Test for 'hung' services(was Re: Upgrade works with one issue)

yary not.com at gmail.com
Tue Jun 24 17:43:52 EDT 2008


On Tue, Jun 24, 2008 at 4:33 AM, Charles Tweedy <
Charles.Tweedy at dcr.virginia.gov> wrote:

> ... I have used Argus for 3 years now, and never touched it. It just works,
> and my perceived uptime has been 100%.


The keyword is "perceived." I installed Argus 3.5 about a month and a half
ago, and discovered it wasn't warning me about some network downtimes.
Turning on graphing showed some gaps in the services. I found that when the
graph showed a gap, the service wasn't running- I could bring a network
interface down, leave it down, and argus wouldn't complain. This is with a
simple ping test, no dependencies,         retries: 0. When I restarted
argus, it would resume testing and start alerting.

So I set up "argusctl hup" in a cron job to restart argus, which restarts
all the frozen test services. Important services now test without gaps.

If you want a quick list of tests that may be not actually running on your
server, use the program at the end of this message. It looks at the graph
data and produces output like this:

 $ ./argus_hound.pl
Check Top:Argus_Statistics:Self_objects samples graph for missing sample
after Mon Jun 23 10:34:48 2008
Check Top:Argus_Statistics:Self_objects hours graph for missing sample after
Tue Jun 17 10:00:48 2008
Check Top:Argus_Statistics:Self_services samples graph for missing sample
after Sat Jun 21 02:34:35 2008
Check Top:Argus_Statistics:Self_services hours graph for missing sample
after Tue Jun 17 02:00:35 2008
Top:World_Reachability:Local_Servers:Ping_example.com has not run since Sun
May 25 18:22:54 2008

from an instance with argus config containing
Group "Argus Statistics" {
   graph: yes

   Service Self/objects
   Service Self/services
}

It can't check un-graphed services. If you've reduced the frequency of a
service, this script will produce a false positive. Same if you've recently
turned a service off, or if a service has a dependency, or if you turned off
graphing. This program isn't smart and it doesn't know about valid reasons
for gaps, it just finds any reduction in frequency, and asks you to check.

Here's the script I'm using. Feel free to check your own. If you want to add
graphs and then run this, let the data accumulate for a few days first- on
my setup some services would regularly stop running after 2-3 days of
working- hence the cron job for preventative maintenance.

--- argus_hound.pl ---
#!/usr/bin/perl
# You may need to change the path in the line above and below
use lib('/usr/local/lib/argus');

# If jump between samples is greater than this, complain
my $gap_size = 3;

BEGIN {
  require 'conf.pl';
  require 'misc.pl';
 }

use Argus::Graph::Data;
use Carp;

sub check_for_gaps {
  my ($data, $level) = @_;
  my ($prior_time, $freq, $err);

  for my $sample (@{$data->{'samples'}}) {
    if (!$prior_time) { $prior_time = $sample->{'time'} }
    elsif (!$freq) { $freq = $sample->{'time'} - $prior_time }
    elsif ( $sample->{'time'} - $prior_time > $gap_size * $freq ) {
      print "Check $::graph_datafile $level graph for missing sample after
",
      scalar(localtime $prior_time),"\n";
      $err = 1;
      last;
    }
    else {
      $freq = $sample->{'time'} - $prior_time;
      $prior_time = $sample->{'time'};
    }
  }

  # and now see if it's been a while since the last sample
  print "$::graph_datafile has not run since ",scalar(localtime
$prior_time),"\n"
      if !$err && $level eq 'samples' && time - $prior_time > $gap_size *
$freq;
}

# Check each file for suspicious gaps
for $::graph_datafile (glob("$::datadir/gdata/*")) {
  $::graph_datafile =~ s-.*/--;
  my $data = Argus::Graph::Data->new($::graph_datafile);

  $data->readallsamples();
  check_for_gaps($data, 'samples');

  $data->readsummary('hours',999999);
  check_for_gaps($data,'hours');

# checking the daily summary not so helpful
#  $data->readsummary('days',999999);
#  check_for_gaps($data,'days');
}

# exit on library error
sub error { croak @_ }


More information about the Arguslist mailing list