One service hangs every day, argusctl -hup fixes it

yary not.com at gmail.com
Mon Jun 16 16:29:47 EDT 2008


This is a continuation of the issue posted here-
http://www.tcp4me.com/pipermail/arguslist/2008-June/000888.html

I've noticed that some of my argus services hang- the tests don't run.
If there's a failure during the "hang" then argus won't catch it.

Running argusctl -hup restarts all the services. Now I have cron
running that every day at 16:10.

One of my "prog" services seems to hang reliably. Here's its daily
graph, with interruptions clearly visible-
http://i186.photobucket.com/albums/x312/fecundfec/mc/temp_2008_06_16.png

The sample-level graph, showing its last sample around 11pm, then
restarting at 16:10
http://i186.photobucket.com/albums/x312/fecundfec/mc/temp_sample_2008_06_16.png

For comparison, here's another "prog" service that does not hang, same
time period, daily graph.
http://i186.photobucket.com/albums/x312/fecundfec/mc/traf_2008_06_16.png

So argus is still running and the computer is still running, it's just
one service that hangs. If I don't run "argusctl hup" then others
eventually hang as well, even "Self/Services" has gaps in its graph.

Here's the definition of the working prog service:
Group "Network Traffic" {
  graph: yes
  frequency: 7min

  Group "Pinky" {
      service: Prog {
        uname: Rec
        label: Rec
        command: netstat -ssp ip
	pluck: (\d+) total packets received
	expect: \d+
	calc: rate
      }
      service: Prog {
        uname: Send
        label: Send
        command: netstat -ssp ip
	pluck: (\d+) packets sent from this host
	expect: \d+
	calc: rate
      }
  }

... (another group)...
}
and the failing one:
Group "Hardware" {
    graph: yes
    Service Prog {
        command: sysctl -n hw.sensors.25
        uname: Temperature
        label: Temperature
	title: Pinky CPU Temperature
	ylabel: degrees F
	pluck: Temp2[^/]*/ (\d*\.\d*) degF
   }
}

The only difference of note is that the failing/hanging group uses the
default frequency, whereas the one that's not hanging has a frequency
defined... on the other hand, before putting in the "argusctl hup"
services with "frequency" set would hang as well- there's no
dependency or cron- here's some debug info from the hanging service-
bios::addtfs	336
bios::inits	335
bios::reads	670
bios::settos	335
bios::shuts	335
bios::timefs	335
bios::timeouts	50
cfdepth	2
opentime	Sun 15 Jun 22:33:31 2008
overridable	1
ovstatus	up
ovstatussummary	unprintable data structure
ovstatussummary::severity	clear
ovstatussummary::total	1
ovstatussummary::up	1
prog::command	sysctl -n hw.sensors.25
prog::exit	0
prog::pid	0
prog::rbuffer	lm0, Temp2, temp, 46.50 degC / 115.70 degF~x0A
severity	critical
siren	1
sirentime	Tue 20 May 16:30:14 2008
slaves_keep_state	*
slaves_send_notifies	*
sort	1
srvc	unprintable data structure
srvc::dones	335
srvc::elapsed	59.7983829975128
srvc::finished	1
srvc::frequency	60
srvc::lasttesttime	Sun 15 Jun 22:33:31 2008
srvc::nexttesttime	Sun 15 Jun 22:35:31 2008
srvc::phi	31
srvc::result	115.70
srvc::retries	2
srvc::showreason	0
srvc::starts	335
srvc::state	done
srvc::status	up
srvc::timeout	60
srvc::tries	0
stats::lasttime	Mon 16 Jun 13:00:00 2008
stats::log	unprintable data structure
stats::monthly	unprintable data structure
stats::status	up
stats::yearly	unprintable data structure
status	up
test	unprintable data structure
test::alpha	1
test::currvalue	115.70
test::pluck	Temp2[^/]*/ (\d*\.\d*) degF
test::rawvalue	lm0, Temp2, temp, 46.50 degC / 115.70 degF~x0A
test::spike_supress	no
test::testedp	1
transtime	Tue 20 May 16:30:14 2008
type	Service

What should I be looking at? How to debug?

thanks

-y


More information about the Arguslist mailing list