One service hangs every day, argusctl -hup fixes it

yary at
Mon Jun 16 16:29:47 EDT 2008

This is a continuation of the issue posted here-

I've noticed that some of my argus services hang- the tests don't run.
If there's a failure during the "hang" then argus won't catch it.

Running argusctl -hup restarts all the services. Now I have cron
running that every day at 16:10.

One of my "prog" services seems to hang reliably. Here's its daily
graph, with interruptions clearly visible-

The sample-level graph, showing its last sample around 11pm, then
restarting at 16:10

For comparison, here's another "prog" service that does not hang, same
time period, daily graph.

So argus is still running and the computer is still running, it's just
one service that hangs. If I don't run "argusctl hup" then others
eventually hang as well, even "Self/Services" has gaps in its graph.

Here's the definition of the working prog service:
Group "Network Traffic" {
  graph: yes
  frequency: 7min

  Group "Pinky" {
      service: Prog {
        uname: Rec
        label: Rec
        command: netstat -ssp ip
	pluck: (\d+) total packets received
	expect: \d+
	calc: rate
      service: Prog {
        uname: Send
        label: Send
        command: netstat -ssp ip
	pluck: (\d+) packets sent from this host
	expect: \d+
	calc: rate

... (another group)...
and the failing one:
Group "Hardware" {
    graph: yes
    Service Prog {
        command: sysctl -n hw.sensors.25
        uname: Temperature
        label: Temperature
	title: Pinky CPU Temperature
	ylabel: degrees F
	pluck: Temp2[^/]*/ (\d*\.\d*) degF

The only difference of note is that the failing/hanging group uses the
default frequency, whereas the one that's not hanging has a frequency
defined... on the other hand, before putting in the "argusctl hup"
services with "frequency" set would hang as well- there's no
dependency or cron- here's some debug info from the hanging service-
bios::addtfs	336
bios::inits	335
bios::reads	670
bios::settos	335
bios::shuts	335
bios::timefs	335
bios::timeouts	50
cfdepth	2
opentime	Sun 15 Jun 22:33:31 2008
overridable	1
ovstatus	up
ovstatussummary	unprintable data structure
ovstatussummary::severity	clear
ovstatussummary::total	1
ovstatussummary::up	1
prog::command	sysctl -n hw.sensors.25
prog::exit	0
prog::pid	0
prog::rbuffer	lm0, Temp2, temp, 46.50 degC / 115.70 degF~x0A
severity	critical
siren	1
sirentime	Tue 20 May 16:30:14 2008
slaves_keep_state	*
slaves_send_notifies	*
sort	1
srvc	unprintable data structure
srvc::dones	335
srvc::elapsed	59.7983829975128
srvc::finished	1
srvc::frequency	60
srvc::lasttesttime	Sun 15 Jun 22:33:31 2008
srvc::nexttesttime	Sun 15 Jun 22:35:31 2008
srvc::phi	31
srvc::result	115.70
srvc::retries	2
srvc::showreason	0
srvc::starts	335
srvc::state	done
srvc::status	up
srvc::timeout	60
srvc::tries	0
stats::lasttime	Mon 16 Jun 13:00:00 2008
stats::log	unprintable data structure
stats::monthly	unprintable data structure
stats::status	up
stats::yearly	unprintable data structure
status	up
test	unprintable data structure
test::alpha	1
test::currvalue	115.70
test::pluck	Temp2[^/]*/ (\d*\.\d*) degF
test::rawvalue	lm0, Temp2, temp, 46.50 degC / 115.70 degF~x0A
test::spike_supress	no
test::testedp	1
transtime	Tue 20 May 16:30:14 2008
type	Service

What should I be looking at? How to debug?



More information about the Arguslist mailing list