From psk at psk.net Thu May 1 16:59:20 2008 From: psk at psk.net (Percy Kwong) Date: Thu May 1 17:00:00 2008 Subject: Notifying within hosts and services Message-ID: <481A2F28.9000704@psk.net> I have a service within a group that needs a script to be run to restart that service if it goes down. I was wondering if I define a method to run the script and place an "notifyalso: Mymethod" within the service definition if that would be possible? Basically, I would like to run the restart script only for that service and for no others within the group or host. Do I need to create a separate group for that one service and place the notifyalso statement within the new group definition or can i just specify the notifyalso command within the service definition for the existing group? Thanks In Advance. From not.com at gmail.com Mon May 19 13:43:40 2008 From: not.com at gmail.com (yary) Date: Mon May 19 13:43:56 2008 Subject: Shorthand for Service UDP/DNS? Message-ID: <75cbfa570805191043l4e091121l637eac5839464664@mail.gmail.com> I see the shorthand for UDP/Domain ("Service UDP/Domain/example.com") documented at http://argus.tcp4me.com/services.html And one example under http://argus.tcp4me.com/xtservices.html shows a shorthand for Service UDP/DNS: "UDP/DNS/Serial/example.com" - but I can't find documentation of that abbreviation. How does that work? From not.com at gmail.com Mon May 19 15:47:32 2008 From: not.com at gmail.com (yary) Date: Mon May 19 15:47:38 2008 Subject: Built-in resolver hides DNS errors Message-ID: <75cbfa570805191247r1aa6c948u3ff454c07a7428d3@mail.gmail.com> I was trying to test out notification by putting in the config file a group I knew would fail: Host "Fail.notopdomain.xyz" { Service Ping } ... and yet, Fail.notopdomain.xyz always came up "green", regardless of the services I tried. After commenting out the built-in resolver "Resolv", Argus appropriately complains "Errors detected during startup", and highlights the error log. The built-in resolver should also complain if it can't resolve a hostname! From not.com at gmail.com Mon May 19 16:04:22 2008 From: not.com at gmail.com (yary) Date: Mon May 19 16:04:29 2008 Subject: "check_now" link, statistics? Message-ID: <75cbfa570805191304w2bcdf95cs88c821e31fc4dc9d@mail.gmail.com> After setting up the config file with a must-fail test: Group "must fail" { Service Prog { command: fail expect: OK-Wonderful } } I ran argusctl hup, and saw "must fail" come up "green" (since it hadn't run yet). I clicked down to the Service Prog object and then clicked "check_now". It refreshed, but still came back green. Is "check_now" supposed to force the argus daemon to recheck the service, or just reload the page? A few minutes later, the page did refresh itself and come up "red". Also the statistics seem to be incorrect, after it ran the test and came up red it says 100% up, 0% down, and no times run. And in fact none of my regular, non-test entries show the total number of times run, they are always blank. This is with argus 3.5 downloaded from tcp4me just this weekend. Here's what the "must fail" Prog status page shows: nameProg statusdown commandfail ------------------------------ *Status: down since Mon 19 May 13:12:03 2008* startelapsed time% up% downtimes down Today Mon 19 May 13:11:42 20080:00:21 100.0 0.00 1 This Month Mon 19 May 13:11:42 2008 0:00:21 100.0 0.00 1 This Year Mon 19 May 13:11:42 2008 0:00:21 100.0 0.00 1 ... and now: *Status: down since Mon 19 May 13:12:03 2008* startelapsed time% up% downtimes down Today Mon 19 May 13:06:50 20080:09:32 54.55 45.45 1 This Month Mon 19 May 13:06:50 2008 0:09:32 54.55 45.451 This Year Mon 19 May 13:06:50 2008 0:09:32 54.55 45.45 1 how could it ever have been up? That statistic should say 0% up, 100% down. From not.com at gmail.com Mon May 19 17:14:43 2008 From: not.com at gmail.com (yary) Date: Mon May 19 17:14:48 2008 Subject: %O not resetting when using %O{param}? Message-ID: <75cbfa570805191414k1fc44a5fo2ed0baf717d080d4@mail.gmail.com> I just figured out the parameters to %O inside notifications and added to the wiki however using this example: Method "mail" { send: To: %R\n\ From: %F\n\ Content-Type: text/plain; charset=utf-8\n\ Subject: Argus %O %S %P%E\n\n\ %M\ntest defined in %O{definedinfile} line %O{definedonline} } When run, the subject shows the file, and not the object's unique name. It acts as if I'd typed "Argus %O{definedinfile}" instead of simply "Argus %O". I've fixed this by using "%{unique}" in the subject- may want to look into fixing this in the source. From not.com at gmail.com Mon May 19 17:27:04 2008 From: not.com at gmail.com (yary) Date: Mon May 19 17:27:14 2008 Subject: Config file error, without error messages Message-ID: <75cbfa570805191427t76fd8fa6x41d9b4edf8c2fa6e@mail.gmail.com> This is my last posting (for today), promise. There are some cases where argus cannot parse the config file, but does not complain. For example, this works AOK- Host "machine.asite.org" { Service TCP/DNS } On the other hand, Argus will not complain about this, but will act as if this line and all groups/services following do not exist: Host "machine.asite.org" { Service TCP/DNS } I had a number of hosts with a single test that I tried making "look pretty" by lining them up like the above, took a while to figure out how to fix. The config docs don't mention that each group, service, and config attribute needs to be on a line by itself. Would be good to document and for the config parser to complain (or to allow single line definitions) thanks -y From jaw+arguslist at tcp4me.com Mon May 19 20:32:04 2008 From: jaw+arguslist at tcp4me.com (Jeff Weisberg) Date: Mon May 19 20:32:06 2008 Subject: Built-in resolver hides DNS errors Message-ID: <200805200032.m4K0W4dI009152@athena.tcp4me.com> | I was trying to test out notification by putting in the config file a | group I knew would fail: | | Host "Fail.notopdomain.xyz" { | Service Ping | } | | ... and yet, Fail.notopdomain.xyz always came up "green", regardless | of the services I tried. | | After commenting out the built-in resolver "Resolv", Argus | appropriately complains "Errors detected during startup", and | highlights the error log. | | The built-in resolver should also complain if it can't resolve a hostname! it does. if you do not use 'Resolv', the dns error will be reported right away; if you do use 'Resolv', the error will be reported sometime later. From jaw+arguslist at tcp4me.com Mon May 19 20:33:23 2008 From: jaw+arguslist at tcp4me.com (Jeff Weisberg) Date: Mon May 19 20:33:26 2008 Subject: "check_now" link, statistics? Message-ID: <200805200033.m4K0XNRQ019874@athena.tcp4me.com> | I ran argusctl hup, and saw "must fail" come up "green" (since it hadn't run | yet). I clicked down to the Service Prog object and then clicked | "check_now". It refreshed, but still came back green. Is "check_now" | supposed to force the argus daemon to recheck the service, or just reload | the page? it causes argus to retest the service (once). depending on the value of the 'retries' parameter, one failed test may not be enough for argus to declare the service down. | Also the statistics seem to be incorrect, after it ran the test and came up | red it says 100% up, 0% down, argus does not update these numbers each time the web page reloads, only when something "interesting" happens. the most recent "interesting" thing was when it went down, up to which point it had been up for 100% of the time. | and no times run. And in fact none of my | regular, non-test entries show the total number of times run, they are | always blank. argus does not keep track of, or report on, the number of times a test runs. From jaw+arguslist at tcp4me.com Mon May 19 20:41:19 2008 From: jaw+arguslist at tcp4me.com (Jeff Weisberg) Date: Mon May 19 20:41:22 2008 Subject: Config file error, without error messages Message-ID: <200805200041.m4K0fJ2r007370@athena.tcp4me.com> | On the other hand, Argus will not complain about this, but will act as if | this line and all groups/services following do not exist: | | Host "machine.asite.org" { Service TCP/DNS } | | I had a number of hosts with a single test that I tried making "look pretty" | by lining them up like the above, took a while to figure out how to fix. The | config docs don't mention that each group, service, and config attribute | needs to be on a line by itself. Would be good to document see: http://argus.tcp4me.com/config.html > Note, unlike C or Perl, you cannot > place the opening { on a different line > Also, the closing } must be on > a line by itself (with optional whitespace) From not.com at gmail.com Tue May 20 20:25:41 2008 From: not.com at gmail.com (yary) Date: Tue May 20 20:25:55 2008 Subject: Multiple values/tests from one service? Message-ID: <75cbfa570805201725s7daf31fdp2f202aaa235cd610@mail.gmail.com> I have a Prog that returns a bunch of values I'd like to monitor. Is it possible to run the command once, and then pluck, test, and graph the results a dozen different ways? It's not too hard to set up a dozen "service Prog" calls, but then the system is running the program more times than it needs to, and I'd like to be sure the values are correlated from the same run. Setting "phi" to the same value will probably be good enough as an interim solution. thanks in advance From not.com at gmail.com Wed May 21 12:26:06 2008 From: not.com at gmail.com (yary) Date: Wed May 21 12:26:12 2008 Subject: Cleaning the stats directory Message-ID: <75cbfa570805210926v4ad14e7dme2800743b87773a2@mail.gmail.com> While experimenting with argus, I've left behind many unneeded files in the stats directory. Here's a few commands to clean it out- cd /var/argus/stats (or wherever your argus data/stats directory is) all one line: argusctl list|tail +2|perl -pe 's/([\040%\#\+\\\;=\"\'"'"'\`\?\&~<>\/\000-\011\013-\037\177-\377])/sprintf("~x%02X",ord($1))/ges'>obj_list and a sanity check here. After creating obj_list, make sure the output of these two commands is the same number: wc -l obj_list ls | fgrep -xf obj_list | wc -l to see what is going to be deleted: ls | fgrep -v -x -f obj_list proceed to deletion- ls | fgrep -v -x -f obj_list | xargs rm rm obj_list use with care, it's not perfect... if there's a better way to clean out this directory, let me know, I'm a rank beginner. From not.com at gmail.com Wed May 21 15:10:50 2008 From: not.com at gmail.com (yary) Date: Wed May 21 15:10:57 2008 Subject: pluck succeeds where expect fails? Message-ID: <75cbfa570805211210r7cdf768csb808b2a4a83fd4c5@mail.gmail.com> I'm scratching my head over this one. I'm plucking a value successfully, but the service is failing because the expect doesn't match, only the expect DOES match... in my config: service: Prog { uname: Rec label: Rec command: netstat -ssp ip expect: total packets received pluck: (\d+) total packets received calc: rate } I have data, but the service is listed as being down from due to "expect" not matching. Interesting lines from the debug: srvc::reasonPROG TEST did not match expected regex srvc::resultip:~x0A ____ 731741145 total packets received~x0A ____ 7084 fragments received~x0A~x. . . status down test::calcdata::lastdv 1849 test::calcdata::lastt 1211397647 test::calcdata::lastv 731741145 test::expect total packets received test::pluck (\d+) total packets received test::rawvalue ip:~x0A ____ 731741145 total packets received~x0A ____ 7084 fragments received~x0A~x. . . I tried replacing the spaces in the regexps with \s, same behavior. Is it "bad" to use expect and pluck in the same service? Am I missing something simple? From not.com at gmail.com Wed May 21 15:26:53 2008 From: not.com at gmail.com (yary) Date: Wed May 21 15:27:24 2008 Subject: pluck succeeds where expect fails? In-Reply-To: <75cbfa570805211210r7cdf768csb808b2a4a83fd4c5@mail.gmail.com> References: <75cbfa570805211210r7cdf768csb808b2a4a83fd4c5@mail.gmail.com> Message-ID: <75cbfa570805211226m1f9599dcqf067b6e39054207f@mail.gmail.com> Never mind- expect tests the value after pluck- my config should have expect: \d+ sorry for the bandwidth From not.com at gmail.com Wed May 21 16:22:25 2008 From: not.com at gmail.com (yary) Date: Wed May 21 16:22:29 2008 Subject: Cleaning the stats directory In-Reply-To: <75cbfa570805210926v4ad14e7dme2800743b87773a2@mail.gmail.com> References: <75cbfa570805210926v4ad14e7dme2800743b87773a2@mail.gmail.com> Message-ID: <75cbfa570805211322r71896981r818a558f2f8d607c@mail.gmail.com> a simpler way is to remove files that haven't been modified recently, since argus updates the statistics after running tests. Something like find /var/argus/stats -mtime +1 -print0 | xargs -0 rm is simpler than my last idea. It's also good in the gcache/gdata directories. again, sorry for the bandwidth. From jaw+arguslist at tcp4me.com Thu May 22 10:33:05 2008 From: jaw+arguslist at tcp4me.com (Jeff Weisberg) Date: Thu May 22 10:33:13 2008 Subject: Cleaning the stats directory Message-ID: <200805221433.m4MEX5rW018987@athena.tcp4me.com> | a simpler way is to remove files that haven't been modified recently, | since argus updates the statistics after running tests. Something like | | find /var/argus/stats -mtime +1 -print0 | xargs -0 rm | | is simpler than my last idea. It's also good in the gcache/gdata directories. or you could just wait. argus will periodically look for and remove any unused old files in the stats, gcache, gdata, and html dirs. From jaw+arguslist at tcp4me.com Thu May 22 10:36:06 2008 From: jaw+arguslist at tcp4me.com (Jeff Weisberg) Date: Thu May 22 10:36:09 2008 Subject: Notifying within hosts and services Message-ID: <200805221436.m4MEa6TQ013732@athena.tcp4me.com> | I have a service within a group that needs a script to be run to restart | that service if it goes down. I was wondering if I define a method to | run the script and place an "notifyalso: Mymethod" within the service | definition if that would be possible? | | Basically, I would like to run the restart script only for that service | and for no others within the group or host. | | Do I need to create a separate group for that one service and place the | notifyalso statement within the new group definition or can i just | specify the notifyalso command within the service definition for the | existing group? you can add it to the service Service something { notifyalso: mymethod... } From not.com at gmail.com Fri May 30 20:44:52 2008 From: not.com at gmail.com (yary) Date: Fri May 30 20:44:59 2008 Subject: Why would a service stop running? Message-ID: <75cbfa570805301744h4aa4c55h345c6f9c6682613d@mail.gmail.com> I have one group ("DSL Route" below), with a ping service that stopped testing, and I'm not sure why. This group is "gravity up", and it's not alerting when both routers appear to be down, because it apparently is only testing one of them. There's no "depends": Group "World Reachability" { retries: 0 countstop: yes frequency: 2min sendnotify: yes graph: yes # do not send a notification if only some are down # only if they are all down gravity: up Group "Local Servers" { sendnotify: no gravity: up service: Ping { label: Stanford hostname: stanford.edu } service: Ping { label: PaloAlto hostname: gatekeeper.city.palo-alto.ca.us } } Group "DSL Route" { sendnotify: no gravity: up Service Ping { label: Speakeasy Gateway hostname: 216.27.178.1 } Service Ping { label: Speakeasy Upstream hostname: 69.17.83.177 } } } Here's graphs- the first shows the two services combined, with one of them stopping testing this morning http://i186.photobucket.com/albums/x312/fecundfec/mc/missing_test.png It's not easy to see on the combined graph which one is running, when it might be obscured by the other graph, so here they are separated: gateway ping, last datapoint is around 10am, though it is now 5:30pm and argus is still running- http://i186.photobucket.com/albums/x312/fecundfec/mc/ping_gateway.png upstream ping, last datapoint current: http://i186.photobucket.com/albums/x312/fecundfec/mc/ping_upstream.png gateway ping, showing gaps earlier in history: http://i186.photobucket.com/albums/x312/fecundfec/mc/ping_gateway_hours.png upstream ping, without gaps: http://i186.photobucket.com/albums/x312/fecundfec/mc/ping_upstream_hours.png What could make the gateway ping service have those gaps while argus is still running & the other ping service doesn't have gaps? How can I ensure that it always runs or fails? Debug page at end of message. thanks in advance -y Debugging Dump of Top:World_Reachability:DSL_Route:Ping_216.27.178.1 acl_about root view_all acl_annotate root staff acl_checknow root view_all acl_flush root view_all acl_getconf root view_all acl_logfile root staff acl_mode simple acl_ntfyack root staff acl_ntfyackall root view_all acl_ntfydetail root staff acl_ntfylist root staff acl_override root staff acl_page root staff user view_all aclcache unprintable data structure alarm 0 autogenerated 0 bios unprintable data structure bios::addtfs 752 bios::inits 751 bios::reads 1502 bios::settos 751 bios::shuts 751 bios::timefs 751 cfdepth 3 children unprintable data structure confck unprintable data structure config unprintable data structure countstop 0 currseverity clear darp unprintable data structure definedattime Wed 28 May 09:33:01 2008 definedinfile /var/argus/config definedonline 191 depend unprintable data structure graph 1 graphd unprintable data structure graphd::gr_nmax_days 1024 graphd::gr_nmax_hours 1024 graphd::gr_nmax_samples 2048 gravity down image unprintable data structure image::barstyle minmax image::drawborder 1 image::gr_line_thickness 1 image::gr_show_days 1 image::gr_show_hours 1 image::gr_show_samples 1 image::gr_what result image::labelstyle box image::title Speakeasy Gateway image::transparent 1 label Speakeasy Gateway label_left Speakeasy Gateway label_right Speakeasy Gateway logsize 200 name Ping nostats 0 nostatus 0 notify unprintable data structure notify::ackonup 0 notify::autoack 1 notify::list unprintable data structure notify::mail_from Argus notify::message_fmt %i %m - %t notify::messagedn Top:World_Reachability:DSL_Route:Ping_216.27.178.1 is DOWN notify::messageup Top:World_Reachability:DSL_Route:Ping_216.27.178.1 is UP notify::nolotsmsgs 0 notify::notify mail:n.otcomm@gmail.com notify::renotify 300 notify::sendnotify 0 notify::shortmessages 0 opentime Thu 29 May 10:33:36 2008 overridable 1 ovstatus up ovstatussummary unprintable data structure ovstatussummary::severity clear ovstatussummary::total 1 ovstatussummary::up 1 parents unprintable data structure passive 0 ping unprintable data structure ping::addr ~xD8~x1B~xB2~x01 ping::data 216.27.178.1 is alive (150 ms) ping::hostname 216.27.178.1 ping::ipver 4 ping::pid 32465 ping::rbuffer ping::rtt 150 prevovstatus down prevstatus down severity critical siren 1 sirentime Thu 29 May 04:51:36 2008 slaves_keep_state * slaves_send_notifies * sort 1 srvc unprintable data structure srvc::dones 751 srvc::elapsed 0.190173149108887 srvc::finished 1 srvc::frequency 120 srvc::lasttesttime Thu 29 May 10:33:36 2008 srvc::nexttesttime Thu 29 May 10:35:36 2008 srvc::phi 96 srvc::result 150 srvc::retries 0 srvc::showreason 0 srvc::starts 751 srvc::state done srvc::status up srvc::timeout 60 srvc::tries 0 stats unprintable data structure stats::daily unprintable data structure stats::lasttime Fri 30 May 17:00:00 2008 stats::log unprintable data structure stats::monthly unprintable data structure stats::status up stats::yearly unprintable data structure status up test unprintable data structure test::alpha 1 test::spike_supress 1 timeout Thu 29 May 10:34:41 2008 transtime Thu 29 May 05:03:37 2008 type Service uname Ping_216.27.178.1 unique Top:World_Reachability:DSL_Route:Ping_216.27.178.1 vxml_long_name Top:World_Reachability:DSL_Route:Ping_216.27.178.1 vxml_short_name Ping_216.27.178.1 wantread 0 wantwrit 0 web unprintable data structure web::bkgimage /img/argus.logo.gif web::bldtime Fri 30 May 17:06:28 2008 web::cachestale 120 web::footer_argus

Argus: 3.5

web::icon /img/smile.gif web::icon_down /img/sad.gif web::javascript /argus.js web::nospkr_icon /img/nospkr.gif web::refresh 60 web::shownotiflist 1 web::showstats 1 web::sirensong /sound/whoopwhoop.wav web::style_sheet /argus.css web::transtime Fri 30 May 17:00:00 2008