Questions - Features

Tue Apr 5 20:58:30 EDT 2005

Hello,

See answers below.

Jeremy Kister wrote:

>On 4/5/2005 1:37 PM, Francois Mikus wrote:
>  
>
>>- support for remote agent running on various platforms which return 
>>information like: state, service, message
>>Similar to big brother agents, nagios agents. An easy way to extend 
>>things without re-inventing the wheel is supporting nagios and/or Big 
>>brother agent communications. Supporting multiple event messages per 
>>message is also very nice(bulk messages), this avoids the event bloat 
>>associated with Nagios.
>>    
>>
>
>SNMP is a universal 'agent'.  I dont know of any other specifically
>written agents, but Argus will easily accommodate anything you can throw
>at it.
>
>so long as the Big brother or nagios agent talks over IP, Argus can
>communicate with it.  I cant think of a time I'd rather have a daemon
>besides SNMP on the remote machines, though.
>
>Argus supports summary messages ("multiple event messages per message")
>
>  
>
>>- using queue-ing for receiving events. This insures that events from 
>>remote agents are never lost when the computer is too busy to process 
>>incoming events . This enables better uses of processing ressources, 
>>resiliency, possibility of dropping low priority events. I do not know 
>>any open-source nms's that really support sophisticated level queuing.
>>    
>>
>
>Argus doesnt receive any events.  It proactively polls things.  One
>could easily write a daemon to listen for traps, and have argus monitor
>that daemon, though.
>
>When you receive a trap, you are depending on everything working
>correctly, which [for anything besides informational purposes] is silly.
> When you proactively poll a service, you're sure.
>  
>
Each method of polling has advantages and are complimentary. 
(active-polling and passive-pushing)

I will outline some of the benefits of using remote agents. I do not 
think I need to say anything about active polling, as we would not be 
here without it.

A typical remote agent is usually a process that runs in cron every X 
minutes and verifies the state of a device. This could involve basic 
stats such as cpu, memory, interface usage, etc. It has local thresholds 
and generates an event which is returned to the monitoring station.The 
status the monitoring station receives as fresh as the exuction(a few 
seconds for most tasks). You will note that I am not talking about SNMP 
traps or syslog messages. What I am talking about this a recurring test 
wich sends data back to the monitoring station on a specific schedule. 
(say, every 5 minutes)

Basic advantages:
- fresh data arrives at monitoring station
- bulk of time spent on data munging and processing is done by the 
remote host, the monitoring station has little processing to do on the 
event it receives

Where a remote agent actually shines is when there is lots of processing 
to be done locally. This processing would be executed on the data before 
any output values or events could be generated.

This could be because, the agent process needs to query things that take 
more time. (local db calls, stat formulas on data, etc.)
Or it could be because the agent is actually called at the tail end of 
another process. I have made use of an agent which was called at the 
tail end of a network collection and trending tool (ex. cricket, torrus, 
etc.). The collection engine processes the data, applies it's 
tresholding and sends all the event data back to the monitoring station 
in a stream of events.

The timing is important, in the two above cases, as when the event's are 
ready, they should be sent immediately to the monitoring station.

In terms of pure scalability, active service checks are much easier to 
manage, as everythings is centralized. Thus using a basic SNMP agent on 
remote hosts can fill most needs. Where that agent can execute local 
commands(Net-SNMP), or just map /proc info to a mib variable. This is 
much cleaner to maintain than custom agents. Side note: A custom agents 
such as the basic big brother message agent is just a 50kb C program, 
very portable. This 50kb message agent's purpose is only to format the 
message and send to to the big brother monitoring station.

When dealing with remote agents sending back data, you are not doing 
anything silly, you are distributing the logic and processing. These 
agents require some logic handling on the monitoring station side, which 
is a dead timer and re-activation logic for when the data does start 
coming back in.

For example:

If I have an remote workstation which deals with a SAN management software.
- I use an agent sending data to the monitoring station about SAN 
availability
- I also configure a service check on the monitoring station for the 
workstation: ping, ssh, disk usage, cpu, mem, important processes

Should something happen to the process sending back info about the SAN 
and the event's are not reaching the monitoring station. After X minutes 
the workstation has not sent any state data for the SAN services, the 
monitoring station should trigger a dead timer, it will use a colour 
(Big Brother used purple) to indicate that these services are 
out-of-touch. They could be down, they could be up, you don't know. No 
alarms are generated.
During all this time, the workstation is still monitored by the basic 
service check.

The value with remote agents, is that you can get up to date bulk state 
information that would otherwise not normally be available. At the 
expense of sometimes having large blocks of state information turn stale.

SNMP daemons as remote agents with active polling are great when the 
data needed is available without delay. (And yes, I do know about 
net-snmp's ability to check data values from local files that have been 
pre-fetched, but this is not very elegant or scalable)

I believe that using both types of polling/pushing is complimentary and 
is the building block to full featured network management system.

Dealing with snmp traps from remote devices is what *I* call silly. All 
events should be sent instead to a syslog, munged and *then* be sent to 
the monitoring station as ad-hoc events  or aggregated and sent to a 
support mailing list, etc. Ad-hoc events are what eat up most sys admins 
until they automate and profile those events. Treating ad-hoc events 
should be one of the *last* steps of a network management system, it is 
just gravy. IMHO

As a side note:
Concerning queuing and events, there *is* such a thing as reliable event 
messaging. See MQseries, or other MQ type products. Unfortunately there 
does not exist any open-source message queuing(MQ) software at this time.

>>- support for maintenance periods in alerts and reporting. This would 
>>include recurring periods, one time scheduled periods, administrative 
>>reason, contact name.
>>    
>>
>yep.  docs on "cron" are included.
>  
>
Well, no sense in re-inventing the wheel. :-) I went through the doc and 
missed that one. I saw it in the example configurations though.
If there is a web interface to the cron maintenance windows data, then 
this is wonderful! If not, well, it can always be added.

>>To win over people and developpers you need to support user hooks: 
>>Ability to create your own services checks, external actions, support 
>>for third party agents. A monitoring platform should be able to leverage 
>>external utilities and also *be* leveraged by external utilities. No 
>>system is in a vaccum.
>>    
>>
>
>it's clear you havent even tried Argus :) Argus is a good thing.
>
>Argus plays well with external utilities.  Provided, is an 'argusctl'
>program, which lets external utilities play with Argus.
>  
>
You are right, I have not.
Keep in mind, I am trying to provide constructive ideas to enhance what 
argus provides. If I did not find argus very interesting, I would not be 
having this discussion. :-)

Thank you for your response.

Wish you a great day.

Francois Mikus
Acktomic Net Architects Inc.