Questions - Features
fmikus at acktomic.com
Tue Apr 5 20:58:30 EDT 2005
See answers below.
Jeremy Kister wrote:
>On 4/5/2005 1:37 PM, Francois Mikus wrote:
>>- support for remote agent running on various platforms which return
>>information like: state, service, message
>>Similar to big brother agents, nagios agents. An easy way to extend
>>things without re-inventing the wheel is supporting nagios and/or Big
>>brother agent communications. Supporting multiple event messages per
>>message is also very nice(bulk messages), this avoids the event bloat
>>associated with Nagios.
>SNMP is a universal 'agent'. I dont know of any other specifically
>written agents, but Argus will easily accommodate anything you can throw
>so long as the Big brother or nagios agent talks over IP, Argus can
>communicate with it. I cant think of a time I'd rather have a daemon
>besides SNMP on the remote machines, though.
>Argus supports summary messages ("multiple event messages per message")
>>- using queue-ing for receiving events. This insures that events from
>>remote agents are never lost when the computer is too busy to process
>>incoming events . This enables better uses of processing ressources,
>>resiliency, possibility of dropping low priority events. I do not know
>>any open-source nms's that really support sophisticated level queuing.
>Argus doesnt receive any events. It proactively polls things. One
>could easily write a daemon to listen for traps, and have argus monitor
>that daemon, though.
>When you receive a trap, you are depending on everything working
>correctly, which [for anything besides informational purposes] is silly.
> When you proactively poll a service, you're sure.
Each method of polling has advantages and are complimentary.
(active-polling and passive-pushing)
I will outline some of the benefits of using remote agents. I do not
think I need to say anything about active polling, as we would not be
here without it.
A typical remote agent is usually a process that runs in cron every X
minutes and verifies the state of a device. This could involve basic
stats such as cpu, memory, interface usage, etc. It has local thresholds
and generates an event which is returned to the monitoring station.The
status the monitoring station receives as fresh as the exuction(a few
seconds for most tasks). You will note that I am not talking about SNMP
traps or syslog messages. What I am talking about this a recurring test
wich sends data back to the monitoring station on a specific schedule.
(say, every 5 minutes)
- fresh data arrives at monitoring station
- bulk of time spent on data munging and processing is done by the
remote host, the monitoring station has little processing to do on the
event it receives
Where a remote agent actually shines is when there is lots of processing
to be done locally. This processing would be executed on the data before
any output values or events could be generated.
This could be because, the agent process needs to query things that take
more time. (local db calls, stat formulas on data, etc.)
Or it could be because the agent is actually called at the tail end of
another process. I have made use of an agent which was called at the
tail end of a network collection and trending tool (ex. cricket, torrus,
etc.). The collection engine processes the data, applies it's
tresholding and sends all the event data back to the monitoring station
in a stream of events.
The timing is important, in the two above cases, as when the event's are
ready, they should be sent immediately to the monitoring station.
In terms of pure scalability, active service checks are much easier to
manage, as everythings is centralized. Thus using a basic SNMP agent on
remote hosts can fill most needs. Where that agent can execute local
commands(Net-SNMP), or just map /proc info to a mib variable. This is
much cleaner to maintain than custom agents. Side note: A custom agents
such as the basic big brother message agent is just a 50kb C program,
very portable. This 50kb message agent's purpose is only to format the
message and send to to the big brother monitoring station.
When dealing with remote agents sending back data, you are not doing
anything silly, you are distributing the logic and processing. These
agents require some logic handling on the monitoring station side, which
is a dead timer and re-activation logic for when the data does start
coming back in.
If I have an remote workstation which deals with a SAN management software.
- I use an agent sending data to the monitoring station about SAN
- I also configure a service check on the monitoring station for the
workstation: ping, ssh, disk usage, cpu, mem, important processes
Should something happen to the process sending back info about the SAN
and the event's are not reaching the monitoring station. After X minutes
the workstation has not sent any state data for the SAN services, the
monitoring station should trigger a dead timer, it will use a colour
(Big Brother used purple) to indicate that these services are
out-of-touch. They could be down, they could be up, you don't know. No
alarms are generated.
During all this time, the workstation is still monitored by the basic
The value with remote agents, is that you can get up to date bulk state
information that would otherwise not normally be available. At the
expense of sometimes having large blocks of state information turn stale.
SNMP daemons as remote agents with active polling are great when the
data needed is available without delay. (And yes, I do know about
net-snmp's ability to check data values from local files that have been
pre-fetched, but this is not very elegant or scalable)
I believe that using both types of polling/pushing is complimentary and
is the building block to full featured network management system.
Dealing with snmp traps from remote devices is what *I* call silly. All
events should be sent instead to a syslog, munged and *then* be sent to
the monitoring station as ad-hoc events or aggregated and sent to a
support mailing list, etc. Ad-hoc events are what eat up most sys admins
until they automate and profile those events. Treating ad-hoc events
should be one of the *last* steps of a network management system, it is
just gravy. IMHO
As a side note:
Concerning queuing and events, there *is* such a thing as reliable event
messaging. See MQseries, or other MQ type products. Unfortunately there
does not exist any open-source message queuing(MQ) software at this time.
>>- support for maintenance periods in alerts and reporting. This would
>>include recurring periods, one time scheduled periods, administrative
>>reason, contact name.
>yep. docs on "cron" are included.
Well, no sense in re-inventing the wheel. :-) I went through the doc and
missed that one. I saw it in the example configurations though.
If there is a web interface to the cron maintenance windows data, then
this is wonderful! If not, well, it can always be added.
>>To win over people and developpers you need to support user hooks:
>>Ability to create your own services checks, external actions, support
>>for third party agents. A monitoring platform should be able to leverage
>>external utilities and also *be* leveraged by external utilities. No
>>system is in a vaccum.
>it's clear you havent even tried Argus :) Argus is a good thing.
>Argus plays well with external utilities. Provided, is an 'argusctl'
>program, which lets external utilities play with Argus.
You are right, I have not.
Keep in mind, I am trying to provide constructive ideas to enhance what
argus provides. If I did not find argus very interesting, I would not be
having this discussion. :-)
Thank you for your response.
Wish you a great day.
Acktomic Net Architects Inc.
More information about the Arguslist