Nagios deep dive: retention.dat and modified_attributes · 23 November 2010, 12:26
When Nagios core (the daemon, typically started by a script in /etc/init.d/) starts up, it follows a rather involved process to turn the configuration files and domain-specific language (DSL) contained within them into in-memory objects – the 10000 foot view of this process is:
- Parse and validate nagios.cfg
- Parse all text-based configuration files (or read the specified objects pre-cache file) based on the cfg_file, cfg_fir, precached_object_file directive in nagios.cfg and command line options passed to Nagios, validate syntax of all files as the files are read.
- Parse retention.dat and load desired persistent object attributes into memory for any objects that exist based on the flat configuration file or objects.pre-cache contents read in from previous steps (any objects stored in retention.dat that do not have counterparts in the objects pre-cache file or Nagios DSL-based configuration files are ignored).
modified_attributes tells Nagios which attributes of an object should be loaded into memory as Nagios reads object state from retention.dat; the code that uses this field (all DSL-related code is in the xdata/ directory of the source tree) uses bit-shifting to store and determine which attributes should be read into memory for an object and which should be ignored.
From include/common.h:
#define MODATTR_NONE 0 #define MODATTR_NOTIFICATIONS_ENABLED 1 #define MODATTR_ACTIVE_CHECKS_ENABLED 2 #define MODATTR_PASSIVE_CHECKS_ENABLED 4 #define MODATTR_EVENT_HANDLER_ENABLED 8 #define MODATTR_FLAP_DETECTION_ENABLED 16 #define MODATTR_FAILURE_PREDICTION_ENABLED 32 #define MODATTR_PERFORMANCE_DATA_ENABLED 64 #define MODATTR_OBSESSIVE_HANDLER_ENABLED 128 #define MODATTR_EVENT_HANDLER_COMMAND 256 #define MODATTR_CHECK_COMMAND 512 #define MODATTR_NORMAL_CHECK_INTERVAL 1024 #define MODATTR_RETRY_CHECK_INTERVAL 2048 #define MODATTR_MAX_CHECK_ATTEMPTS 4096 #define MODATTR_FRESHNESS_CHECKS_ENABLED 8192 #define MODATTR_CHECK_TIMEPERIOD 16384 #define MODATTR_CUSTOM_VARIABLE 32768 #define MODATTR_NOTIFICATION_TIMEPERIOD 65536
The default value for modified_attributes is 0 – ignore all attributes from retention.dat that have counterpart constants in common.h
When an object’s state for the fields listed is changed as Nagios runs, Nagios changes the value of the modified_attributes field to include the constant that represents the field; this allows the retention.dat parsing code to know which attributes to read into memory as an object is parsed from retention.dat into memory when Nagios starts.
A common use case showing this process:
- User logs into the Nagios UI and disables active checks for a host along with notifications
When these two actions are processed, Nagios core will then change modified_attributes to indicate that the state of the notifications_enabled and active_checks_enabled fields were changed from their default values by setting modified_attributes to 3, which is the result of code similar to this:
modified_attributes |= MODATTR_NOTIFICATIONS_ENABLED modified_attributes |= MODATTR_ACTIVE_CHECKS_ENABLED
When Nagios is stopped, it serializes all objects from memory to disk – the modified_attributes attribute is one of the attributes written to disk.
Our team has taken the approach of writing out our own retention.dat files based on state for Nagios objects stored in a database as a part of our current distributed nagios implementation – knowing how modified_attributes works fixed a long standing bug in our code that was causing attributes for hosts and services that had been modified in-flight to be ignored when Nagios started – we hope this short article helps you avoid the same bug.
Special thanks to my managers Mike Fischer and Eric Scholz at Comcast (a great place to work as a developer!) for allowing me to share information learned while at work based on our use of open source software with the community – and special thanks to Ryan Richins for his work with me on uncovering the cause of this bug in our custom Nagios configuration distribution code.
— Max Schubert
Nagios Performance Tuning - use the RAM (but be careful!), Luke · 6 January 2010, 03:04
We found that migrating as many queues and files as we reasonably can within our Nagios architecture to RAM disks makes a huge difference with the performance of a large Nagios installation. We currently poll over 15k services on over 2k+ hosts in less than 5 minutes 24×7×365.
We use RHEL5; by default RHEL mounts /dev/shm as a RAM disk with 50% of physical RAM available to the partition.
Our opinion on using RAM disks for temporary storage is controversial; a number of users on the Nagios users and developers lists have told me that disks with big caches should be as fast as RAM as files are cached in RAM, but our experience has shown that nothing beats a RAM disk for a fast queue directory or file. Our experiences also taught us that when moving queues to RAM it is very important to also implement supporting code that ensures important data is persisted across reboots or can easily be re-created across reboots.
Our experience is based on machines with SCSI disks in RAID 0, 5, and 1+0 configurations.
Queues and files we moved to RAM that sped up our Nagios architecture noticeably (by over 40% in total):
Nagios (nagios.cfg)
- log_file
- object_cache_file
- status_file
- temp_file
- temp_path
- check_result_path
- state_retention_file
Moving log_file, object_cache_file, and status file to RAM speed up the CGIs in a larger environment. Moving the temp_file, temp_path, check_result_path, and state_retention_file to RAM lowers the latency for Nagios in a larger environment.
We have also taken the radical steps of moving all configuration files into RAM as well as plugins. We use ePN extensively, every time Nagios goes to run an ePN plugin it checks to see if the plugin has changed. Moving plugins to RAM we noticed a speed up.
IMPORTANT NOTE – Do not move everything to RAM without putting in custom, periodic scripts or other processes that back up important files from RAM to real disk so that if the host crashes they can be quickly recovered or re-created!
SNMPTT (snmptt.ini)
The spool file for checks is a good one to move to RAM and speeds up processing.
PNP (npcd.conf and process_perfdata.conf)
The NPCD queue is another directory we moved to RAM and noticed a nice jump in processing time for NPCD.
Summary
Moving any of the above queues to RAM disks will increase the overall speed of your Nagios architecture; the Nagios-specific configuration changes make a very noticable difference but at the price of some additional supporting code to ensure the robustness of critical data. We developed this list over a period of 3-6 months of time, so take your time if you decide to implement any of the changes mentioned in this article; also make sure you have Nagios trending metrics in place beforehand so you can see what kind of difference the above changes make, if any, to your installation.
Special thanks to my managers Eric Scholz, Mike Fischer, and Jason Livingood for allowing us to share our experiences and knowledge with the general public, and extra special thanks to my teammates Ryan Richins and Shaofeng Yang for their work with me in creating an ever-changing and improving Nagios architecture that is stable and gives us incredible performance.
We are still hiring :), contact me if you are interested in working on a terrific team doing interesting and innovative work.
— Max Schubert
Updated Nagios::Plugin::SNMP and Nenm::Utils on Githhub (on CPAN this week) · 27 August 2009, 00:19
I have released version 1.2 of Nagios::Plugin::SNMP to Github:
http://github.com/perldork/nagios—plugin—snmp/tree/master
This release includes:
- Many bug fixes
- Delta processing for SNMP counters with a framework that allows you to plug in your own delta calcuation routine! This version requires Cache::Memcached (and a memcache instance somewhere on the network the code can reach) to do delta processing. Delta processing itself, however, is optional so you do not need Cache::Memcached installed to use the module without the delta processing features.
- Clustered SNMP-agent aware code – for cases where one agent out of N will have a specific OID or OIDS, you can specify multiple hosts to Nagios::Plugin::SNMP and it will try to retrieve the OID from each host listed in turn; it will only die with an error if all hosts fail to return the requested OID.
Additionally I have released an updated version of the Nenm::Utils module that I initially created for the Syngress Nagios book project I lead. This version includes:
- Multiple bug fixes
- More flexible threshold processing
This module is also available on the book site
My team at work uses both of these modules extensively to query several thousand SNMP-based agents every 5 minutes.
Special thanks to:
My teammates Ryan Richins and Shaofeng Yang for their extensive contributions to both of these modules.
My managers at Comcast, Mike Fischer and Jason Livingood, for allowing us to contribute code we have done at work back to the open source community.
Comcast is hiring! Our team is looking for a talented developer with systems administration experience to join our team. Let me know if you are in the northern Virginia area of the US and are looking for a fun and challenging place to work :).
— Max Schubert
Nagios Performance Tuning: Early Lessons Learned, Lessons Shared. Part 5 - Circular Dependency Checking · 6 August 2009, 17:36
NOTE – we are using Nagios 3.0.3, which does not have the very cool patch for the circular dependency checking algorithm recently introduced into the Nagios 3.1.x release tree.
Our startup times for our Nagios instances jumped dramatically today (more than 6x) due to some of our users adding large numbers of new services to their hosts that are associated with their hosts through the
service -> hostgroup -> host
relationship I have discussed often and that we make use of often. We always want our Nagios instance to start on a 5 minute interval as we push most of the performance data we get back from checks into a long-term trending data warehouse.
We also test every configuration release in an integration and test environment before doing a deployment.
With this in mind, we decided to try turning off circular dependency checking on startup for our production Nagios instances.
On one this reduced startup time from 763! seconds to 16 seconds; on the other startup times were reduced from 158 seconds to 6 seconds.
There you have it, a simple way to dramatically reduce startup times, but again, only do this if you test your configuration beforehand in an environment with circular dependency checking on.
— Max Schubert
Why do I get an 'unitialized value' error message from Getopt/Long.pm when Nagios runs my perl-based plugin under ePN? · 25 July 2009, 15:48
Had this message while debugging an ePN-based script today:
**ePN /data/nagios/etc/customers/tean/project/plugins/check_plugin_name.pl: "Use of uninitialized value in pattern match (m//) at /usr/lib/perl5/5.8.8/Getopt/Long.pm line 848,".
Was very puzzled by this as i had never seen that error before, we run 20-30 or more ePN-based scripts, and obviously I don’t maintain that code so how could I have introduced a bug into it?
Answer: I didn’t. What i did do was define a custom attribute for a service but not put any spaces after the attribute in my service definition. E.g.
define command {
command_name check_plugin_name
command_line $USER10$/team/project/plugins/check_plugin_name.pl \
--check-interval $_SERVICE_PROJECT_CHECK_INTERVAL$ \
--hostname $HOSTADDRESS$ \
$_SERVICE_PROJECT_ALT_HOSTS$ \
-p '$_HOST_SNMP_PORT$' \
--snmp-version 2c \
--rocommunity $_HOST_SNMP_COMMUNITY$ \
--timeout $_HOST_PLUGIN_TIMEOUT$ \
-c '$_SERVICE_PROJECT_CRIT$' \
$_SERVICE_PROJECT_WARN$
}
Notice that at the end of the command line I reference $_SERVICE_PROJECT_WARN$. This style of custom attribute calling lets the user set a warning threshold definition the service definition if they want to, like so
define service {
...
__project_warn -w my_threshold_specification
...
}
But if they don’t, no changes are needed to the command definition to let it work as the command does not require a warning threshold.
However I then defined the attribute like so in my service definition:
define service {
...
__project_warn<-- end of line, no spaces!
...
}
This caused Nagios to substitute a null or some other non-printable character as the value of the attribute in the command line before executing it, which in turn got passed through to Getopt/Long.pm as an undefined option name.
The fix .. just add spaces and an empty string to the attribute in the service definition :)
define service {
...
__snmp_port 161
__project_warn ''
...
}
Voila, no undefined option.
Could be a candidate for either a Nagios custom attribute value fix or a Getopt/Long.pm fix, I am thinking Getopt::Long should set an undefined option name to the empty string so that developers do not have to guard for this condition.
— Max Schubert
Nagios patch withdrawl: only send recovery escalation notifications for services if a problem escalation notification was sent · 24 July 2009, 18:16
Well, I hate to say it, but me oculpa, I had to withdraw the first attempt at the patch I did in an earlier article (which I have hidden for now to make sure others do not download it) that was supposed to fix escalation recovery notification behavior.
My first attempt at the patch was overly naive; if you downloaded it, please remove it from your installation as it will most likely not work for you. It does work for us, but our configuration is very unique and very different from how most people use Nagios.
I have a new version in place at my job and I will be releasing that version next week or the week after next. Why might you trust this new one after my poor first attempts?
- The bugs in it were found through a team code review, so now 3 sets of eyes have looked at the code and they will look at it again before I release to the public.
- I have tested and will test again the patch with configurations that are like most people use Nagios in addition to our own unique configuration to ensure the patch works for the vast majority of Nagios systems.
My apologies if you downloaded and used the earlier patches; thankfully it will not corrupt data etc, just does not do what I promised it would do.
The current version is working for us and working with typical configurations as well I am just not going to repeat the same mistakes I made last time as I know how frustrating it is to back out code.
— Max Schubert
Nagios Performance Tuning: Early Lessons Learned, Lessons Shared. Part 4.5 - Scalable Performance Data Graphing · 13 March 2009, 16:40
In my previous post on Scalable Performance Data Graphing with Nagios, I discussed how our team is using PNP, NPCD, and modpnpsender.o to send performance data from our polling server to our report server and then process it.
A week ago our report server hit it’s upper limit on the number of PNP performance data events it could process (8800 events every 5 minutes). We trend on a dozen or so poller and report server metrics, including the age of events in the NPCD queue; our queue went from having one minute each 5 minutes where it was completely empty to having over 30,000 backlogged events and growing in a 24 hour period. This backlog meant the RRD files (and consequently the PNP UI) were up to 15 minutes behind reality as well.
I started at the beginning of the week with tuning our NPCD threads and sleep time parameters. At first I tried starting more threads and sleeping less, but the server was so overwhelmed that this had the opposite effect; the queue grew.
Next I played with starting fewer threads and eventually found that 10 threads every 5 seconds was at least letting NPCD start to drain the queue. After 48 hours (!) the queue was down to 3k events, with no events in queue older than 119 seconds. Better, but not good enough for us to say the problem was fixed.
My colleague, Ryan Richins, remembered seeing documentation on the PNP site on integrating rrdcached with PNP. I had vaguely remembered seeing it, so I gave it a second look. Ryan, meanwhile, downloaded the source to the latest stable versions but did not find that rrdcached was included. He then re-read the PNP page and we eventually downloaded the latest trunk snapshot of RRD Tool, knowing that it might not be production ready. This version did contain rrdcached.
The configure options we used are:
export LDFLAGS=”-L/lib64 -lpango-1.0 -lpangocairo-1.0”
CPPFLAGS=”-I/usr/include/cairo -I/usr/include/pango-1.0”
CPPFLAGS=”$CPPFLGAGS -I/usr/include/cairo”
CPPFLAGS=”$CPPFLAGS -I/usr/include/pango-1.0/pango”
export CPPFLAGS./configure \ —prefix=/path/to/base \ —enable-perl \ —disable-ruby \ —disable-lua \ —disable-python \ —disable-tcl \ —with-rrdcached
After installing rrdcached, we downloaded the latest version of PNP and looked at the rrdcached integration. We have done a lot of local changes to process_perfdata.pl, so we back-ported the RRD integration code into our process_perfdata.pl script (just 6 lines or so of code!). We tested rrdcached on our integration environment with this /etc/default/rrdcached file:
RUN_RRDCACHED=1 RRDCACHED=/path/to/rrdcached RRDCACHED_USER="nagios" OPTS="-w 60 -z 6 -f 120 -F -j /path/to/temp/dir -t 10" PIDFILE="/var/run/rrdcached/rrdcached.pid" SOCKFILE="127.0.0.1:45000" SOCKPERMS=0666
On our integration environment, this performed quite well, so we rolled it out to production. In production, performance was not doing as well, so I changed the parameters to this, which seems so far to be the best combo I have found for us:
RUN_RRDCACHED=1 RRDCACHED=/path/to/base/bin/rrdcached RRDCACHED_USER="nagios" OPTS="-w 75 -z 1 -f 900 -F -j /path/to/journal/dir -t 8" PIDFILE="/var/run/rrdcached/rrdcached.pid" SOCKFILE="127.0.0.1:45000" SOCKPERMS=0666
IMPORTANT NOTE – when using rrdcached, if you need to restart it with a new set of parameters and npcd, do the following:
- Stop npcd
- Use pgrep or the equivalent on your system to let all process_perfdata.pl processes complete
- Restart rrdcached
- Restart npcd
If you do not let all process_perfdata.pl processes stop before restarting rrdcached, you will lose data.
So after this change our queue came down to about 1.5k events, but we still were constantly processing events (no empty minute). My coworker and I started discussing using memcached to queue events and then a light went off for us .. why not use a RAM disk for the NPCD queue? Since rrdcached keeps a journal file in case it crashes, the risk from a crash is just losing 5 minutes of data or less, which was acceptable to us. By the way, this idea was not an original one, I had seen it on the Nagios / PNP lists before, i just had not considered it fully.
So, I changed the queue directory for npcd to be this (RHEL 5.x):
/dev/shm/perfdata.ram
I also changed the path for the process_perfdata.log file to be
/dev/shm/var/perfdata.log
The size of the queue in MB with 8800 events in 5 minutes is only 1-2 MB on average, so not much risk of clogging the real memory on this system (we have about 500 MB free according to top and snmpd).
I then restarted npcd and watched our NPCD metrics. Voila! Back down to 4 minutes to process all PNP files and an empty minute to spare :), with a max event queue size of 1.2k.
Something Ryan and I were expecting to see that we did not see during all this was CPU I/O wait to decrease. The load on the server has slightly decreased, the server definitely feels more responsive, but still seeing consistently 20-25% I/O wait on the CPU.
Maybe there is more we can tune? Ideas welcomed.
Special thanks to Mike Fischer, my manager at Comcast, for allowing me to share my experiences at work online; special thanks to Ryan Richins, my talented teammate, for his hard work with me on our Nagios system and welcome to Shaofeng Yang, our new teammate!
— Max Schubert
PNP-aware version of Drraw released · 27 February 2009, 11:43
I have been looking for a while for a tool to let me and our users create custom web-based dashboards from PNP RRD files using a web interface.
On the PNP users mailing list someone mentioned a perl-based tool to create dashboards from RRD files called Drraw. I installed it (very easy!) and it is quite a cool tool for this purpose. Very full-featured and flexible. The tool was written by Cristophe Kalt.
I saw a few things I did not like about it:
- PNP RRD files just have numerical DS names and so in the graph/template creation UI I was just seeing DS labels named “1,” “2,” etc .. which wasn’t very human friendly.
- The program makes you type in the path to the CGI.
- The CSS seemed a bit unreadable to me.
So I added code to the project that will read from the XML meta-data descriptors PNP creates along with RRD files so that when you go to create a new template/graph in Drraw you see the DS names as specified in the perfdata output from your Nagios plugins. I also cleaned the CSS up, renamed the CGI to index.cgi, and included a little Apache configuration snippet to make it easy to set up Drraw in Apache with the index.cgi file being used as the directory index.
http://github.com/perldork/drraw-pnp/tree/master
Hope you find it useful; I have interest in integrating this functionality into PNP .. if you have interest as well and are famililar with PNP, perl, and PHP, write me, I welcome help.
Update – Cristophe added PNP integration code to the project independent of me doing it, his release JUST came out today :), so feel free to use my variant but I am discussing with him and will be talking with other developers about rolling my changes back into the main line and helping with the project as a developer.
— Max Schubert
Nagios Performance Tuning: Early Lessons Learned, Lessons Shared. Part 4 - Scalable Performance Data Graphing · 18 January 2009, 00:28
In the last three parts of this short series I have discussed the techniques my teammate and I have used at work to tune our Nagios poller so that it completes all configured checks within a five minute interval. I will now discuss how we are storing and graphing this data in a way that will scale as our installation continues to grow (we are currently graphing data from over 5000 checks every 5 minutes).
The Nagios Plugin API and Performance Data: sections of the online Nagios documentation discuss plugin development and current performance data format specifications in great detail; check out both links if you are not familiar with Nagios plugin performance data.
While Nagios does not come out of the box with a performance data graphing framework, it should come as no surprise that there are a number of ways to send performance data from Nagios to external graphing systems:
- Have Nagios write the performance data returned from a host or service check to an external file after each check returns
- Have Nagios call a program directly to process performance data after every host or service check returns
- Use a NEB (Nagios Event Broker) module to process service or host performance data after each service or host check is run
As with any other configuration choice with Nagios, each method has benefits and drawbacks in terms of it’s implementation difficulty and effect on Nagios performance and resource utilization on the host running Nagios.
Since we are focusing on scalable graphing, our goals are as follows:
- Minimize the disk I/O impact on the Nagios polling server so that the majority of it’s processing power stays focused on completing all configured checks within a five minute interval.
- Minimize the amount of scheduling skew the additional performance data processing imposes on the Nagios poller.
- Maximize the number of performance data graphing requests that can be made within the polling interval.
- Have the actual storage and graphing of all performance data done on a server other than the Nagios poller, preferably one that can be dedicated to graphing.
There are a number of graphing frameworks available for Nagios; in this article I will focus on PNP – PNP Is Not Perfparse. It is a mostly well-documented, flexible framework. For Nagios administrators who currently use both Cacti and Nagios, I highly recommend considering PNP as an an alternative. PNP eliminates the need to administer device and service configurations in two places and also means no double-polling to get both fault management and trending data.
PNP consists of four discreet components:
- A perl script, process_perfdata.pl, that takes service or host check performance data and updates one or more RRD database files.
- A threaded C-based daemon (NPCD) that can be used to spawn one or more process_perfdata.pl script instances. The end user can configure the number of instances of process_perfdata.pl to spawn at a time and how often NPCD should spawn new process_perfdata.pl processes.
- An undocumented NEB module, modpnpsender.c, that will send an XML representation of perfdata from a host or service check to a location you configure over a TCP socket to a location you specify.
- A PHP-based framework for viewing graphs using the RRD files created by process_perfdata.pl and custom PHP templates created by the end user.
PNP can integrate with Nagios in a number of ways:
- process_perfdata.pl called directly by Nagios using the perfdata command options of Nagios in nagios.cfg
- NPCD is run on the Nagios poller, Nagios is configured to create :: delimited queue files in a queue directory on the server, NPCD then calls process_perfdata.pl
- Nagios runs modpnpsender.o. modpnpsender.o sends perfdata over a TCP socket to a server dedicated to graphing. process_perfdata.pl is called by inetd to directly update RRD files on the graphing server.
- Nagios runs modpnpsender.o. modpnpsender.o sends perfdata over a TCP socket to a server dedicated to graphing, a simple inetd script listens for incoming PNP connections and creates NPCD-formatted queue files in the NPCD queue directory, then NPCD calls process_perfdata.pl to update RRD files.
The first two methods above place all the disk I/O burden associated with RRD files on the Nagios poller; while this is perfectly fine for smaller installations, it is not good for a larger installation. Additionally, methods one and two cause Nagios to pause as it runs the perfdata processing commands. In our environment this cause check scheduling skew to happen at an unacceptable rate. With just 1800 services we were skewing by over a minute a day for checks, i.e. a check that was initially scheduled to run at minute 01 of the hour was running at minute 02 of the hour by day two.
modpnpsender.o is a NEB module that registers for service events; when a service event occurs within Nagios modpnpsender opens a TCP connection to a remote server, sends an XML representation of the event to the remote side, then closes the socket. This transaction does not take more than a second or two depending on where in the network your reporting host sits in comparison to your Nagios poller. We made a few minor modifications to the code (which we will release in the near future) to enhance the functionality of the NEB module.
Our first modification was to add in fork() code to the NEB module. While the Nagios documentation says never to fork in a NEB module, without the fork we found that our service check schedule was skewing almost as significantly with the NEB module in place as it had been calling process_perfdata.pl directly from Nagios via the process perfdata external command options in nagios.cfg. This occurred because Nagios waits for the NEB module to finish processing before it continues. With the fork() code in place, this skew disappeared completely; we have not seen any system instability due to the additional fork() calls.
Our second modification was to make the XML buffer size in the modpnpsender.c source file a C #define parameter, as the code had a hard-coded buffer size that was not enough to accommodate the 4096 bytes of output that Nagios allows; for checks with long perfdata output this hard-coded buffer was being overrun by the code and causing Nagios to segfault.
The third PNP architecture works much better than the first two with our modpnpsender.c modifications in place; the NEB module opens a socket and sends XML to the report server; process_perfdata.pl then reads the data from the socket via inetd and updates the RRD files associated with each metric. The problem with this method is that the report server effectively experiences a denial of service attack every polling cycle if thousands of performance data records are sent to it at a time. In our case, thousands of perfdata records would arrive within two minutes, nearly knocking the server over for a few minutes each run.
My first attempt to ameliorate this problem was to have each process_perfdata.pl instance sleep for a random number of seconds ranging from 15 – 60 before the RRD update processing occurred. While this helped, it still left the kernel tracking thousands of processes at once and did not lower the impact of each check cycle enough to be satisfactory.
The solution I found to this was the fourth option for PNP data processing listed above, which is a hybrid of the methods the PNP developers outline in the online documentation:
- The NEB module sends perfdata to the report server
- A small script reads the perfdata from the TCP socket via inetd and then writes out the performance data to a spool directory.
- NPCD then periodically processes the queue files to update the RRD check databases.
So far this method is much more effective in our environment than the others are at keeping load averages and I/O wait times on the report server at reasonable levels. We are currently processing over 5000 checks on 1200 hosts in four minutes with a load average of 2 or less on the report server and I/O wait CPU percentages of 20% or less. All perfdata is ingested into RRD files within 4 minutes.
In addition to our PNP graphing, we also have a daemon running on the report server that reads the Nagios perfdata output and sends it to a corporate data warehouse for long term trending.
There it is; a scalable graphing architecture with Nagios and PNP that we believe will allow us to graph thousands more checks per five minute period than we are doing now without having to upgrade hardware.
In the next article in this series I will discuss how to use PNP to monitor the performance of your Nagios poller and report server.
Special thanks to Mike Fischer, my manager at Comcast, for allowing me to share my experiences at work online; special thanks to Ryan Richins, my talented teammate, for his hard work with me on our Nagios system. We are looking for another developer to join our team in the NoVA area; write me at maxs@webwizarddesign.com if you are interested.
— Max Schubert
Nagios Performance Tuning: Early Lessons Learned, Lessons Shared. Part 3 - Tuning The Poller · 25 December 2008, 23:48
The main Nagios configuration file, nagios.cfg, gives Nagios administrators a high degree of control over the tuning process. This flexibility can be intimidating to new Nagios users. In this article I will attempt to demystify some of the settings contained in this file and explain how they can be used to make the Nagios poller operate in a manner that meets your requirements.
All parameters in this article are discussed in the online Nagios documentation extensively; read that documentation before you read this article if you have not already.
The configuration settings that will most likely be of interest to you when tuning Nagios are:
- sleep_time
- service_inter_check_delay_method
- host_inter_check_delay_method
- max_concurrent_checks
- max_service_check_spread
- max_host_check_spread
- service_interleave_factor
If you are using host and service dependencies, then these parameters become important as well:
- cached_host_check_horizon
- cached_service_check_horizon
Host check performance tuning goes hand-in-hand with service performance tuning as failed service checks can trigger on-demand host checks. It is also important to properly define your host alive check, retry interval and max attempts parameters as well to keep host checks from dragging down poller performance.
Service check related parameters
sleep_time
This variable holds the number of seconds (or partial seconds) Nagios should sleep between running service checks. In a large installation, with solid hardware, and time performance goals, I recommend setting this parameter to as low a number as possible as any additional delay introduced between checks means your overall check schedule skews more quickly over time.
Warning – if you do not configure Nagios with the —enable-nanosleep flag, you can only use positive integers for this parameter.
With nanosleep enabled, we were able to reduce this number to as low a fraction as .01 seconds (1/100th of a second). Anything lower will fail; if you use zero, Nagios will complain and exit.
max_service_check_spread
This parameter indicates how long Nagios has to execute all service checks in your configuration when it is restarted, in minutes. If you are super-concerned about performance data getting into a performance data store at regular intervals, set this value to the interval you are using for metrics. For example, if your performance data tool expects 5 minute samples, you better set this value to 5 to ensure that all service checks get run within a 5 minute interval. This can cause administrators headaches when many checks are being run and some are time sensitive for trending and others are longer checks (like robotic checks) that are more oriented towards fault management than trending / deviation from trend alerting. One way to diminish the time you spend on managing poller performance for trending is to set up two instances of Nagios (it is free to use after all) and set up all time sensitive checks on one instance and all longer running checks on another. This gives your team the benefit of having an instance that is tightly bounded by time and watched to ensure it is hitting time interval requirements while the other can be a little more loose on timing and take on larger numbers of longer running checks.
service_inter_check_delay_method
Nagios is designed to be flexible and work on a wide variety of hardware. This is a good thing. The inter-check delay method is a parameter that lets the administrator tell Nagios how aggressive to be when running a set of service checks on a managed host. It tunes the delay Nagios uses between scheduled checks on a host; the more delay, the less resource impact imposed on a host. There are four settings for this parameter:
- n – no delay (all checks scheduled at once)
- d – dumb delay – 1 second between each check
- s – smart delay – delay between checks is either the result of dividing the average service check interval by the total number of services being checked or the result of multiplying the max_service_check_spread by 60 and then dividing that number by the total number of services being checked. Whichever is lowest is used.
- x.xx – custom fixed delay of x.xx seconds.
While the configuration notes say to never use the n method for production, if you have hardware that can handle it this will give you a huge performance boost. The hosts we monitor can all take 5-10 service checks at once without noticeable negative performance impact and our Nagios poller is able to handle running over 1000 service checks at any given instant; changing from the s method to the n method lowered the time we took to complete all our checks by a significant amount; and our team is using Nagios to populate a time-series database as well as do fault management.
service_interleave_factor
This parameter determines how many checks are scheduled initially on each host at a time in your configuration as the scheduler creates the initial service check queue. It has two settings:
- s – smart calculation by Nagios, designed to minimize resource impact on remote hosts
- x – you provide the interleave factor
Lets pretend we have 250 hosts and 1000 checks, 4 checks per host. If you set this parameter to 1, the scheduler will schedule all four checks on host 1, then schedule all checks on host 2, then all checks on host 3, up to host 250 .. in essence serially checking all hosts. If you set this parameter to 250, then the scheduler will first schedule check 1 across all 250 hosts, then check 2 across all 250 hosts, then check 3, and finally check 4. In a situation when finishing checks quickly matters, we have found that setting the interleave value to the number of hosts in your configuration gives a performance boost and does help decrease the effect of setting inter check delay to none.
max_concurrent_checks
This parameter tells Nagios how many checks it is allowed to schedule at once. Leaving this at 0 minimizes the amount of time Nagios takes to complete all checks but also maximizes the load on the Nagios host and the bursts of network traffic Nagios produces. Setting max_concurrent_checks to 1 then would force Nagios to execute just one check at a time. We use the 0 setting as we again are tied to hitting a 5 minute interval for all checks for trending purposes and we are fortunate enough to have decent hardware (dual dual-core CPU, SCSI disks, nice network) and we have enough network bandwidth around that bursting every few minutes does not bother anyone.
Finally, if you are using service dependencies, be sure to set the cached_service_check_horizon to a number of seconds equal to the smallest service check interval in use. When a service that depends on another service needs to be checked, Nagios will first check the depended on service, so if one service is depended on by many others, setting this will keep Nagios from re-executing the depended on service check plugin more often than it needs to.
Tuning is not easy but if you have the time and resources to invest in it the results can be fantastic. With the parameters discussed in this section, my team has been able to have our Nagios instance execute over 3500 checks across 900+ servers in about three minutes, still well within our five minute ‘hard’ ceiling.
Host check related parameters
A key consideration in optimizing host checks is to ensure that your host check method completes quickly and to minimize the number of times it verifies a host is up. An example given in the Nagios documentation uses the ICMP ECHO (ping) check that comes with Nagios to illustrate this, telling how it is more effective to have a check ping once and then repeat up to ten times on failure rather than have a single check that pings ten times each time it is run.
host_inter_check_delay_method
As with the service check inter-delay, this is the amount of time Nagios inserts between checks of all hosts when initially scheduling them after a restart. We use the n value to finish quickly, other values will reduce the network impact of initial host checks.
max_host_check_spread
How long Nagios has to complete all host checks. We set this to 5 minutes as that is our max polling interval. The longer the interval, the lower the load will be on your Nagios poller.
cached_host_check_horizon
How long Nagios should cache host check results. When using service dependencies, this comes in handy, as Nagios will perform a host check each time a service that is depended on fails. Caching host checks for the interval of a typical depdended on service will reduce the number of ‘on demand’ live host checks Nagios has to do and help keep your “all checks done” intervals low.
In this article we have briefly covered some of the more important nagios.cfg performance tuning parameters. Learning to tune Nagios effectively is a process that takes time, patience, and experience with Nagios. If you are a long time Nagios administrator or have worked with a number of network/host/service fault and performance management tools, then the time you invest may well pay off many times over in the success of your Nagios deployment.
In my next article I will discuss the methods my team is using to ingest and graph Nagios performance data using PNP in a way that we hope will scale to large numbers of hosts and services.
Special thanks to Mike Fischer, my manager at Comcast, for allowing me to share my experiences at work online; special thanks to Ryan Richins, my talented teammate, for his hard work with me on our Nagios system. We are looking for another developer to join our team in the NoVA area; write me at maxs@webwizarddesign.com if you are interested.
— Max Schubert
