Nagios Performance Tuning - use the RAM (but be careful!), Luke · 6 January 2010, 02:04
We found that migrating as many queues and files as we reasonably can within our Nagios architecture to RAM disks makes a huge difference with the performance of a large Nagios installation. We currently poll over 15k services on over 2k+ hosts in less than 5 minutes 24×7×365.
We use RHEL5; by default RHEL mounts /dev/shm as a RAM disk with 50% of physical RAM available to the partition.
Our opinion on using RAM disks for temporary storage is controversial; a number of users on the Nagios users and developers lists have told me that disks with big caches should be as fast as RAM as files are cached in RAM, but our experience has shown that nothing beats a RAM disk for a fast queue directory or file. Our experiences also taught us that when moving queues to RAM it is very important to also implement supporting code that ensures important data is persisted across reboots or can easily be re-created across reboots.
Our experience is based on machines with SCSI disks in RAID 0, 5, and 1+0 configurations.
Queues and files we moved to RAM that sped up our Nagios architecture noticeably (by over 40% in total):
Nagios (nagios.cfg)
- log_file
- object_cache_file
- status_file
- temp_file
- temp_path
- check_result_path
- state_retention_file
Moving log_file, object_cache_file, and status file to RAM speed up the CGIs in a larger environment. Moving the temp_file, temp_path, check_result_path, and state_retention_file to RAM lowers the latency for Nagios in a larger environment.
We have also taken the radical steps of moving all configuration files into RAM as well as plugins. We use ePN extensively, every time Nagios goes to run an ePN plugin it checks to see if the plugin has changed. Moving plugins to RAM we noticed a speed up.
IMPORTANT NOTE – Do not move everything to RAM without putting in custom, periodic scripts or other processes that back up important files from RAM to real disk so that if the host crashes they can be quickly recovered or re-created!
SNMPTT (snmptt.ini)
The spool file for checks is a good one to move to RAM and speeds up processing.
PNP (npcd.conf and process_perfdata.conf)
The NPCD queue is another directory we moved to RAM and noticed a nice jump in processing time for NPCD.
Summary
Moving any of the above queues to RAM disks will increase the overall speed of your Nagios architecture; the Nagios-specific configuration changes make a very noticable difference but at the price of some additional supporting code to ensure the robustness of critical data. We developed this list over a period of 3-6 months of time, so take your time if you decide to implement any of the changes mentioned in this article; also make sure you have Nagios trending metrics in place beforehand so you can see what kind of difference the above changes make, if any, to your installation.
Special thanks to my managers Eric Scholz, Mike Fischer, and Jason Livingood for allowing us to share our experiences and knowledge with the general public, and extra special thanks to my teammates Ryan Richins and Shaofeng Yang for their work with me in creating an ever-changing and improving Nagios architecture that is stable and gives us incredible performance.
We are still hiring :), contact me if you are interested in working on a terrific team doing interesting and innovative work.
— Max Schubert
Updated Nagios::Plugin::SNMP and Nenm::Utils on Githhub (on CPAN this week) · 26 August 2009, 23:19
I have released version 1.2 of Nagios::Plugin::SNMP to Github:
http://github.com/perldork/nagios—plugin—snmp/tree/master
This release includes:
- Many bug fixes
- Delta processing for SNMP counters with a framework that allows you to plug in your own delta calcuation routine! This version requires Cache::Memcached (and a memcache instance somewhere on the network the code can reach) to do delta processing. Delta processing itself, however, is optional so you do not need Cache::Memcached installed to use the module without the delta processing features.
- Clustered SNMP-agent aware code – for cases where one agent out of N will have a specific OID or OIDS, you can specify multiple hosts to Nagios::Plugin::SNMP and it will try to retrieve the OID from each host listed in turn; it will only die with an error if all hosts fail to return the requested OID.
Additionally I have released an updated version of the Nenm::Utils module that I initially created for the Syngress Nagios book project I lead. This version includes:
- Multiple bug fixes
- More flexible threshold processing
This module is also available on the book site
My team at work uses both of these modules extensively to query several thousand SNMP-based agents every 5 minutes.
Special thanks to:
My teammates Ryan Richins and Shaofeng Yang for their extensive contributions to both of these modules.
My managers at Comcast, Mike Fischer and Jason Livingood, for allowing us to contribute code we have done at work back to the open source community.
Comcast is hiring! Our team is looking for a talented developer with systems administration experience to join our team. Let me know if you are in the northern Virginia area of the US and are looking for a fun and challenging place to work :).
— Max Schubert
Nagios Performance Tuning: Early Lessons Learned, Lessons Shared. Part 5 - Circular Dependency Checking · 6 August 2009, 16:36
NOTE – we are using Nagios 3.0.3, which does not have the very cool patch for the circular dependency checking algorithm recently introduced into the Nagios 3.1.x release tree.
Our startup times for our Nagios instances jumped dramatically today (more than 6x) due to some of our users adding large numbers of new services to their hosts that are associated with their hosts through the
service -> hostgroup -> host
relationship I have discussed often and that we make use of often. We always want our Nagios instance to start on a 5 minute interval as we push most of the performance data we get back from checks into a long-term trending data warehouse.
We also test every configuration release in an integration and test environment before doing a deployment.
With this in mind, we decided to try turning off circular dependency checking on startup for our production Nagios instances.
On one this reduced startup time from 763! seconds to 16 seconds; on the other startup times were reduced from 158 seconds to 6 seconds.
There you have it, a simple way to dramatically reduce startup times, but again, only do this if you test your configuration beforehand in an environment with circular dependency checking on.
— Max Schubert
Easy to use ruby library for interacting with Confluence - confluence4r · 31 July 2009, 16:43
http://confluence.atlassian.com/display/CONFEXT/Confluence4r
I added a gemspec for the package to the bottom of the page if you want to build it as a gem in-house.
— Max Schubert
Why do I get an 'unitialized value' error message from Getopt/Long.pm when Nagios runs my perl-based plugin under ePN? · 25 July 2009, 14:48
Had this message while debugging an ePN-based script today:
**ePN /data/nagios/etc/customers/tean/project/plugins/check_plugin_name.pl: "Use of uninitialized value in pattern match (m//) at /usr/lib/perl5/5.8.8/Getopt/Long.pm line 848,".
Was very puzzled by this as i had never seen that error before, we run 20-30 or more ePN-based scripts, and obviously I don’t maintain that code so how could I have introduced a bug into it?
Answer: I didn’t. What i did do was define a custom attribute for a service but not put any spaces after the attribute in my service definition. E.g.
define command {
command_name check_plugin_name
command_line $USER10$/team/project/plugins/check_plugin_name.pl \
--check-interval $_SERVICE_PROJECT_CHECK_INTERVAL$ \
--hostname $HOSTADDRESS$ \
$_SERVICE_PROJECT_ALT_HOSTS$ \
-p '$_HOST_SNMP_PORT$' \
--snmp-version 2c \
--rocommunity $_HOST_SNMP_COMMUNITY$ \
--timeout $_HOST_PLUGIN_TIMEOUT$ \
-c '$_SERVICE_PROJECT_CRIT$' \
$_SERVICE_PROJECT_WARN$
}
Notice that at the end of the command line I reference $_SERVICE_PROJECT_WARN$. This style of custom attribute calling lets the user set a warning threshold definition the service definition if they want to, like so
define service {
...
__project_warn -w my_threshold_specification
...
}
But if they don’t, no changes are needed to the command definition to let it work as the command does not require a warning threshold.
However I then defined the attribute like so in my service definition:
define service {
...
__project_warn<-- end of line, no spaces!
...
}
This caused Nagios to substitute a null or some other non-printable character as the value of the attribute in the command line before executing it, which in turn got passed through to Getopt/Long.pm as an undefined option name.
The fix .. just add spaces and an empty string to the attribute in the service definition :)
define service {
...
__snmp_port 161
__project_warn ''
...
}
Voila, no undefined option.
Could be a candidate for either a Nagios custom attribute value fix or a Getopt/Long.pm fix, I am thinking Getopt::Long should set an undefined option name to the empty string so that developers do not have to guard for this condition.
— Max Schubert
Nagios patch withdrawl: only send recovery escalation notifications for services if a problem escalation notification was sent · 24 July 2009, 17:16
Well, I hate to say it, but me oculpa, I had to withdraw the first attempt at the patch I did in an earlier article (which I have hidden for now to make sure others do not download it) that was supposed to fix escalation recovery notification behavior.
My first attempt at the patch was overly naive; if you downloaded it, please remove it from your installation as it will most likely not work for you. It does work for us, but our configuration is very unique and very different from how most people use Nagios.
I have a new version in place at my job and I will be releasing that version next week or the week after next. Why might you trust this new one after my poor first attempts?
- The bugs in it were found through a team code review, so now 3 sets of eyes have looked at the code and they will look at it again before I release to the public.
- I have tested and will test again the patch with configurations that are like most people use Nagios in addition to our own unique configuration to ensure the patch works for the vast majority of Nagios systems.
My apologies if you downloaded and used the earlier patches; thankfully it will not corrupt data etc, just does not do what I promised it would do.
The current version is working for us and working with typical configurations as well I am just not going to repeat the same mistakes I made last time as I know how frustrating it is to back out code.
— Max Schubert
Are you an expert US citizen? · 19 June 2009, 18:59
Email from a recruiter this year included a request for the following skill:
- US Citizen – 10+ years of experience Expert Required
— Max Schubert
Comment [2]
Installing CentOS via Netboot in VMWare · 4 May 2009, 22:28
A quick blurb as I always forget this:
- Download and install VMWare Server 2.0 GPL edition (or other if you have the $$)
- Create a blank VM
- Download the netboot.iso from a CentOS web site
- Add the ISO as a virtual CD-ROM for the new virtual machine
- When prompted for installation media, choose HTTP, then enter
- In the site field – domain-name-of-mirror
- In the path field – URI to the base architecture directory
Example:
- Site – mirrors.xmission.com
- Path – centos/5.3/os/i386
No trailing slash on the domain and no leading or trailing slash on the URI path
— Max Schubert
Getting ruby 1.8.7 and newer to compile with readline support on Red Hat Enterprise Linux (RHEL4 and RHEL5) · 4 April 2009, 10:20
Paraphrased from http://www.sanft.com/2008/12/01/upgrading-to-ruby-186-on-red-hat/
First, ensure you have the following packages installed:
- readline
- readline-devel
Then make sure you remove the system ruby and ruby-devel packages, otherwise gems and other extensions might find the wrong version of ruby when they look for compile flags etc:
- ruby
- ruby-devel
After unpacking the source for ruby, do the usual:
configure --prefix /usr/local make all sudo make install
Now do the following from the ruby source directory:
cd ext/readline /usr/local/bin/ruby extconf.rb make make install
To ensure that ruby now has readline support, run
/usr/local/bin/ruby -rreadline -e 1
If you get no output (which should be the result), voila, readline support is now active.
— Max Schubert
Getting ruby gem mysql native extension to install on RHEL5 / CentOS 5 · 2 April 2009, 12:04
From
http://www.wzzrd.com/2008_02_01_archive.html
If you are on a 32-bit platform:
gem install mysql -- --with-mysql-conf=/usr/bin/mysql_config --with-mysql-lib=/usr/lib/mysql
If you are on a 64-bit platform:
gem install mysql -- --with-mysql-conf=/usr/bin/mysql_config --with-mysql-lib=/usr/lib64/mysql
— Max Schubert
Comment [1]
