Nagios Performance Tuning - use the RAM (but be careful!), Luke · 5 January 2010, 22:04
We found that migrating as many queues and files as we reasonably can within our Nagios architecture to RAM disks makes a huge difference with the performance of a large Nagios installation. We currently poll over 15k services on over 2k+ hosts in less than 5 minutes 24×7×365.
We use RHEL5; by default RHEL mounts /dev/shm as a RAM disk with 50% of physical RAM available to the partition.
Our opinion on using RAM disks for temporary storage is controversial; a number of users on the Nagios users and developers lists have told me that disks with big caches should be as fast as RAM as files are cached in RAM, but our experience has shown that nothing beats a RAM disk for a fast queue directory or file. Our experiences also taught us that when moving queues to RAM it is very important to also implement supporting code that ensures important data is persisted across reboots or can easily be re-created across reboots.
Our experience is based on machines with SCSI disks in RAID 0, 5, and 1+0 configurations.
Queues and files we moved to RAM that sped up our Nagios architecture noticeably (by over 40% in total):
Moving log_file, object_cache_file, and status file to RAM speed up the CGIs in a larger environment. Moving the temp_file, temp_path, check_result_path, and state_retention_file to RAM lowers the latency for Nagios in a larger environment.
We have also taken the radical steps of moving all configuration files into RAM as well as plugins. We use ePN extensively, every time Nagios goes to run an ePN plugin it checks to see if the plugin has changed. Moving plugins to RAM we noticed a speed up.
IMPORTANT NOTE – Do not move everything to RAM without putting in custom, periodic scripts or other processes that back up important files from RAM to real disk so that if the host crashes they can be quickly recovered or re-created!
The spool file for checks is a good one to move to RAM and speeds up processing.
PNP (npcd.conf and process_perfdata.conf)
The NPCD queue is another directory we moved to RAM and noticed a nice jump in processing time for NPCD.
Moving any of the above queues to RAM disks will increase the overall speed of your Nagios architecture; the Nagios-specific configuration changes make a very noticable difference but at the price of some additional supporting code to ensure the robustness of critical data. We developed this list over a period of 3-6 months of time, so take your time if you decide to implement any of the changes mentioned in this article; also make sure you have Nagios trending metrics in place beforehand so you can see what kind of difference the above changes make, if any, to your installation.
Special thanks to my managers Eric Scholz, Mike Fischer, and Jason Livingood for allowing us to share our experiences and knowledge with the general public, and extra special thanks to my teammates Ryan Richins and Shaofeng Yang for their work with me in creating an ever-changing and improving Nagios architecture that is stable and gives us incredible performance.
We are still hiring :), contact me if you are interested in working on a terrific team doing interesting and innovative work.
— Max Schubert