Nagios Performance Tuning: Early Lessons Learned, Lessons Shared: Part I · Oct 30, 10:48 PM
Just went through a round of tuning our pre-production instance of Nagios, gathering base performance information and data to allow our team to give our management reasonable capacity estimates for how many services and hosts we can monitor in our environment with Nagios.
Pre-production hardware is HP DL185 (2x of these servers in use):
- Dual 64-bit AMD 2600 CPUs
- SCSI disks
- 6 GB RAM
- Dual Gigabit bonded NICs
- RHEL 5.2
Nagios configuration:
Nagios poller:
- Nagios 3.0.3
- Plugins v 1.14.12
- Nagios2JSON CGI
- NSCA
- modpnpsender installed as a NEB module
- SNMPTT with custom SNMPTT to Nagios script
Nagios report server:
- NagTrap / MySQL
- PNP web gui and process_perfdata.pl as inetd daemon
For initial testing and tuning we are polling ~ 250 hosts with a total of ~ 1800 checks, all checks are SNMP, all scheduled at 5 minute intervals. Some are gets, some are summarizations of walks.
We are using PNP for graphing (NEB module mode, process_perfdata.pl run via inetd), RRD updates happen on a second server dedicated to reporting and visualization.
At the beginning of our tuning adventure we were seeing:
- All checks completed in 3.5 minutes
- Over the course of 12 hours, scheduling was skewing about 48 seconds .. meaning after 12 hours a check that was initially scheduled to run at 0 5 10 15 etc would be then running at 48 53 58
- Some checks were not making it to PNP (gaps in graphs).
This would barely be ok if we were just doing fault management (barely), but we want to send all perfdata not only to PNP but to a large time series warehouse db another team maintains. This meant we needed 5 minute samples to stay close to the same intervals over time as the large time series database stores raw samples for years and many other teams pull data from it for graphing, reports, and other analysis.
After two weeks of tuning we have reduced our check execution time (all 1800 checks!) to < 60 seconds, with an average scheduling skew of just 7 seconds at the end of 24 hours with our tuned configuration in place. All performance data is successfully being graphed by PNP as well. Our current configuration does this without knocking over either our Nagios polling server, our PNP server, or the hosts we are polling .. and we have room to poll many more services and hosts using the same two servers.
How did we get from start to finish? More science than art :p. I am usually a very intuitive developer but this time my teammate and I found we had to take a more scientific approach .. and it worked.
Stay tuned (if this isn’t boring the hell out of you :p), my next few posts will talk about the process we went through, what we have learned about the many performance tuning parameters in nagios.cfg, and how we reached our current performance, which our team and our management is very happy with right now.
— Max Schubert
Comment
First Nagios 3 Enterprise Monitoring Book Review! Nagios Performance Tuning: Early Lessons Learned, Lessons Shared: Part 2