Semintelligent

Nagios Performance Tuning: Early Lessons Learned, Lessons Shared. Part 4.5 - Scalable Performance Data Graphing · 13 March 2009, 11:40

In my previous post on Scalable Performance Data Graphing with Nagios, I discussed how our team is using PNP, NPCD, and modpnpsender.o to send performance data from our polling server to our report server and then process it.

A week ago our report server hit it’s upper limit on the number of PNP performance data events it could process (8800 events every 5 minutes). We trend on a dozen or so poller and report server metrics, including the age of events in the NPCD queue; our queue went from having one minute each 5 minutes where it was completely empty to having over 30,000 backlogged events and growing in a 24 hour period. This backlog meant the RRD files (and consequently the PNP UI) were up to 15 minutes behind reality as well.

I started at the beginning of the week with tuning our NPCD threads and sleep time parameters. At first I tried starting more threads and sleeping less, but the server was so overwhelmed that this had the opposite effect; the queue grew.

Next I played with starting fewer threads and eventually found that 10 threads every 5 seconds was at least letting NPCD start to drain the queue. After 48 hours (!) the queue was down to 3k events, with no events in queue older than 119 seconds. Better, but not good enough for us to say the problem was fixed.

My colleague, Ryan Richins, remembered seeing documentation on the PNP site on integrating rrdcached with PNP. I had vaguely remembered seeing it, so I gave it a second look. Ryan, meanwhile, downloaded the source to the latest stable versions but did not find that rrdcached was included. He then re-read the PNP page and we eventually downloaded the latest trunk snapshot of RRD Tool, knowing that it might not be production ready. This version did contain rrdcached.

The configure options we used are:


export LDFLAGS=”-L/lib64 -lpango-1.0 -lpangocairo-1.0”
CPPFLAGS=”-I/usr/include/cairo -I/usr/include/pango-1.0”
CPPFLAGS=”$CPPFLGAGS -I/usr/include/cairo”
CPPFLAGS=”$CPPFLAGS -I/usr/include/pango-1.0/pango”
export CPPFLAGS

./configure \ —prefix=/path/to/base \ —enable-perl \ —disable-ruby \ —disable-lua \ —disable-python \ —disable-tcl \ —with-rrdcached

After installing rrdcached, we downloaded the latest version of PNP and looked at the rrdcached integration. We have done a lot of local changes to process_perfdata.pl, so we back-ported the RRD integration code into our process_perfdata.pl script (just 6 lines or so of code!). We tested rrdcached on our integration environment with this /etc/default/rrdcached file:

RUN_RRDCACHED=1
RRDCACHED=/path/to/rrdcached
RRDCACHED_USER="nagios"
OPTS="-w 60 -z 6 -f 120 -F -j /path/to/temp/dir -t 10"
PIDFILE="/var/run/rrdcached/rrdcached.pid"
SOCKFILE="127.0.0.1:45000"
SOCKPERMS=0666

On our integration environment, this performed quite well, so we rolled it out to production. In production, performance was not doing as well, so I changed the parameters to this, which seems so far to be the best combo I have found for us:

RUN_RRDCACHED=1
RRDCACHED=/path/to/base/bin/rrdcached
RRDCACHED_USER="nagios"
OPTS="-w 75 -z 1 -f 900 -F -j /path/to/journal/dir -t 8"
PIDFILE="/var/run/rrdcached/rrdcached.pid"
SOCKFILE="127.0.0.1:45000"
SOCKPERMS=0666

IMPORTANT NOTE – when using rrdcached, if you need to restart it with a new set of parameters and npcd, do the following:

If you do not let all process_perfdata.pl processes stop before restarting rrdcached, you will lose data.

So after this change our queue came down to about 1.5k events, but we still were constantly processing events (no empty minute). My coworker and I started discussing using memcached to queue events and then a light went off for us .. why not use a RAM disk for the NPCD queue? Since rrdcached keeps a journal file in case it crashes, the risk from a crash is just losing 5 minutes of data or less, which was acceptable to us. By the way, this idea was not an original one, I had seen it on the Nagios / PNP lists before, i just had not considered it fully.

So, I changed the queue directory for npcd to be this (RHEL 5.x):

/dev/shm/perfdata.ram

I also changed the path for the process_perfdata.log file to be

/dev/shm/var/perfdata.log

The size of the queue in MB with 8800 events in 5 minutes is only 1-2 MB on average, so not much risk of clogging the real memory on this system (we have about 500 MB free according to top and snmpd).

I then restarted npcd and watched our NPCD metrics. Voila! Back down to 4 minutes to process all PNP files and an empty minute to spare :), with a max event queue size of 1.2k.

Something Ryan and I were expecting to see that we did not see during all this was CPU I/O wait to decrease. The load on the server has slightly decreased, the server definitely feels more responsive, but still seeing consistently 20-25% I/O wait on the CPU.

Maybe there is more we can tune? Ideas welcomed.

Special thanks to Mike Fischer, my manager at Comcast, for allowing me to share my experiences at work online; special thanks to Ryan Richins, my talented teammate, for his hard work with me on our Nagios system and welcome to Shaofeng Yang, our new teammate!

— Max Schubert

---

Comment

Textile Help