semintelligent.com: personal site of Max Schubert a.k.a. "perldork"


Semintelligent

Nagios Performance Tuning - use the RAM (but be careful!), Luke · 6 January 2010, 02:04

We found that migrating as many queues and files as we reasonably can within our Nagios architecture to RAM disks makes a huge difference with the performance of a large Nagios installation. We currently poll over 15k services on over 2k+ hosts in less than 5 minutes 24×7×365.

We use RHEL5; by default RHEL mounts /dev/shm as a RAM disk with 50% of physical RAM available to the partition.

Our opinion on using RAM disks for temporary storage is controversial; a number of users on the Nagios users and developers lists have told me that disks with big caches should be as fast as RAM as files are cached in RAM, but our experience has shown that nothing beats a RAM disk for a fast queue directory or file. Our experiences also taught us that when moving queues to RAM it is very important to also implement supporting code that ensures important data is persisted across reboots or can easily be re-created across reboots.

Our experience is based on machines with SCSI disks in RAID 0, 5, and 1+0 configurations.

Queues and files we moved to RAM that sped up our Nagios architecture noticeably (by over 40% in total):

Nagios (nagios.cfg)

Moving log_file, object_cache_file, and status file to RAM speed up the CGIs in a larger environment. Moving the temp_file, temp_path, check_result_path, and state_retention_file to RAM lowers the latency for Nagios in a larger environment.

We have also taken the radical steps of moving all configuration files into RAM as well as plugins. We use ePN extensively, every time Nagios goes to run an ePN plugin it checks to see if the plugin has changed. Moving plugins to RAM we noticed a speed up.

IMPORTANT NOTE – Do not move everything to RAM without putting in custom, periodic scripts or other processes that back up important files from RAM to real disk so that if the host crashes they can be quickly recovered or re-created!

SNMPTT (snmptt.ini)

The spool file for checks is a good one to move to RAM and speeds up processing.

PNP (npcd.conf and process_perfdata.conf)

The NPCD queue is another directory we moved to RAM and noticed a nice jump in processing time for NPCD.

Summary

Moving any of the above queues to RAM disks will increase the overall speed of your Nagios architecture; the Nagios-specific configuration changes make a very noticable difference but at the price of some additional supporting code to ensure the robustness of critical data. We developed this list over a period of 3-6 months of time, so take your time if you decide to implement any of the changes mentioned in this article; also make sure you have Nagios trending metrics in place beforehand so you can see what kind of difference the above changes make, if any, to your installation.

Special thanks to my managers Eric Scholz, Mike Fischer, and Jason Livingood for allowing us to share our experiences and knowledge with the general public, and extra special thanks to my teammates Ryan Richins and Shaofeng Yang for their work with me in creating an ever-changing and improving Nagios architecture that is stable and gives us incredible performance.

We are still hiring :), contact me if you are interested in working on a terrific team doing interesting and innovative work.

— Max Schubert

Comment

---

Updated Nagios::Plugin::SNMP and Nenm::Utils on Githhub (on CPAN this week) · 26 August 2009, 23:19

I have released version 1.2 of Nagios::Plugin::SNMP to Github:

http://github.com/perldork/nagios—plugin—snmp/tree/master

This release includes:

Additionally I have released an updated version of the Nenm::Utils module that I initially created for the Syngress Nagios book project I lead. This version includes:

This module is also available on the book site

http://www.nagios3book.com/

My team at work uses both of these modules extensively to query several thousand SNMP-based agents every 5 minutes.

Special thanks to:

My teammates Ryan Richins and Shaofeng Yang for their extensive contributions to both of these modules.

My managers at Comcast, Mike Fischer and Jason Livingood, for allowing us to contribute code we have done at work back to the open source community.

Comcast is hiring! Our team is looking for a talented developer with systems administration experience to join our team. Let me know if you are in the northern Virginia area of the US and are looking for a fun and challenging place to work :).

— Max Schubert

Comment

---

Nagios Performance Tuning: Early Lessons Learned, Lessons Shared. Part 5 - Circular Dependency Checking · 6 August 2009, 16:36

NOTE – we are using Nagios 3.0.3, which does not have the very cool patch for the circular dependency checking algorithm recently introduced into the Nagios 3.1.x release tree.

Our startup times for our Nagios instances jumped dramatically today (more than 6x) due to some of our users adding large numbers of new services to their hosts that are associated with their hosts through the

service -> hostgroup -> host

relationship I have discussed often and that we make use of often. We always want our Nagios instance to start on a 5 minute interval as we push most of the performance data we get back from checks into a long-term trending data warehouse.

We also test every configuration release in an integration and test environment before doing a deployment.

With this in mind, we decided to try turning off circular dependency checking on startup for our production Nagios instances.

On one this reduced startup time from 763! seconds to 16 seconds; on the other startup times were reduced from 158 seconds to 6 seconds.

There you have it, a simple way to dramatically reduce startup times, but again, only do this if you test your configuration beforehand in an environment with circular dependency checking on.

— Max Schubert

Comment

---

Easy to use ruby library for interacting with Confluence - confluence4r · 31 July 2009, 16:43

http://confluence.atlassian.com/display/CONFEXT/Confluence4r

I added a gemspec for the package to the bottom of the page if you want to build it as a gem in-house.

— Max Schubert

Comment

---

Why do I get an 'unitialized value' error message from Getopt/Long.pm when Nagios runs my perl-based plugin under ePN? · 25 July 2009, 14:48

Had this message while debugging an ePN-based script today:

**ePN /data/nagios/etc/customers/tean/project/plugins/check_plugin_name.pl: "Use of uninitialized value in pattern match (m//) at /usr/lib/perl5/5.8.8/Getopt/Long.pm line 848,".

Was very puzzled by this as i had never seen that error before, we run 20-30 or more ePN-based scripts, and obviously I don’t maintain that code so how could I have introduced a bug into it?

Answer: I didn’t. What i did do was define a custom attribute for a service but not put any spaces after the attribute in my service definition. E.g.

define command {
    command_name check_plugin_name
    command_line $USER10$/team/project/plugins/check_plugin_name.pl \
    --check-interval $_SERVICE_PROJECT_CHECK_INTERVAL$ \
    --hostname $HOSTADDRESS$ \
    $_SERVICE_PROJECT_ALT_HOSTS$ \
    -p '$_HOST_SNMP_PORT$' \
    --snmp-version 2c \
    --rocommunity $_HOST_SNMP_COMMUNITY$ \
    --timeout $_HOST_PLUGIN_TIMEOUT$ \
    -c '$_SERVICE_PROJECT_CRIT$' \
    $_SERVICE_PROJECT_WARN$
}

Notice that at the end of the command line I reference $_SERVICE_PROJECT_WARN$. This style of custom attribute calling lets the user set a warning threshold definition the service definition if they want to, like so

define service {
    ...
    __project_warn -w my_threshold_specification
    ...
}

But if they don’t, no changes are needed to the command definition to let it work as the command does not require a warning threshold.

However I then defined the attribute like so in my service definition:

define service {
    ...
    __project_warn<-- end of line, no spaces!
    ...
}

This caused Nagios to substitute a null or some other non-printable character as the value of the attribute in the command line before executing it, which in turn got passed through to Getopt/Long.pm as an undefined option name.

The fix .. just add spaces and an empty string to the attribute in the service definition :)

define service {
    ...
    __snmp_port           161
    __project_warn        ''
    ...
}

Voila, no undefined option.

Could be a candidate for either a Nagios custom attribute value fix or a Getopt/Long.pm fix, I am thinking Getopt::Long should set an undefined option name to the empty string so that developers do not have to guard for this condition.

— Max Schubert

Comment

---

Nagios patch withdrawl: only send recovery escalation notifications for services if a problem escalation notification was sent · 24 July 2009, 17:16

Well, I hate to say it, but me oculpa, I had to withdraw the first attempt at the patch I did in an earlier article (which I have hidden for now to make sure others do not download it) that was supposed to fix escalation recovery notification behavior.

My first attempt at the patch was overly naive; if you downloaded it, please remove it from your installation as it will most likely not work for you. It does work for us, but our configuration is very unique and very different from how most people use Nagios.

I have a new version in place at my job and I will be releasing that version next week or the week after next. Why might you trust this new one after my poor first attempts?

My apologies if you downloaded and used the earlier patches; thankfully it will not corrupt data etc, just does not do what I promised it would do.

The current version is working for us and working with typical configurations as well I am just not going to repeat the same mistakes I made last time as I know how frustrating it is to back out code.

— Max Schubert

Comment

---

Are you an expert US citizen? · 19 June 2009, 18:59

Email from a recruiter this year included a request for the following skill:

— Max Schubert

Comment [2]

---

Installing CentOS via Netboot in VMWare · 4 May 2009, 22:28

A quick blurb as I always forget this:

Example:

No trailing slash on the domain and no leading or trailing slash on the URI path

— Max Schubert

Comment

---

Getting ruby 1.8.7 and newer to compile with readline support on Red Hat Enterprise Linux (RHEL4 and RHEL5) · 4 April 2009, 10:20

Paraphrased from http://www.sanft.com/2008/12/01/upgrading-to-ruby-186-on-red-hat/

First, ensure you have the following packages installed:

Then make sure you remove the system ruby and ruby-devel packages, otherwise gems and other extensions might find the wrong version of ruby when they look for compile flags etc:

After unpacking the source for ruby, do the usual:

configure --prefix /usr/local
make all
sudo make install

Now do the following from the ruby source directory:

cd ext/readline
/usr/local/bin/ruby extconf.rb
make
make install

To ensure that ruby now has readline support, run

/usr/local/bin/ruby -rreadline -e 1

If you get no output (which should be the result), voila, readline support is now active.

— Max Schubert

Comment

---

Getting ruby gem mysql native extension to install on RHEL5 / CentOS 5 · 2 April 2009, 12:04

From

http://www.wzzrd.com/2008_02_01_archive.html

If you are on a 32-bit platform:

gem install mysql -- --with-mysql-conf=/usr/bin/mysql_config --with-mysql-lib=/usr/lib/mysql

If you are on a 64-bit platform:

gem install mysql -- --with-mysql-conf=/usr/bin/mysql_config --with-mysql-lib=/usr/lib64/mysql

— Max Schubert

Comment [1]

---

Older