Nagios Performance Tuning: Early Lessons Learned, Lessons Shared: Part 2 · 31 October 2008, 17:29

One of the first questions to ask your customer when designing a Nagios implementation should be “how many devices and services will we be monitoring?” It is important to ask this question early on in the process as the answers will affect how you design your Nagios-based system.

Another important question to ask is whether the system will be used to gather long-term (months/years) trending information or not. If it will be used as an ingest system for long term trending information, then timing becomes important. Making sure your service check intervals are consistent ovr time is critical to having the metrics it gathers have value for your organization / customer.

Why? Isn’t 5 minutes always 5 minutes to Nagios? Imagine you have a 5 minute metric – if, over time, that 5 minute metrics’ scheduling slips constantly forward or backwards from the original 5 minute intervals you schedule it for because of configuration decisions that cause Nagios to pause or ‘fall behind,’ you will end up with gaps in metrics and intervals that are hard to compare against each other. For example, if your original schedule for a metric is

0 5 10 15 20

and it then over the course of time it slips to

8 13 18 24 28

now your hour to hour comparisons are skewed, and if the scheduling skew continues, eventually you will have gaps in metrics.

So, given the above two questions:

You have some early architecture decisions to make, so your first priority should be to spend generous amounts of time reading and understanding the comprehensive online Nagios documentation. The Nagios documentation includes useful information on how to prepare for a larger installation, architecture patterns to follow when designing your systems, and very good information on Nagios configuration parameters that will help keep your systems executing checks quickly without becoming overwhelmed.

If your Nagios system will be trending hundreds or thousands of devices with thousands of service checks, you should think about having your Nagios poller and Nagios reporting / graphing functions exist on different servers. If you can, dedicate a second server to trending and notification and, if you have the luxury, use a third server just for notifications. The less I/O strain you put on your master poller, the more likely it will be able to hit whatever performance expectations you have.

Using a second server to offload trending and reporting also helps ensure that all performance data designated for trending actually make it to whatever graphing package you use and then to graphs.

On the other hand, if your system will only be used for fault management (not trending), you will be able to use less expensive hardware and will not necessarily require the expense or complexity of a multi-server setup. Same goes for cases in which you are monitoring a few hundred services on 50-100 servers.

Nagios does a lot of fork()ing when it runs service checks, so your Nagios master poller should have generous RAM and at least two CPUs. I have not come up with sizing formulae yet nor I have I found a sizing calculator for Nagios, but when I find one or figure out general rules of thumb I will post them.

Your reporting / trending server will experience high levels of disk I/O activity, so a generous amount of RAM and SCSI disks in RAID 1+0 (or 6, 0+1) is highly recommended.

Also, if you can avoid it, do NOT use VMWare or other BIOS-emulating virtual machine technology for your Nagios instances .. they generally will not be able to handle the fast processing needs a large Nagios installation requires and some virtualization technologies have problems with time sync, which is a huge deal killer for Nagios’ scheduling.

Next blog entry in this series will focus in on the Nagios master polling server.

Special thanks to Mike Fischer, my manager at Comcast, for allowing me to share my experiences at work online; special thanks to Ryan Richins, my talented teammate, for his hard work with me on our Nagios system. We are looking for another developer to join our team in the NoVA area; write me at if you are interested.

— Max Schubert



Textile Help