Easy to use ruby library for interacting with Confluence - confluence4r · 31 July 2009, 12:43
I added a gemspec for the package to the bottom of the page if you want to build it as a gem in-house.
— Max Schubert
Installing CentOS via Netboot in VMWare · 4 May 2009, 18:28
A quick blurb as I always forget this:
- Download and install VMWare Server 2.0 GPL edition (or other if you have the $$)
- Create a blank VM
- Download the netboot.iso from a CentOS web site
- Add the ISO as a virtual CD-ROM for the new virtual machine
- When prompted for installation media, choose HTTP, then enter
- In the site field – domain-name-of-mirror
- In the path field – URI to the base architecture directory
- Site – mirrors.xmission.com
- Path – centos/5.3/os/i386
No trailing slash on the domain and no leading or trailing slash on the URI path
— Max Schubert
Nagios Performance Tuning: Early Lessons Learned, Lessons Shared. Part 4 - Scalable Performance Data Graphing · 17 January 2009, 19:28
In the last three parts of this short series I have discussed the techniques my teammate and I have used at work to tune our Nagios poller so that it completes all configured checks within a five minute interval. I will now discuss how we are storing and graphing this data in a way that will scale as our installation continues to grow (we are currently graphing data from over 5000 checks every 5 minutes).
The Nagios Plugin API and Performance Data: sections of the online Nagios documentation discuss plugin development and current performance data format specifications in great detail; check out both links if you are not familiar with Nagios plugin performance data.
While Nagios does not come out of the box with a performance data graphing framework, it should come as no surprise that there are a number of ways to send performance data from Nagios to external graphing systems:
- Have Nagios write the performance data returned from a host or service check to an external file after each check returns
- Have Nagios call a program directly to process performance data after every host or service check returns
- Use a NEB (Nagios Event Broker) module to process service or host performance data after each service or host check is run
As with any other configuration choice with Nagios, each method has benefits and drawbacks in terms of it’s implementation difficulty and effect on Nagios performance and resource utilization on the host running Nagios.
Since we are focusing on scalable graphing, our goals are as follows:
- Minimize the disk I/O impact on the Nagios polling server so that the majority of it’s processing power stays focused on completing all configured checks within a five minute interval.
- Minimize the amount of scheduling skew the additional performance data processing imposes on the Nagios poller.
- Maximize the number of performance data graphing requests that can be made within the polling interval.
- Have the actual storage and graphing of all performance data done on a server other than the Nagios poller, preferably one that can be dedicated to graphing.
There are a number of graphing frameworks available for Nagios; in this article I will focus on PNP – PNP Is Not Perfparse. It is a mostly well-documented, flexible framework. For Nagios administrators who currently use both Cacti and Nagios, I highly recommend considering PNP as an an alternative. PNP eliminates the need to administer device and service configurations in two places and also means no double-polling to get both fault management and trending data.
PNP consists of four discreet components:
- A perl script, process_perfdata.pl, that takes service or host check performance data and updates one or more RRD database files.
- A threaded C-based daemon (NPCD) that can be used to spawn one or more process_perfdata.pl script instances. The end user can configure the number of instances of process_perfdata.pl to spawn at a time and how often NPCD should spawn new process_perfdata.pl processes.
- An undocumented NEB module, modpnpsender.c, that will send an XML representation of perfdata from a host or service check to a location you configure over a TCP socket to a location you specify.
- A PHP-based framework for viewing graphs using the RRD files created by process_perfdata.pl and custom PHP templates created by the end user.
PNP can integrate with Nagios in a number of ways:
- process_perfdata.pl called directly by Nagios using the perfdata command options of Nagios in nagios.cfg
- NPCD is run on the Nagios poller, Nagios is configured to create :: delimited queue files in a queue directory on the server, NPCD then calls process_perfdata.pl
- Nagios runs modpnpsender.o. modpnpsender.o sends perfdata over a TCP socket to a server dedicated to graphing. process_perfdata.pl is called by inetd to directly update RRD files on the graphing server.
- Nagios runs modpnpsender.o. modpnpsender.o sends perfdata over a TCP socket to a server dedicated to graphing, a simple inetd script listens for incoming PNP connections and creates NPCD-formatted queue files in the NPCD queue directory, then NPCD calls process_perfdata.pl to update RRD files.
The first two methods above place all the disk I/O burden associated with RRD files on the Nagios poller; while this is perfectly fine for smaller installations, it is not good for a larger installation. Additionally, methods one and two cause Nagios to pause as it runs the perfdata processing commands. In our environment this cause check scheduling skew to happen at an unacceptable rate. With just 1800 services we were skewing by over a minute a day for checks, i.e. a check that was initially scheduled to run at minute 01 of the hour was running at minute 02 of the hour by day two.
modpnpsender.o is a NEB module that registers for service events; when a service event occurs within Nagios modpnpsender opens a TCP connection to a remote server, sends an XML representation of the event to the remote side, then closes the socket. This transaction does not take more than a second or two depending on where in the network your reporting host sits in comparison to your Nagios poller. We made a few minor modifications to the code (which we will release in the near future) to enhance the functionality of the NEB module.
Our first modification was to add in fork() code to the NEB module. While the Nagios documentation says never to fork in a NEB module, without the fork we found that our service check schedule was skewing almost as significantly with the NEB module in place as it had been calling process_perfdata.pl directly from Nagios via the process perfdata external command options in nagios.cfg. This occurred because Nagios waits for the NEB module to finish processing before it continues. With the fork() code in place, this skew disappeared completely; we have not seen any system instability due to the additional fork() calls.
Our second modification was to make the XML buffer size in the modpnpsender.c source file a C #define parameter, as the code had a hard-coded buffer size that was not enough to accommodate the 4096 bytes of output that Nagios allows; for checks with long perfdata output this hard-coded buffer was being overrun by the code and causing Nagios to segfault.
The third PNP architecture works much better than the first two with our modpnpsender.c modifications in place; the NEB module opens a socket and sends XML to the report server; process_perfdata.pl then reads the data from the socket via inetd and updates the RRD files associated with each metric. The problem with this method is that the report server effectively experiences a denial of service attack every polling cycle if thousands of performance data records are sent to it at a time. In our case, thousands of perfdata records would arrive within two minutes, nearly knocking the server over for a few minutes each run.
My first attempt to ameliorate this problem was to have each process_perfdata.pl instance sleep for a random number of seconds ranging from 15 – 60 before the RRD update processing occurred. While this helped, it still left the kernel tracking thousands of processes at once and did not lower the impact of each check cycle enough to be satisfactory.
The solution I found to this was the fourth option for PNP data processing listed above, which is a hybrid of the methods the PNP developers outline in the online documentation:
- The NEB module sends perfdata to the report server
- A small script reads the perfdata from the TCP socket via inetd and then writes out the performance data to a spool directory.
- NPCD then periodically processes the queue files to update the RRD check databases.
So far this method is much more effective in our environment than the others are at keeping load averages and I/O wait times on the report server at reasonable levels. We are currently processing over 5000 checks on 1200 hosts in four minutes with a load average of 2 or less on the report server and I/O wait CPU percentages of 20% or less. All perfdata is ingested into RRD files within 4 minutes.
In addition to our PNP graphing, we also have a daemon running on the report server that reads the Nagios perfdata output and sends it to a corporate data warehouse for long term trending.
There it is; a scalable graphing architecture with Nagios and PNP that we believe will allow us to graph thousands more checks per five minute period than we are doing now without having to upgrade hardware.
In the next article in this series I will discuss how to use PNP to monitor the performance of your Nagios poller and report server.
Special thanks to Mike Fischer, my manager at Comcast, for allowing me to share my experiences at work online; special thanks to Ryan Richins, my talented teammate, for his hard work with me on our Nagios system. We are looking for another developer to join our team in the NoVA area; write me at email@example.com if you are interested.
— Max Schubert
Why you should not use the default HSQLDB backend with Confluence · 19 July 2008, 15:18
Confluence clearly states on their web site that users should not use Confluence with the default HSQLDB backend in production. However, they fail to list any reason why HSQLDB should not be used. This may lead some people (like me) into thinking that HSQLDB will be ok for the long term since they do not list any reasons for their statement.
So we stuck with the HSQLDB back end for our Wiki. It was bad. After a few months we did have some data integrity problems (fortunately, with records that had been deleted), but even this was enough to break the built-in Hibernate-based backups and force us to migrate.
The migration was hard too. Why:
- Atlassian does not publish decent tools on their site (unlike other companies that build their products on Confluence) to help you easily migrate away from HSQLDB to other databases
- For MySQL, Atlassian recommends using the MySQL migration tools to migrate from HSQLDB to MySQL; these tools work very well but have a java heap limit of 384 MB, so only smaller databases can be migrated using this toolkit.
- If any data integrity problems occur and a foreign key relationship is corrupted within the data in the database, the built-in Hibernate-based backups will no longer work as Hibernate will abort at the first referential integrity constraint it finds.
- Data integrity problems meant we had to:
- Create the MySQL database with no constraints
- Migrate from the embedded HSQLDB to a standalone instance so we could access the data with JDBC
- Write custom scripts to dump data from JDBC
- Munge the SQL INSERT data to convert from java Unicode to MySQL Unicode encoding
- Import all records back into the MySQL DB
- Re-add constraints one at a time
Some of the challenges we ran into:
- Our database well exceeded 384 MB rendering the MySQL migration tools useless
- HSQLDB uses java escape sequences for UTF-8 characters; we ended up using java code to convert those sequences from the HSQLDB script file to binary for our MySQL import file as MySQL wants to have either \x encoded sequences or raw binary.
- Had to migrate from the embedded version of HSQLDB to the standalone version to allow for our hand-written migration code to be able to access the HSQLDB instance via JDBC.
I really wish Atlassian would list some of these potential problems on their site, had they given more than just ‘do not use it’ we would have listened :).
— Max Schubert
The "E" word corrupts a Wiki · 9 August 2007, 18:47
Confluence, The Enterprise Wiki
Pretty UI, well designed java API, missing many features that GPL / OSS Wikis provide. The latest versions require Oracle to be used as a backend. At work I have been told we have to upgrade away from the HSQLDB version of this Wiki to the Oracle version. When I asked “Why?” the only answer I was given was that the Oracle version is more “robust” and “because we use this in production.” This Wiki instance is used for internal documentation and collaboration among a group of about 40 people.
Hmm .. so we change from a reliable, lightweight, zero-administration ,mature database backend to a reliable, very heavyweight, complex database .. and we get more robustness? Doesn’t compute. Smells of licensing revenue to me.
I find Confluence to be a decent piece of groupware, I find that it is much less flexible and easy to use than many other Wikis I have used. The “E” word has corrupted Confluence.
— Max Schubert
Enterprise: Four-Letter Word · 9 June 2007, 16:11
The more I work with costly “Enterprise” software packages, the more I get a sense of FUD when I hear that term in the name of a piece of software.
My idea of what “Enterprise” is supposed to mean with regards to software is “able to work in a heterogeneus, networked environment and integrate with a wide variety of other software packages and SNMP monitoring systems.” The reality I am seeing with any package that uses the term “Enterprise” as part of its’ name is:
- Difficult to manage
- Licensing is expensive
- Managing licenses is a painful experience
- Support technicians are nice but generally suffer the “we read scripts” support desk anti-pattern.
- Most packages do not have open APIs for integration with other products; instead they encourage the use of “professional support” a.k.a consulting time at hefty rates (for example, $2k/hour for an EMC consultant).
- The “Enterprise” software company maintains a wide range of software products that work together wonderfully, but surprise (!) .. third party products do not integrate as well. Why would that be? Hmm.
- Add two months of time to a rollout over an open source package or over software from a smaller company
If I ever use the word “Enterprise” in a software package I develop or manage the development of I will know it is time to take me out into the field and shoot me.
— Max Schubert