UPDATE: Please see this updated post instead.
NOTE: These tests were carried out in the beginning of 2011, with the newest stable versions of CFEngine and Puppet back then (CFEngine web site and Suse Linux repository, respectively). New versions of both solutions have been released since.
CFEngine and Puppet are configuration management solutions that can help you automate IT infrastructure. Practical examples include adding local users, installing Apache, and making sure password-based authentication in sshd is turned off. The more servers and complexity you have, the more interesting such a solution becomes for you.
The companies and communities behind CFEngine and Puppet frequently make claims that their solution “is scalable”. They point to some users managing hundreds, thousands, even tens of thousands of servers with the tools. But what does all this mean? Is CFEngine and Puppet indistinguishable with respect to scalability?
In this post, we will highlight one aspect of scalability: how many clients a server can handle. As always, to find useful answers, we should do measurements — and stop listening to the marketing departments.
Amazon EC2 is used to measure performance at the server while the number of clients is being increased in steps every 30 minutes. We start up with 25 clients, then go to 50, 100, 200, 400.
CPU usage at the central server and client execution time is being measured. The Amazon CludWatch system is being used to monitor resource usage.
The following cloud instances were used,
- Clients: t1.micro. up to 2 ECUs, 1 core, 613MB memory, SuSE 11.1, 64-bit
- Server: m1.large. 4 ECUs, 2 cores, 7.5GB memory, SuSE 11.1, 64-bit
and these versions of each tool:
- CFEngine 3.1.5 (with Enterprise extensions)
- Puppet 0.24.8
Both tools were set to run every 5 minutes.
Neither CFEngine nor Puppet will perform any action if you do not tell it to. During these tests, a policy or manifest to copy 20 configuration files from the server (totalling 140 kb) was used to have something realistic to test.
Doing file copies and templating is very common in the space of configuration management, merely because all Unix systems can be configured in terms of various configuration files.
The CFEngine server was not much affected by the extra load of more clients. At 400 clients, the average server CPU usage is below 10%. Client execution time is pretty much constant at 7.85 seconds independent of the number of clients.
For Puppet, the story looks different. The Puppet server worked up to 50 clients, but stopped responding to client requests at 100. Reducing the amount of clients back down to 50 made the Puppet server start responding again.
Another thing to note is that the CPU graph for Puppet has much more spikes than the smoother graph for CFEngine.
We did not manage to find the maximum client count for CFEngine servers, but if you extrapolate the CPU graph, it should lie somewhere between 4000 – 5000 clients. The Puppet server started failing between 50 – 100 clients. The difference is mind-blowing!
We have just discussed the amount of clients per server. Scalability is a abstract term, so there are definitively other interesting aspects to highlight. One example is, how easy is it to scale to multiple servers?