Cfengine 3.1.4 and 3.1.3 extended change log

Highlights

First of all, thanks for the good feedback from the previous extended change log! It seems like this is something the Cfengine community is interested in, so I will continue the series. This time we cover the changes in both Cfengine 3.1.3 and 3.1.4, since they were released quite close to each other (January 22nd and January 31st). Some rather annoying bugs were discovered by the Cfengine community in 3.1.3, so the 3.1.4 release was brought forward.

The new releases bring the following advancements.

  • major memory leaks removed – daemons now show no growth
  • allowing definition of good and bad return codes for commands
  • lock database purging
  • 30 second recv() timeout on Linux
  • two race conditions (causing segmentation faults) in cf-serverd fixed
  • Solaris global zone process list fixed
  • new function ip2host()
  • package architectures handled correctly

There are some more details in the ChangeLog-file, and at the bug tracker.

Leak no more

Memory leaks occur when a program allocates memory, but does not release it again later. This is not always a problem, because operating systems always reclaim all memory when a process terminates. Releasing memory just before termination thus only results in unnecessary resource consumption (indeed the GNU C library does not by default release memory on process termination).

However, in daemons and long running programs, repeated memory leaks is clearly an issue. Memory leaks usually manifest themselves in an ever-increasing size of the process’ virtual memory. Unfortunately, such bugs are extremely hard to track down, because it is not always clear where the leak happens and when a certain memory segment should be released (if it is released to soon, the process will crash). In Cfengine, this is further complicated by the fact that certain policies may cause more severe leaks than others because different execution paths are followed when running them.

But since multiple reports of severe leaks started to come from the Cfengine community, a lot of effort was put into debugging it (see the report on the bug tracker). It took one month (!) of iteration before all the leaks were tracked down, even with much help with testing from the community – especially Jonathan. On the positive side, these leaks (or anything like them) are very unlikely to reappear. But don’t hesitate to create a report if you think you have found a leak some day.

The main sources of leaks turned out to be an error when releasing lists (struct Rlist). Also, when re-reading the policy, parts of the old one was never released. Since only cf-monitord and cf-serverd re-read the policy, they were the most affected components.

Jonathan from the Cfengine community provided a policy that caused a lot of leakage, and by using this, debugging was easier. But we can also use it to illustrate the difference between Cfengine 3.1.2 and Cfengine 3.1.4. The graphs below show the segment size (RSS) of the three Cfengine daemons, measured over one day. They pretty much speak for themselves. Thanks to community members helping to fix this issue!

Cfengine 3.1.2 daemon leakage

 

Cfengine 3.1.4 daemon memory usage
Cfengine 3.1.4 daemon memory usage

download all graphs (including virtual sizes)

Command return codes

As of Cfengine 3.1.0, promises of type commands were flagged as repaired if they returned zero, not kept otherwise. This allowed to define a class in either case and run follow-up promises. In Cfengine 3.1.4, a much more flexible framework has been introduced. In addition, commands in packages-promises and transformer in files-promises has been incorporated. Now, Cfengine users can specify a list of return codes for which one of these promises should be kept, repaired and not kept. It’s often easier to understand by example, so let’s do just that. First let’s start with a simple shell script.

#!/bin/sh
# saved to /tmp/retarg
exit $1

So this script just exits with the code given as the first parameter, which must be from 0 to 255 on Unix. We will use this little script to demonstrate the new return code functionality in the following snippet.

bundle agent commands_retcode
{
commands:
  "/tmp/retarg 0",
  classes => define_retcodes;
}
###
body classes define_retcodes
{
kept_returncodes => { "5" };
repaired_returncodes => { "0" };
failed_returncodes => { "1", "2", "3", "4" };
promise_kept => { "waskept" };
promise_repaired => { "wasrepaired" };
repair_failed => { "wasfailed" };
}

By running /tmp/retarg with arguments 0, 1 and 5, we see that the classes wasrepaired, wasfailed and waskept gets defined, respectively. We may also use overlapping return codes in the *_returncodes lists, which could result in the promise getting multiple statuses (e.g. both repaired and failed). This might seem a bit strange, but gives the user total control. If the return code is not found in any of the lists, the promise does not get a status at all. When none of the lists are defined, Cfengine falls back to the default of zero being promise repaired, and anything else promise failed.

This flexibility is also allowed in packages-promises, as demonstrated in the following.

bundle agent packages_retcode
{
packages:
  "aatv"
  package_policy => "add",
  package_method => generic,
  classes => define_retcodes;
}

Lastly, in files-promises, the return value of the transformer command is considered, as shown below.

bundle agent transformer_retcode
{
files:
  "/tmp"
  classes => define_retcodes,
  transformer => "/tmp/retarg 0";
}

A complete self-containing policy demonstrating the new framework in all the three above promise types is available for download here. In the reference manual, this is documented as part of the classes body.

Lock purging

As you probably know, Cfengine has a concept of locks.  Locks ensure that promises are not checked too often, but also that repairing each promise does not take too long. These parameters are configurable through ifelapsed and expireafter policy setting, available at a global and promise level. Since information about these locks needs to persist between runs of cf-agent, Cfengine keeps track of these locks in a database stored in /var/cfengine/state/cf_lock.* (suffix depends on the dbm used). A hash of the promise attributes is used as keys for this database.

The problem with this is that sometimes the attributes change even though the promise really is the same. For example, if you have a commands-promiser “/bin/echo $(date)”, the promiser would seem to change each time cf-agent runs. As another example, you may want to delete files in /tmp that are more than 3 days old. Many of these files would never reappear (but some might), so keeping an entry for all of them in the lock database just increases its size for no reason. This causes the lock database to grow indefinitely, but very slowly (if you are still not on Cfengine 3.1.4, check the size of yours). Trying to make some heuristic checks for if a given promise should be in the lock database or not would surely end in unexpected behaviour for some of the huge user base of Cfengine. A less risky approach that was introduced in Cfengine 3.1.4, is to automatically purge old locks.cf-agent will run a lock-purging algorithm every month, deleting locks that are more than one month old. This should take care of the (slow) growth of the lock database, while still not risking unexpected behaviour.

Other improvements

A 30 second timeout on the recv() system call is introduced on Linux hosts. This means that any connection that waits to receive data will time out if no data is received within 30 seconds. The reason to introduce this is to avoid a remote system to cause components of Cfengine to hang indefinitely. A remote system may become unresponsive for a number of reasons, including network unreliability, high load, deadlocks (e.g. when trying to open a database that was uncleanly shut down), kernel or driver bugs, most of which are outside of Cfengine’s control. Introducing a mechanism to back-off after a certain time has elapsed is the only way we can protect ourself from all these scenarios, but still allows for self-healing when Cfengine retries the operation later. As the details of the socket API is different amongst OSes, Linux is the first one to get this support.

A new function ip2host() that does reverse dns lookups is introduced. Note that DNS is often quite unreliable, and can thus cause cf-agent to hang for a while while doing the lookup.

bundle agent reverse_lookup
{
vars:
 "local4" string => ip2host("127.0.0.1");
 "local6" string => ip2host("::1");

reports:
cfengine_3::
  "local4 is $(local4)";
  "local6 is $(local6)";
}

Cfengine community members discovered that Cfengine sometimes ignored the architecture when considering packages-promises. This could cause Cfengine to believe that a given package was installed for all architectures, even though it was installed only for one. With Cfengine 3.1.4, this is handled correctly.

Two important issues casing segmentation faults in cf-serverd have been fixed. They were caused by race conditions in cf-serverd and were thus appearing only on busy servers. On Solaris global zones, Cfengine can now distinguish processes based on the zone they run in (previously a Nova-feature). This means that a process restart promise in the global zone will not kill processes in other zones. However, a bug was causing this not to function properly, so indeed processes in all zones were killed. This is all resolved in Cfengine 3.1.4.

Get it!

As usual, Cfengine 3.1.4 is provided not only as a source tarball, but also prepackaged for the most popular Linux distributions by logging into the Engine Room (free registration required). Users of the following distributions enjoy free packages, both 32- and 64-bit versions.

  • CentOS 5
  • Debian 4, 5 and 6
  • Fedora 14
  • Red Hat Enterprise Linux 3, 4, 5 and 6
  • Suse 9, 10 and 11
  • Ubuntu 8, 9 and 10

Note that most distributions also maintain a Cfengine 3 package, but this is usually older and may not be built in a uniform way.

The feedback from last post ensured that self-containing policies are available for download in companion with the snippets. Please do not hesitate to leave a comment if you found this useful, or have more suggestions for improvements.

Enjoy!

This entry was posted in CFEngine and tagged , , , , , , , , , , , , , , , , . Bookmark the permalink.

5 Responses to Cfengine 3.1.4 and 3.1.3 extended change log

  1. Pingback: Tweets that mention Cfengine 3.1.4 extended change log | Peritia collata -- Topsy.com

  2. Thanks for doing this again.

  3. Thanks for the detailed changelog. I missed one addition. You can control how many time the ps command can run and for which bundles is has to reinit the state of the processes.
    {{{

    ##
    # To avoid a lot of ps commands, wait for cf 3.1.3
    #
    refresh_processes => { “none” };
    }}}

  4. Pingback: Cfengine 3.1.5 is nearly out, what can we expect from it? « Normation Blog

Leave a Reply

Your email address will not be published. Required fields are marked *


4 + two =

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>