[Pacemaker] Occasional nonsensical resource agent errors since Debian 3.2.57-3+deb7u1 kernel update

Sat Jul 12 09:42:57 EDT 2014

Hi,

We run multiple deployments of corosync+pacemaker on Debian "wheezy" for 
high-availability of various resources. The configurations are unchanged 
and ran without any issues for many months. However, since we applied 
the Debian 3.2.57-3+deb7u1 kernel update in May, we have been getting 
resource agent errors on rare occasions, with error messages that are 
clearly incorrect.

The incidents have happened four times on two unrelated clusters:

* Our cluster hosts "talos" and "pomona" use pacemaker to manage a few 
virtual IP adresses using the ocf:heartbeat:IPaddr2 resource agent. This 
one has had two incidents. The first incident began with this error:

Jun  2 17:30:16 pomona lrmd: [2145]: info: RA output: 
(ldap-ip:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: 1: 
/usr/lib/ocf/resource.d//heartbeat/IPaddr2: : Permission denied

The second incident began with this error:

Jul 12 08:36:15 talos IPaddr2[21294]: ERROR: Setup problem: couldn't 
find command: ip

I can confidently say, the permissions of IPaddr2 and the location of 
the "ip" command, did not change at any point!

* Our cluster hosts "aries" and "taurus" use pacemaker in a more 
complicated setup, managing Xen virtual machines on shared storage 
utilizing DRBD and CLVM, using the resource agents 
ocf:pacemaker:controld, ocf:gleim:clvmd (which is the stock clvmd 
resource agent from a later pacemaker version than is included in 
wheezy), ocf:heartbeat:LVM, ocf:linbit:drbd, and ocf:gleim:Xen (which is 
the stock Xen resource agent with a trivial one-line change for a local 
workaround).

This cluster had also had two incidents:

* The first began with:

Jun 16 10:38:15 aries lrmd: [3646]: info: RA output: 
(jabber:monitor:stderr) /usr/lib/ocf/resource.d//gleim/Xen: 71: local: 
en-list: bad variable name

There is no variable "en-list" in the resource agent; the closest string 
in the file is "xen-list", which is a binary not a variable, used like this:

   ...
   if have_binary xen-list; then
      xen-list $1 2>/dev/null | grep -qs "State.*[-r][-b][-p]--" 2>/dev/null
      ...

* The second began with:

Jun 21 11:58:58 taurus Xen[9052]: ERROR: Setup problem: couldn't find 
command: awk

Again, the location of "awk"  has not changed.

We have no reason to suspect the kernel update other than timing, and 
the fact that the incidents occur on unrelated clusters. We have since 
upgraded to Debian's next update, 3.2.57-3+deb7u2, but the most recent 
incident occurred after that. The original update included fixes for 
these issues:

CVE-2014-0196

     Jiri Slaby discovered a race condition in the pty layer, which could
     lead to denial of service or privilege escalation.

CVE-2014-1737 / CVE-2014-1738

     Matthew Daley discovered that missing input sanitising in the
     FDRAWCMD ioctl and an information leak could result in privilege
     escalation.

CVE-2014-2851

     Incorrect reference counting in the ping_init_sock() function allows
     denial of service or privilege escalation.

CVE-2014-3122

     Incorrect locking of memory can result in local denial of service.

Given the odd error messages from the resource agent, I suspect it's a 
memory corruption error of some sort. We've been unable to find anything 
else useful in the logs, and we'll probably end up reverting to the 
prior kernel version. But given the rarity of the issue, it would be a 
long while before we could be confident that fixed it.

Is anyone else running pacemaker on Debian with 3.2.57-3+deb7u1 kernel 
or later? Has anyone had any similar issues?

-- Ken Gaillot <kjgaillo at gleim.com>
    Gleim NOC