[Pacemaker] Occasional nonsensical resource agent errors

Tue Jul 15 15:36:40 EDT 2014

On 07/15/2014 02:31 PM, Andrew Daugherity wrote:
>> Message: 1
>> Date: Sat, 12 Jul 2014 09:42:57 -0400
>> From: Ken Gaillot <kjgaillo at gleim.com>
>> To: pacemaker at oss.clusterlabs.org
>> Subject: [Pacemaker] Occasional nonsensical resource agent errors
>> 	since Debian 3.2.57-3+deb7u1 kernel update
>>
>> Hi,
>>
>> We run multiple deployments of corosync+pacemaker on Debian "wheezy" for
>> high-availability of various resources. The configurations are unchanged
>> and ran without any issues for many months. However, since we applied
>> the Debian 3.2.57-3+deb7u1 kernel update in May, we have been getting
>> resource agent errors on rare occasions, with error messages that are
>> clearly incorrect.
>>
>>
>> [....]
>>
>> Given the odd error messages from the resource agent, I suspect it's a
>> memory corruption error of some sort. We've been unable to find anything
>> else useful in the logs, and we'll probably end up reverting to the
>> prior kernel version. But given the rarity of the issue, it would be a
>> long while before we could be confident that fixed it.
>>
>> Is anyone else running pacemaker on Debian with 3.2.57-3+deb7u1 kernel
>> or later? Has anyone had any similar issues?
>
> Just curious, I see you're running Xen; are you setting dom0_mem?  I had similar issues with SLES 11 SP2 and SP3 (but not <= SP1) that was apparently random memory corruption due to a kernel bug.  It was mostly random but I did eventually find a repeatable test case: checksum verification of a kernel build tree with mtree; on affected systems there would usually be a few files that failed to verify.
>
> I had been setting dom0_mem=768M, as that was a good balance between maximizing memory available for VMs while keeping enough for services in Dom0 (including pacemaker/corosync), and I set node attributes for pacemaker utilization to 1GB less than physical RAM, leaving 256M available for Xen overhead, etc.  Raising it to 2048M (or not setting it at all) was a sufficient workaround to avoid the bug, but I have finally received a fixed kernel from Novell support.
>
> Note: this fix has not yet made it into any official updates for SLES 11 -- Novell/SUSE say it will be in the next kernel version, whenever that happens.  Recent openSUSE kernels are also affected (and have yet to be fixed).
>
> -Andrew

Hi Andrew,

Thanks for the feedback!

Our "aries/taurus" cluster are Xen dom0s, and we pin dom0_mem so there's 
at least 1GB RAM reported in the dom0 OS. (The version of Xen+Linux 
kernel in wheezy has an issue where the reported RAM is less than the 
dom0_mem value, so dom0_mem is actually higher.)

However we are also seeing the issue on our "talos/pomona" cluster, 
which are not dom0s, so I don't suspect Xen itself. But it could be the 
same kernel issue.

mtree isn't packaged for Debian, and I'm not familiar with it, although 
I did see a Linux port on Google code. How do you use it for your test 
case? What do the detected differences signify?

Do you know what kernel and Xen versions were in SP2/3, and what 
specifically was fixed in the kernel they gave you?

-- Ken Gaillot <kjgaillo at gleim.com>
Network Operations Center, Gleim Publications