[Pacemaker] Occasional nonsensical resource agent errors

Andrew Daugherity adaugherity at tamu.edu
Thu Jul 17 21:07:02 EDT 2014


> Message: 6
> Date: Tue, 15 Jul 2014 15:36:40 -0400
> From: Ken Gaillot <kjgaillo at gleim.com>
> To: The Pacemaker cluster resource manager
> 	<pacemaker at oss.clusterlabs.org>
> Subject: Re: [Pacemaker] Occasional nonsensical resource agent errors
> Message-ID: <53C582C8.6090202 at gleim.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Hi Andrew,
> 
> Thanks for the feedback!
> 
> Our "aries/taurus" cluster are Xen dom0s, and we pin dom0_mem so there's 
> at least 1GB RAM reported in the dom0 OS. (The version of Xen+Linux 
> kernel in wheezy has an issue where the reported RAM is less than the 
> dom0_mem value, so dom0_mem is actually higher.)
> 
> However we are also seeing the issue on our "talos/pomona" cluster, 
> which are not dom0s, so I don't suspect Xen itself. But it could be the 
> same kernel issue.
> 
> mtree isn't packaged for Debian, and I'm not familiar with it, although 
> I did see a Linux port on Google code. How do you use it for your test 
> case? What do the detected differences signify?
That mtree-port from Google code is what I used; fortunately for me it was packaged in the OBS already: http://software.opensuse.org/package/mtree
It looks like the only build-dep it has is openssl-devel, so not too hard to build.  I'm sure there's other utilities that accomplish the same thing (e.g. tripwire) but I was familiar with mtree from BSD-land, so it's what I used.

Backtracking a bit, when I saw these strange errors, running 'rpm -Va' (verify installed files from all packages; there's probably a dpkg equivalent but I don't know it off-hand) would sometimes, but not consistently, produce errors.  

I decided that perhaps I needed a bigger dataset, and I had been playing with zfsonlinux on another box, which had several kernel trees extracted for that, so I tarballed the build dirs (2.6GB, 171k files), checksummed them with mtree, then copied the tarball and checksum file to the boxen with problems and verified it there.  I actually had to boot into a known good kernel (in my case, kernel-default rather than kernel-xen) to get a clean untar.

Under the problematic kernels, a small number of files would fail to verify (which files failed tended to change, but I would almost always get some errors).  Occasionally the filesystem would also report I/O errors (much more likely to happen under btrfs than xfs or ext3), but after rebooting and running fsck/xfsrepair/btrfs scrub etc. the FS would check out clean.

Basic mtree usage--
  Generate checksum file:
    1) cd /path/to/testroot
    2) mtree -c -K sha256digest > /path/to/checksumfile  [outside testroot]
  Verify:
    1) cd /path/to/testroot
    2) mtree -f /path/to/checksumfile
Like diff, only differences (in file size/mode/checksum/etc.) are reported and no output means everything verifies.

> Do you know what kernel and Xen versions were in SP2/3, and what 
> specifically was fixed in the kernel they gave you?
SLES 11 SP2 and SP3 seem to be based on the same 3.0.x kernel tree (SP1, which was unaffected, was 2.6.32.x).  When SP2 was still supported (it has now dropped out of support) the versions tended to track closely but not exactly.  Xen in SP3 is 4.2.4; SP2 was 4.1.x.  In a matter of fortuitous timing, the official kernel update for SLES 11 SP3 was released yesterday; the version with the fix is 3.0.101-0.35.1.  The relevant changelog is this:
====
* Thu Jun 05 2014 jbeulich at suse.com
- swiotlb: don't assume PA 0 is invalid (bnc#865882).
====
Unfortunately that bug is private, even to me, but the git tree is public:
http://kernel.opensuse.org/cgit/kernel-source/commit/?id=0a9fc1a8654e9f62d7a8173fef83c6949ed67e35
http://kernel.opensuse.org/cgit/kernel-source/commit/?h=SLE11-SP3&id=4461f4df6e363235e2ef3b61c41617f7c22dc510

The master aka opensuse-factory branch is on 3.16 (was 3.15 at time of this commit), while SLE11-SP3 remains on 3.0.x with backported fixes.  This may not be the bug you're hitting, but if you can find a reproducible test case, that's half the battle.

-Andrew





More information about the Pacemaker mailing list