[Pacemaker] CIB write-to-disk bug?

Alan Robertson alanr at unix.sh
Thu Apr 1 10:27:02 EDT 2010


Lars Ellenberg wrote:
> On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:
>> OK....
>>
>> Since there was no ssh-as-root between the cluster nodes, I didn't
>> send all the logs along from every node in the cluster - and it
>> didn't occur to me to look at all of them.
>>
>> However, the problem has gotten curioser and curioser - because ALL
>> the nodes in the cluster reported the same problem at the same
>> time...
>>
>> That makes it a lot less likely to be a race condition with the disk
>> writing infrastructure...
>>
>> I've attached the relevant lines from the various machines -
>> slightly processed (date stamp format changed and a few other minor
>> things).
>>
>> Let me know if you want me to send all the system logs along...
> 
> There should be core files.
> You should be able to get some interessting information out there,
> especially "the_cib" and "digest" at the point of abort().
> 
>>>
>>> Also, for my reference - what method are you using to compute the
>>> digest of the file?  That is, what command should I execute to get
>>> the same results?
> 
> It's an md5sum over the xml tree -- not over the formated ascii buffer,
> though, so "md5sum cib.xml" won't do.
> I think it is the same as
>  echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
> But there is "cibadmin --md5-sum -x cib.xml",
> to use the exact same code path.

This is a change from how it used to be (the last time I looked - at 
least according to my not-always-reliable memory).  Thanks for the update.


>> 2010/03/31_19:02:52	vhost0384	[13294]: ERROR: crm_abort:
>> write_cib_contents: Triggered fatal assert at io.c:624 :
>> retrieveCib(tmp1, tmp2, FALSE) != NULL
> 
> So it did not verify right after it was written.
> Can you reproduce?

I have no idea.  I didn't do anything much.  Hopefully the test suite 
does a lot more strenuous things...

> The core files may actually contains some hints,
> so have a look there.

None of them verified.  All the nodes in the cluster failed the test at 
the same time - and now I have no official CIBs on disk - on any cluster 
nodes...  I sent Andrew all the CIBs, and all the core files, and 
basically everything under /var/lib/heartbeat/ from one machine. 
They're from the latest official release - so the binaries that match 
them are readily available.

	Thanks Lars!


-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce




More information about the Pacemaker mailing list