[Pacemaker] CIB write-to-disk bug?

Fri Apr 2 14:16:32 UTC 2010

Lars Ellenberg wrote:
> On Thu, Apr 01, 2010 at 08:27:02AM -0600, Alan Robertson wrote:
>> Lars Ellenberg wrote:
>>> On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:
>>>> OK....
>>>>
>>>> Since there was no ssh-as-root between the cluster nodes, I didn't
>>>> send all the logs along from every node in the cluster - and it
>>>> didn't occur to me to look at all of them.
>>>>
>>>> However, the problem has gotten curioser and curioser - because ALL
>>>> the nodes in the cluster reported the same problem at the same
>>>> time...
>>>>
>>>> That makes it a lot less likely to be a race condition with the disk
>>>> writing infrastructure...
>>>>
>>>> I've attached the relevant lines from the various machines -
>>>> slightly processed (date stamp format changed and a few other minor
>>>> things).
>>>>
>>>> Let me know if you want me to send all the system logs along...
>>> There should be core files.
>>> You should be able to get some interessting information out there,
>>> especially "the_cib" and "digest" at the point of abort().
>>>
>>>>> Also, for my reference - what method are you using to compute the
>>>>> digest of the file?  That is, what command should I execute to get
>>>>> the same results?
>>> It's an md5sum over the xml tree -- not over the formated ascii buffer,
>>> though, so "md5sum cib.xml" won't do.
>>> I think it is the same as
>>> echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
>>> But there is "cibadmin --md5-sum -x cib.xml",
>>> to use the exact same code path.
>> This is a change from how it used to be (the last time I looked - at
>> least according to my not-always-reliable memory).  Thanks for the
>> update.
>>
>>
>>>> 2010/03/31_19:02:52	vhost0384	[13294]: ERROR: crm_abort:
>>>> write_cib_contents: Triggered fatal assert at io.c:624 :
>>>> retrieveCib(tmp1, tmp2, FALSE) != NULL
>>> So it did not verify right after it was written.
>>> Can you reproduce?
>> I have no idea.  I didn't do anything much.  Hopefully the test
>> suite does a lot more strenuous things...
>>
>>> The core files may actually contains some hints,
>>> so have a look there.
>> None of them verified.  All the nodes in the cluster failed the test
>> at the same time - and now I have no official CIBs on disk - on any
>> cluster nodes...  I sent Andrew all the CIBs, and all the core
> 
> Well, Andrew is on vacation right now... you will have noticed.
> 
>> files, and basically everything under /var/lib/heartbeat/ from one
>> machine. They're from the latest official release - so the binaries
>> that match them are readily available.
> 
> The strange thing is that your "corrupt" cib.uHFtAW
> contains a <status/> thing.  it should not.
> No other cib*.raw or cib.xml does.
> 
> Because <status/> is explicitly filtered out in write_cib_contents:
>  free_xml_from_parent(the_cib, cib_status_root);
> before
>  write_xml_file(the_cib, tmp1, FALSE),
> so that should never have made it in there.
> 
> Something is very wrong somewhere...
> 
> Did you manage to get two status sections in there, somehow?
> You tried anything funky with the cib as last action before this failed?

Not that I recall...

> Do it again, with higher log level.  Sorry, no time right now to rebuild
> your exact thing with your exact gcc and stuff to look at your core file.

You can just download the RPM and extract the objects.  That's what I used.

-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce