[Pacemaker] CIB write-to-disk bug?

Thu Apr 1 02:12:47 EDT 2010

OK....

Since there was no ssh-as-root between the cluster nodes, I didn't send 
all the logs along from every node in the cluster - and it didn't occur 
to me to look at all of them.

However, the problem has gotten curioser and curioser - because ALL the 
nodes in the cluster reported the same problem at the same time...

That makes it a lot less likely to be a race condition with the disk 
writing infrastructure...

I've attached the relevant lines from the various machines - slightly 
processed (date stamp format changed and a few other minor things).

Let me know if you want me to send all the system logs along...

Alan Robertson wrote:
> Hi,
> 
> I've run into what looks at first blush to be a CIB bug in writing to disk.
> 
> The key messages from this incident are these:
> 
> 
> Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: validate_cib_digest: 
> Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf 
> (/var/lib/heartbeat/crm/cib.GUdD9T), calculated 
> 0bac3440f5c42f0f37d22ea7dfe433e8
> Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Checksum of 
> /var/lib/heartbeat/crm/cib.uHFtAW failed!  Configuration contents ignored!
> Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Usually this 
> is caused by manual changes, please refer to 
> http://clusterlabs.org/wiki/FAQ#cib_changes_detected
> Mar 31 19:02:52 vhost0384 cib: [13294]: WARN: retrieveCib: Continuing 
> but /var/lib/heartbeat/crm/cib.uHFtAW will NOT used.
> 
> 
> I did not make manual changes on a running CIB. I was using the cluster 
> shell at the time.   The CIB it is complaining about appears to be an 
> intact, valid CIB with contents approximately like they should have been 
> at the time.  By the way, I have a report from another IBMer that they 
> have seen systems that stop writing to their local CIBs.  I'll contact him.
> 
> Here are some relevant facts:
>   These machines are virtual guests in a cloud somewhere - operations
>     have somewhat unpredictable latency.  But, nothing too egregious
>     was happening at the time or Heartbeat would have bitched.
>   I was doing some testing at the time.  I was putting on and
>     taking off constraints using the cluster shell
>     migrate and unmigrate operations.
> 
> Given that the file looks intact, and I know how the CIB is written to 
> disk (since I originally wrote that code), I wonder if it isn't a 
> versioning issue / race condition.  That is, the code for writing to 
> disk does NOT guarantee when it gets done (assuming you're still using 
> it).  It would be easy to do a checksum on the wrong version compared to 
> the version you thought it should be (or before it completed).
> 
> Andrew:  You should have already received all the relevant logs to you 
> on a separate email.
> 
> Also, for my reference - what method are you using to compute the digest 
> of the file?  That is, what command should I execute to get the same 
> results?
> 

-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: log.excerpt
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100401/b63bbe66/attachment-0001.ksh>