[ClusterLabs] No Cluster fun (split brain)

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Tue May 19 11:50:24 EDT 2015


I just wanted to tell you that two nodes in out three-node cluster (SLES11 SP3) went mad when the thrird node was cleanly rebooted (i.e. after rcopenais stop). Going mad means both nodes built up a "retransmit list" and decided to be DC for the cluster. When the third node came back, the communication problems went away, but the cluster was unable to continue running resources. It needed a reboot of one of the mad nodes and later a cleanup of resources for the other mad node.

While the nodes complained they wouldn't be able to talk to each other, I was watching their syslogs via tail -f through SSH using the same NIC the cluster uses to communicate. If there was a communication error, it's in the cluster's brain. I was running the latest patches of SLES11 already, because we had the same problem when we did the same thing, and support said everything looks OK, and such a thing shouldn't happen.

Now if the corosync guys would add comments to their C-code or at least would respond to inquiried to their advertised mailing list...

I have the feeling that using DLM or O2CB increases the "ring faulty" messages drastically, without a reason obvious to me (other than programming errors). Just a side-note...

Yes, put your heads in the sand, hoping the problems will go away by itself...


More information about the Users mailing list