[ClusterLabs] Too quick node reboot leads to failed corosync assert on other node(s)

Jan Friesse jfriesse at redhat.com
Fri Feb 19 08:18:22 UTC 2016


Michal Koutný napsal(a):
> On 02/18/2016 10:40 AM, Christine Caulfield wrote:
>> I definitely remember looking into this, or something very like it, ages
>> ago. I can't find anything in the commit logs for either corosync or
>> cman that looks relevant though. If you're seeing it on recent builds
>> then it's obviously still a problem anyway and we ought to look into it!
> Thanks for you replies.
>
> So far this happened only once and we've done only "post mortem", alas
> no available reproducer. If I have time, I'll try to reproduce it

Ok. Actually I was trying to reproduce and was really not successful 
(current master). Steps I've used:
- 2 nodes, token set to 30 sec
- execute cpgbench on node2
- pause node1 corosync (ctrl+z), kill node1 corosync (kill -9 %1)
- wait until corosync on node2 move into "entering GATHER
state from..."
- execute corosync on node1

Basically during recovery new node trans list was never send (and/or 
ignored by node2).

I'm going to try test v1.4.7, but it's also possible that bug is fixed 
by other commits (my favorites are cfbb021e130337603fe5b545d1e377296ecb92ea,
4ee84c51fa73c4ec7cbee922111a140a3aaf75df, 
f135b680967aaef1d466f40170c75ae3e470e147).

Regards,
   Honza

> locally and check whether it exists in the current version.
>
> Michal
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>





More information about the Users mailing list