[ClusterLabs] corosync - CS_ERR_BAD_HANDLE when multiple nodes are starting up

Tue Oct 6 10:09:57 CEST 2015

Thomas,

Thomas Lamprecht napsal(a):
> Hi,
>
> thanks for the response!
> I added some information and clarification below.
>
> On 10/01/2015 09:23 AM, Jan Friesse wrote:
>> Hi,
>>
>> Thomas Lamprecht napsal(a):
>>> Hello,
>>>
>>> we are using corosync version needle (2.3.5) for our cluster filesystem
>>> (pmxcfs).
>>> The situation is the following. First we start up the pmxcfs, which is
>>> an fuse fs. And if there is an cluster configuration, we start also
>>> corosync.
>>> This allows the filesystem to exist on one node 'cluster's or forcing it
>>> in an local mode. We use CPG to send our messages to all members,
>>> the filesystem is in the RAM and all fs operations are sent 'over the
>>> wire'.
>>>
>>> The problem is now the following:
>>> When we're restarting all (in my test case 3) nodes at the same time, I
>>> get in 1 from 10 cases only CS_ERR_BAD_HANDLE back when calling
>>
>> I'm really unsure how to understand what are you doing. You are
>> restarting all nodes and get CS_ERR_BAD_HANDLE? I mean, if you are
>> restarting all nodes, which node returns CS_ERR_BAD_HANDLE? Or are you
>> restarting just pmxcfs? Or just coorsync?
> Clarification, sorry was a bit unspecific. I can see the error behaviour
> in two cases:
> 1) I restart three physical hosts (= nodes) at the same time, one of
> them - normally the last one coming up again - joins successfully the
> corosync cluster the filesystem (pmxcfs) notices that, but then
> cpg_mcast_joined receives only CS_ERR_BAD_HANDLE errors.

Ok, that is weird. Are you able to reproduce same behavior restarting 
pmxcfs? Or really membership change (= restart of node) is needed? Also 
are you sure network interface is up when corosync starts?

corosync.log of failing node may be interesting.

>
> 2) I disconnect the network interface on which corosync runs, and
> reconnect it a bit later. This triggers the same as above, but also not
> every time.

Just to make sure. Don't do ifdown. Corosync reacts to ifdown pretty 
badly. Also NetworkManager does ifdown on cable unplug if not configured 
in server mode. If you want to test network split, ether use iptables 
(make sure to block all traffic needed by corosync, so if you are using 
multicast make sure to block both unicast and multicast packets on input 
and output - 
https://github.com/jfriesse/csts/blob/master/tests/inc/fw.sh), or use 
blocking on switch.

>
> Currently I'm trying to get an somewhat reproduce able test and try it
> also on bigger setups and other possible causes, need to do a bit more
> home work here and report back later.

Actually, smaller clusters are better for debugging, but yes, larger 
setup may show problem faster.

>>
>>> cpg_mcast_joined to send out the data, but only one node.
>>> corosyn-quorumtool shows that we have quorum, and the logs are also
>>> showing a healthy connect to the corosync cluster. The failing handle is
>>> initialized once at the initialization of our filesystem. Should it be
>>> reinitialized on every reconnect?
>>
>> Again, I'm unsure what you mean by reconnect. On Corosync shudown you
>> have to reconnect (I believe this is not the case because you are
>> getting error only with 10% probability).
> Yes, we reconnect to Corosync, and it's not a corosync shutdown, the
> whole host reboots or the network interfaces goes down and then a bit
> later up again. The probability is just an estimation but the main
> problem is that I can not reproduce it all the time.
>>
>>> Restarting the filesystem solves this problem, the strange thing is that
>>> isn't clearly reproduce-able and often works just fine.
>>>
>>> Are there some known problems or steps we should look for?
>>
>> Hard to tell but generally:
>> - Make sure cpg_init really returns CS_OK. If not, returned handle is
>> invalid
>> - Make sure there is no memory corruption and handle is really valid
>> (valgrind may be helpful).
> cpg_init checks are in place and should be OK.
> Yes, will use Valgrind, but one questions ahead:
>
> Can the handle get lost somehow? Is there a need to reinitialize the cpg
> with cpg_initialize/cpg_model_initialize after we left and later
> rejoined the cluster?

I'm still unsure what you mean after we left and later rejoined. As long 
as corosync is running client application "don't need to care about" 
membership changes. It's corosync problem. So if network split happens, 
you don't have to call cpg_initialize. Only place where cpg_initalize is 
needed is initial connection and reconnection after corosync main 
process exit.

Regards,
   Honza

>>
>> Regards,
>>   Honza
>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org