[ClusterLabs] corosync - CS_ERR_BAD_HANDLE when multiple nodes are starting up

Thomas Lamprecht t.lamprecht at proxmox.com
Fri Oct 23 10:26:06 UTC 2015


ping

On 10/14/2015 02:10 PM, Thomas Lamprecht wrote:
> Hi,
>
> On 10/08/2015 10:57 AM, Jan Friesse wrote:
>> Hi,
>>
>> Thomas Lamprecht napsal(a):
>>> [snip]
>>>>>>> Hello,
>>>>>>>
>>>>>>> we are using corosync version needle (2.3.5) for our cluster
>>>>>>> filesystem
>>>>>>> (pmxcfs).
>>>>>>> The situation is the following. First we start up the pmxcfs, 
>>>>>>> which is
>>>>>>> an fuse fs. And if there is an cluster configuration, we start also
>>>>>>> corosync.
>>>>>>> This allows the filesystem to exist on one node 'cluster's or
>>>>>>> forcing it
>>>>>>> in an local mode. We use CPG to send our messages to all members,
>>>>>>> the filesystem is in the RAM and all fs operations are sent 
>>>>>>> 'over the
>>>>>>> wire'.
>>>>>>>
>>>>>>> The problem is now the following:
>>>>>>> When we're restarting all (in my test case 3) nodes at the same
>>>>>>> time, I
>>>>>>> get in 1 from 10 cases only CS_ERR_BAD_HANDLE back when calling
>>>>>>
>>>>>> I'm really unsure how to understand what are you doing. You are
>>>>>> restarting all nodes and get CS_ERR_BAD_HANDLE? I mean, if you are
>>>>>> restarting all nodes, which node returns CS_ERR_BAD_HANDLE? Or 
>>>>>> are you
>>>>>> restarting just pmxcfs? Or just coorsync?
>>>>> Clarification, sorry was a bit unspecific. I can see the error 
>>>>> behaviour
>>>>> in two cases:
>>>>> 1) I restart three physical hosts (= nodes) at the same time, one of
>>>>> them - normally the last one coming up again - joins successfully the
>>>>> corosync cluster the filesystem (pmxcfs) notices that, but then
>>>>> cpg_mcast_joined receives only CS_ERR_BAD_HANDLE errors.
>>>>
>>>> Ok, that is weird. Are you able to reproduce same behavior restarting
>>>> pmxcfs? Or really membership change (= restart of node) is needed?
>>>> Also are you sure network interface is up when corosync starts?
>>> No, tried quite a few times to restart pmxcfs but that didn't trigger
>>> the problem, yet. But I could trigger it once with only restarting one
>>> node, so restarting all only makes the problem worse but isn't 
>>> needed in
>>> the first place.
>>
>> Ok. So let's expect change of membership come into play.
>>
>> Do you think you can try to test (in cycle):
>> - start corosync
>> - start pmxcfs
>> - stop pmxcfs
>> - stop corosync
>>
>> On one node? Because if problem appears, we will have at least 
>> reproducer.
>>
> Hmm, yeah can do that but you have to know that we start pmxcfs 
> _before_ corosync, as we want to access the data when quorum is lost 
> or when it's only a one node cluster, and thus corosync is not running.
>
> So the cycle to replicate this problems would be:
> - start pmxcfs
> - start corosync
> - stop corosync
> - stop pmxcfs
>
> if I'm not mistaken.
>
>>>
>>>>
>>>> corosync.log of failing node may be interesting.
>>>>
>>> My nodes hostnames are [one, two, three], this time the came up in the
>>> order they're named.
>>> This time it was on two nodes the first and second node coming up 
>>> again.
>>> corosync log seems normal, although I haven't had debug mode enabled,
>>> don't know what difference that makes when no errors shows up in the
>>> normal log.
>>>> Oct 07 09:06:36 [1335] two corosync notice [MAIN  ] Corosync Cluster
>>>> Engine ('2.3.5'): started and ready to provide service.
>>>> Oct 07 09:06:36 [1335] two corosync info    [MAIN  ] Corosync built-in
>>>> features: augeas systemd pie relro bindnow
>>>> Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] Initializing
>>>> transport (UDP/IP Multicast).
>>>> Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] Initializing
>>>> transmit/receive security (NSS) crypto: aes256 hash: sha1
>>>> Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] The network
>>>> interface [10.10.1.152] is now up.
>>>> Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
>>>> loaded: corosync configuration map access [0]
>>>> Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: cmap
>>>> Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
>>>> loaded: corosync configuration service [1]
>>>> Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: cfg
>>>> Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
>>>> loaded: corosync cluster closed process group service v1.01 [2]
>>>> Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: cpg
>>>> Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
>>>> loaded: corosync profile loading service [4]
>>>> Oct 07 09:06:36 [1335] two corosync notice  [QUORUM] Using quorum
>>>> provider corosync_votequorum
>>>> Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
>>>> loaded: corosync vote quorum service v1.0 [5]
>>>> Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name:
>>>> votequorum
>>>> Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
>>>> loaded: corosync cluster quorum service v0.1 [3]
>>>> Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: 
>>>> quorum
>>>> Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] A new membership
>>>> (10.10.1.152:92) was formed. Members joined: 2
>>>> Oct 07 09:06:36 [1335] two corosync notice  [QUORUM] Members[1]: 2
>>>> Oct 07 09:06:36 [1335] two corosync notice  [MAIN  ] Completed service
>>>> synchronization, ready to provide service.
>>
>> Looks good
>>
>>> Then pmxcfs results in:
>>>> Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
>>>> failed: 9
>>>> Oct 07 09:06:38 two pmxcfs[952]: [status] notice: Bad handle 0
>>>> Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
>>>> failed: 9
>>>> Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
>>>> failed: 9
>>>> Oct 07 09:06:38 two pmxcfs[952]: [status] notice: Bad handle 0
>>>> Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
>>>> failed: 9
>>
>> Just for sure, bad handle 0 is logged by pmxcfs?
>
> Yes exactly, this is from additional debug code I inserted and only 
> outputs the value of the handle.
>
>>
>>>
>>> After that the third node joins, but the CS_ERR_BAD_HANDLE stays.
>>>
>>>> Oct 07 09:06:41 [1335] two corosync notice [TOTEM ] A new membership
>>>> (10.10.1.151:100) was formed. Members joined: 1
>>>> Oct 07 09:06:41 [1335] two corosync notice  [QUORUM] This node is
>>>> within the primary component and will provide service.
>>>> Oct 07 09:06:41 [1335] two corosync notice  [QUORUM] Members[2]: 1 2
>>>> Oct 07 09:06:41 [1335] two corosync notice  [MAIN  ] Completed service
>>>> synchronization, ready to provide service.
>>>> Oct 07 09:06:52 [1335] two corosync notice  [TOTEM ] A new membership
>>>> (10.10.1.151:104) was formed. Members joined: 3
>>>> Oct 07 09:06:52 [1335] two corosync notice  [QUORUM] Members[3]: 1 2 3
>>>> Oct 07 09:06:52 [1335] two corosync notice  [MAIN  ] Completed service
>>>> synchronization, ready to provide service.
>>>
>>
>> Again, looks good.
>>
>>> Funnily the third node also has some initial problems, but is able to
>>> correct them:
>>>> Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_message
>>>> failed: 9
>>>> Oct 07 09:06:53 three pmxcfs[943]: [status] notice: Bad handle 0
>>>> Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_message
>>>> failed: 9
>>>> Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_message
>>>> failed: 9
>>>> Oct 07 09:06:53 three pmxcfs[943]: [status] notice: Bad handle 0
>>>> Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_message
>>>> failed: 9
>>>> [...]
>>>> Oct 07 09:06:56 three pmxcfs[943]: [status] notice: update cluster
>>>> info (cluster name  teachings, version = 15)
>>>> Oct 07 09:06:56 three pmxcfs[943]: [status] notice: node has quorum
>>>> Oct 07 09:06:56 three pmxcfs[943]: [dcdb] notice: members: 3/943
>>>> Oct 07 09:06:56 three pmxcfs[943]: [dcdb] notice: all data is up to 
>>>> date
>>>> Oct 07 09:06:56 three pmxcfs[943]: [status] notice: members: 3/943
>>>> Oct 07 09:06:56 three pmxcfs[943]: [status] notice: all data is up to
>>>> date
>>>
>>
>> Even weirder. I mean ... CS_ERR_BAD_HANDLE is not kind of error which 
>> fixes automatically.
>
> Yes, weird is a good word for that.
>
>>
>>>
>>>
>>> Thats
>>>>
>>>>>
>>>>> 2) I disconnect the network interface on which corosync runs, and
>>>>> reconnect it a bit later. This triggers the same as above, but 
>>>>> also not
>>>>> every time.
>>>>
>>>> Just to make sure. Don't do ifdown. Corosync reacts to ifdown pretty
>>>> badly. Also NetworkManager does ifdown on cable unplug if not
>>>> configured in server mode. If you want to test network split, ether
>>>> use iptables (make sure to block all traffic needed by corosync, so if
>>>> you are using multicast make sure to block both unicast and multicast
>>>> packets on input and output -
>>>> https://github.com/jfriesse/csts/blob/master/tests/inc/fw.sh), or use
>>>> blocking on switch.
>>>>
>>> No, I'm aware of the ifdown problems and we do not use 
>>> NetworkManager in
>>> Proxmox VE. I often used an 'virtual' cluster, i.e. three virtual
>>> machines (nested virtualization) in KVM/qemu, and then 'disconnected'
>>> the network interface through qemu, which should equal to an physical
>>> unplug.
>>
>> Sounds good.
>>
>>> Corosync also recovers when 'connecting' the interface again through 
>>> qemu.
>>> Thanks for the script hint, will do some more physical tests involving
>>> the network.
>>>>>
>>>>> Currently I'm trying to get an somewhat reproduce able test and 
>>>>> try it
>>>>> also on bigger setups and other possible causes, need to do a bit 
>>>>> more
>>>>> home work here and report back later.
>>>>
>>>> Actually, smaller clusters are better for debugging, but yes, larger
>>>> setup may show problem faster.
>>> Ok, good to know.
>>>>
>>>>>>
>>>>>>> cpg_mcast_joined to send out the data, but only one node.
>>>>>>> corosyn-quorumtool shows that we have quorum, and the logs are also
>>>>>>> showing a healthy connect to the corosync cluster. The failing
>>>>>>> handle is
>>>>>>> initialized once at the initialization of our filesystem. Should 
>>>>>>> it be
>>>>>>> reinitialized on every reconnect?
>>>>>>
>>>>>> Again, I'm unsure what you mean by reconnect. On Corosync shudown 
>>>>>> you
>>>>>> have to reconnect (I believe this is not the case because you are
>>>>>> getting error only with 10% probability).
>>>>> Yes, we reconnect to Corosync, and it's not a corosync shutdown, the
>>>>> whole host reboots or the network interfaces goes down and then a bit
>>>>> later up again. The probability is just an estimation but the main
>>>>> problem is that I can not reproduce it all the time.
>>>>>>
>>>>>>> Restarting the filesystem solves this problem, the strange thing is
>>>>>>> that
>>>>>>> isn't clearly reproduce-able and often works just fine.
>>>>>>>
>>>>>>> Are there some known problems or steps we should look for?
>>>>>>
>>>>>> Hard to tell but generally:
>>>>>> - Make sure cpg_init really returns CS_OK. If not, returned 
>>>>>> handle is
>>>>>> invalid
>>>>>> - Make sure there is no memory corruption and handle is really valid
>>>>>> (valgrind may be helpful).
>>>>> cpg_init checks are in place and should be OK.
>>>>> Yes, will use Valgrind, but one questions ahead:
>>>>>
>>>>> Can the handle get lost somehow? Is there a need to reinitialize 
>>>>> the cpg
>>>>> with cpg_initialize/cpg_model_initialize after we left and later
>>>>> rejoined the cluster?
>>>>
>>>> I'm still unsure what you mean after we left and later rejoined. As
>>>> long as corosync is running client application "don't need to care
>>>> about" membership changes. It's corosync problem. So if network split
>>>> happens, you don't have to call cpg_initialize. Only place where
>>>> cpg_initalize is needed is initial connection and reconnection after
>>>> corosync main process exit.
>>> What I mean is leaving the cpg group, after a lost and regained quorum
>>> (unplug and plug NIC).
>>>
>>> I'm a bit confused by two things:
>>>
>>> -) Restarting pmxcfs, thus making a clean new initialization, solves 
>>> the
>>> problem but reinitializing isn't intended, as you also stated. If I'm
>>> not mistaken corosync has even problems when calling cpg_initialize
>>> twice from the same PID, with an finalize in between.
>>
>> It shouldn't. cpg_join may be problem, because of following scenario:
>> - client calls cpg_finalize
>> - delivered into corosync main process, corosync needs to inform 
>> other clients on both local node and other cluster nodes.
>> - cpg_finalize in client API returned back without error
>> - client calls new cpg_init + cpg_join
>> - server doesn't (yet) get time to process leave message
>>
>> -> error is returned because there are two process in same group with 
>> same pid.
>
> Ok, is then there a way to ensure that the leave message was processed?
>
> I'll look into that, thanks for the notice, currently checking some 
> things with gdb.
>
>>
>>
>>>
>>> -) the handle only can be changed by cpg_initialize, as it's the only
>>> place where it is a reference, so where does it get to zero? I 
>>> assume it
>>
>> This is question I'm asking myself too. I have 3 explanations:
>> - bug in pmxcfs
>
> That's what I suspect but, naturally, not hope :)
>
>> - bug in corosync libraries
>> - bug in libqb
>>
>> Just for sure. Are you using newest corosync 2.3.5 and newest libqb 
>> 0.17.2?
>
> Yes.
>
>>> wasn't zero from the beginning as then cpg_initialize should fail?
>>
>> I'm not exactly sure what you've meant. cpg_initialize doesn't really 
>> care what value was in the handle before it was called. It just 
>> overwrites it.
>>
>>>
>>> Is there a intended way how to handle an CS_ERR_BAD_HANDLE?
>>>
>>
>> No. CS_ERR_BAD_HANDLE is kind of error which simply shouldn't happen 
>> as long as handle is not changed in caller app.
>>
> Ok, another question regarding cpg_initialize. The situations is, 
> pmxcfs and corosync are initialized, then i kill corosync, now pmxcfs 
> calls cpg_finalize and then tries to call cpg_initialize again, I 
> naturally receive an error  (CS_ERR_LIBRARY) but get a valid looking 
> handle (e.g. 4578767460630528000), should be acted upon this handle 
> (finalize or there like) or is it mere a random value?
>
> Further, I can rise the chance of this bug when I'm plug the corosync 
> interface, then after 60 second without quorum  the watchdog triggers, 
> when the node comes up again the " cpg_send_message failed: 9 " error 
> message is constantly thrown and also when reconnecting the interface 
> the same behaviour gets shown (corosync "recovers" and becomes 
> successfully quorate while pmxcfs stays in this error until 
> restarted). It's not always reproducable but definitely more often.
>>
>> Regards,
>>   Honza
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>






More information about the Users mailing list