[ClusterLabs] corosync - CS_ERR_BAD_HANDLE when multiple nodes are starting up

Thu Oct 8 04:57:59 EDT 2015

Hi,

Thomas Lamprecht napsal(a):
> Hi,
> again thanks for the response!
>> Thomas,
>>
>> Thomas Lamprecht napsal(a):
>>> Hi,
>>>
>>> thanks for the response!
>>> I added some information and clarification below.
>>>
>>> On 10/01/2015 09:23 AM, Jan Friesse wrote:
>>>> Hi,
>>>>
>>>> Thomas Lamprecht napsal(a):
>>>>> Hello,
>>>>>
>>>>> we are using corosync version needle (2.3.5) for our cluster
>>>>> filesystem
>>>>> (pmxcfs).
>>>>> The situation is the following. First we start up the pmxcfs, which is
>>>>> an fuse fs. And if there is an cluster configuration, we start also
>>>>> corosync.
>>>>> This allows the filesystem to exist on one node 'cluster's or
>>>>> forcing it
>>>>> in an local mode. We use CPG to send our messages to all members,
>>>>> the filesystem is in the RAM and all fs operations are sent 'over the
>>>>> wire'.
>>>>>
>>>>> The problem is now the following:
>>>>> When we're restarting all (in my test case 3) nodes at the same
>>>>> time, I
>>>>> get in 1 from 10 cases only CS_ERR_BAD_HANDLE back when calling
>>>>
>>>> I'm really unsure how to understand what are you doing. You are
>>>> restarting all nodes and get CS_ERR_BAD_HANDLE? I mean, if you are
>>>> restarting all nodes, which node returns CS_ERR_BAD_HANDLE? Or are you
>>>> restarting just pmxcfs? Or just coorsync?
>>> Clarification, sorry was a bit unspecific. I can see the error behaviour
>>> in two cases:
>>> 1) I restart three physical hosts (= nodes) at the same time, one of
>>> them - normally the last one coming up again - joins successfully the
>>> corosync cluster the filesystem (pmxcfs) notices that, but then
>>> cpg_mcast_joined receives only CS_ERR_BAD_HANDLE errors.
>>
>> Ok, that is weird. Are you able to reproduce same behavior restarting
>> pmxcfs? Or really membership change (= restart of node) is needed?
>> Also are you sure network interface is up when corosync starts?
> No, tried quite a few times to restart pmxcfs but that didn't trigger
> the problem, yet. But I could trigger it once with only restarting one
> node, so restarting all only makes the problem worse but isn't needed in
> the first place.

Ok. So let's expect change of membership come into play.

Do you think you can try to test (in cycle):
- start corosync
- start pmxcfs
- stop pmxcfs
- stop corosync

On one node? Because if problem appears, we will have at least reproducer.

>
>>
>> corosync.log of failing node may be interesting.
>>
> My nodes hostnames are [one, two, three], this time the came up in the
> order they're named.
> This time it was on two nodes the first and second node coming up again.
> corosync log seems normal, although I haven't had debug mode enabled,
> don't know what difference that makes when no errors shows up in the
> normal log.
>> Oct 07 09:06:36 [1335] two corosync notice [MAIN  ] Corosync Cluster
>> Engine ('2.3.5'): started and ready to provide service.
>> Oct 07 09:06:36 [1335] two corosync info    [MAIN  ] Corosync built-in
>> features: augeas systemd pie relro bindnow
>> Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] Initializing
>> transport (UDP/IP Multicast).
>> Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] Initializing
>> transmit/receive security (NSS) crypto: aes256 hash: sha1
>> Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] The network
>> interface [10.10.1.152] is now up.
>> Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
>> loaded: corosync configuration map access [0]
>> Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: cmap
>> Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
>> loaded: corosync configuration service [1]
>> Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: cfg
>> Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
>> loaded: corosync cluster closed process group service v1.01 [2]
>> Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: cpg
>> Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
>> loaded: corosync profile loading service [4]
>> Oct 07 09:06:36 [1335] two corosync notice  [QUORUM] Using quorum
>> provider corosync_votequorum
>> Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
>> loaded: corosync vote quorum service v1.0 [5]
>> Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name:
>> votequorum
>> Oct 07 09:06:36 [1335] two corosync notice  [SERV  ] Service engine
>> loaded: corosync cluster quorum service v0.1 [3]
>> Oct 07 09:06:36 [1335] two corosync info    [QB    ] server name: quorum
>> Oct 07 09:06:36 [1335] two corosync notice  [TOTEM ] A new membership
>> (10.10.1.152:92) was formed. Members joined: 2
>> Oct 07 09:06:36 [1335] two corosync notice  [QUORUM] Members[1]: 2
>> Oct 07 09:06:36 [1335] two corosync notice  [MAIN  ] Completed service
>> synchronization, ready to provide service.

Looks good

> Then pmxcfs results in:
>> Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
>> failed: 9
>> Oct 07 09:06:38 two pmxcfs[952]: [status] notice: Bad handle 0
>> Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
>> failed: 9
>> Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
>> failed: 9
>> Oct 07 09:06:38 two pmxcfs[952]: [status] notice: Bad handle 0
>> Oct 07 09:06:38 two pmxcfs[952]: [status] crit: cpg_send_message
>> failed: 9

Just for sure, bad handle 0 is logged by pmxcfs?

>
> After that the third node joins, but the CS_ERR_BAD_HANDLE stays.
>
>> Oct 07 09:06:41 [1335] two corosync notice [TOTEM ] A new membership
>> (10.10.1.151:100) was formed. Members joined: 1
>> Oct 07 09:06:41 [1335] two corosync notice  [QUORUM] This node is
>> within the primary component and will provide service.
>> Oct 07 09:06:41 [1335] two corosync notice  [QUORUM] Members[2]: 1 2
>> Oct 07 09:06:41 [1335] two corosync notice  [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Oct 07 09:06:52 [1335] two corosync notice  [TOTEM ] A new membership
>> (10.10.1.151:104) was formed. Members joined: 3
>> Oct 07 09:06:52 [1335] two corosync notice  [QUORUM] Members[3]: 1 2 3
>> Oct 07 09:06:52 [1335] two corosync notice  [MAIN  ] Completed service
>> synchronization, ready to provide service.
>

Again, looks good.

> Funnily the third node also has some initial problems, but is able to
> correct them:
>> Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_message
>> failed: 9
>> Oct 07 09:06:53 three pmxcfs[943]: [status] notice: Bad handle 0
>> Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_message
>> failed: 9
>> Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_message
>> failed: 9
>> Oct 07 09:06:53 three pmxcfs[943]: [status] notice: Bad handle 0
>> Oct 07 09:06:53 three pmxcfs[943]: [status] crit: cpg_send_message
>> failed: 9
>> [...]
>> Oct 07 09:06:56 three pmxcfs[943]: [status] notice: update cluster
>> info (cluster name  teachings, version = 15)
>> Oct 07 09:06:56 three pmxcfs[943]: [status] notice: node has quorum
>> Oct 07 09:06:56 three pmxcfs[943]: [dcdb] notice: members: 3/943
>> Oct 07 09:06:56 three pmxcfs[943]: [dcdb] notice: all data is up to date
>> Oct 07 09:06:56 three pmxcfs[943]: [status] notice: members: 3/943
>> Oct 07 09:06:56 three pmxcfs[943]: [status] notice: all data is up to
>> date
>

Even weirder. I mean ... CS_ERR_BAD_HANDLE is not kind of error which 
fixes automatically.

>
>
> Thats
>>
>>>
>>> 2) I disconnect the network interface on which corosync runs, and
>>> reconnect it a bit later. This triggers the same as above, but also not
>>> every time.
>>
>> Just to make sure. Don't do ifdown. Corosync reacts to ifdown pretty
>> badly. Also NetworkManager does ifdown on cable unplug if not
>> configured in server mode. If you want to test network split, ether
>> use iptables (make sure to block all traffic needed by corosync, so if
>> you are using multicast make sure to block both unicast and multicast
>> packets on input and output -
>> https://github.com/jfriesse/csts/blob/master/tests/inc/fw.sh), or use
>> blocking on switch.
>>
> No, I'm aware of the ifdown problems and we do not use NetworkManager in
> Proxmox VE. I often used an 'virtual' cluster, i.e. three virtual
> machines (nested virtualization) in KVM/qemu, and then 'disconnected'
> the network interface through qemu, which should equal to an physical
> unplug.

Sounds good.

> Corosync also recovers when 'connecting' the interface again through qemu.
> Thanks for the script hint, will do some more physical tests involving
> the network.
>>>
>>> Currently I'm trying to get an somewhat reproduce able test and try it
>>> also on bigger setups and other possible causes, need to do a bit more
>>> home work here and report back later.
>>
>> Actually, smaller clusters are better for debugging, but yes, larger
>> setup may show problem faster.
> Ok, good to know.
>>
>>>>
>>>>> cpg_mcast_joined to send out the data, but only one node.
>>>>> corosyn-quorumtool shows that we have quorum, and the logs are also
>>>>> showing a healthy connect to the corosync cluster. The failing
>>>>> handle is
>>>>> initialized once at the initialization of our filesystem. Should it be
>>>>> reinitialized on every reconnect?
>>>>
>>>> Again, I'm unsure what you mean by reconnect. On Corosync shudown you
>>>> have to reconnect (I believe this is not the case because you are
>>>> getting error only with 10% probability).
>>> Yes, we reconnect to Corosync, and it's not a corosync shutdown, the
>>> whole host reboots or the network interfaces goes down and then a bit
>>> later up again. The probability is just an estimation but the main
>>> problem is that I can not reproduce it all the time.
>>>>
>>>>> Restarting the filesystem solves this problem, the strange thing is
>>>>> that
>>>>> isn't clearly reproduce-able and often works just fine.
>>>>>
>>>>> Are there some known problems or steps we should look for?
>>>>
>>>> Hard to tell but generally:
>>>> - Make sure cpg_init really returns CS_OK. If not, returned handle is
>>>> invalid
>>>> - Make sure there is no memory corruption and handle is really valid
>>>> (valgrind may be helpful).
>>> cpg_init checks are in place and should be OK.
>>> Yes, will use Valgrind, but one questions ahead:
>>>
>>> Can the handle get lost somehow? Is there a need to reinitialize the cpg
>>> with cpg_initialize/cpg_model_initialize after we left and later
>>> rejoined the cluster?
>>
>> I'm still unsure what you mean after we left and later rejoined. As
>> long as corosync is running client application "don't need to care
>> about" membership changes. It's corosync problem. So if network split
>> happens, you don't have to call cpg_initialize. Only place where
>> cpg_initalize is needed is initial connection and reconnection after
>> corosync main process exit.
> What I mean is leaving the cpg group, after a lost and regained quorum
> (unplug and plug NIC).
>
> I'm a bit confused by two things:
>
> -) Restarting pmxcfs, thus making a clean new initialization, solves the
> problem but reinitializing isn't intended, as you also stated. If I'm
> not mistaken corosync has even problems when calling cpg_initialize
> twice from the same PID, with an finalize in between.

It shouldn't. cpg_join may be problem, because of following scenario:
- client calls cpg_finalize
- delivered into corosync main process, corosync needs to inform other 
clients on both local node and other cluster nodes.
- cpg_finalize in client API returned back without error
- client calls new cpg_init + cpg_join
- server doesn't (yet) get time to process leave message

-> error is returned because there are two process in same group with 
same pid.

>
> -) the handle only can be changed by cpg_initialize, as it's the only
> place where it is a reference, so where does it get to zero? I assume it

This is question I'm asking myself too. I have 3 explanations:
- bug in pmxcfs
- bug in corosync libraries
- bug in libqb

Just for sure. Are you using newest corosync 2.3.5 and newest libqb 0.17.2?

> wasn't zero from the beginning as then cpg_initialize should fail?

I'm not exactly sure what you've meant. cpg_initialize doesn't really 
care what value was in the handle before it was called. It just 
overwrites it.

>
> Is there a intended way how to handle an CS_ERR_BAD_HANDLE?
>

No. CS_ERR_BAD_HANDLE is kind of error which simply shouldn't happen as 
long as handle is not changed in caller app.

Regards,
   Honza

> Thanks for your time.
>
> Regards,
> Thomas
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org