[ClusterLabs] corosync 2.4 CPG config change callback

Jan Friesse jfriesse at redhat.com
Wed Mar 7 05:16:30 EST 2018


Thomas,


> Hi,
>
> first some background info for my questions I'm going to ask:
> We use corosync as a basis for our distributed realtime configuration
> file system (pmxcfs)[1].

nice

>
> We got some reports of a completely hanging FS with the only
> correlations being high load, often IO, and most times a message that
> corosync did not got scheduled for longer than the token timeout.
>
> See this example from a three node cluster, first:
>
>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN  ] Corosync main process was not scheduled for 3767.3159 ms (threshold is 1320.0000 ms). Consider token timeout increase.
>
> then we get a few logs that JOIN or LEAVE messages were thrown away
> (understandable for this event):
>
> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Corosync main process was not scheduled for 3767.3159 ms (threshold is 1320.0000 ms). Consider token timeout increase.
> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] Failed to receive the leave message. failed: 2 3
> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] Failed to receive the leave message. failed: 2 3
> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [QUORUM] Members[3]: 1 2 3
> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [QUORUM] Members[3]: 1 2 3
> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Completed service synchronization, ready to provide service.
>
> Until recently we stepped really in the dark and had everything from
> Kernel bugs to our filesystem logic as possible cause in mind...  But
> then we had the luck to trigger this in our test systems and went to
> town with gdb on the core dump, finding that we can trigger this by
> pausing the leader (from our FS POV) for a short moment (may be shorter
> than the token timeout), so that a new leader get elected, and then
> resuming our leader node VM again.
>
> The problem I saw was that while the leader had a log entry which
> proved that he noticed his blackout:
>> [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3

I know it looks weird but it's perfectly fine and expected.

>
> our FS cpg_confchg_fn callback[2] was never called, thus it thought it

That shouldn't happen

> was still in sync and nothing ever happened, until another member
> triggered this callback, by either leaving or (re-)joining.
>
> Looking in the cpg.c code I saw that there's another callback, namely
> cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that

This callback is not necessary to have as long as information about cpg 
group is enough. cpg_totem_confchg_fn contains information about all 
processors (nodes).

> one... (I ain't the original author of the FS and it predates at least
> to 2010, so maybe cpg_initialize was not yet deprecated there, and
> thus model_initialize was not used then)

>
> I switched over to using cpg_model_initialize and set the totem_confchg
> callback, but for the "blacked out node" it gets called twice after the
> event, but always shows all members...
>
> So to finally get to my questions:
>
> * why doesn't get the cpg_confchg_fn CB called when a node has a short
>   blackout (i.e., corosync not being scheduled for a bit of time)?
>   having all other nodes in it's leave and join list, as the log
>   would suggests ("Members joined: 2 3 left: 2 3")

I believe it was called but not when corosync was paused.

>
> * If that doesn't seems like a good idea, what can we use to really
>   detect such a node blackout?

It's not possible to detect from the affected node, but it must be 
detected from other nodes.

>
> As a work around I added logic for when through a config change a node
> with a lower ID joined. The node which was leader until then triggers
> a CPG leave enforcing a cluster wide config change event to happen,
> which this time also the blacked out node gets and syncs then again
> This works, but does not feels really nice, IMO...

Ok let me explain what exactly happened in your test and simplify it to 
two nodes:

- Node A and B are running corosync and same cpg application
- Node A is paused for time > token_timeout
- Node B detects that node A is not responding
   - Creates new membership (with only node B)
   - Sends notification to cpg app about Node A leave
- Node A is still paused so it cannot send notification into cpg app
- Node A is unpaused
   - Pause is detected and start forming new membership
   - From node A point of view, node B left - simply because node never 
considers itself as left
   - Node A is able to contact node B so final membership is A, B.
   - Node A can finally deliver two cpg conf changes into cpg 
application. One about left of node B and second one about join of node B.

Now it's really cpg application problem to synchronize its data. Many 
applications (usually FS) are using quorum together with fencing to find 
out, which cluster partition is quorate and clean inquorate one.

Hopefully my explanation help you and feel free to ask more questions!

Regards,
   Honza

>
> help would be appreciated, much thanks!
>
> cheers,
> Thomas
>
> [1]: https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=data/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD
> [2]: https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=data/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb=HEAD#l1096
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>




More information about the Users mailing list