[ClusterLabs] corosync 2.4 CPG config change callback

Wed Mar 7 15:59:56 UTC 2018

Hi,

On 3/7/18 1:41 PM, Jan Friesse wrote:
> Thomas,
> 
>> First thanks for your answer!
>>
>> On 3/7/18 11:16 AM, Jan Friesse wrote:
>>>> first some background info for my questions I'm going to ask:
>>>> We use corosync as a basis for our distributed realtime configuration
>>>> file system (pmxcfs)[1].
>>>
>>> nice
>>>
>>>>
>>>> We got some reports of a completely hanging FS with the only
>>>> correlations being high load, often IO, and most times a message that
>>>> corosync did not got scheduled for longer than the token timeout.
>>>>
>>>> See this example from a three node cluster, first:
>>>>
>>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN  ] Corosync main process was not scheduled for 3767.3159 ms (threshold is 1320.0000 ms). Consider token timeout increase.
>>>>
>>>> then we get a few logs that JOIN or LEAVE messages were thrown away
>>>> (understandable for this event):
>>>>
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Corosync main process was not scheduled for 3767.3159 ms (threshold is 1320.0000 ms). Consider token timeout increase.
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] JOIN or LEAVE message was thrown away during flush operation.
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [TOTEM ] Failed to receive the leave message. failed: 2 3
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [TOTEM ] Failed to receive the leave message. failed: 2 3
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [QUORUM] Members[3]: 1 2 3
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [QUORUM] Members[3]: 1 2 3
>>>> Mar 01 13:07:56 ceph05-01-public corosync[1638]:  [MAIN  ] Completed service synchronization, ready to provide service.
>>>>
>>>> Until recently we stepped really in the dark and had everything from
>>>> Kernel bugs to our filesystem logic as possible cause in mind...  But
>>>> then we had the luck to trigger this in our test systems and went to
>>>> town with gdb on the core dump, finding that we can trigger this by
>>>> pausing the leader (from our FS POV) for a short moment (may be shorter
>>>> than the token timeout), so that a new leader get elected, and then
>>>> resuming our leader node VM again.
>>>>
>>>> The problem I saw was that while the leader had a log entry which
>>>> proved that he noticed his blackout:
>>>>> [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3
>>>
>>> I know it looks weird but it's perfectly fine and expected.
>>>
>>
>> It seemed OK, from this nodes POV, just the missing config change CB
>> was a bit odd to us.
>>
>>>>
>>>> our FS cpg_confchg_fn callback[2] was never called, thus it thought it
>>>
>>> That shouldn't happen
>>>
>>
>> So we really should get a config change CB on the paused node after
>> unpausing, with all other (online) nodes in both leave and join member
>> list?
> 
> Nope, cpg will take care to sent two messages (one is about left node and second is about join node).
> 
>> Just asking again to confirm my thinking and that I did not misunderstood
>> you. :)
>>
>>>> was still in sync and nothing ever happened, until another member
>>>> triggered this callback, by either leaving or (re-)joining.
>>>>
>>>> Looking in the cpg.c code I saw that there's another callback, namely
>>>> cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that
>>>
>>> This callback is not necessary to have as long as information about cpg group is enough. cpg_totem_confchg_fn contains information about all processors (nodes).
>>>
>>
>> OK, make sense.
>>
>>>> one... (I ain't the original author of the FS and it predates at least
>>>> to 2010, so maybe cpg_initialize was not yet deprecated there, and
>>>> thus model_initialize was not used then)
>>>
>>>>
>>>> I switched over to using cpg_model_initialize and set the totem_confchg
>>>> callback, but for the "blacked out node" it gets called twice after the
>>>> event, but always shows all members...
>>>>
>>>> So to finally get to my questions:
>>>>
>>>> * why doesn't get the cpg_confchg_fn CB called when a node has a short
>>>>   blackout (i.e., corosync not being scheduled for a bit of time)?
>>>>   having all other nodes in it's leave and join list, as the log
>>>>   would suggests ("Members joined: 2 3 left: 2 3")
>>>
>>> I believe it was called but not when corosync was paused.
>>>
>>
>> But I never see my registered callback called after unpausing and
>> also when printing out the dispatch_data->id inside cpg_dispatch
>> I do not get one with "MESSAGE_RES_CPG_CONFCHG_CALLBACK" after
>> unpausing only a "MESSAGE_RES_CPG_TOTEM_CONFCHG_CALLBACK"
>>
>>>>
>>>> * If that doesn't seems like a good idea, what can we use to really
>>>>   detect such a node blackout?
>>>
>>> It's not possible to detect from the affected node, but it must be detected from other nodes.
>>>
>>
>> You mean not directly, as in: we cannot differ between a legitimate fall
>> out and such a pause, caused artificially or by real high IO.
>> That makes sense.
>>
>>>>
>>>> As a work around I added logic for when through a config change a node
>>>> with a lower ID joined. The node which was leader until then triggers
>>>> a CPG leave enforcing a cluster wide config change event to happen,
>>>> which this time also the blacked out node gets and syncs then again
>>>> This works, but does not feels really nice, IMO...
>>>
>>> Ok let me explain what exactly happened in your test and simplify it to two nodes:
>>>
>>> - Node A and B are running corosync and same cpg application
>>> - Node A is paused for time > token_timeout
>>> - Node B detects that node A is not responding
>>>   - Creates new membership (with only node B)
>>>   - Sends notification to cpg app about Node A leave
>>> - Node A is still paused so it cannot send notification into cpg app
>>> - Node A is unpaused
>>>   - Pause is detected and start forming new membership
>>>   - From node A point of view, node B left - simply because node never considers itself as left
>>>   - Node A is able to contact node B so final membership is A, B.
>>>   - Node A can finally deliver two cpg conf changes into cpg application. One about left of node B and second one about join of node B.
>>>
>>
>> Exactly the last point is missing here, and I'd also expect it
>> to happen... I only get the config change on Node B for the Node
>> A join, but Node A logs (as stated)... :/
> 
> So if really cpg events are so:
>  - node B sees node A as dead and then alive
>  - and Node A never gets any confchg
> 

Yes.

> then corosync has a bug (super serious one) somewhere.
> 
> Could you please try compile latest needle, and try to reproduce the problem (ideally using testcpg (in test directory))?
> 

OK, I tested. It differs between a 2 node cluster, were it works flawlessly,
and a tree node cluster, where it does not works. I have a guess why that
may be: on 2 node cluster you'll always get a quorum loss and thus force
a real resync after the paused node comes back from the dead.
In a three (and possible bigger) the remaining nodes are still quorate,
and thus the partition gets no quorum change when the paused node is active
again, at least that whats mainly the difference between both tests..

Used a two/three node cluster in VMs (Debain based distro: Proxmox VE) with
corosync 2.4.3 and built also latest needle from git. Earlier I was still on
2.4.2, on minor release behind, all three similar behaviour.

Below the output of testcpg from the same build, I started and stopped them both at
the same time, using two synchronized tmux panes.
I annotated where I did the PAUSE/UNPAUSE cycle, you see that Node A, the node
which gets paused, only receives a totem change callback in the 3 node case,
but receives all changes in the 2 node one.

= 2 NODE =

== Node A (seppl) ==

root at seppl:~# ./testcpg
Type EXIT to finish
Local node id is 1
membership list
node id 1 pid 29077

seppl:29303 2018-03-07 16:07:39.842
TotemConfchgCallback: ringid (1.1364)
active processors 2: 1 2
seppl:29303 2018-03-07 16:07:39.842
ConfchgCallback: group 'GROUP'
Local node id is 1.0.0.0/1 result 1
joined node/pid 1/29303 reason: 1
nodes in group now 3
node/pid 1/29077
node/pid 1/29303
node/pid 2/22396

# PAUSE / UNPAUSE lois

seppl:29303 2018-03-07 16:07:44.840
ConfchgCallback: group 'GROUP'
Local node id is 1.0.0.0/1 result 1
left node/pid 2/22396 reason: 3
nodes in group now 2
node/pid 1/29077
node/pid 1/29303
seppl:29303 2018-03-07 16:07:44.842
ConfchgCallback: group 'GROUP'
Local node id is 1.0.0.0/1 result 1
joined node/pid 2/22396 reason: 4
nodes in group now 3
node/pid 1/29077
node/pid 1/29303
node/pid 2/22396

seppl:29303 2018-03-07 16:07:44.842
TotemConfchgCallback: ringid (1.1376)
active processors 2: 1 2
EXIT
Finalize  result is 1 (should be 1)

== Node B (lois) ==

root at lois:~# ./testcpg
Type EXIT to finish
Local node id is 2
membership list
node id 1 pid 29077
lois:22396 2018-03-07 16:08:06.705
ConfchgCallback: group 'GROUP'
Local node id is 2.0.0.0/2 result 1
joined node/pid 2/22396 reason: 1
nodes in group now 2
node/pid 1/29077
node/pid 2/22396

lois:22396 2018-03-07 16:08:06.706
TotemConfchgCallback: ringid (1.1364)
active processors 2: 1 2
lois:22396 2018-03-07 16:08:06.706
ConfchgCallback: group 'GROUP'
Local node id is 2.0.0.0/2 result 1
joined node/pid 1/29303 reason: 1
nodes in group now 3
node/pid 1/29077
node/pid 1/29303
node/pid 2/22396

# PAUSE lois

lois:22396 2018-03-07 16:08:13.769
ConfchgCallback: group 'GROUP'
Local node id is 2.0.0.0/2 result 1
left node/pid 1/29077 reason: 3
left node/pid 1/29303 reason: 3
nodes in group now 1
node/pid 2/22396

lois:22396 2018-03-07 16:08:13.770
TotemConfchgCallback: ringid (2.1368)
active processors 1: 2

# UNPAUSE lois

lois:22396 2018-03-07 16:08:15.502
ConfchgCallback: group 'GROUP'
Local node id is 2.0.0.0/2 result 1
joined node/pid 1/29303 reason: 4
nodes in group now 2
node/pid 1/29303
node/pid 2/22396
lois:22396 2018-03-07 16:08:15.503
ConfchgCallback: group 'GROUP'
Local node id is 2.0.0.0/2 result 1
joined node/pid 1/29077 reason: 4
nodes in group now 3
node/pid 1/29077
node/pid 1/29303
node/pid 2/22396

lois:22396 2018-03-07 16:08:15.504
TotemConfchgCallback: ringid (1.1376)
active processors 2: 1 2
EXIT
Finalize  result is 1 (should be 1)

* result => works as expected

= 3 NODE =

= Node A (seppl) =

root at seppl:~# ./testcpg
Type EXIT to finish
Local node id is 1
membership list

seppl:1953 2018-03-07 16:41:05.788
TotemConfchgCallback: ringid (1.1424)
active processors 3: 1 2 3
seppl:1953 2018-03-07 16:41:05.788
ConfchgCallback: group 'GROUP'
Local node id is 1.0.0.0/1 result 1
joined node/pid 1/1953 reason: 1
nodes in group now 3
node/pid 1/1953
node/pid 2/2054
node/pid 3/15085

# PAUSE / UNPAUSE seppl

seppl:1953 2018-03-07 16:41:11.645
TotemConfchgCallback: ringid (1.1436)
active processors 3: 1 2 3
EXIT
Finalize  result is 1 (should be 1)

= Node B (lois) =
root at lois:~# ./testcpg
Type EXIT to finish
Local node id is 2
membership list
lois:2054 2018-03-07 16:41:05.794
ConfchgCallback: group 'GROUP'
Local node id is 2.0.0.0/2 result 1
joined node/pid 2/2054 reason: 1
nodes in group now 1
node/pid 2/2054

lois:2054 2018-03-07 16:41:05.794
TotemConfchgCallback: ringid (1.1424)
active processors 3: 1 2 3
lois:2054 2018-03-07 16:41:05.796
ConfchgCallback: group 'GROUP'
Local node id is 2.0.0.0/2 result 1
joined node/pid 3/15085 reason: 1
nodes in group now 2
node/pid 2/2054
node/pid 3/15085
lois:2054 2018-03-07 16:41:05.796
ConfchgCallback: group 'GROUP'
Local node id is 2.0.0.0/2 result 1
joined node/pid 1/1953 reason: 1
nodes in group now 3
node/pid 1/1953
node/pid 2/2054
node/pid 3/15085

# PAUSE seppl

lois:2054 2018-03-07 16:41:13.773
ConfchgCallback: group 'GROUP'
Local node id is 2.0.0.0/2 result 1
left node/pid 1/1953 reason: 3
nodes in group now 2
node/pid 2/2054
node/pid 3/15085

lois:2054 2018-03-07 16:41:13.774
TotemConfchgCallback: ringid (2.1428)
active processors 2: 2 3

# UNPAUSE seppl

lois:2054 2018-03-07 16:41:17.755
ConfchgCallback: group 'GROUP'
Local node id is 2.0.0.0/2 result 1
joined node/pid 1/1953 reason: 4
nodes in group now 3
node/pid 1/1953
node/pid 2/2054
node/pid 3/15085

lois:2054 2018-03-07 16:41:17.758
TotemConfchgCallback: ringid (1.1436)
active processors 3: 1 2 3
EXIT
Finalize  result is 1 (should be 1)

= Node C (franz) =

root at franz:~# ./testcpg
Type EXIT to finish
Local node id is 3
membership list

franz:15085 2018-03-07 16:41:05.789
TotemConfchgCallback: ringid (1.1424)
active processors 3: 1 2 3
franz:15085 2018-03-07 16:41:05.789
ConfchgCallback: group 'GROUP'
Local node id is 3.0.0.0/3 result 1
joined node/pid 3/15085 reason: 1
nodes in group now 2
node/pid 2/2054
node/pid 3/15085
franz:15085 2018-03-07 16:41:05.789
ConfchgCallback: group 'GROUP'
Local node id is 3.0.0.0/3 result 1
joined node/pid 1/1953 reason: 1
nodes in group now 3
node/pid 1/1953
node/pid 2/2054
node/pid 3/15085

# PAUSE seppl

franz:15085 2018-03-07 16:41:13.766
ConfchgCallback: group 'GROUP'
Local node id is 3.0.0.0/3 result 1
left node/pid 1/1953 reason: 3
nodes in group now 2
node/pid 2/2054
node/pid 3/15085

franz:15085 2018-03-07 16:41:13.769
TotemConfchgCallback: ringid (2.1428)
active processors 2: 2 3

# UNPAUSE seppl

franz:15085 2018-03-07 16:41:17.749
ConfchgCallback: group 'GROUP'
Local node id is 3.0.0.0/3 result 1
joined node/pid 1/1953 reason: 4
nodes in group now 3
node/pid 1/1953
node/pid 2/2054
node/pid 3/15085

franz:15085 2018-03-07 16:41:17.751
TotemConfchgCallback: ringid (1.1436)
active processors 3: 1 2 3
EXIT
Finalize  result is 1 (should be 1)

Hope I did both test right, but as it reproduces multiple times
with testcpg, our cpg usage in our filesystem, this seems like
valid tested, not just an single occurrence.

cheers,
Thomas

>>
>>> Now it's really cpg application problem to synchronize its data. Many applications (usually FS) are using quorum together with fencing to find out, which cluster partition is quorate and clean inquorate one.
>>>
>>> Hopefully my explanation help you and feel free to ask more questions!
>>>
>>
>> They help, but I'm still a bit unsure about why the CB could not happen here,
>> may need to dive a bit deeper into corosync :)
>>
>>> Regards,
>>>   Honza
>>>
>>>>
>>>> help would be appreciated, much thanks!
>>>>
>>>> cheers,
>>>> Thomas
>>>>
>>>> [1]: https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=data/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD
>>>> [2]: https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=data/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb=HEAD#l1096
>>>>
>>
>>
>>
> 
>