[ClusterLabs] corosync 2.4 CPG config change callback

Jan Friesse jfriesse at redhat.com
Fri Mar 9 11:26:44 EST 2018


Thomas,

> Hi,
> 
> On 3/7/18 1:41 PM, Jan Friesse wrote:
>> Thomas,
>>
>>> First thanks for your answer!
>>>
>>> On 3/7/18 11:16 AM, Jan Friesse wrote:

...

> TotemConfchgCallback: ringid (1.1436)
> active processors 3: 1 2 3
> EXIT
> Finalize  result is 1 (should be 1)
> 
> 
> Hope I did both test right, but as it reproduces multiple times
> with testcpg, our cpg usage in our filesystem, this seems like
> valid tested, not just an single occurrence.

I've tested it too and yes, you are 100% right. Bug is there and it's 
pretty easy to reproduce when node with lowest nodeid is paused. It's 
slightly harder when node with higher nodeid is paused.

Most of the clusters are using power fencing, so they simply never sees 
this problem. That may be also the reason why it wasn't reported long 
time ago (this bug exists virtually at least since OpenAIS Whitetank). 
So really nice work with finding this bug.

What I'm not entirely sure is what may be best way to solve this 
problem. What I'm sure is, that it's going to be "fun" :(

Lets start with very high level of possible solutions:
- "Ignore the problem". CPG behaves more or less correctly. "Current" 
membership really didn't changed so it doesn't make too much sense to 
inform about change. It's possible to use cpg_totem_confchg_fn_t to find 
out when ringid changes. I'm adding this solution just for completeness, 
because I don't prefer it at all.
- cpg_confchg_fn_t adds all left and back joined into left/join list
- cpg will sends extra cpg_confchg_fn_t call about left and joined 
nodes. I would prefer this solution simply because it makes cpg behavior 
equal in all situations.

Which of the options you would prefer? Same question also for @Ken (-> 
what would you prefer for PCMK) and @Chrissie.

Regards,
   Honza


> 
> cheers,
> Thomas
> 
>>>
>>>> Now it's really cpg application problem to synchronize its data. Many applications (usually FS) are using quorum together with fencing to find out, which cluster partition is quorate and clean inquorate one.
>>>>
>>>> Hopefully my explanation help you and feel free to ask more questions!
>>>>
>>>
>>> They help, but I'm still a bit unsure about why the CB could not happen here,
>>> may need to dive a bit deeper into corosync :)
>>>
>>>> Regards,
>>>>    Honza
>>>>
>>>>>
>>>>> help would be appreciated, much thanks!
>>>>>
>>>>> cheers,
>>>>> Thomas
>>>>>
>>>>> [1]: https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=data/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD
>>>>> [2]: https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=data/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb=HEAD#l1096
>>>>>
>>>
>>>
>>>
>>
>>
> 
> 
> 




More information about the Users mailing list