[ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

Thu Apr 18 12:12:11 EDT 2024

On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger <kwenning at redhat.com> wrote:

>
>
> On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe <
> christophe.nolibos at thalesgroup.com> wrote:
>
>> Classified as: {OPEN}
>>
>>
>>
>> Well… why do you say that « Well if corosync isn't  there that this is
>> to be expected and pacemaker won't recover corosync.”?
>>
>> In my mind, Corosync is managed by Pacemaker as any other cluster
>> resource and the "pacemakerd: recover properly from > Corosync crash" fix
>> implemented in version 2.1.2 seems confirm that.
>>
>
> Nope. Startup of the stack is done by systemd. And pacemaker is just
> started after corosync is up and
> systemd should be responsible for keeping the stack up.
> For completeness: if you have sbd in the mix that is as well being started
> by systemd but kind of
> parallel with corosync as part of it (systemd terminology).
>

The "recover" above is referring to pacemaker recovering from corosync
going away and coming back.

>
> Klaus
>
>>
>>
>>
>>
>> {OPEN}
>>
>> *De :* NOLIBOS Christophe
>> *Envoyé :* jeudi 18 avril 2024 17:56
>> *À :* 'Klaus Wenninger' <kwenning at redhat.com>; Cluster Labs - All topics
>> related to open-source clustering welcomed <users at clusterlabs.org>
>> *Cc :* Ken Gaillot <kgaillot at redhat.com>
>> *Objet :* RE: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>> Classified as: {OPEN}
>>
>>
>>
>>
>>
>> [~]$ systemctl status corosync
>>
>> ● corosync.service - Corosync Cluster Engine
>>
>>    Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled;
>> vendor preset: disabled)
>>
>>    Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC;
>> 53min ago
>>
>>      Docs: man:corosync
>>
>>            man:corosync.conf
>>
>>            man:corosync_overview
>>
>>   Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force
>> (code=exited, status=0/SUCCESS)
>>
>>   Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS
>> (code=killed, signal=KILL)
>>
>> Main PID: 1324906 (code=killed, signal=KILL)
>>
>>
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [TOTEM ] A new membership (1.1c8)
>> was formed. Members joined: 1
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
>> members. Current votes: 1 expected_votes: 2
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
>> members. Current votes: 1 expected_votes: 2
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
>> members. Current votes: 1 expected_votes: 2
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Members[1]: 1
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>>
>> Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine.
>>
>> Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited,
>> code=killed, status=9/KILL
>>
>> Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result
>> 'signal'.
>>
>> [~]$
>>
>>
>>
>>
>>
>> *De :* Klaus Wenninger <kwenning at redhat.com>
>> *Envoyé :* jeudi 18 avril 2024 17:43
>> *À :* Cluster Labs - All topics related to open-source clustering
>> welcomed <users at clusterlabs.org>
>> *Cc :* Ken Gaillot <kgaillot at redhat.com>; NOLIBOS Christophe <
>> christophe.nolibos at thalesgroup.com>
>> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users <
>> users at clusterlabs.org> wrote:
>>
>> Classified as: {OPEN}
>>
>> I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
>> When I kill Corosync, no new corosync process is created and pacemaker is
>> in failure.
>> The only solution is to restart the pacemaker service.
>>
>> [~]$ pcs status
>> Error: unable to get cib
>> [~]$
>>
>> [~]$systemctl status pacemaker
>> ● pacemaker.service - Pacemaker High Availability Cluster Manager
>>    Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled;
>> vendor preset: disabled)
>>    Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min
>> ago
>>      Docs: man:pacemakerd
>>            https://clusterlabs.org/pacemaker/doc/
>>  Main PID: 1324923 (pacemakerd)
>>     Tasks: 91
>>    Memory: 132.1M
>>    CGroup: /system.slice/pacemaker.service
>> ...
>> Apr 18 14:59:02 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:03 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:04 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:05 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:06 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:07 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:08 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:09 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:10 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:11 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> [~]$
>>
>> Well if corosync isn't  there that this is to be expected and pacemaker
>> won't recover corosync.
>>
>> Can you check what systemd thinks about corosync (status/journal).
>>
>>
>>
>> Klaus
>>
>>
>> {OPEN}
>>
>> -----Message d'origine-----
>> De : Ken Gaillot <kgaillot at redhat.com>
>> Envoyé : jeudi 18 avril 2024 16:40
>> À : Cluster Labs - All topics related to open-source clustering welcomed <
>> users at clusterlabs.org>
>> Cc : NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
>> Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>> What OS are you using? Does it use systemd?
>>
>> What does happen when you kill Corosync?
>>
>> On Thu, 2024-04-18 at 13:13 +0000, NOLIBOS Christophe via Users wrote:
>> > Classified as: {OPEN}
>> >
>> > Dear All,
>> >
>> > I have a question about the "pacemakerd: recover properly from
>> > Corosync crash" fix implemented in version 2.1.2.
>> > I have observed the issue when testing pacemaker version 2.0.5, just
>> > by killing the ‘corosync’ process: Corosync was not recovered.
>> >
>> > I am using now pacemaker version 2.1.5-8.
>> > Doing the same test, I have the same result: Corosync is still not
>> > recovered.
>> >
>> > Please confirm the "pacemakerd: recover properly from Corosync crash"
>> > fix implemented in version 2.1.2 covers this scenario.
>> > If it is, did I miss something in the configuration of my cluster?
>> >
>> > Best Regard.
>> >
>> > Christophe.
>> >
>> >
>> >
>> > {OPEN}
>> > _______________________________________________
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users
>> >
>> > ClusterLabs home: https://www.clusterlabs.org/
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>>
>> {OPEN}
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20240418/305fa444/attachment.htm>