[ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

Tue Apr 23 10:37:13 EDT 2024

On Tue, Apr 23, 2024 at 10:34 AM Klaus Wenninger <kwenning at redhat.com>
wrote:

>
>
> On Tue, Apr 23, 2024 at 9:53 AM NOLIBOS Christophe <
> christophe.nolibos at thalesgroup.com> wrote:
>
>> Classified as: {OPEN}
>>
>>
>>
>> Other strange thing.
>>
>> On RHEL 7, corosync is restarted while the “Restart=on-failure » line is
>> commented.
>>
>> I think also that something changed in the pacemaker behavior, or
>> somewhere else.
>>
>
> That is how it was working before introduction of the reconnection to
> corosync.
> Previously pacemaker would fail and systemd would restart it checking the
> services
> pacemaker depends on. And finding corosync not running it would be
> restarted.
>

>From what I've read there has been a change in how systemd is handling
restart
of dependent services a while back as well. So changed behavior can come
from
that as well. Just for completeness ...

Klaus

>
> Klaus
>
>
>>
>>
>> *De :* Klaus Wenninger <kwenning at redhat.com>
>> *Envoyé :* lundi 22 avril 2024 12:41
>> *À :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
>> *Cc :* Cluster Labs - All topics related to open-source clustering
>> welcomed <users at clusterlabs.org>
>> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Apr 22, 2024 at 12:32 PM NOLIBOS Christophe <
>> christophe.nolibos at thalesgroup.com> wrote:
>>
>> Classified as: {OPEN}
>>
>>
>>
>> You are right : the “Restart=on-failure” line is commented and so,
>> disabled per default.
>>
>> Uncommenting it resolves my issue.
>>
>>
>>
>> Maybe pacemaker changed behavior here without syncing enough with
>> corosync behavior.
>>
>> We'll look into that to see which approach is better - restart corosync
>> on failure - or have
>>
>> pacemaker be restarted by systemd which should in turn restart corosync
>> as well.
>>
>>
>>
>> Klaus
>>
>>
>>
>> Thanks a lot.
>>
>> Christophe.
>>
>>
>>
>> *De :* Klaus Wenninger <kwenning at redhat.com>
>> *Envoyé :* lundi 22 avril 2024 11:06
>> *À :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
>> *Cc :* Cluster Labs - All topics related to open-source clustering
>> welcomed <users at clusterlabs.org>
>> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe <
>> christophe.nolibos at thalesgroup.com> wrote:
>>
>> Classified as: {OPEN}
>>
>>
>>
>> ‘kill -9’ command.
>>
>> Is it gracefully exit?
>>
>>
>>
>> Looking as if corosync-unit-file has Restart=on-failure disabled per
>> default.
>>
>> I'm not aware of another mechanism that would restart corosync and I
>>
>> think default behavior is not to restart.
>>
>> Comments suggest just to enable if using watchdog but that might just
>>
>> reference the RestartSec to provoke a watchdog-reboot instead of a
>>
>> restart via systemd.
>>
>> Any signal that isn't handled by the process - so that the exit-code could
>>
>> be set to 0 - should be fine.
>>
>>
>>
>> Klaus
>>
>>
>>
>>
>>
>> *De :* Klaus Wenninger <kwenning at redhat.com>
>> *Envoyé :* jeudi 18 avril 2024 20:17
>> *À :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
>> *Cc :* Cluster Labs - All topics related to open-source clustering
>> welcomed <users at clusterlabs.org>
>> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>>
>>
>> NOLIBOS Christophe <christophe.nolibos at thalesgroup.com> schrieb am Do.,
>> 18. Apr. 2024, 19:01:
>>
>> Classified as: {OPEN}
>>
>>
>>
>> Hummm… my RHEL 8.8 OS has been hardened.
>>
>> I am wondering if the problem does not come from that.
>>
>>
>>
>> On another side, I get the same issue (i.e. corosync not restarted by
>> system) with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).
>>
>>
>>
>> I’m checking.
>>
>>
>>
>> How did, you kill corosync? If it exits gracefully might not be
>> restarted. Check journal. Sry cant try am on my mobile ATM. Klaus
>>
>>
>>
>>
>>
>> {OPEN}
>>
>>
>>
>> {OPEN}
>>
>>
>>
>> {OPEN}
>>
>>
>>
>> {OPEN}
>>
>> *De :* Users <users-bounces at clusterlabs.org> *De la part de* NOLIBOS
>> Christophe via Users
>> *Envoyé :* jeudi 18 avril 2024 18:34
>> *À :* Klaus Wenninger <kwenning at redhat.com>; Cluster Labs - All topics
>> related to open-source clustering welcomed <users at clusterlabs.org>
>> *Cc :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
>> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>> Classified as: {OPEN}
>>
>>
>>
>> So, the issue is on systemd?
>>
>>
>>
>> If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker
>> 1.1.13-10, corosync is correctly restarted by systemd.
>>
>>
>>
>> [RHEL7 ~]# journalctl -f
>>
>> -- Logs begin at Wed 2024-01-03 13:15:41 UTC. --
>>
>> Apr 18 16:26:55 - systemd[1]: corosync.service failed.
>>
>> Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over,
>> scheduling restart.
>>
>> Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...
>>
>> Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine
>> (corosync): [  OK  ]
>>
>> Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.
>>
>> Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster
>> Manager.
>>
>> Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability
>> Cluster Manager...
>>
>> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging
>> available in /var/log/pacemaker.log
>>
>> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Switching to
>> /var/log/cluster/corosync.log
>>
>> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging
>> available in /var/log/cluster/corosync.log
>>
>>
>>
>> *De :* Klaus Wenninger <kwenning at redhat.com>
>> *Envoyé :* jeudi 18 avril 2024 18:12
>> *À :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>; Cluster
>> Labs - All topics related to open-source clustering welcomed <
>> users at clusterlabs.org>
>> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger <kwenning at redhat.com>
>> wrote:
>>
>>
>>
>>
>>
>> On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe <
>> christophe.nolibos at thalesgroup.com> wrote:
>>
>> Classified as: {OPEN}
>>
>>
>>
>> Well… why do you say that « Well if corosync isn't  there that this is
>> to be expected and pacemaker won't recover corosync.”?
>>
>> In my mind, Corosync is managed by Pacemaker as any other cluster
>> resource and the "pacemakerd: recover properly from > Corosync crash" fix
>> implemented in version 2.1.2 seems confirm that.
>>
>>
>>
>> Nope. Startup of the stack is done by systemd. And pacemaker is just
>> started after corosync is up and
>>
>> systemd should be responsible for keeping the stack up.
>>
>> For completeness: if you have sbd in the mix that is as well being
>> started by systemd but kind of
>>
>> parallel with corosync as part of it (systemd terminology).
>>
>>
>>
>> The "recover" above is referring to pacemaker recovering from corosync
>> going away and coming back.
>>
>>
>>
>>
>>
>> Klaus
>>
>>
>>
>>
>>
>> {OPEN}
>>
>>
>>
>> {OPEN}
>>
>> *De :* NOLIBOS Christophe
>> *Envoyé :* jeudi 18 avril 2024 17:56
>> *À :* 'Klaus Wenninger' <kwenning at redhat.com>; Cluster Labs - All topics
>> related to open-source clustering welcomed <users at clusterlabs.org>
>> *Cc :* Ken Gaillot <kgaillot at redhat.com>
>> *Objet :* RE: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>> Classified as: {OPEN}
>>
>>
>>
>>
>>
>> [~]$ systemctl status corosync
>>
>> ● corosync.service - Corosync Cluster Engine
>>
>>    Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled;
>> vendor preset: disabled)
>>
>>    Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC;
>> 53min ago
>>
>>      Docs: man:corosync
>>
>>            man:corosync.conf
>>
>>            man:corosync_overview
>>
>>   Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force
>> (code=exited, status=0/SUCCESS)
>>
>>   Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS
>> (code=killed, signal=KILL)
>>
>> Main PID: 1324906 (code=killed, signal=KILL)
>>
>>
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [TOTEM ] A new membership (1.1c8)
>> was formed. Members joined: 1
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
>> members. Current votes: 1 expected_votes: 2
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
>> members. Current votes: 1 expected_votes: 2
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
>> members. Current votes: 1 expected_votes: 2
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Members[1]: 1
>>
>> Apr 18 13:16:04 - corosync[1324906]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>>
>> Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine.
>>
>> Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited,
>> code=killed, status=9/KILL
>>
>> Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result
>> 'signal'.
>>
>> [~]$
>>
>>
>>
>>
>>
>> *De :* Klaus Wenninger <kwenning at redhat.com>
>> *Envoyé :* jeudi 18 avril 2024 17:43
>> *À :* Cluster Labs - All topics related to open-source clustering
>> welcomed <users at clusterlabs.org>
>> *Cc :* Ken Gaillot <kgaillot at redhat.com>; NOLIBOS Christophe <
>> christophe.nolibos at thalesgroup.com>
>> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users <
>> users at clusterlabs.org> wrote:
>>
>> Classified as: {OPEN}
>>
>> I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
>> When I kill Corosync, no new corosync process is created and pacemaker is
>> in failure.
>> The only solution is to restart the pacemaker service.
>>
>> [~]$ pcs status
>> Error: unable to get cib
>> [~]$
>>
>> [~]$systemctl status pacemaker
>> ● pacemaker.service - Pacemaker High Availability Cluster Manager
>>    Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled;
>> vendor preset: disabled)
>>    Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min
>> ago
>>      Docs: man:pacemakerd
>>            https://clusterlabs.org/pacemaker/doc/
>>  Main PID: 1324923 (pacemakerd)
>>     Tasks: 91
>>    Memory: 132.1M
>>    CGroup: /system.slice/pacemaker.service
>> ...
>> Apr 18 14:59:02 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:03 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:04 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:05 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:06 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:07 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:08 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:09 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:10 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> Apr 18 14:59:11 - pacemakerd[1324923]:  crit: Could not connect to
>> Corosync CFG: CS_ERR_LIBRARY
>> [~]$
>>
>> Well if corosync isn't  there that this is to be expected and pacemaker
>> won't recover corosync.
>>
>> Can you check what systemd thinks about corosync (status/journal).
>>
>>
>>
>> Klaus
>>
>>
>> {OPEN}
>>
>> -----Message d'origine-----
>> De : Ken Gaillot <kgaillot at redhat.com>
>> Envoyé : jeudi 18 avril 2024 16:40
>> À : Cluster Labs - All topics related to open-source clustering welcomed <
>> users at clusterlabs.org>
>> Cc : NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
>> Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
>> crash" fix
>>
>> What OS are you using? Does it use systemd?
>>
>> What does happen when you kill Corosync?
>>
>> On Thu, 2024-04-18 at 13:13 +0000, NOLIBOS Christophe via Users wrote:
>> > Classified as: {OPEN}
>> >
>> > Dear All,
>> >
>> > I have a question about the "pacemakerd: recover properly from
>> > Corosync crash" fix implemented in version 2.1.2.
>> > I have observed the issue when testing pacemaker version 2.0.5, just
>> > by killing the ‘corosync’ process: Corosync was not recovered.
>> >
>> > I am using now pacemaker version 2.1.5-8.
>> > Doing the same test, I have the same result: Corosync is still not
>> > recovered.
>> >
>> > Please confirm the "pacemakerd: recover properly from Corosync crash"
>> > fix implemented in version 2.1.2 covers this scenario.
>> > If it is, did I miss something in the configuration of my cluster?
>> >
>> > Best Regard.
>> >
>> > Christophe.
>> >
>> >
>> >
>> > {OPEN}
>> > _______________________________________________
>> > Manage your subscription:
>> > https://lists.clusterlabs.org/mailman/listinfo/users
>> >
>> > ClusterLabs home: https://www.clusterlabs.org/
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>>
>> {OPEN}
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20240423/5c86b428/attachment-0001.htm>