[ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

Mon Apr 22 06:40:38 EDT 2024

On Mon, Apr 22, 2024 at 12:32 PM NOLIBOS Christophe <
christophe.nolibos at thalesgroup.com> wrote:

> Classified as: {OPEN}
>
>
>
> You are right : the “Restart=on-failure” line is commented and so,
> disabled per default.
>
> Uncommenting it resolves my issue.
>

Maybe pacemaker changed behavior here without syncing enough with corosync
behavior.
We'll look into that to see which approach is better - restart corosync on
failure - or have
pacemaker be restarted by systemd which should in turn restart corosync as
well.

Klaus

>
>
> Thanks a lot.
>
> Christophe.
>
>
>
> *De :* Klaus Wenninger <kwenning at redhat.com>
> *Envoyé :* lundi 22 avril 2024 11:06
> *À :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
> *Cc :* Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe <
> christophe.nolibos at thalesgroup.com> wrote:
>
> Classified as: {OPEN}
>
>
>
> ‘kill -9’ command.
>
> Is it gracefully exit?
>
>
>
> Looking as if corosync-unit-file has Restart=on-failure disabled per
> default.
>
> I'm not aware of another mechanism that would restart corosync and I
>
> think default behavior is not to restart.
>
> Comments suggest just to enable if using watchdog but that might just
>
> reference the RestartSec to provoke a watchdog-reboot instead of a
>
> restart via systemd.
>
> Any signal that isn't handled by the process - so that the exit-code could
>
> be set to 0 - should be fine.
>
>
>
> Klaus
>
>
>
>
>
> *De :* Klaus Wenninger <kwenning at redhat.com>
> *Envoyé :* jeudi 18 avril 2024 20:17
> *À :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
> *Cc :* Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
> NOLIBOS Christophe <christophe.nolibos at thalesgroup.com> schrieb am Do.,
> 18. Apr. 2024, 19:01:
>
> Classified as: {OPEN}
>
>
>
> Hummm… my RHEL 8.8 OS has been hardened.
>
> I am wondering if the problem does not come from that.
>
>
>
> On another side, I get the same issue (i.e. corosync not restarted by
> system) with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).
>
>
>
> I’m checking.
>
>
>
> How did, you kill corosync? If it exits gracefully might not be restarted.
> Check journal. Sry cant try am on my mobile ATM. Klaus
>
>
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
> *De :* Users <users-bounces at clusterlabs.org> *De la part de* NOLIBOS
> Christophe via Users
> *Envoyé :* jeudi 18 avril 2024 18:34
> *À :* Klaus Wenninger <kwenning at redhat.com>; Cluster Labs - All topics
> related to open-source clustering welcomed <users at clusterlabs.org>
> *Cc :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
> Classified as: {OPEN}
>
>
>
> So, the issue is on systemd?
>
>
>
> If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker
> 1.1.13-10, corosync is correctly restarted by systemd.
>
>
>
> [RHEL7 ~]# journalctl -f
>
> -- Logs begin at Wed 2024-01-03 13:15:41 UTC. --
>
> Apr 18 16:26:55 - systemd[1]: corosync.service failed.
>
> Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over,
> scheduling restart.
>
> Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...
>
> Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine
> (corosync): [  OK  ]
>
> Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.
>
> Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster
> Manager.
>
> Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster
> Manager...
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging
> available in /var/log/pacemaker.log
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Switching to
> /var/log/cluster/corosync.log
>
> Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging
> available in /var/log/cluster/corosync.log
>
>
>
> *De :* Klaus Wenninger <kwenning at redhat.com>
> *Envoyé :* jeudi 18 avril 2024 18:12
> *À :* NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>; Cluster
> Labs - All topics related to open-source clustering welcomed <
> users at clusterlabs.org>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger <kwenning at redhat.com>
> wrote:
>
>
>
>
>
> On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe <
> christophe.nolibos at thalesgroup.com> wrote:
>
> Classified as: {OPEN}
>
>
>
> Well… why do you say that « Well if corosync isn't  there that this is to
> be expected and pacemaker won't recover corosync.”?
>
> In my mind, Corosync is managed by Pacemaker as any other cluster resource
> and the "pacemakerd: recover properly from > Corosync crash" fix
> implemented in version 2.1.2 seems confirm that.
>
>
>
> Nope. Startup of the stack is done by systemd. And pacemaker is just
> started after corosync is up and
>
> systemd should be responsible for keeping the stack up.
>
> For completeness: if you have sbd in the mix that is as well being started
> by systemd but kind of
>
> parallel with corosync as part of it (systemd terminology).
>
>
>
> The "recover" above is referring to pacemaker recovering from corosync
> going away and coming back.
>
>
>
>
>
> Klaus
>
>
>
>
>
> {OPEN}
>
>
>
> {OPEN}
>
> *De :* NOLIBOS Christophe
> *Envoyé :* jeudi 18 avril 2024 17:56
> *À :* 'Klaus Wenninger' <kwenning at redhat.com>; Cluster Labs - All topics
> related to open-source clustering welcomed <users at clusterlabs.org>
> *Cc :* Ken Gaillot <kgaillot at redhat.com>
> *Objet :* RE: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
> Classified as: {OPEN}
>
>
>
>
>
> [~]$ systemctl status corosync
>
> ● corosync.service - Corosync Cluster Engine
>
>    Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled;
> vendor preset: disabled)
>
>    Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC;
> 53min ago
>
>      Docs: man:corosync
>
>            man:corosync.conf
>
>            man:corosync_overview
>
>   Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force
> (code=exited, status=0/SUCCESS)
>
>   Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS
> (code=killed, signal=KILL)
>
> Main PID: 1324906 (code=killed, signal=KILL)
>
>
>
> Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1
>
> Apr 18 13:16:04 - corosync[1324906]:   [TOTEM ] A new membership (1.1c8)
> was formed. Members joined: 1
>
> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
> members. Current votes: 1 expected_votes: 2
>
> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
> members. Current votes: 1 expected_votes: 2
>
> Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster
> members. Current votes: 1 expected_votes: 2
>
> Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Members[1]: 1
>
> Apr 18 13:16:04 - corosync[1324906]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
>
> Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine.
>
> Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited,
> code=killed, status=9/KILL
>
> Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result
> 'signal'.
>
> [~]$
>
>
>
>
>
> *De :* Klaus Wenninger <kwenning at redhat.com>
> *Envoyé :* jeudi 18 avril 2024 17:43
> *À :* Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> *Cc :* Ken Gaillot <kgaillot at redhat.com>; NOLIBOS Christophe <
> christophe.nolibos at thalesgroup.com>
> *Objet :* Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
>
>
>
>
>
>
> On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users <
> users at clusterlabs.org> wrote:
>
> Classified as: {OPEN}
>
> I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
> When I kill Corosync, no new corosync process is created and pacemaker is
> in failure.
> The only solution is to restart the pacemaker service.
>
> [~]$ pcs status
> Error: unable to get cib
> [~]$
>
> [~]$systemctl status pacemaker
> ● pacemaker.service - Pacemaker High Availability Cluster Manager
>    Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled;
> vendor preset: disabled)
>    Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min ago
>      Docs: man:pacemakerd
>            https://clusterlabs.org/pacemaker/doc/
>  Main PID: 1324923 (pacemakerd)
>     Tasks: 91
>    Memory: 132.1M
>    CGroup: /system.slice/pacemaker.service
> ...
> Apr 18 14:59:02 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:03 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:04 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:05 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:06 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:07 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:08 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:09 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:10 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> Apr 18 14:59:11 - pacemakerd[1324923]:  crit: Could not connect to
> Corosync CFG: CS_ERR_LIBRARY
> [~]$
>
> Well if corosync isn't  there that this is to be expected and pacemaker
> won't recover corosync.
>
> Can you check what systemd thinks about corosync (status/journal).
>
>
>
> Klaus
>
>
> {OPEN}
>
> -----Message d'origine-----
> De : Ken Gaillot <kgaillot at redhat.com>
> Envoyé : jeudi 18 avril 2024 16:40
> À : Cluster Labs - All topics related to open-source clustering welcomed <
> users at clusterlabs.org>
> Cc : NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
> Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync
> crash" fix
>
> What OS are you using? Does it use systemd?
>
> What does happen when you kill Corosync?
>
> On Thu, 2024-04-18 at 13:13 +0000, NOLIBOS Christophe via Users wrote:
> > Classified as: {OPEN}
> >
> > Dear All,
> >
> > I have a question about the "pacemakerd: recover properly from
> > Corosync crash" fix implemented in version 2.1.2.
> > I have observed the issue when testing pacemaker version 2.0.5, just
> > by killing the ‘corosync’ process: Corosync was not recovered.
> >
> > I am using now pacemaker version 2.1.5-8.
> > Doing the same test, I have the same result: Corosync is still not
> > recovered.
> >
> > Please confirm the "pacemakerd: recover properly from Corosync crash"
> > fix implemented in version 2.1.2 covers this scenario.
> > If it is, did I miss something in the configuration of my cluster?
> >
> > Best Regard.
> >
> > Christophe.
> >
> >
> >
> > {OPEN}
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> --
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> {OPEN}
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20240422/d7019075/attachment-0001.htm>