[ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

Tue Apr 23 03:53:33 EDT 2024

Classified as: {OPEN}

Other strange thing.

On RHEL 7, corosync is restarted while the “Restart=on-failure » line is commented.

I think also that something changed in the pacemaker behavior, or somewhere else.

De : Klaus Wenninger <kwenning at redhat.com> 
Envoyé : lundi 22 avril 2024 12:41
À : NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
Cc : Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

On Mon, Apr 22, 2024 at 12:32 PM NOLIBOS Christophe <christophe.nolibos at thalesgroup.com <mailto:christophe.nolibos at thalesgroup.com> > wrote:

Classified as: {OPEN}

You are right : the “Restart=on-failure” line is commented and so, disabled per default.

Uncommenting it resolves my issue.

Maybe pacemaker changed behavior here without syncing enough with corosync behavior.

We'll look into that to see which approach is better - restart corosync on failure - or have

pacemaker be restarted by systemd which should in turn restart corosync as well.

Klaus 

Thanks a lot.

Christophe.

De : Klaus Wenninger < <mailto:kwenning at redhat.com> kwenning at redhat.com> 
Envoyé : lundi 22 avril 2024 11:06
À : NOLIBOS Christophe <christophe. <mailto:nolibos at thalesgroup.com> nolibos at thalesgroup.com>
Cc : Cluster Labs - All topics related to open-source clustering welcomed < <mailto:users at clusterlabs.org> users at clusterlabs.org>
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

On Mon, Apr 22, 2024 at 9:51 AM NOLIBOS Christophe <christophe.nolibos at thalesgroup.com <mailto:christophe.nolibos at thalesgroup.com> > wrote:

Classified as: {OPEN}

‘kill -9’ command.

Is it gracefully exit?

Looking as if corosync-unit-file has Restart=on-failure disabled per default.

I'm not aware of another mechanism that would restart corosync and I

think default behavior is not to restart.

Comments suggest just to enable if using watchdog but that might just

reference the RestartSec to provoke a watchdog-reboot instead of a

restart via systemd.

Any signal that isn't handled by the process - so that the exit-code could

be set to 0 - should be fine.

Klaus

De : Klaus Wenninger < <mailto:kwenning at redhat.com> kwenning at redhat.com> 
Envoyé : jeudi 18 avril 2024 20:17
À : NOLIBOS Christophe < <mailto:christophe.nolibos at thalesgroup.com> christophe.nolibos at thalesgroup.com>
Cc : Cluster Labs - All topics related to open-source clustering welcomed < <mailto:users at clusterlabs.org> users at clusterlabs.org>
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

NOLIBOS Christophe <christophe.nolibos at thalesgroup.com <mailto:christophe.nolibos at thalesgroup.com> > schrieb am Do., 18. Apr. 2024, 19:01:

Classified as: {OPEN}

Hummm… my RHEL 8.8 OS has been hardened.

I am wondering if the problem does not come from that.

On another side, I get the same issue (i.e. corosync not restarted by system) with Pacemaker 2.1.5-8 deployed on RHEL 8.4 (not hardened).

I’m checking.

How did, you kill corosync? If it exits gracefully might not be restarted. Check journal. Sry cant try am on my mobile ATM. Klaus

{OPEN}

{OPEN}

{OPEN}

{OPEN}

De : Users < <mailto:users-bounces at clusterlabs.org> users-bounces at clusterlabs.org> De la part de NOLIBOS Christophe via Users
Envoyé : jeudi 18 avril 2024 18:34
À : Klaus Wenninger < <mailto:kwenning at redhat.com> kwenning at redhat.com>; Cluster Labs - All topics related to open-source clustering welcomed < <mailto:users at clusterlabs.org> users at clusterlabs.org>
Cc : NOLIBOS Christophe < <mailto:christophe.nolibos at thalesgroup.com> christophe.nolibos at thalesgroup.com>
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

Classified as: {OPEN}

So, the issue is on systemd?

If I run the same test on RHEL 7 (3.10.0-693.11.1.el7) with pacemaker 1.1.13-10, corosync is correctly restarted by systemd.

[RHEL7 ~]# journalctl -f

-- Logs begin at Wed 2024-01-03 13:15:41 UTC. --

Apr 18 16:26:55 - systemd[1]: corosync.service failed.

Apr 18 16:26:55 - systemd[1]: pacemaker.service holdoff time over, scheduling restart.

Apr 18 16:26:55 - systemd[1]: Starting Corosync Cluster Engine...

Apr 18 16:26:55 - corosync[12179]: Starting Corosync Cluster Engine (corosync): [  OK  ]

Apr 18 16:26:55 - systemd[1]: Started Corosync Cluster Engine.

Apr 18 16:26:55 - systemd[1]: Started Pacemaker High Availability Cluster Manager.

Apr 18 16:26:55 - systemd[1]: Starting Pacemaker High Availability Cluster Manager...

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging available in /var/log/pacemaker.log

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Switching to /var/log/cluster/corosync.log

Apr 18 16:26:55 - pacemakerd[12192]:   notice: Additional logging available in /var/log/cluster/corosync.log

De : Klaus Wenninger < <mailto:kwenning at redhat.com> kwenning at redhat.com> 
Envoyé : jeudi 18 avril 2024 18:12
À : NOLIBOS Christophe < <mailto:christophe.nolibos at thalesgroup.com> christophe.nolibos at thalesgroup.com>; Cluster Labs - All topics related to open-source clustering welcomed < <mailto:users at clusterlabs.org> users at clusterlabs.org>
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

On Thu, Apr 18, 2024 at 6:09 PM Klaus Wenninger <kwenning at redhat.com <mailto:kwenning at redhat.com> > wrote:

On Thu, Apr 18, 2024 at 6:06 PM NOLIBOS Christophe <christophe.nolibos at thalesgroup.com <mailto:christophe.nolibos at thalesgroup.com> > wrote:

Classified as: {OPEN}

Well… why do you say that « Well if corosync isn't  there that this is to be expected and pacemaker won't recover corosync.”?

In my mind, Corosync is managed by Pacemaker as any other cluster resource and the "pacemakerd: recover properly from > Corosync crash" fix implemented in version 2.1.2 seems confirm that.

Nope. Startup of the stack is done by systemd. And pacemaker is just started after corosync is up and

systemd should be responsible for keeping the stack up.

For completeness: if you have sbd in the mix that is as well being started by systemd but kind of

parallel with corosync as part of it (systemd terminology).

The "recover" above is referring to pacemaker recovering from corosync going away and coming back.

Klaus 

{OPEN}

{OPEN}

De : NOLIBOS Christophe 
Envoyé : jeudi 18 avril 2024 17:56
À : 'Klaus Wenninger' < <mailto:kwenning at redhat.com> kwenning at redhat.com>; Cluster Labs - All topics related to open-source clustering welcomed < <mailto:users at clusterlabs.org> users at clusterlabs.org>
Cc : Ken Gaillot < <mailto:kgaillot at redhat.com> kgaillot at redhat.com>
Objet : RE: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

Classified as: {OPEN}

[~]$ systemctl status corosync

● corosync.service - Corosync Cluster Engine

   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor preset: disabled)

   Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC; 53min ago

     Docs: man:corosync

           man:corosync.conf

           man:corosync_overview

  Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, status=0/SUCCESS)

  Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=KILL)

Main PID: 1324906 (code=killed, signal=KILL)

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1

Apr 18 13:16:04 - corosync[1324906]:   [TOTEM ] A new membership (1.1c8) was formed. Members joined: 1

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Members[1]: 1

Apr 18 13:16:04 - corosync[1324906]:   [MAIN  ] Completed service synchronization, ready to provide service.

Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine.

Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited, code=killed, status=9/KILL

Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result 'signal'.

[~]$

De : Klaus Wenninger < <mailto:kwenning at redhat.com> kwenning at redhat.com> 
Envoyé : jeudi 18 avril 2024 17:43
À : Cluster Labs - All topics related to open-source clustering welcomed < <mailto:users at clusterlabs.org> users at clusterlabs.org>
Cc : Ken Gaillot < <mailto:kgaillot at redhat.com> kgaillot at redhat.com>; NOLIBOS Christophe < <mailto:christophe.nolibos at thalesgroup.com> christophe.nolibos at thalesgroup.com>
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users <users at clusterlabs.org <mailto:users at clusterlabs.org> > wrote:

Classified as: {OPEN}

I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
When I kill Corosync, no new corosync process is created and pacemaker is in failure.
The only solution is to restart the pacemaker service.

[~]$ pcs status
Error: unable to get cib
[~]$

[~]$systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min ago
     Docs: man:pacemakerd
           https://clusterlabs.org/pacemaker/doc/
 Main PID: 1324923 (pacemakerd)
    Tasks: 91
   Memory: 132.1M
   CGroup: /system.slice/pacemaker.service
...
Apr 18 14:59:02 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:03 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:04 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:05 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:06 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:07 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:08 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:09 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:10 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:11 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
[~]$

Well if corosync isn't  there that this is to be expected and pacemaker won't recover corosync.

Can you check what systemd thinks about corosync (status/journal). 

Klaus

{OPEN}

-----Message d'origine-----
De : Ken Gaillot <kgaillot at redhat.com <mailto:kgaillot at redhat.com> > 
Envoyé : jeudi 18 avril 2024 16:40
À : Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org <mailto:users at clusterlabs.org> >
Cc : NOLIBOS Christophe <christophe.nolibos at thalesgroup.com <mailto:christophe.nolibos at thalesgroup.com> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

What OS are you using? Does it use systemd?

What does happen when you kill Corosync?

On Thu, 2024-04-18 at 13:13 +0000, NOLIBOS Christophe via Users wrote:
> Classified as: {OPEN}
> 
> Dear All,
>  
> I have a question about the "pacemakerd: recover properly from 
> Corosync crash" fix implemented in version 2.1.2.
> I have observed the issue when testing pacemaker version 2.0.5, just 
> by killing the ‘corosync’ process: Corosync was not recovered.
>  
> I am using now pacemaker version 2.1.5-8.
> Doing the same test, I have the same result: Corosync is still not 
> recovered.
>  
> Please confirm the "pacemakerd: recover properly from Corosync crash"
> fix implemented in version 2.1.2 covers this scenario.
> If it is, did I miss something in the configuration of my cluster?
>  
> Best Regard.
>  
> Christophe.
>   
>  
> 
> {OPEN}
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
--
Ken Gaillot <kgaillot at redhat.com <mailto:kgaillot at redhat.com> >
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

{OPEN}

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20240423/4c2280f5/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 10900 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20240423/4c2280f5/attachment-0001.p7s>