[ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

NOLIBOS Christophe christophe.nolibos at thalesgroup.com
Thu Apr 18 11:55:50 EDT 2024


Classified as: {OPEN}

 

 

[~]$ systemctl status corosync

● corosync.service - Corosync Cluster Engine

   Loaded: loaded (/usr/lib/systemd/system/corosync.service; enabled; vendor preset: disabled)

   Active: failed (Result: signal) since Thu 2024-04-18 14:58:42 UTC; 53min ago

     Docs: man:corosync

           man:corosync.conf

           man:corosync_overview

  Process: 2027251 ExecStop=/usr/sbin/corosync-cfgtool -H --force (code=exited, status=0/SUCCESS)

  Process: 1324906 ExecStart=/usr/sbin/corosync -f $COROSYNC_OPTIONS (code=killed, signal=KILL)

Main PID: 1324906 (code=killed, signal=KILL)

 

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Sync joined[1]: 1

Apr 18 13:16:04 - corosync[1324906]:   [TOTEM ] A new membership (1.1c8) was formed. Members joined: 1

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2

Apr 18 13:16:04 - corosync[1324906]:   [QUORUM] Members[1]: 1

Apr 18 13:16:04 - corosync[1324906]:   [MAIN  ] Completed service synchronization, ready to provide service.

Apr 18 13:16:04 - systemd[1]: Started Corosync Cluster Engine.

Apr 18 14:58:42 - systemd[1]: corosync.service: Main process exited, code=killed, status=9/KILL

Apr 18 14:58:42 - systemd[1]: corosync.service: Failed with result 'signal'.

[~]$

 

 

De : Klaus Wenninger <kwenning at redhat.com> 
Envoyé : jeudi 18 avril 2024 17:43
À : Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
Cc : Ken Gaillot <kgaillot at redhat.com>; NOLIBOS Christophe <christophe.nolibos at thalesgroup.com>
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

 

 

 

On Thu, Apr 18, 2024 at 5:07 PM NOLIBOS Christophe via Users <users at clusterlabs.org <mailto:users at clusterlabs.org> > wrote:

Classified as: {OPEN}

I'm using RedHat 8.8 (4.18.0-477.21.1.el8_8.x86_64).
When I kill Corosync, no new corosync process is created and pacemaker is in failure.
The only solution is to restart the pacemaker service.

[~]$ pcs status
Error: unable to get cib
[~]$

[~]$systemctl status pacemaker
● pacemaker.service - Pacemaker High Availability Cluster Manager
   Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2024-04-18 13:16:04 UTC; 1h 43min ago
     Docs: man:pacemakerd
           https://clusterlabs.org/pacemaker/doc/
 Main PID: 1324923 (pacemakerd)
    Tasks: 91
   Memory: 132.1M
   CGroup: /system.slice/pacemaker.service
...
Apr 18 14:59:02 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:03 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:04 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:05 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:06 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:07 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:08 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:09 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:10 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
Apr 18 14:59:11 - pacemakerd[1324923]:  crit: Could not connect to Corosync CFG: CS_ERR_LIBRARY
[~]$

Well if corosync isn't  there that this is to be expected and pacemaker won't recover corosync.

Can you check what systemd thinks about corosync (status/journal). 

 

Klaus


{OPEN}

-----Message d'origine-----
De : Ken Gaillot <kgaillot at redhat.com <mailto:kgaillot at redhat.com> > 
Envoyé : jeudi 18 avril 2024 16:40
À : Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org <mailto:users at clusterlabs.org> >
Cc : NOLIBOS Christophe <christophe.nolibos at thalesgroup.com <mailto:christophe.nolibos at thalesgroup.com> >
Objet : Re: [ClusterLabs] "pacemakerd: recover properly from Corosync crash" fix

What OS are you using? Does it use systemd?

What does happen when you kill Corosync?

On Thu, 2024-04-18 at 13:13 +0000, NOLIBOS Christophe via Users wrote:
> Classified as: {OPEN}
> 
> Dear All,
>  
> I have a question about the "pacemakerd: recover properly from 
> Corosync crash" fix implemented in version 2.1.2.
> I have observed the issue when testing pacemaker version 2.0.5, just 
> by killing the ‘corosync’ process: Corosync was not recovered.
>  
> I am using now pacemaker version 2.1.5-8.
> Doing the same test, I have the same result: Corosync is still not 
> recovered.
>  
> Please confirm the "pacemakerd: recover properly from Corosync crash"
> fix implemented in version 2.1.2 covers this scenario.
> If it is, did I miss something in the configuration of my cluster?
>  
> Best Regard.
>  
> Christophe.
>   
>  
> 
> {OPEN}
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
--
Ken Gaillot <kgaillot at redhat.com <mailto:kgaillot at redhat.com> >
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

 

{OPEN}

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20240418/2514e354/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 10900 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20240418/2514e354/attachment-0001.p7s>


More information about the Users mailing list