[ClusterLabs] corosync service stopping

Reid Wahl nwahl at redhat.com
Thu Apr 25 23:54:19 EDT 2024


Any logs from Pacemaker?

On Thu, Apr 25, 2024 at 3:46 AM Alexander Eastwood via Users
<users at clusterlabs.org> wrote:
>
> Hi all,
>
> I’m trying to get a better understanding of why our cluster - or specifically corosync.service - entered a failed state. Here are all of the relevant corosync logs from this event, with the last line showing when I manually started the service again:
>
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [CFG   ] Node 1 was shut down by sysadmin
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Unloading all Corosync service engines.
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync configuration map access
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync configuration service
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync profile loading service
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync resource monitoring service
> Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync watchdog service
> Apr 23 11:06:11 [1295854] testcluster-c1 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
> Apr 23 11:06:11 [1295854] testcluster-c1 corosync warning [KNET  ] host: host: 1 has no active links
> Apr 23 11:06:11 [1295854] testcluster-c1 corosync notice  [MAIN  ] Corosync Cluster Engine exiting normally
> Apr 23 13:18:36 [796246] testcluster-c1 corosync notice  [MAIN  ] Corosync Cluster Engine 3.1.6 starting up
>
> The first line suggests that a manual shutdown of one of the cluster nodes, however neither me nor any of my colleagues did this. The ‘sysadmin’ surely must mean a person logging on to the server and running some command, as opposed to a system process?
>
> Then in the 3rd row from the bottom there is the warning “host: host: 1 has no active links” which is followed by “Corosync Cluster Engine exiting normally”. Does this mean that the reason for the Cluster Engine exiting is the fact that there are no active links?
>
> Finally, I am considering adding a systemd override file for the corosync service with the following content:
>
> [Service]
> Restart=on-failure
>
> Is there any reason not to do this? And, given that the process exited normally, would I need to use Restart=always instead?
>
> Many thanks
>
> Alex
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker



More information about the Users mailing list