[ClusterLabs] corosync service stopping

Thu Apr 25 06:45:40 EDT 2024

Hi all,

I’m trying to get a better understanding of why our cluster - or specifically corosync.service - entered a failed state. Here are all of the relevant corosync logs from this event, with the last line showing when I manually started the service again:

Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [CFG   ] Node 1 was shut down by sysadmin
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Unloading all Corosync service engines.
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync configuration map access
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync configuration service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Apr 23 11:06:10 [1295854] testcluster-c1 corosync info    [QB    ] withdrawing server sockets
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync profile loading service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync resource monitoring service
Apr 23 11:06:10 [1295854] testcluster-c1 corosync notice  [SERV  ] Service engine unloaded: corosync watchdog service
Apr 23 11:06:11 [1295854] testcluster-c1 corosync info    [KNET  ] host: host: 1 (passive) best link: 0 (pri: 0)
Apr 23 11:06:11 [1295854] testcluster-c1 corosync warning [KNET  ] host: host: 1 has no active links
Apr 23 11:06:11 [1295854] testcluster-c1 corosync notice  [MAIN  ] Corosync Cluster Engine exiting normally
Apr 23 13:18:36 [796246] testcluster-c1 corosync notice  [MAIN  ] Corosync Cluster Engine 3.1.6 starting up

The first line suggests that a manual shutdown of one of the cluster nodes, however neither me nor any of my colleagues did this. The ‘sysadmin’ surely must mean a person logging on to the server and running some command, as opposed to a system process?

Then in the 3rd row from the bottom there is the warning “host: host: 1 has no active links” which is followed by “Corosync Cluster Engine exiting normally”. Does this mean that the reason for the Cluster Engine exiting is the fact that there are no active links? 

Finally, I am considering adding a systemd override file for the corosync service with the following content:

[Service]
Restart=on-failure

Is there any reason not to do this? And, given that the process exited normally, would I need to use Restart=always instead?

Many thanks

Alex