[ClusterLabs] Should pacemaker pursue its own and corosync's instant resurrection if either dies? (Was: Is corosync supposed to be restarted if it dies?)

Andrei Borzenkov arvidjaar at gmail.com
Sat Dec 2 12:40:17 EST 2017


02.12.2017 16:30, Jan Pokorný пишет:
> 
> In race-condition free situation, such a BindsTo-incurred stopping (or
> at least scheduled to since 235?) of the service is then not a subject
> of auto-restarting, from what I've observed, and documentation agrees:
> 
>   Restart= [...] When the death of the process is a result of systemd
>   operation (e.g. service stop or restart), the service will not be
>   restarted
> 

Yes, if systemd has chance to explicitly queue Stop action, that's correct.

>>>> (FTR, I tried with systemd 235).
>>>>
>>
>> Well ... what we have here is race condition. We have two events -
>> corosync.service and pacemaker.service *independent* failures
>> and two (re-)actions - stop pacemaker.service in response to the
>> former (due to BindsTo) and restart pacemaker.service in response to
>> the latter (due to Restart=on-failure). The final result depends on
>> the order in which systemd gets those events and schedules actions
>> (and relative timing when those actions complete) and this is not
>> deterministic.
> 
> Coming to similar conclusion.
> 

To illustrate. Following are two logs from the same system from two
consecutive "systemctl start pacemaker; pkill -9 corosync"

Number 1:


Dec 02 20:03:17 ha1 sbd[1462]:    cluster:    error: pcmk_cpg_dispatch:
Connection to the CPG API failed: Library error (2)
Dec 02 20:03:17 ha1 systemd[1]: corosync.service: Main process exited,
code=killed, status=9/KILL
Dec 02 20:03:17 ha1 sbd[1462]:    cluster:  warning:
sbd_membership_destroy: Lost connection to corosync
Dec 02 20:03:17 ha1 systemd[1]: corosync.service: Unit entered failed state.
Dec 02 20:03:17 ha1 sbd[1462]:    cluster:    error: set_servant_health:
Cluster connection terminated
Dec 02 20:03:17 ha1 systemd[1]: corosync.service: Failed with result
'signal'.
Dec 02 20:03:17 ha1 sbd[1462]:    cluster:    error:
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 2
Dec 02 20:03:17 ha1 systemd[1]: Stopping Pacemaker High Availability
Cluster Manager...
Dec 02 20:03:17 ha1 sbd[1455]:  warning: inquisitor_child: cluster
health check: UNHEALTHY
Dec 02 20:03:17 ha1 sbd[1455]:  warning: inquisitor_child: Servant
cluster is outdated (age: 170)
Dec 02 20:03:17 ha1 cib[1568]:    error: Connection to the CPG API
failed: Library error (2)
Dec 02 20:03:17 ha1 attrd[1571]:    error: Connection to the CPG API
failed: Library error (2)
Dec 02 20:03:17 ha1 attrd[1571]:   notice: Disconnecting client
0x55590e1d1190, pid=1573...
Dec 02 20:03:17 ha1 stonith-ng[1569]:    error: Connection to the CPG
API failed: Library error (2)
Dec 02 20:03:17 ha1 lrmd[1570]:    error: Connection to stonith-ng failed
Dec 02 20:03:17 ha1 lrmd[1570]:    error: Connection to
stonith-ng[0x558bf889f300] closed (I/O condition=17)
Dec 02 20:03:17 ha1 pacemakerd[1566]:   notice: Caught 'Terminated' signal
Dec 02 20:03:17 ha1 pacemakerd[1566]:    error: Connection to the CPG
API failed: Library error (2)
Dec 02 20:03:17 ha1 systemd[1]: pacemaker.service: Main process exited,
code=exited, status=107/n/a
Dec 02 20:03:17 ha1 systemd[1]: Stopped Pacemaker High Availability
Cluster Manager.

Key line is "Stopping Pacemaker" which indicates voluntary action on
systemd side.

Number 2:

Dec 02 20:07:33 ha1 sbd[1462]:    cluster:    error: pcmk_cpg_dispatch:
Connection to the CPG API failed: Library error (2)
Dec 02 20:07:33 ha1 systemd[1]: corosync.service: Main process exited,
code=killed, status=9/KILL
Dec 02 20:07:33 ha1 sbd[1462]:    cluster:  warning:
sbd_membership_destroy: Lost connection to corosync
Dec 02 20:07:33 ha1 systemd[1]: corosync.service: Unit entered failed state.
Dec 02 20:07:33 ha1 sbd[1462]:    cluster:    error: set_servant_health:
Cluster connection terminated
Dec 02 20:07:33 ha1 systemd[1]: corosync.service: Failed with result
'signal'.
Dec 02 20:07:33 ha1 sbd[1462]:    cluster:    error:
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 2
Dec 02 20:07:33 ha1 systemd[1]: pacemaker.service: Main process exited,
code=exited, status=107/n/a
Dec 02 20:07:33 ha1 sbd[1455]:  warning: inquisitor_child: cluster
health check: UNHEALTHY
Dec 02 20:07:33 ha1 systemd[1]: Stopped Pacemaker High Availability
Cluster Manager.
...
Dec 02 20:07:33 ha1 systemd[1]: pacemaker.service: Service hold-off time
over, scheduling restart.

Here there is no line "Stopping Pacemaker", from systemd PoV it failed
and should be restarted.

Note that it is quite possible that in the second case systemd still
attempts to stop pacemaker due to BindsTo directive. But this job is
dropped as redundant and so we never actually see it. And as soon as you
enable debug output timing is skewed and you cannot reproduce it anymore.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20171202/cf1dafc5/attachment-0003.sig>


More information about the Users mailing list