[ClusterLabs] Antw: Re: Restoring network connection breaks cluster services

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Tue Aug 13 03:33:30 EDT 2019


>>> Jan Pokorný <jpokorny at redhat.com> schrieb am 12.08.2019 um 22:30 in
Nachricht
<20190812203037.GM25546 at redhat.com>:

[...]
> Is it OK for lower level components to do autonomous decisions
> without at least informing the higher level wrt. what exactly is
> going on, as we could observe here?
[...]

Excuse me for throwing in another comparison with old HP-UX ServiceGuard: As
far as I understood it, the HP-UX kernel had a hardware-based watchdog that the
process corresponding to crmd enabled during start and periodically "fed" to
avoid a "TOC" (Transfer Of Control, resulting in a kernel panic, crash dump and
reboot).

That was all: So when crmd died or failed to feed the watchdog the node
rebooted and the other node took care of the resources (if possible). Fencing
was basically network based with a disk as tie-breaker: If there was a network
outage, both (2-node cluster case) nodes tried to control the cluster, racing
for a SCSI lock on the "lock disk" (requiring a multi-initiator SCSI setup for
shared disks). The winner wrote his node name to the disk's slot so that the
other node(s) could read and tell who the winner of the race was. They all
committed suicide then (an exit of the main cluster process would be enough to
trigger the watchdog, but the did a  explicit TOC).

Comparing with pacemaker corosync and fencing this all seems unnecessarily
complex to me; at least if you have some shared storage.

The other nice thing was network traffic in ServiceGuard: The heartbeat
interval was configurable (like every 7 seconds), and when there was nothing
"interesting" happening in the cluster there was no traffic other than the
heartbeat (missing a configurable number of heartbeats declared a split brain,
and the machinery really started).  I think pacemaker is creating way to much
network traffic.

So I think sbd should not decide by itself whether to reboot a node or not.
Maybe even sbd should not use the watchdog, but the crmd should...

Regards,
Ulrich



More information about the Users mailing list