[ClusterLabs] fencing configuration

Tue Jun 7 10:51:36 EDT 2022

On 07.06.2022 11:50, Klaus Wenninger wrote:
>>
>> From the documentation is not clear to me whether this would be:
>> a) multiple fencing where ipmi would be first level and sbd would be a second level fencing (where sbd always succeeds)
>> b) or this is considered a single level fencing with a timeout
> 
> With b) falling back to watchdog-fencing wouldn't work properly
> although I remember
> some recent change that might make it fall back without issues.

b) works here:

Jun 07 17:35:50 ha2 pacemaker-controld[7069]:  notice: Requesting
fencing (reboot) of node qnetd

Jun 07 17:35:50 ha2 pacemaker-fenced[7065]:  notice: Client
pacemaker-controld.7069 wants to fence (reboot) qnetd using any device

Jun 07 17:35:50 ha2 pacemaker-fenced[7065]:  notice: Requesting peer
fencing (reboot) targeting qnetd

Jun 07 17:35:50 ha2 pacemaker-fenced[7065]:  notice: watchdog is not
eligible to fence (reboot) qnetd: static-list

Jun 07 17:35:50 ha2 pacemaker-schedulerd[7068]:  warning: Calculated
transition 14 (with warnings), saving inputs in
/var/lib/pacemaker/pengine/pe-warn-95.bz2

Jun 07 17:35:50 ha2 pacemaker-fenced[7065]:  notice: Requesting that ha1
perform 'reboot' action targeting qnetd

Jun 07 17:35:53 ha2 pacemaker-fenced[7065]:  notice: Requesting that ha2
perform 'reboot' action targeting qnetd

Jun 07 17:35:53 ha2 pacemaker-fenced[7065]:  notice: watchdog is not
eligible to fence (reboot) qnetd: static-list

Jun 07 17:35:55 ha2 stonith[11138]: external_reset_req: '_dummy reset'
for host qnetd failed with rc 1

Jun 07 17:35:57 ha2 stonith[11142]: external_reset_req: '_dummy reset'
for host qnetd failed with rc 1

Jun 07 17:35:57 ha2 pacemaker-fenced[7065]:  error: Operation 'reboot'
[11141] targeting qnetd using dummy_stonith returned 1

Jun 07 17:35:57 ha2 pacemaker-fenced[7065]:  warning:
dummy_stonith[11141] [ Performing: stonith -t external/_dummy -E -T
reset qnetd ]

Jun 07 17:35:57 ha2 pacemaker-fenced[7065]:  warning:
dummy_stonith[11141] [ failed: qnetd 5 ]

Jun 07 17:35:57 ha2 pacemaker-fenced[7065]:  notice: Couldn't find
anyone to fence (reboot) qnetd using any device

Jun 07 17:35:57 ha2 pacemaker-fenced[7065]:  notice: Waiting 10s for
qnetd to self-fence (reboot) for client pacemaker-controld.7069

Jun 07 17:36:07 ha2 pacemaker-fenced[7065]:  notice: Self-fencing
(reboot) by qnetd for pacemaker-controld.7069 assumed complete

Jun 07 17:36:07 ha2 pacemaker-fenced[7065]:  notice: Operation 'reboot'
targeting qnetd by ha2 for pacemaker-controld.7069 at ha2: OK (complete)

Jun 07 17:36:07 ha2 pacemaker-controld[7069]:  notice: Fence operation 7
for qnetd passed

Jun 07 17:36:07 ha2 pacemaker-controld[7069]:  notice: Transition 14
(Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-warn-95.bz2): Complete

Jun 07 17:36:07 ha2 pacemaker-controld[7069]:  notice: State transition
S_TRANSITION_ENGINE -> S_IDLE

Jun 07 17:36:07 ha2 pacemaker-controld[7069]:  notice: Peer qnetd was
terminated (reboot) by ha2 on behalf of pacemaker-controld.7069 at ha2: OK

The only gotcha is this stray error after everything have already completed.

Jun 07 17:37:05 ha2 pacemaker-fenced[7065]:  notice: Peer's 'reboot'
action targeting qnetd for client pacemaker-controld.7069 timed out

Jun 07 17:37:05 ha2 pacemaker-fenced[7065]:  notice: Couldn't find
anyone to fence (reboot) qnetd using any device

Jun 07 17:37:05 ha2 pacemaker-fenced[7065]:  error:
request_peer_fencing: Triggered fatal assertion at fenced_remote.c:1799
: op->state < st_done

bor at bor-Latitude-E5450:~/src/ClusterLabs/pacemaker$

> I would try to go for a) as with a reasonably current
> pacemaker-version (iirc 2.1.0 and above)
> you should be able to make the watchdog-fencing-device visible as with
> other fencing-devices

Yep.

dummy_stonith

watchdog

2 fence devices found

> (just use fence_watchdog as the fence-agent - still implemented inside
> pacemaker
> fence-watchdog-binary actually just provides the meta-data).
> Like this you can limit watchdog-fencing to certain-nodes that do
> actually provide a proper
> hardware-watchdog and you can add it to a topology.
> 

Well, as could be seen from above even though "watchdog" is not
eligible, pacemaker is still using it. So I am not sure it will work.