[ClusterLabs] Antw: Re: Antw: [EXT] delaying start of a resource

Thu Dec 17 06:02:27 EST 2020

>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 17.12.2020 um 09:50 in
Nachricht
<CAA91j0VUv4nMtEtCPQNiMF-XrRv_9KqkCnPvmAn4XBoNBQpGTA at mail.gmail.com>:

...
> According to logs from xstha1, it started to activate resources only
> after stonith was confirmed
> 
> Dec 16 15:08:12 [708] stonith‑ng:   notice: log_operation:
> Operation 'off' [1273] (call 4 from crmd.712) for host 'xstha2' with
> device 'xstha2‑stonith' returned: 0 (OK)
> Dec 16 15:08:12 [708] stonith‑ng:   notice: remote_op_done:
> Operation 'off' targeting xstha2 on xstha1 for
> crmd.712 at xstha1.e487e7cc: OK
> 
> It is possible that your IPMI/BMC/whatever implementation responds
> with success before it actually completes this action. I have seen at

Shouldn't a reasonable "stonith-timeout=180" do? Even sbd needs one, because
after sending the fence command, it has to be read and processed.

For example what I see in the DC logs here around fencing is:
Nov 30 11:31:56 h18 pacemaker-fenced[49409]:  notice: prm_stonith_sbd is
eligible to fence (reboot) h16: dynamic-list
Nov 30 11:32:03 h18 corosync[49399]:   [TOTEM ] A processor failed, forming
new configuration.
Nov 30 11:32:09 h18 corosync[49399]:   [TOTEM ] A new membership
(172.20.16.18:42032) was formed. Members left: 116
...
Nov 30 11:32:09 h18 pacemaker-controld[49413]:  notice: Our peer on the DC
(h16) is dead
...
Nov 30 11:33:57 h18 pacemaker-controld[49413]:  notice: Peer h16 was
terminated (reboot) by h18 on behalf of pacemaker-controld.69600: OK
...note the delay between node being dead and confirmation...
Nov 30 11:36:05 h18 corosync[49399]:   [TOTEM ] A new membership
(172.20.16.16:42036) was formed. Members joined: 116
...node re-joined cluster after being fenced

> least some delays in the past. There is not really much that can be
> done here except adding artificial delay to stonith resource agent.
> You need to test IPMI functionality before using it in pacemaker.

Another example:
Dec 16 14:34:35 h18 pacemaker-controld[4478]:  notice: Requesting fencing
(reboot) of node h18
...
Dec 16 14:34:38 h18 pacemaker-fenced[4474]:  notice: Requesting that h16
perform 'reboot' action targeting h18
...
Dec 16 14:34:40 h18 sbd[3717]: /dev/disk/by-id/dm-name-SBD_1-3P2:   notice:
servant_md: Received command reset from h16 on disk...
...
Dec 16 14:34:40 h18 sbd[3697]:  warning: inquisitor_child:
/dev/disk/by-id/dm-name-SBD_1-3P2 requested a reset
Dec 16 14:34:40 h18 sbd[3697]:    emerg: do_exit: Rebooting system: reboot
...
Dec 16 14:34:45 h16 corosync[3617]:   [TOTEM ] A processor failed, forming new
configuration.
...
Dec 16 14:35:50 h16 dlm_controld[4802]: 170858
91E73809FE224F2495FE617D556E1800 wait for fencing
...
Dec 16 14:36:39 h16 pacemaker-controld[4527]:  notice: Peer h18 was terminated
(reboot) by h16 on behalf of pacemaker-controld.4478: OK
...
Dec 16 14:38:55 h16 corosync[3617]:   [TOTEM ] A new membership
(172.20.16.16:42128) was formed. Members joined: 118

The timeout (3 min) may be excessive here, but it shows what's going on.

Regards,
Ulrich