[ClusterLabs] Antw: Re: Antw: [EXT] delaying start of a resource
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Thu Dec 17 06:02:27 EST 2020
>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 17.12.2020 um 09:50 in
Nachricht
<CAA91j0VUv4nMtEtCPQNiMF-XrRv_9KqkCnPvmAn4XBoNBQpGTA at mail.gmail.com>:
...
> According to logs from xstha1, it started to activate resources only
> after stonith was confirmed
>
> Dec 16 15:08:12 [708] stonith‑ng: notice: log_operation:
> Operation 'off' [1273] (call 4 from crmd.712) for host 'xstha2' with
> device 'xstha2‑stonith' returned: 0 (OK)
> Dec 16 15:08:12 [708] stonith‑ng: notice: remote_op_done:
> Operation 'off' targeting xstha2 on xstha1 for
> crmd.712 at xstha1.e487e7cc: OK
>
> It is possible that your IPMI/BMC/whatever implementation responds
> with success before it actually completes this action. I have seen at
Shouldn't a reasonable "stonith-timeout=180" do? Even sbd needs one, because
after sending the fence command, it has to be read and processed.
For example what I see in the DC logs here around fencing is:
Nov 30 11:31:56 h18 pacemaker-fenced[49409]: notice: prm_stonith_sbd is
eligible to fence (reboot) h16: dynamic-list
Nov 30 11:32:03 h18 corosync[49399]: [TOTEM ] A processor failed, forming
new configuration.
Nov 30 11:32:09 h18 corosync[49399]: [TOTEM ] A new membership
(172.20.16.18:42032) was formed. Members left: 116
...
Nov 30 11:32:09 h18 pacemaker-controld[49413]: notice: Our peer on the DC
(h16) is dead
...
Nov 30 11:33:57 h18 pacemaker-controld[49413]: notice: Peer h16 was
terminated (reboot) by h18 on behalf of pacemaker-controld.69600: OK
...note the delay between node being dead and confirmation...
Nov 30 11:36:05 h18 corosync[49399]: [TOTEM ] A new membership
(172.20.16.16:42036) was formed. Members joined: 116
...node re-joined cluster after being fenced
> least some delays in the past. There is not really much that can be
> done here except adding artificial delay to stonith resource agent.
> You need to test IPMI functionality before using it in pacemaker.
Another example:
Dec 16 14:34:35 h18 pacemaker-controld[4478]: notice: Requesting fencing
(reboot) of node h18
...
Dec 16 14:34:38 h18 pacemaker-fenced[4474]: notice: Requesting that h16
perform 'reboot' action targeting h18
...
Dec 16 14:34:40 h18 sbd[3717]: /dev/disk/by-id/dm-name-SBD_1-3P2: notice:
servant_md: Received command reset from h16 on disk...
...
Dec 16 14:34:40 h18 sbd[3697]: warning: inquisitor_child:
/dev/disk/by-id/dm-name-SBD_1-3P2 requested a reset
Dec 16 14:34:40 h18 sbd[3697]: emerg: do_exit: Rebooting system: reboot
...
Dec 16 14:34:45 h16 corosync[3617]: [TOTEM ] A processor failed, forming new
configuration.
...
Dec 16 14:35:50 h16 dlm_controld[4802]: 170858
91E73809FE224F2495FE617D556E1800 wait for fencing
...
Dec 16 14:36:39 h16 pacemaker-controld[4527]: notice: Peer h18 was terminated
(reboot) by h16 on behalf of pacemaker-controld.4478: OK
...
Dec 16 14:38:55 h16 corosync[3617]: [TOTEM ] A new membership
(172.20.16.16:42128) was formed. Members joined: 118
The timeout (3 min) may be excessive here, but it shows what's going on.
Regards,
Ulrich
More information about the Users
mailing list