[ClusterLabs] Antw: Re: Antw: [EXT] delaying start of a resource

Thu Dec 17 13:30:14 EST 2020

On Thu, 2020-12-17 at 19:13 +0300, Andrei Borzenkov wrote:
> 17.12.2020 14:02, Ulrich Windl пишет:
> > > > > Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 17.12.2020
> > > > > um 09:50 in
> > 
> > Nachricht
> > <CAA91j0VUv4nMtEtCPQNiMF-XrRv_9KqkCnPvmAn4XBoNBQpGTA at mail.gmail.com
> > >:
> > 
> > ...
> > > According to logs from xstha1, it started to activate resources
> > > only
> > > after stonith was confirmed
> > > 
> > > Dec 16 15:08:12 [708] stonith‑ng:   notice: log_operation:
> > > Operation 'off' [1273] (call 4 from crmd.712) for host 'xstha2'
> > > with
> > > device 'xstha2‑stonith' returned: 0 (OK)
> > > Dec 16 15:08:12 [708] stonith‑ng:   notice: remote_op_done:
> > > Operation 'off' targeting xstha2 on xstha1 for
> > > crmd.712 at xstha1.e487e7cc: OK
> > > 
> > > It is possible that your IPMI/BMC/whatever implementation
> > > responds
> > > with success before it actually completes this action. I have
> > > seen at
> > 
> > Shouldn't a reasonable "stonith-timeout=180" do? 
> 
> This is maximum time to wait for successful stonith. In this case
> stonith *was* successful - at least from the pacemaker point of view.

This reminded me that some IPMI implementations return "success" for
commands before they've actually been completed. This is why
fence_ipmilan has a "power_wait" parameter that defaults to 2 seconds.

The best thing would be to do some manual testing using ipmitool or
whatnot to turn off the power, and observe how long it takes between
when the command returns and the server actually is powered down. Then
set power_wait to a comfortable margin above that. Or just keep raising
power_wait until the problem goes away :)
-- 
Ken Gaillot <kgaillot at redhat.com>