[ClusterLabs] Antw: [EXT] Re: Q: wrong "unexpected shutdown of DC" detected

Ken Gaillot kgaillot at redhat.com
Thu Jan 28 10:03:18 EST 2021


On Thu, 2021-01-28 at 11:23 +0100, Ulrich Windl wrote:
> Ken,
> 
> thanks for analyzing the logs! See comments inline...
> 
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 27.01.2021 um
> > > > 19:55 in
> 
> Nachricht
> <644fc719a2e8870c332db859bcdef275d986249a.camel at redhat.com>:
> > On Wed, 2021‑01‑27 at 12:36 +0100, Ulrich Windl wrote:
> 
> ...
> > > Jan 27 10:43:48 h16 pacemaker‑execd[25960]:  warning:
> > > prm_CFS_VMI_stop_0[11502] timed out after 90000ms
> > > Jan 27 10:43:48 h16 pacemaker‑execd[25960]:  notice: prm_CFS_VMI
> > > stop
> > > (call 129, PID 11502) exited with status 1 (execution time
> > > 90007ms,
> > > queue time 0ms)
> > > Jan 27 10:43:48 h16 pacemaker‑controld[25963]:  error: Result of
> > > stop
> > > operation for prm_CFS_VMI on h16: Timed Out
> > 
> > This stop timeout is why h16 correctly needs to be fenced. The only
> > question is why the stop timed out.
> 
> The resouirce is OCFS2, needing DLM. DLM in turn wants a quorum,
> right?
> So: No quorum, no action -> timeout. Is that right?
> 
> ...
> > > Finally: ;‑)
> > > 
> > > Jan 27 11:35:14 h19 pacemaker‑fenced[2099]:  notice: Versions did
> > > not
> > > change in patch 0.250.39
> > > Jan 27 11:36:43 h19 pacemaker‑fenced[2099]:  notice: Operation
> > > 'reboot' targeting h18 on h16 for 
> > > pacemaker‑controld.7467 at h16.46c6f6cc: OK
> > > Jan 27 11:36:43 h19 pacemaker‑fenced[2099]:  error:
> > > stonith_construct_reply: Triggered assert at
> > > fenced_commands.c:2363 :
> > > request != NULL
> 
> You did not comment on that; is that expected behavior? ;-)

Sort of ;)

This was changed to a more reasonable log warning in the 2.0.5 release:

  Missing request information for client notifications for operation
with result <N> (initiated before we came up?)

It can happen (and is perfectly OK) when a node is coming up while some
fencing operation is already in-flight. Ideally we'd synchronize in-
flight operation information when a node comes up, but it wouldn't
really change anything, it would just allow us to tell that situation
from an actual error when this message comes up.

> > > Jan 27 11:36:43 h19 pacemaker‑fenced[2099]:  warning: Can't
> > > create a
> > > sane reply
> > > 
> > > Regards,
> > > Ulrich

-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list