[ClusterLabs] Antw: Re: Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

Klaus Wenninger kwenning at redhat.com
Wed Jun 15 07:22:35 EDT 2022


On Wed, Jun 15, 2022 at 10:33 AM Ulrich Windl
<Ulrich.Windl at rz.uni-regensburg.de> wrote:
>
> >>> Klaus Wenninger <kwenning at redhat.com> schrieb am 15.06.2022 um 10:00 in
> Nachricht
> <CALrDAo2X2-40Yu=23-jo0uYM9qk4ouE=37Uiv1jWFN6T29v5mQ at mail.gmail.com>:
> > On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl
> > <Ulrich.Windl at rz.uni‑regensburg.de> wrote:
> >>
> >> >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174
> : 161
> > :
> >> 60728>:
> >>
> >> ...
> >> > Yes it's odd, but isn't the cluster just to protect us from odd
> situations?
> >> > ;‑)
> >>
> >> I have more odd stuff:
> >> Jun 14 20:40:09 rksaph18 pacemaker‑execd[7020]:  warning:
> > prm_lockspace_ocfs2_monitor_120000 process (PID 30234) timed out
> >> ...
> >> Jun 14 20:40:14 h18 pacemaker‑execd[7020]:  crit:
> > prm_lockspace_ocfs2_monitor_120000 process (PID 30234) will not die!
> >> ...
> >> Jun 14 20:40:53 h18 pacemaker‑controld[7026]:  warning: lrmd IPC request
> 525
> > failed: Connection timed out after 5000ms
> >> Jun 14 20:40:53 h18 pacemaker‑controld[7026]:  error: Couldn't perform
> > lrmd_rsc_cancel operation (timeout=0): ‑110: Connection timed out (110)
> >> ...
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Couldn't perform
> > lrmd_rsc_exec operation (timeout=90000): ‑114: Connection timed out (110)
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Operation stop on
> > prm_lockspace_ocfs2 failed: ‑70
> >> ...
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  warning: Input I_FAIL
> received
> > in state S_NOT_DC from do_lrm_rsc_op
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  notice: State transition
> > S_NOT_DC ‑> S_RECOVERY
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  warning: Fast‑tracking
> shutdown
> > in response to errors
> >> Jun 14 20:42:23 h18 pacemaker‑controld[7026]:  error: Input I_TERMINATE
> > received in state S_RECOVERY from do_recover
> >> Jun 14 20:42:28 h18 pacemaker‑controld[7026]:  warning: Sending IPC to lrmd
>
> > disabled until pending reply received
> >> Jun 14 20:42:28 h18 pacemaker‑controld[7026]:  error: Couldn't perform
> > lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110)
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  warning: Sending IPC to lrmd
>
> > disabled until pending reply received
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: Couldn't perform
> > lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110)
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Stopped 2 recurring
>
> > operations at shutdown (0 remaining)
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: 3 resources were
> active
> > at shutdown
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
> the
> > executor
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
> > Corosync
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  notice: Disconnected from
> the
> > CIB manager
> >> Jun 14 20:42:33 h18 pacemaker‑controld[7026]:  error: Could not recover
> from
> > internal error
> >> Jun 14 20:42:33 h18 pacemakerd[7003]:  error: pacemaker‑controld[7026]
> exited
> > with status 1 (Error occurred)
> >> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping
> pacemaker‑schedulerd
> >> Jun 14 20:42:33 h18 pacemaker‑schedulerd[7024]:  notice: Caught
> 'Terminated'
> > signal
> >> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker‑attrd
> >> Jun 14 20:42:33 h18 pacemaker‑attrd[7022]:  notice: Caught 'Terminated'
> > signal
> >> Jun 14 20:42:33 h18 pacemakerd[7003]:  notice: Stopping pacemaker‑execd
> >> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: pcmk health
> > check: UNHEALTHY
> >> Jun 14 20:42:34 h18 sbd[6856]:  warning: inquisitor_child: Servant pcmk is
>
> > outdated (age: 41877)
> >> (SBD Fencing)
> >>
> >
> > Rolling it up from the back I guess the reaction to self‑fence in case
> > pacemaker
> > is telling it doesn't know ‑ and isn't able to find out ‑ about the
> > state of the resources
> > is basically correct.
> >
> > Seeing the issue with the fake‑age being printed ‑ possibly causing
> > confusion ‑ it reminds
> > me that this should be addressed. Thought we had already but obviously
> > a false memory.
>
> Hi Klaus and others!
>
> Well that is the current update state of SLES15 SP3; maybe upstream updates
> did not make it into SLES yet; I don't know.
>
> >
> > Would be interesting if pacemaker would recover the sub‑processes
> > without sbd around
> > and other ways of fencing ‑ that should kick in in a similar way ‑
> > would need a significant
> > time.
> > As pacemakerd recently started to ping the sub‑daemons via ipc ‑
> > instead of just listening
> > for signals ‑ it would be interesting if logs we are seeing are
> > already from that code.
>
> The "code" probably is:
> pacemaker-2.0.5+20201202.ba59be712-150300.4.21.1.x86_64
>
> >
> > That what is happening with the monitor‑process kicked off by execd seems to
>
> > hog
> > the ipc for a significant time might be an issue to look after.
>
> I don't know the details (even support at SUSE doesn't know what's going on in
> the kernel, it seems),
> but it looks as if one "stalled" monitor process can cause the node to be
> fenced.
>
> I had been considering this extreme paranoid idea:
> What if you could configure three (different) monitor operations for a
> resource, and an action will be triggered only if at least two of the three
> monitors agree on the status of the resource. I think such a mechanism is not
> uncommon in mission-critical systems...
>
> > Although the new implementation in pacemakerd might kick in and recover
> > execd ‑
> > for what that is worth in the end.
> >
> > This all seems to be kicked off by an RA that might not be robust enough or
> > the node is in a state that just doesn't allow a better answer.
>
> Well I think it's the kernel: I had processes that could not be killed even
> using "kill -9" in the past. (On non-Linux this is rather common if the process
> is blocked on I/O for example)
>
> > Guess timeouts and retries required to give a timely answer about the state
> > of a resource should be taken care of inside the RA.
> > Guess the last 2 are at least something totally different than fork
> > segfaulting
> > although that might as well be a sign that there is something really wrong
> > with the node.
>
> (As said above it may be some RAM corruption where SMI (system management
> interrupts, or so) play a role, but Dell says the hardware is OK, and using
> SLES we don't have software support with Dell, so they won't even consider that
> fact.)

That happens inside of VMs right? I mean nodes being VMs.
A couple of years back I had an issue running protected mode inside
of kvm-virtual machines on Lenovo laptops.
That was really an SMI issue (obviously issues when an SMI interrupt
was invoked during the CPU being in protected mode) that went away
disabling SMI interrupts.
I have no idea if that is still possible with current chipsets. And I'm not
telling you to do that in production but it might be interesting to narrow
the issue down still. One might run into thermal issues and such
SMI is taking care of on that hardware.

Klaus
>
> But actually I start believing such a system is a good playground for any HA
> solution ;-)
> Unfortunately here it's much more production than playground...
>
> Regards,
> Ulrich
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



More information about the Users mailing list