[ClusterLabs] Antw: Re: Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Wed Jun 15 04:33:39 EDT 2022
>>> Klaus Wenninger <kwenning at redhat.com> schrieb am 15.06.2022 um 10:00 in
Nachricht
<CALrDAo2X2-40Yu=23-jo0uYM9qk4ouE=37Uiv1jWFN6T29v5mQ at mail.gmail.com>:
> On Wed, Jun 15, 2022 at 8:32 AM Ulrich Windl
> <Ulrich.Windl at rz.uni‑regensburg.de> wrote:
>>
>> >>> Ulrich Windl schrieb am 14.06.2022 um 15:53 in Nachricht <62A892F0.174
: 161
> :
>> 60728>:
>>
>> ...
>> > Yes it's odd, but isn't the cluster just to protect us from odd
situations?
>> > ;‑)
>>
>> I have more odd stuff:
>> Jun 14 20:40:09 rksaph18 pacemaker‑execd[7020]: warning:
> prm_lockspace_ocfs2_monitor_120000 process (PID 30234) timed out
>> ...
>> Jun 14 20:40:14 h18 pacemaker‑execd[7020]: crit:
> prm_lockspace_ocfs2_monitor_120000 process (PID 30234) will not die!
>> ...
>> Jun 14 20:40:53 h18 pacemaker‑controld[7026]: warning: lrmd IPC request
525
> failed: Connection timed out after 5000ms
>> Jun 14 20:40:53 h18 pacemaker‑controld[7026]: error: Couldn't perform
> lrmd_rsc_cancel operation (timeout=0): ‑110: Connection timed out (110)
>> ...
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: error: Couldn't perform
> lrmd_rsc_exec operation (timeout=90000): ‑114: Connection timed out (110)
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: error: Operation stop on
> prm_lockspace_ocfs2 failed: ‑70
>> ...
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: warning: Input I_FAIL
received
> in state S_NOT_DC from do_lrm_rsc_op
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: notice: State transition
> S_NOT_DC ‑> S_RECOVERY
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: warning: Fast‑tracking
shutdown
> in response to errors
>> Jun 14 20:42:23 h18 pacemaker‑controld[7026]: error: Input I_TERMINATE
> received in state S_RECOVERY from do_recover
>> Jun 14 20:42:28 h18 pacemaker‑controld[7026]: warning: Sending IPC to lrmd
> disabled until pending reply received
>> Jun 14 20:42:28 h18 pacemaker‑controld[7026]: error: Couldn't perform
> lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110)
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: warning: Sending IPC to lrmd
> disabled until pending reply received
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: error: Couldn't perform
> lrmd_rsc_cancel operation (timeout=0): ‑114: Connection timed out (110)
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: notice: Stopped 2 recurring
> operations at shutdown (0 remaining)
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: error: 3 resources were
active
> at shutdown
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: notice: Disconnected from
the
> executor
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: notice: Disconnected from
> Corosync
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: notice: Disconnected from
the
> CIB manager
>> Jun 14 20:42:33 h18 pacemaker‑controld[7026]: error: Could not recover
from
> internal error
>> Jun 14 20:42:33 h18 pacemakerd[7003]: error: pacemaker‑controld[7026]
exited
> with status 1 (Error occurred)
>> Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping
pacemaker‑schedulerd
>> Jun 14 20:42:33 h18 pacemaker‑schedulerd[7024]: notice: Caught
'Terminated'
> signal
>> Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker‑attrd
>> Jun 14 20:42:33 h18 pacemaker‑attrd[7022]: notice: Caught 'Terminated'
> signal
>> Jun 14 20:42:33 h18 pacemakerd[7003]: notice: Stopping pacemaker‑execd
>> Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: pcmk health
> check: UNHEALTHY
>> Jun 14 20:42:34 h18 sbd[6856]: warning: inquisitor_child: Servant pcmk is
> outdated (age: 41877)
>> (SBD Fencing)
>>
>
> Rolling it up from the back I guess the reaction to self‑fence in case
> pacemaker
> is telling it doesn't know ‑ and isn't able to find out ‑ about the
> state of the resources
> is basically correct.
>
> Seeing the issue with the fake‑age being printed ‑ possibly causing
> confusion ‑ it reminds
> me that this should be addressed. Thought we had already but obviously
> a false memory.
Hi Klaus and others!
Well that is the current update state of SLES15 SP3; maybe upstream updates
did not make it into SLES yet; I don't know.
>
> Would be interesting if pacemaker would recover the sub‑processes
> without sbd around
> and other ways of fencing ‑ that should kick in in a similar way ‑
> would need a significant
> time.
> As pacemakerd recently started to ping the sub‑daemons via ipc ‑
> instead of just listening
> for signals ‑ it would be interesting if logs we are seeing are
> already from that code.
The "code" probably is:
pacemaker-2.0.5+20201202.ba59be712-150300.4.21.1.x86_64
>
> That what is happening with the monitor‑process kicked off by execd seems to
> hog
> the ipc for a significant time might be an issue to look after.
I don't know the details (even support at SUSE doesn't know what's going on in
the kernel, it seems),
but it looks as if one "stalled" monitor process can cause the node to be
fenced.
I had been considering this extreme paranoid idea:
What if you could configure three (different) monitor operations for a
resource, and an action will be triggered only if at least two of the three
monitors agree on the status of the resource. I think such a mechanism is not
uncommon in mission-critical systems...
> Although the new implementation in pacemakerd might kick in and recover
> execd ‑
> for what that is worth in the end.
>
> This all seems to be kicked off by an RA that might not be robust enough or
> the node is in a state that just doesn't allow a better answer.
Well I think it's the kernel: I had processes that could not be killed even
using "kill -9" in the past. (On non-Linux this is rather common if the process
is blocked on I/O for example)
> Guess timeouts and retries required to give a timely answer about the state
> of a resource should be taken care of inside the RA.
> Guess the last 2 are at least something totally different than fork
> segfaulting
> although that might as well be a sign that there is something really wrong
> with the node.
(As said above it may be some RAM corruption where SMI (system management
interrupts, or so) play a role, but Dell says the hardware is OK, and using
SLES we don't have software support with Dell, so they won't even consider that
fact.)
But actually I start believing such a system is a good playground for any HA
solution ;-)
Unfortunately here it's much more production than playground...
Regards,
Ulrich
More information about the Users
mailing list