[ClusterLabs] Antw: [EXT] Re: Why not retry a monitor (pacemaker‑execd) that got a segmentation fault?

Tue Jun 14 09:53:52 EDT 2022

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 14.06.2022 um 15:49 in
Nachricht
<dfba7c2f630deb546cc2bfe9c584b8195eb4a94a.camel at redhat.com>:
> On Tue, 2022‑06‑14 at 14:36 +0200, Ulrich Windl wrote:
>> Hi!
>> 
>> I had a case where a VirtualDomain monitor operation ended in a core
>> dump (actually it was pacemaker‑execd, but it counted as "monitor"
>> operation), and the cluster decided to restart the VM. Wouldn't it be
>> worth to retry the monitor operation first?
> 
> It counts like any other monitor failure
> 
>> Chances are that a re‑tried monitor operation returns a better status
>> than segmentation fault.
>> Or dies the logic just ignore processes dying on signals?
>> 
>> 20201202.ba59be712‑150300.4.21.1.x86_64 (SLES15 SP3)
>> 
>> Jun 14 14:09:16 h19 systemd‑coredump[28788]: Process 28786
>> (pacemaker‑execd) of user 0 dumped core.
>> Jun 14 14:09:16 h19 pacemaker‑execd[7440]:  warning:
>> prm_xen_v04_monitor_600000[28786] terminated with signal:
>> Segmentation fault
> 
> This means that the child process forked to execute the resource agent
> segfaulted, which is odd.

Yes it's odd, but isn't the cluster just to protect us from odd situations?
;-)

> 
> Is the agent a compiled program? If not, it's possible the tiny amount
> of pacemaker code that executes the agent is what segfaulted. Do you
> have the actual core, and can you do a backtrace?

Believe me, it's just "odd":
                                                  Stack trace of thread
28786:
                                                  #0  0x00007f85589e0bcf
__libc_fork (/lib64/libc-2.31.so + 0xe1bcf)
                                                  #1  0x00007f855949b85d n/a
(/usr/lib64/libcrmservice.so.28.2.2 + 0x785d)
                                                  #2  0x00007f855949a5e3 n/a
(/usr/lib64/libcrmservice.so.28.2.2 + 0x65e3)
                                                  #3  0x00007f8558d470ed n/a
(/usr/lib64/libglib-2.0.so.0.6200.6 + 0x530ed)
                                                  #4  0x00007f8558d46624
g_main_context_dispatch (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x52624)
                                                  #5  0x00007f8558d469c0 n/a
(/usr/lib64/libglib-2.0.so.0.6200.6 + 0x529c0)
                                                  #6  0x00007f8558d46c82
g_main_loop_run (/usr/lib64/libglib-2.0.so.0.6200.6 + 0x52c82)
                                                  #7  0x0000558c0765930b n/a
(/usr/lib/pacemaker/pacemaker-execd + 0x330b)
                                                  #8  0x00007f85589342bd
__libc_start_main (/lib64/libc-2.31.so + 0x352bd)
                                                  #9  0x0000558c076593da n/a
(/usr/lib/pacemaker/pacemaker-execd + 0x33da)

Rumors say it's Dell's dcdbas module combined with Xen and an AMD CPU plus
some software bugs ;-)

Regards,
Ulrich

> 
>> Jun 14 14:09:16 h19 pacemaker‑controld[7443]:  error: Result of
>> monitor operation for prm_xen_v04 on h19: Error
>> Jun 14 14:09:16 h19 pacemaker‑controld[7443]:  notice: Transition 9
>> action 107 (prm_xen_v04_monitor_600000 on h19): expected 'ok' but got
>> 'error'
>> ...
>> Jun 14 14:09:16 h19 pacemaker‑schedulerd[7442]:  notice:  *
>> Recover    prm_xen_v04              (             h19 )
>> 
>> Regards,
>> ulrich
>> 
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>> 
> ‑‑ 
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/