[ClusterLabs] Why not retry a monitor (pacemaker-execd) that got a segmentation fault?
Ken Gaillot
kgaillot at redhat.com
Tue Jun 14 09:49:16 EDT 2022
On Tue, 2022-06-14 at 14:36 +0200, Ulrich Windl wrote:
> Hi!
>
> I had a case where a VirtualDomain monitor operation ended in a core
> dump (actually it was pacemaker-execd, but it counted as "monitor"
> operation), and the cluster decided to restart the VM. Wouldn't it be
> worth to retry the monitor operation first?
It counts like any other monitor failure
> Chances are that a re-tried monitor operation returns a better status
> than segmentation fault.
> Or dies the logic just ignore processes dying on signals?
>
> 20201202.ba59be712-150300.4.21.1.x86_64 (SLES15 SP3)
>
> Jun 14 14:09:16 h19 systemd-coredump[28788]: Process 28786
> (pacemaker-execd) of user 0 dumped core.
> Jun 14 14:09:16 h19 pacemaker-execd[7440]: warning:
> prm_xen_v04_monitor_600000[28786] terminated with signal:
> Segmentation fault
This means that the child process forked to execute the resource agent
segfaulted, which is odd.
Is the agent a compiled program? If not, it's possible the tiny amount
of pacemaker code that executes the agent is what segfaulted. Do you
have the actual core, and can you do a backtrace?
> Jun 14 14:09:16 h19 pacemaker-controld[7443]: error: Result of
> monitor operation for prm_xen_v04 on h19: Error
> Jun 14 14:09:16 h19 pacemaker-controld[7443]: notice: Transition 9
> action 107 (prm_xen_v04_monitor_600000 on h19): expected 'ok' but got
> 'error'
> ...
> Jun 14 14:09:16 h19 pacemaker-schedulerd[7442]: notice: *
> Recover prm_xen_v04 ( h19 )
>
> Regards,
> ulrich
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list