[ClusterLabs] IPaddr2 resource times out and cant be killed

Fri Jul 29 16:20:52 EDT 2022

On Fri, Jul 29, 2022 at 1:02 PM Reid Wahl <nwahl at redhat.com> wrote:
>
> On Fri, Jul 29, 2022 at 12:52 PM Ross Sponholtz <rsponholtz at hotmail.com> wrote:
> >
> > I’m running a RHEL pacemaker cluster on Azure, and I’ve gotten a failure & fencing where I get these messages in the log file:
> >
> >
> > warning: vip_ABC_30_monitor_10000 process (PID 1779737) timed out
> > crit: vip_ABC_30_monitor_10000 process (PID 1779737) will not die!
> >
> >
> >
> > This resource uses the IPAddr2 resource agent.  I’ve looked at the agent code, and I can’t pinpoint any reason it would hang up, and since the node gets fenced, I can’t tell why this happens – any ideas on what kinds of failures could cause this problem?
> >
> >
> >
> > Thanks,
> >
> > Ross
> >
>
> Are you able to reproduce this? I suggest adding `trace_ra=1` to the
> resource configuration in order to determine where it's hanging.
>
> # pcs resource update vip_ABC trace_ra=1
>
> This will produce a shell trace of each operation in
> /var/lib/heartbeat/trace_ra/IPaddr2. This is naturally quite a lot of
> logging, so remove the option when you've gotten what you need.
>
> # pcs resource update vip_ABC trace_ra=
>
> Also discussed in this article (you should have access if you're on RHEL):
> - How can I determine exactly what is happening with every operation
> on a resource in Pacemaker?
> (https://access.redhat.com/solutions/3182931)

You may also want to set on-fail=block for the stop operation to
prevent the node from getting fenced while you troubleshoot this.

# pcs resource update vip_ABC op stop interval=0s
timeout=<whatever_the_current_timeout_is> on-fail=block

Other than that, trace_ra=1 will generally tell us quite a lot -- I
just hope that it _does_ get written, given that the child process
becomes unkillable.

The IPaddr2 resource agent doesn't do all that much. It runs a few
`ip` commands and sends an ARP refresh. That's about it. Generally
would not expect any of those to hang unless there's a deeper issue.

>
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
>
>
> --
> Regards,
>
> Reid Wahl (He/Him)
> Senior Software Engineer, Red Hat
> RHEL High Availability - Pacemaker

-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker