[ClusterLabs] IPaddr2 resource times out and cant be killed

Mon Aug 1 14:34:57 EDT 2022

On Mon, Aug 1, 2022 at 8:34 AM Strahil Nikolov <hunter86_bg at yahoo.com> wrote:
>
> In clouds you can't just use VIPs.
> Use azure-lb resource instead.

It's not either/or. IPaddr2 is the correct resource agent to create a
VIP on an Azure VM. azure-lb creates an ncat listener to answer Azure
Load Balancer health probe requests.

Azure VMs as cluster members are not quite like AWS and GCP VMs, which
require specialized resource agents for managing VIPs.

>
> Best Regards,
> Strahil Nikolov
>
> On Fri, Jul 29, 2022 at 23:21, Reid Wahl
> <nwahl at redhat.com> wrote:
> On Fri, Jul 29, 2022 at 1:02 PM Reid Wahl <nwahl at redhat.com> wrote:
> >
> > On Fri, Jul 29, 2022 at 12:52 PM Ross Sponholtz <rsponholtz at hotmail.com> wrote:
> > >
> > > I’m running a RHEL pacemaker cluster on Azure, and I’ve gotten a failure & fencing where I get these messages in the log file:
> > >
> > >
> > > warning: vip_ABC_30_monitor_10000 process (PID 1779737) timed out
> > > crit: vip_ABC_30_monitor_10000 process (PID 1779737) will not die!
> > >
> > >
> > >
> > > This resource uses the IPAddr2 resource agent.  I’ve looked at the agent code, and I can’t pinpoint any reason it would hang up, and since the node gets fenced, I can’t tell why this happens – any ideas on what kinds of failures could cause this problem?
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Ross
> > >
> >
> > Are you able to reproduce this? I suggest adding `trace_ra=1` to the
> > resource configuration in order to determine where it's hanging.
> >
> > # pcs resource update vip_ABC trace_ra=1
> >
> > This will produce a shell trace of each operation in
> > /var/lib/heartbeat/trace_ra/IPaddr2. This is naturally quite a lot of
> > logging, so remove the option when you've gotten what you need.
> >
> > # pcs resource update vip_ABC trace_ra=
> >
> > Also discussed in this article (you should have access if you're on RHEL):
> > - How can I determine exactly what is happening with every operation
> > on a resource in Pacemaker?
> > (https://access.redhat.com/solutions/3182931)
>
> You may also want to set on-fail=block for the stop operation to
> prevent the node from getting fenced while you troubleshoot this.
>
> # pcs resource update vip_ABC op stop interval=0s
> timeout=<whatever_the_current_timeout_is> on-fail=block
>
> Other than that, trace_ra=1 will generally tell us quite a lot -- I
> just hope that it _does_ get written, given that the child process
> becomes unkillable.
>
> The IPaddr2 resource agent doesn't do all that much. It runs a few
> `ip` commands and sends an ARP refresh. That's about it. Generally
> would not expect any of those to hang unless there's a deeper issue.
>
> >
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
> >
> >
> >
> > --
> > Regards,
> >
> > Reid Wahl (He/Him)
> > Senior Software Engineer, Red Hat
> > RHEL High Availability - Pacemaker
>
>
>
>
> --
> Regards,
>
> Reid Wahl (He/Him)
> Senior Software Engineer, Red Hat
> RHEL High Availability - Pacemaker
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker