[ClusterLabs] regression in crm_node in 2.1.7/uppstream?

Thu Jul 11 10:34:23 UTC 2024

On Thu, Jul 11, 2024 at 3:27 AM Reid Wahl <nwahl at redhat.com> wrote:
>
> On Thu, Jul 11, 2024 at 2:56 AM Reid Wahl <nwahl at redhat.com> wrote:
> >
> > On Fri, Jul 5, 2024 at 3:48 AM Artur Novik <freishutz at gmail.com> wrote:
> > >
> > > >On Thu, Jul 4, 2024 at 5:03 AM Artur Novik <freishutz at gmail.com> wrote:
> > >
> > > >> Hi everybody,
> > > >> I faced with a strange behavior and I since there was a lot of activity
> > > >> around crm_node structs in 2.1.7, I want to believe  that it's a regression
> > > >> rather than a new behavior by default.
> > > >>
> > > >> "crm_node -i" occasionally, but very often, returns "*exit code 68* : Node
> > > >> is not known to cluster".
> > > >>
> > > >> The quick test below (taken from two different clusters with pacemaker
> > > >> 2.1.7 and 2.1.8):
> > > >>
> > > >> ```
> > > >>
> > > >> [root at node1 ~]# crm_node -i
> > > >> Node is not known to cluster
> > > >> [root at node1 ~]# crm_node -i
> > > >> 1
> > > >> [root at node1 ~]# crm_node -i
> > > >> 1
> > > >> [root at node1 ~]# crm_node -i
> > > >> Node is not known to cluster
> > > >> [root at node1 ~]# for i in 1 2 3 4 5 6 7; do ssh node$i crm_node -i; done
> > > >> 1
> > > >> 2
> > > >> Node is not known to cluster
> > > >> Node is not known to cluster
> > > >> 5
> > > >> Node is not known to cluster
> > > >> 7
> > > >> [root at node1 ~]# for i in 1 2 3 4 5 6 7; do sleep 1; ssh node$i crm_node -i ; done
> > > >> Node is not known to cluster
> > > >> Node is not known to cluster
> > > >> Node is not known to cluster
> > > >> Node is not known to cluster
> > > >> Node is not known to cluster
> > > >> 6
> > > >> 7
> > > >>
> > > >>
> > > >> [root at es-brick2 ~]# crm_node -i
> > > >> 2
> > > >> [root at es-brick2 ~]# crm_node -i
> > > >> 2
> > > >> [root at es-brick2 ~]# crm_node -i
> > > >> Node is not known to cluster
> > > >> [root at es-brick2 ~]# crm_node -i
> > > >> 2
> > > >> [root at es-brick2 ~]# rpm -qa | grep pacemaker | sort
> > > >> pacemaker-2.1.8.rc2-1.el8_10.x86_64
> > > >> pacemaker-cli-2.1.8.rc2-1.el8_10.x86_64
> > > >> pacemaker-cluster-libs-2.1.8.rc2-1.el8_10.x86_64
> > > >> pacemaker-libs-2.1.8.rc2-1.el8_10.x86_64
> > > >> pacemaker-remote-2.1.8.rc2-1.el8_10.x86_64
> > > >> pacemaker-schemas-2.1.8.rc2-1.el8_10.noarch
> > > >>
> > > >> ```
> > > >>
> > > >> I checked next versions (all packages, except the last one, taken from
> > > >> rocky linux and rebuilt against corosync 3.1.8 from rocky 8.10. The distro
> > > >> itself rockylinux 8.10 too):
> > > >> Pacemaker  version Status
> > > >> 2.1.5 (8.8) OK
> > > >> 2.1.6 (8.9) OK
> > > >> 2.1.7 (8.10) Broken
> > > >> 2.1.8-RC2 (upstream) Broken
> > > >>
> > > >> I don't attach logs for now since I believe it could be reproduced
> > > >> absolutely on any installation.
> > > >>
> > >
> > > > Hi, thanks for the report. I can try to reproduce on 2.1.8 later, but so
> > > > far I'm unable to reproduce on the current upstream main branch. I don't
> > > > believe there are any major differences in the relevant code between main
> > > > and 2.1.8-rc2.
> > >
> > > > I wonder if it's an issue where the controller is busy with a synchronous
> > > > request when you run `crm_node -i` (which would be a bug). Can you share
> > > > logs and your config?
> > >
> > > The logs could be taken from google drive since they are too large to attach:
> > > https://drive.google.com/file/d/1MLgjYncHXrQlZQ2FAmoGp9blvDtS-8RG/view?usp=drive_link  (~65MB with all nodes)
> > > https://drive.google.com/drive/folders/13YYhAtS6zlDjoOOf8ZZQSyfTP_wzLbG_?usp=drive_link (the directory with logs)
> > >
> > > The timestamp and node:
> > > [root at es-brick1 ~]# date
> > > Fri Jul  5 10:02:35 UTC 2024
> > >
> > > Since this reproduced on multiple KVMs (rhel8, 9 and fedora40), I attached some info from hypervisor side too.
> >
> > Thank you for the additional info. We've been looking into this, and
> > so far I'm still unable to reproduce it on my machine. However, I have
> > an idea that it's related to passing a pointer to an uninitialized
> > `nodeid` variable in `print_node_id()` within crm_node.c.
> >
> > Can you run `crm_node -i -VVVVVV` and share the output from a
> > successful run and from a failed run?
>
> Disregard. I can't reproduce it when I build from source, but I can
> reproduce it after I install the pacemaker package from the fedora
> repo via dnf.

The problem is indeed the uninitialized `uint32_t nodeid`. When the
garbage value is less than INT_MAX, we get the "not known to cluster"
error. When the garbage value is greater than INT_MAX, the controller
finds a negative int, which is invalid as an ID, so it searches for
the local node name instead and finds the correct result.

We're working on a fix.

> >
> > >
> > > > Thanks,
> > > > A
> > > > _______________________________________________
> > > > Manage your subscription:
> > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > ClusterLabs home: https://www.clusterlabs.org/
> > > >
> > >
> > > > --
> > > > Regards,
> > > >
> > > > Reid Wahl (He/Him)
> > > > Senior Software Engineer, Red Hat
> > > > RHEL High Availability - Pacemaker
> > >
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
> >
> >
> >
> > --
> > Regards,
> >
> > Reid Wahl (He/Him)
> > Senior Software Engineer, Red Hat
> > RHEL High Availability - Pacemaker
>
>
>
> --
> Regards,
>
> Reid Wahl (He/Him)
> Senior Software Engineer, Red Hat
> RHEL High Availability - Pacemaker

--
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker