[ClusterLabs] regression in crm_node in 2.1.7/uppstream?

Thu Jul 11 09:56:37 UTC 2024

On Fri, Jul 5, 2024 at 3:48 AM Artur Novik <freishutz at gmail.com> wrote:
>
> >On Thu, Jul 4, 2024 at 5:03 AM Artur Novik <freishutz at gmail.com> wrote:
>
> >> Hi everybody,
> >> I faced with a strange behavior and I since there was a lot of activity
> >> around crm_node structs in 2.1.7, I want to believe  that it's a regression
> >> rather than a new behavior by default.
> >>
> >> "crm_node -i" occasionally, but very often, returns "*exit code 68* : Node
> >> is not known to cluster".
> >>
> >> The quick test below (taken from two different clusters with pacemaker
> >> 2.1.7 and 2.1.8):
> >>
> >> ```
> >>
> >> [root at node1 ~]# crm_node -i
> >> Node is not known to cluster
> >> [root at node1 ~]# crm_node -i
> >> 1
> >> [root at node1 ~]# crm_node -i
> >> 1
> >> [root at node1 ~]# crm_node -i
> >> Node is not known to cluster
> >> [root at node1 ~]# for i in 1 2 3 4 5 6 7; do ssh node$i crm_node -i; done
> >> 1
> >> 2
> >> Node is not known to cluster
> >> Node is not known to cluster
> >> 5
> >> Node is not known to cluster
> >> 7
> >> [root at node1 ~]# for i in 1 2 3 4 5 6 7; do sleep 1; ssh node$i crm_node -i ; done
> >> Node is not known to cluster
> >> Node is not known to cluster
> >> Node is not known to cluster
> >> Node is not known to cluster
> >> Node is not known to cluster
> >> 6
> >> 7
> >>
> >>
> >> [root at es-brick2 ~]# crm_node -i
> >> 2
> >> [root at es-brick2 ~]# crm_node -i
> >> 2
> >> [root at es-brick2 ~]# crm_node -i
> >> Node is not known to cluster
> >> [root at es-brick2 ~]# crm_node -i
> >> 2
> >> [root at es-brick2 ~]# rpm -qa | grep pacemaker | sort
> >> pacemaker-2.1.8.rc2-1.el8_10.x86_64
> >> pacemaker-cli-2.1.8.rc2-1.el8_10.x86_64
> >> pacemaker-cluster-libs-2.1.8.rc2-1.el8_10.x86_64
> >> pacemaker-libs-2.1.8.rc2-1.el8_10.x86_64
> >> pacemaker-remote-2.1.8.rc2-1.el8_10.x86_64
> >> pacemaker-schemas-2.1.8.rc2-1.el8_10.noarch
> >>
> >> ```
> >>
> >> I checked next versions (all packages, except the last one, taken from
> >> rocky linux and rebuilt against corosync 3.1.8 from rocky 8.10. The distro
> >> itself rockylinux 8.10 too):
> >> Pacemaker  version Status
> >> 2.1.5 (8.8) OK
> >> 2.1.6 (8.9) OK
> >> 2.1.7 (8.10) Broken
> >> 2.1.8-RC2 (upstream) Broken
> >>
> >> I don't attach logs for now since I believe it could be reproduced
> >> absolutely on any installation.
> >>
>
> > Hi, thanks for the report. I can try to reproduce on 2.1.8 later, but so
> > far I'm unable to reproduce on the current upstream main branch. I don't
> > believe there are any major differences in the relevant code between main
> > and 2.1.8-rc2.
>
> > I wonder if it's an issue where the controller is busy with a synchronous
> > request when you run `crm_node -i` (which would be a bug). Can you share
> > logs and your config?
>
> The logs could be taken from google drive since they are too large to attach:
> https://drive.google.com/file/d/1MLgjYncHXrQlZQ2FAmoGp9blvDtS-8RG/view?usp=drive_link  (~65MB with all nodes)
> https://drive.google.com/drive/folders/13YYhAtS6zlDjoOOf8ZZQSyfTP_wzLbG_?usp=drive_link (the directory with logs)
>
> The timestamp and node:
> [root at es-brick1 ~]# date
> Fri Jul  5 10:02:35 UTC 2024
>
> Since this reproduced on multiple KVMs (rhel8, 9 and fedora40), I attached some info from hypervisor side too.

Thank you for the additional info. We've been looking into this, and
so far I'm still unable to reproduce it on my machine. However, I have
an idea that it's related to passing a pointer to an uninitialized
`nodeid` variable in `print_node_id()` within crm_node.c.

Can you run `crm_node -i -VVVVVV` and share the output from a
successful run and from a failed run?

>
> > Thanks,
> > A
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
> > --
> > Regards,
> >
> > Reid Wahl (He/Him)
> > Senior Software Engineer, Red Hat
> > RHEL High Availability - Pacemaker
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-- 
Regards,

Reid Wahl (He/Him)
Senior Software Engineer, Red Hat
RHEL High Availability - Pacemaker