[ClusterLabs] regression in crm_node in 2.1.7/uppstream?

Fri Jul 5 10:48:04 UTC 2024

>On Thu, Jul 4, 2024 at 5:03 AM Artur Novik <freishutz at gmail.com  <https://lists.clusterlabs.org/mailman/listinfo/users>> wrote:

>>/Hi everybody, />>/I faced with a strange behavior and I since there was a lot of activity />>/around crm_node structs in 2.1.7, I want to believe that it's a regression />>/rather than a new behavior by default. />>//>>/"crm_node -i" occasionally, but very often, returns "*exit code 68* : Node />>/is not known to cluster". />>//>>/The quick test below (taken from two different clusters with pacemaker />>/2.1.7 and 2.1.8): />>//>>/``` />>//>>/[root at node1 <https://lists.clusterlabs.org/mailman/listinfo/users> 
~]# crm_node -i />>/Node is not known to cluster />>/[root at node1 <https://lists.clusterlabs.org/mailman/listinfo/users> 
~]# crm_node -i />>/1 />/> [root at node1 <https://lists.clusterlabs.org/mailman/listinfo/users> 
~]# crm_node -i />/> 1 />/> [root at node1 <https://lists.clusterlabs.org/mailman/listinfo/users> 
~]# crm_node -i />/> Node is not known to cluster />/> [root at node1 <https://lists.clusterlabs.org/mailman/listinfo/users> 
~]# for i in 1 2 3 4 5 6 7; do ssh node$i crm_node -i; done />/> 1 />/> 2 />/> Node is not known to cluster />/> Node is not known to cluster />/> 5 />/> Node is not known to cluster />/> 7 />/> [root at node1 <https://lists.clusterlabs.org/mailman/listinfo/users> 
~]# for i in 1 2 3 4 5 6 7; do sleep 1; ssh node$i crm_node -i ; done />/> Node is not known to cluster />/> Node is not known to cluster />/> Node is not known to cluster />/> Node is not known to cluster />/> Node is not known to cluster />/> 6 />/> 7 />>//>>//>/> [root at es-brick2 
<https://lists.clusterlabs.org/mailman/listinfo/users> ~]# crm_node -i />/> 2 />/> [root at es-brick2 
<https://lists.clusterlabs.org/mailman/listinfo/users> ~]# crm_node -i />/> 2 />/> [root at es-brick2 
<https://lists.clusterlabs.org/mailman/listinfo/users> ~]# crm_node -i />/> Node is not known to cluster />>/[root at es-brick2 
<https://lists.clusterlabs.org/mailman/listinfo/users> ~]# crm_node -i />>/2 />>/[root at es-brick2 
<https://lists.clusterlabs.org/mailman/listinfo/users> ~]# rpm -qa | 
grep pacemaker | sort />>/pacemaker-2.1.8.rc2-1.el8_10.x86_64 />>/pacemaker-cli-2.1.8.rc2-1.el8_10.x86_64 />>/pacemaker-cluster-libs-2.1.8.rc2-1.el8_10.x86_64 />>/pacemaker-libs-2.1.8.rc2-1.el8_10.x86_64 />>/pacemaker-remote-2.1.8.rc2-1.el8_10.x86_64 />>/pacemaker-schemas-2.1.8.rc2-1.el8_10.noarch />>//>>/``` />>//>>/I checked next versions (all packages, except the last one, taken from />>/rocky linux and rebuilt against corosync 3.1.8 from rocky 8.10. The distro />>/itself rockylinux 8.10 too): />>/Pacemaker version Status />>/2.1.5 (8.8) OK />>/2.1.6 (8.9) OK />>/2.1.7 (8.10) Broken />>/2.1.8-RC2 (upstream) Broken />>//>>/I don't attach logs for now since I believe it could be reproduced />>/absolutely on any installation. />>//
> Hi, thanks for the report. I can try to reproduce on 2.1.8 later, but so
> far I'm unable to reproduce on the current upstream main branch. I don't
> believe there are any major differences in the relevant code between main
> and 2.1.8-rc2.

> I wonder if it's an issue where the controller is busy with a synchronous
> request when you run `crm_node -i` (which would be a bug). Can you share
> logs and your config?

The logs could be taken from google drive since they are too large to attach:
https://drive.google.com/file/d/1MLgjYncHXrQlZQ2FAmoGp9blvDtS-8RG/view?usp=drive_link   (~65MB with all nodes)
https://drive.google.com/drive/folders/13YYhAtS6zlDjoOOf8ZZQSyfTP_wzLbG_?usp=drive_link  (the directory with logs)

The timestamp and node:
[root at es-brick1 ~]# date Fri Jul  5 10:02:35 UTC 2024 Since this 
reproduced on multiple KVMs (rhel8, 9 and fedora40), I attached some 
info from hypervisor side too.
>/Thanks, />/A />/_______________________________________________ />/Manage your subscription: />/https://lists.clusterlabs.org/mailman/listinfo/users />//>/ClusterLabs home: https://www.clusterlabs.org/ />//
> -- 
> Regards,
>
> Reid Wahl (He/Him)
> Senior Software Engineer, Red Hat
> RHEL High Availability - Pacemaker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20240705/8e5455b3/attachment.htm>