[ClusterLabs] crm node stays online after issuing node standby command

Wed Mar 15 06:21:56 EDT 2023

Hi All,

We are seeing an issue as part of crm maintenance operations. As part of
the upgrade process, the crm nodes are put into standby mode.
But it's observed that one of the nodes fails to go into standby mode
despite the "crm node standby" returning success.

Commands issued to put nodes into maintenance :

[2023-03-15 06:07:08 +0000] [468] [INFO] changed: [FILE-1] => {"changed":
true, "cmd": "/usr/sbin/crm node standby FILE-1", "delta":
"0:00:00.442615", "end": "2023-03-15 06:07:08.150375", "rc": 0, "start":
"2023-03-15 06:07:07.707760", "stderr": "", "stderr_lines": [], "stdout":
"\u001b[32mINFO\u001b[0m: standby node FILE-1", "stdout_lines":
["\u001b[32mINFO\u001b[0m: standby node FILE-1"]}

.

*[2023-03-15 06:07:08 +0000] [468] [INFO] changed: [FILE-2] => {"changed":
true, "cmd": "/usr/sbin/crm node standby FILE-2", "delta":
"0:00:00.459407", "end": "2023-03-15 06:07:08.223749", "rc": 0, "start":
"2023-03-15 06:07:07.764342", "stderr": "", "stderr_lines": [], "stdout":
"\u001b[32mINFO\u001b[0m: standby node FILE-2", "stdout_lines":
["\u001b[32mINFO\u001b[0m: standby node FILE-2"]}*

      ........

Crm status o/p after above command execution:

FILE-2:/var/log # crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: FILE-1 (version
2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) - partition
with quorum
  * Last updated: Wed Mar 15 08:32:27 2023
  * Last change:  Wed Mar 15 06:07:08 2023 by root via cibadmin on FILE-4
  * 4 nodes configured
  * 11 resource instances configured (5 DISABLED)

Node List:
  * Node FILE-1: standby (with active resources)
  * Node FILE-3: standby (with active resources)
  * Node FILE-4: standby (with active resources)
*  * Online: [ FILE-2 ]*

pacemaker logs indicate that FILE-2 received the commands to put it into
standby.

FILE-2:/var/log # grep standby /var/log/pacemaker/pacemaker.log
Mar 15 06:07:08.098 FILE-2 pacemaker-based     [8635] (cib_perform_op)
 info: ++                                            <nvpair
id="num-1-instance_attributes-standby" name="standby" value="on"/>
Mar 15 06:07:08.166 FILE-2 pacemaker-based     [8635] (cib_perform_op)
 info: ++                                            <nvpair
id="num-3-instance_attributes-standby" name="standby" value="on"/>
Mar 15 06:07:08.170 FILE-2 pacemaker-based     [8635] (cib_perform_op)
 info: ++                                            <nvpair
id="num-2-instance_attributes-standby" name="standby" value="on"/>
Mar 15 06:07:08.230 FILE-2 pacemaker-based     [8635] (cib_perform_op)
 info: ++                                            <nvpair
id="num-4-instance_attributes-standby" name="standby" value="on"/>

Issue is quite intermittent and observed on other nodes as well.
We have seen a similar issue when we try to remove the node from standby
mode (using crm node online) command. One/more nodes fails to get removed
from standby mode.

We suspect it could be an issue with parallel execution of node
standby/online command for all nodes but this issue wasn't observed with
pacemaker packaged with SLES15 SP2 OS.

I'm attaching the pacemaker.log from FILE-2 for analysis. Let us know if
any additional information is required.

OS: SLES15 SP4
Pacemaker version -->
 crmadmin --version
Pacemaker 2.1.2+20211124.ada5c3b36-150400.2.43

Thanks,
Ayush
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230315/5baf97da/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pacemaker.log
Type: application/octet-stream
Size: 253338 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20230315/5baf97da/attachment-0001.obj>