[ClusterLabs] HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

Vitaly Zolotusky vitaly at unitc.com
Mon Dec 17 15:43:48 EST 2018


Hello,
I have a 2 node cluster and stonith is configured for SBD and fence_ipmilan.
fence_ipmilan for node 1 is configured for 0 delay and for node 2 for 30 sec delay so that nodes do not start killing each other during startup.
In some cases (usually right after installation and when node 1 comes up first and node 2 second) the node that comes up first (node 1) states that node 2 is unclean, but can't fence it until quorum reached. 
Then as soon as quorum is reached after startup of corosync on node 2 it sends a fence request for node 2. 
Fence_ipmilan gets into 30 sec delay.
Pacemaker gets started on node 2.
While fence_ipmilan is still waiting for the delay node 1 crmd aborts transition that requested the fence.
Even though the transition was aborted, when delay time expires node 2 gets fenced.
Excerpts from messages are below. I also attached messages from both nodes and pe-input files from node 1.
Any suggestions would be appreciated.
Thank you very much for your help!
Vitaly Zolotusky

Here are excerpts from the messages:

Node 1 - controller - rhino66-right 172.18.51.81 - came up first  *****************

Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Fencing and resource management disabled due to lack of quorum
Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Node rhino66-left.lab.archivas.com is unclean!
Nov 29 16:47:54 rhino66-right pengine[22183]:   notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore)
.....
Nov 29 16:48:38 rhino66-right corosync[6677]:   [TOTEM ] A new membership (172.16.1.81:60) was formed. Members joined: 2
Nov 29 16:48:38 rhino66-right corosync[6677]:   [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] This node is within the primary component and will provide service.
Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] Members[2]: 1 2
Nov 29 16:48:38 rhino66-right corosync[6677]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Quorum acquired
Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Quorum acquired
Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Node (null) state is now member
Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Node (null) state is now member
Nov 29 16:48:54 rhino66-right crmd[22184]:   notice: State transition S_IDLE -> S_POLICY_ENGINE
Nov 29 16:48:54 rhino66-right pengine[22183]:   notice: Watchdog will be used via SBD if fencing is required
Nov 29 16:48:54 rhino66-right pengine[22183]:  warning: Scheduling Node rhino66-left.lab.archivas.com for STONITH
Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Fence (reboot) rhino66-left.lab.archivas.com 'node is unclean'
Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      fence_sbd             ( rhino66-right.lab.archivas.com )
Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      ipmi-82               ( rhino66-right.lab.archivas.com )
Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      S_IP                  ( rhino66-right.lab.archivas.com )
Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      postgres:0            ( rhino66-right.lab.archivas.com )
Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      ethmonitor:0          ( rhino66-right.lab.archivas.com )
Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      fs_monitor:0          ( rhino66-right.lab.archivas.com )   due to unrunnable DBMaster running (blocked)
Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      tomcat-instance:0     ( rhino66-right.lab.archivas.com )   due to unrunnable DBMaster running (blocked)
Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      ClusterMonitor:0      ( rhino66-right.lab.archivas.com )   due to unrunnable DBMaster running (blocked)
Nov 29 16:48:54 rhino66-right pengine[22183]:  warning: Calculated transition 5 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-9.bz2

Nov 29 16:48:54 rhino66-right crmd[22184]:   notice: Requesting fencing (reboot) of node rhino66-left.lab.archivas.com
Nov 29 16:48:54 rhino66-right stonith-ng[22178]:   notice: Client crmd.22184.4aa20c13 wants to fence (reboot) 'rhino66-left.lab.archivas.com' with device '(any)'
Nov 29 16:48:54 rhino66-right stonith-ng[22178]:   notice: Requesting peer fencing (reboot) of rhino66-left.lab.archivas.com
Nov 29 16:48:54 rhino66-right stonith-ng[22178]:   notice: fence_sbd can fence (reboot) rhino66-left.lab.archivas.com: static-list
Nov 29 16:48:54 rhino66-right stonith-ng[22178]:   notice: ipmi-82 can fence (reboot) rhino66-left.lab.archivas.com: static-list
Nov 29 16:48:54 rhino66-right /fence_ipmilan: Delay 30 second(s) before logging in to the fence device
.....
Nov 29 16:49:14 rhino66-right cib[22177]:   notice: Node (null) state is now member
Nov 29 16:49:15 rhino66-right crmd[22184]:   notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:49:16 rhino66-right crmd[22184]:   notice: Transition aborted: Node join

Nov 29 16:49:20 rhino66-right cib[22177]:   notice: Local CIB 1.33.2.abc5436abfebbac946f69a2776e7a73a differs from rhino66-left.lab.archivas.com: 1.31.0.62aaa721579d5d8189b01b400534dc05 0x5568eb66c590
Nov 29 16:49:25 rhino66-right /fence_ipmilan: Executing: /usr/bin/ipmitool -I lanplus -H 172.16.1.2 -p 623 -U admin -P [set] -L ADMINISTRATOR chassis power off

Node 2 - rhino66-left 172.18.51.82 - comes up second ********

Nov 29 16:48:38 rhino66-left systemd[1]: Starting Corosync Cluster Engine...
Nov 29 16:48:38 rhino66-left corosync[6217]:   [MAIN  ] Corosync Cluster Engine ('2.4.4'): started and ready to provide service.
Nov 29 16:48:38 rhino66-left corosync[6217]:   [MAIN  ] Corosync built-in features: dbus rdma systemd xmlconf qdevices qnetd snmp libcgroup pie relro bindnow
Nov 29 16:48:38 rhino66-left corosync[6217]:   [MAIN  ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Nov 29 16:48:38 rhino66-left corosync[6217]:   [MAIN  ] Please migrate config file to nodelist.
Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] Initializing transport (UDP/IP Unicast).
Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha256
Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] The network interface [172.16.1.82] is now up.
Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service engine loaded: corosync configuration map access [0]
Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name: cmap
Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service engine loaded: corosync configuration service [1]
Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name: cfg
Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name: cpg
Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service engine loaded: corosync profile loading service [4]
Nov 29 16:48:38 rhino66-left corosync[6218]:   [QUORUM] Using quorum provider corosync_votequorum
Nov 29 16:48:38 rhino66-left corosync[6218]:   [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service engine loaded: corosync vote quorum service v1.0 [5]
Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name: votequorum
Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name: quorum
Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] adding new UDPU member {172.16.1.81}
Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] adding new UDPU member {172.16.1.82}
Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] A new membership (172.16.1.82:56) was formed. Members joined: 2
Nov 29 16:48:38 rhino66-left corosync[6218]:   [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Nov 29 16:48:38 rhino66-left corosync[6218]:   [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Nov 29 16:48:38 rhino66-left corosync[6218]:   [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Nov 29 16:48:38 rhino66-left corosync[6218]:   [QUORUM] Members[1]: 2
Nov 29 16:48:38 rhino66-left corosync[6218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] A new membership (172.16.1.81:60) was formed. Members joined: 1
Nov 29 16:48:38 rhino66-left corosync[6218]:   [QUORUM] This node is within the primary component and will provide service.
Nov 29 16:48:38 rhino66-left corosync[6218]:   [QUORUM] Members[2]: 1 2
Nov 29 16:48:38 rhino66-left corosync[6218]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 29 16:48:38 rhino66-left corosync[6207]: Starting Corosync Cluster Engine (corosync): [  OK  ]
Nov 29 16:48:38 rhino66-left systemd[1]: Started Corosync Cluster Engine.
......
Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Starting Pacemaker 1.1.18-2.fc28.1
Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Quorum acquired
Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Defaulting to uname -n for the local corosync node name
Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Node (null) state is now member
Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Node rhino66-left.lab.archivas.com state is now member
Nov 29 16:49:14 rhino66-left stonith-ng[21176]:   notice: Connecting to cluster infrastructure: corosync
Nov 29 16:49:14 rhino66-left attrd[21178]:   notice: Connecting to cluster infrastructure: corosync
Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:14 rhino66-left cib[21175]:   notice: Connecting to cluster infrastructure: corosync
Nov 29 16:49:14 rhino66-left stonith-ng[21176]:   notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:49:14 rhino66-left stonith-ng[21176]:   notice: Node (null) state is now member
Nov 29 16:49:14 rhino66-left covermon.sh[21154]: 2018-11-29 16:49:14,464 INFO     at covermon.py line 49 [MainThread:21154] New enclosure discovered: sgDev=/dev/sg7 serial=SGFTJ18263C8B07
Nov 29 16:49:14 rhino66-left attrd[21178]:   notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:49:14 rhino66-left attrd[21178]:   notice: Node (null) state is now member
......
Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Connecting to cluster infrastructure: corosync
Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Defaulting to uname -n for the local corosync node name
Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Quorum acquired
Nov 29 16:49:15 rhino66-left cib[21175]:   notice: Defaulting to uname -n for the local corosync node name
Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:15 rhino66-left attrd[21178]:   notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:15 rhino66-left stonith-ng[21176]:   notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:15 rhino66-left attrd[21178]:   notice: Node (null) state is now member
Nov 29 16:49:15 rhino66-left stonith-ng[21176]:   notice: Node (null) state is now member
Nov 29 16:49:15 rhino66-left stonith-ng[21176]:   notice: Watchdog will be used via SBD if fencing is required
Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Node (null) state is now member
Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Node rhino66-left.lab.archivas.com state is now member
Nov 29 16:49:15 rhino66-left attrd[21178]:   notice: Defaulting to uname -n for the local corosync node name
Nov 29 16:49:15 rhino66-left attrd[21178]:   notice: Recorded attribute writer: rhino66-right.lab.archivas.com
Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Defaulting to uname -n for the local corosync node name
Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: The local CRM is operational
Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: State transition S_STARTING -> S_PENDING
Nov 29 16:49:16 rhino66-left stonith-ng[21176]:   notice: Added 'fence_sbd' to the device list (1 active devices)
Nov 29 16:49:16 rhino66-left crmd[21180]:   notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:17 rhino66-left stonith-ng[21176]:   notice: Added 'ipmi-81' to the device list (2 active devices)
Nov 29 16:49:20 rhino66-left attrd[21178]:   notice: Updating all attributes after cib_refresh_notify event

......

Nov 29 16:49:27 rhino66-left systemd-logind[1539]: System is powering down.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.tar.bz2
Type: application/x-bzip
Size: 125809 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20181217/eedafa55/attachment-0001.bin>


More information about the Users mailing list