[ClusterLabs] HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.
Vitaly Zolotusky
vitaly at unitc.com
Mon Dec 17 15:43:48 EST 2018
Hello,
I have a 2 node cluster and stonith is configured for SBD and fence_ipmilan.
fence_ipmilan for node 1 is configured for 0 delay and for node 2 for 30 sec delay so that nodes do not start killing each other during startup.
In some cases (usually right after installation and when node 1 comes up first and node 2 second) the node that comes up first (node 1) states that node 2 is unclean, but can't fence it until quorum reached.
Then as soon as quorum is reached after startup of corosync on node 2 it sends a fence request for node 2.
Fence_ipmilan gets into 30 sec delay.
Pacemaker gets started on node 2.
While fence_ipmilan is still waiting for the delay node 1 crmd aborts transition that requested the fence.
Even though the transition was aborted, when delay time expires node 2 gets fenced.
Excerpts from messages are below. I also attached messages from both nodes and pe-input files from node 1.
Any suggestions would be appreciated.
Thank you very much for your help!
Vitaly Zolotusky
Here are excerpts from the messages:
Node 1 - controller - rhino66-right 172.18.51.81 - came up first *****************
Nov 29 16:47:54 rhino66-right pengine[22183]: warning: Fencing and resource management disabled due to lack of quorum
Nov 29 16:47:54 rhino66-right pengine[22183]: warning: Node rhino66-left.lab.archivas.com is unclean!
Nov 29 16:47:54 rhino66-right pengine[22183]: notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore)
.....
Nov 29 16:48:38 rhino66-right corosync[6677]: [TOTEM ] A new membership (172.16.1.81:60) was formed. Members joined: 2
Nov 29 16:48:38 rhino66-right corosync[6677]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Nov 29 16:48:38 rhino66-right corosync[6677]: [QUORUM] This node is within the primary component and will provide service.
Nov 29 16:48:38 rhino66-right corosync[6677]: [QUORUM] Members[2]: 1 2
Nov 29 16:48:38 rhino66-right corosync[6677]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Quorum acquired
Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Quorum acquired
Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:48:38 rhino66-right crmd[22184]: notice: Node (null) state is now member
Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:48:38 rhino66-right pacemakerd[22152]: notice: Node (null) state is now member
Nov 29 16:48:54 rhino66-right crmd[22184]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Nov 29 16:48:54 rhino66-right pengine[22183]: notice: Watchdog will be used via SBD if fencing is required
Nov 29 16:48:54 rhino66-right pengine[22183]: warning: Scheduling Node rhino66-left.lab.archivas.com for STONITH
Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * Fence (reboot) rhino66-left.lab.archivas.com 'node is unclean'
Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * Start fence_sbd ( rhino66-right.lab.archivas.com )
Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * Start ipmi-82 ( rhino66-right.lab.archivas.com )
Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * Start S_IP ( rhino66-right.lab.archivas.com )
Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * Start postgres:0 ( rhino66-right.lab.archivas.com )
Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * Start ethmonitor:0 ( rhino66-right.lab.archivas.com )
Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * Start fs_monitor:0 ( rhino66-right.lab.archivas.com ) due to unrunnable DBMaster running (blocked)
Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * Start tomcat-instance:0 ( rhino66-right.lab.archivas.com ) due to unrunnable DBMaster running (blocked)
Nov 29 16:48:54 rhino66-right pengine[22183]: notice: * Start ClusterMonitor:0 ( rhino66-right.lab.archivas.com ) due to unrunnable DBMaster running (blocked)
Nov 29 16:48:54 rhino66-right pengine[22183]: warning: Calculated transition 5 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-9.bz2
Nov 29 16:48:54 rhino66-right crmd[22184]: notice: Requesting fencing (reboot) of node rhino66-left.lab.archivas.com
Nov 29 16:48:54 rhino66-right stonith-ng[22178]: notice: Client crmd.22184.4aa20c13 wants to fence (reboot) 'rhino66-left.lab.archivas.com' with device '(any)'
Nov 29 16:48:54 rhino66-right stonith-ng[22178]: notice: Requesting peer fencing (reboot) of rhino66-left.lab.archivas.com
Nov 29 16:48:54 rhino66-right stonith-ng[22178]: notice: fence_sbd can fence (reboot) rhino66-left.lab.archivas.com: static-list
Nov 29 16:48:54 rhino66-right stonith-ng[22178]: notice: ipmi-82 can fence (reboot) rhino66-left.lab.archivas.com: static-list
Nov 29 16:48:54 rhino66-right /fence_ipmilan: Delay 30 second(s) before logging in to the fence device
.....
Nov 29 16:49:14 rhino66-right cib[22177]: notice: Node (null) state is now member
Nov 29 16:49:15 rhino66-right crmd[22184]: notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:49:16 rhino66-right crmd[22184]: notice: Transition aborted: Node join
Nov 29 16:49:20 rhino66-right cib[22177]: notice: Local CIB 1.33.2.abc5436abfebbac946f69a2776e7a73a differs from rhino66-left.lab.archivas.com: 1.31.0.62aaa721579d5d8189b01b400534dc05 0x5568eb66c590
Nov 29 16:49:25 rhino66-right /fence_ipmilan: Executing: /usr/bin/ipmitool -I lanplus -H 172.16.1.2 -p 623 -U admin -P [set] -L ADMINISTRATOR chassis power off
Node 2 - rhino66-left 172.18.51.82 - comes up second ********
Nov 29 16:48:38 rhino66-left systemd[1]: Starting Corosync Cluster Engine...
Nov 29 16:48:38 rhino66-left corosync[6217]: [MAIN ] Corosync Cluster Engine ('2.4.4'): started and ready to provide service.
Nov 29 16:48:38 rhino66-left corosync[6217]: [MAIN ] Corosync built-in features: dbus rdma systemd xmlconf qdevices qnetd snmp libcgroup pie relro bindnow
Nov 29 16:48:38 rhino66-left corosync[6217]: [MAIN ] interface section bindnetaddr is used together with nodelist. Nodelist one is going to be used.
Nov 29 16:48:38 rhino66-left corosync[6217]: [MAIN ] Please migrate config file to nodelist.
Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] Initializing transport (UDP/IP Unicast).
Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] Initializing transmit/receive security (NSS) crypto: aes256 hash: sha256
Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] The network interface [172.16.1.82] is now up.
Nov 29 16:48:38 rhino66-left corosync[6218]: [SERV ] Service engine loaded: corosync configuration map access [0]
Nov 29 16:48:38 rhino66-left corosync[6218]: [QB ] server name: cmap
Nov 29 16:48:38 rhino66-left corosync[6218]: [SERV ] Service engine loaded: corosync configuration service [1]
Nov 29 16:48:38 rhino66-left corosync[6218]: [QB ] server name: cfg
Nov 29 16:48:38 rhino66-left corosync[6218]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 [2]
Nov 29 16:48:38 rhino66-left corosync[6218]: [QB ] server name: cpg
Nov 29 16:48:38 rhino66-left corosync[6218]: [SERV ] Service engine loaded: corosync profile loading service [4]
Nov 29 16:48:38 rhino66-left corosync[6218]: [QUORUM] Using quorum provider corosync_votequorum
Nov 29 16:48:38 rhino66-left corosync[6218]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Nov 29 16:48:38 rhino66-left corosync[6218]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5]
Nov 29 16:48:38 rhino66-left corosync[6218]: [QB ] server name: votequorum
Nov 29 16:48:38 rhino66-left corosync[6218]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 [3]
Nov 29 16:48:38 rhino66-left corosync[6218]: [QB ] server name: quorum
Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] adding new UDPU member {172.16.1.81}
Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] adding new UDPU member {172.16.1.82}
Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] A new membership (172.16.1.82:56) was formed. Members joined: 2
Nov 29 16:48:38 rhino66-left corosync[6218]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Nov 29 16:48:38 rhino66-left corosync[6218]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Nov 29 16:48:38 rhino66-left corosync[6218]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
Nov 29 16:48:38 rhino66-left corosync[6218]: [QUORUM] Members[1]: 2
Nov 29 16:48:38 rhino66-left corosync[6218]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 29 16:48:38 rhino66-left corosync[6218]: [TOTEM ] A new membership (172.16.1.81:60) was formed. Members joined: 1
Nov 29 16:48:38 rhino66-left corosync[6218]: [QUORUM] This node is within the primary component and will provide service.
Nov 29 16:48:38 rhino66-left corosync[6218]: [QUORUM] Members[2]: 1 2
Nov 29 16:48:38 rhino66-left corosync[6218]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 29 16:48:38 rhino66-left corosync[6207]: Starting Corosync Cluster Engine (corosync): [ OK ]
Nov 29 16:48:38 rhino66-left systemd[1]: Started Corosync Cluster Engine.
......
Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Starting Pacemaker 1.1.18-2.fc28.1
Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Quorum acquired
Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Defaulting to uname -n for the local corosync node name
Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Node (null) state is now member
Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Node rhino66-left.lab.archivas.com state is now member
Nov 29 16:49:14 rhino66-left stonith-ng[21176]: notice: Connecting to cluster infrastructure: corosync
Nov 29 16:49:14 rhino66-left attrd[21178]: notice: Connecting to cluster infrastructure: corosync
Nov 29 16:49:14 rhino66-left pacemakerd[21153]: notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:14 rhino66-left cib[21175]: notice: Connecting to cluster infrastructure: corosync
Nov 29 16:49:14 rhino66-left stonith-ng[21176]: notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:49:14 rhino66-left stonith-ng[21176]: notice: Node (null) state is now member
Nov 29 16:49:14 rhino66-left covermon.sh[21154]: 2018-11-29 16:49:14,464 INFO at covermon.py line 49 [MainThread:21154] New enclosure discovered: sgDev=/dev/sg7 serial=SGFTJ18263C8B07
Nov 29 16:49:14 rhino66-left attrd[21178]: notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:49:14 rhino66-left attrd[21178]: notice: Node (null) state is now member
......
Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Connecting to cluster infrastructure: corosync
Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Could not obtain a node name for corosync nodeid 2
Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Defaulting to uname -n for the local corosync node name
Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Quorum acquired
Nov 29 16:49:15 rhino66-left cib[21175]: notice: Defaulting to uname -n for the local corosync node name
Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:15 rhino66-left attrd[21178]: notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:15 rhino66-left stonith-ng[21176]: notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:15 rhino66-left attrd[21178]: notice: Node (null) state is now member
Nov 29 16:49:15 rhino66-left stonith-ng[21176]: notice: Node (null) state is now member
Nov 29 16:49:15 rhino66-left stonith-ng[21176]: notice: Watchdog will be used via SBD if fencing is required
Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Node (null) state is now member
Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Node rhino66-left.lab.archivas.com state is now member
Nov 29 16:49:15 rhino66-left attrd[21178]: notice: Defaulting to uname -n for the local corosync node name
Nov 29 16:49:15 rhino66-left attrd[21178]: notice: Recorded attribute writer: rhino66-right.lab.archivas.com
Nov 29 16:49:15 rhino66-left crmd[21180]: notice: Defaulting to uname -n for the local corosync node name
Nov 29 16:49:15 rhino66-left crmd[21180]: notice: The local CRM is operational
Nov 29 16:49:15 rhino66-left crmd[21180]: notice: State transition S_STARTING -> S_PENDING
Nov 29 16:49:16 rhino66-left stonith-ng[21176]: notice: Added 'fence_sbd' to the device list (1 active devices)
Nov 29 16:49:16 rhino66-left crmd[21180]: notice: Could not obtain a node name for corosync nodeid 1
Nov 29 16:49:17 rhino66-left stonith-ng[21176]: notice: Added 'ipmi-81' to the device list (2 active devices)
Nov 29 16:49:20 rhino66-left attrd[21178]: notice: Updating all attributes after cib_refresh_notify event
......
Nov 29 16:49:27 rhino66-left systemd-logind[1539]: System is powering down.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.tar.bz2
Type: application/x-bzip
Size: 125809 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20181217/eedafa55/attachment-0001.bin>
More information about the Users
mailing list