[ClusterLabs] Antw: HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

Tue Dec 18 01:47:45 EST 2018

>>> Vitaly Zolotusky <vitaly at unitc.com> schrieb am 17.12.2018 um 21:43 in Nachricht
<1782126841.215210.1545079428693 at webmail6.networksolutionsemail.com>:
> Hello,
> I have a 2 node cluster and stonith is configured for SBD and fence_ipmilan.
> fence_ipmilan for node 1 is configured for 0 delay and for node 2 for 30 sec 
> delay so that nodes do not start killing each other during startup.
> In some cases (usually right after installation and when node 1 comes up 
> first and node 2 second) the node that comes up first (node 1) states that 
> node 2 is unclean, but can't fence it until quorum reached. 

I'd concentrate on examining why node2 is considered unclean. Of course that doesn't fix the issue, but if fixing it takes some time, you'll have a work-around ;-)

> Then as soon as quorum is reached after startup of corosync on node 2 it 
> sends a fence request for node 2. 
> Fence_ipmilan gets into 30 sec delay.
> Pacemaker gets started on node 2.
> While fence_ipmilan is still waiting for the delay node 1 crmd aborts 
> transition that requested the fence.
> Even though the transition was aborted, when delay time expires node 2 gets 
> fenced.
> Excerpts from messages are below. I also attached messages from both nodes 
> and pe-input files from node 1.
> Any suggestions would be appreciated.
> Thank you very much for your help!
> Vitaly Zolotusky
> 
> Here are excerpts from the messages:
> 
> Node 1 - controller - rhino66-right 172.18.51.81 - came up first  
> *****************
> 
> Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Fencing and resource 
> management disabled due to lack of quorum
> Nov 29 16:47:54 rhino66-right pengine[22183]:  warning: Node 
> rhino66-left.lab.archivas.com is unclean!
> Nov 29 16:47:54 rhino66-right pengine[22183]:   notice: Cannot fence unclean 
> nodes until quorum is attained (or no-quorum-policy is set to ignore)
> .....
> Nov 29 16:48:38 rhino66-right corosync[6677]:   [TOTEM ] A new membership 
> (172.16.1.81:60) was formed. Members joined: 2
> Nov 29 16:48:38 rhino66-right corosync[6677]:   [VOTEQ ] Waiting for all 
> cluster members. Current votes: 1 expected_votes: 2
> Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] This node is within 
> the primary component and will provide service.
> Nov 29 16:48:38 rhino66-right corosync[6677]:   [QUORUM] Members[2]: 1 2
> Nov 29 16:48:38 rhino66-right corosync[6677]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Quorum acquired
> Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Quorum acquired
> Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain a node 
> name for corosync nodeid 2
> Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not obtain 
> a node name for corosync nodeid 2
> Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Could not obtain a node 
> name for corosync nodeid 2
> Nov 29 16:48:38 rhino66-right crmd[22184]:   notice: Node (null) state is 
> now member
> Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Could not obtain 
> a node name for corosync nodeid 2
> Nov 29 16:48:38 rhino66-right pacemakerd[22152]:   notice: Node (null) state 
> is now member
> Nov 29 16:48:54 rhino66-right crmd[22184]:   notice: State transition S_IDLE 
>-> S_POLICY_ENGINE
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice: Watchdog will be 
> used via SBD if fencing is required
> Nov 29 16:48:54 rhino66-right pengine[22183]:  warning: Scheduling Node 
> rhino66-left.lab.archivas.com for STONITH
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Fence (reboot) 
> rhino66-left.lab.archivas.com 'node is unclean'
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      
> fence_sbd             ( rhino66-right.lab.archivas.com )
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      
> ipmi-82               ( rhino66-right.lab.archivas.com )
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      S_IP   
>                ( rhino66-right.lab.archivas.com )
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      
> postgres:0            ( rhino66-right.lab.archivas.com )
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      
> ethmonitor:0          ( rhino66-right.lab.archivas.com )
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      
> fs_monitor:0          ( rhino66-right.lab.archivas.com )   due to unrunnable 
> DBMaster running (blocked)
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      
> tomcat-instance:0     ( rhino66-right.lab.archivas.com )   due to unrunnable 
> DBMaster running (blocked)
> Nov 29 16:48:54 rhino66-right pengine[22183]:   notice:  * Start      
> ClusterMonitor:0      ( rhino66-right.lab.archivas.com )   due to unrunnable 
> DBMaster running (blocked)
> Nov 29 16:48:54 rhino66-right pengine[22183]:  warning: Calculated 
> transition 5 (with warnings), saving inputs in 
> /var/lib/pacemaker/pengine/pe-warn-9.bz2
> 
> Nov 29 16:48:54 rhino66-right crmd[22184]:   notice: Requesting fencing 
> (reboot) of node rhino66-left.lab.archivas.com
> Nov 29 16:48:54 rhino66-right stonith-ng[22178]:   notice: Client 
> crmd.22184.4aa20c13 wants to fence (reboot) 'rhino66-left.lab.archivas.com' 
> with device '(any)'
> Nov 29 16:48:54 rhino66-right stonith-ng[22178]:   notice: Requesting peer 
> fencing (reboot) of rhino66-left.lab.archivas.com
> Nov 29 16:48:54 rhino66-right stonith-ng[22178]:   notice: fence_sbd can 
> fence (reboot) rhino66-left.lab.archivas.com: static-list
> Nov 29 16:48:54 rhino66-right stonith-ng[22178]:   notice: ipmi-82 can fence 
> (reboot) rhino66-left.lab.archivas.com: static-list
> Nov 29 16:48:54 rhino66-right /fence_ipmilan: Delay 30 second(s) before 
> logging in to the fence device
> .....
> Nov 29 16:49:14 rhino66-right cib[22177]:   notice: Node (null) state is now 
> member
> Nov 29 16:49:15 rhino66-right crmd[22184]:   notice: Could not obtain a node 
> name for corosync nodeid 2
> Nov 29 16:49:16 rhino66-right crmd[22184]:   notice: Transition aborted: 
> Node join
> 
> Nov 29 16:49:20 rhino66-right cib[22177]:   notice: Local CIB 
> 1.33.2.abc5436abfebbac946f69a2776e7a73a differs from 
> rhino66-left.lab.archivas.com: 1.31.0.62aaa721579d5d8189b01b400534dc05 
> 0x5568eb66c590
> Nov 29 16:49:25 rhino66-right /fence_ipmilan: Executing: /usr/bin/ipmitool 
> -I lanplus -H 172.16.1.2 -p 623 -U admin -P [set] -L ADMINISTRATOR chassis 
> power off
> 
> Node 2 - rhino66-left 172.18.51.82 - comes up second ********
> 
> Nov 29 16:48:38 rhino66-left systemd[1]: Starting Corosync Cluster Engine...
> Nov 29 16:48:38 rhino66-left corosync[6217]:   [MAIN  ] Corosync Cluster 
> Engine ('2.4.4'): started and ready to provide service.
> Nov 29 16:48:38 rhino66-left corosync[6217]:   [MAIN  ] Corosync built-in 
> features: dbus rdma systemd xmlconf qdevices qnetd snmp libcgroup pie relro 
> bindnow
> Nov 29 16:48:38 rhino66-left corosync[6217]:   [MAIN  ] interface section 
> bindnetaddr is used together with nodelist. Nodelist one is going to be used.
> Nov 29 16:48:38 rhino66-left corosync[6217]:   [MAIN  ] Please migrate 
> config file to nodelist.
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] Initializing 
> transport (UDP/IP Unicast).
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] Initializing 
> transmit/receive security (NSS) crypto: aes256 hash: sha256
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] The network 
> interface [172.16.1.82] is now up.
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service engine 
> loaded: corosync configuration map access [0]
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name: cmap
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service engine 
> loaded: corosync configuration service [1]
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name: cfg
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service engine 
> loaded: corosync cluster closed process group service v1.01 [2]
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name: cpg
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service engine 
> loaded: corosync profile loading service [4]
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QUORUM] Using quorum 
> provider corosync_votequorum
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [VOTEQ ] Waiting for all 
> cluster members. Current votes: 1 expected_votes: 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service engine 
> loaded: corosync vote quorum service v1.0 [5]
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name: 
> votequorum
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [SERV  ] Service engine 
> loaded: corosync cluster quorum service v0.1 [3]
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QB    ] server name: quorum
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] adding new UDPU 
> member {172.16.1.81}
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] adding new UDPU 
> member {172.16.1.82}
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] A new membership 
> (172.16.1.82:56) was formed. Members joined: 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [VOTEQ ] Waiting for all 
> cluster members. Current votes: 1 expected_votes: 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [VOTEQ ] Waiting for all 
> cluster members. Current votes: 1 expected_votes: 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [VOTEQ ] Waiting for all 
> cluster members. Current votes: 1 expected_votes: 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QUORUM] Members[1]: 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [TOTEM ] A new membership 
> (172.16.1.81:60) was formed. Members joined: 1
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QUORUM] This node is within 
> the primary component and will provide service.
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [QUORUM] Members[2]: 1 2
> Nov 29 16:48:38 rhino66-left corosync[6218]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Nov 29 16:48:38 rhino66-left corosync[6207]: Starting Corosync Cluster 
> Engine (corosync): [  OK  ]
> Nov 29 16:48:38 rhino66-left systemd[1]: Started Corosync Cluster Engine.
> ......
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Starting Pacemaker 
> 1.1.18-2.fc28.1
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Could not obtain a 
> node name for corosync nodeid 2
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Quorum acquired
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Defaulting to 
> uname -n for the local corosync node name
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Could not obtain a 
> node name for corosync nodeid 1
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Could not obtain a 
> node name for corosync nodeid 1
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Node (null) state 
> is now member
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Node 
> rhino66-left.lab.archivas.com state is now member
> Nov 29 16:49:14 rhino66-left stonith-ng[21176]:   notice: Connecting to 
> cluster infrastructure: corosync
> Nov 29 16:49:14 rhino66-left attrd[21178]:   notice: Connecting to cluster 
> infrastructure: corosync
> Nov 29 16:49:14 rhino66-left pacemakerd[21153]:   notice: Could not obtain a 
> node name for corosync nodeid 1
> Nov 29 16:49:14 rhino66-left cib[21175]:   notice: Connecting to cluster 
> infrastructure: corosync
> Nov 29 16:49:14 rhino66-left stonith-ng[21176]:   notice: Could not obtain a 
> node name for corosync nodeid 2
> Nov 29 16:49:14 rhino66-left stonith-ng[21176]:   notice: Node (null) state 
> is now member
> Nov 29 16:49:14 rhino66-left covermon.sh[21154]: 2018-11-29 16:49:14,464 
> INFO     at covermon.py line 49 [MainThread:21154] New enclosure discovered: 
> sgDev=/dev/sg7 serial=SGFTJ18263C8B07
> Nov 29 16:49:14 rhino66-left attrd[21178]:   notice: Could not obtain a node 
> name for corosync nodeid 2
> Nov 29 16:49:14 rhino66-left attrd[21178]:   notice: Node (null) state is 
> now member
> ......
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Connecting to cluster 
> infrastructure: corosync
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Could not obtain a node 
> name for corosync nodeid 2
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Defaulting to uname -n 
> for the local corosync node name
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Quorum acquired
> Nov 29 16:49:15 rhino66-left cib[21175]:   notice: Defaulting to uname -n 
> for the local corosync node name
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Could not obtain a node 
> name for corosync nodeid 1
> Nov 29 16:49:15 rhino66-left attrd[21178]:   notice: Could not obtain a node 
> name for corosync nodeid 1
> Nov 29 16:49:15 rhino66-left stonith-ng[21176]:   notice: Could not obtain a 
> node name for corosync nodeid 1
> Nov 29 16:49:15 rhino66-left attrd[21178]:   notice: Node (null) state is 
> now member
> Nov 29 16:49:15 rhino66-left stonith-ng[21176]:   notice: Node (null) state 
> is now member
> Nov 29 16:49:15 rhino66-left stonith-ng[21176]:   notice: Watchdog will be 
> used via SBD if fencing is required
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Could not obtain a node 
> name for corosync nodeid 1
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Node (null) state is now 
> member
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Node 
> rhino66-left.lab.archivas.com state is now member
> Nov 29 16:49:15 rhino66-left attrd[21178]:   notice: Defaulting to uname -n 
> for the local corosync node name
> Nov 29 16:49:15 rhino66-left attrd[21178]:   notice: Recorded attribute 
> writer: rhino66-right.lab.archivas.com
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: Defaulting to uname -n 
> for the local corosync node name
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: The local CRM is 
> operational
> Nov 29 16:49:15 rhino66-left crmd[21180]:   notice: State transition 
> S_STARTING -> S_PENDING
> Nov 29 16:49:16 rhino66-left stonith-ng[21176]:   notice: Added 'fence_sbd' 
> to the device list (1 active devices)
> Nov 29 16:49:16 rhino66-left crmd[21180]:   notice: Could not obtain a node 
> name for corosync nodeid 1
> Nov 29 16:49:17 rhino66-left stonith-ng[21176]:   notice: Added 'ipmi-81' to 
> the device list (2 active devices)
> Nov 29 16:49:20 rhino66-left attrd[21178]:   notice: Updating all attributes 
> after cib_refresh_notify event
> 
> ......
> 
> Nov 29 16:49:27 rhino66-left systemd-logind[1539]: System is powering down.