[ClusterLabs] fencing by node name or by node ID

Sun Feb 21 19:19:26 EST 2016

Hi,

Last night a node in our cluster (Corosync 2.3.5, Pacemaker 1.1.14)
experienced some failure and fell out of the cluster:

Feb 21 22:11:12 vhbl06 corosync[3603]:   [TOTEM ] A new membership (10.0.6.9:612) was formed. Members left: 167773709
Feb 21 22:11:12 vhbl06 corosync[3603]:   [TOTEM ] Failed to receive the leave message. failed: 167773709
Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: crm_update_peer_proc: Node vhbl07[167773709] - state is now lost (was member)
Feb 21 22:11:12 vhbl06 cib[8304]:   notice: crm_update_peer_proc: Node vhbl07[167773709] - state is now lost (was member)
Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: Removing all vhbl07 attributes for attrd_peer_change_cb
Feb 21 22:11:12 vhbl06 cib[8304]:   notice: Removing vhbl07/167773709 from the membership list
Feb 21 22:11:12 vhbl06 cib[8304]:   notice: Purged 1 peers with id=167773709 and/or uname=vhbl07 from the membership cache
Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: Lost attribute writer vhbl07
Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: Removing vhbl07/167773709 from the membership list
Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: crm_update_peer_proc: Node vhbl07[167773709] - state is now lost (was member)
Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: Purged 1 peers with id=167773709 and/or uname=vhbl07 from the membership cache
Feb 21 22:11:12 vhbl06 crmd[8309]:   notice: Our peer on the DC (vhbl07) is dead
Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: Removing vhbl07/167773709 from the membership list
Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: Purged 1 peers with id=167773709 and/or uname=vhbl07 from the membership cache
Feb 21 22:11:12 vhbl06 crmd[8309]:   notice: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Feb 21 22:11:12 vhbl06 corosync[3603]:   [QUORUM] Members[4]: 167773705 167773706 167773707 167773708
Feb 21 22:11:12 vhbl06 corosync[3603]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 21 22:11:12 vhbl06 crmd[8309]:   notice: crm_reap_unseen_nodes: Node vhbl07[167773709] - state is now lost (was member)
Feb 21 22:11:12 vhbl06 pacemakerd[8261]:   notice: crm_reap_unseen_nodes: Node vhbl07[167773709] - state is now lost (was member)
Feb 21 22:11:12 vhbl06 kernel: [343490.563365] dlm: closing connection to node 167773709
Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl05 can not fence (reboot) 167773709: static-list
Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl07 can not fence (reboot) 167773709: static-list
Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl01 can not fence (reboot) 167773709: static-list
Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl02 can not fence (reboot) 167773709: static-list
Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl03 can not fence (reboot) 167773709: static-list
Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl04 can not fence (reboot) 167773709: static-list
Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: Operation reboot of 167773709 by <no-one> for stonith-api.20937 at vhbl03.9c470723: No such device
Feb 21 22:11:12 vhbl06 crmd[8309]:   notice: Peer 167773709 was not terminated (reboot) by <anyone> for vhbl03: No such device (ref=9c470723-d318-4c7e-a705-ce9ee5c7ffe5) by client stonith-api.20937
Feb 21 22:11:12 vhbl06 dlm_controld[3641]: 343352 tell corosync to remove nodeid 167773705 from cluster
Feb 21 22:11:15 vhbl06 corosync[3603]:   [TOTEM ] A processor failed, forming new configuration.
Feb 21 22:11:19 vhbl06 corosync[3603]:   [TOTEM ] A new membership (10.0.6.10:616) was formed. Members left: 167773705
Feb 21 22:11:19 vhbl06 corosync[3603]:   [TOTEM ] Failed to receive the leave message. failed: 167773705

However, no fencing agent reported ability to fence the failing node
(vhbl07), because stonith-ng wasn't looking it up by name, but by
numeric ID (at least that's what the logs suggest to me), and the
pcmk_host_list attributes contained strings like vhbl07.

1. Was it dlm_controld who requested the fencing?

   I suspect it because of the "dlm: closing connection to node
   167773709" kernel message right before the stonith-ng logs.  And
   dlm_controld really hasn't got anything to use but the corosync node
   ID.

2. Shouldn't some component translate between node IDs and node names?
   Is this a configuration error in our setup?  Should I include both in
   pcmk_host_list?

3. After the failed fence, why was 167773705 (vhbl03) removed from the
   cluster?  Because it was chosen to execute the fencing operation, but
   failed?

The logs continue like this:

Feb 21 22:11:19 vhbl06 attrd[8307]:   notice: crm_update_peer_proc: Node vhbl03[167773705] - state is now lost (was member)
Feb 21 22:11:19 vhbl06 attrd[8307]:   notice: Removing all vhbl03 attributes for attrd_peer_change_cb
Feb 21 22:11:19 vhbl06 attrd[8307]:   notice: Removing vhbl03/167773705 from the membership list
Feb 21 22:11:19 vhbl06 attrd[8307]:   notice: Purged 1 peers with id=167773705 and/or uname=vhbl03 from the membership cache
Feb 21 22:11:19 vhbl06 corosync[3603]:   [QUORUM] Members[3]: 167773706 167773707 167773708
Feb 21 22:11:19 vhbl06 corosync[3603]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 21 22:11:19 vhbl06 crmd[8309]:   notice: crm_reap_unseen_nodes: Node vhbl03[167773705] - state is now lost (was member)
Feb 21 22:11:19 vhbl06 crmd[8309]:   notice: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ]
Feb 21 22:11:19 vhbl06 pacemakerd[8261]:   notice: crm_reap_unseen_nodes: Node vhbl03[167773705] - state is now lost (was member)
Feb 21 22:11:19 vhbl06 cib[8304]:   notice: crm_update_peer_proc: Node vhbl03[167773705] - state is now lost (was member)
Feb 21 22:11:19 vhbl06 cib[8304]:   notice: Removing vhbl03/167773705 from the membership list
Feb 21 22:11:19 vhbl06 cib[8304]:   notice: Purged 1 peers with id=167773705 and/or uname=vhbl03 from the membership cache
Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: crm_update_peer_proc: Node vhbl03[167773705] - state is now lost (was member)
Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: Removing vhbl03/167773705 from the membership list
Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: Purged 1 peers with id=167773705 and/or uname=vhbl03 from the membership cache
Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl05 can not fence (reboot) 167773709: static-list
Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl07 can not fence (reboot) 167773709: static-list
Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl01 can not fence (reboot) 167773709: static-list
Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl02 can not fence (reboot) 167773709: static-list
Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl03 can not fence (reboot) 167773709: static-list
Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl04 can not fence (reboot) 167773709: static-list
Feb 21 22:11:19 vhbl06 kernel: [343497.392381] dlm: closing connection to node 167773705

4. Why can't I see any action above to fence 167773705 (vhbl03)?

Feb 21 22:11:19 vhbl06 crmd[8309]:  warning: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: Operation reboot of 167773709 by <no-one> for stonith-api.17462 at vhbl04.0cd1625d: No such device
Feb 21 22:11:19 vhbl06 crmd[8309]:   notice: Peer 167773709 was not terminated (reboot) by <anyone> for vhbl04: No such device (ref=0cd1625d-a61e-4f94-930d-bb80a10b89da) by client stonith-api.17462
Feb 21 22:11:19 vhbl06 dlm_controld[3641]: 343359 tell corosync to remove nodeid 167773706 from cluster
Feb 21 22:11:22 vhbl06 corosync[3603]:   [TOTEM ] A processor failed, forming new configuration.
Feb 21 22:11:26 vhbl06 corosync[3603]:   [TOTEM ] A new membership (10.0.6.11:620) was formed. Members left: 167773706
Feb 21 22:11:26 vhbl06 corosync[3603]:   [TOTEM ] Failed to receive the leave message. failed: 167773706

Looks like vhbl04 took over the job of fencing vhbl07 from vhbl03, and
of course failed the exact same way.  So it was expelled, too.

Feb 21 22:11:26 vhbl06 attrd[8307]:   notice: crm_update_peer_proc: Node vhbl04[167773706] - state is now lost (was member)
Feb 21 22:11:26 vhbl06 cib[8304]:   notice: crm_update_peer_proc: Node vhbl04[167773706] - state is now lost (was member)
Feb 21 22:11:26 vhbl06 attrd[8307]:   notice: Removing all vhbl04 attributes for attrd_peer_change_cb
Feb 21 22:11:26 vhbl06 cib[8304]:   notice: Removing vhbl04/167773706 from the membership list
Feb 21 22:11:26 vhbl06 cib[8304]:   notice: Purged 1 peers with id=167773706 and/or uname=vhbl04 from the membership cache
Feb 21 22:11:26 vhbl06 attrd[8307]:   notice: Removing vhbl04/167773706 from the membership list
Feb 21 22:11:26 vhbl06 attrd[8307]:   notice: Purged 1 peers with id=167773706 and/or uname=vhbl04 from the membership cache
Feb 21 22:11:26 vhbl06 stonith-ng[8305]:   notice: crm_update_peer_proc: Node vhbl04[167773706] - state is now lost (was member)
Feb 21 22:11:26 vhbl06 stonith-ng[8305]:   notice: Removing vhbl04/167773706 from the membership list
Feb 21 22:11:26 vhbl06 crmd[8309]:  warning: No match for shutdown action on 167773706
Feb 21 22:11:26 vhbl06 stonith-ng[8305]:   notice: Purged 1 peers with id=167773706 and/or uname=vhbl04 from the membership cache
Feb 21 22:11:26 vhbl06 crmd[8309]:   notice: Stonith/shutdown of vhbl04 not matched
Feb 21 22:11:26 vhbl06 corosync[3603]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
Feb 21 22:11:26 vhbl06 pacemakerd[8261]:   notice: Membership 620: quorum lost (2)
Feb 21 22:11:26 vhbl06 crmd[8309]:   notice: Membership 620: quorum lost (2)
Feb 21 22:11:26 vhbl06 corosync[3603]:   [QUORUM] Members[2]: 167773707 167773708

That, finally, was enough to lose quorum and paralize the cluster.
Later, vhbl07 was rebooted by the hardware watchdog and came back for a
cold welcome:

Feb 21 22:24:53 vhbl06 corosync[3603]:   [TOTEM ] A new membership (10.0.6.12:628) was formed. Members joined: 167773709
Feb 21 22:24:53 vhbl06 corosync[3603]:   [QUORUM] Members[2]: 167773708 167773709
Feb 21 22:24:53 vhbl06 corosync[3603]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 21 22:24:53 vhbl06 crmd[8309]:   notice: pcmk_quorum_notification: Node vhbl07[167773709] - state is now member (was lost)
Feb 21 22:24:53 vhbl06 pacemakerd[8261]:   notice: pcmk_quorum_notification: Node vhbl07[167773709] - state is now member (was lost)
Feb 21 22:24:53 vhbl06 dlm_controld[3641]: 344173 daemon joined 167773709 needs fencing
Feb 21 22:25:47 vhbl06 dlm_controld[3641]: 344226 clvmd wait for quorum
Feb 21 22:29:26 vhbl06 crmd[8309]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Feb 21 22:29:27 vhbl06 pengine[8308]:   notice: We do not have quorum - fencing and resource management disabled
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl04 is unclean because the node is no longer part of the cluster
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl04 is unclean
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl05 is unclean because the node is no longer part of the cluster
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl05 is unclean
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl07 is unclean because our peer process is no longer available
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl07 is unclean
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl03 is unclean because vm-niifdc is thought to be active there
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Action vm-dogwood_stop_0 on vhbl03 is unrunnable (offline)
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Action vm-niifidp_stop_0 on vhbl03 is unrunnable (offline)
[...]
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl03 is unclean!
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl04 is unclean!
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl05 is unclean!
Feb 21 22:29:27 vhbl06 pengine[8308]:   notice: We can fence vhbl07 without quorum because they're in our membership
Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Scheduling Node vhbl07 for STONITH
Feb 21 22:29:27 vhbl06 pengine[8308]:   notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore)
[...]
Feb 21 22:29:27 vhbl06 crmd[8309]:   notice: Executing reboot fencing operation (212) on vhbl07 (timeout=60000)
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: Client crmd.8309.09cea2e7 wants to fence (reboot) 'vhbl07' with device '(any)'
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: Initiating remote operation reboot for vhbl07: 31b2023d-3fc5-419e-8490-91eb81254497 (0)
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl05 can not fence (reboot) vhbl07: static-list
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl07 can fence (reboot) vhbl07: static-list
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl01 can not fence (reboot) vhbl07: static-list
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl02 can not fence (reboot) vhbl07: static-list
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl03 can not fence (reboot) vhbl07: static-list
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl04 can not fence (reboot) vhbl07: static-list
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl05 can not fence (reboot) vhbl07: static-list
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl07 can fence (reboot) vhbl07: static-list
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl01 can not fence (reboot) vhbl07: static-list
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl02 can not fence (reboot) vhbl07: static-list
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl03 can not fence (reboot) vhbl07: static-list
Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl04 can not fence (reboot) vhbl07: static-list
Feb 21 22:29:27 vhbl06 dlm_controld[3641]: 344447 daemon remove 167773709 already needs fencing
Feb 21 22:29:27 vhbl06 dlm_controld[3641]: 344447 tell corosync to remove nodeid 167773709 from cluster
Feb 21 22:29:30 vhbl06 corosync[3603]:   [TOTEM ] A processor failed, forming new configuration.
Feb 21 22:29:34 vhbl06 corosync[3603]:   [TOTEM ] A new membership (10.0.6.12:632) was formed. Members left: 167773709
Feb 21 22:29:34 vhbl06 corosync[3603]:   [TOTEM ] Failed to receive the leave message. failed: 167773709
Feb 21 22:29:34 vhbl06 corosync[3603]:   [QUORUM] Members[1]: 167773708
Feb 21 22:29:34 vhbl06 corosync[3603]:   [MAIN  ] Completed service synchronization, ready to provide service.
Feb 21 22:29:34 vhbl06 pacemakerd[8261]:   notice: crm_reap_unseen_nodes: Node vhbl07[167773709] - state is now lost (was member)
Feb 21 22:29:34 vhbl06 crmd[8309]:   notice: crm_reap_unseen_nodes: Node vhbl07[167773709] - state is now lost (was member)
Feb 21 22:29:34 vhbl06 kernel: [344592.424938] dlm: closing connection to node 167773709
Feb 21 22:29:42 vhbl06 stonith-ng[8305]:   notice: Operation 'reboot' [5533] (call 2 from crmd.8309) for host 'vhbl07' with device 'fencing-vhbl07' returned: 0 (OK)
Feb 21 22:29:42 vhbl06 stonith-ng[8305]:   notice: Operation reboot of vhbl07 by vhbl06 for crmd.8309 at vhbl06.31b2023d: OK
Feb 21 22:29:42 vhbl06 crmd[8309]:   notice: Stonith operation 2/212:1:0:d06e9743-b452-4b6a-b3a9-d352a4454269: OK (0)
Feb 21 22:29:42 vhbl06 crmd[8309]:   notice: Peer vhbl07 was terminated (reboot) by vhbl06 for vhbl06: OK (ref=31b2023d-3fc5-419e-8490-91eb81254497) by client crmd.8309

That is, fencing by node name worked all right.

I wonder if I understood the issue right and what would be the best way
to avoid it in the future.  Please advise.
-- 
Thanks,
Feri.