[ClusterLabs] fencing by node name or by node ID

Mon Feb 22 17:10:57 UTC 2016

On 02/21/2016 06:19 PM, Ferenc Wágner wrote:
> Hi,
> 
> Last night a node in our cluster (Corosync 2.3.5, Pacemaker 1.1.14)
> experienced some failure and fell out of the cluster:
> 
> Feb 21 22:11:12 vhbl06 corosync[3603]:   [TOTEM ] A new membership (10.0.6.9:612) was formed. Members left: 167773709
> Feb 21 22:11:12 vhbl06 corosync[3603]:   [TOTEM ] Failed to receive the leave message. failed: 167773709
> Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: crm_update_peer_proc: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:11:12 vhbl06 cib[8304]:   notice: crm_update_peer_proc: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: Removing all vhbl07 attributes for attrd_peer_change_cb
> Feb 21 22:11:12 vhbl06 cib[8304]:   notice: Removing vhbl07/167773709 from the membership list
> Feb 21 22:11:12 vhbl06 cib[8304]:   notice: Purged 1 peers with id=167773709 and/or uname=vhbl07 from the membership cache
> Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: Lost attribute writer vhbl07
> Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: Removing vhbl07/167773709 from the membership list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: crm_update_peer_proc: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:11:12 vhbl06 attrd[8307]:   notice: Purged 1 peers with id=167773709 and/or uname=vhbl07 from the membership cache
> Feb 21 22:11:12 vhbl06 crmd[8309]:   notice: Our peer on the DC (vhbl07) is dead
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: Removing vhbl07/167773709 from the membership list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: Purged 1 peers with id=167773709 and/or uname=vhbl07 from the membership cache
> Feb 21 22:11:12 vhbl06 crmd[8309]:   notice: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> Feb 21 22:11:12 vhbl06 corosync[3603]:   [QUORUM] Members[4]: 167773705 167773706 167773707 167773708
> Feb 21 22:11:12 vhbl06 corosync[3603]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 21 22:11:12 vhbl06 crmd[8309]:   notice: crm_reap_unseen_nodes: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:11:12 vhbl06 pacemakerd[8261]:   notice: crm_reap_unseen_nodes: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:11:12 vhbl06 kernel: [343490.563365] dlm: closing connection to node 167773709
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl05 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl07 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl01 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl02 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl03 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl04 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:12 vhbl06 stonith-ng[8305]:   notice: Operation reboot of 167773709 by <no-one> for stonith-api.20937 at vhbl03.9c470723: No such device
> Feb 21 22:11:12 vhbl06 crmd[8309]:   notice: Peer 167773709 was not terminated (reboot) by <anyone> for vhbl03: No such device (ref=9c470723-d318-4c7e-a705-ce9ee5c7ffe5) by client stonith-api.20937
> Feb 21 22:11:12 vhbl06 dlm_controld[3641]: 343352 tell corosync to remove nodeid 167773705 from cluster
> Feb 21 22:11:15 vhbl06 corosync[3603]:   [TOTEM ] A processor failed, forming new configuration.
> Feb 21 22:11:19 vhbl06 corosync[3603]:   [TOTEM ] A new membership (10.0.6.10:616) was formed. Members left: 167773705
> Feb 21 22:11:19 vhbl06 corosync[3603]:   [TOTEM ] Failed to receive the leave message. failed: 167773705
> 
> However, no fencing agent reported ability to fence the failing node
> (vhbl07), because stonith-ng wasn't looking it up by name, but by
> numeric ID (at least that's what the logs suggest to me), and the
> pcmk_host_list attributes contained strings like vhbl07.
> 
> 1. Was it dlm_controld who requested the fencing?
> 
>    I suspect it because of the "dlm: closing connection to node
>    167773709" kernel message right before the stonith-ng logs.  And
>    dlm_controld really hasn't got anything to use but the corosync node
>    ID.

Not based on this; dlm would print messages about fencing, with
"dlm_controld.*fence request".

However it looks like these logs are not from the DC, which will say
what process requested the fencing. It may be DLM or something else.
Also, DLM on any node might initiate fencing, so it's worth looking at
all the nodes' logs around this time.

> 2. Shouldn't some component translate between node IDs and node names?
>    Is this a configuration error in our setup?  Should I include both in
>    pcmk_host_list?

Yes, stonithd's create_remote_stonith_op() function will do the
translation if the st_opt_cs_nodeid call option is set in the request
XML. If that fails, you'll see a "Could not expand nodeid" warning in
the log. That option is set by the kick() stonith API used by DLM, so it
should happen automatically.

I'm not sure why it appears not to have worked here; logs from other
nodes might help. Do corosync and pacemaker know the same node names?
That would be necessary to get the node name from corosync.

Have you tested fencing vhbl07 from the command line with stonith_admin
to make sure fencing is configured correctly?

> 3. After the failed fence, why was 167773705 (vhbl03) removed from the
>    cluster?  Because it was chosen to execute the fencing operation, but
>    failed?

dlm_controld explicitly requested it. I'm not familiar enough with DLM
to know why. It doesn't sound like a good idea to me.

> The logs continue like this:
> 
> Feb 21 22:11:19 vhbl06 attrd[8307]:   notice: crm_update_peer_proc: Node vhbl03[167773705] - state is now lost (was member)
> Feb 21 22:11:19 vhbl06 attrd[8307]:   notice: Removing all vhbl03 attributes for attrd_peer_change_cb
> Feb 21 22:11:19 vhbl06 attrd[8307]:   notice: Removing vhbl03/167773705 from the membership list
> Feb 21 22:11:19 vhbl06 attrd[8307]:   notice: Purged 1 peers with id=167773705 and/or uname=vhbl03 from the membership cache
> Feb 21 22:11:19 vhbl06 corosync[3603]:   [QUORUM] Members[3]: 167773706 167773707 167773708
> Feb 21 22:11:19 vhbl06 corosync[3603]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 21 22:11:19 vhbl06 crmd[8309]:   notice: crm_reap_unseen_nodes: Node vhbl03[167773705] - state is now lost (was member)
> Feb 21 22:11:19 vhbl06 crmd[8309]:   notice: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED origin=election_timeout_popped ]
> Feb 21 22:11:19 vhbl06 pacemakerd[8261]:   notice: crm_reap_unseen_nodes: Node vhbl03[167773705] - state is now lost (was member)
> Feb 21 22:11:19 vhbl06 cib[8304]:   notice: crm_update_peer_proc: Node vhbl03[167773705] - state is now lost (was member)
> Feb 21 22:11:19 vhbl06 cib[8304]:   notice: Removing vhbl03/167773705 from the membership list
> Feb 21 22:11:19 vhbl06 cib[8304]:   notice: Purged 1 peers with id=167773705 and/or uname=vhbl03 from the membership cache
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: crm_update_peer_proc: Node vhbl03[167773705] - state is now lost (was member)
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: Removing vhbl03/167773705 from the membership list
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: Purged 1 peers with id=167773705 and/or uname=vhbl03 from the membership cache
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl05 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl07 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl01 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl02 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl03 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl04 can not fence (reboot) 167773709: static-list
> Feb 21 22:11:19 vhbl06 kernel: [343497.392381] dlm: closing connection to node 167773705
> 
> 4. Why can't I see any action above to fence 167773705 (vhbl03)?

Only the DC and the node that executes the fence will have those logs.
The other nodes will just have the query results ("can/can not fence")
and the final stonith result.

> Feb 21 22:11:19 vhbl06 crmd[8309]:  warning: FSA: Input I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
> Feb 21 22:11:19 vhbl06 stonith-ng[8305]:   notice: Operation reboot of 167773709 by <no-one> for stonith-api.17462 at vhbl04.0cd1625d: No such device
> Feb 21 22:11:19 vhbl06 crmd[8309]:   notice: Peer 167773709 was not terminated (reboot) by <anyone> for vhbl04: No such device (ref=0cd1625d-a61e-4f94-930d-bb80a10b89da) by client stonith-api.17462
> Feb 21 22:11:19 vhbl06 dlm_controld[3641]: 343359 tell corosync to remove nodeid 167773706 from cluster
> Feb 21 22:11:22 vhbl06 corosync[3603]:   [TOTEM ] A processor failed, forming new configuration.
> Feb 21 22:11:26 vhbl06 corosync[3603]:   [TOTEM ] A new membership (10.0.6.11:620) was formed. Members left: 167773706
> Feb 21 22:11:26 vhbl06 corosync[3603]:   [TOTEM ] Failed to receive the leave message. failed: 167773706
> 
> Looks like vhbl04 took over the job of fencing vhbl07 from vhbl03, and
> of course failed the exact same way.  So it was expelled, too.
> 
> Feb 21 22:11:26 vhbl06 attrd[8307]:   notice: crm_update_peer_proc: Node vhbl04[167773706] - state is now lost (was member)
> Feb 21 22:11:26 vhbl06 cib[8304]:   notice: crm_update_peer_proc: Node vhbl04[167773706] - state is now lost (was member)
> Feb 21 22:11:26 vhbl06 attrd[8307]:   notice: Removing all vhbl04 attributes for attrd_peer_change_cb
> Feb 21 22:11:26 vhbl06 cib[8304]:   notice: Removing vhbl04/167773706 from the membership list
> Feb 21 22:11:26 vhbl06 cib[8304]:   notice: Purged 1 peers with id=167773706 and/or uname=vhbl04 from the membership cache
> Feb 21 22:11:26 vhbl06 attrd[8307]:   notice: Removing vhbl04/167773706 from the membership list
> Feb 21 22:11:26 vhbl06 attrd[8307]:   notice: Purged 1 peers with id=167773706 and/or uname=vhbl04 from the membership cache
> Feb 21 22:11:26 vhbl06 stonith-ng[8305]:   notice: crm_update_peer_proc: Node vhbl04[167773706] - state is now lost (was member)
> Feb 21 22:11:26 vhbl06 stonith-ng[8305]:   notice: Removing vhbl04/167773706 from the membership list
> Feb 21 22:11:26 vhbl06 crmd[8309]:  warning: No match for shutdown action on 167773706
> Feb 21 22:11:26 vhbl06 stonith-ng[8305]:   notice: Purged 1 peers with id=167773706 and/or uname=vhbl04 from the membership cache
> Feb 21 22:11:26 vhbl06 crmd[8309]:   notice: Stonith/shutdown of vhbl04 not matched
> Feb 21 22:11:26 vhbl06 corosync[3603]:   [QUORUM] This node is within the non-primary component and will NOT provide any services.
> Feb 21 22:11:26 vhbl06 pacemakerd[8261]:   notice: Membership 620: quorum lost (2)
> Feb 21 22:11:26 vhbl06 crmd[8309]:   notice: Membership 620: quorum lost (2)
> Feb 21 22:11:26 vhbl06 corosync[3603]:   [QUORUM] Members[2]: 167773707 167773708
> 
> That, finally, was enough to lose quorum and paralize the cluster.
> Later, vhbl07 was rebooted by the hardware watchdog and came back for a
> cold welcome:
> 
> Feb 21 22:24:53 vhbl06 corosync[3603]:   [TOTEM ] A new membership (10.0.6.12:628) was formed. Members joined: 167773709
> Feb 21 22:24:53 vhbl06 corosync[3603]:   [QUORUM] Members[2]: 167773708 167773709
> Feb 21 22:24:53 vhbl06 corosync[3603]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 21 22:24:53 vhbl06 crmd[8309]:   notice: pcmk_quorum_notification: Node vhbl07[167773709] - state is now member (was lost)
> Feb 21 22:24:53 vhbl06 pacemakerd[8261]:   notice: pcmk_quorum_notification: Node vhbl07[167773709] - state is now member (was lost)
> Feb 21 22:24:53 vhbl06 dlm_controld[3641]: 344173 daemon joined 167773709 needs fencing
> Feb 21 22:25:47 vhbl06 dlm_controld[3641]: 344226 clvmd wait for quorum
> Feb 21 22:29:26 vhbl06 crmd[8309]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Feb 21 22:29:27 vhbl06 pengine[8308]:   notice: We do not have quorum - fencing and resource management disabled
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl04 is unclean because the node is no longer part of the cluster
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl04 is unclean
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl05 is unclean because the node is no longer part of the cluster
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl05 is unclean
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl07 is unclean because our peer process is no longer available
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl07 is unclean
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl03 is unclean because vm-niifdc is thought to be active there
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Action vm-dogwood_stop_0 on vhbl03 is unrunnable (offline)
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Action vm-niifidp_stop_0 on vhbl03 is unrunnable (offline)
> [...]
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl03 is unclean!
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl04 is unclean!
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Node vhbl05 is unclean!
> Feb 21 22:29:27 vhbl06 pengine[8308]:   notice: We can fence vhbl07 without quorum because they're in our membership
> Feb 21 22:29:27 vhbl06 pengine[8308]:  warning: Scheduling Node vhbl07 for STONITH
> Feb 21 22:29:27 vhbl06 pengine[8308]:   notice: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore)
> [...]
> Feb 21 22:29:27 vhbl06 crmd[8309]:   notice: Executing reboot fencing operation (212) on vhbl07 (timeout=60000)
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: Client crmd.8309.09cea2e7 wants to fence (reboot) 'vhbl07' with device '(any)'
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: Initiating remote operation reboot for vhbl07: 31b2023d-3fc5-419e-8490-91eb81254497 (0)
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl05 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl07 can fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl01 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl02 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl03 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl04 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl05 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl07 can fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl01 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl02 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl03 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 stonith-ng[8305]:   notice: fencing-vhbl04 can not fence (reboot) vhbl07: static-list
> Feb 21 22:29:27 vhbl06 dlm_controld[3641]: 344447 daemon remove 167773709 already needs fencing
> Feb 21 22:29:27 vhbl06 dlm_controld[3641]: 344447 tell corosync to remove nodeid 167773709 from cluster
> Feb 21 22:29:30 vhbl06 corosync[3603]:   [TOTEM ] A processor failed, forming new configuration.
> Feb 21 22:29:34 vhbl06 corosync[3603]:   [TOTEM ] A new membership (10.0.6.12:632) was formed. Members left: 167773709
> Feb 21 22:29:34 vhbl06 corosync[3603]:   [TOTEM ] Failed to receive the leave message. failed: 167773709
> Feb 21 22:29:34 vhbl06 corosync[3603]:   [QUORUM] Members[1]: 167773708
> Feb 21 22:29:34 vhbl06 corosync[3603]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Feb 21 22:29:34 vhbl06 pacemakerd[8261]:   notice: crm_reap_unseen_nodes: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:29:34 vhbl06 crmd[8309]:   notice: crm_reap_unseen_nodes: Node vhbl07[167773709] - state is now lost (was member)
> Feb 21 22:29:34 vhbl06 kernel: [344592.424938] dlm: closing connection to node 167773709
> Feb 21 22:29:42 vhbl06 stonith-ng[8305]:   notice: Operation 'reboot' [5533] (call 2 from crmd.8309) for host 'vhbl07' with device 'fencing-vhbl07' returned: 0 (OK)
> Feb 21 22:29:42 vhbl06 stonith-ng[8305]:   notice: Operation reboot of vhbl07 by vhbl06 for crmd.8309 at vhbl06.31b2023d: OK
> Feb 21 22:29:42 vhbl06 crmd[8309]:   notice: Stonith operation 2/212:1:0:d06e9743-b452-4b6a-b3a9-d352a4454269: OK (0)
> Feb 21 22:29:42 vhbl06 crmd[8309]:   notice: Peer vhbl07 was terminated (reboot) by vhbl06 for vhbl06: OK (ref=31b2023d-3fc5-419e-8490-91eb81254497) by client crmd.8309
> 
> That is, fencing by node name worked all right.
> 
> I wonder if I understood the issue right and what would be the best way
> to avoid it in the future.  Please advise.
>