[ClusterLabs] fencing by node name or by node ID

Tue Feb 23 10:22:07 EST 2016

On 02/22/2016 06:56 PM, Ferenc Wágner wrote:
> Ken Gaillot <kgaillot at redhat.com> writes:
> 
>> On 02/21/2016 06:19 PM, Ferenc Wágner wrote:
>>
>>> Last night a node in our cluster (Corosync 2.3.5, Pacemaker 1.1.14)
>>> experienced some failure and fell out of the cluster: [...]
>>>
>>> However, no fencing agent reported ability to fence the failing node
>>> (vhbl07), because stonith-ng wasn't looking it up by name, but by
>>> numeric ID (at least that's what the logs suggest to me), and the
>>> pcmk_host_list attributes contained strings like vhbl07.
>>>
>>> 1. Was it dlm_controld who requested the fencing?
>>>
>>>    I suspect it because of the "dlm: closing connection to node
>>>    167773709" kernel message right before the stonith-ng logs.  And
>>>    dlm_controld really hasn't got anything to use but the corosync node
>>>    ID.
>>
>> Not based on this; dlm would print messages about fencing, with
>> "dlm_controld.*fence request".
>>
>> However it looks like these logs are not from the DC, which will say
>> what process requested the fencing. It may be DLM or something else.
>> Also, DLM on any node might initiate fencing, so it's worth looking at
>> all the nodes' logs around this time.
> 
> What's a good way to determine the DC node from the logs?  Messages like
> the following make me think it was the failing node, vhbl07:
> 
> 22:11:12 vhbl03 crmd[7956]:   notice: Our peer on the DC (vhbl07) is dead

That is correct. So that's an exception :) and the other nodes' logs
will probably be more interesting.

The easiest way to tell the DC at any time is that its logs will have
lots of "pengine:" messages.

> The local disks contained no usable logs besides what I've already
> shown, but the remote log server had more to say.  Unfortunately, it
> stored the various facilities in different files with low resoultion
> time stamps, so we've got partial ordering info only.
> 
> To rehash: vhbl03 - 167773705
>            vhbl04 - 167773706
>            vhbl05 - 167773707
>            vhbl06 - 167773708
>            vhbl07 - 167773709 (the failed node)
> 
> There are DLM fence requests for vhbl07 on vhbl0[34], and later on
> vhbl05:

OK, we can be pretty sure DLM initiated the fencing. (It's possible both
DLM and the cluster initiated it.)

> 22:11:12 vhbl03 dlm_controld[3644]: 349002 fence request 167773709 pid 20937 nodedown time 1456089072 fence_all dlm_stonith
> 22:11:12 vhbl03 dlm_controld[3644]: 349002 abandoned lockspace clvmd
> 22:11:12 vhbl04 dlm_controld[3899]: 330220 fence request 167773709 pid 17462 nodedown time 1456089072 fence_all dlm_stonith
> 22:11:12 vhbl04 dlm_controld[3899]: 330220 tell corosync to remove nodeid 167773705 from cluster
> 22:11:12 vhbl05 dlm_controld[4068]: 344431 tell corosync to remove nodeid 167773705 from cluster
> 22:11:19 vhbl04 dlm_controld[3899]: 330227 abandoned lockspace clvmd
> 22:11:19 vhbl05 dlm_controld[4068]: 344438 fence request 167773709 pid 26716 nodedown time 1456089072 fence_all dlm_stonith
> 22:11:19 vhbl05 dlm_controld[4068]: 344438 tell corosync to remove nodeid 167773706 from cluster
> 22:11:26 vhbl05 dlm_controld[4068]: 344445 abandoned lockspace clvmd
> 
>>> 2. Shouldn't some component translate between node IDs and node names?
>>>    Is this a configuration error in our setup?  Should I include both in
>>>    pcmk_host_list?
>>
>> Yes, stonithd's create_remote_stonith_op() function will do the
>> translation if the st_opt_cs_nodeid call option is set in the request
>> XML. If that fails, you'll see a "Could not expand nodeid" warning in
>> the log. That option is set by the kick() stonith API used by DLM, so it
>> should happen automatically.
> 
> After vhbl07 failed, the winner of the election might have been vhbl03,
> as its stonith daemon logged extra lines before the 'can not fence' one,
> as you predicted:
> 
> 22:11:12 vhbl03 stonith-ng[7952]:   notice: Could not obtain a node name for corosync nodeid 167773709
> 22:11:12 vhbl03 stonith-ng[7952]:   notice: Client stonith-api.20937.f3087e02 wants to fence (reboot) '167773709' with device '(any)'
> 22:11:12 vhbl03 stonith-ng[7952]:   notice: Could not obtain a node name for corosync nodeid 167773709
> 22:11:12 vhbl03 stonith-ng[7952]:  warning: Could not expand nodeid '167773709' into a host name (0x7f509ff1d790)
> 22:11:12 vhbl03 stonith-ng[7952]:   notice: Initiating remote operation reboot for 167773709: 9c470723-d318-4c7e-a705-ce9ee5c7ffe5 (0)
> 22:11:12 vhbl03 stonith-ng[7952]:   notice: fencing-vhbl05 can not fence (reboot) 167773709: static-list
> [...]
> 
>> I'm not sure why it appears not to have worked here; logs from other
>> nodes might help. Do corosync and pacemaker know the same node names?
>> That would be necessary to get the node name from corosync.
> 
> I haven't defined nodeids in corosync.conf.  What are "node names" in
> corosync at all?  Host names reverse-resolved from the ring0 address?

I believe this is the issue.  By node names, for corosync I mean
ring0_addr in node{} in nodelist{}. When you don't specify those,
pacemaker will assume "uname -n" for the node names, and usually
everything is fine.

However it appears that DLM may require explicit node names in order to
work reliably, because DLM only knows the corosync node IDs, and
pacemaker needs to map those to its node names.

There is a bit of irony here in that pacemaker did know the node names
(it had stored them in its peer cache), but it purged the cache entry
when the node went away, just before DLM's fencing request arrived.

> But you certainly have a point here.  On startup, I get messages like
> 
> Feb 21 22:44:31 vhbl03 pacemakerd[8521]: notice: Could not obtain a node name for corosync nodeid 167773705
> Feb 21 22:44:31 vhbl03 pacemakerd[8521]: notice: Defaulting to uname -n for the local corosync node name
> Feb 21 22:44:31 vhbl03 pacemakerd[8521]: notice: pcmk_quorum_notification: Node vhbl03[167773705] - state is now member (was (null))
> Feb 21 22:44:31 vhbl03 pacemakerd[8521]: notice: Could not obtain a node name for corosync nodeid 167773706
> Feb 21 22:44:31 vhbl03 pacemakerd[8521]: notice: Could not obtain a node name for corosync nodeid 167773706
> Feb 21 22:44:31 vhbl03 pacemakerd[8521]: notice: pcmk_quorum_notification: Node (null)[167773706] - state is now member (was (null))
> 
> I guess the (null) node name in the last line gets filled in later when
> that other node also defaults its own name to uname -n.  So this has to
> be fixed ASAP.  How could I fix this up in a running cluster?  If that's
> not readily possible, is adding the node IDs to pcmk_host_lists a good
> idea?

Changing them would require a restart of corosync.

However you're right, you could do it via the fencing configuration. It
wouldn't work in pcmk_host_list because the fence agent wouldn't
recognize the ID, but I think you could use
pcmk_host_map="vhbl07:vhbl07;167773709:vhbl07;..." as a workaround.

> There are also documentation issues here in my opinion.
> 
> * http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-node-name.html
>   should mention why node names can be critically important.

Normally everything works behind the scenes, but DLM appears to be a
special case, which is definitely worth documenting somewhere.

> * the corosync manual does not mention nodelist.node.name
> 
> * https://bugzilla.redhat.com/show_bug.cgi?id=831737 contains good
>   information with explanations, but one doesn't find it until late :)
> 
> All the above still leaves the question of "best practice" open for me.
> 
>> Have you tested fencing vhbl07 from the command line with stonith_admin
>> to make sure fencing is configured correctly?
> 
> The later logs I included show a successful fencing of vhbl07.  As soon
> as stonith-ng tried it with the name instead of the ID, it worked.
> 
>>> 3. After the failed fence, why was 167773705 (vhbl03) removed from the
>>>    cluster?  Because it was chosen to execute the fencing operation, but
>>>    failed?
>>
>> dlm_controld explicitly requested it. I'm not familiar enough with DLM
>> to know why. It doesn't sound like a good idea to me.
> 
> It's really hard to get authoritative information on DLM..:(  I've Cc-ed
> David Teigland, he can probably shed some light on this.
> 
>>> 4. Why can't I see any action above to fence 167773705 (vhbl03)?
>>
>> Only the DC and the node that executes the fence will have those logs.
>> The other nodes will just have the query results ("can/can not fence")
>> and the final stonith result.
> 
> Even the above "can not fence" lines are for vhbl07, not vhbl03.  I
> can't find such logs on any node at all.  Maybe it's queued after the
> fencing of vhbl07?  Is there such a thing?