[ClusterLabs] fencing by node name or by node ID

Ferenc Wágner wferi at niif.hu
Tue Feb 23 00:56:16 UTC 2016


Ken Gaillot <kgaillot at redhat.com> writes:

> On 02/21/2016 06:19 PM, Ferenc Wágner wrote:
> 
>> Last night a node in our cluster (Corosync 2.3.5, Pacemaker 1.1.14)
>> experienced some failure and fell out of the cluster: [...]
>> 
>> However, no fencing agent reported ability to fence the failing node
>> (vhbl07), because stonith-ng wasn't looking it up by name, but by
>> numeric ID (at least that's what the logs suggest to me), and the
>> pcmk_host_list attributes contained strings like vhbl07.
>> 
>> 1. Was it dlm_controld who requested the fencing?
>> 
>>    I suspect it because of the "dlm: closing connection to node
>>    167773709" kernel message right before the stonith-ng logs.  And
>>    dlm_controld really hasn't got anything to use but the corosync node
>>    ID.
>
> Not based on this; dlm would print messages about fencing, with
> "dlm_controld.*fence request".
>
> However it looks like these logs are not from the DC, which will say
> what process requested the fencing. It may be DLM or something else.
> Also, DLM on any node might initiate fencing, so it's worth looking at
> all the nodes' logs around this time.

What's a good way to determine the DC node from the logs?  Messages like
the following make me think it was the failing node, vhbl07:

22:11:12 vhbl03 crmd[7956]:   notice: Our peer on the DC (vhbl07) is dead

The local disks contained no usable logs besides what I've already
shown, but the remote log server had more to say.  Unfortunately, it
stored the various facilities in different files with low resoultion
time stamps, so we've got partial ordering info only.

To rehash: vhbl03 - 167773705
           vhbl04 - 167773706
           vhbl05 - 167773707
           vhbl06 - 167773708
           vhbl07 - 167773709 (the failed node)

There are DLM fence requests for vhbl07 on vhbl0[34], and later on
vhbl05:

22:11:12 vhbl03 dlm_controld[3644]: 349002 fence request 167773709 pid 20937 nodedown time 1456089072 fence_all dlm_stonith
22:11:12 vhbl03 dlm_controld[3644]: 349002 abandoned lockspace clvmd
22:11:12 vhbl04 dlm_controld[3899]: 330220 fence request 167773709 pid 17462 nodedown time 1456089072 fence_all dlm_stonith
22:11:12 vhbl04 dlm_controld[3899]: 330220 tell corosync to remove nodeid 167773705 from cluster
22:11:12 vhbl05 dlm_controld[4068]: 344431 tell corosync to remove nodeid 167773705 from cluster
22:11:19 vhbl04 dlm_controld[3899]: 330227 abandoned lockspace clvmd
22:11:19 vhbl05 dlm_controld[4068]: 344438 fence request 167773709 pid 26716 nodedown time 1456089072 fence_all dlm_stonith
22:11:19 vhbl05 dlm_controld[4068]: 344438 tell corosync to remove nodeid 167773706 from cluster
22:11:26 vhbl05 dlm_controld[4068]: 344445 abandoned lockspace clvmd

>> 2. Shouldn't some component translate between node IDs and node names?
>>    Is this a configuration error in our setup?  Should I include both in
>>    pcmk_host_list?
>
> Yes, stonithd's create_remote_stonith_op() function will do the
> translation if the st_opt_cs_nodeid call option is set in the request
> XML. If that fails, you'll see a "Could not expand nodeid" warning in
> the log. That option is set by the kick() stonith API used by DLM, so it
> should happen automatically.

After vhbl07 failed, the winner of the election might have been vhbl03,
as its stonith daemon logged extra lines before the 'can not fence' one,
as you predicted:

22:11:12 vhbl03 stonith-ng[7952]:   notice: Could not obtain a node name for corosync nodeid 167773709
22:11:12 vhbl03 stonith-ng[7952]:   notice: Client stonith-api.20937.f3087e02 wants to fence (reboot) '167773709' with device '(any)'
22:11:12 vhbl03 stonith-ng[7952]:   notice: Could not obtain a node name for corosync nodeid 167773709
22:11:12 vhbl03 stonith-ng[7952]:  warning: Could not expand nodeid '167773709' into a host name (0x7f509ff1d790)
22:11:12 vhbl03 stonith-ng[7952]:   notice: Initiating remote operation reboot for 167773709: 9c470723-d318-4c7e-a705-ce9ee5c7ffe5 (0)
22:11:12 vhbl03 stonith-ng[7952]:   notice: fencing-vhbl05 can not fence (reboot) 167773709: static-list
[...]

> I'm not sure why it appears not to have worked here; logs from other
> nodes might help. Do corosync and pacemaker know the same node names?
> That would be necessary to get the node name from corosync.

I haven't defined nodeids in corosync.conf.  What are "node names" in
corosync at all?  Host names reverse-resolved from the ring0 address?
But you certainly have a point here.  On startup, I get messages like

Feb 21 22:44:31 vhbl03 pacemakerd[8521]: notice: Could not obtain a node name for corosync nodeid 167773705
Feb 21 22:44:31 vhbl03 pacemakerd[8521]: notice: Defaulting to uname -n for the local corosync node name
Feb 21 22:44:31 vhbl03 pacemakerd[8521]: notice: pcmk_quorum_notification: Node vhbl03[167773705] - state is now member (was (null))
Feb 21 22:44:31 vhbl03 pacemakerd[8521]: notice: Could not obtain a node name for corosync nodeid 167773706
Feb 21 22:44:31 vhbl03 pacemakerd[8521]: notice: Could not obtain a node name for corosync nodeid 167773706
Feb 21 22:44:31 vhbl03 pacemakerd[8521]: notice: pcmk_quorum_notification: Node (null)[167773706] - state is now member (was (null))

I guess the (null) node name in the last line gets filled in later when
that other node also defaults its own name to uname -n.  So this has to
be fixed ASAP.  How could I fix this up in a running cluster?  If that's
not readily possible, is adding the node IDs to pcmk_host_lists a good
idea?

There are also documentation issues here in my opinion.

* http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-node-name.html
  should mention why node names can be critically important.

* the corosync manual does not mention nodelist.node.name

* https://bugzilla.redhat.com/show_bug.cgi?id=831737 contains good
  information with explanations, but one doesn't find it until late :)

All the above still leaves the question of "best practice" open for me.

> Have you tested fencing vhbl07 from the command line with stonith_admin
> to make sure fencing is configured correctly?

The later logs I included show a successful fencing of vhbl07.  As soon
as stonith-ng tried it with the name instead of the ID, it worked.

>> 3. After the failed fence, why was 167773705 (vhbl03) removed from the
>>    cluster?  Because it was chosen to execute the fencing operation, but
>>    failed?
>
> dlm_controld explicitly requested it. I'm not familiar enough with DLM
> to know why. It doesn't sound like a good idea to me.

It's really hard to get authoritative information on DLM..:(  I've Cc-ed
David Teigland, he can probably shed some light on this.

>> 4. Why can't I see any action above to fence 167773705 (vhbl03)?
>
> Only the DC and the node that executes the fence will have those logs.
> The other nodes will just have the query results ("can/can not fence")
> and the final stonith result.

Even the above "can not fence" lines are for vhbl07, not vhbl03.  I
can't find such logs on any node at all.  Maybe it's queued after the
fencing of vhbl07?  Is there such a thing?
-- 
Thanks a lot,
Feri.




More information about the Users mailing list