[Pacemaker] wrong device in stonith_admin -l
laurent+pacemaker at u-picardie.fr
laurent+pacemaker at u-picardie.fr
Fri Dec 14 16:45:49 EST 2012
Andrew Beekhof <andrew at beekhof.net> writes:
> On Wed, Dec 12, 2012 at 11:51 AM, <laurent+pacemaker at u-picardie.fr> wrote:
>>
>> Hi,
>>
>> I've just observed something weird.
>> A node is running a stonith resource for which gethosts gives an empty
>> node list. The result of stonith_admin -l does include it in the
>> device list !
>>
>> result of "stonith_admin -l elasticsearch-05" run from
>> elasticsearch-06 :
>> stonith-xen-peatbull
>> stonith-xen-eddu
>> 2 devices found
>>
>> stonith-xen-peatbull is a correct fencing device
>> stonith-xen-eddu is a fencing device with an empty hostlist
>>
>> running "my-xen0 gethosts" with the stonith-xen-eddu params by hand
>> doesn't return any host, and it does exit with 0 (is that correct to
>> return 0 with an empty host list ?)
>>
>> logs :
>> Dec 12 01:09:10 elasticsearch-06 stonith-ng[18181]: notice: stonith_device_register: Added 'stonith-cluster-xen' to the device list (6 active devices)
>> Dec 12 01:09:10 elasticsearch-06 attrd[18183]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
>> Dec 12 01:09:10 elasticsearch-06 attrd[18183]: notice: attrd_perform_update: Sent update 5: probe_complete=true
>> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]: notice: stonith_device_register: Added 'stonith-xen-eddu' to the device list (6 active devices)
>> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]: notice: stonith_device_register: Added 'stonith-xen-peatbull' to the device list (6 active devices)
>> Dec 12 01:09:12 elasticsearch-06 stonith: [18434]: info: external/my-xen0-ha device OK.
>> Dec 12 01:09:12 elasticsearch-06 crmd[18185]: notice: process_lrm_event: LRM operation stonith-cluster-xen_start_0 (call=61,rc=0, cib-update=27, confirmed=true) ok
>> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-05
>> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-06
>> Dec 12 01:09:15 elasticsearch-06 stonith: [18465]: info: external/my-xen0 device OK.
>> Dec 12 01:09:15 elasticsearch-06 crmd[18185]: notice: process_lrm_event: LRM operation stonith-xen-peatbull_start_0 (call=68, rc=0, cib-update=28, confirmed=true) ok
>> Dec 12 01:09:15 elasticsearch-06 stonith: [18458]: info: external/my-xen0 device OK.
>> Dec 12 01:09:15 elasticsearch-06 crmd[18185]: notice: process_lrm_event: LRM operation stonith-xen-eddu_start_0 (call=66, rc=0, cib-update=29, confirmed=true) ok
>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-kornog (1): (null)
>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-nikka (1): (null)
>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-yoichi (1): (null)
>> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: CRIT: external_hostlist: 'my-xen0 gethosts' returned an empty hostlist
>> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: ERROR: Could not list hosts for external/my-xen0.
>> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: CRIT: external_hostlist: 'my-xen0 gethosts' returned an empty hostlist
>> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: ERROR: Could not list hosts for external/my-xen0.
>> Dec 12 01:12:37 elasticsearch-06 stonith-ng[18181]: notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-eddu (1): failed: 255
>>
>> David, I mentioned a node being wrongly fenced in the "stonith-timeout
>> duration 0 is too low" bug, could it be related ?
Hi,
> Doubtful, what does your config look like?
i've restarted from scratch with a simpler setup:
primitive dummy_01 ocf:heartbeat:Dummy \
meta allow-migrate="true" \
op monitor interval="180" timeout="20"
primitive stonith-xen-eddu stonith:external/my-xen0 \
params
hostlist="elasticsearch-01 elasticsearch-02 elasticsearch-03 elasticsearch-04" dom0="eddu"
clone clone-stonith-xen-eddu stonith-xen-eddu \
meta clone-max="3" clone-node-max="1"
location clone-stonith-xen-eddu-location-01 clone-stonith-xen-eddu \
rule $id="clone-stonith-xen-eddu-location-01-rule" inf:
defined #uname
location dummy_01-location-01 dummy_01 \
rule $id="dummy_01-location-01-rule" inf: defined #uname
property $id="cib-bootstrap-options" \
dc-version="1.1.8-56429db" \
cluster-infrastructure="corosync" \
stonith-timeout="120" \
symmetric-cluster="false" \
no-quorum-policy="stop" \
stonith-enabled="true"
there're 6 nodes: elasticsearch-01 ... 06
afaik pcmk_host_check defaults to "dynamic-list".
when the external stonith agent is called with "gethosts" it checks if
any of the guests are running on eddu (the xen dom0/host)
In this case, there're none of them running on eddu, it then returns
an empty hostlist.
Looking at the logs there's a critical message concerning the empty
hostlist.
So I guess it's not valid to have a stonith primitive temporarily
having no hosts to fence.
It's just I would certainly not expect that device to appear in the
result of "stonith-admin -l nodename".
And it does ! :)
I've just reproduced it again starting a new cluster from scratch and
using the above config.
Let's say the stonith agent runs on nodes 02, 03 and 04.
The first time I run stonith-admin -l "elasticsearch-01" on node 02,
03 or 04 it returns "No devices found". From the second attempt it
does list "stonith-xen-eddu" as valid device.
That's a behavior I did observe with the "stonith-timeout duration 0
is too low" bug.
I wouldn't be surprised if it was related: in case of a timeout or in
case of an empty hostlist the stonith device is wrongly reported as
a valid fencing device instead of being blacklisted/disabled.
I hope it's a bit clearer now. If not i'll have to try to learn how to
write a test case for it. (that would definitely make it clearer !)
:-)
> IIRC, these agents want to be told which machines they can fence
I'd say that's true for the ipmi agent.
But a xen guest might be migrated from one host to another.
--
Laurent
More information about the Pacemaker
mailing list