[Pacemaker] wrong device in stonith_admin -l

Andrew Beekhof andrew at beekhof.net
Mon Dec 17 02:38:30 UTC 2012


On Sat, Dec 15, 2012 at 8:45 AM,  <laurent+pacemaker at u-picardie.fr> wrote:
> Andrew Beekhof <andrew at beekhof.net> writes:
>
>> On Wed, Dec 12, 2012 at 11:51 AM,  <laurent+pacemaker at u-picardie.fr> wrote:
>>>
>>> Hi,
>>>
>>> I've just observed something weird.
>>> A node is running a stonith resource for which gethosts gives an empty
>>> node list. The result of stonith_admin -l does include it in the
>>> device list !
>>>
>>> result of "stonith_admin -l elasticsearch-05" run from
>>> elasticsearch-06 :
>>>  stonith-xen-peatbull
>>>  stonith-xen-eddu
>>> 2 devices found
>>>
>>> stonith-xen-peatbull is a correct fencing device
>>> stonith-xen-eddu is a fencing device with an empty hostlist
>>>
>>> running "my-xen0 gethosts" with the stonith-xen-eddu params by hand
>>> doesn't return any host, and it does exit with 0 (is that correct to
>>> return 0 with an empty host list ?)
>>>
>>> logs :
>>> Dec 12 01:09:10 elasticsearch-06 stonith-ng[18181]:   notice: stonith_device_register: Added 'stonith-cluster-xen' to the device list (6 active devices)
>>> Dec 12 01:09:10 elasticsearch-06 attrd[18183]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
>>> Dec 12 01:09:10 elasticsearch-06 attrd[18183]:   notice: attrd_perform_update: Sent update 5: probe_complete=true
>>> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]:   notice: stonith_device_register: Added 'stonith-xen-eddu' to the device list (6 active devices)
>>> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]:   notice: stonith_device_register: Added 'stonith-xen-peatbull' to the device list (6 active devices)
>>> Dec 12 01:09:12 elasticsearch-06 stonith: [18434]: info: external/my-xen0-ha device OK.
>>> Dec 12 01:09:12 elasticsearch-06 crmd[18185]:   notice: process_lrm_event: LRM operation stonith-cluster-xen_start_0 (call=61,rc=0, cib-update=27, confirmed=true) ok
>>> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-05
>>> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-06
>>> Dec 12 01:09:15 elasticsearch-06 stonith: [18465]: info: external/my-xen0 device OK.
>>> Dec 12 01:09:15 elasticsearch-06 crmd[18185]:   notice: process_lrm_event: LRM operation stonith-xen-peatbull_start_0 (call=68, rc=0, cib-update=28, confirmed=true) ok
>>> Dec 12 01:09:15 elasticsearch-06 stonith: [18458]: info: external/my-xen0 device OK.
>>> Dec 12 01:09:15 elasticsearch-06 crmd[18185]:   notice: process_lrm_event: LRM operation stonith-xen-eddu_start_0 (call=66, rc=0, cib-update=29, confirmed=true) ok
>>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-kornog (1): (null)
>>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-nikka (1): (null)
>>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-yoichi (1): (null)
>>> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: CRIT: external_hostlist: 'my-xen0 gethosts' returned an empty hostlist
>>> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: ERROR: Could not list hosts for external/my-xen0.
>>> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: CRIT: external_hostlist: 'my-xen0 gethosts' returned an empty hostlist
>>> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: ERROR: Could not list hosts for external/my-xen0.
>>> Dec 12 01:12:37 elasticsearch-06 stonith-ng[18181]:   notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-eddu (1): failed:  255
>>>
>>> David, I mentioned a node being wrongly fenced in the "stonith-timeout
>>> duration 0 is too low" bug, could it be related ?
>
> Hi,
>
>> Doubtful, what does your config look like?
>
> i've restarted from scratch with a simpler setup:
> primitive dummy_01 ocf:heartbeat:Dummy \
>         meta allow-migrate="true" \
>         op monitor interval="180" timeout="20"
> primitive stonith-xen-eddu stonith:external/my-xen0 \
>         params
>         hostlist="elasticsearch-01 elasticsearch-02 elasticsearch-03 elasticsearch-04" dom0="eddu"
> clone clone-stonith-xen-eddu stonith-xen-eddu \
>         meta clone-max="3" clone-node-max="1"
> location clone-stonith-xen-eddu-location-01 clone-stonith-xen-eddu \
>         rule $id="clone-stonith-xen-eddu-location-01-rule" inf:
>         defined #uname
> location dummy_01-location-01 dummy_01 \
>         rule $id="dummy_01-location-01-rule" inf: defined #uname
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.8-56429db" \
>         cluster-infrastructure="corosync" \
>         stonith-timeout="120" \
>         symmetric-cluster="false" \
>         no-quorum-policy="stop" \
>         stonith-enabled="true"
>
> there're 6 nodes: elasticsearch-01 ... 06
> afaik pcmk_host_check defaults to "dynamic-list".
>
> when the external stonith agent is called with "gethosts" it checks if
> any of the guests are running on eddu  (the xen dom0/host)
> In this case, there're none of them running on eddu, it then returns
> an empty hostlist.
> Looking at the logs there's a critical message concerning the empty
> hostlist.
> So I guess it's not valid to have a stonith primitive temporarily
> having no hosts to fence.

Just to be clear, thats the cluster-glue stonith binary complaining.
Not pacemaker.

>
> It's just I would certainly not expect that device to appear in the
> result of "stonith-admin -l nodename".
> And it does ! :)

Might be time to create a bug and attach logs.

> I've just reproduced it again starting a new cluster from scratch and
> using the above config.
> Let's say the stonith agent runs on nodes 02, 03 and 04.
> The first time I run stonith-admin -l "elasticsearch-01" on node 02,
> 03 or 04 it returns "No devices found". From the second attempt it
> does list "stonith-xen-eddu" as valid device.
>
> That's a behavior I did observe with the "stonith-timeout duration 0
> is too low" bug.
> I wouldn't be surprised if it was related: in case of a timeout or in
> case of an empty hostlist the stonith device is wrongly reported as
> a valid fencing device instead of being blacklisted/disabled.
>
> I hope it's a bit clearer now. If not i'll have to try to learn how to
> write a test case for it. (that would definitely make it clearer !)
> :-)
>
>
>> IIRC, these agents want to be told which machines they can fence
>
> I'd say that's true for the ipmi agent.
> But a xen guest might be migrated from one host to another.

Agreed. But I believe thats how most of them are written.




More information about the Pacemaker mailing list