[Pacemaker] stonith in a virtual cluster

Tue Mar 6 08:38:15 EST 2012

Hello,

On 03/01/2012 09:13 PM, Jean-Francois Malouin wrote:
> * Florian Haas <florian at hastexo.com> [20120229 08:12]:
>> Jean-François,
>>
>> I realize I'm late to this discussion, however allow me to chime in here anyhow:
>>
>> On Mon, Feb 27, 2012 at 11:45 PM, Jean-Francois Malouin
>> <Jean-Francois.Malouin at bic.mni.mcgill.ca> wrote:
>>>> Have you looked at fence_virt? http://www.clusterlabs.org/wiki/Guest_Fencing
>>>
>>> Yes I did.
>>>
>>> I had a quick go last week at compiling it on Debian/Squeeze with
>>> backports but with no luck.
>>
>> Seeing as you're on Debian, there really is no need to use fence_virt.
>> Instead, you should just be able to use the "external/libvirt" STONITH
>> plugin that ships with cluster-glue (in squeeze-backports). That
>> plugin works like a charm and I've used it in testing many times. No
>> need to compile anything.
>>
>> http://www.hastexo.com/resources/hints-and-kinks/fencing-virtual-cluster-nodes
>> may be a helpful resource.
> 
> Thanks Florian! Exactly what I needed!
> 
> I set it up as you explained above. I can virsh from the guests to the
> physical host but I'm experiencing a few oddities...
> 
> If I manually stonith node1 from node2 (or killall -9 corosync on
> node1) I get repeated console messages:
> 
> node2 stonith: [31734]: CRIT: external_reset_req: 'libvirt reset' for host node1 failed with rc 1
> 
> and syslog shows:
> 
> Mar  1 14:00:51 node2 pengine: [991]: WARN: pe_fence_node: Node node1 will be fenced because it is un-expectedly down
> Mar  1 14:00:51 node2 pengine: [991]: WARN: determine_online_status: Node node1 is unclean
> Mar  1 14:00:51 node2 pengine: [991]: notice: unpack_rsc_op: Operation fence_node1_last_failure_0 found resource fence_node1 active on node2
> Mar  1 14:00:51 node2 pengine: [991]: notice: unpack_rsc_op: Operation fence_node2_last_failure_0 found resource fence_node2 active on node1
> Mar  1 14:00:51 node2 pengine: [991]: WARN: custom_action: Action resPing:0_stop_0 on node1 is unrunnable (offline)
> Mar  1 14:00:51 node2 pengine: [991]: WARN: custom_action: Marking node node1 unclean
> Mar  1 14:00:51 node2 pengine: [991]: WARN: custom_action: Action fence_node2_stop_0 on node1 is unrunnable (offline)
> Mar  1 14:00:51 node2 pengine: [991]: WARN: custom_action: Marking node node1 unclean
> Mar  1 14:00:51 node2 pengine: [991]: WARN: stage6: Scheduling Node node1 for STONITH
> ...
> Mar  1 14:00:52 node2 stonith-ng: [987]: info: initiate_remote_stonith_op: Initiating remote operation reboot for node1: 339d69d4-7d46-46a0-8256-e2c9a6637f09
> Mar  1 14:00:52 node2 stonith-ng: [987]: info: can_fence_host_with_device: Refreshing port list for fence_node1
> Mar  1 14:00:52 node2 stonith-ng: [987]: WARN: parse_host_line: Could not parse (0 0):
> Mar  1 14:00:52 node2 stonith-ng: [987]: info: can_fence_host_with_device: fence_node1 can fence node1: dynamic-list
> Mar  1 14:00:52 node2 stonith-ng: [987]: info: call_remote_stonith: Requesting that node2 perform op reboot node1
> Mar  1 14:00:52 node2 stonith-ng: [987]: info: stonith_fence: Exec <stonith_command t="stonith-ng" st_async_id="339d69d4-7d46-46a0-8256-e2c9a6637f09" st_op="st_fence" st_callid="0" st_callopt="0" st_remote_op="339d69d4-7d46-46a0-8256-e2c9a6637f09" st_target="node1" st_device_action="reboot" st_timeout="54000" src="node2" seq="3" />
> Mar  1 14:00:52 node2 stonith-ng: [987]: info: can_fence_host_with_device: fence_node1 can fence node1: dynamic-list
> Mar  1 14:00:52 node2 stonith-ng: [987]: info: stonith_fence: Found 1 matching devices for 'node1'
> ...
> Mar  1 14:00:52 node2 stonith-ng: [987]: info: stonith_command: Processed st_fence from node2: rc=-1
> Mar  1 14:00:52 node2 stonith-ng: [987]: info: make_args: reboot-ing node 'node1' as 'port=node1'
> Mar  1 14:00:52 node2 pengine: [991]: WARN: process_pe_message: Transition 1: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-8.bz2
> Mar  1 14:00:52 node2 pengine: [991]: notice: process_pe_message: Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" to identify issues.
> Mar  1 14:00:57 node2 external/libvirt[31741]: [31769]: notice: Domain node1 was stopped
> Mar  1 14:01:02 node2 external/libvirt[31741]: [31783]: ERROR: Failed to start domain node1
> Mar  1 14:01:02 node2 external/libvirt[31741]: [31789]: ERROR: error: failed to get domain 'node1'
> Mar  1 14:01:02 node2 external/libvirt[31741]: [31789]: error: Domain not found: xenUnifiedDomainLookupByName

Do you already use libvirt to manage your Xen VMs? Is there a chance you
manage them only with the help of Xen's native "xm" command and
therefore you only have xm config files stored in /etc/xen/ and no
libvirt xml definition files for the vms in /etc/libvirt/xen/?

Without its xml definition files libvirt won't be able to start Xen VMs
... have a look at http://libvirt.org/drvxen.html#xmlimport if you want
to create them easily.

You could also try the external/xen0 STONITH resource agent.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> 
> At this point I can't restart the stonith'ed node1, the cib list it as
> UNCLEAN: first I manually have to wipe it clean with 
> 
> 'crm node clearstate node1' 
> 
> as otherwize the surviving node2 just keep shooting it and some dummy
> resources (and and an IP resource located with a ping to the
> hypervisor) dont restart properly by themselves.
> 
> Must something simple that I overlooked...
> 
> Any ideas?
> 
> jf
> 
>>
>> Cheers,
>> Florian
>>
>> -- 
>> Need help with High Availability?
>> http://www.hastexo.com/now
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120306/cc0a7351/attachment-0003.sig>