[Pacemaker] stonith in a virtual cluster

Thu Mar 1 15:13:41 EST 2012

* Florian Haas <florian at hastexo.com> [20120229 08:12]:
> Jean-François,
> 
> I realize I'm late to this discussion, however allow me to chime in here anyhow:
> 
> On Mon, Feb 27, 2012 at 11:45 PM, Jean-Francois Malouin
> <Jean-Francois.Malouin at bic.mni.mcgill.ca> wrote:
> >> Have you looked at fence_virt? http://www.clusterlabs.org/wiki/Guest_Fencing
> >
> > Yes I did.
> >
> > I had a quick go last week at compiling it on Debian/Squeeze with
> > backports but with no luck.
> 
> Seeing as you're on Debian, there really is no need to use fence_virt.
> Instead, you should just be able to use the "external/libvirt" STONITH
> plugin that ships with cluster-glue (in squeeze-backports). That
> plugin works like a charm and I've used it in testing many times. No
> need to compile anything.
> 
> http://www.hastexo.com/resources/hints-and-kinks/fencing-virtual-cluster-nodes
> may be a helpful resource.

Thanks Florian! Exactly what I needed!

I set it up as you explained above. I can virsh from the guests to the
physical host but I'm experiencing a few oddities...

If I manually stonith node1 from node2 (or killall -9 corosync on
node1) I get repeated console messages:

node2 stonith: [31734]: CRIT: external_reset_req: 'libvirt reset' for host node1 failed with rc 1

and syslog shows:

Mar  1 14:00:51 node2 pengine: [991]: WARN: pe_fence_node: Node node1 will be fenced because it is un-expectedly down
Mar  1 14:00:51 node2 pengine: [991]: WARN: determine_online_status: Node node1 is unclean
Mar  1 14:00:51 node2 pengine: [991]: notice: unpack_rsc_op: Operation fence_node1_last_failure_0 found resource fence_node1 active on node2
Mar  1 14:00:51 node2 pengine: [991]: notice: unpack_rsc_op: Operation fence_node2_last_failure_0 found resource fence_node2 active on node1
Mar  1 14:00:51 node2 pengine: [991]: WARN: custom_action: Action resPing:0_stop_0 on node1 is unrunnable (offline)
Mar  1 14:00:51 node2 pengine: [991]: WARN: custom_action: Marking node node1 unclean
Mar  1 14:00:51 node2 pengine: [991]: WARN: custom_action: Action fence_node2_stop_0 on node1 is unrunnable (offline)
Mar  1 14:00:51 node2 pengine: [991]: WARN: custom_action: Marking node node1 unclean
Mar  1 14:00:51 node2 pengine: [991]: WARN: stage6: Scheduling Node node1 for STONITH
...
Mar  1 14:00:52 node2 stonith-ng: [987]: info: initiate_remote_stonith_op: Initiating remote operation reboot for node1: 339d69d4-7d46-46a0-8256-e2c9a6637f09
Mar  1 14:00:52 node2 stonith-ng: [987]: info: can_fence_host_with_device: Refreshing port list for fence_node1
Mar  1 14:00:52 node2 stonith-ng: [987]: WARN: parse_host_line: Could not parse (0 0):
Mar  1 14:00:52 node2 stonith-ng: [987]: info: can_fence_host_with_device: fence_node1 can fence node1: dynamic-list
Mar  1 14:00:52 node2 stonith-ng: [987]: info: call_remote_stonith: Requesting that node2 perform op reboot node1
Mar  1 14:00:52 node2 stonith-ng: [987]: info: stonith_fence: Exec <stonith_command t="stonith-ng" st_async_id="339d69d4-7d46-46a0-8256-e2c9a6637f09" st_op="st_fence" st_callid="0" st_callopt="0" st_remote_op="339d69d4-7d46-46a0-8256-e2c9a6637f09" st_target="node1" st_device_action="reboot" st_timeout="54000" src="node2" seq="3" />
Mar  1 14:00:52 node2 stonith-ng: [987]: info: can_fence_host_with_device: fence_node1 can fence node1: dynamic-list
Mar  1 14:00:52 node2 stonith-ng: [987]: info: stonith_fence: Found 1 matching devices for 'node1'
...
Mar  1 14:00:52 node2 stonith-ng: [987]: info: stonith_command: Processed st_fence from node2: rc=-1
Mar  1 14:00:52 node2 stonith-ng: [987]: info: make_args: reboot-ing node 'node1' as 'port=node1'
Mar  1 14:00:52 node2 pengine: [991]: WARN: process_pe_message: Transition 1: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-8.bz2
Mar  1 14:00:52 node2 pengine: [991]: notice: process_pe_message: Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" to identify issues.
Mar  1 14:00:57 node2 external/libvirt[31741]: [31769]: notice: Domain node1 was stopped
Mar  1 14:01:02 node2 external/libvirt[31741]: [31783]: ERROR: Failed to start domain node1
Mar  1 14:01:02 node2 external/libvirt[31741]: [31789]: ERROR: error: failed to get domain 'node1'
Mar  1 14:01:02 node2 external/libvirt[31741]: [31789]: error: Domain not found: xenUnifiedDomainLookupByName

At this point I can't restart the stonith'ed node1, the cib list it as
UNCLEAN: first I manually have to wipe it clean with 

'crm node clearstate node1' 

as otherwize the surviving node2 just keep shooting it and some dummy
resources (and and an IP resource located with a ping to the
hypervisor) dont restart properly by themselves.

Must something simple that I overlooked...

Any ideas?

jf

> 
> Cheers,
> Florian
> 
> -- 
> Need help with High Availability?
> http://www.hastexo.com/now
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org