[Pacemaker] question about stonith:external/libvirt

Mon May 21 05:43:22 EDT 2012

On Sun, May 20, 2012 at 6:40 AM, Matthew O'Connor <matt at ecsorl.com> wrote:
> After using the tutorial on the Hastexo site for setting up stonith via
> libvirt, I believe I have it working correctly...but...some strange things
> are happening.  I have two nodes, with shared storage provided by a
> dual-primary DRBD resource and OCFS2.  Here is one of my stonith primitives:
>
> primitive p_fence-l2 stonith:external/libvirt \
>        params hostlist="l2:l2.sandbox"
> hypervisor_uri="qemu+ssh://matt@hv01/system" stonith-timeout="30"
> pcmk_host_check="none" \
>        op start interval="0" timeout="15" \
>        op stop interval="0" timeout="15" \
>        op monitor interval="60" \
>        meta target-role="Started"
>
> This cluster has stonith-enabled="true" in the cluster options, plus the
> necessary location statements in the cib.

Does it have "fencing resource-and-stonith" in the DRBD configuration,
and stonith_admin-fence-peer.sh as its fence-peer handler?

> To watch the DLM, I run dbench on the shared storage on the node I let live.
>  While it's running, I creatively nuke the other node.  If I just "killall
> pacemakerd" on l2 for instance, the DLM seems unaffected and the fence takes
> place, rebooting the now "failed" node l2.  No real interruption of service
> on the surviving node, l3.  Yet, if I "halt -f -n" on l2, the fence still
> takes place but the surviving node's (l3's) DLM hangs and won't come back
> until I bring the failed node back online.

A hanging DLM is OK, and DLM recovery after the failed node comes back
is OK too, but of course the DLM should also recover once it's
satisfied that the offending node has been properly fenced. Any logs
from stonith-ng on l3?

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now