Oualid Nouri o.nouri at computer-lan.de
Mon Sep 5 05:52:39 EDT 2011

Hi to all,
i have setup a drbd-based dual primary two node cluster with Pacemaker on opensuse 11.4  for testing.
I have also setup drbd=>controld=>clvm=>lvm=>ocfs2 resources (all clones)  and a samba+IP resource (primitive)  . Fencing is done via UPS with two apcsmart resources.
So far it seems to work. The resources come all up. I can access the samba share.
Going in standby shutting down and restarting one of the nodes. Everything worked as expected.
After this test i started testing failover functionality by powering off one node.
After powering off one Node by pulling the power cable the hosted resources failed over to the remaining node (failover node). As expected
The "failing" node get fenced by powering off the UPS, as expected.

So far so good.....

But when the "failing" node comes back online the drbd+ControlD resource came up. The controld depending resources (clvm=>lvm=ocfs2 etc.)  on the failed and failover  node stuck. Ending in failed status of the lvm-resource. None of the clvm depending resources comes up. And the previously functioning resources on the failover node are no longer accessible.
Checking the status on the command line (on the failover node) shows that all lvm-specific command hang after the failed node tries to rejoin the Cluster.

There are many howtos and my example is nearly identical.
I have searched the web but did not found any hints.
Is this behavior depending on wrong parameters?
Is this behavior depending on the combination of the used Cluster components?

Any help appreciated, thank you!

Used Software:
Opensuse 11.4 x86_64
Pacemaker 1.1.5
Corosync 1.3.0
Lvm2-clvm 2.02.67

