[Pacemaker] SBD kills both nodes in a two node cluster.

Tue Apr 19 06:04:31 EDT 2011

I' ve two nodes with shared storage and multipathing. But the SBD device doesn't work as expected.
My idea was that in case of a split brain: One node kills the other node and one will survive.
But in my case I get a double kill, both nodes will be killed at the same time.
I simulated the split brain with "ip link set down eth0" on one node. I tested it several times.

The sbd deamon is running on both nodes.
My configuration:
primitive stonith_sbd stonith:external/sbd params sbd_device="/dev/disk/by-id/scsi-36..."
clone stonith_sbd-clone stonith_sbd

/var/log/messages:
Node A:
Apr 19 10:37:09 nodeA crmd: [7690]: info: te_fence_node: Executing reboot fencing operation (17) on nodeB (timeout=180000)
Apr 19 10:37:09 nodeA stonith-ng: [7685]: info: initiate_remote_stonith_op: Initiating remote operation reboot for nodeB: d4226746-fef1-4d29-bc85-2d33e9bf7f94
Apr 19 10:37:09 nodeA stonith-ng: [7685]: info: stonith_queryQuery <stonith_command t="stonith-ng" st_async_id="d4226746-fef1-4d29-bc85-2d33e9bf7f94" st_op="st_query" st_callid="0" st_callopt="0" st_remote
_op="d4226746-fef1-4d29-bc85-2d33e9bf7f94" st_target="nodeB" st_device_action="reboot" st_clientid="3b1b3feb-5e4e-4a3c-ae8e-2131ea2ae588" st_timeout="18000" src="nodeA" seq="1" />

Node B:
Apr 19 10:37:09 nodeB crmd: [7851]: info: te_fence_node: Executing reboot fencing operation (17) on nodeA (timeout=180000)
Apr 19 10:37:09 nodeB stonith-ng: [7846]: info: initiate_remote_stonith_op: Initiating remote operation reboot for nodeA: e361b3b6-2890-474d-8671-b73eea62d1ab
Apr 19 10:37:09 nodeB stonith-ng: [7846]: info: stonith_queryQuery <stonith_command t="stonith-ng" st_async_id="e361b3b6-2890-474d-8671-b73eea62d1ab" st_op="st_query" st_callid="0" st_callopt="0" st_remote
_op="e361b3b6-2890-474d-8671-b73eea62d1ab" st_target="nodeA" st_device_action="reboot" st_clientid="a0d67d7e-5e30-44fe-bc88-e733019e594d" st_timeout="18000" src="nodeB" seq="1" />

On both nodes I started a "sbd -d /dev/disk/by-id/scsi-36... list" in an endless loop and these are the last SBD commands I get.
As you can see both nodes request a reset at the same time and both will succeed => double kill.
Node A:
0       nodeB clear
1       nodeA clear
0       nodeB clear
1       nodeA reset   nodeB
0       nodeB reset   nodeA
1       nodeA reset   nodeB

Node B:
0       nodeB clear
1       nodeA reset   nodeB
0       nodeB clear
1       nodeA reset   nodeB
0       nodeB clear
1       nodeA reset   nodeB
0       nodeB reset   nodeA
1       nodeA reset   nodeB
0       nodeB reset   nodeA
1       nodeA reset   nodeB

Cheers,
Ulf
-- 
NEU: FreePhone - kostenlos mobil telefonieren und surfen!			
Jetzt informieren: http://www.gmx.net/de/go/freephone