[Pacemaker] DRBD 2 node cluster and STONITH configuration help required.

Thu Feb 4 05:43:05 EST 2010

Hi there,

I have successfully configured a 2 node DRBD pacemaker cluster using the
instructions provided by LINBIT here:
http://www.drbd.org/users-guide-emb/ch-pacemaker.html.  The cluster works
perfectly and I can migrate the resources back and forth between the two
nodes without a problem.  However, when simulating certain cluster
communication failures, I am having problems preventing the DRBD cluster
from entering a split brain state.  I have been led to believe that STONITH
will help prevent split brain situations, but the LINBIT instructions do not
provide any guidance on how to conifgure STONITH in the pacemaker cluster.
The only thing I can find in LINBITs documentation is where it talks about
the resource fencing options within the /etc/drbd.conf of which I have
configured:

resource r0 {
  disk {
    fencing resource-only;
  }
  handlers {
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  }

I'm still at a loss to understand what actually triggers DRBD to run the
above fencing scripts or how to tell if it has run them.

I've searched the internet high and low for example pacemaker configs that
show you how to configure STONITH resources for DRBD, but I can't find
anything useful.

Whilst hunting the Internet I did find this howto: (
http://www.howtoforge.com/installation-and-setup-guide-for-drbd-openais-pacemaker-xen-on-opensuse-11.1)
 that spells out how to configure a DRBD pacemaker cluster and even
states
the following: "STONITH is disabled in this [example] configuration though
it is highly-recommended in any production environment to eliminate the risk
of divergent data." Infuriatingly it doesn't tell you how to configure
STONITH!

Could someone you please, please, please give me some pointers or some
helpful examples on how I go about configuring STONITH and or modifying my
pacemaker configuration in any other ways to get it into a production ready
state?  My current configuration is listed below:

The cluster is built on 2 Redhat EL 5.3 servers running the following
software versions:
drbd-8.3.6-1
pacemaker-1.0.5-4.1
openais-0.80.5-15.1

root at mq001:~# crm configure show
node mq001.back.live.cwwtf.local
node mq002.back.live.cwwtf.local
primitive activemq-emp lsb:bbc-activemq-emp
primitive activemq-forge-services lsb:bbc-activemq-forge-
services
primitive activemq-social lsb:activemq-social
primitive drbd_activemq ocf:linbit:drbd \
    params drbd_resource="r0" \
    op monitor interval="15s"
primitive fs_activemq ocf:heartbeat:Filesystem \
    params device="/dev/drbd1" directory="/drbd" fstype="ext3"
primitive ip_activemq ocf:heartbeat:IPaddr2 \
    params ip="172.23.8.71" nic="eth0"
group activemq fs_activemq ip_activemq activemq-forge-services activemq-emp
activemq-social
ms ms_drbd_activemq drbd_activemq \
    meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true"
colocation activemq_on_drbd inf: activemq ms_drbd_activemq:Master
order activemq_after_drbd inf: ms_drbd_activemq:promote activemq:start
property $id="cib-bootstrap-options" \
    dc-version="1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7" \
    cluster-infrastructure="openais" \
    expected-quorum-votes="2" \
    no-quorum-policy="ignore" \
    last-lrm-refresh="1260809203"

/etc/drbd.conf

global {
  usage-count no;
}
common {
  protocol C;
}
resource r0 {
  disk {
    fencing resource-only;
  }
  handlers {
    fence-peer "/usr/lib/drbd/crm-fence-peer.
sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
  }
  syncer {
    rate 40M;
  }
  on mq001.back.live.cwwtf.local {
    device    /dev/drbd1;
    disk      /dev/cciss/c0d0p1;
    address   172.23.8.69:7789;
    meta-disk internal;
  }
  on mq002.back.live.cwwtf.local {
    device    /dev/drbd1;
    disk      /dev/cciss/c0d0p1;
    address   172.23.8.70:7789;
    meta-disk internal;
  }
}

root at mq001:~# cat /etc/ais/openais.conf
totem {
  version: 2
  token: 3000
  token_retransmits_before_loss_const: 10
  join: 60
  consensus: 1500
  vsftype: none
  max_messages: 20
  clear_node_high_bit: yes
  secauth: on
  threads: 0
  rrp_mode: passive
  interface {
    ringnumber: 0
    bindnetaddr: 172.59.60.0
    mcastaddr: 239.94.1.1
    mcastport: 5405
  }
  interface {
    ringnumber: 1
    bindnetaddr: 172.23.8.0
    mcastaddr: 239.94.2.1
    mcastport: 5405
  }
}
logging {
  to_stderr: yes
  debug: on
  timestamp: on
  to_file: no
  to_syslog: yes
  syslog_facility: daemon
}
amf {
  mode: disabled
}
service {
  ver:       0
  name:      pacemaker
  use_mgmtd: yes
}
aisexec {
  user:   root
  group:  root
}

Many Thanks,
Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100204/f6101e34/attachment.html>