[ClusterLabs] Two mode cluster VMware drbd

Tue Mar 12 10:46:54 EDT 2019

Hello,

I’m planning to setup a two node (active-passive) HA cluster consisting of
pacemaker, corosync and DRBD. The two nodes will run on VMware VM’s and
connect to a single DB server (unfortunately for various reasons not
included in the cluster).

Resources:

Resource Group: clusterd_services

     otrs_articel_fs    (ocf::heartbeat:Filesystem):    Started srv2

     vip        (ocf::heartbeat:IPaddr2):       Started srv2

     Apache     (systemd:httpd):        Started srv2

     OTRS       (systemd:otrs): Started srv2

Master/Slave Set: articel_ms [articel_drbd]

     Masters: [ srv2 ]

     Slaves: [ srv1 ]

my_vcentre-fence       (stonith:fence_vmware_soap):    Started srv1

Ultimately I would do

-       Each VM will be running on separate ESXi hosts to provide at least
a certain type of protection against hardware failure;

-       Redundant communication paths between the two nodes for DRBD
replication and cluster communication to prevent split-brain scenarios;

-       fence_vmware_soap for VM fencing;

-       pacemaker , corosync, pcsd not configured to start on both so that
in case of a fence event they will not join the cluster but give room to
investigate why it got fenced in first place;

-       /usr/lib/drbd/crm-fence-peer.9.sh  and
/usr/lib/drbd/crm-unfence-peer.9.sh for DRBD resource level fencing
(if the DRBD replication link becomes disconnected, the
crm-fence-peer.9.sh script contacts the cluster manager, determines
the Pacemaker Master/Slave resource associated with this DRBD
resource, and ensures that the Master/Slave resource no longer gets
promoted on any node other than the currently active one);

What I’m just wondering is that if for whatever reason the communication
paths between both nodes are interrupted so that each will think that the
other node is gone, each of them will try to fence each other resulting in
a fence race. I was reading that you could possibly introduce a delay into
the secondary’s fencing for example 30-60 seconds so that during that
delay, ASSUMING the primary is functioning well, the primary will fence the
secondary, but that doesn’t sound like a reliable solution to me, I mean
how can I assume before which one is primary and which one will suffer
problems?

Any sugestions?

I highly appreciate you effort!

BR,

Adam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190312/2ebe7b70/attachment-0002.html>