[ClusterLabs] [DRBD-user] DRBD fencing issue on failover causes resource failure
Tim Walberg
twalberg at gmail.com
Wed Mar 16 17:51:16 UTC 2016
Is there a way to make this work properly without STONITH? I forgot to mention
that both nodes are virtual machines (QEMU/KVM), which makes STONITH a minor
challenge. Also, since these symptoms occur even under "pcs cluster standby",
where STONITH *shouldn't* be invoked, I'm not sure if that's the entire answer.
On 03/16/2016 13:34 -0400, Digimer wrote:
>> On 16/03/16 01:17 PM, Tim Walberg wrote:
>> > Having an issue on a newly built CentOS 7.2.1511 NFS cluster with DRBD
>> > (drbd84-utils-8.9.5-1 with kmod-drbd84-8.4.7-1_1). At this point, the
>> > resources consist of a cluster address, a DRBD device mirroring between
>> > the two cluster nodes, the file system, and the nfs-server resource. The
>> > resources all behave properly until an extended failover or outage.
>> >
>> > I have tested failover in several ways ("pcs cluster standby", "pcs
>> > cluster stop", "init 0", "init 6", "echo b > /proc/sysrq-trigger", etc.)
>> > and the symptoms are that, until the killed node is brought back into
>> > the cluster, failover never seems to complete. The DRBD device appears
>> > on the remaining node to be in a "Secondary/Unknown" state, and the
>> > resources end up looking like:
>> >
>> > # pcs status
>> > Cluster name: nfscluster
>> > Last updated: Wed Mar 16 12:05:33 2016 Last change: Wed Mar 16
>> > 12:04:46 2016 by root via cibadmin on nfsnode01
>> > Stack: corosync
>> > Current DC: nfsnode01 (version 1.1.13-10.el7_2.2-44eb2dd) - partition
>> > with quorum
>> > 2 nodes and 5 resources configured
>> >
>> > Online: [ nfsnode01 ]
>> > OFFLINE: [ nfsnode02 ]
>> >
>> > Full list of resources:
>> >
>> > nfsVIP (ocf::heartbeat:IPaddr2): Started nfsnode01
>> > nfs-server (systemd:nfs-server): Stopped
>> > Master/Slave Set: drbd_master [drbd_dev]
>> > Slaves: [ nfsnode01 ]
>> > Stopped: [ nfsnode02 ]
>> > drbd_fs (ocf::heartbeat:Filesystem): Stopped
>> >
>> > PCSD Status:
>> > nfsnode01: Online
>> > nfsnode02: Online
>> >
>> > Daemon Status:
>> > corosync: active/enabled
>> > pacemaker: active/enabled
>> > pcsd: active/enabled
>> >
>> > As soon as I bring the second node back online, the failover completes.
>> > But this is obviously not a good state, as an extended outage for any
>> > reason on one node essentially kills the cluster services. There's
>> > obviously something I've missed in configuring the resources, but I
>> > haven't been able to pinpoint it yet.
>> >
>> > Perusing the logs, it appears that, upon the initial failure, DRBD does
>> > in fact promote the drbd_master resource, but immediately after that,
>> > pengine calls for it to be demoted for reasons I haven't been able to
>> > determine yet, but seems to be tied to the fencing configuration. I can
>> > see that the crm-fence-peer.sh script is called, but it almost seems
>> > like it's fencing the wrong node... Indeed, I do see that it adds a
>> > -INFINITY location constraint for the surviving node, which would
>> > explain the decision to demote the DRBD master.
>> >
>> > My DRBD resource looks like this:
>> >
>> > # cat /etc/drbd.d/drbd0.res
>> > resource drbd0 {
>> >
>> > protocol C;
>> > startup { wfc-timeout 0; degr-wfc-timeout 120; }
>> >
>> > disk {
>> > on-io-error detach;
>> > fencing resource-only;
>>
>> This should be 'resource-and-stonith;', but alone won't do anything
>> until pacemaker's stonith is working.
>>
>> > }
>> >
>> > handlers {
>> > fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>> > after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>> > }
>> >
>> > on nfsnode01 {
>> > device /dev/drbd0;
>> > disk /dev/vg_nfs/lv_drbd0;
>> > meta-disk internal;
>> > address 10.0.0.2:7788 <http://10.0.0.2:7788>;
>> > }
>> >
>> > on nfsnode02 {
>> > device /dev/drbd0;
>> > disk /dev/vg_nfs/lv_drbd0;
>> > meta-disk internal;
>> > address 10.0.0.3:7788 <http://10.0.0.3:7788>;
>> > }
>> > }
>> >
>> > If I comment out the three lines having to do with fencing, the failover
>> > works properly. But I'd prefer to have the fencing there in the odd
>> > chance that we end up with a split brain instead of just a node outage...
>> >
>> > And, here's "pcs config --full":
>> >
>> > # pcs config --full
>> > Cluster Name: nfscluster
>> > Corosync Nodes:
>> > nfsnode01 nfsnode02
>> > Pacemaker Nodes:
>> > nfsnode01 nfsnode02
>> >
>> > Resources:
>> > Resource: nfsVIP (class=ocf provider=heartbeat type=IPaddr2)
>> > Attributes: ip=10.0.0.1 cidr_netmask=24
>> > Operations: start interval=0s timeout=20s (nfsVIP-start-interval-0s)
>> > stop interval=0s timeout=20s (nfsVIP-stop-interval-0s)
>> > monitor interval=15s (nfsVIP-monitor-interval-15s)
>> > Resource: nfs-server (class=systemd type=nfs-server)
>> > Operations: monitor interval=60s (nfs-server-monitor-interval-60s)
>> > Master: drbd_master
>> > Meta Attrs: master-max=1 master-node-max=1 clone-max=2
>> > clone-node-max=1 notify=true
>> > Resource: drbd_dev (class=ocf provider=linbit type=drbd)
>> > Attributes: drbd_resource=drbd0
>> > Operations: start interval=0s timeout=240 (drbd_dev-start-interval-0s)
>> > promote interval=0s timeout=90 (drbd_dev-promote-interval-0s)
>> > demote interval=0s timeout=90 (drbd_dev-demote-interval-0s)
>> > stop interval=0s timeout=100 (drbd_dev-stop-interval-0s)
>> > monitor interval=29s role=Master
>> > (drbd_dev-monitor-interval-29s)
>> > monitor interval=31s role=Slave
>> > (drbd_dev-monitor-interval-31s)
>> > Resource: drbd_fs (class=ocf provider=heartbeat type=Filesystem)
>> > Attributes: device=/dev/drbd0 directory=/exports/drbd0 fstype=xfs
>> > Operations: start interval=0s timeout=60 (drbd_fs-start-interval-0s)
>> > stop interval=0s timeout=60 (drbd_fs-stop-interval-0s)
>> > monitor interval=20 timeout=40 (drbd_fs-monitor-interval-20)
>> >
>> > Stonith Devices:
>> > Fencing Levels:
>> >
>> > Location Constraints:
>> > Ordering Constraints:
>> > start nfsVIP then start nfs-server (kind:Mandatory)
>> > (id:order-nfsVIP-nfs-server-mandatory)
>> > start drbd_fs then start nfs-server (kind:Mandatory)
>> > (id:order-drbd_fs-nfs-server-mandatory)
>> > promote drbd_master then start drbd_fs (kind:Mandatory)
>> > (id:order-drbd_master-drbd_fs-mandatory)
>> > Colocation Constraints:
>> > nfs-server with nfsVIP (score:INFINITY)
>> > (id:colocation-nfs-server-nfsVIP-INFINITY)
>> > nfs-server with drbd_fs (score:INFINITY)
>> > (id:colocation-nfs-server-drbd_fs-INFINITY)
>> > drbd_fs with drbd_master (score:INFINITY) (with-rsc-role:Master)
>> > (id:colocation-drbd_fs-drbd_master-INFINITY)
>> >
>> > Resources Defaults:
>> > resource-stickiness: 100
>> > failure-timeout: 60
>> > Operations Defaults:
>> > No defaults set
>> >
>> > Cluster Properties:
>> > cluster-infrastructure: corosync
>> > cluster-name: nfscluster
>> > dc-version: 1.1.13-10.el7_2.2-44eb2dd
>> > have-watchdog: false
>> > maintenance-mode: false
>> > stonith-enabled: false
>>
>> Configure *and test* stonith in pacemaker first, then DRBD will hook
>> into it and use it properly. DRBD simply asks pacemaker to do the fence,
>> but you currently don't have it setup.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.ca/w/
>> What if the cure for cancer is trapped in the mind of a person without
>> access to education?
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
End of included message
--
twalberg at gmail.com
More information about the Users
mailing list