[ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary node to Slave (always Stopped)

Thu Sep 17 19:02:48 EDT 2015

I have a simple DRBD + filesystem + NFS configuration that works properly when I manually start/stop DRBD, but will not start the DRBD slave resource properly on failover or recovery.  I cannot ever get the Master/Slave set to say anything but 'Stopped'.  I am running CentOS 7.1 with the latest packages as of today:

[root at fx201-1a log]# rpm -qa | grep -e pcs -e pacemaker -e drbd
pacemaker-cluster-libs-1.1.12-22.el7_1.4.x86_64
pacemaker-1.1.12-22.el7_1.4.x86_64
pcs-0.9.137-13.el7_1.4.x86_64
pacemaker-libs-1.1.12-22.el7_1.4.x86_64
drbd84-utils-8.9.3-1.1.el7.elrepo.x86_64
pacemaker-cli-1.1.12-22.el7_1.4.x86_64
kmod-drbd84-8.4.6-1.el7.elrepo.x86_64

Here is my pcs config output:

[root at fx201-1a log]# pcs config
Cluster Name: fx201-vmcl
Corosync Nodes:
 fx201-1a.ams fx201-1b.ams
Pacemaker Nodes:
 fx201-1a.ams fx201-1b.ams

Resources:
 Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.XX.XX.XX cidr_netmask=24
  Operations: start interval=0s timeout=20s (ClusterIP-start-timeout-20s)
              stop interval=0s timeout=20s (ClusterIP-stop-timeout-20s)
              monitor interval=15s (ClusterIP-monitor-interval-15s)
 Master: ms_drbd_vmfs
  Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
  Resource: drbd_vmfs (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=vmfs
   Operations: start interval=0s timeout=240 (drbd_vmfs-start-timeout-240)
               promote interval=0s timeout=90 (drbd_vmfs-promote-timeout-90)
               demote interval=0s timeout=90 (drbd_vmfs-demote-timeout-90)
               stop interval=0s timeout=100 (drbd_vmfs-stop-timeout-100)
               monitor interval=30s (drbd_vmfs-monitor-interval-30s)
 Resource: vmfsFS (class=ocf provider=heartbeat type=Filesystem)
  Attributes: device=/dev/drbd0 directory=/exports/vmfs fstype=xfs
  Operations: start interval=0s timeout=60 (vmfsFS-start-timeout-60)
              stop interval=0s timeout=60 (vmfsFS-stop-timeout-60)
              monitor interval=20 timeout=40 (vmfsFS-monitor-interval-20)
 Resource: nfs-server (class=systemd type=nfs-server)
  Operations: monitor interval=60s (nfs-server-monitor-interval-60s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
  promote ms_drbd_vmfs then start vmfsFS (kind:Mandatory) (id:order-ms_drbd_vmfs-vmfsFS-mandatory)
  start vmfsFS then start nfs-server (kind:Mandatory) (id:order-vmfsFS-nfs-server-mandatory)
  start ClusterIP then start nfs-server (kind:Mandatory) (id:order-ClusterIP-nfs-server-mandatory)
Colocation Constraints:
  ms_drbd_vmfs with ClusterIP (score:INFINITY) (id:colocation-ms_drbd_vmfs-ClusterIP-INFINITY)
  vmfsFS with ms_drbd_vmfs (score:INFINITY) (with-rsc-role:Master) (id:colocation-vmfsFS-ms_drbd_vmfs-INFINITY)
  nfs-server with vmfsFS (score:INFINITY) (id:colocation-nfs-server-vmfsFS-INFINITY)

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: fx201-vmcl
 dc-version: 1.1.13-a14efad
 have-watchdog: false
 last-lrm-refresh: 1442528181
 stonith-enabled: false

And status:

[root at fx201-1a log]# pcs status --full
Cluster name: fx201-vmcl
Last updated: Thu Sep 17 17:55:56 2015 Last change: Thu Sep 17 17:18:10 2015 by root via crm_attribute on fx201-1b.ams
Stack: corosync
Current DC: fx201-1b.ams (2) (version 1.1.13-a14efad) - partition with quorum
2 nodes and 5 resources configured

Online: [ fx201-1a.ams (1) fx201-1b.ams (2) ]

Full list of resources:

 ClusterIP (ocf::heartbeat:IPaddr2): Started fx201-1a.ams
 Master/Slave Set: ms_drbd_vmfs [drbd_vmfs]
     drbd_vmfs (ocf::linbit:drbd): Master fx201-1a.ams
     drbd_vmfs (ocf::linbit:drbd): Stopped
     Masters: [ fx201-1a.ams ]
     Stopped: [ fx201-1b.ams ]
 vmfsFS (ocf::heartbeat:Filesystem): Started fx201-1a.ams
 nfs-server (systemd:nfs-server): Started fx201-1a.ams

PCSD Status:
  fx201-1a.ams: Online
  fx201-1b.ams: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

If I do a failover, after manually confirming that the DRBD data is synchronized completely, it does work, but then never reconnects the secondary side, and in order to get the resource synchronized again, I have to manually correct it, ad infinitum.  I have tried standby/unstandby, pcs resource debug-start (with undesirable results), and so on.

Here are some relevant log messages from pacemaker.log:

Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: process_pe_message: Input has not changed since last time, not saving to disk
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: determine_online_status: Node fx201-1b.ams is online
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: determine_online_status: Node fx201-1a.ams is online
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: determine_op_status: Operation monitor found resource drbd_vmfs:0 active in master mode on fx201-1b.ams
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: determine_op_status: Operation monitor found resource drbd_vmfs:0 active on fx201-1a.ams
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: native_print: ClusterIP (ocf::heartbeat:IPaddr2): Started fx201-1a.ams
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: clone_print: Master/Slave Set: ms_drbd_vmfs [drbd_vmfs]
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: short_print:     Masters: [ fx201-1a.ams ]
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: short_print:     Stopped: [ fx201-1b.ams ]
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: native_print: vmfsFS (ocf::heartbeat:Filesystem): Started fx201-1a.ams
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: native_print: nfs-server (systemd:nfs-server): Started fx201-1a.ams
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: native_color: Resource drbd_vmfs:1 cannot run anywhere
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: master_color: Promoting drbd_vmfs:0 (Master fx201-1a.ams)
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: master_color: ms_drbd_vmfs: Promoted 1 instances of a possible 1 to master
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: LogActions: Leave   ClusterIP (Started fx201-1a.ams)
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: LogActions: Leave   drbd_vmfs:0 (Master fx201-1a.ams)
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: LogActions: Leave   drbd_vmfs:1 (Stopped)
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: LogActions: Leave   vmfsFS (Started fx201-1a.ams)
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info: LogActions: Leave   nfs-server (Started fx201-1a.ams)
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:   notice: process_pe_message: Calculated Transition 16: /var/lib/pacemaker/pengine/pe-input-61.bz2
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info: do_te_invoke: Processing graph 16 (ref=pe_calc-dc-1442530090-97) derived from /var/lib/pacemaker/pengine/pe-input-61.bz2
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:   notice: run_graph: Transition 16 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-61.bz2): Complete
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info: do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]

Thank you all for your help,

Jason

"This message and any attachments may contain confidential information. If you
have received this  message in error, any use or distribution is prohibited. 
Please notify us by reply e-mail if you have mistakenly received this message,
and immediately and permanently delete it and any attachments. Thank you."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150917/4c6f3b20/attachment-0002.html>