[ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary node to Slave (always Stopped)

Thu Sep 17 19:29:45 EDT 2015

I may be wrong, but shouldn't "clone-node-max" be 2 on the ms_drbd_vmfs
resource?

Luke Pascoe

*E* luke at osnz.co.nz
* P* +64 (9) 296 2961
* M* +64 (27) 426 6649
* W* www.osnz.co.nz

24 Wellington St
Papakura
Auckland, 2110
New Zealand

On 18 September 2015 at 11:02, Jason Gress <jgress at accertify.com> wrote:

> I have a simple DRBD + filesystem + NFS configuration that works properly
> when I manually start/stop DRBD, but will not start the DRBD slave resource
> properly on failover or recovery.  I cannot ever get the Master/Slave set
> to say anything but 'Stopped'.  I am running CentOS 7.1 with the latest
> packages as of today:
>
> [root at fx201-1a log]# rpm -qa | grep -e pcs -e pacemaker -e drbd
> pacemaker-cluster-libs-1.1.12-22.el7_1.4.x86_64
> pacemaker-1.1.12-22.el7_1.4.x86_64
> pcs-0.9.137-13.el7_1.4.x86_64
> pacemaker-libs-1.1.12-22.el7_1.4.x86_64
> drbd84-utils-8.9.3-1.1.el7.elrepo.x86_64
> pacemaker-cli-1.1.12-22.el7_1.4.x86_64
> kmod-drbd84-8.4.6-1.el7.elrepo.x86_64
>
> Here is my pcs config output:
>
> [root at fx201-1a log]# pcs config
> Cluster Name: fx201-vmcl
> Corosync Nodes:
>  fx201-1a.ams fx201-1b.ams
> Pacemaker Nodes:
>  fx201-1a.ams fx201-1b.ams
>
> Resources:
>  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: ip=10.XX.XX.XX cidr_netmask=24
>   Operations: start interval=0s timeout=20s (ClusterIP-start-timeout-20s)
>               stop interval=0s timeout=20s (ClusterIP-stop-timeout-20s)
>               monitor interval=15s (ClusterIP-monitor-interval-15s)
>  Master: ms_drbd_vmfs
>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> notify=true
>   Resource: drbd_vmfs (class=ocf provider=linbit type=drbd)
>    Attributes: drbd_resource=vmfs
>    Operations: start interval=0s timeout=240 (drbd_vmfs-start-timeout-240)
>                promote interval=0s timeout=90
> (drbd_vmfs-promote-timeout-90)
>                demote interval=0s timeout=90 (drbd_vmfs-demote-timeout-90)
>                stop interval=0s timeout=100 (drbd_vmfs-stop-timeout-100)
>                monitor interval=30s (drbd_vmfs-monitor-interval-30s)
>  Resource: vmfsFS (class=ocf provider=heartbeat type=Filesystem)
>   Attributes: device=/dev/drbd0 directory=/exports/vmfs fstype=xfs
>   Operations: start interval=0s timeout=60 (vmfsFS-start-timeout-60)
>               stop interval=0s timeout=60 (vmfsFS-stop-timeout-60)
>               monitor interval=20 timeout=40 (vmfsFS-monitor-interval-20)
>  Resource: nfs-server (class=systemd type=nfs-server)
>   Operations: monitor interval=60s (nfs-server-monitor-interval-60s)
>
> Stonith Devices:
> Fencing Levels:
>
> Location Constraints:
> Ordering Constraints:
>   promote ms_drbd_vmfs then start vmfsFS (kind:Mandatory)
> (id:order-ms_drbd_vmfs-vmfsFS-mandatory)
>   start vmfsFS then start nfs-server (kind:Mandatory)
> (id:order-vmfsFS-nfs-server-mandatory)
>   start ClusterIP then start nfs-server (kind:Mandatory)
> (id:order-ClusterIP-nfs-server-mandatory)
> Colocation Constraints:
>   ms_drbd_vmfs with ClusterIP (score:INFINITY)
> (id:colocation-ms_drbd_vmfs-ClusterIP-INFINITY)
>   vmfsFS with ms_drbd_vmfs (score:INFINITY) (with-rsc-role:Master)
> (id:colocation-vmfsFS-ms_drbd_vmfs-INFINITY)
>   nfs-server with vmfsFS (score:INFINITY)
> (id:colocation-nfs-server-vmfsFS-INFINITY)
>
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: fx201-vmcl
>  dc-version: 1.1.13-a14efad
>  have-watchdog: false
>  last-lrm-refresh: 1442528181
>  stonith-enabled: false
>
> And status:
>
> [root at fx201-1a log]# pcs status --full
> Cluster name: fx201-vmcl
> Last updated: Thu Sep 17 17:55:56 2015 Last change: Thu Sep 17 17:18:10
> 2015 by root via crm_attribute on fx201-1b.ams
> Stack: corosync
> Current DC: fx201-1b.ams (2) (version 1.1.13-a14efad) - partition with
> quorum
> 2 nodes and 5 resources configured
>
> Online: [ fx201-1a.ams (1) fx201-1b.ams (2) ]
>
> Full list of resources:
>
>  ClusterIP (ocf::heartbeat:IPaddr2): Started fx201-1a.ams
>  Master/Slave Set: ms_drbd_vmfs [drbd_vmfs]
>      drbd_vmfs (ocf::linbit:drbd): Master fx201-1a.ams
>      drbd_vmfs (ocf::linbit:drbd): Stopped
>      Masters: [ fx201-1a.ams ]
>      Stopped: [ fx201-1b.ams ]
>  vmfsFS (ocf::heartbeat:Filesystem): Started fx201-1a.ams
>  nfs-server (systemd:nfs-server): Started fx201-1a.ams
>
> PCSD Status:
>   fx201-1a.ams: Online
>   fx201-1b.ams: Online
>
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
>
> If I do a failover, after manually confirming that the DRBD data is
> synchronized completely, it does work, but then never reconnects the
> secondary side, and in order to get the resource synchronized again, I have
> to manually correct it, ad infinitum.  I have tried standby/unstandby, pcs
> resource debug-start (with undesirable results), and so on.
>
> Here are some relevant log messages from pacemaker.log:
>
> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info:
> crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:   notice:
> do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [
> input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info:
> do_state_transition: Progressed to state S_POLICY_ENGINE after
> C_TIMER_POPPED
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> process_pe_message: Input has not changed since last time, not saving to
> disk
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> determine_online_status: Node fx201-1b.ams is online
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> determine_online_status: Node fx201-1a.ams is online
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> determine_op_status: Operation monitor found resource drbd_vmfs:0 active
> in master mode on fx201-1b.ams
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> determine_op_status: Operation monitor found resource drbd_vmfs:0 active
> on fx201-1a.ams
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> native_print: ClusterIP (ocf::heartbeat:IPaddr2): Started fx201-1a.ams
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> clone_print: Master/Slave Set: ms_drbd_vmfs [drbd_vmfs]
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> short_print:     Masters: [ fx201-1a.ams ]
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> short_print:     Stopped: [ fx201-1b.ams ]
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> native_print: vmfsFS (ocf::heartbeat:Filesystem): Started fx201-1a.ams
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> native_print: nfs-server (systemd:nfs-server): Started fx201-1a.ams
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> native_color: Resource drbd_vmfs:1 cannot run anywhere
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> master_color: Promoting drbd_vmfs:0 (Master fx201-1a.ams)
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> master_color: ms_drbd_vmfs: Promoted 1 instances of a possible 1 to master
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> LogActions: Leave   ClusterIP (Started fx201-1a.ams)
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> LogActions: Leave   drbd_vmfs:0 (Master fx201-1a.ams)
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> LogActions: Leave   drbd_vmfs:1 (Stopped)
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> LogActions: Leave   vmfsFS (Started fx201-1a.ams)
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
> LogActions: Leave   nfs-server (Started fx201-1a.ams)
> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:   notice:
> process_pe_message: Calculated Transition 16:
> /var/lib/pacemaker/pengine/pe-input-61.bz2
> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info:
> do_state_transition: State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info:
> do_te_invoke: Processing graph 16 (ref=pe_calc-dc-1442530090-97) derived
> from /var/lib/pacemaker/pengine/pe-input-61.bz2
> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:   notice:
> run_graph: Transition 16 (Complete=0, Pending=0, Fired=0, Skipped=0,
> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-61.bz2): Complete
> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info:
> do_log: FSA: Input I_TE_SUCCESS from notify_crmd() received in state
> S_TRANSITION_ENGINE
> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:   notice:
> do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [
> input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>
> Thank you all for your help,
>
> Jason
>
>
> "This message and any attachments may contain confidential information. If you
> have received this  message in error, any use or distribution is prohibited.
> Please notify us by reply e-mail if you have mistakenly received this message,
> and immediately and permanently delete it and any attachments. Thank you."
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150918/7df57e5b/attachment-0003.html>