[ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary node to Slave (always Stopped)

Thu Sep 17 23:54:58 UTC 2015

The only difference in the DRBD resource between yours and mine that I can
see is the monitoring parameters (mine works nicely, but is Centos 6).
Here's mine:

Master: ms_drbd_iscsicg0
  Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true
  Resource: drbd_iscsivg0 (class=ocf provider=linbit type=drbd)
   Attributes: drbd_resource=iscsivg0
   Operations: start interval=0s timeout=240
(drbd_iscsivg0-start-timeout-240)
               promote interval=0s timeout=90
(drbd_iscsivg0-promote-timeout-90)
               demote interval=0s timeout=90
(drbd_iscsivg0-demote-timeout-90)
               stop interval=0s timeout=100 (drbd_iscsivg0-stop-timeout-100)
               monitor interval=29s role=Master
(drbd_iscsivg0-monitor-interval-29s-role-Master)
               monitor interval=31s role=Slave
(drbd_iscsivg0-monitor-interval-31s-role-Slave)

What mechanism are you using to fail over? Check your constraints after you
do it and make sure it hasn't added one which stops the slave clone from
starting on the "failed" node.

Luke Pascoe

*E* luke at osnz.co.nz
* P* +64 (9) 296 2961
* M* +64 (27) 426 6649
* W* www.osnz.co.nz

24 Wellington St
Papakura
Auckland, 2110
New Zealand

On 18 September 2015 at 11:40, Jason Gress <jgress at accertify.com> wrote:

> Looking more closely, according to page 64 (
> http://clusterlabs.org/doc/Cluster_from_Scratch.pdf) it does indeed
> appear that 1 is the correct number.  (I just realized that it's page 64 of
> the "book", but page 76 of the pdf.)
>
> Thank you again,
>
> Jason
>
> From: Jason Gress <jgress at accertify.com>
> Reply-To: Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> Date: Thursday, September 17, 2015 at 6:36 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed <
> users at clusterlabs.org>
> Subject: Re: [ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary
> node to Slave (always Stopped)
>
> I can't say whether or not you are right or wrong (you may be right!) but
> I followed the Cluster From Scratch tutorial closely, and it only had a
> clone-node-max=1 there.  (Page 106 of the pdf, for the curious.)
>
> Thanks,
>
> Jason
>
> From: Luke Pascoe <luke at osnz.co.nz>
> Reply-To: Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> Date: Thursday, September 17, 2015 at 6:29 PM
> To: Cluster Labs - All topics related to open-source clustering welcomed <
> users at clusterlabs.org>
> Subject: Re: [ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary
> node to Slave (always Stopped)
>
> I may be wrong, but shouldn't "clone-node-max" be 2 on the ms_drbd_vmfs
> resource?
>
> Luke Pascoe
>
>
>
> *E* luke at osnz.co.nz
> *P* +64 (9) 296 2961
> *M* +64 (27) 426 6649
> *W* www.osnz.co.nz
>
> 24 Wellington St
> Papakura
> Auckland, 2110
> New Zealand
>
> On 18 September 2015 at 11:02, Jason Gress <jgress at accertify.com> wrote:
>
>> I have a simple DRBD + filesystem + NFS configuration that works properly
>> when I manually start/stop DRBD, but will not start the DRBD slave resource
>> properly on failover or recovery.  I cannot ever get the Master/Slave set
>> to say anything but 'Stopped'.  I am running CentOS 7.1 with the latest
>> packages as of today:
>>
>> [root at fx201-1a log]# rpm -qa | grep -e pcs -e pacemaker -e drbd
>> pacemaker-cluster-libs-1.1.12-22.el7_1.4.x86_64
>> pacemaker-1.1.12-22.el7_1.4.x86_64
>> pcs-0.9.137-13.el7_1.4.x86_64
>> pacemaker-libs-1.1.12-22.el7_1.4.x86_64
>> drbd84-utils-8.9.3-1.1.el7.elrepo.x86_64
>> pacemaker-cli-1.1.12-22.el7_1.4.x86_64
>> kmod-drbd84-8.4.6-1.el7.elrepo.x86_64
>>
>> Here is my pcs config output:
>>
>> [root at fx201-1a log]# pcs config
>> Cluster Name: fx201-vmcl
>> Corosync Nodes:
>>  fx201-1a.ams fx201-1b.ams
>> Pacemaker Nodes:
>>  fx201-1a.ams fx201-1b.ams
>>
>> Resources:
>>  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>>   Attributes: ip=10.XX.XX.XX cidr_netmask=24
>>   Operations: start interval=0s timeout=20s (ClusterIP-start-timeout-20s)
>>               stop interval=0s timeout=20s (ClusterIP-stop-timeout-20s)
>>               monitor interval=15s (ClusterIP-monitor-interval-15s)
>>  Master: ms_drbd_vmfs
>>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
>> notify=true
>>   Resource: drbd_vmfs (class=ocf provider=linbit type=drbd)
>>    Attributes: drbd_resource=vmfs
>>    Operations: start interval=0s timeout=240 (drbd_vmfs-start-timeout-240)
>>                promote interval=0s timeout=90
>> (drbd_vmfs-promote-timeout-90)
>>                demote interval=0s timeout=90 (drbd_vmfs-demote-timeout-90)
>>                stop interval=0s timeout=100 (drbd_vmfs-stop-timeout-100)
>>                monitor interval=30s (drbd_vmfs-monitor-interval-30s)
>>  Resource: vmfsFS (class=ocf provider=heartbeat type=Filesystem)
>>   Attributes: device=/dev/drbd0 directory=/exports/vmfs fstype=xfs
>>   Operations: start interval=0s timeout=60 (vmfsFS-start-timeout-60)
>>               stop interval=0s timeout=60 (vmfsFS-stop-timeout-60)
>>               monitor interval=20 timeout=40 (vmfsFS-monitor-interval-20)
>>  Resource: nfs-server (class=systemd type=nfs-server)
>>   Operations: monitor interval=60s (nfs-server-monitor-interval-60s)
>>
>> Stonith Devices:
>> Fencing Levels:
>>
>> Location Constraints:
>> Ordering Constraints:
>>   promote ms_drbd_vmfs then start vmfsFS (kind:Mandatory)
>> (id:order-ms_drbd_vmfs-vmfsFS-mandatory)
>>   start vmfsFS then start nfs-server (kind:Mandatory)
>> (id:order-vmfsFS-nfs-server-mandatory)
>>   start ClusterIP then start nfs-server (kind:Mandatory)
>> (id:order-ClusterIP-nfs-server-mandatory)
>> Colocation Constraints:
>>   ms_drbd_vmfs with ClusterIP (score:INFINITY)
>> (id:colocation-ms_drbd_vmfs-ClusterIP-INFINITY)
>>   vmfsFS with ms_drbd_vmfs (score:INFINITY) (with-rsc-role:Master)
>> (id:colocation-vmfsFS-ms_drbd_vmfs-INFINITY)
>>   nfs-server with vmfsFS (score:INFINITY)
>> (id:colocation-nfs-server-vmfsFS-INFINITY)
>>
>> Cluster Properties:
>>  cluster-infrastructure: corosync
>>  cluster-name: fx201-vmcl
>>  dc-version: 1.1.13-a14efad
>>  have-watchdog: false
>>  last-lrm-refresh: 1442528181
>>  stonith-enabled: false
>>
>> And status:
>>
>> [root at fx201-1a log]# pcs status --full
>> Cluster name: fx201-vmcl
>> Last updated: Thu Sep 17 17:55:56 2015 Last change: Thu Sep 17 17:18:10
>> 2015 by root via crm_attribute on fx201-1b.ams
>> Stack: corosync
>> Current DC: fx201-1b.ams (2) (version 1.1.13-a14efad) - partition with
>> quorum
>> 2 nodes and 5 resources configured
>>
>> Online: [ fx201-1a.ams (1) fx201-1b.ams (2) ]
>>
>> Full list of resources:
>>
>>  ClusterIP (ocf::heartbeat:IPaddr2):Started fx201-1a.ams
>>  Master/Slave Set: ms_drbd_vmfs [drbd_vmfs]
>>      drbd_vmfs (ocf::linbit:drbd):Master fx201-1a.ams
>>      drbd_vmfs (ocf::linbit:drbd):Stopped
>>      Masters: [ fx201-1a.ams ]
>>      Stopped: [ fx201-1b.ams ]
>>  vmfsFS (ocf::heartbeat:Filesystem):Started fx201-1a.ams
>>  nfs-server (systemd:nfs-server):Started fx201-1a.ams
>>
>> PCSD Status:
>>   fx201-1a.ams: Online
>>   fx201-1b.ams: Online
>>
>> Daemon Status:
>>   corosync: active/enabled
>>   pacemaker: active/enabled
>>   pcsd: active/enabled
>>
>> If I do a failover, after manually confirming that the DRBD data is
>> synchronized completely, it does work, but then never reconnects the
>> secondary side, and in order to get the resource synchronized again, I have
>> to manually correct it, ad infinitum.  I have tried standby/unstandby, pcs
>> resource debug-start (with undesirable results), and so on.
>>
>> Here are some relevant log messages from pacemaker.log:
>>
>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info:
>> crm_timer_popped:PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:   notice:
>> do_state_transition:State transition S_IDLE -> S_POLICY_ENGINE [
>> input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info:
>> do_state_transition:Progressed to state S_POLICY_ENGINE after
>> C_TIMER_POPPED
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> process_pe_message:Input has not changed since last time, not saving to
>> disk
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> determine_online_status:Node fx201-1b.ams is online
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> determine_online_status:Node fx201-1a.ams is online
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> determine_op_status:Operation monitor found resource drbd_vmfs:0 active
>> in master mode on fx201-1b.ams
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> determine_op_status:Operation monitor found resource drbd_vmfs:0 active
>> on fx201-1a.ams
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> native_print:ClusterIP(ocf::heartbeat:IPaddr2):Started fx201-1a.ams
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> clone_print:Master/Slave Set: ms_drbd_vmfs [drbd_vmfs]
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> short_print:    Masters: [ fx201-1a.ams ]
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> short_print:    Stopped: [ fx201-1b.ams ]
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> native_print:vmfsFS(ocf::heartbeat:Filesystem):Started fx201-1a.ams
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> native_print:nfs-server(systemd:nfs-server):Started fx201-1a.ams
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> native_color:Resource drbd_vmfs:1 cannot run anywhere
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> master_color:Promoting drbd_vmfs:0 (Master fx201-1a.ams)
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> master_color:ms_drbd_vmfs: Promoted 1 instances of a possible 1 to master
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> LogActions:Leave   ClusterIP(Started fx201-1a.ams)
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> LogActions:Leave   drbd_vmfs:0(Master fx201-1a.ams)
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> LogActions:Leave   drbd_vmfs:1(Stopped)
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> LogActions:Leave   vmfsFS(Started fx201-1a.ams)
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>> LogActions:Leave   nfs-server(Started fx201-1a.ams)
>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:   notice:
>> process_pe_message:Calculated Transition 16:
>> /var/lib/pacemaker/pengine/pe-input-61.bz2
>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info:
>> do_state_transition:State transition S_POLICY_ENGINE ->
>> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
>> origin=handle_response ]
>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info:
>> do_te_invoke:Processing graph 16 (ref=pe_calc-dc-1442530090-97) derived
>> from /var/lib/pacemaker/pengine/pe-input-61.bz2
>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:   notice:
>> run_graph:Transition 16 (Complete=0, Pending=0, Fired=0, Skipped=0,
>> Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-61.bz2): Complete
>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:     info:
>> do_log:FSA: Input I_TE_SUCCESS from notify_crmd() received in state
>> S_TRANSITION_ENGINE
>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:   notice:
>> do_state_transition:State transition S_TRANSITION_ENGINE -> S_IDLE [
>> input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>>
>> Thank you all for your help,
>>
>> Jason
>>
>> "This message and any attachments may contain confidential information. If you
>> have received this  message in error, any use or distribution is prohibited.
>> Please notify us by reply e-mail if you have mistakenly received this message,
>> and immediately and permanently delete it and any attachments. Thank you."
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
> "This message and any attachments may contain confidential information. If you
> have received this  message in error, any use or distribution is prohibited.
> Please notify us by reply e-mail if you have mistakenly received this message,
> and immediately and permanently delete it and any attachments. Thank you."
>
>
> "This message and any attachments may contain confidential information. If you
> have received this  message in error, any use or distribution is prohibited.
> Please notify us by reply e-mail if you have mistakenly received this message,
> and immediately and permanently delete it and any attachments. Thank you."
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150918/0cca0636/attachment.htm>