[ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary node to Slave (always Stopped)

Fri Sep 18 00:08:22 UTC 2015

pcs resource create drbd_iscsivg0 ocf:linbit:drbd drbd_resource=iscsivg0 op
monitor interval="29s" role="Master" op monitor interval="31s" role="Slave"

Luke Pascoe

*E* luke at osnz.co.nz
* P* +64 (9) 296 2961
* M* +64 (27) 426 6649
* W* www.osnz.co.nz

24 Wellington St
Papakura
Auckland, 2110
New Zealand

On 18 September 2015 at 12:02, Jason Gress <jgress at accertify.com> wrote:

> That may very well be it.  Would you be so kind as to show me the pcs
> command to create that config?  I generated my configuration with these
> commands, and I'm not sure how to get the additional monitor options in
> there:
>
> pcs resource create drbd_vmfs ocf:linbit:drbd drbd_resource=vmfs op
> monitor interval=30s
> pcs resource master ms_drbd_vmfs drbd_vmfs master-max=1 master-node-max=1
> clone-max=2 clone-node-max=1 notify=true
>
> Thank you very much for your help, and sorry for the newbie question!
>
> Jason
>
> From: Luke Pascoe <luke at osnz.co.nz>
> Reply-To: Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> Date: Thursday, September 17, 2015 at 6:54 PM
>
> To: Cluster Labs - All topics related to open-source clustering welcomed <
> users at clusterlabs.org>
> Subject: Re: [ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary
> node to Slave (always Stopped)
>
> The only difference in the DRBD resource between yours and mine that I can
> see is the monitoring parameters (mine works nicely, but is Centos 6).
> Here's mine:
>
> Master: ms_drbd_iscsicg0
>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> notify=true
>   Resource: drbd_iscsivg0 (class=ocf provider=linbit type=drbd)
>    Attributes: drbd_resource=iscsivg0
>    Operations: start interval=0s timeout=240
> (drbd_iscsivg0-start-timeout-240)
>                promote interval=0s timeout=90
> (drbd_iscsivg0-promote-timeout-90)
>                demote interval=0s timeout=90
> (drbd_iscsivg0-demote-timeout-90)
>                stop interval=0s timeout=100
> (drbd_iscsivg0-stop-timeout-100)
>                monitor interval=29s role=Master
> (drbd_iscsivg0-monitor-interval-29s-role-Master)
>                monitor interval=31s role=Slave
> (drbd_iscsivg0-monitor-interval-31s-role-Slave)
>
> What mechanism are you using to fail over? Check your constraints after
> you do it and make sure it hasn't added one which stops the slave clone
> from starting on the "failed" node.
>
>
> Luke Pascoe
>
>
>
> *E* luke at osnz.co.nz
> *P* +64 (9) 296 2961
> *M* +64 (27) 426 6649
> *W* www.osnz.co.nz
>
> 24 Wellington St
> Papakura
> Auckland, 2110
> New Zealand
>
> On 18 September 2015 at 11:40, Jason Gress <jgress at accertify.com> wrote:
>
>> Looking more closely, according to page 64 (
>> http://clusterlabs.org/doc/Cluster_from_Scratch.pdf) it does indeed
>> appear that 1 is the correct number.  (I just realized that it's page 64 of
>> the "book", but page 76 of the pdf.)
>>
>> Thank you again,
>>
>> Jason
>>
>> From: Jason Gress <jgress at accertify.com>
>> Reply-To: Cluster Labs - All topics related to open-source clustering
>> welcomed <users at clusterlabs.org>
>> Date: Thursday, September 17, 2015 at 6:36 PM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> <users at clusterlabs.org>
>> Subject: Re: [ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary
>> node to Slave (always Stopped)
>>
>> I can't say whether or not you are right or wrong (you may be right!) but
>> I followed the Cluster From Scratch tutorial closely, and it only had a
>> clone-node-max=1 there.  (Page 106 of the pdf, for the curious.)
>>
>> Thanks,
>>
>> Jason
>>
>> From: Luke Pascoe <luke at osnz.co.nz>
>> Reply-To: Cluster Labs - All topics related to open-source clustering
>> welcomed <users at clusterlabs.org>
>> Date: Thursday, September 17, 2015 at 6:29 PM
>> To: Cluster Labs - All topics related to open-source clustering welcomed
>> <users at clusterlabs.org>
>> Subject: Re: [ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary
>> node to Slave (always Stopped)
>>
>> I may be wrong, but shouldn't "clone-node-max" be 2 on the ms_drbd_vmfs
>> resource?
>>
>> Luke Pascoe
>>
>>
>>
>> *E* luke at osnz.co.nz
>> *P* +64 (9) 296 2961
>> *M* +64 (27) 426 6649
>> *W* www.osnz.co.nz
>>
>> 24 Wellington St
>> Papakura
>> Auckland, 2110
>> New Zealand
>>
>> On 18 September 2015 at 11:02, Jason Gress <jgress at accertify.com> wrote:
>>
>>> I have a simple DRBD + filesystem + NFS configuration that works
>>> properly when I manually start/stop DRBD, but will not start the DRBD slave
>>> resource properly on failover or recovery.  I cannot ever get the
>>> Master/Slave set to say anything but 'Stopped'.  I am running CentOS 7.1
>>> with the latest packages as of today:
>>>
>>> [root at fx201-1a log]# rpm -qa | grep -e pcs -e pacemaker -e drbd
>>> pacemaker-cluster-libs-1.1.12-22.el7_1.4.x86_64
>>> pacemaker-1.1.12-22.el7_1.4.x86_64
>>> pcs-0.9.137-13.el7_1.4.x86_64
>>> pacemaker-libs-1.1.12-22.el7_1.4.x86_64
>>> drbd84-utils-8.9.3-1.1.el7.elrepo.x86_64
>>> pacemaker-cli-1.1.12-22.el7_1.4.x86_64
>>> kmod-drbd84-8.4.6-1.el7.elrepo.x86_64
>>>
>>> Here is my pcs config output:
>>>
>>> [root at fx201-1a log]# pcs config
>>> Cluster Name: fx201-vmcl
>>> Corosync Nodes:
>>>  fx201-1a.ams fx201-1b.ams
>>> Pacemaker Nodes:
>>>  fx201-1a.ams fx201-1b.ams
>>>
>>> Resources:
>>>  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>>>   Attributes: ip=10.XX.XX.XX cidr_netmask=24
>>>   Operations: start interval=0s timeout=20s (ClusterIP-start-timeout-20s)
>>>               stop interval=0s timeout=20s (ClusterIP-stop-timeout-20s)
>>>               monitor interval=15s (ClusterIP-monitor-interval-15s)
>>>  Master: ms_drbd_vmfs
>>>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2
>>> clone-node-max=1 notify=true
>>>   Resource: drbd_vmfs (class=ocf provider=linbit type=drbd)
>>>    Attributes: drbd_resource=vmfs
>>>    Operations: start interval=0s timeout=240
>>> (drbd_vmfs-start-timeout-240)
>>>                promote interval=0s timeout=90
>>> (drbd_vmfs-promote-timeout-90)
>>>                demote interval=0s timeout=90
>>> (drbd_vmfs-demote-timeout-90)
>>>                stop interval=0s timeout=100 (drbd_vmfs-stop-timeout-100)
>>>                monitor interval=30s (drbd_vmfs-monitor-interval-30s)
>>>  Resource: vmfsFS (class=ocf provider=heartbeat type=Filesystem)
>>>   Attributes: device=/dev/drbd0 directory=/exports/vmfs fstype=xfs
>>>   Operations: start interval=0s timeout=60 (vmfsFS-start-timeout-60)
>>>               stop interval=0s timeout=60 (vmfsFS-stop-timeout-60)
>>>               monitor interval=20 timeout=40 (vmfsFS-monitor-interval-20)
>>>  Resource: nfs-server (class=systemd type=nfs-server)
>>>   Operations: monitor interval=60s (nfs-server-monitor-interval-60s)
>>>
>>> Stonith Devices:
>>> Fencing Levels:
>>>
>>> Location Constraints:
>>> Ordering Constraints:
>>>   promote ms_drbd_vmfs then start vmfsFS (kind:Mandatory)
>>> (id:order-ms_drbd_vmfs-vmfsFS-mandatory)
>>>   start vmfsFS then start nfs-server (kind:Mandatory)
>>> (id:order-vmfsFS-nfs-server-mandatory)
>>>   start ClusterIP then start nfs-server (kind:Mandatory)
>>> (id:order-ClusterIP-nfs-server-mandatory)
>>> Colocation Constraints:
>>>   ms_drbd_vmfs with ClusterIP (score:INFINITY)
>>> (id:colocation-ms_drbd_vmfs-ClusterIP-INFINITY)
>>>   vmfsFS with ms_drbd_vmfs (score:INFINITY) (with-rsc-role:Master)
>>> (id:colocation-vmfsFS-ms_drbd_vmfs-INFINITY)
>>>   nfs-server with vmfsFS (score:INFINITY)
>>> (id:colocation-nfs-server-vmfsFS-INFINITY)
>>>
>>> Cluster Properties:
>>>  cluster-infrastructure: corosync
>>>  cluster-name: fx201-vmcl
>>>  dc-version: 1.1.13-a14efad
>>>  have-watchdog: false
>>>  last-lrm-refresh: 1442528181
>>>  stonith-enabled: false
>>>
>>> And status:
>>>
>>> [root at fx201-1a log]# pcs status --full
>>> Cluster name: fx201-vmcl
>>> Last updated: Thu Sep 17 17:55:56 2015 Last change: Thu Sep 17 17:18:10
>>> 2015 by root via crm_attribute on fx201-1b.ams
>>> Stack: corosync
>>> Current DC: fx201-1b.ams (2) (version 1.1.13-a14efad) - partition with
>>> quorum
>>> 2 nodes and 5 resources configured
>>>
>>> Online: [ fx201-1a.ams (1) fx201-1b.ams (2) ]
>>>
>>> Full list of resources:
>>>
>>>  ClusterIP (ocf::heartbeat:IPaddr2):Started fx201-1a.ams
>>>  Master/Slave Set: ms_drbd_vmfs [drbd_vmfs]
>>>      drbd_vmfs (ocf::linbit:drbd):Master fx201-1a.ams
>>>      drbd_vmfs (ocf::linbit:drbd):Stopped
>>>      Masters: [ fx201-1a.ams ]
>>>      Stopped: [ fx201-1b.ams ]
>>>  vmfsFS (ocf::heartbeat:Filesystem):Started fx201-1a.ams
>>>  nfs-server (systemd:nfs-server):Started fx201-1a.ams
>>>
>>> PCSD Status:
>>>   fx201-1a.ams: Online
>>>   fx201-1b.ams: Online
>>>
>>> Daemon Status:
>>>   corosync: active/enabled
>>>   pacemaker: active/enabled
>>>   pcsd: active/enabled
>>>
>>> If I do a failover, after manually confirming that the DRBD data is
>>> synchronized completely, it does work, but then never reconnects the
>>> secondary side, and in order to get the resource synchronized again, I have
>>> to manually correct it, ad infinitum.  I have tried standby/unstandby, pcs
>>> resource debug-start (with undesirable results), and so on.
>>>
>>> Here are some relevant log messages from pacemaker.log:
>>>
>>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:
>>> info: crm_timer_popped:PEngine Recheck Timer (I_PE_CALC) just popped
>>> (900000ms)
>>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:
>>> notice: do_state_transition:State transition S_IDLE -> S_POLICY_ENGINE
>>> [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
>>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:
>>> info: do_state_transition:Progressed to state S_POLICY_ENGINE after
>>> C_TIMER_POPPED
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> process_pe_message:Input has not changed since last time, not saving to
>>> disk
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> determine_online_status:Node fx201-1b.ams is online
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> determine_online_status:Node fx201-1a.ams is online
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> determine_op_status:Operation monitor found resource drbd_vmfs:0 active
>>> in master mode on fx201-1b.ams
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> determine_op_status:Operation monitor found resource drbd_vmfs:0 active
>>> on fx201-1a.ams
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> native_print:ClusterIP(ocf::heartbeat:IPaddr2):Started fx201-1a.ams
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> clone_print:Master/Slave Set: ms_drbd_vmfs [drbd_vmfs]
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> short_print:    Masters: [ fx201-1a.ams ]
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> short_print:    Stopped: [ fx201-1b.ams ]
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> native_print:vmfsFS(ocf::heartbeat:Filesystem):Started fx201-1a.ams
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> native_print:nfs-server(systemd:nfs-server):Started fx201-1a.ams
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> native_color:Resource drbd_vmfs:1 cannot run anywhere
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> master_color:Promoting drbd_vmfs:0 (Master fx201-1a.ams)
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> master_color:ms_drbd_vmfs: Promoted 1 instances of a possible 1 to
>>> master
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> LogActions:Leave   ClusterIP(Started fx201-1a.ams)
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> LogActions:Leave   drbd_vmfs:0(Master fx201-1a.ams)
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> LogActions:Leave   drbd_vmfs:1(Stopped)
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> LogActions:Leave   vmfsFS(Started fx201-1a.ams)
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:     info:
>>> LogActions:Leave   nfs-server(Started fx201-1a.ams)
>>> Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net    pengine:   notice:
>>> process_pe_message:Calculated Transition 16:
>>> /var/lib/pacemaker/pengine/pe-input-61.bz2
>>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:
>>> info: do_state_transition:State transition S_POLICY_ENGINE ->
>>> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
>>> origin=handle_response ]
>>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:
>>> info: do_te_invoke:Processing graph 16 (ref=pe_calc-dc-1442530090-97)
>>> derived from /var/lib/pacemaker/pengine/pe-input-61.bz2
>>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:
>>> notice: run_graph:Transition 16 (Complete=0, Pending=0, Fired=0,
>>> Skipped=0, Incomplete=0,
>>> Source=/var/lib/pacemaker/pengine/pe-input-61.bz2): Complete
>>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:
>>> info: do_log:FSA: Input I_TE_SUCCESS from notify_crmd() received in
>>> state S_TRANSITION_ENGINE
>>> Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net       crmd:
>>> notice: do_state_transition:State transition S_TRANSITION_ENGINE ->
>>> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
>>>
>>> Thank you all for your help,
>>>
>>> Jason
>>>
>>> "This message and any attachments may contain confidential information. If you
>>> have received this  message in error, any use or distribution is prohibited.
>>> Please notify us by reply e-mail if you have mistakenly received this message,
>>> and immediately and permanently delete it and any attachments. Thank you."
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>> "This message and any attachments may contain confidential information. If you
>> have received this  message in error, any use or distribution is prohibited.
>> Please notify us by reply e-mail if you have mistakenly received this message,
>> and immediately and permanently delete it and any attachments. Thank you."
>>
>> "This message and any attachments may contain confidential information. If you
>> have received this  message in error, any use or distribution is prohibited.
>> Please notify us by reply e-mail if you have mistakenly received this message,
>> and immediately and permanently delete it and any attachments. Thank you."
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> "This message and any attachments may contain confidential information. If you
> have received this  message in error, any use or distribution is prohibited.
> Please notify us by reply e-mail if you have mistakenly received this message,
> and immediately and permanently delete it and any attachments. Thank you."
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150918/fcc53717/attachment.htm>