[ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary node to Slave (always Stopped)
Jason Gress
jgress at accertify.com
Fri Sep 18 00:02:01 UTC 2015
That may very well be it. Would you be so kind as to show me the pcs command to create that config? I generated my configuration with these commands, and I'm not sure how to get the additional monitor options in there:
pcs resource create drbd_vmfs ocf:linbit:drbd drbd_resource=vmfs op monitor interval=30s
pcs resource master ms_drbd_vmfs drbd_vmfs master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
Thank you very much for your help, and sorry for the newbie question!
Jason
From: Luke Pascoe <luke at osnz.co.nz<mailto:luke at osnz.co.nz>>
Reply-To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>
Date: Thursday, September 17, 2015 at 6:54 PM
To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>
Subject: Re: [ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary node to Slave (always Stopped)
The only difference in the DRBD resource between yours and mine that I can see is the monitoring parameters (mine works nicely, but is Centos 6). Here's mine:
Master: ms_drbd_iscsicg0
Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
Resource: drbd_iscsivg0 (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=iscsivg0
Operations: start interval=0s timeout=240 (drbd_iscsivg0-start-timeout-240)
promote interval=0s timeout=90 (drbd_iscsivg0-promote-timeout-90)
demote interval=0s timeout=90 (drbd_iscsivg0-demote-timeout-90)
stop interval=0s timeout=100 (drbd_iscsivg0-stop-timeout-100)
monitor interval=29s role=Master (drbd_iscsivg0-monitor-interval-29s-role-Master)
monitor interval=31s role=Slave (drbd_iscsivg0-monitor-interval-31s-role-Slave)
What mechanism are you using to fail over? Check your constraints after you do it and make sure it hasn't added one which stops the slave clone from starting on the "failed" node.
Luke Pascoe
[http://osnz.co.nz/logo_blue_80.png]
E luke at osnz.co.nz<mailto:luke at osnz.co.nz>
P +64 (9) 296 2961
M +64 (27) 426 6649
W www.osnz.co.nz<http://www.osnz.co.nz/>
24 Wellington St
Papakura
Auckland, 2110
New Zealand
On 18 September 2015 at 11:40, Jason Gress <jgress at accertify.com<mailto:jgress at accertify.com>> wrote:
Looking more closely, according to page 64 (http://clusterlabs.org/doc/Cluster_from_Scratch.pdf) it does indeed appear that 1 is the correct number. (I just realized that it's page 64 of the "book", but page 76 of the pdf.)
Thank you again,
Jason
From: Jason Gress <jgress at accertify.com<mailto:jgress at accertify.com>>
Reply-To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>
Date: Thursday, September 17, 2015 at 6:36 PM
To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>
Subject: Re: [ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary node to Slave (always Stopped)
I can't say whether or not you are right or wrong (you may be right!) but I followed the Cluster From Scratch tutorial closely, and it only had a clone-node-max=1 there. (Page 106 of the pdf, for the curious.)
Thanks,
Jason
From: Luke Pascoe <luke at osnz.co.nz<mailto:luke at osnz.co.nz>>
Reply-To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>
Date: Thursday, September 17, 2015 at 6:29 PM
To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>
Subject: Re: [ClusterLabs] Pacemaker/pcs & DRBD not demoting secondary node to Slave (always Stopped)
I may be wrong, but shouldn't "clone-node-max" be 2 on the ms_drbd_vmfs resource?
Luke Pascoe
[http://osnz.co.nz/logo_blue_80.png]
E luke at osnz.co.nz<mailto:luke at osnz.co.nz>
P +64 (9) 296 2961<tel:%2B64%20%289%29%20296%202961>
M +64 (27) 426 6649
W www.osnz.co.nz<http://www.osnz.co.nz/>
24 Wellington St
Papakura
Auckland, 2110
New Zealand
On 18 September 2015 at 11:02, Jason Gress <jgress at accertify.com<mailto:jgress at accertify.com>> wrote:
I have a simple DRBD + filesystem + NFS configuration that works properly when I manually start/stop DRBD, but will not start the DRBD slave resource properly on failover or recovery. I cannot ever get the Master/Slave set to say anything but 'Stopped'. I am running CentOS 7.1 with the latest packages as of today:
[root at fx201-1a log]# rpm -qa | grep -e pcs -e pacemaker -e drbd
pacemaker-cluster-libs-1.1.12-22.el7_1.4.x86_64
pacemaker-1.1.12-22.el7_1.4.x86_64
pcs-0.9.137-13.el7_1.4.x86_64
pacemaker-libs-1.1.12-22.el7_1.4.x86_64
drbd84-utils-8.9.3-1.1.el7.elrepo.x86_64
pacemaker-cli-1.1.12-22.el7_1.4.x86_64
kmod-drbd84-8.4.6-1.el7.elrepo.x86_64
Here is my pcs config output:
[root at fx201-1a log]# pcs config
Cluster Name: fx201-vmcl
Corosync Nodes:
fx201-1a.ams fx201-1b.ams
Pacemaker Nodes:
fx201-1a.ams fx201-1b.ams
Resources:
Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=10.XX.XX.XX cidr_netmask=24
Operations: start interval=0s timeout=20s (ClusterIP-start-timeout-20s)
stop interval=0s timeout=20s (ClusterIP-stop-timeout-20s)
monitor interval=15s (ClusterIP-monitor-interval-15s)
Master: ms_drbd_vmfs
Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
Resource: drbd_vmfs (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=vmfs
Operations: start interval=0s timeout=240 (drbd_vmfs-start-timeout-240)
promote interval=0s timeout=90 (drbd_vmfs-promote-timeout-90)
demote interval=0s timeout=90 (drbd_vmfs-demote-timeout-90)
stop interval=0s timeout=100 (drbd_vmfs-stop-timeout-100)
monitor interval=30s (drbd_vmfs-monitor-interval-30s)
Resource: vmfsFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd0 directory=/exports/vmfs fstype=xfs
Operations: start interval=0s timeout=60 (vmfsFS-start-timeout-60)
stop interval=0s timeout=60 (vmfsFS-stop-timeout-60)
monitor interval=20 timeout=40 (vmfsFS-monitor-interval-20)
Resource: nfs-server (class=systemd type=nfs-server)
Operations: monitor interval=60s (nfs-server-monitor-interval-60s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
promote ms_drbd_vmfs then start vmfsFS (kind:Mandatory) (id:order-ms_drbd_vmfs-vmfsFS-mandatory)
start vmfsFS then start nfs-server (kind:Mandatory) (id:order-vmfsFS-nfs-server-mandatory)
start ClusterIP then start nfs-server (kind:Mandatory) (id:order-ClusterIP-nfs-server-mandatory)
Colocation Constraints:
ms_drbd_vmfs with ClusterIP (score:INFINITY) (id:colocation-ms_drbd_vmfs-ClusterIP-INFINITY)
vmfsFS with ms_drbd_vmfs (score:INFINITY) (with-rsc-role:Master) (id:colocation-vmfsFS-ms_drbd_vmfs-INFINITY)
nfs-server with vmfsFS (score:INFINITY) (id:colocation-nfs-server-vmfsFS-INFINITY)
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: fx201-vmcl
dc-version: 1.1.13-a14efad
have-watchdog: false
last-lrm-refresh: 1442528181
stonith-enabled: false
And status:
[root at fx201-1a log]# pcs status --full
Cluster name: fx201-vmcl
Last updated: Thu Sep 17 17:55:56 2015 Last change: Thu Sep 17 17:18:10 2015 by root via crm_attribute on fx201-1b.ams
Stack: corosync
Current DC: fx201-1b.ams (2) (version 1.1.13-a14efad) - partition with quorum
2 nodes and 5 resources configured
Online: [ fx201-1a.ams (1) fx201-1b.ams (2) ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2):Started fx201-1a.ams
Master/Slave Set: ms_drbd_vmfs [drbd_vmfs]
drbd_vmfs (ocf::linbit:drbd):Master fx201-1a.ams
drbd_vmfs (ocf::linbit:drbd):Stopped
Masters: [ fx201-1a.ams ]
Stopped: [ fx201-1b.ams ]
vmfsFS (ocf::heartbeat:Filesystem):Started fx201-1a.ams
nfs-server (systemd:nfs-server):Started fx201-1a.ams
PCSD Status:
fx201-1a.ams: Online
fx201-1b.ams: Online
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
If I do a failover, after manually confirming that the DRBD data is synchronized completely, it does work, but then never reconnects the secondary side, and in order to get the resource synchronized again, I have to manually correct it, ad infinitum. I have tried standby/unstandby, pcs resource debug-start (with undesirable results), and so on.
Here are some relevant log messages from pacemaker.log:
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> crmd: info: crm_timer_popped:PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> crmd: notice: do_state_transition:State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> crmd: info: do_state_transition:Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: process_pe_message:Input has not changed since last time, not saving to disk
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: determine_online_status:Node fx201-1b.ams is online
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: determine_online_status:Node fx201-1a.ams is online
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: determine_op_status:Operation monitor found resource drbd_vmfs:0 active in master mode on fx201-1b.ams
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: determine_op_status:Operation monitor found resource drbd_vmfs:0 active on fx201-1a.ams
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: native_print:ClusterIP(ocf::heartbeat:IPaddr2):Started fx201-1a.ams
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: clone_print:Master/Slave Set: ms_drbd_vmfs [drbd_vmfs]
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: short_print: Masters: [ fx201-1a.ams ]
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: short_print: Stopped: [ fx201-1b.ams ]
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: native_print:vmfsFS(ocf::heartbeat:Filesystem):Started fx201-1a.ams
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: native_print:nfs-server(systemd:nfs-server):Started fx201-1a.ams
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: native_color:Resource drbd_vmfs:1 cannot run anywhere
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: master_color:Promoting drbd_vmfs:0 (Master fx201-1a.ams)
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: master_color:ms_drbd_vmfs: Promoted 1 instances of a possible 1 to master
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: LogActions:Leave ClusterIP(Started fx201-1a.ams)
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: LogActions:Leave drbd_vmfs:0(Master fx201-1a.ams)
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: LogActions:Leave drbd_vmfs:1(Stopped)
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: LogActions:Leave vmfsFS(Started fx201-1a.ams)
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: info: LogActions:Leave nfs-server(Started fx201-1a.ams)
Sep 17 17:48:10 [5662] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> pengine: notice: process_pe_message:Calculated Transition 16: /var/lib/pacemaker/pengine/pe-input-61.bz2
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> crmd: info: do_state_transition:State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> crmd: info: do_te_invoke:Processing graph 16 (ref=pe_calc-dc-1442530090-97) derived from /var/lib/pacemaker/pengine/pe-input-61.bz2
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> crmd: notice: run_graph:Transition 16 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-61.bz2): Complete
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> crmd: info: do_log:FSA: Input I_TE_SUCCESS from notify_crmd() received in state S_TRANSITION_ENGINE
Sep 17 17:48:10 [13954] fx201-1b.ams.accertify.net<http://fx201-1b.ams.accertify.net> crmd: notice: do_state_transition:State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Thank you all for your help,
Jason
"This message and any attachments may contain confidential information. If you
have received this message in error, any use or distribution is prohibited.
Please notify us by reply e-mail if you have mistakenly received this message,
and immediately and permanently delete it and any attachments. Thank you."
_______________________________________________
Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
"This message and any attachments may contain confidential information. If you
have received this message in error, any use or distribution is prohibited.
Please notify us by reply e-mail if you have mistakenly received this message,
and immediately and permanently delete it and any attachments. Thank you."
"This message and any attachments may contain confidential information. If you
have received this message in error, any use or distribution is prohibited.
Please notify us by reply e-mail if you have mistakenly received this message,
and immediately and permanently delete it and any attachments. Thank you."
_______________________________________________
Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
"This message and any attachments may contain confidential information. If you
have received this message in error, any use or distribution is prohibited.
Please notify us by reply e-mail if you have mistakenly received this message,
and immediately and permanently delete it and any attachments. Thank you."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150918/5a281177/attachment.htm>
More information about the Users
mailing list