[ClusterLabs] Non-cloned resource moves before cloned resource startup on unstandby
Andrei Borzenkov
arvidjaar at gmail.com
Tue Sep 11 01:59:35 EDT 2018
07.09.2018 23:07, Dan Ragle пишет:
> On an active-active two node cluster with DRBD, dlm, filesystem mounts,
> a Web Server, and some crons I can't figure out how to have the crons
> jump from node to node in the correct order. Specifically, I have two
> crontabs (managed via symlink creation/deletion) which normally will run
> one on node1 and the other on node2. When a node goes down, I want both
> to run on the remaining node until the original node comes back up, at
> which time they should split the nodes again. However, when returning to
> the original node the crontab that is being moved must wait until the
> underlying FS mount is done on the original node before jumping.
>
> DRBD, dlm, the filesystem mounts and the Web Server are all working as
> expected; when I mark the second node as standby Apache stops, the FS
> unmounts, dlm stops, and DRBD stops on the node; and when I mark that
> same node unstandby the reverse happens as expected. All three of those
> are cloned resources.
>
> The crontab resources are not cloned and create symlinks, one resource
> preferring the first node and the other preferring the second. Each is
> colocated and order dependent on the filesystem mounts (which in turn
> are colocated and dependent on dlm, which in turn is colocated and
> dependent on DRBD promotion). I thought this would be sufficient, but
> when the original node is marked unstandby the crontab that prefers to
> be on that node attempts to jump over immediately before the FS is
> mounted on that node. Of course the crontab link fails because the
> underlying filesystem hasn't been mounted yet.
>
> pcs version is 0.9.162.
>
> Here's the obfuscated detailed list of commands for the config. I'm
> still trying to set it up so it's not production-ready yet, but want to
> get this much sorted before I add too much more.
>
> # pcs config export pcs-commands
> #!/usr/bin/sh
> # sequence generated on 2018-09-07 15:21:15 with: clufter 0.77.0
> # invoked as: ['/usr/sbin/pcs', 'config', 'export', 'pcs-commands']
> # targeting system: ('linux', 'centos', '7.5.1804', 'Core')
> # using interpreter: CPython 2.7.5
> pcs cluster auth node1.mydomain.com node2.mydomain.com <> /dev/tty
> pcs cluster setup --name MyCluster \
> node1.mydomain.com node2.mydomain.com --transport udpu
> pcs cluster start --all --wait=60
> pcs cluster cib tmp-cib.xml
> cp tmp-cib.xml tmp-cib.xml.deltasrc
> pcs -f tmp-cib.xml property set stonith-enabled=false
> pcs -f tmp-cib.xml property set no-quorum-policy=freeze
> pcs -f tmp-cib.xml resource defaults resource-stickiness=100
> pcs -f tmp-cib.xml resource create DRBD ocf:linbit:drbd drbd_resource=r0 \
> op demote interval=0s timeout=90 monitor interval=60s notify
> interval=0s \
> timeout=90 promote interval=0s timeout=90 reload interval=0s timeout=30 \
> start interval=0s timeout=240 stop interval=0s timeout=100
> pcs -f tmp-cib.xml resource create dlm ocf:pacemaker:controld \
> allow_stonith_disabled=1 \
> op monitor interval=60s start interval=0s timeout=90 stop interval=0s \
> timeout=100
> pcs -f tmp-cib.xml resource create WWWMount ocf:heartbeat:Filesystem \
> device=/dev/drbd1 directory=/var/www fstype=gfs2 \
> options=_netdev,nodiratime,noatime \
> op monitor interval=20 timeout=40 notify interval=0s timeout=60 start \
> interval=0s timeout=120s stop interval=0s timeout=120s
> pcs -f tmp-cib.xml resource create WebServer ocf:heartbeat:apache \
> configfile=/etc/httpd/conf/httpd.conf
> statusurl=http://localhost/server-status \
> op monitor interval=1min start interval=0s timeout=40s stop interval=0s \
> timeout=60s
> pcs -f tmp-cib.xml resource create SharedRootCrons ocf:heartbeat:symlink \
> link=/etc/cron.d/root-shared target=/var/www/crons/root-shared \
> op monitor interval=60 timeout=15 start interval=0s timeout=15 stop \
> interval=0s timeout=15
> pcs -f tmp-cib.xml resource create SharedUserCrons ocf:heartbeat:symlink \
> link=/etc/cron.d/User-shared target=/var/www/crons/User-shared \
> op monitor interval=60 timeout=15 start interval=0s timeout=15 stop \
> interval=0s timeout=15
> pcs -f tmp-cib.xml resource create PrimaryUserCrons ocf:heartbeat:symlink \
> link=/etc/cron.d/User-server1 target=/var/www/crons/User-server1 \
> op monitor interval=60 timeout=15 start interval=0s timeout=15 stop \
> interval=0s timeout=15 meta resource-stickiness=0
> pcs -f tmp-cib.xml \
> resource create SecondaryUserCrons ocf:heartbeat:symlink \
> link=/etc/cron.d/User-server2 target=/var/www/crons/User-server2 \
> op monitor interval=60 timeout=15 start interval=0s timeout=15 stop \
> interval=0s timeout=15 meta resource-stickiness=0
> pcs -f tmp-cib.xml \
> resource clone dlm clone-max=2 clone-node-max=1 interleave=true
> pcs -f tmp-cib.xml resource clone WWWMount interleave=true
> pcs -f tmp-cib.xml resource clone WebServer interleave=true
> pcs -f tmp-cib.xml resource clone SharedRootCrons interleave=true
> pcs -f tmp-cib.xml resource clone SharedUserCrons interleave=true
> pcs -f tmp-cib.xml \
> resource master DRBDClone DRBD master-node-max=1 clone-max=2
> master-max=2 \
> interleave=true notify=true clone-node-max=1
> pcs -f tmp-cib.xml \
> constraint colocation add dlm-clone with DRBDClone \
> id=colocation-dlm-clone-DRBDClone-INFINITY
> pcs -f tmp-cib.xml constraint order promote DRBDClone \
> then dlm-clone id=order-DRBDClone-dlm-clone-mandatory
> pcs -f tmp-cib.xml \
> constraint colocation add WWWMount-clone with dlm-clone \
> id=colocation-WWWMount-clone-dlm-clone-INFINITY
> pcs -f tmp-cib.xml constraint order dlm-clone \
> then WWWMount-clone id=order-dlm-clone-WWWMount-clone-mandatory
> pcs -f tmp-cib.xml \
> constraint colocation add WebServer-clone with WWWMount-clone \
> id=colocation-WebServer-clone-WWWMount-clone-INFINITY
> pcs -f tmp-cib.xml constraint order WWWMount-clone \
> then WebServer-clone id=order-WWWMount-clone-WebServer-clone-mandatory
> pcs -f tmp-cib.xml \
> constraint colocation add SharedRootCrons-clone with WWWMount-clone \
> id=colocation-SharedRootCrons-clone-WWWMount-clone-INFINITY
> pcs -f tmp-cib.xml \
> constraint colocation add SharedUserCrons-clone with WWWMount-clone \
> id=colocation-SharedUserCrons-clone-WWWMount-clone-INFINITY
> pcs -f tmp-cib.xml constraint order WWWMount-clone \
> then SharedRootCrons-clone \
> id=order-WWWMount-clone-SharedRootCrons-clone-mandatory
> pcs -f tmp-cib.xml constraint order WWWMount-clone \
> then SharedUserCrons-clone \
> id=order-WWWMount-clone-SharedUserCrons-clone-mandatory
> pcs -f tmp-cib.xml \
> constraint location PrimaryUserCrons prefers node1.mydomain.com=500
> pcs -f tmp-cib.xml \
> constraint colocation add PrimaryUserCrons with WWWMount-clone \
> id=colocation-PrimaryUserCrons-WWWMount-clone-INFINITY
> pcs -f tmp-cib.xml constraint order WWWMount-clone \
> then PrimaryUserCrons \
> id=order-WWWMount-clone-PrimaryUserCrons-mandatory
> pcs -f tmp-cib.xml \
> constraint location SecondaryUserCrons prefers node2.mydomain.com=500
I can't answer your question, but just observation - it appears only
resources with explicit location preferences misbehave. Is it possible
as workaround to not use them?
> pcs -f tmp-cib.xml \
> constraint colocation add SecondaryUserCrons with WWWMount-clone \
> id=colocation-SecondaryUserCrons-WWWMount-clone-INFINITY
> pcs -f tmp-cib.xml constraint order WWWMount-clone \
> then SecondaryUserCrons \
> id=order-WWWMount-clone-SecondaryUserCrons-mandatory
> pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.xml.deltasrc
>
> When I standby node2, the SecondaryUserCrons bounces over to node1 as
> expected. When I unstandby node2, it bounces back to node2 immediately,
> before WWWMount is performed, and thus it fails. What am I missing? Here
> are the log messages from the unstandby operation:
>
> Sep 7 15:02:28 node2 crmd[58188]: notice: State transition S_IDLE ->
> S_POLICY_ENGINE
> Sep 7 15:02:28 node2 pengine[58187]: notice: * Start
> DRBD:1 ( node2.mydomain.com )
> Sep 7 15:02:28 node2 pengine[58187]: notice: * Start
> dlm:1 ( node2.mydomain.com ) due
> to unrunnable DRBD:1 promote (blocked)
> Sep 7 15:02:28 node2 pengine[58187]: notice: * Start
> WWWMount:1 ( node2.mydomain.com ) due
> to unrunnable dlm:1 start (blocked)
> Sep 7 15:02:28 node2 pengine[58187]: notice: * Start
> WebServer:1 ( node2.mydomain.com ) due
> to unrunnable WWWMount:1 start (blocked)
> Sep 7 15:02:28 node2 pengine[58187]: notice: * Start
> SharedRootCrons:1 ( node2.mydomain.com ) due
> to unrunnable WWWMount:1 start (blocked)
> Sep 7 15:02:28 node2 pengine[58187]: notice: * Start
> SharedUserCrons:1 ( node2.mydomain.com ) due
> to unrunnable WWWMount:1 start (blocked)
> Sep 7 15:02:28 node2 pengine[58187]: notice: * Move
> SecondaryUserCrons ( node1.mydomain.com -> node2.mydomain.com )
> Sep 7 15:02:28 node2 pengine[58187]: notice: Calculated transition
> 129, saving inputs in /var/lib/pacemaker/pengine/pe-input-2795.bz2
This file would be useful to have.
More information about the Users
mailing list