[ClusterLabs] Non-cloned resource moves before cloned resource startup on unstandby

Tue Sep 11 05:59:35 UTC 2018

07.09.2018 23:07, Dan Ragle пишет:
> On an active-active two node cluster with DRBD, dlm, filesystem mounts,
> a Web Server, and some crons I can't figure out how to have the crons
> jump from node to node in the correct order. Specifically, I have two
> crontabs (managed via symlink creation/deletion) which normally will run
> one on node1 and the other on node2. When a node goes down, I want both
> to run on the remaining node until the original node comes back up, at
> which time they should split the nodes again. However, when returning to
> the original node the crontab that is being moved must wait until the
> underlying FS mount is done on the original node before jumping.
> 
> DRBD, dlm, the filesystem mounts and the Web Server are all working as
> expected; when I mark the second node as standby Apache stops, the FS
> unmounts, dlm stops, and DRBD stops on the node; and when I mark that
> same node unstandby the reverse happens as expected. All three of those
> are cloned resources.
> 
> The crontab resources are not cloned and create symlinks, one resource
> preferring the first node and the other preferring the second. Each is
> colocated and order dependent on the filesystem mounts (which in turn
> are colocated and dependent on dlm, which in turn is colocated and
> dependent on DRBD promotion). I thought this would be sufficient, but
> when the original node is marked unstandby the crontab that prefers to
> be on that node attempts to jump over immediately before the FS is
> mounted on that node. Of course the crontab link fails because the
> underlying filesystem hasn't been mounted yet.
> 
> pcs version is 0.9.162.
> 
> Here's the obfuscated detailed list of commands for the config. I'm
> still trying to set it up so it's not production-ready yet, but want to
> get this much sorted before I add too much more.
> 
> # pcs config export pcs-commands
> #!/usr/bin/sh
> # sequence generated on 2018-09-07 15:21:15 with: clufter 0.77.0
> # invoked as: ['/usr/sbin/pcs', 'config', 'export', 'pcs-commands']
> # targeting system: ('linux', 'centos', '7.5.1804', 'Core')
> # using interpreter: CPython 2.7.5
> pcs cluster auth node1.mydomain.com node2.mydomain.com <> /dev/tty
> pcs cluster setup --name MyCluster \
>   node1.mydomain.com node2.mydomain.com --transport udpu
> pcs cluster start --all --wait=60
> pcs cluster cib tmp-cib.xml
> cp tmp-cib.xml tmp-cib.xml.deltasrc
> pcs -f tmp-cib.xml property set stonith-enabled=false
> pcs -f tmp-cib.xml property set no-quorum-policy=freeze
> pcs -f tmp-cib.xml resource defaults resource-stickiness=100
> pcs -f tmp-cib.xml resource create DRBD ocf:linbit:drbd drbd_resource=r0 \
>   op demote interval=0s timeout=90 monitor interval=60s notify
> interval=0s \
>   timeout=90 promote interval=0s timeout=90 reload interval=0s timeout=30 \
>   start interval=0s timeout=240 stop interval=0s timeout=100
> pcs -f tmp-cib.xml resource create dlm ocf:pacemaker:controld \
>   allow_stonith_disabled=1 \
>   op monitor interval=60s start interval=0s timeout=90 stop interval=0s \
>   timeout=100
> pcs -f tmp-cib.xml resource create WWWMount ocf:heartbeat:Filesystem \
>   device=/dev/drbd1 directory=/var/www fstype=gfs2 \
>   options=_netdev,nodiratime,noatime \
>   op monitor interval=20 timeout=40 notify interval=0s timeout=60 start \
>   interval=0s timeout=120s stop interval=0s timeout=120s
> pcs -f tmp-cib.xml resource create WebServer ocf:heartbeat:apache \
>   configfile=/etc/httpd/conf/httpd.conf
> statusurl=http://localhost/server-status \
>   op monitor interval=1min start interval=0s timeout=40s stop interval=0s \
>   timeout=60s
> pcs -f tmp-cib.xml resource create SharedRootCrons ocf:heartbeat:symlink \
>   link=/etc/cron.d/root-shared target=/var/www/crons/root-shared \
>   op monitor interval=60 timeout=15 start interval=0s timeout=15 stop \
>   interval=0s timeout=15
> pcs -f tmp-cib.xml resource create SharedUserCrons ocf:heartbeat:symlink \
>   link=/etc/cron.d/User-shared target=/var/www/crons/User-shared \
>   op monitor interval=60 timeout=15 start interval=0s timeout=15 stop \
>   interval=0s timeout=15
> pcs -f tmp-cib.xml resource create PrimaryUserCrons ocf:heartbeat:symlink \
>   link=/etc/cron.d/User-server1 target=/var/www/crons/User-server1 \
>   op monitor interval=60 timeout=15 start interval=0s timeout=15 stop \
>   interval=0s timeout=15 meta resource-stickiness=0
> pcs -f tmp-cib.xml \
>   resource create SecondaryUserCrons ocf:heartbeat:symlink \
>   link=/etc/cron.d/User-server2 target=/var/www/crons/User-server2 \
>   op monitor interval=60 timeout=15 start interval=0s timeout=15 stop \
>   interval=0s timeout=15 meta resource-stickiness=0
> pcs -f tmp-cib.xml \
>   resource clone dlm clone-max=2 clone-node-max=1 interleave=true
> pcs -f tmp-cib.xml resource clone WWWMount interleave=true
> pcs -f tmp-cib.xml resource clone WebServer interleave=true
> pcs -f tmp-cib.xml resource clone SharedRootCrons interleave=true
> pcs -f tmp-cib.xml resource clone SharedUserCrons interleave=true
> pcs -f tmp-cib.xml \
>   resource master DRBDClone DRBD master-node-max=1 clone-max=2
> master-max=2 \
>   interleave=true notify=true clone-node-max=1
> pcs -f tmp-cib.xml \
>   constraint colocation add dlm-clone with DRBDClone \
>   id=colocation-dlm-clone-DRBDClone-INFINITY
> pcs -f tmp-cib.xml constraint order promote DRBDClone \
>   then dlm-clone id=order-DRBDClone-dlm-clone-mandatory
> pcs -f tmp-cib.xml \
>   constraint colocation add WWWMount-clone with dlm-clone \
>   id=colocation-WWWMount-clone-dlm-clone-INFINITY
> pcs -f tmp-cib.xml constraint order dlm-clone \
>   then WWWMount-clone id=order-dlm-clone-WWWMount-clone-mandatory
> pcs -f tmp-cib.xml \
>   constraint colocation add WebServer-clone with WWWMount-clone \
>   id=colocation-WebServer-clone-WWWMount-clone-INFINITY
> pcs -f tmp-cib.xml constraint order WWWMount-clone \
>   then WebServer-clone id=order-WWWMount-clone-WebServer-clone-mandatory
> pcs -f tmp-cib.xml \
>   constraint colocation add SharedRootCrons-clone with WWWMount-clone \
>   id=colocation-SharedRootCrons-clone-WWWMount-clone-INFINITY
> pcs -f tmp-cib.xml \
>   constraint colocation add SharedUserCrons-clone with WWWMount-clone \
>   id=colocation-SharedUserCrons-clone-WWWMount-clone-INFINITY
> pcs -f tmp-cib.xml constraint order WWWMount-clone \
>   then SharedRootCrons-clone \
>   id=order-WWWMount-clone-SharedRootCrons-clone-mandatory
> pcs -f tmp-cib.xml constraint order WWWMount-clone \
>   then SharedUserCrons-clone \
>   id=order-WWWMount-clone-SharedUserCrons-clone-mandatory
> pcs -f tmp-cib.xml \
>   constraint location PrimaryUserCrons prefers node1.mydomain.com=500
> pcs -f tmp-cib.xml \
>   constraint colocation add PrimaryUserCrons with WWWMount-clone \
>   id=colocation-PrimaryUserCrons-WWWMount-clone-INFINITY
> pcs -f tmp-cib.xml constraint order WWWMount-clone \
>   then PrimaryUserCrons \
>   id=order-WWWMount-clone-PrimaryUserCrons-mandatory
> pcs -f tmp-cib.xml \
>   constraint location SecondaryUserCrons prefers node2.mydomain.com=500

I can't answer your question, but just observation - it appears only
resources with explicit location preferences misbehave. Is it possible
as workaround to not use them?

> pcs -f tmp-cib.xml \
>   constraint colocation add SecondaryUserCrons with WWWMount-clone \
>   id=colocation-SecondaryUserCrons-WWWMount-clone-INFINITY
> pcs -f tmp-cib.xml constraint order WWWMount-clone \
>   then SecondaryUserCrons \
>   id=order-WWWMount-clone-SecondaryUserCrons-mandatory
> pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.xml.deltasrc
> 
> When I standby node2, the SecondaryUserCrons bounces over to node1 as
> expected. When I unstandby node2, it bounces back to node2 immediately,
> before WWWMount is performed, and thus it fails. What am I missing? Here
> are the log messages from the unstandby operation:
> 
> Sep  7 15:02:28 node2 crmd[58188]:   notice: State transition S_IDLE ->
> S_POLICY_ENGINE
> Sep  7 15:02:28 node2 pengine[58187]:   notice:  * Start     
> DRBD:1                 (                        node2.mydomain.com )
> Sep  7 15:02:28 node2 pengine[58187]:   notice:  * Start     
> dlm:1                  (                        node2.mydomain.com ) due
> to unrunnable DRBD:1 promote (blocked)
> Sep  7 15:02:28 node2 pengine[58187]:   notice:  * Start     
> WWWMount:1             (                        node2.mydomain.com ) due
> to unrunnable dlm:1 start (blocked)
> Sep  7 15:02:28 node2 pengine[58187]:   notice:  * Start     
> WebServer:1            (                        node2.mydomain.com ) due
> to unrunnable WWWMount:1 start (blocked)
> Sep  7 15:02:28 node2 pengine[58187]:   notice:  * Start     
> SharedRootCrons:1      (                        node2.mydomain.com ) due
> to unrunnable WWWMount:1 start (blocked)
> Sep  7 15:02:28 node2 pengine[58187]:   notice:  * Start     
> SharedUserCrons:1      (                        node2.mydomain.com ) due
> to unrunnable WWWMount:1 start (blocked)
> Sep  7 15:02:28 node2 pengine[58187]:   notice:  * Move      
> SecondaryUserCrons     ( node1.mydomain.com -> node2.mydomain.com )
> Sep  7 15:02:28 node2 pengine[58187]:   notice: Calculated transition
> 129, saving inputs in /var/lib/pacemaker/pengine/pe-input-2795.bz2

This file would be useful to have.