[ClusterLabs] Non-cloned resource moves before cloned resource startup on unstandby

Daniel Ragle daniel at Biblestuph.com
Wed Sep 12 13:16:29 EDT 2018


Thanks for the comments. Replies within.

On 9/11/2018 1:52 PM, Ken Gaillot wrote:
> On Fri, 2018-09-07 at 16:07 -0400, Dan Ragle wrote:
>> On an active-active two node cluster with DRBD, dlm, filesystem
>> mounts, a Web Server, and some crons I can't figure out how to have
>> the crons jump from node to node in the correct order. Specifically,
>> I have two crontabs (managed via symlink creation/deletion)
>> which normally will run one on node1 and the other on node2. When a
>> node goes down, I want both to run on the remaining node until
>> the original node comes back up, at which time they should split the
>> nodes again. However, when returning to the original node the
>> crontab that is being moved must wait until the underlying FS mount
>> is done on the original node before jumping.
>>
>> DRBD, dlm, the filesystem mounts and the Web Server are all working
>> as expected; when I mark the second node as standby Apache
>> stops, the FS unmounts, dlm stops, and DRBD stops on the node; and
>> when I mark that same node unstandby the reverse happens as
>> expected. All three of those are cloned resources.
>>
>> The crontab resources are not cloned and create symlinks, one
>> resource preferring the first node and the other preferring the
>> second. Each is colocated and order dependent on the filesystem
>> mounts (which in turn are colocated and dependent on dlm, which in
>> turn is colocated and dependent on DRBD promotion). I thought this
>> would be sufficient, but when the original node is marked
>> unstandby the crontab that prefers to be on that node attempts to
>> jump over immediately before the FS is mounted on that node. Of
>> course the crontab link fails because the underlying filesystem
>> hasn't been mounted yet.
>>
>> pcs version is 0.9.162.
>>
>> Here's the obfuscated detailed list of commands for the config. I'm
>> still trying to set it up so it's not production-ready yet, but
>> want to get this much sorted before I add too much more.
>>
>> # pcs config export pcs-commands
>> #!/usr/bin/sh
>> # sequence generated on 2018-09-07 15:21:15 with: clufter 0.77.0
>> # invoked as: ['/usr/sbin/pcs', 'config', 'export', 'pcs-commands']
>> # targeting system: ('linux', 'centos', '7.5.1804', 'Core')
>> # using interpreter: CPython 2.7.5
>> pcs cluster auth node1.mydomain.com node2.mydomain.com <> /dev/tty
>> pcs cluster setup --name MyCluster \
>>     node1.mydomain.com node2.mydomain.com --transport udpu
>> pcs cluster start --all --wait=60
>> pcs cluster cib tmp-cib.xml
>> cp tmp-cib.xml tmp-cib.xml.deltasrc
>> pcs -f tmp-cib.xml property set stonith-enabled=false
>> pcs -f tmp-cib.xml property set no-quorum-policy=freeze
>> pcs -f tmp-cib.xml resource defaults resource-stickiness=100
> 
> Just a note, scores are all added together, and highest wins. For
> example, if resource-stickiness + location preference for current node
> > colocation with resource on different node, then the colocation will
> be ignored.

I don't think that's what's happening here; as the resource is moving 
*where* I want/expect it to, just not in the right order.

> 
>> pcs -f tmp-cib.xml resource create DRBD ocf:linbit:drbd
>> drbd_resource=r0 \
>>     op demote interval=0s timeout=90 monitor interval=60s notify
>> interval=0s \
>>     timeout=90 promote interval=0s timeout=90 reload interval=0s
>> timeout=30 \
>>     start interval=0s timeout=240 stop interval=0s timeout=100
>> pcs -f tmp-cib.xml resource create dlm ocf:pacemaker:controld \
>>     allow_stonith_disabled=1 \
>>     op monitor interval=60s start interval=0s timeout=90 stop
>> interval=0s \
>>     timeout=100
>> pcs -f tmp-cib.xml resource create WWWMount ocf:heartbeat:Filesystem
>> \
>>     device=/dev/drbd1 directory=/var/www fstype=gfs2 \
>>     options=_netdev,nodiratime,noatime \
>>     op monitor interval=20 timeout=40 notify interval=0s timeout=60
>> start \
>>     interval=0s timeout=120s stop interval=0s timeout=120s
>> pcs -f tmp-cib.xml resource create WebServer ocf:heartbeat:apache \
>>     configfile=/etc/httpd/conf/httpd.conf statusurl=http://localhost/s
>> erver-status \
>>     op monitor interval=1min start interval=0s timeout=40s stop
>> interval=0s \
>>     timeout=60s
>> pcs -f tmp-cib.xml resource create SharedRootCrons
>> ocf:heartbeat:symlink \
>>     link=/etc/cron.d/root-shared target=/var/www/crons/root-shared \
>>     op monitor interval=60 timeout=15 start interval=0s timeout=15
>> stop \
>>     interval=0s timeout=15
> 
> Another note, I seem to remember some implementations of the cron
> daemon refuse to work from symlinks, and some require a restart when a
> cron is changed outside of the crontab command. That may or may not
> apply in your situation; the system or cron daemon logs should show
> whether the change took effect when the resource is started/stopped.
> 

Yup. We're actually already doing this much in production, just not with 
any type of cluster based management. It's working well. Creating and 
deleting the symlinks works fine (crond picks up the change without a 
problem). When updating the underlying cron definitions you do however 
need to touch -h the symlink file itself; just updating the underlying 
file isn't enough to get crond to notice. We just issue that touch 
command as part of our rollin tools whenever we update the underlying 
cron definitions.

> An alternative design for working around those issues is to have all
> the crons always active (on host storage) on both nodes, but the cron
> jobs check somehow whether they're on the active node or not and exit
> when not where they need to be.
> 
>> pcs -f tmp-cib.xml resource create SharedUserCrons
>> ocf:heartbeat:symlink \
>>     link=/etc/cron.d/User-shared target=/var/www/crons/User-shared \
>>     op monitor interval=60 timeout=15 start interval=0s timeout=15
>> stop \
>>     interval=0s timeout=15
>> pcs -f tmp-cib.xml resource create PrimaryUserCrons
>> ocf:heartbeat:symlink \
>>     link=/etc/cron.d/User-server1 target=/var/www/crons/User-server1 \
>>     op monitor interval=60 timeout=15 start interval=0s timeout=15
>> stop \
>>     interval=0s timeout=15 meta resource-stickiness=0
>> pcs -f tmp-cib.xml \
>>     resource create SecondaryUserCrons ocf:heartbeat:symlink \
>>     link=/etc/cron.d/User-server2 target=/var/www/crons/User-server2 \
>>     op monitor interval=60 timeout=15 start interval=0s timeout=15
>> stop \
>>     interval=0s timeout=15 meta resource-stickiness=0
>> pcs -f tmp-cib.xml \
>>     resource clone dlm clone-max=2 clone-node-max=1 interleave=true
>> pcs -f tmp-cib.xml resource clone WWWMount interleave=true
>> pcs -f tmp-cib.xml resource clone WebServer interleave=true
>> pcs -f tmp-cib.xml resource clone SharedRootCrons interleave=true
>> pcs -f tmp-cib.xml resource clone SharedUserCrons interleave=true
>> pcs -f tmp-cib.xml \
>>     resource master DRBDClone DRBD master-node-max=1 clone-max=2
>> master-max=2 \
>>     interleave=true notify=true clone-node-max=1
>> pcs -f tmp-cib.xml \
>>     constraint colocation add dlm-clone with DRBDClone \
>>     id=colocation-dlm-clone-DRBDClone-INFINITY
> 
> Even though your DRBD is multi-master, that doesn't mean it will
> *always* be in primary mode (e.g. it will start in secondary and then
> be promoted to primary, or the promotion may fail). I think you want to
> colocate DLM with the DRBD master role, so DLM (and further
> dependencies) don't run if DRBD is in secondary mode.
> 

I'll be buggered. That fixed it. Though I don't understand why.

     pcs constraint remove colocation-dlm-clone-DRBDClone-INFINITY
     pcs constraint colocation add dlm-clone with DRBDClone \
         with-rsc-role=Master INFINITY

And all is well, the symlink resource is now behaving as expected. I 
would never have thought of that as a part of the problem because DRBD, 
dlm, and the FS mount were working as expected all along; i.e., dlm 
waited for the DRBD promotion (see the ordering constraint on the next 
command) before starting, and WWWMount waited for dlm to start.

With the above in place, now I see a new transition sequence, where DRBD 
is started (by itself, the only operation in the first transition), and 
then the next transition promotes DRBD, starts dlm, starts WWWMount, 
*then* moves the symlink.

>> pcs -f tmp-cib.xml constraint order promote DRBDClone \
>>     then dlm-clone id=order-DRBDClone-dlm-clone-mandatory
>> pcs -f tmp-cib.xml \
>>     constraint colocation add WWWMount-clone with dlm-clone \
>>     id=colocation-WWWMount-clone-dlm-clone-INFINITY
>> pcs -f tmp-cib.xml constraint order dlm-clone \
>>     then WWWMount-clone id=order-dlm-clone-WWWMount-clone-mandatory
>> pcs -f tmp-cib.xml \
>>     constraint colocation add WebServer-clone with WWWMount-clone \
>>     id=colocation-WebServer-clone-WWWMount-clone-INFINITY
>> pcs -f tmp-cib.xml constraint order WWWMount-clone \
>>     then WebServer-clone id=order-WWWMount-clone-WebServer-clone-
>> mandatory
> 
> Yet another side note: you can clone a group, so it might simplify
> slightly to clone a group of DLM + WWWMount + WebServer, then
> colocate/order the cloned group relative to DRBD master.

Cool. Yah, hadn't even gotten as far as groups yet.

> 
>> pcs -f tmp-cib.xml \
>>     constraint colocation add SharedRootCrons-clone with WWWMount-
>> clone \
>>     id=colocation-SharedRootCrons-clone-WWWMount-clone-INFINITY
>> pcs -f tmp-cib.xml \
>>     constraint colocation add SharedUserCrons-clone with WWWMount-
>> clone \
>>     id=colocation-SharedUserCrons-clone-WWWMount-clone-INFINITY
>> pcs -f tmp-cib.xml constraint order WWWMount-clone \
>>     then SharedRootCrons-clone \
>>     id=order-WWWMount-clone-SharedRootCrons-clone-mandatory
>> pcs -f tmp-cib.xml constraint order WWWMount-clone \
>>     then SharedUserCrons-clone \
>>     id=order-WWWMount-clone-SharedUserCrons-clone-mandatory
>> pcs -f tmp-cib.xml \
>>     constraint location PrimaryUserCrons prefers
>> node1.mydomain.com=500
> 
> This score is higher than stickiness, but I think that was intentional,
> so it would move back if a node is lost and recovered. Another way to
> do that would be to set resource-stickiness=0 on these resources to
> override the default, ideally with a small anti-colocation as you tried
> later.

Yup, just trying to make sure it really likes node2 as long as node2 is 
available and all the other dependencies are met.

> 
>> pcs -f tmp-cib.xml \
>>     constraint colocation add PrimaryUserCrons with WWWMount-clone \
>>     id=colocation-PrimaryUserCrons-WWWMount-clone-INFINITY
>> pcs -f tmp-cib.xml constraint order WWWMount-clone \
>>     then PrimaryUserCrons \
>>     id=order-WWWMount-clone-PrimaryUserCrons-mandatory
>> pcs -f tmp-cib.xml \
>>     constraint location SecondaryUserCrons prefers
>> node2.mydomain.com=500
>> pcs -f tmp-cib.xml \
>>     constraint colocation add SecondaryUserCrons with WWWMount-clone \
>>     id=colocation-SecondaryUserCrons-WWWMount-clone-INFINITY
>> pcs -f tmp-cib.xml constraint order WWWMount-clone \
>>     then SecondaryUserCrons \
>>     id=order-WWWMount-clone-SecondaryUserCrons-mandatory
>> pcs cluster cib-push tmp-cib.xml diff-against=tmp-cib.xml.deltasrc
>>
>> When I standby node2, the SecondaryUserCrons bounces over to node1 as
>> expected. When I unstandby node2, it bounces back to node2
>> immediately, before WWWMount is performed, and thus it fails. What am
>> I missing? Here are the log messages from the unstandby operation:
>>
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: State transition S_IDLE
>> -> S_POLICY_ENGINE
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Start      DRBD:1                 (                        node2.mydo
>> main.com )
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Start      dlm:1                  (                        node2.mydo
>> main.com )
>> due to unrunnable DRBD:1 promote (blocked)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Start      WWWMount:1             (                        node2.mydo
>> main.com )
>> due to unrunnable dlm:1 start (blocked)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Start      WebServer:1            (                        node2.mydo
>> main.com )
>> due to unrunnable WWWMount:1 start (blocked)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Start      SharedRootCrons:1      (                        node2.mydo
>> main.com )
>> due to unrunnable WWWMount:1 start (blocked)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Start      SharedUserCrons:1      (                        node2.mydo
>> main.com )
>> due to unrunnable WWWMount:1 start (blocked)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Move       SecondaryUserCrons     ( node1.mydomain.com ->
>> node2.mydomain.com )
>> Sep  7 15:02:28 node2 pengine[58187]:   notice: Calculated transition
>> 129, saving inputs in /var/lib/pacemaker/pengine/pe-input-2795.bz2
> 
> Please open a bug at bugs.clusterlabs.org; this is definitely broken.
> 

Filed as https://bugs.clusterlabs.org/show_bug.cgi?id=5368 (had done so 
before realizing the fix from above). Might still be worth looking into, 
however, will leave it there and let you guys decide what you want to do 
with it. I'll add a note as to the dlm -> DRBD Master Role thing above.

Dan

>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating stop
>> operation SecondaryUserCrons_stop_0 on node1.mydomain.com
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating notify
>> operation DRBD_pre_notify_start_0 on node1.mydomain.com
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating start
>> operation SecondaryUserCrons_start_0 locally on node2.mydomain.com
>> Sep  7 15:02:28 node2 symlink(SecondaryUserCrons)[52196]: WARNING:
>> /var/www/crons/User-server2 does not exist!
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating start
>> operation DRBD_start_0 locally on node2.mydomain.com
>> Sep  7 15:02:28 node2 symlink(SecondaryUserCrons)[52196]: INFO:
>> '/etc/cron.d/User-server2' -> '/var/www/crons/User-server2'
>> Sep  7 15:02:28 node2 symlink(SecondaryUserCrons)[52196]: ERROR:
>> /etc/cron.d/User-server2 does not point to /var/www/crons/User-
>> server2!
>> Sep  7 15:02:28 node2 lrmd[58185]:   notice:
>> SecondaryUserCrons_start_0:52196:stderr [ ocf-exit-
>> reason:/etc/cron.d/User-server2 does
>> not point to /var/www/crons/User-server2! ]
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Result of start
>> operation for SecondaryUserCrons on node2.mydomain.com: 5 (not
>> installed)
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: node2.mydomain.com-
>> SecondaryUserCrons_start_0:390 [
>> ocf-exit-reason:/etc/cron.d/User-server2 does not point to
>> /var/www/crons/User-server2!\n ]
>> Sep  7 15:02:28 node2 crmd[58188]:  warning: Action 109
>> (SecondaryUserCrons_start_0) on node2.mydomain.com failed (target: 0
>> vs. rc:
>> 5): Error
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Transition aborted by
>> operation SecondaryUserCrons_start_0 'modify' on
>> node2.mydomain.com: Event failed
>> Sep  7 15:02:28 node2 crmd[58188]:  warning: Action 109
>> (SecondaryUserCrons_start_0) on node2.mydomain.com failed (target: 0
>> vs. rc:
>> 5): Error
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Transition aborted by
>> status-2-fail-count-SecondaryUserCrons.start_0 doing create
>> fail-count-SecondaryUserCrons#start_0=INFINITY: Transient attribute
>> change
>> Sep  7 15:02:28 node2 kernel: drbd r0: Starting worker thread (from
>> drbdsetup [52264])
>> Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: disk( Diskless ->
>> Attaching )
>> Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: Maximum number of peer
>> devices = 1
>> Sep  7 15:02:28 node2 kernel: drbd r0: Method to ensure write
>> ordering: drain
>> Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: drbd_bm_resize called
>> with capacity == 1048543928
>> Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: resync bitmap:
>> bits=131067991 words=2047938 pages=4000
>> Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: size = 500 GB
>> (524271964 KB)
>> Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: size = 500 GB
>> (524271964 KB)
>> Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: recounting of set bits
>> took additional 13ms
>> Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: disk( Attaching ->
>> Outdated )
>> Sep  7 15:02:28 node2 kernel: drbd r0/0 drbd1: attached to current
>> UUID: A2457506F4D44F1C
>> Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: disk( Diskless ->
>> Attaching )
>> Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: Maximum number of peer
>> devices = 1
>> Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: drbd_bm_resize called
>> with capacity == 2097016
>> Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: resync bitmap:
>> bits=262127 words=4096 pages=8
>> Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: size = 1024 MB
>> (1048508 KB)
>> Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: size = 1024 MB
>> (1048508 KB)
>> Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: recounting of set bits
>> took additional 0ms
>> Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: disk( Attaching ->
>> Outdated )
>> Sep  7 15:02:28 node2 kernel: drbd r0/1 drbd2: attached to current
>> UUID: 0EC5D56AEE53C6B6
>> Sep  7 15:02:28 node2 kernel: drbd r0 node1.mydomain.com: Starting
>> sender thread (from drbdsetup [52291])
>> Sep  7 15:02:28 node2 kernel: drbd r0 node1.mydomain.com: conn(
>> StandAlone -> Unconnected )
>> Sep  7 15:02:28 node2 kernel: drbd r0 node1.mydomain.com: Starting
>> receiver thread (from drbd_w_r0 [52265])
>> Sep  7 15:02:28 node2 kernel: drbd r0 node1.mydomain.com: conn(
>> Unconnected -> Connecting )
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Result of start
>> operation for DRBD on node2.mydomain.com: 0 (ok)
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating notify
>> operation DRBD_post_notify_start_0 on node1.mydomain.com
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating notify
>> operation DRBD_post_notify_start_0 locally on node2.mydomain.com
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Result of notify
>> operation for DRBD on node2.mydomain.com: 0 (ok)
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Transition 129
>> (Complete=29, Pending=0, Fired=0, Skipped=1, Incomplete=7,
>> Source=/var/lib/pacemaker/pengine/pe-input-2795.bz2): Stopped
>> Sep  7 15:02:28 node2 pengine[58187]:  warning: Processing failed op
>> start for SecondaryUserCrons on node2.mydomain.com: not
>> installed (5)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice: Preventing
>> SecondaryUserCrons from re-starting on node2.mydomain.com: operation
>> start failed 'not installed' (5)
>> Sep  7 15:02:28 node2 pengine[58187]:  warning: Processing failed op
>> start for SecondaryUserCrons on node2.mydomain.com: not
>> installed (5)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice: Preventing
>> SecondaryUserCrons from re-starting on node2.mydomain.com: operation
>> start failed 'not installed' (5)
>> Sep  7 15:02:28 node2 pengine[58187]:  warning: Forcing
>> SecondaryUserCrons away from node2.mydomain.com after 1000000
>> failures
>> (max=1000000)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Start      dlm:1                  (                        node2.mydo
>> main.com )
>> due to unrunnable DRBD:1 promote (blocked)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Start      WWWMount:1             (                        node2.mydo
>> main.com )
>> due to unrunnable dlm:1 start (blocked)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Start      WebServer:1            (                        node2.mydo
>> main.com )
>> due to unrunnable WWWMount:1 start (blocked)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Start      SharedRootCrons:1      (                        node2.mydo
>> main.com )
>> due to unrunnable WWWMount:1 start (blocked)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Start      SharedUserCrons:1      (                        node2.mydo
>> main.com )
>> due to unrunnable WWWMount:1 start (blocked)
>> Sep  7 15:02:28 node2 pengine[58187]:   notice:  *
>> Recover    SecondaryUserCrons     ( node2.mydomain.com ->
>> node1.mydomain.com )
>> Sep  7 15:02:28 node2 pengine[58187]:   notice: Calculated transition
>> 130, saving inputs in /var/lib/pacemaker/pengine/pe-input-2796.bz2
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating monitor
>> operation DRBD_monitor_60000 locally on node2.mydomain.com
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Initiating stop
>> operation SecondaryUserCrons_stop_0 locally on node2.mydomain.com
>> Sep  7 15:02:28 node2 symlink(SecondaryUserCrons)[52329]: WARNING:
>> /var/www/crons/User-server2 does not exist!
>> Sep  7 15:02:28 node2 symlink(SecondaryUserCrons)[52329]: ERROR:
>> /etc/cron.d/User-server2 does not point to /var/www/crons/User-
>> server2!
>> Sep  7 15:02:28 node2 lrmd[58185]:   notice:
>> SecondaryUserCrons_stop_0:52329:stderr [ ocf-exit-
>> reason:/etc/cron.d/User-server2 does
>> not point to /var/www/crons/User-server2! ]
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Result of stop operation
>> for SecondaryUserCrons on node2.mydomain.com: 5 (not installed)
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: node2.mydomain.com-
>> SecondaryUserCrons_stop_0:394 [
>> ocf-exit-reason:/etc/cron.d/User-server2 does not point to
>> /var/www/crons/User-server2!\n ]
>> Sep  7 15:02:28 node2 crmd[58188]:  warning: Action 10
>> (SecondaryUserCrons_stop_0) on node2.mydomain.com failed (target: 0
>> vs. rc:
>> 5): Error
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Transition aborted by
>> operation SecondaryUserCrons_stop_0 'modify' on
>> node2.mydomain.com: Event failed
>> Sep  7 15:02:28 node2 crmd[58188]:  warning: Action 10
>> (SecondaryUserCrons_stop_0) on node2.mydomain.com failed (target: 0
>> vs. rc:
>> 5): Error
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Transition aborted by
>> status-2-fail-count-SecondaryUserCrons.stop_0 doing create
>> fail-count-SecondaryUserCrons#stop_0=INFINITY: Transient attribute
>> change
>> Sep  7 15:02:28 node2 crmd[58188]:   notice: Transition 130
>> (Complete=18, Pending=0, Fired=0, Skipped=0, Incomplete=8,
>> Source=/var/lib/pacemaker/pengine/pe-input-2796.bz2): Complete
>> Sep  7 15:02:29 node2 pengine[58187]:    error: No further recovery
>> can be attempted for SecondaryUserCrons: stop action failed with
>> 'not installed' (5)
>> Sep  7 15:02:29 node2 pengine[58187]:  warning: Processing failed op
>> stop for SecondaryUserCrons on node2.mydomain.com: not
>> installed (5)
>> Sep  7 15:02:29 node2 pengine[58187]:   notice: Preventing
>> SecondaryUserCrons from re-starting on node2.mydomain.com: operation
>> stop
>> failed 'not installed' (5)
>> Sep  7 15:02:29 node2 pengine[58187]:    error: No further recovery
>> can be attempted for SecondaryUserCrons: stop action failed with
>> 'not installed' (5)
>> Sep  7 15:02:29 node2 pengine[58187]:  warning: Processing failed op
>> stop for SecondaryUserCrons on node2.mydomain.com: not
>> installed (5)
>> Sep  7 15:02:29 node2 pengine[58187]:   notice: Preventing
>> SecondaryUserCrons from re-starting on node2.mydomain.com: operation
>> stop
>> failed 'not installed' (5)
>> Sep  7 15:02:29 node2 pengine[58187]:  warning: Forcing
>> SecondaryUserCrons away from node2.mydomain.com after 1000000
>> failures
>> (max=1000000)
>> Sep  7 15:02:29 node2 pengine[58187]:   notice:  *
>> Start      dlm:1                  (                        node2.mydo
>> main.com )
>> due to unrunnable DRBD:1 promote (blocked)
>> Sep  7 15:02:29 node2 pengine[58187]:   notice:  *
>> Start      WWWMount:1             (                        node2.mydo
>> main.com )
>> due to unrunnable dlm:1 start (blocked)
>> Sep  7 15:02:29 node2 pengine[58187]:   notice:  *
>> Start      WebServer:1            (                        node2.mydo
>> main.com )
>> due to unrunnable WWWMount:1 start (blocked)
>> Sep  7 15:02:29 node2 pengine[58187]:   notice:  *
>> Start      SharedRootCrons:1      (                        node2.mydo
>> main.com )
>> due to unrunnable WWWMount:1 start (blocked)
>> Sep  7 15:02:29 node2 pengine[58187]:   notice:  *
>> Start      SharedUserCrons:1      (                        node2.mydo
>> main.com )
>> due to unrunnable WWWMount:1 start (blocked)
>> Sep  7 15:02:29 node2 pengine[58187]:    error: Calculated transition
>> 131 (with errors), saving inputs in
>> /var/lib/pacemaker/pengine/pe-error-26.bz2
>> Sep  7 15:02:29 node2 crmd[58188]:  warning: Transition 131
>> (Complete=16, Pending=0, Fired=0, Skipped=0, Incomplete=5,
>> Source=/var/lib/pacemaker/pengine/pe-error-26.bz2): Terminated
>> Sep  7 15:02:29 node2 crmd[58188]:  warning: Transition failed:
>> terminated
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: Graph 131 with 21
>> actions: batch-limit=0 jobs, network-delay=60000ms
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   47]: Completed
>> pseudo op dlm-clone_running_0            on N/A (priority:
>> 1000000, waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   46]: Completed
>> pseudo op dlm-clone_start_0              on N/A (priority: 0,
>> waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   55]: Completed
>> pseudo op WWWMount-clone_running_0       on N/A (priority:
>> 1000000, waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   54]: Completed
>> pseudo op WWWMount-clone_start_0         on N/A (priority: 0,
>> waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   69]: Pending
>> rsc op WebServer_monitor_60000             on node2.mydomain.com
>> (priority: 0, waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice:  * [Input 68]:
>> Unresolved dependency rsc op WebServer_start_0 on node2.mydomain.com
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   71]: Completed
>> pseudo op WebServer-clone_running_0        on N/A (priority:
>> 1000000, waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   70]: Completed
>> pseudo op WebServer-clone_start_0          on N/A (priority:
>> 0, waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   93]: Pending
>> rsc op SharedRootCrons_monitor_60000       on node2.mydomain.com
>> (priority: 0, waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice:  * [Input 92]:
>> Unresolved dependency rsc op SharedRootCrons_start_0 on
>> node2.mydomain.com
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   95]: Completed
>> pseudo op SharedRootCrons-clone_running_0 on N/A (priority:
>> 1000000, waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action   94]: Completed
>> pseudo op SharedRootCrons-clone_start_0  on N/A (priority: 0,
>> waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action  101]: Pending
>> rsc op SharedUserCrons_monitor_60000   on node2.mydomain.com
>> (priority: 0, waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice:  * [Input 100]:
>> Unresolved dependency rsc op SharedUserCrons_start_0 on
>> node2.mydomain.com
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action  103]: Completed
>> pseudo op SharedUserCrons-clone_running_0 on N/A (priority:
>> 1000000, waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: [Action  102]: Completed
>> pseudo op SharedUserCrons-clone_start_0 on N/A (priority: 0,
>> waiting: none)
>> Sep  7 15:02:29 node2 crmd[58188]:   notice: State transition
>> S_TRANSITION_ENGINE -> S_IDLE
>> Sep  7 15:02:29 node2 kernel: drbd r0 node1.mydomain.com: Handshake
>> to peer 0 successful: Agreed network protocol version 113
>> Sep  7 15:02:29 node2 kernel: drbd r0 node1.mydomain.com: Feature
>> flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
>> WRITE_ZEROES.
>> Sep  7 15:02:29 node2 kernel: drbd r0 node1.mydomain.com: Starting
>> ack_recv thread (from drbd_r_r0 [52295])
>> Sep  7 15:02:29 node2 kernel: drbd r0 node1.mydomain.com: Preparing
>> remote state change 2019156377
>> Sep  7 15:02:29 node2 kernel: drbd r0 node1.mydomain.com: Committing
>> remote state change 2019156377 (primary_nodes=1)
>> Sep  7 15:02:29 node2 kernel: drbd r0 node1.mydomain.com: conn(
>> Connecting -> Connected ) peer( Unknown -> Primary )
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> drbd_sync_handshake:
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> self
>> A2457506F4D44F1C:0000000000000000:B13E5D392CF268C4:FE2F70857D64FB02
>> bits:0 flags:20
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> peer
>> D355B0F942665879:A2457506F4D44F1D:B13E5D392CF268C4:E56E164C51EEFAB0
>> bits:6 flags:120
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> uuid_compare()=-2 by rule 50
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT )
>> Sep  7 15:02:29 node2 kernel: drbd r0/1 drbd2 node1.mydomain.com:
>> drbd_sync_handshake:
>> Sep  7 15:02:29 node2 kernel: drbd r0/1 drbd2 node1.mydomain.com:
>> self
>> 0EC5D56AEE53C6B6:0000000000000000:0000000000000000:0000000000000000
>> bits:0 flags:20
>> Sep  7 15:02:29 node2 kernel: drbd r0/1 drbd2 node1.mydomain.com:
>> peer
>> 0EC5D56AEE53C6B6:0000000000000000:B62926494645765C:0000000000000000
>> bits:0 flags:120
>> Sep  7 15:02:29 node2 kernel: drbd r0/1 drbd2 node1.mydomain.com:
>> uuid_compare()=0 by rule 38
>> Sep  7 15:02:29 node2 kernel: drbd r0/1 drbd2: disk( Outdated ->
>> UpToDate )
>> Sep  7 15:02:29 node2 kernel: drbd r0/1 drbd2 node1.mydomain.com:
>> pdsk( DUnknown -> UpToDate ) repl( Off -> Established )
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 27(1),
>> total 27; compression: 100.0%
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> send bitmap stats [Bytes(packets)]: plain 0(0), RLE 27(1), total
>> 27; compression: 100.0%
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> helper command: /sbin/drbdadm before-resync-target
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> helper command: /sbin/drbdadm before-resync-target exit code 0 (0x0)
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1: disk( Outdated ->
>> Inconsistent )
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> repl( WFBitMapT -> SyncTarget )
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> Began resync as SyncTarget (will sync 24 KB [6 bits set]).
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> Resync done (total 1 sec; paused 0 sec; 24 K/sec)
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> updated UUIDs
>> D355B0F942665878:0000000000000000:A2457506F4D44F1C:E2BDB50A1BFBAE5E
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1: disk( Inconsistent ->
>> UpToDate )
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> repl( SyncTarget -> Established )
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> helper command: /sbin/drbdadm after-resync-target
>> Sep  7 15:02:29 node2 kernel: drbd r0/0 drbd1 node1.mydomain.com:
>> helper command: /sbin/drbdadm after-resync-target exit code 0 (0x0)
>> Sep  7 15:03:29 node2 crmd[58188]:   notice: State transition S_IDLE
>> -> S_POLICY_ENGINE
>> Sep  7 15:03:29 node2 pengine[58187]:    error: No further recovery
>> can be attempted for SecondaryUserCrons: stop action failed with
>> 'not installed' (5)
>> Sep  7 15:03:29 node2 pengine[58187]:  warning: Processing failed op
>> stop for SecondaryUserCrons on node2.mydomain.com: not
>> installed (5)
>> Sep  7 15:03:29 node2 pengine[58187]:   notice: Preventing
>> SecondaryUserCrons from re-starting on node2.mydomain.com: operation
>> stop
>> failed 'not installed' (5)
>> Sep  7 15:03:29 node2 pengine[58187]:    error: No further recovery
>> can be attempted for SecondaryUserCrons: stop action failed with
>> 'not installed' (5)
>> Sep  7 15:03:29 node2 pengine[58187]:  warning: Processing failed op
>> stop for SecondaryUserCrons on node2.mydomain.com: not
>> installed (5)
>> Sep  7 15:03:29 node2 pengine[58187]:   notice: Preventing
>> SecondaryUserCrons from re-starting on node2.mydomain.com: operation
>> stop
>> failed 'not installed' (5)
>> Sep  7 15:03:29 node2 pengine[58187]:  warning: Forcing
>> SecondaryUserCrons away from node2.mydomain.com after 1000000
>> failures
>> (max=1000000)
>> Sep  7 15:03:29 node2 pengine[58187]:   notice:  *
>> Promote    DRBD:1                 (        Slave -> Master
>> node2.mydomain.com )
>> Sep  7 15:03:29 node2 pengine[58187]:   notice:  *
>> Start      dlm:1                  (                        node2.mydo
>> main.com )
>> Sep  7 15:03:29 node2 pengine[58187]:   notice:  *
>> Start      WWWMount:1             (                        node2.mydo
>> main.com )
>> Sep  7 15:03:29 node2 pengine[58187]:   notice:  *
>> Start      WebServer:1            (                        node2.mydo
>> main.com )
>> Sep  7 15:03:29 node2 pengine[58187]:   notice:  *
>> Start      SharedRootCrons:1      (                        node2.mydo
>> main.com )
>> Sep  7 15:03:29 node2 pengine[58187]:   notice:  *
>> Start      SharedUserCrons:1      (                        node2.mydo
>> main.com )
>> Sep  7 15:03:29 node2 pengine[58187]:    error: Calculated transition
>> 132 (with errors), saving inputs in
>> /var/lib/pacemaker/pengine/pe-error-27.bz2
>> Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating cancel
>> operation DRBD_monitor_60000 locally on node2.mydomain.com
>> Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating notify
>> operation DRBD_pre_notify_promote_0 on node1.mydomain.com
>> Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating notify
>> operation DRBD_pre_notify_promote_0 locally on node2.mydomain.com
>> Sep  7 15:03:29 node2 crmd[58188]:   notice: Result of notify
>> operation for DRBD on node2.mydomain.com: 0 (ok)
>> Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating promote
>> operation DRBD_promote_0 locally on node2.mydomain.com
>> Sep  7 15:03:29 node2 kernel: drbd r0: Preparing cluster-wide state
>> change 360863446 (1->-1 3/1)
>> Sep  7 15:03:29 node2 kernel: drbd r0: State change 360863446:
>> primary_nodes=3, weak_nodes=FFFFFFFFFFFFFFFC
>> Sep  7 15:03:29 node2 kernel: drbd r0: Committing cluster-wide state
>> change 360863446 (0ms)
>> Sep  7 15:03:29 node2 kernel: drbd r0: role( Secondary -> Primary )
>> Sep  7 15:03:29 node2 crmd[58188]:   notice: Result of promote
>> operation for DRBD on node2.mydomain.com: 0 (ok)
>> Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating notify
>> operation DRBD_post_notify_promote_0 on node1.mydomain.com
>> Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating notify
>> operation DRBD_post_notify_promote_0 locally on node2.mydomain.com
>> Sep  7 15:03:29 node2 crmd[58188]:   notice: Result of notify
>> operation for DRBD on node2.mydomain.com: 0 (ok)
>> Sep  7 15:03:29 node2 crmd[58188]:   notice: Initiating start
>> operation dlm_start_0 locally on node2.mydomain.com
>> Sep  7 15:03:29 node2 dlm_controld[53127]: 693403 dlm_controld 4.0.7
>> started
>> Sep  7 15:03:30 node2 crmd[58188]:   notice: Result of start
>> operation for dlm on node2.mydomain.com: 0 (ok)
>> Sep  7 15:03:30 node2 crmd[58188]:   notice: Initiating monitor
>> operation dlm_monitor_60000 locally on node2.mydomain.com
>> Sep  7 15:03:30 node2 crmd[58188]:   notice: Initiating start
>> operation WWWMount_start_0 locally on node2.mydomain.com
>> Sep  7 15:03:30 node2 Filesystem(WWWMount)[53154]: INFO: Running
>> start for /dev/drbd1 on /var/www
>> Sep  7 15:03:30 node2 kernel: dlm: Using TCP for communications
>> Sep  7 15:03:30 node2 kernel: GFS2: fsid=MyCluster:www: Trying to
>> join cluster "lock_dlm", "MyCluster:www"
>> Sep  7 15:03:30 node2 kernel: dlm: connecting to 1
>> Sep  7 15:03:30 node2 kernel: dlm: got connection from 1
>> Sep  7 15:03:31 node2 kernel: GFS2: fsid=MyCluster:www: Joined
>> cluster. Now mounting FS...
>> Sep  7 15:03:31 node2 kernel: GFS2: fsid=MyCluster:www.1: jid=1,
>> already locked for use
>> Sep  7 15:03:31 node2 kernel: GFS2: fsid=MyCluster:www.1: jid=1:
>> Looking at journal...
>> Sep  7 15:03:31 node2 kernel: GFS2: fsid=MyCluster:www.1: jid=1: Done
>> Sep  7 15:03:31 node2 crmd[58188]:   notice: Result of start
>> operation for WWWMount on node2.mydomain.com: 0 (ok)
>> Sep  7 15:03:31 node2 crmd[58188]:   notice: Initiating monitor
>> operation WWWMount_monitor_20000 locally on node2.mydomain.com
>> Sep  7 15:03:31 node2 crmd[58188]:   notice: Initiating start
>> operation WebServer_start_0 locally on node2.mydomain.com
>> Sep  7 15:03:31 node2 crmd[58188]:   notice: Initiating start
>> operation SharedRootCrons_start_0 locally on node2.mydomain.com
>> Sep  7 15:03:31 node2 crmd[58188]:   notice: Initiating start
>> operation SharedUserCrons_start_0 locally on node2.mydomain.com
>> Sep  7 15:03:31 node2 symlink(SharedRootCrons)[53328]: INFO:
>> '/etc/cron.d/root-shared' -> '/var/www/crons/root-shared'
>> Sep  7 15:03:31 node2 symlink(SharedUserCrons)[53329]: INFO:
>> '/etc/cron.d/User-shared' -> '/var/www/crons/User-shared'
>> Sep  7 15:03:31 node2 crmd[58188]:   notice: Result of start
>> operation for SharedRootCrons on node2.mydomain.com: 0 (ok)
>> Sep  7 15:03:31 node2 crmd[58188]:   notice: Result of start
>> operation for SharedUserCrons on node2.mydomain.com: 0 (ok)
>> Sep  7 15:03:31 node2 crmd[58188]:   notice: Initiating monitor
>> operation SharedRootCrons_monitor_60000 locally on node2.mydomain.com
>> Sep  7 15:03:31 node2 crmd[58188]:   notice: Initiating monitor
>> operation SharedUserCrons_monitor_60000 locally on node2.mydomain.com
>> Sep  7 15:03:31 node2 apache(WebServer)[53325]: INFO: apache not
>> running
>> Sep  7 15:03:31 node2 apache(WebServer)[53325]: INFO: waiting for
>> apache /etc/httpd/conf/httpd.conf to come up
>> Sep  7 15:03:32 node2 crmd[58188]:   notice: Result of start
>> operation for WebServer on node2.mydomain.com: 0 (ok)
>> Sep  7 15:03:32 node2 crmd[58188]:   notice: Initiating monitor
>> operation WebServer_monitor_60000 locally on node2.mydomain.com
>> Sep  7 15:03:33 node2 crmd[58188]:   notice: Transition 132
>> (Complete=44, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-error-27.bz2): Complete
>> Sep  7 15:03:33 node2 crmd[58188]:   notice: State transition
>> S_TRANSITION_ENGINE -> S_IDLE
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> pdf
>> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list