[Pacemaker] Decreasing failover time when running DRBD+OCFS2+XEN in dual primary mode

kamal kishi kamal.kishi at gmail.com
Fri Jun 13 03:35:29 EDT 2014


Hi Andrew,
Checked the logs and I felt OCFS2 taking time to recover, can anyone please
verify my log and confirm if I'm correct.

And if OCFS2 is the reason for delay in failover may I know a way to reduce
that delay caused.

Attached is my syslog and pacemaker configuration

Looking forward for a solution


On Fri, Jun 13, 2014 at 8:55 AM, kamal kishi <kamal.kishi at gmail.com> wrote:

> Fine Andrew, will check it out but does the timeouts provided for
> pacemaker affect this??
> Which part of the time configuration will be considered by pacemaker to
> decide if the other node is actually down and the resources should be taken
> over by it.
>
> And Alexis, I'm not facing any issue while putting node to standby mode.
> I'm using DRBD 8.3.11 (apt-get install drbd8-utils=2:8.3.11-0ubuntu1)
> Had to force the download to particular version as the current
> download/patch is not compatible with pacemaker.
> You too try to install 8.3.11 and check once, all the best
>
>
> On Fri, Jun 13, 2014 at 5:22 AM, Andrew Beekhof <andrew at beekhof.net>
> wrote:
>
>>
>> On 12 Jun 2014, at 9:15 pm, kamal kishi <kamal.kishi at gmail.com> wrote:
>>
>> > Hi All,
>> >
>> > This might be a basic question but I'm not sure whats taking time for
>> failover switching.
>> > Hope anyone can figure it out.
>>
>> How about looking in the logs and seeing when the various stop/start
>> actions occur and which ones take the longest?
>>
>> >
>> > Scenario -
>> > Pacemaker running DRBD(Dual primary mode)+OCFS2+XEN for Virtual windows
>> machine
>> >
>> > Pacemaker startup starts -
>> > DRBD -> OCFS2 -> XEN
>> > Lets consider under Server1  - DRBD, OCFS2(clone) and XEN are started
>> >
>> > Server2 - DRBD, OCFS2(clone) are started
>> >
>> > Now if Server1 power is OFF
>> >
>> > The XEN resource which was running under Server1 should be failed over
>> to Server2.
>> >
>> > In my case, its taking almost 90 to 110 seconds to do this.
>> >
>> > Can anyone suggest me ways to reduce it to within 30 to 40 seconds
>> >
>> > My pacemaker configuration is -
>> > crm configure
>> > property no-quorum-policy=ignore
>> > property stonith-enabled=false
>> > property default-resource-stickiness=1000
>> >
>> > primitive resDRBDr1 ocf:linbit:drbd \
>> > params drbd_resource="r0" \
>> > op start interval="0" timeout="240s" \
>> > op stop interval="0" timeout="100s" \
>> > op monitor interval="20s" role="Master" timeout="240s" \
>> > op monitor interval="30s" role="Slave" timeout="240s" \
>> > meta migration-threshold="3" failure-timeout="60s"
>> > primitive resOCFS2r1 ocf:heartbeat:Filesystem \
>> > params device="/dev/drbd/by-res/r0" directory="/cluster" fstype="ocfs2"
>> \
>> > op monitor interval="10s" timeout="60s" \
>> > op start interval="0" timeout="90s" \
>> > op stop interval="0" timeout="60s" \
>> > meta migration-threshold="3" failure-timeout="60s"
>> > primitive resXen1 ocf:heartbeat:Xen \
>> > params xmfile="/home/cluster/xen/win7.cfg" name="xenwin7" \
>> > op monitor interval="20s" timeout="60s" \
>> > op start interval="0" timeout="90s" \
>> > op stop interval="0" timeout="60s" \
>> > op migrate_from interval="0" timeout="120s" \
>> > op migrate_to interval="0" timeout="120s" \
>> > meta allow-migrate="true" target-role="started"
>> >
>> > ms msDRBDr1 resDRBDr1 \
>> > meta notify="true" master-max="2" interleave="true"
>> target-role="Started"
>> > clone cloOCFS2r1 resOCFS2r1 \
>> > meta interleave="true" ordered="true" target-role="Started"
>> >
>> > colocation colOCFS12-with-DRBDrMaster inf: cloOCFS2r1 msDRBDr1:Master
>> > colocation colXen-with-OCFSr1 inf: resXen1 cloOCFS2r1
>> > order ordDRBD-before-OCFSr1 inf: msDRBDr1:promote cloOCFS2r1:start
>> > order ordOCFS2r1-before-Xen1 inf: cloOCFS2r1:start resXen1:start
>> >
>> > commit
>> > bye
>> >
>> > --
>> > Regards,
>> > Kamal Kishore B V
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
>
> --
> Regards,
> Kamal Kishore B V
>



-- 
Regards,
Kamal Kishore B V
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140613/d6d2af69/attachment-0003.html>
-------------- next part --------------
crm configure
property no-quorum-policy=ignore
property stonith-enabled=false
property default-resource-stickiness=1000
commit
bye

crm configure
primitive resDRBDr1 ocf:linbit:drbd \
params drbd_resource="r0" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="100s" \
op monitor interval="20s" role="Master" timeout="240s" \
op monitor interval="30s" role="Slave" timeout="240s" \
meta migration-threshold="3" failure-timeout="58s"
primitive resOCFS2r1 ocf:heartbeat:Filesystem \
params device="/dev/drbd/by-res/r0" directory="/cluster" fstype="ocfs2" \
op monitor interval="10s" timeout="60s" \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="60s" \
meta migration-threshold="3" failure-timeout="60s"
primitive resXen1 ocf:heartbeat:Xen \
params xmfile="/home/cluster/xen/win7.cfg" name="xenwin7" \
op monitor interval="20s" timeout="62s" \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="60s" \
op migrate_from interval="0" timeout="90s" \
op migrate_to interval="0" timeout="90s" \
meta allow-migrate="true" target-role="started"

ms msDRBDr1 resDRBDr1 \
meta notify="true" master-max="2" interleave="true" target-role="Started"
clone cloOCFS2r1 resOCFS2r1 \
meta interleave="true" ordered="true" target-role="Started"

colocation colOCFS12-with-DRBDrMaster inf: cloOCFS2r1 msDRBDr1:Master
colocation colXen-with-OCFSr1 inf: resXen1 cloOCFS2r1
order ordDRBD-before-OCFSr1 inf: msDRBDr1:promote cloOCFS2r1:start
order ordOCFS2r1-before-Xen1 inf: cloOCFS2r1:start resXen1:start

commit
bye
-------------- next part --------------
Jun 13 17:48:51 server2 NetworkManager[984]: <info> (eth6): carrier now OFF (device state 10)
Jun 13 17:48:51 server2 kernel: [154433.002364] bnx2 0000:01:00.0: eth6: NIC Copper Link is Down
Jun 13 17:48:51 server2 kernel: [154433.003052] xenbr0: port 1(eth6) entering forwarding state
Jun 13 17:48:52 server2 NetworkManager[984]: <info> (eth6): carrier now ON (device state 10)
Jun 13 17:48:52 server2 kernel: [154434.742214] bnx2 0000:01:00.0: eth6: NIC Copper Link is Up, 100 Mbps full duplex, receive & transmit flow control ON
Jun 13 17:48:52 server2 kernel: [154434.742731] xenbr0: port 1(eth6) entering forwarding state
Jun 13 17:48:52 server2 kernel: [154434.742750] xenbr0: port 1(eth6) entering forwarding state
Jun 13 17:48:53 server2 corosync[1762]:   [TOTEM ] A processor failed, forming new configuration.
Jun 13 17:48:54 server2 NetworkManager[984]: <info> (eth6): carrier now OFF (device state 10)
Jun 13 17:48:54 server2 kernel: [154435.786381] bnx2 0000:01:00.0: eth6: NIC Copper Link is Down
Jun 13 17:48:54 server2 kernel: [154435.786896] xenbr0: port 1(eth6) entering forwarding state
Jun 13 17:48:56 server2 cib: [1834]: notice: ais_dispatch_message: Membership 2076: quorum lost
Jun 13 17:48:56 server2 cib: [1834]: info: crm_update_peer: Node server1: id=16777226 state=lost (new) addr=r(0) ip(10.0.0.1)  votes=1 born=2072 seen=2072 proc=00000000000000000000000000111312
Jun 13 17:48:56 server2 corosync[1762]:   [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 2076: memb=1, new=0, lost=1
Jun 13 17:48:56 server2 corosync[1762]:   [pcmk  ] info: pcmk_peer_update: memb: server2 33554442
Jun 13 17:48:56 server2 corosync[1762]:   [pcmk  ] info: pcmk_peer_update: lost: server1 16777226
Jun 13 17:48:56 server2 corosync[1762]:   [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 2076: memb=1, new=0, lost=0
Jun 13 17:48:56 server2 corosync[1762]:   [pcmk  ] info: pcmk_peer_update: MEMB: server2 33554442
Jun 13 17:48:56 server2 corosync[1762]:   [pcmk  ] info: ais_mark_unseen_peer_dead: Node server1 was not seen in the previous transition
Jun 13 17:48:56 server2 corosync[1762]:   [pcmk  ] info: update_member: Node 16777226/server1 is now: lost
Jun 13 17:48:56 server2 corosync[1762]:   [pcmk  ] info: send_member_notification: Sending membership update 2076 to 2 children
Jun 13 17:48:56 server2 corosync[1762]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 13 17:48:56 server2 crmd: [1840]: notice: ais_dispatch_message: Membership 2076: quorum lost
Jun 13 17:48:56 server2 crmd: [1840]: info: ais_status_callback: status: server1 is now lost (was member)
Jun 13 17:48:56 server2 crmd: [1840]: info: crm_update_peer: Node server1: id=16777226 state=lost (new) addr=r(0) ip(10.0.0.1)  votes=1 born=2072 seen=2072 proc=00000000000000000000000000111312
Jun 13 17:48:56 server2 crmd: [1840]: info: erase_node_from_join: Removed node server1 from join calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1
Jun 13 17:48:56 server2 crmd: [1840]: WARN: check_dead_member: Our DC node (server1) left the cluster
Jun 13 17:48:56 server2 crmd: [1840]: info: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ]
Jun 13 17:48:56 server2 crmd: [1840]: info: update_dc: Unset DC server1
Jun 13 17:48:56 server2 corosync[1762]:   [CPG   ] chosen downlist: sender r(0) ip(10.0.0.2) ; members(old:2 left:1)
Jun 13 17:48:56 server2 crmd: [1840]: info: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
Jun 13 17:48:56 server2 crmd: [1840]: info: do_te_control: Registering TE UUID: 68dbc62f-5255-4a32-915b-fbb3954c6092
Jun 13 17:48:56 server2 crmd: [1840]: info: set_graph_functions: Setting custom graph functions
Jun 13 17:48:56 server2 crmd: [1840]: info: unpack_graph: Unpacked transition -1: 0 actions in 0 synapses
Jun 13 17:48:56 server2 crmd: [1840]: info: do_dc_takeover: Taking over DC status for this partition
Jun 13 17:48:56 server2 corosync[1762]:   [MAIN  ] Completed service synchronization, ready to provide service.
Jun 13 17:48:56 server2 cib: [1834]: info: cib_process_readwrite: We are now in R/W mode
Jun 13 17:48:56 server2 cib: [1834]: info: cib_process_request: Operation complete: op cib_master for section 'all' (origin=local/crmd/315, version=0.199.10): ok (rc=0)
Jun 13 17:48:56 server2 cib: [1834]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/316, version=0.199.11): ok (rc=0)
Jun 13 17:48:56 server2 cib: [1834]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/318, version=0.199.12): ok (rc=0)
Jun 13 17:48:56 server2 crmd: [1840]: info: join_make_offer: Making join offers based on membership 2076
Jun 13 17:48:56 server2 crmd: [1840]: info: do_dc_join_offer_all: join-6: Waiting on 1 outstanding join acks
Jun 13 17:48:56 server2 cib: [1834]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/320, version=0.199.13): ok (rc=0)
Jun 13 17:48:56 server2 crmd: [1840]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
Jun 13 17:48:56 server2 crmd: [1840]: info: config_query_callback: Checking for expired actions every 900000ms
Jun 13 17:48:56 server2 crmd: [1840]: info: config_query_callback: Sending expected-votes=2 to corosync
Jun 13 17:48:56 server2 crmd: [1840]: info: ais_dispatch_message: Membership 2076: quorum still lost
Jun 13 17:48:56 server2 crmd: [1840]: info: crmd_ais_dispatch: Setting expected votes to 2
Jun 13 17:48:56 server2 crmd: [1840]: info: ais_dispatch_message: Membership 2076: quorum still lost
Jun 13 17:48:56 server2 cib: [1834]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/323, version=0.199.14): ok (rc=0)
Jun 13 17:48:56 server2 crmd: [1840]: info: crmd_ais_dispatch: Setting expected votes to 2
Jun 13 17:48:56 server2 crmd: [1840]: info: update_dc: Set DC to server2 (3.0.5)
Jun 13 17:48:56 server2 cib: [1834]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/325, version=0.199.15): ok (rc=0)
Jun 13 17:48:56 server2 crmd: [1840]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Jun 13 17:48:56 server2 crmd: [1840]: info: do_state_transition: All 1 cluster nodes responded to the join offer.
Jun 13 17:48:56 server2 crmd: [1840]: info: do_dc_join_finalize: join-6: Syncing the CIB from server2 to the rest of the cluster
Jun 13 17:48:56 server2 cib: [1834]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/327, version=0.199.15): ok (rc=0)
Jun 13 17:48:56 server2 cib: [1834]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/328, version=0.199.16): ok (rc=0)
Jun 13 17:48:57 server2 crmd: [1840]: info: do_dc_join_ack: join-6: Updating node state to member for server2
Jun 13 17:48:57 server2 cib: [1834]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='server2']/lrm (origin=local/crmd/329, version=0.199.17): ok (rc=0)
Jun 13 17:48:57 server2 crmd: [1840]: info: erase_xpath_callback: Deletion of "//node_state[@uname='server2']/lrm": ok (rc=0)
Jun 13 17:48:57 server2 crmd: [1840]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
Jun 13 17:48:57 server2 crmd: [1840]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
Jun 13 17:48:57 server2 crmd: [1840]: info: do_dc_join_final: Ensuring DC, quorum and node attributes are up-to-date
Jun 13 17:48:57 server2 crmd: [1840]: info: crm_update_quorum: Updating quorum status to false (call=333)
Jun 13 17:48:57 server2 crmd: [1840]: info: abort_transition_graph: do_te_invoke:167 - Triggered transition abort (complete=1) : Peer Cancelled
Jun 13 17:48:57 server2 crmd: [1840]: info: do_pe_invoke: Query 334: Requesting the current CIB: S_POLICY_ENGINE
Jun 13 17:48:57 server2 attrd: [1838]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Jun 13 17:48:57 server2 attrd: [1838]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-resDRBDr1:1 (1402659788)
Jun 13 17:48:57 server2 cib: [1834]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/331, version=0.199.19): ok (rc=0)
Jun 13 17:48:57 server2 crmd: [1840]: WARN: match_down_event: No match for shutdown action on server1
Jun 13 17:48:57 server2 crmd: [1840]: info: te_update_diff: Stonith/shutdown of server1 not matched
Jun 13 17:48:57 server2 crmd: [1840]: info: abort_transition_graph: te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, id=server1, magic=NA, cib=0.199.20) : Node failure
Jun 13 17:48:57 server2 crmd: [1840]: info: do_pe_invoke: Query 335: Requesting the current CIB: S_POLICY_ENGINE
Jun 13 17:48:57 server2 cib: [1834]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/333, version=0.199.21): ok (rc=0)
Jun 13 17:48:57 server2 attrd: [1838]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-resDRBDr1:1 (10000)
Jun 13 17:48:57 server2 crmd: [1840]: info: do_pe_invoke_callback: Invoking the PE: query=335, ref=pe_calc-dc-1402661937-320, seq=2076, quorate=0
Jun 13 17:48:57 server2 pengine: [1839]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jun 13 17:48:57 server2 pengine: [1839]: notice: unpack_rsc_op: Operation resDRBDr1:1_last_failure_0 found resource resDRBDr1:1 active in master mode on server2
Jun 13 17:48:57 server2 pengine: [1839]: notice: RecurringOp:  Start recurring monitor (20s) for resXen1 on server2
Jun 13 17:48:57 server2 pengine: [1839]: notice: LogActions: Start   resXen1#011(server2)
Jun 13 17:48:57 server2 pengine: [1839]: notice: LogActions: Leave   resDRBDr1:0#011(Stopped)
Jun 13 17:48:57 server2 pengine: [1839]: notice: LogActions: Leave   resDRBDr1:1#011(Master server2)
Jun 13 17:48:57 server2 pengine: [1839]: notice: LogActions: Leave   resOCFS2r1:0#011(Stopped)
Jun 13 17:48:57 server2 pengine: [1839]: notice: LogActions: Leave   resOCFS2r1:1#011(Started server2)
Jun 13 17:48:57 server2 crmd: [1840]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 13 17:48:57 server2 crmd: [1840]: info: unpack_graph: Unpacked transition 111: 2 actions in 2 synapses
Jun 13 17:48:57 server2 crmd: [1840]: info: do_te_invoke: Processing graph 111 (ref=pe_calc-dc-1402661937-320) derived from /var/lib/pengine/pe-input-159.bz2
Jun 13 17:48:57 server2 crmd: [1840]: info: te_rsc_command: Initiating action 6: start resXen1_start_0 on server2 (local)
Jun 13 17:48:57 server2 attrd: [1838]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-resXen1 (1402588112)
Jun 13 17:48:57 server2 crmd: [1840]: info: do_lrm_rsc_op: Performing key=6:111:0:68dbc62f-5255-4a32-915b-fbb3954c6092 op=resXen1_start_0 )
Jun 13 17:48:57 server2 lrmd: [1837]: info: rsc:resXen1 start[115] (pid 20673)
Jun 13 17:48:57 server2 attrd: [1838]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jun 13 17:48:57 server2 pengine: [1839]: notice: process_pe_message: Transition 111: PEngine Input stored in: /var/lib/pengine/pe-input-159.bz2
Jun 13 17:49:00 server2 kernel: [154442.467686] block drbd0: PingAck did not arrive in time.
Jun 13 17:49:00 server2 kernel: [154442.467698] block drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) 
Jun 13 17:49:00 server2 kernel: [154442.477515] block drbd0: asender terminated
Jun 13 17:49:00 server2 kernel: [154442.477523] block drbd0: Terminating drbd0_asender
Jun 13 17:49:00 server2 kernel: [154442.477590] block drbd0: Connection closed
Jun 13 17:49:00 server2 kernel: [154442.477666] block drbd0: conn( NetworkFailure -> Unconnected ) 
Jun 13 17:49:00 server2 kernel: [154442.477675] block drbd0: receiver terminated
Jun 13 17:49:00 server2 kernel: [154442.477679] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0
Jun 13 17:49:00 server2 kernel: [154442.477687] block drbd0: Restarting drbd0_receiver
Jun 13 17:49:00 server2 kernel: [154442.477690] block drbd0: receiver (re)started
Jun 13 17:49:00 server2 kernel: [154442.477697] block drbd0: conn( Unconnected -> WFConnection ) 
Jun 13 17:49:00 server2 crm-fence-peer.sh[20742]: invoked for r0
Jun 13 17:49:01 server2 cib: [1834]: info: cib:diff: - <cib admin_epoch="0" epoch="199" num_updates="25" />
Jun 13 17:49:01 server2 cib: [1834]: info: cib:diff: + <cib epoch="200" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.5" update-origin="server2" update-client="crm_attribute" cib-last-written="Fri Jun 13 17:40:09 2014" have-quorum="0" dc-uuid="server2" >
Jun 13 17:49:01 server2 cib: [1834]: info: cib:diff: +   <configuration >
Jun 13 17:49:01 server2 cib: [1834]: info: cib:diff: +     <constraints >
Jun 13 17:49:01 server2 cib: [1834]: info: cib:diff: +       <rsc_location rsc="msDRBDr1" id="drbd-fence-by-handler-msDRBDr1" __crm_diff_marker__="added:top" >
Jun 13 17:49:01 server2 cib: [1834]: info: cib:diff: +         <rule role="Master" score="-INFINITY" id="drbd-fence-by-handler-rule-msDRBDr1" >
Jun 13 17:49:01 server2 cib: [1834]: info: cib:diff: +           <expression attribute="#uname" operation="ne" value="server2" id="drbd-fence-by-handler-expr-msDRBDr1" />
Jun 13 17:49:01 server2 cib: [1834]: info: cib:diff: +         </rule>
Jun 13 17:49:01 server2 cib: [1834]: info: cib:diff: +       </rsc_location>
Jun 13 17:49:01 server2 cib: [1834]: info: cib:diff: +     </constraints>
Jun 13 17:49:01 server2 cib: [1834]: info: cib:diff: +   </configuration>
Jun 13 17:49:01 server2 cib: [1834]: info: cib:diff: + </cib>
Jun 13 17:49:01 server2 cib: [1834]: info: cib_process_request: Operation complete: op cib_create for section constraints (origin=local/cibadmin/2, version=0.200.1): ok (rc=0)
Jun 13 17:49:01 server2 crmd: [1840]: info: abort_transition_graph: te_update_diff:124 - Triggered transition abort (complete=0, tag=diff, id=(null), magic=NA, cib=0.200.1) : Non-status change
Jun 13 17:49:01 server2 crm-fence-peer.sh[20742]: INFO peer is reachable, my disk is UpToDate: placed constraint 'drbd-fence-by-handler-msDRBDr1'
Jun 13 17:49:01 server2 crmd: [1840]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
Jun 13 17:49:01 server2 crmd: [1840]: info: update_abort_priority: Abort action done superceeded by restart
Jun 13 17:49:01 server2 kernel: [154443.626577] block drbd0: helper command: /sbin/drbdadm fence-peer minor-0 exit code 4 (0x400)
Jun 13 17:49:01 server2 kernel: [154443.626585] block drbd0: fence-peer helper returned 4 (peer was fenced)
Jun 13 17:49:01 server2 kernel: [154443.626595] block drbd0: pdsk( DUnknown -> Outdated ) 
Jun 13 17:49:01 server2 kernel: [154443.626661] block drbd0: new current UUID 94A8E78196FAA43D:C5DC1F452BF1147D:7F1601609656EF4E:7F1501609656EF4F
Jun 13 17:49:01 server2 kernel: [154443.629753] block drbd0: susp( 1 -> 0 ) 
Jun 13 17:49:18 server2 kernel: [154460.723689] o2net: Connection to node server1 (num 0) at 10.0.0.1:7777 has been idle for 30.57 secs, shutting it down.
Jun 13 17:49:18 server2 kernel: [154460.723731] o2net: No longer connected to node server1 (num 0) at 10.0.0.1:7777
Jun 13 17:49:18 server2 kernel: [154460.723772] (xend,20736,4):dlm_send_remote_convert_request:395 ERROR: Error -112 when sending message 504 (key 0x649b059e) to node 0
Jun 13 17:49:18 server2 kernel: [154460.723780] o2dlm: Waiting on the death of node 0 in domain F18CB82626444DD0913312B7AE741C5B
Jun 13 17:49:28 server2 kernel: [154470.731722] o2net: Connection to node server1 (num 0) at 10.0.0.1:7777 shutdown, state 7
Jun 13 17:49:31 server2 kernel: [154473.731713] o2net: Connection to node server1 (num 0) at 10.0.0.1:7777 shutdown, state 7
Jun 13 17:49:34 server2 kernel: [154476.731722] o2net: Connection to node server1 (num 0) at 10.0.0.1:7777 shutdown, state 7
Jun 13 17:49:37 server2 kernel: [154479.731715] o2net: Connection to node server1 (num 0) at 10.0.0.1:7777 shutdown, state 7
Jun 13 17:49:40 server2 kernel: [154482.731716] o2net: Connection to node server1 (num 0) at 10.0.0.1:7777 shutdown, state 7
Jun 13 17:49:43 server2 kernel: [154485.731731] o2net: Connection to node server1 (num 0) at 10.0.0.1:7777 shutdown, state 7
Jun 13 17:49:46 server2 kernel: [154488.731716] o2net: Connection to node server1 (num 0) at 10.0.0.1:7777 shutdown, state 7
Jun 13 17:49:48 server2 kernel: [154490.739677] o2net: No connection established with node 0 after 30.0 seconds, giving up.
Jun 13 17:49:48 server2 kernel: [154490.739710] (xend,20736,5):dlm_send_remote_convert_request:395 ERROR: Error -107 when sending message 504 (key 0x649b059e) to node 0
Jun 13 17:49:48 server2 kernel: [154490.739717] o2dlm: Waiting on the death of node 0 in domain F18CB82626444DD0913312B7AE741C5B
Jun 13 17:49:49 server2 kernel: [154491.731709] o2net: Connection to node server1 (num 0) at 10.0.0.1:7777 shutdown, state 7
Jun 13 17:49:53 server2 kernel: [154495.743719] o2net: Connection to node server1 (num 0) at 10.0.0.1:7777 shutdown, state 7
Jun 13 17:49:56 server2 kernel: [154498.743722] o2net: Connection to node server1 (num 0) at 10.0.0.1:7777 shutdown, state 7
Jun 13 17:49:59 server2 kernel: [154501.743726] o2net: Connection to node server1 (num 0) at 10.0.0.1:7777 shutdown, state 7
Jun 13 17:49:59 server2 kernel: [154501.748005] (xend,20736,4):dlm_send_remote_convert_request:395 ERROR: Error -107 when sending message 504 (key 0x649b059e) to node 0
Jun 13 17:49:59 server2 kernel: [154501.748011] o2dlm: Waiting on the death of node 0 in domain F18CB82626444DD0913312B7AE741C5B
Jun 13 17:49:59 server2 kernel: [154501.748028] o2cb: o2dlm has evicted node 0 from domain F18CB82626444DD0913312B7AE741C5B
Jun 13 17:50:01 server2 kernel: [154502.763673] o2dlm: Waiting on the recovery of node 0 in domain F18CB82626444DD0913312B7AE741C5B
Jun 13 17:50:02 server2 kernel: [154503.967692] o2dlm: Begin recovery on domain F18CB82626444DD0913312B7AE741C5B for node 0
Jun 13 17:50:02 server2 kernel: [154503.967718] o2dlm: Node 1 (me) is the Recovery Master for the dead node 0 in domain F18CB82626444DD0913312B7AE741C5B
Jun 13 17:50:02 server2 kernel: [154503.967813] o2dlm: End recovery on domain F18CB82626444DD0913312B7AE741C5B
Jun 13 17:50:12 server2 kernel: [154513.768024] ocfs2: Begin replay journal (node 0, slot 0) on device (147,0)
Jun 13 17:50:14 server2 kernel: [154515.953166] ocfs2: End replay journal (node 0, slot 0) on device (147,0)
Jun 13 17:50:14 server2 kernel: [154515.983124] ocfs2: Beginning quota recovery on device (147,0) for slot 0
Jun 13 17:50:14 server2 kernel: [154516.061294] ocfs2: Finishing quota recovery on device (147,0) for slot 0
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/block: add XENBUS_PATH=backend/vbd/5/768
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/block: add XENBUS_PATH=backend/vbd/5/5632
Jun 13 17:50:14 server2 kernel: [154516.304048] device tap5.0 entered promiscuous mode
Jun 13 17:50:14 server2 kernel: [154516.304206] xenbr1: port 2(tap5.0) entering forwarding state
Jun 13 17:50:14 server2 kernel: [154516.304223] xenbr1: port 2(tap5.0) entering forwarding state
Jun 13 17:50:14 server2 NetworkManager[984]:    SCPlugin-Ifupdown: devices added (path: /sys/devices/vif-5-0/net/vif5.0, iface: vif5.0)
Jun 13 17:50:14 server2 NetworkManager[984]:    SCPlugin-Ifupdown: device added (path: /sys/devices/vif-5-0/net/vif5.0, iface: vif5.0): no ifupdown configuration found.
Jun 13 17:50:14 server2 NetworkManager[984]: <warn> failed to allocate link cache: (-10) Operation not supported
Jun 13 17:50:14 server2 NetworkManager[984]: <info> (vif5.0): carrier is OFF
Jun 13 17:50:14 server2 NetworkManager[984]: <error> [1402662014.572738] [nm-device-ethernet.c:456] real_update_permanent_hw_address(): (vif5.0): unable to read permanent MAC address (error 0)
Jun 13 17:50:14 server2 NetworkManager[984]: <info> (vif5.0): new Ethernet device (driver: 'vif' ifindex: 16)
Jun 13 17:50:14 server2 NetworkManager[984]: <info> (vif5.0): exported as /org/freedesktop/NetworkManager/Devices/8
Jun 13 17:50:14 server2 NetworkManager[984]: <info> (vif5.0): now managed
Jun 13 17:50:14 server2 NetworkManager[984]: <info> (vif5.0): device state change: unmanaged -> unavailable (reason 'managed') [10 20 2]
Jun 13 17:50:14 server2 NetworkManager[984]: <info> (vif5.0): bringing up device.
Jun 13 17:50:14 server2 kernel: [154516.327450] ADDRCONF(NETDEV_UP): vif5.0: link is not ready
Jun 13 17:50:14 server2 NetworkManager[984]: <info> (vif5.0): preparing device.
Jun 13 17:50:14 server2 NetworkManager[984]: <info> (vif5.0): deactivating device (reason 'managed') [2]
Jun 13 17:50:14 server2 NetworkManager[984]: <info> Unmanaged Device found; state CONNECTED forced. (see http://bugs.launchpad.net/bugs/191889)
Jun 13 17:50:14 server2 NetworkManager[984]: <info> Unmanaged Device found; state CONNECTED forced. (see http://bugs.launchpad.net/bugs/191889)
Jun 13 17:50:14 server2 NetworkManager[984]: <info> Added default wired connection 'Wired connection 4' for /sys/devices/vif-5-0/net/vif5.0
Jun 13 17:50:14 server2 kernel: [154516.328863] ADDRCONF(NETDEV_UP): vif5.0: link is not ready
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/vif-bridge: online type_if=vif XENBUS_PATH=backend/vif/5/0
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/vif-bridge: add type_if=tap XENBUS_PATH=
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/block: Writing backend/vbd/5/768/node /dev/loop0 to xenstore.
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/block: Writing backend/vbd/5/768/physical-device 7:0 to xenstore.
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/block: Writing backend/vbd/5/768/hotplug-status connected to xenstore.
Jun 13 17:50:14 server2 kernel: [154516.528459] device vif5.0 entered promiscuous mode
Jun 13 17:50:14 server2 kernel: [154516.532150] xenbr1: port 2(tap5.0) entering forwarding state
Jun 13 17:50:14 server2 kernel: [154516.534139] ADDRCONF(NETDEV_UP): vif5.0: link is not ready
Jun 13 17:50:14 server2 kernel: [154516.547443] xenbr1: port 2(tap5.0) entering forwarding state
Jun 13 17:50:14 server2 kernel: [154516.547456] xenbr1: port 2(tap5.0) entering forwarding state
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/vif-bridge: Successful vif-bridge add for tap5.0, bridge xenbr1.
Jun 13 17:50:14 server2 NetworkManager[984]:    SCPlugin-Ifupdown: devices added (path: /sys/devices/virtual/net/tap5.0, iface: tap5.0)
Jun 13 17:50:14 server2 NetworkManager[984]:    SCPlugin-Ifupdown: device added (path: /sys/devices/virtual/net/tap5.0, iface: tap5.0): no ifupdown configuration found.
Jun 13 17:50:14 server2 NetworkManager[984]: <warn> /sys/devices/virtual/net/tap5.0: couldn't determine device driver; ignoring...
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/vif-bridge: Successful vif-bridge online for vif5.0, bridge xenbr1.
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/vif-bridge: Writing backend/vif/5/0/hotplug-status connected to xenstore.
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/block: Writing backend/vbd/5/5632/node /dev/loop1 to xenstore.
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/block: Writing backend/vbd/5/5632/physical-device 7:1 to xenstore.
Jun 13 17:50:14 server2 logger: /etc/xen/scripts/block: Writing backend/vbd/5/5632/hotplug-status connected to xenstore.
Jun 13 17:50:14 server2 lrmd: [1837]: info: RA output: (resXen1:start:stdout) Using config file "/home/cluster/xen/win7.cfg".#012Started domain xenwin7 (id=5)
Jun 13 17:50:16 server2 lrmd: [1837]: info: operation start[115] on resXen1 for client 1840: pid 20673 exited with return code 0
Jun 13 17:50:16 server2 crmd: [1840]: info: process_lrm_event: LRM operation resXen1_start_0 (call=115, rc=0, cib-update=336, confirmed=true) ok
Jun 13 17:50:16 server2 crmd: [1840]: info: match_graph_event: Action resXen1_start_0 (6) confirmed on server2 (rc=0)
Jun 13 17:50:16 server2 crmd: [1840]: info: run_graph: ====================================================
Jun 13 17:50:16 server2 crmd: [1840]: notice: run_graph: Transition 111 (Complete=1, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pengine/pe-input-159.bz2): Stopped
Jun 13 17:50:16 server2 crmd: [1840]: info: te_graph_trigger: Transition 111 is now complete
Jun 13 17:50:16 server2 crmd: [1840]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 13 17:50:16 server2 crmd: [1840]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
Jun 13 17:50:16 server2 crmd: [1840]: info: do_pe_invoke: Query 337: Requesting the current CIB: S_POLICY_ENGINE
Jun 13 17:50:16 server2 crmd: [1840]: info: do_pe_invoke_callback: Invoking the PE: query=337, ref=pe_calc-dc-1402662016-322, seq=2076, quorate=0
Jun 13 17:50:16 server2 pengine: [1839]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jun 13 17:50:16 server2 pengine: [1839]: notice: unpack_rsc_op: Operation resDRBDr1:1_last_failure_0 found resource resDRBDr1:1 active in master mode on server2
Jun 13 17:50:16 server2 pengine: [1839]: notice: RecurringOp:  Start recurring monitor (20s) for resXen1 on server2
Jun 13 17:50:16 server2 pengine: [1839]: notice: LogActions: Leave   resXen1#011(Started server2)
Jun 13 17:50:16 server2 pengine: [1839]: notice: LogActions: Leave   resDRBDr1:0#011(Stopped)
Jun 13 17:50:16 server2 pengine: [1839]: notice: LogActions: Leave   resDRBDr1:1#011(Master server2)
Jun 13 17:50:16 server2 pengine: [1839]: notice: LogActions: Leave   resOCFS2r1:0#011(Stopped)
Jun 13 17:50:16 server2 pengine: [1839]: notice: LogActions: Leave   resOCFS2r1:1#011(Started server2)
Jun 13 17:50:16 server2 crmd: [1840]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 13 17:50:16 server2 crmd: [1840]: info: unpack_graph: Unpacked transition 112: 1 actions in 1 synapses
Jun 13 17:50:16 server2 crmd: [1840]: info: do_te_invoke: Processing graph 112 (ref=pe_calc-dc-1402662016-322) derived from /var/lib/pengine/pe-input-160.bz2
Jun 13 17:50:16 server2 crmd: [1840]: info: te_rsc_command: Initiating action 8: monitor resXen1_monitor_20000 on server2 (local)
Jun 13 17:50:16 server2 crmd: [1840]: info: do_lrm_rsc_op: Performing key=8:112:0:68dbc62f-5255-4a32-915b-fbb3954c6092 op=resXen1_monitor_20000 )
Jun 13 17:50:16 server2 lrmd: [1837]: info: rsc:resXen1 monitor[116] (pid 21465)
Jun 13 17:50:16 server2 avahi-daemon[1063]: Joining mDNS multicast group on interface tap5.0.IPv6 with address fe80::fcff:ffff:feff:ffff.
Jun 13 17:50:16 server2 avahi-daemon[1063]: New relevant interface tap5.0.IPv6 for mDNS.
Jun 13 17:50:16 server2 avahi-daemon[1063]: Registering new address record for fe80::fcff:ffff:feff:ffff on tap5.0.*.
Jun 13 17:50:16 server2 pengine: [1839]: notice: process_pe_message: Transition 112: PEngine Input stored in: /var/lib/pengine/pe-input-160.bz2
Jun 13 17:50:16 server2 lrmd: [1837]: info: operation monitor[116] on resXen1 for client 1840: pid 21465 exited with return code 0
Jun 13 17:50:16 server2 crmd: [1840]: info: process_lrm_event: LRM operation resXen1_monitor_20000 (call=116, rc=0, cib-update=338, confirmed=false) ok
Jun 13 17:50:16 server2 crmd: [1840]: info: match_graph_event: Action resXen1_monitor_20000 (8) confirmed on server2 (rc=0)
Jun 13 17:50:16 server2 crmd: [1840]: info: run_graph: ====================================================
Jun 13 17:50:16 server2 crmd: [1840]: notice: run_graph: Transition 112 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-160.bz2): Complete
Jun 13 17:50:16 server2 crmd: [1840]: info: te_graph_trigger: Transition 112 is now complete
Jun 13 17:50:16 server2 crmd: [1840]: info: notify_crmd: Transition 112 status: done - <null>
Jun 13 17:50:16 server2 crmd: [1840]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 13 17:50:16 server2 crmd: [1840]: info: do_state_transition: Starting PEngine Recheck Timer
Jun 13 17:50:25 server2 kernel: [154527.107675] tap5.0: no IPv6 routers present
Jun 13 17:50:39 server2 avahi-daemon[1063]: Interface tap5.0.IPv6 no longer relevant for mDNS.
Jun 13 17:50:39 server2 avahi-daemon[1063]: Leaving mDNS multicast group on interface tap5.0.IPv6 with address fe80::fcff:ffff:feff:ffff.
Jun 13 17:50:39 server2 avahi-daemon[1063]: Withdrawing address record for fe80::fcff:ffff:feff:ffff on tap5.0.
Jun 13 17:50:39 server2 avahi-daemon[1063]: Withdrawing workstation service for tap5.0.
Jun 13 17:50:39 server2 NetworkManager[984]:    SCPlugin-Ifupdown: devices removed (path: /sys/devices/virtual/net/tap5.0, iface: tap5.0)
Jun 13 17:50:39 server2 kernel: [154540.921762] xenbr1: port 2(tap5.0) entering forwarding state
Jun 13 17:50:39 server2 kernel: [154540.922525] xenbr1: port 2(tap5.0) entering disabled state
Jun 13 17:50:39 server2 kernel: [154540.922598] xenbr1: port 2(tap5.0) entering disabled state
Jun 13 17:50:40 server2 kernel: [154541.813244] xen-blkback:ring-ref 16383, event-channel 6, protocol 1 (unspecified, assuming native)
Jun 13 17:50:40 server2 kernel: [154541.940131] xen-blkback:ring-ref 16382, event-channel 7, protocol 1 (unspecified, assuming native)
Jun 13 17:50:46 server2 kernel: [154548.555492] ADDRCONF(NETDEV_CHANGE): vif5.0: link becomes ready
Jun 13 17:50:46 server2 kernel: [154548.555611] xenbr1: port 3(vif5.0) entering forwarding state
Jun 13 17:50:46 server2 kernel: [154548.555628] xenbr1: port 3(vif5.0) entering forwarding state
Jun 13 17:50:46 server2 NetworkManager[984]: <info> (vif5.0): carrier now ON (device state 20)
Jun 13 17:50:46 server2 NetworkManager[984]: <info> (vif5.0): device state change: unavailable -> disconnected (reason 'carrier-changed') [20 30 40]
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Unmanaged Device found; state CONNECTED forced. (see http://bugs.launchpad.net/bugs/191889)
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Unmanaged Device found; state CONNECTED forced. (see http://bugs.launchpad.net/bugs/191889)
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Auto-activating connection 'Wired connection 4'.
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Activation (vif5.0) starting connection 'Wired connection 4'
Jun 13 17:50:46 server2 NetworkManager[984]: <info> (vif5.0): device state change: disconnected -> prepare (reason 'none') [30 40 0]
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Unmanaged Device found; state CONNECTED forced. (see http://bugs.launchpad.net/bugs/191889)
Jun 13 17:50:46  NetworkManager[984]: last message repeated 3 times
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Activation (vif5.0) Stage 1 of 5 (Device Prepare) scheduled...
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Activation (vif5.0) Stage 1 of 5 (Device Prepare) started...
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Activation (vif5.0) Stage 2 of 5 (Device Configure) scheduled...
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Activation (vif5.0) Stage 1 of 5 (Device Prepare) complete.
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Activation (vif5.0) Stage 2 of 5 (Device Configure) starting...
Jun 13 17:50:46 server2 NetworkManager[984]: <info> (vif5.0): device state change: prepare -> config (reason 'none') [40 50 0]
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Unmanaged Device found; state CONNECTED forced. (see http://bugs.launchpad.net/bugs/191889)
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Unmanaged Device found; state CONNECTED forced. (see http://bugs.launchpad.net/bugs/191889)
Jun 13 17:50:46 server2 NetworkManager[984]: <info> Activation (vif5.0) Stage 2 of 5 (Device Configure) successful.


More information about the Pacemaker mailing list