[Pacemaker] Cluster Volume Group is stuck

Wed May 11 08:07:15 EDT 2011

Hi all

we have a three node cluster with a Cluster Volume Group vgsmet.

After powering off one Node, the Volume Group is stuck.
One of the ERROR messages is:
May 11 10:50:32 multix244 crmd: [8086]: ERROR: process_lrm_event: LRM   
operation vgsmet:0_monitor_60000 (38) Timed Out (timeout=60000ms)

If we power on the Node again the cluster recovers.

--------------- This is the Configuration: ----------------------------

node multix244 \
         attributes standby="off"
node multix245 \
         attributes standby="off"
node multix246 \
         attributes standby="off"
primitive SBD_Stonith stonith:external/sbd \
         meta target-role="Started" \
         operations $id="SBD_Stonith-operations" \
         op monitor interval="15" timeout="15" start-delay="15" \
         params   
sbd_device="/dev/disk/by-id/scsi-3600a0b8000420d5a00001cf14dc3a9a2-part1"
primitive clvm ocf:lvm2:clvmd \
         params daemon_timeout="30" \
         op start interval="0" timeout="240s" \
         op stop interval="0" timeout="100s" \
         meta target-role="started"
primitive dlm ocf:pacemaker:controld \
         op monitor interval="120s" \
         op start interval="0" timeout="240s" \
         op stop interval="0" timeout="100s" \
         meta target-role="started"
primitive smetserv ocf:heartbeat:Xen \
         meta target-role="Started" \
         operations $id="smetserv-operations" \
         op monitor interval="10" timeout="30" \
         params xmfile="/etc/xen/vm/smetserv"
primitive vgsmet ocf:heartbeat:LVM \
         meta is-managed="true" \
         op monitor interval="60" timeout="60" \
         params volgrpname="vgsmet"
clone clvm_clone clvm \
         meta globally-unique="false" interleave="true" target-role="started"
clone dlm_clone dlm \
         meta globally-unique="false" interleave="true" target-role="started"
clone vgsmet_clone vgsmet \
         meta interleave="true" target-role="Started" is-managed="true"
colocation clvm_with_dlm inf: clvm_clone dlm_clone
colocation smetserv_with_vgsmet inf: smetserv vgsmet_clone
colocation vgsmet_with_clvm inf: vgsmet_clone clvm_clone
order clvm_before_vgsmet inf: clvm_clone vgsmet_clone
order dlm_before_clvm inf: dlm_clone clvm_clone
order vgsmet_before_smetserv inf: vgsmet_clone smetserv
property $id="cib-bootstrap-options" \
         dc-version="1.1.5-5ce2879aa0d5f43d01629bc20edc6868a9352002" \
         cluster-infrastructure="openais" \
         expected-quorum-votes="3" \
         last-lrm-refresh="1305101520" \
         default-action-timeout="60s"
rsc_defaults $id="rsc_defaults-options" \
         resource-stickiness="10"

------------------ This is the normal output of crm_mon ---------------
============
Last updated: Wed May 11 10:14:25 2011
Stack: openais
Current DC: multix244 - partition with quorum
Version: 1.1.5-5ce2879aa0d5f43d01629bc20edc6868a9352002
3 Nodes configured, 3 expected votes
5 Resources configured.
============

Online: [ multix246 multix244 multix245 ]

  Clone Set: dlm_clone [dlm]
      Started: [ multix244 multix245 multix246 ]
  Clone Set: clvm_clone [clvm]
      Started: [ multix244 multix245 multix246 ]
  Clone Set: vgsmet_clone [vgsmet]
      Started: [ multix244 multix245 multix246 ]
  smetserv       (ocf::heartbeat:Xen):   Started multix246
  SBD_Stonith    (stonith:external/sbd): Started multix245

--------------- After power off, we have: ----------------------------

============
Last updated: Wed May 11 10:24:11 2011
Stack: openais
Current DC: multix244 - partition with quorum
Version: 1.1.5-5ce2879aa0d5f43d01629bc20edc6868a9352002
3 Nodes configured, 3 expected votes
5 Resources configured.
============

Node multix246: UNCLEAN (offline)
Online: [ multix244 multix245 ]

  Clone Set: dlm_clone [dlm]
      Started: [ multix244 multix245 ]
      Stopped: [ dlm:2 ]
  Clone Set: clvm_clone [clvm]
      Started: [ multix244 multix245 ]
      Stopped: [ clvm:2 ]
  Clone Set: vgsmet_clone [vgsmet]
      vgsmet:0   (ocf::heartbeat:LVM):   Started multix244 FAILED
      vgsmet:1   (ocf::heartbeat:LVM):   Started multix245 FAILED
      Stopped: [ vgsmet:2 ]
  SBD_Stonith    (stonith:external/sbd): Started multix245

Failed actions:
     vgsmet:0_monitor_60000 (node=multix244, call=33, rc=-2,   
status=Timed Out): unknown exec error
     vgsmet:1_monitor_60000 (node=multix245, call=23, rc=-2,   
status=Timed Out): unknown exec error

------------------------------------------------------------------

I append the log files.

Karl

-- 
Karl Rößmann				Tel. +49-711-689-1657
Max-Planck-Institut FKF       		Fax. +49-711-689-1632
Postfach 800 665
70506 Stuttgart				email K.Roessmann at fkf.mpg.de
-------------- next part --------------
May 11 10:48:55 multix244 corosync[8060]:   [TOTEM ] A processor failed, forming new configuration.
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] CLM CONFIGURATION CHANGE
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] New Configuration:
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] 	r(0) ip(192.168.40.60) 
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] 	r(0) ip(192.168.40.61) 
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] Members Left:
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] 	r(0) ip(192.168.40.62) 
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] Members Joined:
May 11 10:48:59 multix244 corosync[8060]:   [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 1152: memb=2, new=0, lost=1
May 11 10:48:59 multix244 corosync[8060]:   [pcmk  ] info: pcmk_peer_update: memb: multix244 1009297600
May 11 10:48:59 multix244 corosync[8060]:   [pcmk  ] info: pcmk_peer_update: memb: multix245 1026074816
May 11 10:48:59 multix244 corosync[8060]:   [pcmk  ] info: pcmk_peer_update: lost: multix246 1042852032
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] CLM CONFIGURATION CHANGE
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] New Configuration:
May 11 10:48:59 multix244 crmd: [8086]: info: ais_dispatch_message: Membership 1152: quorum retained
May 11 10:48:59 multix244 crmd: [8086]: info: ais_status_callback: status: multix246 is now lost (was member)
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] 	r(0) ip(192.168.40.60) 
May 11 10:48:59 multix244 cluster-dlm: [8194]: info: ais_dispatch_message: Membership 1152: quorum retained
May 11 10:48:59 multix244 crmd: [8086]: info: crm_update_peer: Node multix246: id=1042852032 state=lost (new) addr=r(0) ip(192.168.40.62)  votes=1 born=1148 seen=1148 proc=00000000000000000000000000151312
May 11 10:48:59 multix244 cib: [8082]: info: ais_dispatch_message: Membership 1152: quorum retained
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] 	r(0) ip(192.168.40.61) 
May 11 10:48:59 multix244 cluster-dlm: [8194]: info: crm_update_peer: Node multix246: id=1042852032 state=lost (new) addr=r(0) ip(192.168.40.62)  votes=1 born=1148 seen=1148 proc=00000000000000000000000000151312
May 11 10:48:59 multix244 cib: [8082]: info: crm_update_peer: Node multix246: id=1042852032 state=lost (new) addr=r(0) ip(192.168.40.62)  votes=1 born=1148 seen=1148 proc=00000000000000000000000000151312
May 11 10:48:59 multix244 crmd: [8086]: info: erase_node_from_join: Removed node multix246 from join calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] Members Left:
May 11 10:48:59 multix244 cluster-dlm: update_cluster: Processing membership 1152
May 11 10:48:59 multix244 corosync[8060]:   [CLM   ] Members Joined:
May 11 10:48:59 multix244 cluster-dlm: dlm_process_node: Skipped active node 1009297600: born-on=1104, last-seen=1152, this-event=1152, last-event=1148
May 11 10:48:59 multix244 corosync[8060]:   [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 1152: memb=2, new=0, lost=0
May 11 10:48:59 multix244 cluster-dlm: dlm_process_node: Skipped active node 1026074816: born-on=1104, last-seen=1152, this-event=1152, last-event=1148
May 11 10:48:59 multix244 corosync[8060]:   [pcmk  ] info: pcmk_peer_update: MEMB: multix244 1009297600
May 11 10:48:59 multix244 cluster-dlm: del_configfs_node: del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/1042852032"
May 11 10:48:59 multix244 corosync[8060]:   [pcmk  ] info: pcmk_peer_update: MEMB: multix245 1026074816
May 11 10:48:59 multix244 cluster-dlm: dlm_process_node: Removed inactive node 1042852032: born-on=1148, last-seen=1148, this-event=1152, last-event=1148
May 11 10:48:59 multix244 corosync[8060]:   [pcmk  ] info: ais_mark_unseen_peer_dead: Node multix246 was not seen in the previous transition
May 11 10:48:59 multix244 corosync[8060]:   [pcmk  ] info: update_member: Node 1042852032/multix246 is now: lost
May 11 10:48:59 multix244 corosync[8060]:   [pcmk  ] info: send_member_notification: Sending membership update 1152 to 3 children
May 11 10:48:59 multix244 corosync[8060]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
May 11 10:48:59 multix244 cib: [8082]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/679, version=0.980.2): ok (rc=0)
May 11 10:48:59 multix244 crmd: [8086]: info: crmd_ais_dispatch: Setting expected votes to 3
May 11 10:48:59 multix244 kernel: [63203.672785] dlm: closing connection to node 1042852032
May 11 10:48:59 multix244 crmd: [8086]: WARN: match_down_event: No match for shutdown action on multix246
May 11 10:48:59 multix244 crmd: [8086]: info: te_update_diff: Stonith/shutdown of multix246 not matched
May 11 10:48:59 multix244 crmd: [8086]: info: abort_transition_graph: te_update_diff:200 - Triggered transition abort (complete=1, tag=node_state, id=multix246, magic=NA, cib=0.980.3) : Node failure
May 11 10:48:59 multix244 crmd: [8086]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
May 11 10:48:59 multix244 crmd: [8086]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
May 11 10:48:59 multix244 crmd: [8086]: info: do_pe_invoke: Query 683: Requesting the current CIB: S_POLICY_ENGINE
May 11 10:48:59 multix244 cib: [8082]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/682, version=0.980.4): ok (rc=0)
May 11 10:48:59 multix244 corosync[8060]:   [CPG   ] chosen downlist from node r(0) ip(192.168.40.60) 
May 11 10:48:59 multix244 cluster-dlm: log_config: dlm:controld conf 2 0 1 memb 1009297600 1026074816 join left 1042852032
May 11 10:48:59 multix244 cluster-dlm: log_config: dlm:ls:clvmd conf 2 0 1 memb 1009297600 1026074816 join left 1042852032
May 11 10:48:59 multix244 cluster-dlm: add_change: clvmd add_change cg 14 remove nodeid 1042852032 reason 3
May 11 10:48:59 multix244 cluster-dlm: add_change: clvmd add_change cg 14 counts member 2 joined 0 remove 1 failed 1
May 11 10:48:59 multix244 cluster-dlm: stop_kernel: clvmd stop_kernel cg 14
May 11 10:48:59 multix244 cluster-dlm: do_sysfs: write "0" to "/sys/kernel/dlm/clvmd/control"
May 11 10:48:59 multix244 cluster-dlm: fence_node_time: Node 1042852032/multix246 has not been shot yet
May 11 10:48:59 multix244 cluster-dlm: check_fencing_done: clvmd check_fencing 1042852032 not fenced add 1305103403 fence 0
May 11 10:48:59 multix244 corosync[8060]:   [MAIN  ] Completed service synchronization, ready to provide service.
May 11 10:48:59 multix244 crmd: [8086]: info: do_pe_invoke_callback: Invoking the PE: query=683, ref=pe_calc-dc-1305103739-542, seq=1152, quorate=1
May 11 10:48:59 multix244 pengine: [8085]: WARN: pe_fence_node: Node multix246 will be fenced because it is un-expectedly down
May 11 10:48:59 multix244 pengine: [8085]: WARN: determine_online_status: Node multix246 is unclean
May 11 10:48:59 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation vgsmet:0_monitor_0 found resource vgsmet:0 active on multix244
May 11 10:48:59 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation clvm:0_monitor_0 found resource clvm:0 active on multix244
May 11 10:48:59 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation dlm:0_monitor_0 found resource dlm:0 active on multix244
May 11 10:48:59 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation SBD_Stonith_monitor_0 found resource SBD_Stonith active on multix245
May 11 10:48:59 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation vgsmet:1_monitor_0 found resource vgsmet:1 active on multix245
May 11 10:48:59 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation clvm:1_monitor_0 found resource clvm:1 active on multix245
May 11 10:48:59 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation dlm:1_monitor_0 found resource dlm:1 active on multix245
May 11 10:48:59 multix244 pengine: [8085]: notice: clone_print:  Clone Set: dlm_clone [dlm]
May 11 10:48:59 multix244 pengine: [8085]: notice: short_print:      Started: [ multix244 multix245 ]
May 11 10:48:59 multix244 pengine: [8085]: notice: short_print:      Stopped: [ dlm:2 ]
May 11 10:48:59 multix244 pengine: [8085]: notice: clone_print:  Clone Set: clvm_clone [clvm]
May 11 10:48:59 multix244 pengine: [8085]: notice: short_print:      Started: [ multix244 multix245 ]
May 11 10:48:59 multix244 pengine: [8085]: notice: short_print:      Stopped: [ clvm:2 ]
May 11 10:48:59 multix244 pengine: [8085]: notice: clone_print:  Clone Set: vgsmet_clone [vgsmet]
May 11 10:48:59 multix244 pengine: [8085]: notice: short_print:      Started: [ multix244 multix245 ]
May 11 10:48:59 multix244 pengine: [8085]: notice: short_print:      Stopped: [ vgsmet:2 ]
May 11 10:48:59 multix244 pengine: [8085]: notice: native_print: smetserv	(ocf::heartbeat:Xen):	Started multix244
May 11 10:48:59 multix244 pengine: [8085]: notice: native_print: SBD_Stonith	(stonith:external/sbd):	Started multix245
May 11 10:48:59 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999995 more times on multix244 before being forced off
May 11 10:49:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999995 more times on multix244 before being forced off
May 11 10:49:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999995 more times on multix244 before being forced off
May 11 10:49:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999993 more times on multix245 before being forced off
May 11 10:49:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999993 more times on multix245 before being forced off
May 11 10:49:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999993 more times on multix245 before being forced off
May 11 10:49:00 multix244 pengine: [8085]: WARN: custom_action: Action dlm:2_stop_0 on multix246 is unrunnable (offline)
May 11 10:49:00 multix244 pengine: [8085]: WARN: custom_action: Marking node multix246 unclean
May 11 10:49:00 multix244 pengine: [8085]: WARN: custom_action: Action clvm:2_stop_0 on multix246 is unrunnable (offline)
May 11 10:49:00 multix244 pengine: [8085]: WARN: custom_action: Marking node multix246 unclean
May 11 10:49:00 multix244 pengine: [8085]: WARN: custom_action: Action vgsmet:2_stop_0 on multix246 is unrunnable (offline)
May 11 10:49:00 multix244 pengine: [8085]: WARN: custom_action: Marking node multix246 unclean
May 11 10:49:00 multix244 pengine: [8085]: WARN: stage6: Scheduling Node multix246 for STONITH
May 11 10:49:00 multix244 pengine: [8085]: notice: LogActions: Leave   dlm:0	(Started multix244)
May 11 10:49:00 multix244 pengine: [8085]: notice: LogActions: Leave   dlm:1	(Started multix245)
May 11 10:49:00 multix244 pengine: [8085]: notice: LogActions: Stop    dlm:2	(multix246)
May 11 10:49:00 multix244 pengine: [8085]: notice: LogActions: Leave   clvm:0	(Started multix244)
May 11 10:49:00 multix244 pengine: [8085]: notice: LogActions: Leave   clvm:1	(Started multix245)
May 11 10:49:00 multix244 pengine: [8085]: notice: LogActions: Stop    clvm:2	(multix246)
May 11 10:49:00 multix244 pengine: [8085]: notice: LogActions: Leave   vgsmet:0	(Started multix244)
May 11 10:49:00 multix244 pengine: [8085]: notice: LogActions: Leave   vgsmet:1	(Started multix245)
May 11 10:49:00 multix244 pengine: [8085]: notice: LogActions: Stop    vgsmet:2	(multix246)
May 11 10:49:00 multix244 pengine: [8085]: notice: LogActions: Leave   smetserv	(Started multix244)
May 11 10:49:00 multix244 pengine: [8085]: notice: LogActions: Leave   SBD_Stonith	(Started multix245)
May 11 10:49:00 multix244 crmd: [8086]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
May 11 10:49:00 multix244 crmd: [8086]: info: unpack_graph: Unpacked transition 211: 13 actions in 13 synapses
May 11 10:49:00 multix244 crmd: [8086]: info: do_te_invoke: Processing graph 211 (ref=pe_calc-dc-1305103739-542) derived from /var/lib/pengine/pe-warn-574.bz2
May 11 10:49:00 multix244 crmd: [8086]: info: te_pseudo_action: Pseudo action 42 fired and confirmed
May 11 10:49:00 multix244 pengine: [8085]: WARN: process_pe_message: Transition 211: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-574.bz2
May 11 10:49:00 multix244 crmd: [8086]: info: te_fence_node: Executing reboot fencing operation (44) on multix246 (timeout=60000)
May 11 10:49:00 multix244 stonith-ng: [8081]: info: initiate_remote_stonith_op: Initiating remote operation reboot for multix246: 0c1999d7-7b5c-40a0-964e-496aa504ed92
May 11 10:49:00 multix244 stonith-ng: [8081]: info: stonith_queryQuery <stonith_command t="stonith-ng" st_async_id="0c1999d7-7b5c-40a0-964e-496aa504ed92" st_op="st_query" st_callid="0" st_callopt="0" st_remote_op="0c1999d7-7b5c-40a0-964e-496aa504ed92" st_target="multix246" st_device_action="reboot" st_clientid="fb6c99d7-2b68-4a23-9339-716046bb442e" st_timeout="6000" src="multix244" seq="382" />
May 11 10:49:00 multix244 stonith-ng: [8081]: info: stonith_query: Found 0 matching devices for 'multix246'
May 11 10:49:00 multix244 stonith-ng: [8081]: info: stonith_command: Processed st_query from multix244: rc=0
May 11 10:49:00 multix244 stonith-ng: [8081]: info: call_remote_stonith: Requesting that multix245 perform op reboot multix246
May 11 10:49:00 multix244 mgmtd: [8087]: info: CIB query: cib
May 11 10:49:00 multix244 mgmtd: [8087]: info: CIB query: cib
May 11 10:49:05 multix244 sshd[26914]: Accepted publickey for root from 134.105.232.244 port 42136 ssh2
May 11 10:49:54 multix244 stonith-ng: [8081]: info: process_remote_stonith_execExecResult <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="st_notify" st_remote_op="0c1999d7-7b5c-40a0-964e-496aa504ed92" st_callid="0" st_callopt="0" st_rc="1" st_output="Performing: stonith -t external/sbd -T reset multix246 failed: multix246 0.05859375 " src="multix245" seq="256" />
May 11 10:50:00 multix244 stonith-ng: [8081]: ERROR: remote_op_timeout: Action reboot (0c1999d7-7b5c-40a0-964e-496aa504ed92) for multix246 timed out
May 11 10:50:00 multix244 stonith-ng: [8081]: info: remote_op_done: Notifing clients of 0c1999d7-7b5c-40a0-964e-496aa504ed92 (reboot of multix246 from fb6c99d7-2b68-4a23-9339-716046bb442e by (null)): 1, rc=-7
May 11 10:50:00 multix244 stonith-ng: [8081]: info: stonith_notify_client: Sending st_fence-notification to client 8086/19f1e5a5-744b-486b-9d2b-4dc3d33aea2c
May 11 10:50:00 multix244 crmd: [8086]: info: tengine_stonith_callbackStonithOp <remote-op state="1" st_target="multix246" st_op="reboot" />
May 11 10:50:00 multix244 crmd: [8086]: info: tengine_stonith_callback: Stonith operation 129/44:211:0:c8937c75-348a-453a-a120-3dfd7e55c4c4: Operation timed out (-7)
May 11 10:50:00 multix244 crmd: [8086]: ERROR: tengine_stonith_callback: Stonith of multix246 failed (-7)... aborting transition.
May 11 10:50:00 multix244 crmd: [8086]: info: abort_transition_graph: tengine_stonith_callback:408 - Triggered transition abort (complete=0) : Stonith failed
May 11 10:50:00 multix244 crmd: [8086]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
May 11 10:50:00 multix244 crmd: [8086]: info: update_abort_priority: Abort action done superceeded by restart
May 11 10:50:00 multix244 crmd: [8086]: ERROR: tengine_stonith_notify: Peer multix246 could not be terminated (reboot) by <anyone> for multix244 (ref=0c1999d7-7b5c-40a0-964e-496aa504ed92): Operation timed out
May 11 10:50:00 multix244 crmd: [8086]: info: run_graph: ====================================================
May 11 10:50:00 multix244 crmd: [8086]: notice: run_graph: Transition 211 (Complete=2, Pending=0, Fired=0, Skipped=8, Incomplete=3, Source=/var/lib/pengine/pe-warn-574.bz2): Stopped
May 11 10:50:00 multix244 crmd: [8086]: info: te_graph_trigger: Transition 211 is now complete
May 11 10:50:00 multix244 crmd: [8086]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
May 11 10:50:00 multix244 crmd: [8086]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
May 11 10:50:00 multix244 crmd: [8086]: info: do_pe_invoke: Query 684: Requesting the current CIB: S_POLICY_ENGINE
May 11 10:50:00 multix244 crmd: [8086]: info: do_pe_invoke_callback: Invoking the PE: query=684, ref=pe_calc-dc-1305103800-543, seq=1152, quorate=1
May 11 10:50:00 multix244 pengine: [8085]: WARN: pe_fence_node: Node multix246 will be fenced because it is un-expectedly down
May 11 10:50:00 multix244 pengine: [8085]: WARN: determine_online_status: Node multix246 is unclean
May 11 10:50:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation vgsmet:0_monitor_0 found resource vgsmet:0 active on multix244
May 11 10:50:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation clvm:0_monitor_0 found resource clvm:0 active on multix244
May 11 10:50:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation dlm:0_monitor_0 found resource dlm:0 active on multix244
May 11 10:50:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation SBD_Stonith_monitor_0 found resource SBD_Stonith active on multix245
May 11 10:50:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation vgsmet:1_monitor_0 found resource vgsmet:1 active on multix245
May 11 10:50:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation clvm:1_monitor_0 found resource clvm:1 active on multix245
May 11 10:50:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation dlm:1_monitor_0 found resource dlm:1 active on multix245
May 11 10:50:00 multix244 pengine: [8085]: notice: clone_print:  Clone Set: dlm_clone [dlm]
May 11 10:50:00 multix244 pengine: [8085]: notice: short_print:      Started: [ multix244 multix245 ]
May 11 10:50:00 multix244 pengine: [8085]: notice: short_print:      Stopped: [ dlm:2 ]
May 11 10:50:00 multix244 pengine: [8085]: notice: clone_print:  Clone Set: clvm_clone [clvm]
May 11 10:50:00 multix244 pengine: [8085]: notice: short_print:      Started: [ multix244 multix245 ]
May 11 10:50:00 multix244 pengine: [8085]: notice: short_print:      Stopped: [ clvm:2 ]
May 11 10:50:00 multix244 pengine: [8085]: notice: clone_print:  Clone Set: vgsmet_clone [vgsmet]
May 11 10:50:00 multix244 pengine: [8085]: notice: short_print:      Started: [ multix244 multix245 ]
May 11 10:50:00 multix244 pengine: [8085]: notice: short_print:      Stopped: [ vgsmet:2 ]
May 11 10:50:00 multix244 pengine: [8085]: notice: native_print: smetserv	(ocf::heartbeat:Xen):	Started multix244
May 11 10:50:00 multix244 pengine: [8085]: notice: native_print: SBD_Stonith	(stonith:external/sbd):	Started multix245
May 11 10:50:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999995 more times on multix244 before being forced off
May 11 10:50:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999995 more times on multix244 before being forced off
May 11 10:50:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999995 more times on multix244 before being forced off
May 11 10:50:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999993 more times on multix245 before being forced off
May 11 10:50:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999993 more times on multix245 before being forced off
May 11 10:50:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999993 more times on multix245 before being forced off
May 11 10:50:00 multix244 pengine: [8085]: WARN: custom_action: Action dlm:2_stop_0 on multix246 is unrunnable (offline)
May 11 10:50:00 multix244 pengine: [8085]: WARN: custom_action: Marking node multix246 unclean
May 11 10:50:00 multix244 pengine: [8085]: WARN: custom_action: Action clvm:2_stop_0 on multix246 is unrunnable (offline)
May 11 10:50:00 multix244 pengine: [8085]: WARN: custom_action: Marking node multix246 unclean
May 11 10:50:00 multix244 pengine: [8085]: WARN: custom_action: Action vgsmet:2_stop_0 on multix246 is unrunnable (offline)
May 11 10:50:00 multix244 pengine: [8085]: WARN: custom_action: Marking node multix246 unclean
May 11 10:50:00 multix244 pengine: [8085]: WARN: stage6: Scheduling Node multix246 for STONITH
May 11 10:50:00 multix244 pengine: [8085]: notice: LogActions: Leave   dlm:0	(Started multix244)
May 11 10:50:00 multix244 pengine: [8085]: notice: LogActions: Leave   dlm:1	(Started multix245)
May 11 10:50:00 multix244 pengine: [8085]: notice: LogActions: Stop    dlm:2	(multix246)
May 11 10:50:00 multix244 pengine: [8085]: notice: LogActions: Leave   clvm:0	(Started multix244)
May 11 10:50:00 multix244 pengine: [8085]: notice: LogActions: Leave   clvm:1	(Started multix245)
May 11 10:50:00 multix244 pengine: [8085]: notice: LogActions: Stop    clvm:2	(multix246)
May 11 10:50:00 multix244 pengine: [8085]: notice: LogActions: Leave   vgsmet:0	(Started multix244)
May 11 10:50:00 multix244 pengine: [8085]: notice: LogActions: Leave   vgsmet:1	(Started multix245)
May 11 10:50:00 multix244 pengine: [8085]: notice: LogActions: Stop    vgsmet:2	(multix246)
May 11 10:50:00 multix244 pengine: [8085]: notice: LogActions: Leave   smetserv	(Started multix244)
May 11 10:50:00 multix244 pengine: [8085]: notice: LogActions: Leave   SBD_Stonith	(Started multix245)
May 11 10:50:00 multix244 crmd: [8086]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
May 11 10:50:00 multix244 crmd: [8086]: info: unpack_graph: Unpacked transition 212: 13 actions in 13 synapses
May 11 10:50:00 multix244 crmd: [8086]: info: do_te_invoke: Processing graph 212 (ref=pe_calc-dc-1305103800-543) derived from /var/lib/pengine/pe-warn-575.bz2
May 11 10:50:00 multix244 crmd: [8086]: info: te_pseudo_action: Pseudo action 42 fired and confirmed
May 11 10:50:00 multix244 pengine: [8085]: WARN: process_pe_message: Transition 212: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-575.bz2
May 11 10:50:00 multix244 crmd: [8086]: info: te_fence_node: Executing reboot fencing operation (44) on multix246 (timeout=60000)
May 11 10:50:00 multix244 stonith-ng: [8081]: info: initiate_remote_stonith_op: Initiating remote operation reboot for multix246: f754d6f0-3ead-4da7-af3d-0523d9d39f1f
May 11 10:50:00 multix244 stonith-ng: [8081]: info: stonith_queryQuery <stonith_command t="stonith-ng" st_async_id="f754d6f0-3ead-4da7-af3d-0523d9d39f1f" st_op="st_query" st_callid="0" st_callopt="0" st_remote_op="f754d6f0-3ead-4da7-af3d-0523d9d39f1f" st_target="multix246" st_device_action="reboot" st_clientid="fb6c99d7-2b68-4a23-9339-716046bb442e" st_timeout="6000" src="multix244" seq="385" />
May 11 10:50:00 multix244 stonith-ng: [8081]: info: stonith_query: Found 0 matching devices for 'multix246'
May 11 10:50:00 multix244 stonith-ng: [8081]: info: stonith_command: Processed st_query from multix244: rc=0
May 11 10:50:00 multix244 stonith-ng: [8081]: info: call_remote_stonith: Requesting that multix245 perform op reboot multix246
May 11 10:50:32 multix244 lrmd: [8083]: WARN: vgsmet:0:monitor process (PID 27062) timed out (try 1).  Killing with signal SIGTERM (15).
May 11 10:50:32 multix244 lrmd: [8083]: WARN: operation monitor[38] on ocf::LVM::vgsmet:0 for client 8086, its parameters: CRM_meta_clone=[0] CRM_meta_clone_node_max=[1] CRM_meta_clone_max=[3] CRM_meta_notify=[false] crm_feature_set=[3.0.5] CRM_meta_globally_unique=[false] volgrpname=[vgsmet] CRM_meta_name=[monitor] CRM_meta_interval=[60000] CRM_meta_timeout=[60000] : pid [27062] timed out
May 11 10:50:32 multix244 crmd: [8086]: ERROR: process_lrm_event: LRM operation vgsmet:0_monitor_60000 (38) Timed Out (timeout=60000ms)
May 11 10:50:32 multix244 crmd: [8086]: info: process_graph_event: Detected action vgsmet:0_monitor_60000 from a different transition: 208 vs. 212
May 11 10:50:32 multix244 crmd: [8086]: info: abort_transition_graph: process_graph_event:473 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=vgsmet:0_monitor_60000, magic=2:-2;1:208:0:c8937c75-348a-453a-a120-3dfd7e55c4c4, cib=0.980.5) : Old event
May 11 10:50:32 multix244 crmd: [8086]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
May 11 10:50:32 multix244 crmd: [8086]: info: update_abort_priority: Abort action done superceeded by restart
May 11 10:50:32 multix244 crmd: [8086]: WARN: update_failcount: Updating failcount for vgsmet:0 on multix244 after failed monitor: rc=-2 (update=value++, time=1305103832)
May 11 10:50:32 multix244 attrd: [8084]: info: attrd_local_callback: Expanded fail-count-vgsmet:0=value++ to 6
May 11 10:50:32 multix244 attrd: [8084]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-vgsmet:0 (6)
May 11 10:50:32 multix244 attrd: [8084]: info: attrd_perform_update: Sent update 602: fail-count-vgsmet:0=6
May 11 10:50:32 multix244 attrd: [8084]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-vgsmet:0 (1305103832)
May 11 10:50:32 multix244 crmd: [8086]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, id=status-multix244-fail-count-vgsmet:0, magic=NA, cib=0.980.6) : Transient attribute: update
May 11 10:50:32 multix244 attrd: [8084]: info: attrd_perform_update: Sent update 604: last-failure-vgsmet:0=1305103832
May 11 10:50:32 multix244 crmd: [8086]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, id=status-multix244-last-failure-vgsmet:0, magic=NA, cib=0.980.7) : Transient attribute: update
May 11 10:50:32 multix244 crmd: [8086]: info: process_graph_event: Detected action vgsmet:1_monitor_60000 from a different transition: 208 vs. 212
May 11 10:50:32 multix244 crmd: [8086]: info: abort_transition_graph: process_graph_event:473 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=vgsmet:1_monitor_60000, magic=2:-2;5:208:0:c8937c75-348a-453a-a120-3dfd7e55c4c4, cib=0.980.8) : Old event
May 11 10:50:32 multix244 crmd: [8086]: WARN: update_failcount: Updating failcount for vgsmet:1 on multix245 after failed monitor: rc=-2 (update=value++, time=1305103832)
May 11 10:50:32 multix244 crmd: [8086]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, id=status-multix245-fail-count-vgsmet:1, magic=NA, cib=0.980.9) : Transient attribute: update
May 11 10:50:32 multix244 crmd: [8086]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, id=status-multix245-last-failure-vgsmet:1, magic=NA, cib=0.980.10) : Transient attribute: update
May 11 10:50:33 multix244 mgmtd: [8087]: info: CIB query: cib
May 11 10:50:33 multix244 mgmtd: [8087]: info: CIB query: cib
May 11 10:50:54 multix244 stonith-ng: [8081]: info: process_remote_stonith_execExecResult <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="st_notify" st_remote_op="f754d6f0-3ead-4da7-af3d-0523d9d39f1f" st_callid="0" st_callopt="0" st_rc="1" st_output="Performing: stonith -t external/sbd -T reset multix246 failed: multix246 0.05859375 " src="multix245" seq="258" />
May 11 10:51:00 multix244 stonith-ng: [8081]: ERROR: remote_op_timeout: Action reboot (f754d6f0-3ead-4da7-af3d-0523d9d39f1f) for multix246 timed out
May 11 10:51:00 multix244 stonith-ng: [8081]: info: remote_op_done: Notifing clients of f754d6f0-3ead-4da7-af3d-0523d9d39f1f (reboot of multix246 from fb6c99d7-2b68-4a23-9339-716046bb442e by (null)): 1, rc=-7
May 11 10:51:00 multix244 stonith-ng: [8081]: info: stonith_notify_client: Sending st_fence-notification to client 8086/19f1e5a5-744b-486b-9d2b-4dc3d33aea2c
May 11 10:51:00 multix244 crmd: [8086]: info: tengine_stonith_callbackStonithOp <remote-op state="1" st_target="multix246" st_op="reboot" />
May 11 10:51:00 multix244 crmd: [8086]: info: tengine_stonith_callback: Stonith operation 130/44:212:0:c8937c75-348a-453a-a120-3dfd7e55c4c4: Operation timed out (-7)
May 11 10:51:00 multix244 crmd: [8086]: ERROR: tengine_stonith_callback: Stonith of multix246 failed (-7)... aborting transition.
May 11 10:51:00 multix244 crmd: [8086]: info: abort_transition_graph: tengine_stonith_callback:408 - Triggered transition abort (complete=0) : Stonith failed
May 11 10:51:00 multix244 crmd: [8086]: ERROR: tengine_stonith_notify: Peer multix246 could not be terminated (reboot) by <anyone> for multix244 (ref=f754d6f0-3ead-4da7-af3d-0523d9d39f1f): Operation timed out
May 11 10:51:00 multix244 crmd: [8086]: info: run_graph: ====================================================
May 11 10:51:00 multix244 crmd: [8086]: notice: run_graph: Transition 212 (Complete=2, Pending=0, Fired=0, Skipped=8, Incomplete=3, Source=/var/lib/pengine/pe-warn-575.bz2): Stopped
May 11 10:51:00 multix244 crmd: [8086]: info: te_graph_trigger: Transition 212 is now complete
May 11 10:51:00 multix244 crmd: [8086]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
May 11 10:51:00 multix244 crmd: [8086]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
May 11 10:51:00 multix244 crmd: [8086]: info: do_pe_invoke: Query 686: Requesting the current CIB: S_POLICY_ENGINE
May 11 10:51:00 multix244 crmd: [8086]: info: do_pe_invoke_callback: Invoking the PE: query=686, ref=pe_calc-dc-1305103860-544, seq=1152, quorate=1
May 11 10:51:00 multix244 pengine: [8085]: WARN: pe_fence_node: Node multix246 will be fenced because it is un-expectedly down
May 11 10:51:00 multix244 pengine: [8085]: WARN: determine_online_status: Node multix246 is unclean
May 11 10:51:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation vgsmet:0_monitor_0 found resource vgsmet:0 active on multix244
May 11 10:51:00 multix244 pengine: [8085]: WARN: unpack_rsc_op: Processing failed op vgsmet:0_monitor_60000 on multix244: unknown exec error (-2)
May 11 10:51:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation clvm:0_monitor_0 found resource clvm:0 active on multix244
May 11 10:51:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation dlm:0_monitor_0 found resource dlm:0 active on multix244
May 11 10:51:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation SBD_Stonith_monitor_0 found resource SBD_Stonith active on multix245
May 11 10:51:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation vgsmet:1_monitor_0 found resource vgsmet:1 active on multix245
May 11 10:51:00 multix244 pengine: [8085]: WARN: unpack_rsc_op: Processing failed op vgsmet:1_monitor_60000 on multix245: unknown exec error (-2)
May 11 10:51:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation clvm:1_monitor_0 found resource clvm:1 active on multix245
May 11 10:51:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation dlm:1_monitor_0 found resource dlm:1 active on multix245
May 11 10:51:00 multix244 pengine: [8085]: notice: clone_print:  Clone Set: dlm_clone [dlm]
May 11 10:51:00 multix244 pengine: [8085]: notice: short_print:      Started: [ multix244 multix245 ]
May 11 10:51:00 multix244 pengine: [8085]: notice: short_print:      Stopped: [ dlm:2 ]
May 11 10:51:00 multix244 pengine: [8085]: notice: clone_print:  Clone Set: clvm_clone [clvm]
May 11 10:51:00 multix244 pengine: [8085]: notice: short_print:      Started: [ multix244 multix245 ]
May 11 10:51:00 multix244 pengine: [8085]: notice: short_print:      Stopped: [ clvm:2 ]
May 11 10:51:00 multix244 pengine: [8085]: notice: clone_print:  Clone Set: vgsmet_clone [vgsmet]
May 11 10:51:00 multix244 pengine: [8085]: notice: native_print:      vgsmet:0	(ocf::heartbeat:LVM):	Started multix244 FAILED
May 11 10:51:00 multix244 pengine: [8085]: notice: native_print:      vgsmet:1	(ocf::heartbeat:LVM):	Started multix245 FAILED
May 11 10:51:00 multix244 pengine: [8085]: notice: short_print:      Stopped: [ vgsmet:2 ]
May 11 10:51:00 multix244 pengine: [8085]: notice: native_print: smetserv	(ocf::heartbeat:Xen):	Started multix244
May 11 10:51:00 multix244 pengine: [8085]: notice: native_print: SBD_Stonith	(stonith:external/sbd):	Started multix245
May 11 10:51:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999994 more times on multix244 before being forced off
May 11 10:51:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999994 more times on multix244 before being forced off
May 11 10:51:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999994 more times on multix244 before being forced off
May 11 10:51:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999992 more times on multix245 before being forced off
May 11 10:51:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999992 more times on multix245 before being forced off
May 11 10:51:00 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999992 more times on multix245 before being forced off
May 11 10:51:00 multix244 pengine: [8085]: WARN: custom_action: Action dlm:2_stop_0 on multix246 is unrunnable (offline)
May 11 10:51:00 multix244 pengine: [8085]: WARN: custom_action: Marking node multix246 unclean
May 11 10:51:00 multix244 pengine: [8085]: WARN: custom_action: Action clvm:2_stop_0 on multix246 is unrunnable (offline)
May 11 10:51:00 multix244 pengine: [8085]: WARN: custom_action: Marking node multix246 unclean
May 11 10:51:00 multix244 pengine: [8085]: notice: RecurringOp:  Start recurring monitor (60s) for vgsmet:0 on multix244
May 11 10:51:00 multix244 pengine: [8085]: notice: RecurringOp:  Start recurring monitor (60s) for vgsmet:1 on multix245
May 11 10:51:00 multix244 pengine: [8085]: WARN: custom_action: Action vgsmet:2_stop_0 on multix246 is unrunnable (offline)
May 11 10:51:00 multix244 pengine: [8085]: WARN: custom_action: Marking node multix246 unclean
May 11 10:51:00 multix244 pengine: [8085]: WARN: stage6: Scheduling Node multix246 for STONITH
May 11 10:51:00 multix244 pengine: [8085]: notice: LogActions: Leave   dlm:0	(Started multix244)
May 11 10:51:00 multix244 pengine: [8085]: notice: LogActions: Leave   dlm:1	(Started multix245)
May 11 10:51:00 multix244 pengine: [8085]: notice: LogActions: Stop    dlm:2	(multix246)
May 11 10:51:00 multix244 pengine: [8085]: notice: LogActions: Leave   clvm:0	(Started multix244)
May 11 10:51:00 multix244 pengine: [8085]: notice: LogActions: Leave   clvm:1	(Started multix245)
May 11 10:51:00 multix244 pengine: [8085]: notice: LogActions: Stop    clvm:2	(multix246)
May 11 10:51:00 multix244 pengine: [8085]: notice: LogActions: Recover vgsmet:0	(Started multix244)
May 11 10:51:00 multix244 pengine: [8085]: notice: LogActions: Recover vgsmet:1	(Started multix245)
May 11 10:51:00 multix244 pengine: [8085]: notice: LogActions: Stop    vgsmet:2	(multix246)
May 11 10:51:00 multix244 pengine: [8085]: notice: LogActions: Restart smetserv	(Started multix244)
May 11 10:51:00 multix244 pengine: [8085]: notice: LogActions: Leave   SBD_Stonith	(Started multix245)
May 11 10:51:00 multix244 crmd: [8086]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
May 11 10:51:00 multix244 crmd: [8086]: info: unpack_graph: Unpacked transition 213: 24 actions in 24 synapses
May 11 10:51:00 multix244 pengine: [8085]: WARN: process_pe_message: Transition 213: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-576.bz2
May 11 10:51:00 multix244 crmd: [8086]: info: do_te_invoke: Processing graph 213 (ref=pe_calc-dc-1305103860-544) derived from /var/lib/pengine/pe-warn-576.bz2
May 11 10:51:00 multix244 crmd: [8086]: info: te_rsc_command: Initiating action 38: stop smetserv_stop_0 on multix244 (local)
May 11 10:51:00 multix244 lrmd: [8083]: info: cancel_op: operation monitor[40] on ocf::Xen::smetserv for client 8086, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] xmfile=[/etc/xen/vm/smetserv] CRM_meta_timeout=[30000] CRM_meta_interval=[10000]  cancelled
May 11 10:51:00 multix244 crmd: [8086]: info: do_lrm_rsc_op: Performing key=38:213:0:c8937c75-348a-453a-a120-3dfd7e55c4c4 op=smetserv_stop_0 )
May 11 10:51:00 multix244 lrmd: [8083]: info: rsc:smetserv:42: stop
May 11 10:51:00 multix244 crmd: [8086]: info: te_pseudo_action: Pseudo action 42 fired and confirmed
May 11 10:51:00 multix244 crmd: [8086]: info: process_lrm_event: LRM operation smetserv_monitor_10000 (call=40, status=1, cib-update=0, confirmed=true) Cancelled
May 11 10:51:00 multix244 crmd: [8086]: info: te_fence_node: Executing reboot fencing operation (44) on multix246 (timeout=60000)
May 11 10:51:00 multix244 stonith-ng: [8081]: info: initiate_remote_stonith_op: Initiating remote operation reboot for multix246: 2b8b0ede-8e18-4957-9fd2-6624c6e82a81
May 11 10:51:00 multix244 stonith-ng: [8081]: info: stonith_queryQuery <stonith_command t="stonith-ng" st_async_id="2b8b0ede-8e18-4957-9fd2-6624c6e82a81" st_op="st_query" st_callid="0" st_callopt="0" st_remote_op="2b8b0ede-8e18-4957-9fd2-6624c6e82a81" st_target="multix246" st_device_action="reboot" st_clientid="fb6c99d7-2b68-4a23-9339-716046bb442e" st_timeout="6000" src="multix244" seq="388" />
May 11 10:51:00 multix244 stonith-ng: [8081]: info: stonith_query: Found 0 matching devices for 'multix246'
May 11 10:51:00 multix244 stonith-ng: [8081]: info: stonith_command: Processed st_query from multix244: rc=0
May 11 10:51:00 multix244 Xen[27397]: INFO: Xen domain smetserv will be stopped (timeout: 13s)
May 11 10:51:00 multix244 stonith-ng: [8081]: info: call_remote_stonith: Requesting that multix245 perform op reboot multix246
May 11 10:51:12 multix244 logger: /etc/xen/scripts/vif-bridge: offline XENBUS_PATH=backend/vif/4/0
May 11 10:51:12 multix244 kernel: [63336.113657] br0: port 2(vif4.0) entering disabled state
May 11 10:51:12 multix244 kernel: [63336.129238] br0: port 2(vif4.0) entering disabled state
May 11 10:51:12 multix244 logger: /etc/xen/scripts/vif-bridge: brctl delif br0 vif4.0 failed
May 11 10:51:12 multix244 logger: /etc/xen/scripts/vif-bridge: ifconfig vif4.0 down failed
May 11 10:51:12 multix244 logger: /etc/xen/scripts/vif-bridge: Successful vif-bridge offline for vif4.0, bridge br0.
May 11 10:51:14 multix244 Xen[27397]: WARNING: Xen domain smetserv will be destroyed!
May 11 10:51:15 multix244 lrmd: [8083]: info: RA output: (smetserv:stop:stderr) Error: Domain 'smetserv' does not exist.
May 11 10:51:15 multix244 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vkbd/4/0
May 11 10:51:15 multix244 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/console/4/0
May 11 10:51:15 multix244 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vfb/4/0
May 11 10:51:15 multix244 logger: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/4/51712
May 11 10:51:15 multix244 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vif/4/0
May 11 10:51:15 multix244 logger: /etc/xen/scripts/block: remove XENBUS_PATH=backend/vbd/4/51728
May 11 10:51:15 multix244 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vbd/4/51712
May 11 10:51:15 multix244 logger: /etc/xen/scripts/xen-hotplug-cleanup: XENBUS_PATH=backend/vbd/4/51728
May 11 10:51:16 multix244 Xen[27397]: INFO: Xen domain smetserv stopped.
May 11 10:51:16 multix244 crmd: [8086]: info: process_lrm_event: LRM operation smetserv_stop_0 (call=42, rc=0, cib-update=687, confirmed=true) ok
May 11 10:51:16 multix244 crmd: [8086]: info: match_graph_event: Action smetserv_stop_0 (38) confirmed on multix244 (rc=0)
May 11 10:51:16 multix244 mgmtd: [8087]: info: CIB query: cib
May 11 10:51:16 multix244 mgmtd: [8087]: info: CIB query: cib
May 11 10:51:54 multix244 stonith-ng: [8081]: info: process_remote_stonith_execExecResult <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="st_notify" st_remote_op="2b8b0ede-8e18-4957-9fd2-6624c6e82a81" st_callid="0" st_callopt="0" st_rc="1" st_output="Performing: stonith -t external/sbd -T reset multix246 failed: multix246 0.05859375 " src="multix245" seq="260" />
May 11 10:52:00 multix244 stonith-ng: [8081]: ERROR: remote_op_timeout: Action reboot (2b8b0ede-8e18-4957-9fd2-6624c6e82a81) for multix246 timed out
May 11 10:52:00 multix244 stonith-ng: [8081]: info: remote_op_done: Notifing clients of 2b8b0ede-8e18-4957-9fd2-6624c6e82a81 (reboot of multix246 from fb6c99d7-2b68-4a23-9339-716046bb442e by (null)): 1, rc=-7
May 11 10:52:00 multix244 stonith-ng: [8081]: info: stonith_notify_client: Sending st_fence-notification to client 8086/19f1e5a5-744b-486b-9d2b-4dc3d33aea2c
May 11 10:52:00 multix244 crmd: [8086]: info: tengine_stonith_callbackStonithOp <remote-op state="1" st_target="multix246" st_op="reboot" />
May 11 10:52:00 multix244 crmd: [8086]: info: tengine_stonith_callback: Stonith operation 131/44:213:0:c8937c75-348a-453a-a120-3dfd7e55c4c4: Operation timed out (-7)
May 11 10:52:00 multix244 crmd: [8086]: ERROR: tengine_stonith_callback: Stonith of multix246 failed (-7)... aborting transition.
May 11 10:52:00 multix244 crmd: [8086]: info: abort_transition_graph: tengine_stonith_callback:408 - Triggered transition abort (complete=0) : Stonith failed
May 11 10:52:00 multix244 crmd: [8086]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
May 11 10:52:00 multix244 crmd: [8086]: info: update_abort_priority: Abort action done superceeded by restart
May 11 10:52:00 multix244 crmd: [8086]: ERROR: tengine_stonith_notify: Peer multix246 could not be terminated (reboot) by <anyone> for multix244 (ref=2b8b0ede-8e18-4957-9fd2-6624c6e82a81): Operation timed out
May 11 10:52:00 multix244 crmd: [8086]: info: run_graph: ====================================================
May 11 10:52:00 multix244 crmd: [8086]: notice: run_graph: Transition 213 (Complete=3, Pending=0, Fired=0, Skipped=17, Incomplete=4, Source=/var/lib/pengine/pe-warn-576.bz2): Stopped
May 11 10:52:00 multix244 crmd: [8086]: info: te_graph_trigger: Transition 213 is now complete
May 11 10:52:00 multix244 crmd: [8086]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
May 11 10:52:00 multix244 crmd: [8086]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
May 11 10:52:00 multix244 crmd: [8086]: info: do_pe_invoke: Query 688: Requesting the current CIB: S_POLICY_ENGINE
May 11 10:52:00 multix244 crmd: [8086]: info: do_pe_invoke_callback: Invoking the PE: query=688, ref=pe_calc-dc-1305103920-546, seq=1152, quorate=1
May 11 10:52:00 multix244 pengine: [8085]: WARN: pe_fence_node: Node multix246 will be fenced because it is un-expectedly down
May 11 10:52:00 multix244 pengine: [8085]: WARN: determine_online_status: Node multix246 is unclean
May 11 10:52:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation vgsmet:0_monitor_0 found resource vgsmet:0 active on multix244
May 11 10:52:00 multix244 pengine: [8085]: WARN: unpack_rsc_op: Processing failed op vgsmet:0_monitor_60000 on multix244: unknown exec error (-2)
May 11 10:52:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation clvm:0_monitor_0 found resource clvm:0 active on multix244
May 11 10:52:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation dlm:0_monitor_0 found resource dlm:0 active on multix244
May 11 10:52:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation SBD_Stonith_monitor_0 found resource SBD_Stonith active on multix245
May 11 10:52:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation vgsmet:1_monitor_0 found resource vgsmet:1 active on multix245
May 11 10:52:00 multix244 pengine: [8085]: WARN: unpack_rsc_op: Processing failed op vgsmet:1_monitor_60000 on multix245: unknown exec error (-2)
May 11 10:52:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation clvm:1_monitor_0 found resource clvm:1 active on multix245
May 11 10:52:00 multix244 pengine: [8085]: notice: unpack_rsc_op: Operation dlm:1_monitor_0 found resource dlm:1 active on multix245
May 11 10:52:00 multix244 pengine: [8085]: notice: clone_print:  Clone Set: dlm_clone [dlm]
May 11 10:52:00 multix244 pengine: [8085]: notice: short_print:      Started: [ multix244 multix245 ]
May 11 10:52:00 multix244 pengine: [8085]: notice: short_print:      Stopped: [ dlm:2 ]
May 11 10:52:00 multix244 pengine: [8085]: notice: clone_print:  Clone Set: clvm_clone [clvm]
May 11 10:52:00 multix244 pengine: [8085]: notice: short_print:      Started: [ multix244 multix245 ]
May 11 10:52:00 multix244 pengine: [8085]: notice: short_print:      Stopped: [ clvm:2 ]
May 11 10:52:00 multix244 pengine: [8085]: notice: clone_print:  Clone Set: vgsmet_clone [vgsmet]
May 11 10:52:01 multix244 pengine: [8085]: notice: native_print:      vgsmet:0	(ocf::heartbeat:LVM):	Started multix244 FAILED
May 11 10:52:01 multix244 pengine: [8085]: notice: native_print:      vgsmet:1	(ocf::heartbeat:LVM):	Started multix245 FAILED
May 11 10:52:01 multix244 pengine: [8085]: notice: short_print:      Stopped: [ vgsmet:2 ]
May 11 10:52:01 multix244 pengine: [8085]: notice: native_print: smetserv	(ocf::heartbeat:Xen):	Stopped 
May 11 10:52:01 multix244 pengine: [8085]: notice: native_print: SBD_Stonith	(stonith:external/sbd):	Started multix245
May 11 10:52:01 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999994 more times on multix244 before being forced off
May 11 10:52:01 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999994 more times on multix244 before being forced off
May 11 10:52:01 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999994 more times on multix244 before being forced off
May 11 10:52:01 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999992 more times on multix245 before being forced off
May 11 10:52:01 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999992 more times on multix245 before being forced off
May 11 10:52:01 multix244 pengine: [8085]: notice: common_apply_stickiness: vgsmet_clone can fail 999992 more times on multix245 before being forced off
May 11 10:52:01 multix244 pengine: [8085]: WARN: custom_action: Action dlm:2_stop_0 on multix246 is unrunnable (offline)
May 11 10:52:01 multix244 pengine: [8085]: WARN: custom_action: Marking node multix246 unclean
May 11 10:52:01 multix244 pengine: [8085]: WARN: custom_action: Action clvm:2_stop_0 on multix246 is unrunnable (offline)
May 11 10:52:01 multix244 pengine: [8085]: WARN: custom_action: Marking node multix246 unclean
May 11 10:52:01 multix244 pengine: [8085]: notice: RecurringOp:  Start recurring monitor (60s) for vgsmet:0 on multix244
May 11 10:52:01 multix244 pengine: [8085]: notice: RecurringOp:  Start recurring monitor (60s) for vgsmet:1 on multix245
May 11 10:52:01 multix244 pengine: [8085]: WARN: custom_action: Action vgsmet:2_stop_0 on multix246 is unrunnable (offline)
May 11 10:52:01 multix244 pengine: [8085]: WARN: custom_action: Marking node multix246 unclean
May 11 10:52:01 multix244 pengine: [8085]: notice: RecurringOp:  Start recurring monitor (10s) for smetserv on multix244
May 11 10:52:01 multix244 pengine: [8085]: WARN: stage6: Scheduling Node multix246 for STONITH
May 11 10:52:01 multix244 pengine: [8085]: notice: LogActions: Leave   dlm:0	(Started multix244)
May 11 10:52:01 multix244 pengine: [8085]: notice: LogActions: Leave   dlm:1	(Started multix245)
May 11 10:52:01 multix244 pengine: [8085]: notice: LogActions: Stop    dlm:2	(multix246)
May 11 10:52:01 multix244 pengine: [8085]: notice: LogActions: Leave   clvm:0	(Started multix244)
May 11 10:52:01 multix244 pengine: [8085]: notice: LogActions: Leave   clvm:1	(Started multix245)
May 11 10:52:01 multix244 pengine: [8085]: notice: LogActions: Stop    clvm:2	(multix246)
May 11 10:52:01 multix244 pengine: [8085]: notice: LogActions: Recover vgsmet:0	(Started multix244)
May 11 10:52:01 multix244 pengine: [8085]: notice: LogActions: Recover vgsmet:1	(Started multix245)
May 11 10:52:01 multix244 pengine: [8085]: notice: LogActions: Stop    vgsmet:2	(multix246)
May 11 10:52:01 multix244 pengine: [8085]: notice: LogActions: Start   smetserv	(multix244)
May 11 10:52:01 multix244 pengine: [8085]: notice: LogActions: Leave   SBD_Stonith	(Started multix245)
May 11 10:52:01 multix244 crmd: [8086]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
May 11 10:52:01 multix244 pengine: [8085]: WARN: process_pe_message: Transition 214: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-577.bz2
May 11 10:52:01 multix244 crmd: [8086]: info: unpack_graph: Unpacked transition 214: 23 actions in 23 synapses
May 11 10:52:01 multix244 crmd: [8086]: info: do_te_invoke: Processing graph 214 (ref=pe_calc-dc-1305103920-546) derived from /var/lib/pengine/pe-warn-577.bz2
May 11 10:52:01 multix244 crmd: [8086]: info: te_pseudo_action: Pseudo action 41 fired and confirmed
May 11 10:52:01 multix244 crmd: [8086]: info: te_fence_node: Executing reboot fencing operation (43) on multix246 (timeout=60000)
May 11 10:52:01 multix244 stonith-ng: [8081]: info: initiate_remote_stonith_op: Initiating remote operation reboot for multix246: fe35c9fe-8e29-4ff4-b503-68efc147c0b8
May 11 10:52:01 multix244 stonith-ng: [8081]: info: stonith_queryQuery <stonith_command t="stonith-ng" st_async_id="fe35c9fe-8e29-4ff4-b503-68efc147c0b8" st_op="st_query" st_callid="0" st_callopt="0" st_remote_op="fe35c9fe-8e29-4ff4-b503-68efc147c0b8" st_target="multix246" st_device_action="reboot" st_clientid="fb6c99d7-2b68-4a23-9339-716046bb442e" st_timeout="6000" src="multix244" seq="391" />
May 11 10:52:01 multix244 stonith-ng: [8081]: info: stonith_query: Found 0 matching devices for 'multix246'
May 11 10:52:01 multix244 stonith-ng: [8081]: info: stonith_command: Processed st_query from multix244: rc=0
May 11 10:52:01 multix244 stonith-ng: [8081]: info: call_remote_stonith: Requesting that multix245 perform op reboot multix246
May 11 10:52:04 multix244 ntpd[6194]: Deleting interface #13 vif4.0, fe80::fcff:ffff:feff:ffff#123, interface stats: received=0, sent=0, dropped=0, active_time=300 secs
May 11 10:52:32 multix244 lrmd: [8083]: WARN: vgsmet:0:monitor process (PID 27844) timed out (try 1).  Killing with signal SIGTERM (15).
May 11 10:52:32 multix244 lrmd: [8083]: WARN: operation monitor[38] on ocf::LVM::vgsmet:0 for client 8086, its parameters: CRM_meta_clone=[0] CRM_meta_clone_node_max=[1] CRM_meta_clone_max=[3] CRM_meta_notify=[false] crm_feature_set=[3.0.5] CRM_meta_globally_unique=[false] volgrpname=[vgsmet] CRM_meta_name=[monitor] CRM_meta_interval=[60000] CRM_meta_timeout=[60000] : pid [27844] timed out
May 11 10:52:32 multix244 crmd: [8086]: ERROR: process_lrm_event: LRM operation vgsmet:0_monitor_60000 (38) Timed Out (timeout=60000ms)
May 11 10:52:32 multix244 crmd: [8086]: info: process_graph_event: Detected action vgsmet:0_monitor_60000 from a different transition: 208 vs. 214
May 11 10:52:32 multix244 crmd: [8086]: info: abort_transition_graph: process_graph_event:473 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=vgsmet:0_monitor_60000, magic=2:-2;1:208:0:c8937c75-348a-453a-a120-3dfd7e55c4c4, cib=0.980.12) : Old event
May 11 10:52:32 multix244 crmd: [8086]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
May 11 10:52:32 multix244 crmd: [8086]: info: update_abort_priority: Abort action done superceeded by restart
May 11 10:52:32 multix244 crmd: [8086]: WARN: update_failcount: Updating failcount for vgsmet:0 on multix244 after failed monitor: rc=-2 (update=value++, time=1305103952)
May 11 10:52:32 multix244 attrd: [8084]: info: attrd_local_callback: Expanded fail-count-vgsmet:0=value++ to 7
May 11 10:52:32 multix244 attrd: [8084]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-vgsmet:0 (7)
May 11 10:52:32 multix244 attrd: [8084]: info: attrd_perform_update: Sent update 608: fail-count-vgsmet:0=7
May 11 10:52:32 multix244 attrd: [8084]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-vgsmet:0 (1305103952)
May 11 10:52:32 multix244 crmd: [8086]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, id=status-multix244-fail-count-vgsmet:0, magic=NA, cib=0.980.13) : Transient attribute: update
May 11 10:52:32 multix244 attrd: [8084]: info: attrd_perform_update: Sent update 610: last-failure-vgsmet:0=1305103952
May 11 10:52:32 multix244 crmd: [8086]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, id=status-multix244-last-failure-vgsmet:0, magic=NA, cib=0.980.14) : Transient attribute: update
May 11 10:52:32 multix244 crmd: [8086]: info: process_graph_event: Detected action vgsmet:1_monitor_60000 from a different transition: 208 vs. 214
May 11 10:52:32 multix244 crmd: [8086]: info: abort_transition_graph: process_graph_event:473 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=vgsmet:1_monitor_60000, magic=2:-2;5:208:0:c8937c75-348a-453a-a120-3dfd7e55c4c4, cib=0.980.15) : Old event
May 11 10:52:32 multix244 crmd: [8086]: WARN: update_failcount: Updating failcount for vgsmet:1 on multix245 after failed monitor: rc=-2 (update=value++, time=1305103952)
May 11 10:52:32 multix244 crmd: [8086]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, id=status-multix245-fail-count-vgsmet:1, magic=NA, cib=0.980.16) : Transient attribute: update
May 11 10:52:32 multix244 crmd: [8086]: info: abort_transition_graph: te_update_diff:149 - Triggered transition abort (complete=0, tag=nvpair, id=status-multix245-last-failure-vgsmet:1, magic=NA, cib=0.980.17) : Transient attribute: update
May 11 10:52:33 multix244 mgmtd: [8087]: info: CIB query: cib
May 11 10:52:33 multix244 mgmtd: [8087]: info: CIB query: cib
May 11 10:52:41 multix244 ntpd[6194]: kernel time sync status change 6001
May 11 10:52:55 multix244 stonith-ng: [8081]: info: process_remote_stonith_execExecResult <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="st_notify" st_remote_op="fe35c9fe-8e29-4ff4-b503-68efc147c0b8" st_callid="0" st_callopt="0" st_rc="1" st_output="Performing: stonith -t external/sbd -T reset multix246 failed: multix246 0.05859375 " src="multix245" seq="262" />
May 11 10:53:01 multix244 stonith-ng: [8081]: ERROR: remote_op_timeout: Action reboot (fe35c9fe-8e29-4ff4-b503-68efc147c0b8) for multix246 timed out
May 11 10:53:01 multix244 stonith-ng: [8081]: info: remote_op_done: Notifing clients of fe35c9fe-8e29-4ff4-b503-68efc147c0b8 (reboot of multix246 from fb6c99d7-2b68-4a23-9339-716046bb442e by (null)): 1, rc=-7
May 11 10:53:01 multix244 stonith-ng: [8081]: info: stonith_notify_client: Sending st_fence-notification to client 8086/19f1e5a5-744b-486b-9d2b-4dc3d33aea2c
May 11 10:53:01 multix244 crmd: [8086]: info: tengine_stonith_callbackStonithOp <remote-op state="1" st_target="multix246" st_op="reboot" />
May 11 10:53:01 multix244 crmd: [8086]: info: tengine_stonith_callback: Stonith operation 132/43:214:0:c8937c75-348a-453a-a120-3dfd7e55c4c4: Operation timed out (-7)
May 11 10:53:01 multix244 crmd: [8086]: ERROR: tengine_stonith_callback: Stonith of multix246 failed (-7)... aborting transition.
May 11 10:53:01 multix244 crmd: [8086]: info: abort_transition_graph: tengine_stonith_callback:408 - Triggered transition abort (complete=0) : Stonith failed
May 11 10:53:01 multix244 crmd: [8086]: ERROR: tengine_stonith_notify: Peer multix246 could not be terminated (reboot) by <anyone> for multix244 (ref=fe35c9fe-8e29-4ff4-b503-68efc147c0b8): Operation timed out
May 11 10:53:01 multix244 crmd: [8086]: info: run_graph: ====================================================
...
...
...