[ClusterLabs] file system resource becomes inaccesible when any of the node goes down

Muhammad Sharfuddin M.Sharfuddin at nds.com.pk
Mon Jul 6 09:42:08 UTC 2015


yes, per the digimer(lists at alteeve.ca) recommendation now I have fencing 
handler configuration enabled, but
still ocfs2 file system resource was inaccessible on surviving/online 
node(node2).

node1 went offline at 14:18:02, and again /sharedata(ocfs2 file system 
resource) remained inaccessible
in between 14:18:16 and 14:19:00 on node2

/var/log/messages on node2:
Jul  6 14:18:06 node2 kernel: [ 4249.199514] drbd r0: PingAck did not 
arrive in time.
Jul  6 14:18:06 node2 kernel: [ 4249.199531] drbd r0: peer( Primary -> 
Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Jul  6 14:18:06 node2 kernel: [ 4249.199712] block drbd0: new current 
UUID 5B915DB0ED4CBDBB:515A2C9604ACF569:3C10859F1F7EF431:3C0F859F1F7EF431
Jul  6 14:18:06 node2 kernel: [ 4249.199717] drbd r0: asender terminated
Jul  6 14:18:06 node2 kernel: [ 4249.199719] drbd r0: Terminating drbd_a_r0
Jul  6 14:18:06 node2 kernel: [ 4249.200032] drbd r0: Connection closed
Jul  6 14:18:06 node2 kernel: [ 4249.200062] drbd r0: conn( 
NetworkFailure -> Unconnected )
Jul  6 14:18:06 node2 kernel: [ 4249.200064] drbd r0: receiver terminated
Jul  6 14:18:06 node2 kernel: [ 4249.200065] drbd r0: Restarting 
receiver thread
Jul  6 14:18:06 node2 kernel: [ 4249.200066] drbd r0: receiver (re)started
Jul  6 14:18:06 node2 kernel: [ 4249.200069] drbd r0: conn( Unconnected 
-> WFConnection )
Jul  6 14:18:06 node2 kernel: [ 4249.200191] drbd r0: helper command: 
/sbin/drbdadm fence-peer r0
Jul  6 14:18:06 node2 crm-fence-peer.sh[8804]: invoked for r0
Jul  6 14:18:10 node2 crm-fence-peer.sh[8804]: No messages received in 3 
seconds.. aborting
Jul  6 14:18:10 node2 corosync[4034]:  [TOTEM ] A processor failed, 
forming new configuration.
Jul  6 14:18:15 node2 crm-fence-peer.sh[8804]: No messages received in 3 
seconds.. aborting
Jul  6 14:18:16 node2 corosync[4034]:  [CLM   ] CLM CONFIGURATION CHANGE
Jul  6 14:18:16 node2 corosync[4034]:  [CLM   ] New Configuration:
Jul  6 14:18:16 node2 corosync[4034]:  [CLM   ]     r(0) ip(172.16.241.132)
Jul  6 14:18:16 node2 corosync[4034]:  [CLM   ] Members Left:
Jul  6 14:18:16 node2 corosync[4034]:  [CLM   ]     r(0) ip(172.16.241.131)
Jul  6 14:18:16 node2 corosync[4034]:  [CLM   ] Members Joined:
Jul  6 14:18:16 node2 corosync[4034]:  [pcmk  ] notice: 
pcmk_peer_update: Transitional membership event on ring 296: memb=1, 
new=0, lost=1
Jul  6 14:18:16 node2 corosync[4034]:  [pcmk  ] info: pcmk_peer_update: 
memb: node2 739307908
Jul  6 14:18:16 node2 corosync[4034]:  [pcmk  ] info: pcmk_peer_update: 
lost: node1 739307907
Jul  6 14:18:16 node2 corosync[4034]:  [CLM   ] CLM CONFIGURATION CHANGE
Jul  6 14:18:16 node2 corosync[4034]:  [CLM   ] New Configuration:
Jul  6 14:18:16 node2 corosync[4034]:  [CLM   ]     r(0) ip(172.16.241.132)
Jul  6 14:18:16 node2 cluster-dlm[4386]:   notice: 
plugin_handle_membership: Membership 296: quorum lost
Jul  6 14:18:16 node2 ocfs2_controld[4516]:   notice: 
plugin_handle_membership: Membership 296: quorum lost
Jul  6 14:18:16 node2 corosync[4034]:  [CLM   ] Members Left:
Jul  6 14:18:16 node2 cluster-dlm[4386]:   notice: 
crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - 
state is now lost (was member)
Jul  6 14:18:16 node2 cluster-dlm[4386]: update_cluster: Processing 
membership 296
Jul  6 14:18:16 node2 cluster-dlm[4386]: dlm_process_node: Skipped 
active node 739307908: born-on=288, last-seen=296, this-event=296, 
last-event=292
Jul  6 14:18:16 node2 cluster-dlm[4386]: del_configfs_node: 
del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/739307907"
Jul  6 14:18:16 node2 ocfs2_controld[4516]:   notice: 
crm_update_peer_state: plugin_handle_membership: Node node1[739307907] - 
state is now lost (was member)
Jul  6 14:18:16 node2 corosync[4034]:  [CLM   ] Members Joined:
Jul  6 14:18:16 node2 cluster-dlm[4386]: dlm_process_node: Removed 
inactive node 739307907: born-on=292, last-seen=292, this-event=296, 
last-event=292
Jul  6 14:18:16 node2 cib[4039]:   notice: plugin_handle_membership: 
Membership 296: quorum lost
Jul  6 14:18:16 node2 stonith-ng[4040]:   notice: 
plugin_handle_membership: Membership 296: quorum lost
Jul  6 14:18:16 node2 crmd[4044]:   notice: plugin_handle_membership: 
Membership 296: quorum lost
Jul  6 14:18:16 node2 ocfs2_controld[4516]: confchg called
Jul  6 14:18:16 node2 corosync[4034]:  [pcmk  ] notice: 
pcmk_peer_update: Stable membership event on ring 296: memb=1, new=0, lost=0
Jul  6 14:18:16 node2 cluster-dlm[4386]: log_config: dlm:controld conf 1 
0 1 memb 739307908 join left 739307907
Jul  6 14:18:16 node2 cib[4039]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node node1[739307907] - state is now lost (was 
member)
Jul  6 14:18:16 node2 stonith-ng[4040]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node node1[739307907] - state is now lost (was 
member)
Jul  6 14:18:16 node2 crmd[4044]:  warning: match_down_event: No match 
for shutdown action on node1
Jul  6 14:18:16 node2 ocfs2_controld[4516]: ocfs2_controld (group 
"ocfs2:controld") confchg: members 1, left 1, joined 0
Jul  6 14:18:16 node2 corosync[4034]:  [pcmk  ] info: pcmk_peer_update: 
MEMB: node2 739307908
Jul  6 14:18:16 node2 cluster-dlm[4386]: log_config: 
dlm:ls:12925001B70F4A529491DFF16F8B8352 conf 1 0 1 memb 739307908 join 
left 739307907
Jul  6 14:18:16 node2 crmd[4044]:   notice: peer_update_callback: 
Stonith/shutdown of node1 not matched
Jul  6 14:18:16 node2 ocfs2_controld[4516]: node daemon left 739307907
Jul  6 14:18:16 node2 corosync[4034]:  [pcmk  ] info: 
ais_mark_unseen_peer_dead: Node node1 was not seen in the previous 
transition
Jul  6 14:18:16 node2 cluster-dlm[4386]: node_history_fail: 
12925001B70F4A529491DFF16F8B8352 check_fs nodeid 739307907 set
Jul  6 14:18:16 node2 crmd[4044]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node node1[739307907] - state is now lost (was 
member)
Jul  6 14:18:16 node2 ocfs2_controld[4516]: node down 739307907
Jul  6 14:18:16 node2 ocfs2_controld[4516]: Node 739307907 has left 
mountgroup 12925001B70F4A529491DFF16F8B8352
Jul  6 14:18:16 node2 corosync[4034]:  [pcmk  ] info: update_member: 
Node 739307907/node1 is now: lost
Jul  6 14:18:16 node2 cluster-dlm[4386]: add_change: 
12925001B70F4A529491DFF16F8B8352 add_change cg 3 remove nodeid 739307907 
reason 3
Jul  6 14:18:16 node2 crmd[4044]:  warning: match_down_event: No match 
for shutdown action on node1
Jul  6 14:18:16 node2 corosync[4034]:  [pcmk  ] info: 
send_member_notification: Sending membership update 296 to 5 children
Jul  6 14:18:16 node2 corosync[4034]:  [TOTEM ] A processor joined or 
left the membership and a new membership was formed.
Jul  6 14:18:16 node2 cluster-dlm[4386]: add_change: 
12925001B70F4A529491DFF16F8B8352 add_change cg 3 counts member 1 joined 
0 remove 1 failed 1
Jul  6 14:18:16 node2 crmd[4044]:   notice: peer_update_callback: 
Stonith/shutdown of node1 not matched
Jul  6 14:18:16 node2 corosync[4034]:  [CPG   ] chosen downlist: sender 
r(0) ip(172.16.241.132) ; members(old:2 left:1)
Jul  6 14:18:16 node2 cluster-dlm[4386]: stop_kernel: 
12925001B70F4A529491DFF16F8B8352 stop_kernel cg 3
Jul  6 14:18:16 node2 cluster-dlm[4386]: do_sysfs: write "0" to 
"/sys/kernel/dlm/12925001B70F4A529491DFF16F8B8352/control"
Jul  6 14:18:16 node2 sbd: [4022]: WARN: CIB: We do NOT have quorum!
Jul  6 14:18:16 node2 sbd: [4020]: WARN: Pacemaker health check: UNHEALTHY
Jul  6 14:18:16 node2 corosync[4034]:  [MAIN  ] Completed service 
synchronization, ready to provide service.
Jul  6 14:18:16 node2 cluster-dlm[4386]: fence_node_time: Node 
739307907/node1 has not been shot yet
Jul  6 14:18:16 node2 cluster-dlm[4386]: check_fencing_done: 
12925001B70F4A529491DFF16F8B8352 check_fencing 739307907 wait add 
1436173156 fail 1436174296 last 0
Jul  6 14:18:16 node2 ocfs2_controld[4516]: Sending notification of node 
739307907 for "12925001B70F4A529491DFF16F8B8352"
Jul  6 14:18:16 node2 crmd[4044]:   notice: crm_update_quorum: Updating 
quorum status to false (call=88)
Jul  6 14:18:16 node2 ocfs2_controld[4516]: confchg called
Jul  6 14:18:16 node2 ocfs2_controld[4516]: group 
"ocfs2:12925001B70F4A529491DFF16F8B8352" confchg: members 1, left 1, 
joined 0
Jul  6 14:18:16 node2 cluster-dlm[4386]: set_fs_notified: 
12925001B70F4A529491DFF16F8B8352 set_fs_notified nodeid 739307907
Jul  6 14:18:16 node2 crmd[4044]:   notice: do_state_transition: State 
transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN 
cause=C_FSA_INTERNAL origin=check_join_state ]
Jul  6 14:18:16 node2 ocfs2_controld[4516]: message from dlmcontrol
Jul  6 14:18:16 node2 ocfs2_controld[4516]: Notified for 
"12925001B70F4A529491DFF16F8B8352", node 739307907, status 0
Jul  6 14:18:16 node2 crmd[4044]:   notice: crm_update_quorum: Updating 
quorum status to false (call=93)
Jul  6 14:18:16 node2 attrd[4042]:   notice: attrd_local_callback: 
Sending full refresh (origin=crmd)
Jul  6 14:18:16 node2 ocfs2_controld[4516]: Completing notification on 
"12925001B70F4A529491DFF16F8B8352" for node 739307907
Jul  6 14:18:16 node2 pengine[4043]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Jul  6 14:18:16 node2 pengine[4043]:  warning: pe_fence_node: Node node1 
will be fenced because the node is no longer part of the cluster
Jul  6 14:18:16 node2 attrd[4042]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: shutdown (0)
Jul  6 14:18:16 node2 pengine[4043]:  warning: determine_online_status: 
Node node1 is unclean
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-drbd:1_demote_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 attrd[4042]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: master-p-drbd (10000)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-drbd:1_stop_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-drbd:1_demote_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-drbd:1_stop_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-drbd:1_demote_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-drbd:1_stop_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-drbd:1_demote_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-drbd:1_stop_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-dlm:1_stop_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-dlm:1_stop_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 attrd[4042]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete (true)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-o2cb:1_stop_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-o2cb:1_stop_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-vg1:1_stop_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-vg1:1_stop_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-sharefs:1_stop_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: custom_action: Action 
p-sharefs:1_stop_0 on node1 is unrunnable (offline)
Jul  6 14:18:16 node2 pengine[4043]:  warning: stage6: Scheduling Node 
node1 for STONITH
Jul  6 14:18:16 node2 pengine[4043]:    error: rsc_expand_action: 
Couldn't expand cl-lock_demote_0
Jul  6 14:18:16 node2 pengine[4043]:    error: crm_abort: 
clone_update_actions_interleave: Triggered assert at clone.c:1260 : 
first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
Jul  6 14:18:16 node2 kernel: [ 4259.779734] dlm: closing connection to 
node 739307907
Jul  6 14:18:16 node2 pengine[4043]:    error: 
clone_update_actions_interleave: No action found for demote in g-lock:0 
(first)
Jul  6 14:18:16 node2 pengine[4043]:    error: crm_abort: 
clone_update_actions_interleave: Triggered assert at clone.c:1260 : 
first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
Jul  6 14:18:16 node2 pengine[4043]:    error: 
clone_update_actions_interleave: No action found for demote in g-lock:0 
(first)
Jul  6 14:18:16 node2 pengine[4043]:    error: crm_abort: 
clone_update_actions_interleave: Triggered assert at clone.c:1260 : 
first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
Jul  6 14:18:16 node2 pengine[4043]:    error: 
clone_update_actions_interleave: No action found for demote in g-lock:0 
(first)
Jul  6 14:18:16 node2 pengine[4043]:    error: crm_abort: 
clone_update_actions_interleave: Triggered assert at clone.c:1260 : 
first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
Jul  6 14:18:16 node2 pengine[4043]:    error: 
clone_update_actions_interleave: No action found for demote in g-lock:0 
(first)
Jul  6 14:18:16 node2 pengine[4043]:   notice: LogActions: Demote 
p-drbd:1    (Master -> Stopped node1)
Jul  6 14:18:16 node2 pengine[4043]:   notice: LogActions: Stop 
p-dlm:1    (node1)
Jul  6 14:18:16 node2 pengine[4043]:   notice: LogActions: Stop 
p-o2cb:1    (node1)
Jul  6 14:18:16 node2 pengine[4043]:   notice: LogActions: Stop 
p-vg1:1    (node1)
Jul  6 14:18:16 node2 pengine[4043]:   notice: LogActions: Stop 
p-sharefs:1    (node1)
Jul  6 14:18:16 node2 pengine[4043]:  warning: process_pe_message: 
Calculated Transition 9: /var/lib/pacemaker/pengine/pe-warn-12.bz2
Jul  6 14:18:16 node2 crmd[4044]:   notice: do_te_invoke: Processing 
graph 9 (ref=pe_calc-dc-1436174296-89) derived from 
/var/lib/pacemaker/pengine/pe-warn-12.bz2
Jul  6 14:18:16 node2 crmd[4044]:   notice: te_fence_node: Executing 
reboot fencing operation (79) on node1 (timeout=90000)
Jul  6 14:18:16 node2 crmd[4044]:   notice: te_rsc_command: Initiating 
action 95: notify p-drbd_pre_notify_demote_0 on node2 (local)
Jul  6 14:18:16 node2 stonith-ng[4040]:   notice: handle_request: Client 
crmd.4044.d27f51e3 wants to fence (reboot) 'node1' with device '(any)'
Jul  6 14:18:16 node2 stonith-ng[4040]:   notice: 
initiate_remote_stonith_op: Initiating remote operation reboot for 
node1: e5df84ee-ff0a-41de-90f8-36f5279a8a3a (0)
Jul  6 14:18:16 node2 crmd[4044]:   notice: process_lrm_event: LRM 
operation p-drbd_notify_0 (call=53, rc=0, cib-update=0, confirmed=true) ok
Jul  6 14:18:17 node2 stonith-ng[4040]:   notice: 
can_fence_host_with_device: sbd_stonith can fence node1: dynamic-list
Jul  6 14:18:17 node2 sbd: [8902]: info: Delivery process handling 
/dev/disk/by-id/scsi-1494554000000000035c71550e83586b70dfc77c3b382dfbd
Jul  6 14:18:17 node2 sbd: [8902]: info: Device UUID: 
4e398182-4894-4f11-b84e-6c6ad4c02614
Jul  6 14:18:17 node2 sbd: [8902]: info: Writing reset to node slot node1
Jul  6 14:18:17 node2 sbd: [8902]: info: Messaging delay: 40
Jul  6 14:18:57 node2 sbd: [8902]: info: reset successfully delivered to 
node1
Jul  6 14:18:57 node2 sbd: [8901]: info: Message successfully delivered.
Jul  6 14:18:58 node2 stonith-ng[4040]:   notice: log_operation: 
Operation 'reboot' [8890] (call 3 from crmd.4044) for host 'node1' with 
device 'sbd_stonith' returned: 0 (OK)
Jul  6 14:18:58 node2 stonith-ng[4040]:   notice: remote_op_done: 
Operation reboot of node1 by node2 for crmd.4044 at node2.e5df84ee: OK
Jul  6 14:18:58 node2 crmd[4044]:   notice: tengine_stonith_callback: 
Stonith operation 3/79:9:0:2a82d4bb-bfa9-4391-9214-8a030c1a9fbd: OK (0)
Jul  6 14:18:58 node2 crmd[4044]:   notice: tengine_stonith_notify: Peer 
node1 was terminated (reboot) by node2 for node2: OK 
(ref=e5df84ee-ff0a-41de-90f8-36f5279a8a3a) by client crmd.4044
Jul  6 14:18:58 node2 crmd[4044]:   notice: te_rsc_command: Initiating 
action 96: notify p-drbd_post_notify_demote_0 on node2 (local)
Jul  6 14:18:58 node2 crmd[4044]:   notice: process_lrm_event: LRM 
operation p-drbd_notify_0 (call=54, rc=0, cib-update=0, confirmed=true) ok
Jul  6 14:18:58 node2 crmd[4044]:   notice: run_graph: Transition 9 
(Complete=24, Pending=0, Fired=0, Skipped=6, Incomplete=5, 
Source=/var/lib/pacemaker/pengine/pe-warn-12.bz2): Stopped
Jul  6 14:18:58 node2 pengine[4043]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Jul  6 14:18:58 node2 pengine[4043]:    error: rsc_expand_action: 
Couldn't expand cl-lock_demote_0
Jul  6 14:18:58 node2 pengine[4043]:    error: crm_abort: 
clone_update_actions_interleave: Triggered assert at clone.c:1260 : 
first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
Jul  6 14:18:58 node2 pengine[4043]:    error: 
clone_update_actions_interleave: No action found for demote in g-lock:0 
(first)
Jul  6 14:18:58 node2 pengine[4043]:    error: crm_abort: 
clone_update_actions_interleave: Triggered assert at clone.c:1260 : 
first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
Jul  6 14:18:58 node2 pengine[4043]:    error: 
clone_update_actions_interleave: No action found for demote in g-lock:0 
(first)
Jul  6 14:18:58 node2 pengine[4043]:    error: crm_abort: 
clone_update_actions_interleave: Triggered assert at clone.c:1260 : 
first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
Jul  6 14:18:58 node2 pengine[4043]:    error: 
clone_update_actions_interleave: No action found for demote in g-lock:0 
(first)
Jul  6 14:18:58 node2 pengine[4043]:    error: crm_abort: 
clone_update_actions_interleave: Triggered assert at clone.c:1260 : 
first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
Jul  6 14:18:58 node2 pengine[4043]:    error: 
clone_update_actions_interleave: No action found for demote in g-lock:0 
(first)
Jul  6 14:18:58 node2 pengine[4043]:   notice: process_pe_message: 
Calculated Transition 10: /var/lib/pacemaker/pengine/pe-input-90.bz2
Jul  6 14:18:58 node2 crmd[4044]:   notice: do_te_invoke: Processing 
graph 10 (ref=pe_calc-dc-1436174338-94) derived from 
/var/lib/pacemaker/pengine/pe-input-90.bz2
Jul  6 14:18:58 node2 crmd[4044]:   notice: run_graph: Transition 10 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-90.bz2): Complete
Jul  6 14:18:58 node2 crmd[4044]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jul  6 14:18:58 node2 cibadmin[9242]:   notice: crm_log_args: Invoked: 
cibadmin -C -o constraints -X <rsc_location rsc="ms-drbd" 
id="drbd-fence-by-handler-r0-ms-drbd">   <rule role="Master" 
score="-INFINITY" id="drbd-fence-by-handler-r0-rule-ms-drbd">     
<expression attribute="#uname" operation="ne" value="node2" 
id="drbd-fence-by-handler-r0-expr-ms-drbd"/>   </rule> </rsc_location>
Jul  6 14:18:58 node2 crmd[4044]:   notice: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jul  6 14:18:58 node2 stonith-ng[4040]:   notice: unpack_config: On loss 
of CCM Quorum: Ignore
Jul  6 14:18:58 node2 cib[4039]:   notice: cib:diff: Diff: --- 0.119.20
Jul  6 14:18:58 node2 cib[4039]:   notice: cib:diff: Diff: +++ 0.120.1 
e4369bcefc4d280545e8b56fd98b0042
Jul  6 14:18:58 node2 cib[4039]:   notice: cib:diff: -- <cib 
admin_epoch="0" epoch="119" num_updates="20"/>
Jul  6 14:18:58 node2 cib[4039]:   notice: cib:diff: ++ <rsc_location 
rsc="ms-drbd" id="drbd-fence-by-handler-r0-ms-drbd">
Jul  6 14:18:58 node2 cib[4039]:   notice: cib:diff: ++ <rule 
role="Master" score="-INFINITY" id="drbd-fence-by-handler-r0-rule-ms-drbd">
Jul  6 14:18:58 node2 cib[4039]:   notice: cib:diff: ++ <expression 
attribute="#uname" operation="ne" value="node2" 
id="drbd-fence-by-handler-r0-expr-ms-drbd"/>
Jul  6 14:18:58 node2 cib[4039]:   notice: cib:diff: ++ </rule>
Jul  6 14:18:58 node2 cib[4039]:   notice: cib:diff: ++ </rsc_location>
Jul  6 14:18:58 node2 crm-fence-peer.sh[8804]: INFO peer is fenced, my 
disk is UpToDate: placed constraint 'drbd-fence-by-handler-r0-ms-drbd'
Jul  6 14:18:58 node2 pengine[4043]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
Jul  6 14:18:58 node2 pengine[4043]:    error: rsc_expand_action: 
Couldn't expand cl-lock_demote_0
Jul  6 14:18:58 node2 pengine[4043]:    error: crm_abort: 
clone_update_actions_interleave: Triggered assert at clone.c:1260 : 
first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
Jul  6 14:18:58 node2 pengine[4043]:    error: 
clone_update_actions_interleave: No action found for demote in g-lock:0 
(first)
Jul  6 14:18:58 node2 pengine[4043]:    error: crm_abort: 
clone_update_actions_interleave: Triggered assert at clone.c:1260 : 
first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
Jul  6 14:18:58 node2 pengine[4043]:    error: 
clone_update_actions_interleave: No action found for demote in g-lock:0 
(first)
Jul  6 14:18:58 node2 pengine[4043]:    error: crm_abort: 
clone_update_actions_interleave: Triggered assert at clone.c:1260 : 
first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
Jul  6 14:18:58 node2 pengine[4043]:    error: 
clone_update_actions_interleave: No action found for demote in g-lock:0 
(first)
Jul  6 14:18:58 node2 pengine[4043]:    error: crm_abort: 
clone_update_actions_interleave: Triggered assert at clone.c:1260 : 
first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
Jul  6 14:18:58 node2 pengine[4043]:    error: 
clone_update_actions_interleave: No action found for demote in g-lock:0 
(first)
Jul  6 14:18:58 node2 pengine[4043]:   notice: process_pe_message: 
Calculated Transition 11: /var/lib/pacemaker/pengine/pe-input-91.bz2
Jul  6 14:18:58 node2 crmd[4044]:   notice: do_te_invoke: Processing 
graph 11 (ref=pe_calc-dc-1436174338-95) derived from 
/var/lib/pacemaker/pengine/pe-input-91.bz2
Jul  6 14:18:58 node2 crmd[4044]:   notice: run_graph: Transition 11 
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-91.bz2): Complete
Jul  6 14:18:58 node2 crmd[4044]:   notice: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jul  6 14:18:58 node2 kernel: [ 4301.717745] drbd r0: helper command: 
/sbin/drbdadm fence-peer r0 exit code 7 (0x700)
Jul  6 14:18:58 node2 kernel: [ 4301.717748] drbd r0: fence-peer() = 7 
&& fencing != Stonith !!!
Jul  6 14:18:58 node2 kernel: [ 4301.717750] drbd r0: fence-peer helper 
returned 7 (peer was stonithed)
Jul  6 14:18:58 node2 kernel: [ 4301.717760] drbd r0: pdsk( DUnknown -> 
Outdated )
Jul  6 14:18:58 node2 cluster-dlm[4386]: fence_node_time: Node 
739307907/node1 was last shot 'now'
Jul  6 14:18:58 node2 cluster-dlm[4386]: check_fencing_done: 
12925001B70F4A529491DFF16F8B8352 check_fencing 739307907 done add 
1436173156 fail 1436174296 last 1436174338
Jul  6 14:18:58 node2 cluster-dlm[4386]: check_fencing_done: 
12925001B70F4A529491DFF16F8B8352 check_fencing done
Jul  6 14:18:58 node2 cluster-dlm[4386]: check_quorum_done: 
12925001B70F4A529491DFF16F8B8352 check_quorum disabled
Jul  6 14:18:58 node2 cluster-dlm[4386]: check_fs_done: 
12925001B70F4A529491DFF16F8B8352 check_fs nodeid 739307907 clear
Jul  6 14:18:58 node2 cluster-dlm[4386]: check_fs_done: 
12925001B70F4A529491DFF16F8B8352 check_fs done
Jul  6 14:18:58 node2 cluster-dlm[4386]: send_info: 
12925001B70F4A529491DFF16F8B8352 send_start cg 3 flags 2 data2 0 counts 
2 1 0 1 1
Jul  6 14:18:58 node2 cluster-dlm[4386]: receive_start: 
12925001B70F4A529491DFF16F8B8352 receive_start 739307908:3 len 76
Jul  6 14:18:58 node2 cluster-dlm[4386]: match_change: 
12925001B70F4A529491DFF16F8B8352 match_change 739307908:3 matches cg 3
Jul  6 14:18:58 node2 cluster-dlm[4386]: wait_messages_done: 
12925001B70F4A529491DFF16F8B8352 wait_messages cg 3 got all 1
Jul  6 14:18:58 node2 cluster-dlm[4386]: start_kernel: 
12925001B70F4A529491DFF16F8B8352 start_kernel cg 3 member_count 1
Jul  6 14:18:58 node2 cluster-dlm[4386]: update_dir_members: dir_member 
739307907
Jul  6 14:18:58 node2 cluster-dlm[4386]: update_dir_members: dir_member 
739307908
Jul  6 14:18:58 node2 cluster-dlm[4386]: set_configfs_members: 
set_members rmdir 
"/sys/kernel/config/dlm/cluster/spaces/12925001B70F4A529491DFF16F8B8352/nodes/739307907"
Jul  6 14:18:58 node2 cluster-dlm[4386]: do_sysfs: write "1" to 
"/sys/kernel/dlm/12925001B70F4A529491DFF16F8B8352/control"
Jul  6 14:18:58 node2 kernel: [ 4301.867106] 
(ocfs2rec,8858,0):ocfs2_replay_journal:1549 Recovering node 739307907 
from slot 1 on device (253,0)
Jul  6 14:18:59 node2 kernel: [ 4302.860366] 
(ocfs2rec,8858,0):ocfs2_begin_quota_recovery:407 Beginning quota 
recovery in slot 1
Jul  6 14:18:59 node2 kernel: [ 4302.862811] 
(kworker/u:2,8448,0):ocfs2_finish_quota_recovery:599 Finishing quota 
recovery in slot 1



--
Regards,

Muhammad Sharfuddin

On 07/06/2015 11:31 AM, emmanuel segura wrote:
> remember you need to configure the fencing handler in your drbd as
> somebody told you and other mail.
>
> 2015-07-06 7:23 GMT+02:00 Muhammad Sharfuddin <M.Sharfuddin at nds.com.pk>:
>> If I understand you, then the issue of inaccessibility of ocfs2 file system
>> on surviving/online nodes is expected
>> due to stonith msgwait timeouts, and the only possible solution to avoid the
>> delay/inaccessibility of ocfs2 file system
>> is to reduce the msgwait timeout values, which is generally not recommended
>> for the other clusters operations.
>>
>> Due to inaccessibility/delay issue, currently this file system cluster is
>> not perfect because a few services which I starts
>> manually(non-clustered) and runs on both the nodes obviously requires
>> /sharedata(ocfs2) availability to run/function.
>>
>> --
>> Regards,
>>
>> Muhammad Sharfuddin
>>
>> On 07/06/2015 02:05 AM, emmanuel segura wrote:
>>
>> when a node goes down, you will see the node in unclean state, how you
>> see in your logs, forming new configuration(corosync) -> stonith
>> reboot request -> and you are using sbd and the node become offline
>> after thet msgwait is expired, when msgwait is expired pacemaker knows
>> the node is dead and than the ocfs will ok, stonith->dlm->ocfs2, if
>> you need to reduce the msgwait you need to be careful about others
>> timeouts and cluster problems.
>>
>> 2015-07-05 18:13 GMT+02:00 Muhammad Sharfuddin <M.Sharfuddin at nds.com.pk>:
>>
>> SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70
>> openais-1.1.4-5.22.1.7)
>>
>> Its a dual primary drbd cluster, which mounts a file system resource on both
>> the cluster nodes simultaneously(file system type is ocfs2).
>>
>> Whenever any of the nodes goes down, the file system(/sharedata) become
>> inaccessible for exact 35 seconds on the other (surviving/online) node, and
>> then become available again on the online node.
>>
>> Please help me understand why the node which survives or remains online
>> unable to access the file system resource(/sharedata) for 35 seconds ? and
>> how can I fix the cluster so that file system remains accessible on the
>> surviving node without any interruption/delay(as in my case of about 35
>> seconds)
>>
>> By inaccessible, I meant to say that running "ls -l /sharedata" and "df
>> /sharedata" does not return any output and does not return the prompt back
>> on the online node for exact 35 seconds once the other node becomes offline.
>>
>> e.g "node1" got offline somewhere around  01:37:15, and then /sharedata file
>> system was inaccessible during 01:37:35 and 01:38:18 on the online node i.e
>> "node2".
>>
>>
>> /var/log/messages on node2, when node1 went offline:
>> Jul  5 01:37:26 node2 kernel: [  675.255865] drbd r0: PingAck did not arrive
>> in time.
>> Jul  5 01:37:26 node2 kernel: [  675.255886] drbd r0: peer( Primary ->
>> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
>> Jul  5 01:37:26 node2 kernel: [  675.256030] block drbd0: new current UUID
>> C23D1458962AD18D:A8DD404C9F563391:6A5F4A26F64BAF0B:6A5E4A26F64BAF0B
>> Jul  5 01:37:26 node2 kernel: [  675.256079] drbd r0: asender terminated
>> Jul  5 01:37:26 node2 kernel: [  675.256081] drbd r0: Terminating drbd_a_r0
>> Jul  5 01:37:26 node2 kernel: [  675.256306] drbd r0: Connection closed
>> Jul  5 01:37:26 node2 kernel: [  675.256338] drbd r0: conn( NetworkFailure
>> -> Unconnected )
>> Jul  5 01:37:26 node2 kernel: [  675.256339] drbd r0: receiver terminated
>> Jul  5 01:37:26 node2 kernel: [  675.256340] drbd r0: Restarting receiver
>> thread
>> Jul  5 01:37:26 node2 kernel: [  675.256341] drbd r0: receiver (re)started
>> Jul  5 01:37:26 node2 kernel: [  675.256344] drbd r0: conn( Unconnected ->
>> WFConnection )
>> Jul  5 01:37:29 node2 corosync[4040]:  [TOTEM ] A processor failed, forming
>> new configuration.
>> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] CLM CONFIGURATION CHANGE
>> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] New Configuration:
>> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ]     r(0) ip(172.16.241.132)
>> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Left:
>> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ]     r(0) ip(172.16.241.131)
>> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Joined:
>> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] notice: pcmk_peer_update:
>> Transitional membership event on ring 216: memb=1, new=0, lost=1
>> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: pcmk_peer_update:
>> memb: node2 739307908
>> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: pcmk_peer_update:
>> lost: node1 739307907
>> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] CLM CONFIGURATION CHANGE
>> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] New Configuration:
>> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ]     r(0) ip(172.16.241.132)
>> Jul  5 01:37:35 node2 cluster-dlm[4344]:   notice: plugin_handle_membership:
>> Membership 216: quorum lost
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]:   notice:
>> plugin_handle_membership: Membership 216: quorum lost
>> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Left:
>> Jul  5 01:37:35 node2 crmd[4050]:   notice: plugin_handle_membership:
>> Membership 216: quorum lost
>> Jul  5 01:37:35 node2 stonith-ng[4046]:   notice: plugin_handle_membership:
>> Membership 216: quorum lost
>> Jul  5 01:37:35 node2 cib[4045]:   notice: plugin_handle_membership:
>> Membership 216: quorum lost
>> Jul  5 01:37:35 node2 cluster-dlm[4344]:   notice: crm_update_peer_state:
>> plugin_handle_membership: Node node1[739307907] - state is now lost (was
>> member)
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]:   notice: crm_update_peer_state:
>> plugin_handle_membership: Node node1[739307907] - state is now lost (was
>> member)
>> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Joined:
>> Jul  5 01:37:35 node2 crmd[4050]:  warning: match_down_event: No match for
>> shutdown action on node1
>> Jul  5 01:37:35 node2 stonith-ng[4046]:   notice: crm_update_peer_state:
>> plugin_handle_membership: Node node1[739307907] - state is now lost (was
>> member)
>> Jul  5 01:37:35 node2 cib[4045]:   notice: crm_update_peer_state:
>> plugin_handle_membership: Node node1[739307907] - state is now lost (was
>> member)
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: update_cluster: Processing
>> membership 216
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]: confchg called
>> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] notice: pcmk_peer_update:
>> Stable membership event on ring 216: memb=1, new=0, lost=0
>> Jul  5 01:37:35 node2 crmd[4050]:   notice: peer_update_callback:
>> Stonith/shutdown of node1 not matched
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: dlm_process_node: Skipped active
>> node 739307908: born-on=204, last-seen=216, this-event=216, last-event=212
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]: ocfs2_controld (group
>> "ocfs2:controld") confchg: members 1, left 1, joined 0
>> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: pcmk_peer_update:
>> MEMB: node2 739307908
>> Jul  5 01:37:35 node2 crmd[4050]:   notice: crm_update_peer_state:
>> plugin_handle_membership: Node node1[739307907] - state is now lost (was
>> member)
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: del_configfs_node:
>> del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/739307907"
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]: node daemon left 739307907
>> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info:
>> ais_mark_unseen_peer_dead: Node node1 was not seen in the previous
>> transition
>> Jul  5 01:37:35 node2 crmd[4050]:  warning: match_down_event: No match for
>> shutdown action on node1
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: dlm_process_node: Removed inactive
>> node 739307907: born-on=212, last-seen=212, this-event=216, last-event=212
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]: node down 739307907
>> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: update_member: Node
>> 739307907/node1 is now: lost
>> Jul  5 01:37:35 node2 crmd[4050]:   notice: peer_update_callback:
>> Stonith/shutdown of node1 not matched
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: log_config: dlm:controld conf 1 0 1
>> memb 739307908 join left 739307907
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]: Node 739307907 has left
>> mountgroup 12925001B70F4A529491DFF16F8B8352
>> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info:
>> send_member_notification: Sending membership update 216 to 5 children
>> Jul  5 01:37:35 node2 sbd: [4028]: WARN: CIB: We do NOT have quorum!
>> Jul  5 01:37:35 node2 sbd: [4026]: WARN: Pacemaker health check: UNHEALTHY
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: log_config:
>> dlm:ls:12925001B70F4A529491DFF16F8B8352 conf 1 0 1 memb 739307908 join left
>> 739307907
>> Jul  5 01:37:35 node2 corosync[4040]:  [TOTEM ] A processor joined or left
>> the membership and a new membership was formed.
>> Jul  5 01:37:35 node2 corosync[4040]:  [CPG   ] chosen downlist: sender r(0)
>> ip(172.16.241.132) ; members(old:2 left:1)
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: node_history_fail:
>> 12925001B70F4A529491DFF16F8B8352 check_fs nodeid 739307907 set
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: add_change:
>> 12925001B70F4A529491DFF16F8B8352 add_change cg 4 remove nodeid 739307907
>> reason 3
>> Jul  5 01:37:35 node2 corosync[4040]:  [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: add_change:
>> 12925001B70F4A529491DFF16F8B8352 add_change cg 4 counts member 1 joined 0
>> remove 1 failed 1
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: stop_kernel:
>> 12925001B70F4A529491DFF16F8B8352 stop_kernel cg 4
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]: Sending notification of node
>> 739307907 for "12925001B70F4A529491DFF16F8B8352"
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]: confchg called
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: do_sysfs: write "0" to
>> "/sys/kernel/dlm/12925001B70F4A529491DFF16F8B8352/control"
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]: group
>> "ocfs2:12925001B70F4A529491DFF16F8B8352" confchg: members 1, left 1, joined
>> 0
>> Jul  5 01:37:35 node2 kernel: [  684.452042] dlm: closing connection to node
>> 739307907
>> Jul  5 01:37:35 node2 crmd[4050]:   notice: crm_update_quorum: Updating
>> quorum status to false (call=114)
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: fence_node_time: Node
>> 739307907/node1 has not been shot yet
>> Jul  5 01:37:35 node2 crmd[4050]:   notice: do_state_transition: State
>> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
>> origin=check_join_state ]
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: check_fencing_done:
>> 12925001B70F4A529491DFF16F8B8352 check_fencing 739307907 wait add 1436042043
>> fail 1436042255 last 0
>> Jul  5 01:37:35 node2 crmd[4050]:   notice: crm_update_quorum: Updating
>> quorum status to false (call=119)
>> Jul  5 01:37:35 node2 attrd[4048]:   notice: attrd_local_callback: Sending
>> full refresh (origin=crmd)
>> Jul  5 01:37:35 node2 attrd[4048]:   notice: attrd_trigger_update: Sending
>> flush op to all hosts for: shutdown (0)
>> Jul  5 01:37:35 node2 pengine[4049]:   notice: unpack_config: On loss of CCM
>> Quorum: Ignore
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: pe_fence_node: Node node1
>> will be fenced because the node is no longer part of the cluster
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: determine_online_status: Node
>> node1 is unclean
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-drbd:1_demote_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-drbd:1_stop_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-drbd:1_demote_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-drbd:1_stop_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 attrd[4048]:   notice: attrd_trigger_update: Sending
>> flush op to all hosts for: master-p-drbd (10000)
>> Jul  5 01:37:35 node2 cluster-dlm[4344]: set_fs_notified:
>> 12925001B70F4A529491DFF16F8B8352 set_fs_notified nodeid 739307907
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-drbd:1_demote_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 attrd[4048]:   notice: attrd_trigger_update: Sending
>> flush op to all hosts for: probe_complete (true)
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]: message from dlmcontrol
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-drbd:1_stop_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]: Notified for
>> "12925001B70F4A529491DFF16F8B8352", node 739307907, status 0
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-drbd:1_demote_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 ocfs2_controld[4473]: Completing notification on
>> "12925001B70F4A529491DFF16F8B8352" for node 739307907
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-drbd:1_stop_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-dlm:1_stop_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-dlm:1_stop_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-o2cb:1_stop_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-o2cb:1_stop_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-vg1:1_stop_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-vg1:1_stop_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-sharefs:1_stop_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
>> p-sharefs:1_stop_0 on node1 is unrunnable (offline)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: stage6: Scheduling Node node1
>> for STONITH
>> Jul  5 01:37:35 node2 pengine[4049]:    error: rsc_expand_action: Couldn't
>> expand cl-lock_demote_0
>> Jul  5 01:37:35 node2 pengine[4049]:    error: crm_abort:
>> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
>> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
>> Jul  5 01:37:35 node2 pengine[4049]:    error:
>> clone_update_actions_interleave: No action found for demote in g-lock:0
>> (first)
>> Jul  5 01:37:35 node2 pengine[4049]:    error: crm_abort:
>> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
>> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
>> Jul  5 01:37:35 node2 pengine[4049]:    error:
>> clone_update_actions_interleave: No action found for demote in g-lock:0
>> (first)
>> Jul  5 01:37:35 node2 pengine[4049]:    error: crm_abort:
>> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
>> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
>> Jul  5 01:37:35 node2 pengine[4049]:    error:
>> clone_update_actions_interleave: No action found for demote in g-lock:0
>> (first)
>> Jul  5 01:37:35 node2 pengine[4049]:    error: crm_abort:
>> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
>> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
>> Jul  5 01:37:35 node2 pengine[4049]:    error:
>> clone_update_actions_interleave: No action found for demote in g-lock:0
>> (first)
>> Jul  5 01:37:35 node2 pengine[4049]:   notice: LogActions: Demote p-drbd:1
>> (Master -> Stopped node1)
>> Jul  5 01:37:35 node2 pengine[4049]:   notice: LogActions: Stop p-dlm:1
>> (node1)
>> Jul  5 01:37:35 node2 pengine[4049]:   notice: LogActions: Stop p-o2cb:1
>> (node1)
>> Jul  5 01:37:35 node2 pengine[4049]:   notice: LogActions: Stop p-vg1:1
>> (node1)
>> Jul  5 01:37:35 node2 pengine[4049]:   notice: LogActions: Stop p-sharefs:1
>> (node1)
>> Jul  5 01:37:35 node2 pengine[4049]:  warning: process_pe_message:
>> Calculated Transition 6: /var/lib/pacemaker/pengine/pe-warn-5.bz2
>> Jul  5 01:37:35 node2 crmd[4050]:   notice: do_te_invoke: Processing graph 6
>> (ref=pe_calc-dc-1436042255-74) derived from
>> /var/lib/pacemaker/pengine/pe-warn-5.bz2
>> Jul  5 01:37:35 node2 crmd[4050]:   notice: te_fence_node: Executing reboot
>> fencing operation (79) on node1 (timeout=90000)
>> Jul  5 01:37:35 node2 crmd[4050]:   notice: te_rsc_command: Initiating
>> action 95: notify p-drbd_pre_notify_demote_0 on node2 (local)
>> Jul  5 01:37:35 node2 stonith-ng[4046]:   notice: handle_request: Client
>> crmd.4050.3c8ccdb0 wants to fence (reboot) 'node1' with device '(any)'
>> Jul  5 01:37:35 node2 stonith-ng[4046]:   notice:
>> initiate_remote_stonith_op: Initiating remote operation reboot for node1:
>> 2fce4697-65d5-40d2-b5b4-5c10be4d3d5a (0)
>> Jul  5 01:37:35 node2 crmd[4050]:   notice: process_lrm_event: LRM operation
>> p-drbd_notify_0 (call=54, rc=0, cib-update=0, confirmed=true) ok
>> Jul  5 01:37:35 node2 stonith-ng[4046]:   notice:
>> can_fence_host_with_device: sbd_stonith can fence node1: dynamic-list
>> Jul  5 01:37:35 node2 sbd: [6197]: info: Delivery process handling
>> /dev/disk/by-id/scsi-1494554000000000035c71550e83586b70dfc77c3b382dfbd
>> Jul  5 01:37:35 node2 sbd: [6197]: info: Device UUID:
>> 4e398182-4894-4f11-b84e-6c6ad4c02614
>> Jul  5 01:37:35 node2 sbd: [6197]: info: Writing reset to node slot node1
>> Jul  5 01:37:35 node2 sbd: [6197]: info: Messaging delay: 40
>> Jul  5 01:38:15 node2 sbd: [6197]: info: reset successfully delivered to
>> node1
>> Jul  5 01:38:15 node2 sbd: [6196]: info: Message successfully delivered.
>> Jul  5 01:38:16 node2 stonith-ng[4046]:   notice: log_operation: Operation
>> 'reboot' [6185] (call 3 from crmd.4050) for host 'node1' with device
>> 'sbd_stonith' returned: 0 (OK)
>> Jul  5 01:38:16 node2 stonith-ng[4046]:   notice: remote_op_done: Operation
>> reboot of node1 by node2 for crmd.4050 at node2.2fce4697: OK
>> Jul  5 01:38:16 node2 crmd[4050]:   notice: tengine_stonith_callback:
>> Stonith operation 3/79:6:0:34e9a1ca-ebc9-4114-b5b7-26e670dc6aa0: OK (0)
>> Jul  5 01:38:16 node2 crmd[4050]:   notice: tengine_stonith_notify: Peer
>> node1 was terminated (reboot) by node2 for node2: OK
>> (ref=2fce4697-65d5-40d2-b5b4-5c10be4d3d5a) by client crmd.4050
>> Jul  5 01:38:16 node2 crmd[4050]:   notice: te_rsc_command: Initiating
>> action 96: notify p-drbd_post_notify_demote_0 on node2 (local)
>> Jul  5 01:38:16 node2 crmd[4050]:   notice: process_lrm_event: LRM operation
>> p-drbd_notify_0 (call=55, rc=0, cib-update=0, confirmed=true) ok
>> Jul  5 01:38:16 node2 crmd[4050]:   notice: run_graph: Transition 6
>> (Complete=24, Pending=0, Fired=0, Skipped=6, Incomplete=5,
>> Source=/var/lib/pacemaker/pengine/pe-warn-5.bz2): Stopped
>> Jul  5 01:38:16 node2 pengine[4049]:   notice: unpack_config: On loss of CCM
>> Quorum: Ignore
>> Jul  5 01:38:16 node2 pengine[4049]:    error: rsc_expand_action: Couldn't
>> expand cl-lock_demote_0
>> Jul  5 01:38:16 node2 pengine[4049]:    error: crm_abort:
>> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
>> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
>> Jul  5 01:38:16 node2 pengine[4049]:    error:
>> clone_update_actions_interleave: No action found for demote in g-lock:0
>> (first)
>> Jul  5 01:38:16 node2 pengine[4049]:    error: crm_abort:
>> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
>> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
>> Jul  5 01:38:16 node2 pengine[4049]:    error:
>> clone_update_actions_interleave: No action found for demote in g-lock:0
>> (first)
>> Jul  5 01:38:16 node2 pengine[4049]:    error: crm_abort:
>> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
>> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
>> Jul  5 01:38:16 node2 pengine[4049]:    error:
>> clone_update_actions_interleave: No action found for demote in g-lock:0
>> (first)
>> Jul  5 01:38:16 node2 pengine[4049]:    error: crm_abort:
>> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
>> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
>> Jul  5 01:38:16 node2 pengine[4049]:    error:
>> clone_update_actions_interleave: No action found for demote in g-lock:0
>> (first)
>> Jul  5 01:38:16 node2 pengine[4049]:   notice: process_pe_message:
>> Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-30.bz2
>> Jul  5 01:38:16 node2 crmd[4050]:   notice: do_te_invoke: Processing graph 7
>> (ref=pe_calc-dc-1436042296-79) derived from
>> /var/lib/pacemaker/pengine/pe-input-30.bz2
>> Jul  5 01:38:16 node2 crmd[4050]:   notice: run_graph: Transition 7
>> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
>> Source=/var/lib/pacemaker/pengine/pe-input-30.bz2): Complete
>> Jul  5 01:38:16 node2 crmd[4050]:   notice: do_state_transition: State
>> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: fence_node_time: Node
>> 739307907/node1 was last shot 'now'
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: check_fencing_done:
>> 12925001B70F4A529491DFF16F8B8352 check_fencing 739307907 done add 1436042043
>> fail 1436042255 last 1436042297
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: check_fencing_done:
>> 12925001B70F4A529491DFF16F8B8352 check_fencing done
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: check_quorum_done:
>> 12925001B70F4A529491DFF16F8B8352 check_quorum disabled
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: check_fs_done:
>> 12925001B70F4A529491DFF16F8B8352 check_fs nodeid 739307907 clear
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: check_fs_done:
>> 12925001B70F4A529491DFF16F8B8352 check_fs done
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: send_info:
>> 12925001B70F4A529491DFF16F8B8352 send_start cg 4 flags 2 data2 0 counts 3 1
>> 0 1 1
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: receive_start:
>> 12925001B70F4A529491DFF16F8B8352 receive_start 739307908:4 len 76
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: match_change:
>> 12925001B70F4A529491DFF16F8B8352 match_change 739307908:4 matches cg 4
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: wait_messages_done:
>> 12925001B70F4A529491DFF16F8B8352 wait_messages cg 4 got all 1
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: start_kernel:
>> 12925001B70F4A529491DFF16F8B8352 start_kernel cg 4 member_count 1
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: update_dir_members: dir_member
>> 739307907
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: update_dir_members: dir_member
>> 739307908
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: set_configfs_members: set_members
>> rmdir
>> "/sys/kernel/config/dlm/cluster/spaces/12925001B70F4A529491DFF16F8B8352/nodes/739307907"
>> Jul  5 01:38:17 node2 cluster-dlm[4344]: do_sysfs: write "1" to
>> "/sys/kernel/dlm/12925001B70F4A529491DFF16F8B8352/control"
>> Jul  5 01:38:17 node2 kernel: [  726.576130]
>> (ocfs2rec,6152,0):ocfs2_replay_journal:1549 Recovering node 739307907 from
>> slot 0 on device (253,0)
>> Jul  5 01:38:18 node2 kernel: [  727.453597]
>> (ocfs2rec,6152,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery in
>> slot 0
>> Jul  5 01:38:18 node2 kernel: [  727.455965]
>> (kworker/u:28,139,0):ocfs2_finish_quota_recovery:599 Finishing quota
>> recovery in slot 0
>>
>>
>> cib:
>> primitive p-dlm ocf:pacemaker:controld \
>>          op monitor interval="120" timeout="30" \
>>          op start interval="0" timeout="90" \
>>          op stop interval="0" timeout="100"
>> primitive p-drbd ocf:linbit:drbd \
>>          params drbd_resource="r0" \
>>          op monitor interval="50" role="Master" timeout="30" \
>>          op monitor interval="60" role="Slave" timeout="30" \
>>          op start interval="0" timeout="240" \
>>          op stop interval="0" timeout="100"
>> primitive p-o2cb ocf:ocfs2:o2cb \
>>          op monitor interval="120" timeout="30" \
>>          op start interval="0" timeout="90" \
>>          op stop interval="0" timeout="100"
>> primitive p-sharefs ocf:heartbeat:Filesystem \
>>          params device="/dev/vg1/sharelv" directory="/sharedata"
>> fstype="ocfs2" \
>>          op monitor interval="60s" timeout="60s" \
>>          op start interval="0" timeout="90s" \
>>          op stop interval="0" timeout="90s"
>> primitive p-vg1 ocf:heartbeat:LVM \
>>          params volgrpname="vg1" \
>>          op monitor interval="60s" timeout="40" \
>>          op start interval="0" timeout="40" \
>>          op stop interval="0" timeout="40"
>> primitive sbd_stonith stonith:external/sbd \
>>          meta target-role="Started" \
>>          op monitor interval="3000" timeout="20" \
>>          op start interval="0" timeout="20" \
>>          op stop interval="0" timeout="20" \
>>          params
>> sbd_device="/dev/disk/by-id/scsi-1494554000000000035c71550e83586b70dfc77c3b382dfbd"
>> group g-lock p-dlm p-o2cb
>> group g-sharedata p-vg1 p-sharefs
>> ms ms-drbd p-drbd \
>>          meta master-max="2" clone-max="2" notify="true" interleave="true"
>> clone cl-lock g-lock \
>>          meta globally-unique="false" interleave="true"
>> clone cl-sharedata g-sharedata \
>>          meta globally-unique="false" interleave="true"
>> colocation col-drbd-lock inf: cl-lock ms-drbd:Master
>> colocation col-lock-sharedata inf: cl-sharedata cl-lock
>> order ord-drbd-lock inf: ms-drbd:promote cl-lock
>> order ord-lock-sharedata inf: cl-lock cl-sharedata
>>          stonith-timeout="90s" \
>>          no-quorum-policy="ignore" \
>>          expected-quorum-votes="2"
>>
>>
>> --
>> Regards,
>>
>> Muhammad Sharfuddin
>> _______________________________________________
>> Linux-HA mailing list is closing down.
>> Please subscribe to users at clusterlabs.org instead.
>> http://clusterlabs.org/mailman/listinfo/users
>> _______________________________________________
>> Linux-HA at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150706/0fb40344/attachment.htm>


More information about the Users mailing list