[ClusterLabs] file system resource becomes inaccesible when any of the node goes down

Mon Jul 6 06:31:50 UTC 2015

remember you need to configure the fencing handler in your drbd as
somebody told you and other mail.

2015-07-06 7:23 GMT+02:00 Muhammad Sharfuddin <M.Sharfuddin at nds.com.pk>:
> If I understand you, then the issue of inaccessibility of ocfs2 file system
> on surviving/online nodes is expected
> due to stonith msgwait timeouts, and the only possible solution to avoid the
> delay/inaccessibility of ocfs2 file system
> is to reduce the msgwait timeout values, which is generally not recommended
> for the other clusters operations.
>
> Due to inaccessibility/delay issue, currently this file system cluster is
> not perfect because a few services which I starts
> manually(non-clustered) and runs on both the nodes obviously requires
> /sharedata(ocfs2) availability to run/function.
>
> --
> Regards,
>
> Muhammad Sharfuddin
>
> On 07/06/2015 02:05 AM, emmanuel segura wrote:
>
> when a node goes down, you will see the node in unclean state, how you
> see in your logs, forming new configuration(corosync) -> stonith
> reboot request -> and you are using sbd and the node become offline
> after thet msgwait is expired, when msgwait is expired pacemaker knows
> the node is dead and than the ocfs will ok, stonith->dlm->ocfs2, if
> you need to reduce the msgwait you need to be careful about others
> timeouts and cluster problems.
>
> 2015-07-05 18:13 GMT+02:00 Muhammad Sharfuddin <M.Sharfuddin at nds.com.pk>:
>
> SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70
> openais-1.1.4-5.22.1.7)
>
> Its a dual primary drbd cluster, which mounts a file system resource on both
> the cluster nodes simultaneously(file system type is ocfs2).
>
> Whenever any of the nodes goes down, the file system(/sharedata) become
> inaccessible for exact 35 seconds on the other (surviving/online) node, and
> then become available again on the online node.
>
> Please help me understand why the node which survives or remains online
> unable to access the file system resource(/sharedata) for 35 seconds ? and
> how can I fix the cluster so that file system remains accessible on the
> surviving node without any interruption/delay(as in my case of about 35
> seconds)
>
> By inaccessible, I meant to say that running "ls -l /sharedata" and "df
> /sharedata" does not return any output and does not return the prompt back
> on the online node for exact 35 seconds once the other node becomes offline.
>
> e.g "node1" got offline somewhere around  01:37:15, and then /sharedata file
> system was inaccessible during 01:37:35 and 01:38:18 on the online node i.e
> "node2".
>
>
> /var/log/messages on node2, when node1 went offline:
> Jul  5 01:37:26 node2 kernel: [  675.255865] drbd r0: PingAck did not arrive
> in time.
> Jul  5 01:37:26 node2 kernel: [  675.255886] drbd r0: peer( Primary ->
> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Jul  5 01:37:26 node2 kernel: [  675.256030] block drbd0: new current UUID
> C23D1458962AD18D:A8DD404C9F563391:6A5F4A26F64BAF0B:6A5E4A26F64BAF0B
> Jul  5 01:37:26 node2 kernel: [  675.256079] drbd r0: asender terminated
> Jul  5 01:37:26 node2 kernel: [  675.256081] drbd r0: Terminating drbd_a_r0
> Jul  5 01:37:26 node2 kernel: [  675.256306] drbd r0: Connection closed
> Jul  5 01:37:26 node2 kernel: [  675.256338] drbd r0: conn( NetworkFailure
> -> Unconnected )
> Jul  5 01:37:26 node2 kernel: [  675.256339] drbd r0: receiver terminated
> Jul  5 01:37:26 node2 kernel: [  675.256340] drbd r0: Restarting receiver
> thread
> Jul  5 01:37:26 node2 kernel: [  675.256341] drbd r0: receiver (re)started
> Jul  5 01:37:26 node2 kernel: [  675.256344] drbd r0: conn( Unconnected ->
> WFConnection )
> Jul  5 01:37:29 node2 corosync[4040]:  [TOTEM ] A processor failed, forming
> new configuration.
> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] CLM CONFIGURATION CHANGE
> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] New Configuration:
> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ]     r(0) ip(172.16.241.132)
> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Left:
> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ]     r(0) ip(172.16.241.131)
> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Joined:
> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] notice: pcmk_peer_update:
> Transitional membership event on ring 216: memb=1, new=0, lost=1
> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: pcmk_peer_update:
> memb: node2 739307908
> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: pcmk_peer_update:
> lost: node1 739307907
> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] CLM CONFIGURATION CHANGE
> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] New Configuration:
> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ]     r(0) ip(172.16.241.132)
> Jul  5 01:37:35 node2 cluster-dlm[4344]:   notice: plugin_handle_membership:
> Membership 216: quorum lost
> Jul  5 01:37:35 node2 ocfs2_controld[4473]:   notice:
> plugin_handle_membership: Membership 216: quorum lost
> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Left:
> Jul  5 01:37:35 node2 crmd[4050]:   notice: plugin_handle_membership:
> Membership 216: quorum lost
> Jul  5 01:37:35 node2 stonith-ng[4046]:   notice: plugin_handle_membership:
> Membership 216: quorum lost
> Jul  5 01:37:35 node2 cib[4045]:   notice: plugin_handle_membership:
> Membership 216: quorum lost
> Jul  5 01:37:35 node2 cluster-dlm[4344]:   notice: crm_update_peer_state:
> plugin_handle_membership: Node node1[739307907] - state is now lost (was
> member)
> Jul  5 01:37:35 node2 ocfs2_controld[4473]:   notice: crm_update_peer_state:
> plugin_handle_membership: Node node1[739307907] - state is now lost (was
> member)
> Jul  5 01:37:35 node2 corosync[4040]:  [CLM   ] Members Joined:
> Jul  5 01:37:35 node2 crmd[4050]:  warning: match_down_event: No match for
> shutdown action on node1
> Jul  5 01:37:35 node2 stonith-ng[4046]:   notice: crm_update_peer_state:
> plugin_handle_membership: Node node1[739307907] - state is now lost (was
> member)
> Jul  5 01:37:35 node2 cib[4045]:   notice: crm_update_peer_state:
> plugin_handle_membership: Node node1[739307907] - state is now lost (was
> member)
> Jul  5 01:37:35 node2 cluster-dlm[4344]: update_cluster: Processing
> membership 216
> Jul  5 01:37:35 node2 ocfs2_controld[4473]: confchg called
> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] notice: pcmk_peer_update:
> Stable membership event on ring 216: memb=1, new=0, lost=0
> Jul  5 01:37:35 node2 crmd[4050]:   notice: peer_update_callback:
> Stonith/shutdown of node1 not matched
> Jul  5 01:37:35 node2 cluster-dlm[4344]: dlm_process_node: Skipped active
> node 739307908: born-on=204, last-seen=216, this-event=216, last-event=212
> Jul  5 01:37:35 node2 ocfs2_controld[4473]: ocfs2_controld (group
> "ocfs2:controld") confchg: members 1, left 1, joined 0
> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: pcmk_peer_update:
> MEMB: node2 739307908
> Jul  5 01:37:35 node2 crmd[4050]:   notice: crm_update_peer_state:
> plugin_handle_membership: Node node1[739307907] - state is now lost (was
> member)
> Jul  5 01:37:35 node2 cluster-dlm[4344]: del_configfs_node:
> del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/739307907"
> Jul  5 01:37:35 node2 ocfs2_controld[4473]: node daemon left 739307907
> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info:
> ais_mark_unseen_peer_dead: Node node1 was not seen in the previous
> transition
> Jul  5 01:37:35 node2 crmd[4050]:  warning: match_down_event: No match for
> shutdown action on node1
> Jul  5 01:37:35 node2 cluster-dlm[4344]: dlm_process_node: Removed inactive
> node 739307907: born-on=212, last-seen=212, this-event=216, last-event=212
> Jul  5 01:37:35 node2 ocfs2_controld[4473]: node down 739307907
> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info: update_member: Node
> 739307907/node1 is now: lost
> Jul  5 01:37:35 node2 crmd[4050]:   notice: peer_update_callback:
> Stonith/shutdown of node1 not matched
> Jul  5 01:37:35 node2 cluster-dlm[4344]: log_config: dlm:controld conf 1 0 1
> memb 739307908 join left 739307907
> Jul  5 01:37:35 node2 ocfs2_controld[4473]: Node 739307907 has left
> mountgroup 12925001B70F4A529491DFF16F8B8352
> Jul  5 01:37:35 node2 corosync[4040]:  [pcmk  ] info:
> send_member_notification: Sending membership update 216 to 5 children
> Jul  5 01:37:35 node2 sbd: [4028]: WARN: CIB: We do NOT have quorum!
> Jul  5 01:37:35 node2 sbd: [4026]: WARN: Pacemaker health check: UNHEALTHY
> Jul  5 01:37:35 node2 cluster-dlm[4344]: log_config:
> dlm:ls:12925001B70F4A529491DFF16F8B8352 conf 1 0 1 memb 739307908 join left
> 739307907
> Jul  5 01:37:35 node2 corosync[4040]:  [TOTEM ] A processor joined or left
> the membership and a new membership was formed.
> Jul  5 01:37:35 node2 corosync[4040]:  [CPG   ] chosen downlist: sender r(0)
> ip(172.16.241.132) ; members(old:2 left:1)
> Jul  5 01:37:35 node2 cluster-dlm[4344]: node_history_fail:
> 12925001B70F4A529491DFF16F8B8352 check_fs nodeid 739307907 set
> Jul  5 01:37:35 node2 cluster-dlm[4344]: add_change:
> 12925001B70F4A529491DFF16F8B8352 add_change cg 4 remove nodeid 739307907
> reason 3
> Jul  5 01:37:35 node2 corosync[4040]:  [MAIN  ] Completed service
> synchronization, ready to provide service.
> Jul  5 01:37:35 node2 cluster-dlm[4344]: add_change:
> 12925001B70F4A529491DFF16F8B8352 add_change cg 4 counts member 1 joined 0
> remove 1 failed 1
> Jul  5 01:37:35 node2 cluster-dlm[4344]: stop_kernel:
> 12925001B70F4A529491DFF16F8B8352 stop_kernel cg 4
> Jul  5 01:37:35 node2 ocfs2_controld[4473]: Sending notification of node
> 739307907 for "12925001B70F4A529491DFF16F8B8352"
> Jul  5 01:37:35 node2 ocfs2_controld[4473]: confchg called
> Jul  5 01:37:35 node2 cluster-dlm[4344]: do_sysfs: write "0" to
> "/sys/kernel/dlm/12925001B70F4A529491DFF16F8B8352/control"
> Jul  5 01:37:35 node2 ocfs2_controld[4473]: group
> "ocfs2:12925001B70F4A529491DFF16F8B8352" confchg: members 1, left 1, joined
> 0
> Jul  5 01:37:35 node2 kernel: [  684.452042] dlm: closing connection to node
> 739307907
> Jul  5 01:37:35 node2 crmd[4050]:   notice: crm_update_quorum: Updating
> quorum status to false (call=114)
> Jul  5 01:37:35 node2 cluster-dlm[4344]: fence_node_time: Node
> 739307907/node1 has not been shot yet
> Jul  5 01:37:35 node2 crmd[4050]:   notice: do_state_transition: State
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
> origin=check_join_state ]
> Jul  5 01:37:35 node2 cluster-dlm[4344]: check_fencing_done:
> 12925001B70F4A529491DFF16F8B8352 check_fencing 739307907 wait add 1436042043
> fail 1436042255 last 0
> Jul  5 01:37:35 node2 crmd[4050]:   notice: crm_update_quorum: Updating
> quorum status to false (call=119)
> Jul  5 01:37:35 node2 attrd[4048]:   notice: attrd_local_callback: Sending
> full refresh (origin=crmd)
> Jul  5 01:37:35 node2 attrd[4048]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: shutdown (0)
> Jul  5 01:37:35 node2 pengine[4049]:   notice: unpack_config: On loss of CCM
> Quorum: Ignore
> Jul  5 01:37:35 node2 pengine[4049]:  warning: pe_fence_node: Node node1
> will be fenced because the node is no longer part of the cluster
> Jul  5 01:37:35 node2 pengine[4049]:  warning: determine_online_status: Node
> node1 is unclean
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-drbd:1_demote_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-drbd:1_stop_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-drbd:1_demote_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-drbd:1_stop_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 attrd[4048]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: master-p-drbd (10000)
> Jul  5 01:37:35 node2 cluster-dlm[4344]: set_fs_notified:
> 12925001B70F4A529491DFF16F8B8352 set_fs_notified nodeid 739307907
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-drbd:1_demote_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 attrd[4048]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: probe_complete (true)
> Jul  5 01:37:35 node2 ocfs2_controld[4473]: message from dlmcontrol
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-drbd:1_stop_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 ocfs2_controld[4473]: Notified for
> "12925001B70F4A529491DFF16F8B8352", node 739307907, status 0
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-drbd:1_demote_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 ocfs2_controld[4473]: Completing notification on
> "12925001B70F4A529491DFF16F8B8352" for node 739307907
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-drbd:1_stop_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-dlm:1_stop_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-dlm:1_stop_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-o2cb:1_stop_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-o2cb:1_stop_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-vg1:1_stop_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-vg1:1_stop_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-sharefs:1_stop_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: custom_action: Action
> p-sharefs:1_stop_0 on node1 is unrunnable (offline)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: stage6: Scheduling Node node1
> for STONITH
> Jul  5 01:37:35 node2 pengine[4049]:    error: rsc_expand_action: Couldn't
> expand cl-lock_demote_0
> Jul  5 01:37:35 node2 pengine[4049]:    error: crm_abort:
> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
> Jul  5 01:37:35 node2 pengine[4049]:    error:
> clone_update_actions_interleave: No action found for demote in g-lock:0
> (first)
> Jul  5 01:37:35 node2 pengine[4049]:    error: crm_abort:
> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
> Jul  5 01:37:35 node2 pengine[4049]:    error:
> clone_update_actions_interleave: No action found for demote in g-lock:0
> (first)
> Jul  5 01:37:35 node2 pengine[4049]:    error: crm_abort:
> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
> Jul  5 01:37:35 node2 pengine[4049]:    error:
> clone_update_actions_interleave: No action found for demote in g-lock:0
> (first)
> Jul  5 01:37:35 node2 pengine[4049]:    error: crm_abort:
> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
> Jul  5 01:37:35 node2 pengine[4049]:    error:
> clone_update_actions_interleave: No action found for demote in g-lock:0
> (first)
> Jul  5 01:37:35 node2 pengine[4049]:   notice: LogActions: Demote p-drbd:1
> (Master -> Stopped node1)
> Jul  5 01:37:35 node2 pengine[4049]:   notice: LogActions: Stop p-dlm:1
> (node1)
> Jul  5 01:37:35 node2 pengine[4049]:   notice: LogActions: Stop p-o2cb:1
> (node1)
> Jul  5 01:37:35 node2 pengine[4049]:   notice: LogActions: Stop p-vg1:1
> (node1)
> Jul  5 01:37:35 node2 pengine[4049]:   notice: LogActions: Stop p-sharefs:1
> (node1)
> Jul  5 01:37:35 node2 pengine[4049]:  warning: process_pe_message:
> Calculated Transition 6: /var/lib/pacemaker/pengine/pe-warn-5.bz2
> Jul  5 01:37:35 node2 crmd[4050]:   notice: do_te_invoke: Processing graph 6
> (ref=pe_calc-dc-1436042255-74) derived from
> /var/lib/pacemaker/pengine/pe-warn-5.bz2
> Jul  5 01:37:35 node2 crmd[4050]:   notice: te_fence_node: Executing reboot
> fencing operation (79) on node1 (timeout=90000)
> Jul  5 01:37:35 node2 crmd[4050]:   notice: te_rsc_command: Initiating
> action 95: notify p-drbd_pre_notify_demote_0 on node2 (local)
> Jul  5 01:37:35 node2 stonith-ng[4046]:   notice: handle_request: Client
> crmd.4050.3c8ccdb0 wants to fence (reboot) 'node1' with device '(any)'
> Jul  5 01:37:35 node2 stonith-ng[4046]:   notice:
> initiate_remote_stonith_op: Initiating remote operation reboot for node1:
> 2fce4697-65d5-40d2-b5b4-5c10be4d3d5a (0)
> Jul  5 01:37:35 node2 crmd[4050]:   notice: process_lrm_event: LRM operation
> p-drbd_notify_0 (call=54, rc=0, cib-update=0, confirmed=true) ok
> Jul  5 01:37:35 node2 stonith-ng[4046]:   notice:
> can_fence_host_with_device: sbd_stonith can fence node1: dynamic-list
> Jul  5 01:37:35 node2 sbd: [6197]: info: Delivery process handling
> /dev/disk/by-id/scsi-1494554000000000035c71550e83586b70dfc77c3b382dfbd
> Jul  5 01:37:35 node2 sbd: [6197]: info: Device UUID:
> 4e398182-4894-4f11-b84e-6c6ad4c02614
> Jul  5 01:37:35 node2 sbd: [6197]: info: Writing reset to node slot node1
> Jul  5 01:37:35 node2 sbd: [6197]: info: Messaging delay: 40
> Jul  5 01:38:15 node2 sbd: [6197]: info: reset successfully delivered to
> node1
> Jul  5 01:38:15 node2 sbd: [6196]: info: Message successfully delivered.
> Jul  5 01:38:16 node2 stonith-ng[4046]:   notice: log_operation: Operation
> 'reboot' [6185] (call 3 from crmd.4050) for host 'node1' with device
> 'sbd_stonith' returned: 0 (OK)
> Jul  5 01:38:16 node2 stonith-ng[4046]:   notice: remote_op_done: Operation
> reboot of node1 by node2 for crmd.4050 at node2.2fce4697: OK
> Jul  5 01:38:16 node2 crmd[4050]:   notice: tengine_stonith_callback:
> Stonith operation 3/79:6:0:34e9a1ca-ebc9-4114-b5b7-26e670dc6aa0: OK (0)
> Jul  5 01:38:16 node2 crmd[4050]:   notice: tengine_stonith_notify: Peer
> node1 was terminated (reboot) by node2 for node2: OK
> (ref=2fce4697-65d5-40d2-b5b4-5c10be4d3d5a) by client crmd.4050
> Jul  5 01:38:16 node2 crmd[4050]:   notice: te_rsc_command: Initiating
> action 96: notify p-drbd_post_notify_demote_0 on node2 (local)
> Jul  5 01:38:16 node2 crmd[4050]:   notice: process_lrm_event: LRM operation
> p-drbd_notify_0 (call=55, rc=0, cib-update=0, confirmed=true) ok
> Jul  5 01:38:16 node2 crmd[4050]:   notice: run_graph: Transition 6
> (Complete=24, Pending=0, Fired=0, Skipped=6, Incomplete=5,
> Source=/var/lib/pacemaker/pengine/pe-warn-5.bz2): Stopped
> Jul  5 01:38:16 node2 pengine[4049]:   notice: unpack_config: On loss of CCM
> Quorum: Ignore
> Jul  5 01:38:16 node2 pengine[4049]:    error: rsc_expand_action: Couldn't
> expand cl-lock_demote_0
> Jul  5 01:38:16 node2 pengine[4049]:    error: crm_abort:
> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
> Jul  5 01:38:16 node2 pengine[4049]:    error:
> clone_update_actions_interleave: No action found for demote in g-lock:0
> (first)
> Jul  5 01:38:16 node2 pengine[4049]:    error: crm_abort:
> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
> Jul  5 01:38:16 node2 pengine[4049]:    error:
> clone_update_actions_interleave: No action found for demote in g-lock:0
> (first)
> Jul  5 01:38:16 node2 pengine[4049]:    error: crm_abort:
> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
> Jul  5 01:38:16 node2 pengine[4049]:    error:
> clone_update_actions_interleave: No action found for demote in g-lock:0
> (first)
> Jul  5 01:38:16 node2 pengine[4049]:    error: crm_abort:
> clone_update_actions_interleave: Triggered assert at clone.c:1260 :
> first_action != NULL || is_set(first_child->flags, pe_rsc_orphan)
> Jul  5 01:38:16 node2 pengine[4049]:    error:
> clone_update_actions_interleave: No action found for demote in g-lock:0
> (first)
> Jul  5 01:38:16 node2 pengine[4049]:   notice: process_pe_message:
> Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-30.bz2
> Jul  5 01:38:16 node2 crmd[4050]:   notice: do_te_invoke: Processing graph 7
> (ref=pe_calc-dc-1436042296-79) derived from
> /var/lib/pacemaker/pengine/pe-input-30.bz2
> Jul  5 01:38:16 node2 crmd[4050]:   notice: run_graph: Transition 7
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-30.bz2): Complete
> Jul  5 01:38:16 node2 crmd[4050]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jul  5 01:38:17 node2 cluster-dlm[4344]: fence_node_time: Node
> 739307907/node1 was last shot 'now'
> Jul  5 01:38:17 node2 cluster-dlm[4344]: check_fencing_done:
> 12925001B70F4A529491DFF16F8B8352 check_fencing 739307907 done add 1436042043
> fail 1436042255 last 1436042297
> Jul  5 01:38:17 node2 cluster-dlm[4344]: check_fencing_done:
> 12925001B70F4A529491DFF16F8B8352 check_fencing done
> Jul  5 01:38:17 node2 cluster-dlm[4344]: check_quorum_done:
> 12925001B70F4A529491DFF16F8B8352 check_quorum disabled
> Jul  5 01:38:17 node2 cluster-dlm[4344]: check_fs_done:
> 12925001B70F4A529491DFF16F8B8352 check_fs nodeid 739307907 clear
> Jul  5 01:38:17 node2 cluster-dlm[4344]: check_fs_done:
> 12925001B70F4A529491DFF16F8B8352 check_fs done
> Jul  5 01:38:17 node2 cluster-dlm[4344]: send_info:
> 12925001B70F4A529491DFF16F8B8352 send_start cg 4 flags 2 data2 0 counts 3 1
> 0 1 1
> Jul  5 01:38:17 node2 cluster-dlm[4344]: receive_start:
> 12925001B70F4A529491DFF16F8B8352 receive_start 739307908:4 len 76
> Jul  5 01:38:17 node2 cluster-dlm[4344]: match_change:
> 12925001B70F4A529491DFF16F8B8352 match_change 739307908:4 matches cg 4
> Jul  5 01:38:17 node2 cluster-dlm[4344]: wait_messages_done:
> 12925001B70F4A529491DFF16F8B8352 wait_messages cg 4 got all 1
> Jul  5 01:38:17 node2 cluster-dlm[4344]: start_kernel:
> 12925001B70F4A529491DFF16F8B8352 start_kernel cg 4 member_count 1
> Jul  5 01:38:17 node2 cluster-dlm[4344]: update_dir_members: dir_member
> 739307907
> Jul  5 01:38:17 node2 cluster-dlm[4344]: update_dir_members: dir_member
> 739307908
> Jul  5 01:38:17 node2 cluster-dlm[4344]: set_configfs_members: set_members
> rmdir
> "/sys/kernel/config/dlm/cluster/spaces/12925001B70F4A529491DFF16F8B8352/nodes/739307907"
> Jul  5 01:38:17 node2 cluster-dlm[4344]: do_sysfs: write "1" to
> "/sys/kernel/dlm/12925001B70F4A529491DFF16F8B8352/control"
> Jul  5 01:38:17 node2 kernel: [  726.576130]
> (ocfs2rec,6152,0):ocfs2_replay_journal:1549 Recovering node 739307907 from
> slot 0 on device (253,0)
> Jul  5 01:38:18 node2 kernel: [  727.453597]
> (ocfs2rec,6152,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery in
> slot 0
> Jul  5 01:38:18 node2 kernel: [  727.455965]
> (kworker/u:28,139,0):ocfs2_finish_quota_recovery:599 Finishing quota
> recovery in slot 0
>
>
> cib:
> primitive p-dlm ocf:pacemaker:controld \
>         op monitor interval="120" timeout="30" \
>         op start interval="0" timeout="90" \
>         op stop interval="0" timeout="100"
> primitive p-drbd ocf:linbit:drbd \
>         params drbd_resource="r0" \
>         op monitor interval="50" role="Master" timeout="30" \
>         op monitor interval="60" role="Slave" timeout="30" \
>         op start interval="0" timeout="240" \
>         op stop interval="0" timeout="100"
> primitive p-o2cb ocf:ocfs2:o2cb \
>         op monitor interval="120" timeout="30" \
>         op start interval="0" timeout="90" \
>         op stop interval="0" timeout="100"
> primitive p-sharefs ocf:heartbeat:Filesystem \
>         params device="/dev/vg1/sharelv" directory="/sharedata"
> fstype="ocfs2" \
>         op monitor interval="60s" timeout="60s" \
>         op start interval="0" timeout="90s" \
>         op stop interval="0" timeout="90s"
> primitive p-vg1 ocf:heartbeat:LVM \
>         params volgrpname="vg1" \
>         op monitor interval="60s" timeout="40" \
>         op start interval="0" timeout="40" \
>         op stop interval="0" timeout="40"
> primitive sbd_stonith stonith:external/sbd \
>         meta target-role="Started" \
>         op monitor interval="3000" timeout="20" \
>         op start interval="0" timeout="20" \
>         op stop interval="0" timeout="20" \
>         params
> sbd_device="/dev/disk/by-id/scsi-1494554000000000035c71550e83586b70dfc77c3b382dfbd"
> group g-lock p-dlm p-o2cb
> group g-sharedata p-vg1 p-sharefs
> ms ms-drbd p-drbd \
>         meta master-max="2" clone-max="2" notify="true" interleave="true"
> clone cl-lock g-lock \
>         meta globally-unique="false" interleave="true"
> clone cl-sharedata g-sharedata \
>         meta globally-unique="false" interleave="true"
> colocation col-drbd-lock inf: cl-lock ms-drbd:Master
> colocation col-lock-sharedata inf: cl-sharedata cl-lock
> order ord-drbd-lock inf: ms-drbd:promote cl-lock
> order ord-lock-sharedata inf: cl-lock cl-sharedata
>         stonith-timeout="90s" \
>         no-quorum-policy="ignore" \
>         expected-quorum-votes="2"
>
>
> --
> Regards,
>
> Muhammad Sharfuddin
> _______________________________________________
> Linux-HA mailing list is closing down.
> Please subscribe to users at clusterlabs.org instead.
> http://clusterlabs.org/mailman/listinfo/users
> _______________________________________________
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>
>
>

-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^