<div dir="ltr">Hi Vladislav and Andrew,<div><br></div><div>After adding fencing/stonith (resource level) and fencing handlers on drbd, I am not getting monitor timeouts on drbd but I am experiencing a different problem now. As per my understanding, logs on node01 showed that it detects node02 to be disconnected (and moved the resources to itself) but crm_mon shows that the resources are still started on node02 which is not.</div><div><br></div><div>Node01:<br><br><div>node01 crmd[952]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]</div><div>node01 pengine[951]: notice: unpack_config: On loss of CCM Quorum: Ignore</div><div>node01 crmd[952]: notice: run_graph: Transition 260 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-78.bz2): Complete</div><div>node01 crmd[952]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]</div><div>node01 pengine[951]: notice: process_pe_message: Calculated Transition 260: /var/lib/pacemaker/pengine/pe-input-78.bz2</div><div>node01 corosync[917]: [TOTEM ] A processor failed, forming new configuration.</div><div>node01 corosync[917]: [TOTEM ] A new membership (<a href="http://10.2.131.20:352">10.2.131.20:352</a>) was formed. Members left: 167936789</div><div>node01 crmd[952]: warning: match_down_event: No match for shutdown action on 167936789</div><div>node01 crmd[952]: notice: peer_update_callback: Stonith/shutdown of node02 not matched</div><div>node01 crmd[952]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]</div><div>node01 pengine[951]: notice: unpack_config: On loss of CCM Quorum: Ignore</div><div>node01 pengine[951]: warning: pe_fence_node: Node node02 will be fenced because our peer process is no longer available</div><div>node01 pengine[951]: warning: determine_online_status: Node node02 is unclean</div><div>node01 pengine[951]: warning: stage6: Scheduling Node node02 for STONITH</div><div>node01 pengine[951]: notice: LogActions: Move fs_pg#011(Started node02 -> node01)</div><div>node01 pengine[951]: notice: LogActions: Move ip_pg#011(Started node02 -> node01)</div><div>node01 pengine[951]: notice: LogActions: Move lsb_pg#011(Started node02 -> node01)</div><div>node01 pengine[951]: notice: LogActions: Demote drbd_pg:0#011(Master -> Stopped node02)</div><div>node01 pengine[951]: notice: LogActions: Promote drbd_pg:1#011(Slave -> Master node01)</div><div>node01 pengine[951]: notice: LogActions: Stop p_fence:0#011(node02)</div><div>node01 crmd[952]: notice: te_rsc_command: Initiating action 2: cancel drbd_pg_cancel_31000 on node01 (local)</div><div>node01 crmd[952]: notice: te_fence_node: Executing reboot fencing operation (54) on node02 (timeout=60000)</div><div>node01 stonith-ng[948]: notice: handle_request: Client crmd.952.6d7ac808 wants to fence (reboot) 'node02' with device '(any)'</div><div>node01 stonith-ng[948]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for node02: 96530c7b-1c80-42c4-82cf-840bf3d5bb5f (0)</div><div>node01 crmd[952]: notice: te_rsc_command: Initiating action 68: notify drbd_pg_pre_notify_demote_0 on node02</div><div>node01 crmd[952]: notice: te_rsc_command: Initiating action 70: notify drbd_pg_pre_notify_demote_0 on node01 (local)</div><div>node01 pengine[951]: warning: process_pe_message: Calculated Transition 261: /var/lib/pacemaker/pengine/pe-warn-0.bz2</div><div>node01 crmd[952]: notice: process_lrm_event: LRM operation drbd_pg_notify_0 (call=63, rc=0, cib-update=0, confirmed=true) ok</div><div>node01 kernel: [230495.836024] d-con pg: PingAck did not arrive in time.</div><div>node01 kernel: [230495.836176] d-con pg: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) </div><div>node01 kernel: [230495.837204] d-con pg: asender terminated</div><div>node01 kernel: [230495.837216] d-con pg: Terminating drbd_a_pg</div><div>node01 kernel: [230495.837286] d-con pg: Connection closed</div><div>node01 kernel: [230495.837298] d-con pg: conn( NetworkFailure -> Unconnected ) </div><div>node01 kernel: [230495.837299] d-con pg: receiver terminated</div><div>node01 kernel: [230495.837300] d-con pg: Restarting receiver thread</div><div>node01 kernel: [230495.837304] d-con pg: receiver (re)started</div><div>node01 kernel: [230495.837314] d-con pg: conn( Unconnected -> WFConnection ) </div><div>node01 crmd[952]: warning: action_timer_callback: Timer popped (timeout=20000, abort_level=1000000, complete=false)</div><div>node01 crmd[952]: error: print_synapse: [Action 2]: Completed rsc op drbd_pg_cancel_31000 on node01 (priority: 0, waiting: none)</div><div>node01 crmd[952]: warning: action_timer_callback: Timer popped (timeout=20000, abort_level=1000000, complete=false)</div><div>node01 crmd[952]: error: print_synapse: [Action 68]: In-flight rsc op drbd_pg_pre_notify_demote_0 on node02 (priority: 0, waiting: none)</div><div>node01 crmd[952]: warning: cib_action_update: rsc_op 68: drbd_pg_pre_notify_demote_0 on node02 timed out</div><div>node01 crmd[952]: error: cib_action_updated: Update 297 FAILED: Timer expired</div><div>node01 crmd[952]: error: stonith_async_timeout_handler: Async call 2 timed out after 120000ms</div><div>node01 crmd[952]: notice: tengine_stonith_callback: Stonith operation 2/54:261:0:6978227d-ce2d-4dc6-955a-eb9313f112a5: Timer expired (-62)</div><div>node01 crmd[952]: notice: tengine_stonith_callback: Stonith operation 2 for node02 failed (Timer expired): aborting transition.</div><div>node01 crmd[952]: notice: run_graph: Transition 261 (Complete=6, Pending=0, Fired=0, Skipped=29, Incomplete=15, Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Stopped</div><div>node01 pengine[951]: notice: unpack_config: On loss of CCM Quorum: Ignore</div><div>node01 pengine[951]: warning: pe_fence_node: Node node02 will be fenced because our peer process is no longer available</div><div>node01 pengine[951]: warning: determine_online_status: Node node02 is unclean</div><div>node01 pengine[951]: warning: stage6: Scheduling Node node02 for STONITH</div><div>node01 pengine[951]: notice: LogActions: Move fs_pg#011(Started node02 -> node01)</div><div>node01 pengine[951]: notice: LogActions: Move ip_pg#011(Started node02 -> node01)</div><div>node01 pengine[951]: notice: LogActions: Move lsb_pg#011(Started node02 -> node01)</div><div>node01 pengine[951]: notice: LogActions: Demote drbd_pg:0#011(Master -> Stopped node02)</div><div>node01 pengine[951]: notice: LogActions: Promote drbd_pg:1#011(Slave -> Master node01)</div><div>node01 pengine[951]: notice: LogActions: Stop p_fence:0#011(node02)</div><div>node01 crmd[952]: notice: te_fence_node: Executing reboot fencing operation (53) on node02 (timeout=60000)</div><div>node01 stonith-ng[948]: notice: handle_request: Client crmd.952.6d7ac808 wants to fence (reboot) 'node02' with device '(any)'</div><div>node01 stonith-ng[948]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for node02: a4fae8ce-3a6c-4fe5-a934-b5b83ae123cb (0)</div><div>node01 crmd[952]: notice: te_rsc_command: Initiating action 67: notify drbd_pg_pre_notify_demote_0 on node02</div><div>node01 crmd[952]: notice: te_rsc_command: Initiating action 69: notify drbd_pg_pre_notify_demote_0 on node01 (local)</div><div>node01 pengine[951]: warning: process_pe_message: Calculated Transition 262: /var/lib/pacemaker/pengine/pe-warn-1.bz2</div><div>node01 crmd[952]: notice: process_lrm_event: LRM operation drbd_pg_notify_0 (call=66, rc=0, cib-update=0, confirmed=true) ok</div><div><br></div><div>Last updated: Mon Sep 15 01:15:59 2014</div><div>Last change: Sat Sep 13 15:23:45 2014 via cibadmin on node01</div><div>Stack: corosync</div><div>Current DC: node01 (167936788) - partition with quorum</div><div>Version: 1.1.10-42f2063</div><div>2 Nodes configured</div><div>7 Resources configured</div><div><br></div><div><br></div><div>Node node02 (167936789): UNCLEAN (online)</div><div>Online: [ node01 ]</div><div><br></div><div> Resource Group: PGServer</div><div> fs_pg (ocf::heartbeat:Filesystem): Started node02</div><div> ip_pg (ocf::heartbeat:IPaddr2): Started node02</div><div> lsb_pg (lsb:postgresql): Started node02</div><div> Master/Slave Set: ms_drbd_pg [drbd_pg]</div><div> Masters: [ node02 ]</div><div> Slaves: [ node01 ]</div><div> Clone Set: cln_p_fence [p_fence]</div><div> Started: [ node01 node02 ]</div></div><div><br></div><div>Thank you,</div><div>Norbert</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Sep 12, 2014 at 12:06 PM, Vladislav Bogdanov <span dir="ltr"><<a href="mailto:bubble@hoster-ok.com" target="_blank">bubble@hoster-ok.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">12.09.2014 05:00, Norbert Kiam Maclang wrote:<br>
> Hi,<br>
><br>
> After adding resource level fencing on drbd, I still ended up having<br>
> problems with timeouts on drbd. Is there a recommended settings for<br>
> this? I followed what is written in the drbd documentation -<br>
> <a href="http://www.drbd.org/users-guide-emb/s-pacemaker-crm-drbd-backed-service.html" target="_blank">http://www.drbd.org/users-guide-emb/s-pacemaker-crm-drbd-backed-service.html</a><br>
> , Another thing I can't understand is why during initial tests, even I<br>
> reboot the vms several times, failover works. But after I soak it for a<br>
> couple of hours (say for example 8 hours or more) and continue with the<br>
> tests, it will not failover and experience split brain. I confirmed it<br>
> though that everything is healthy before performing a reboot. Disk<br>
> health and network is good, drbd is synced, time beetween servers is good.<br>
<br>
I recall I've seen something similar a year ago (near the time your<br>
pacemaker version is dated). I do not remember what was the exact<br>
problem cause, but I saw that drbd RA timeouts because it waits for<br>
something (fencing) in the kernel space to be done. drbd calls userspace<br>
scripts from within kernelspace, and you'll see them in the process list<br>
with the drbd kernel thread as a parent.<br>
<br>
I'd also upgrade your corosync configuration from "member" to "nodelist"<br>
syntax, specifying "name" parameter together with ring0_addr for nodes<br>
(that parameter is not referenced in corosync docs but should be<br>
somewhere in the Pacemaker Explained - it is used only by the pacemaker).<br>
<br>
Also there is trace_ra functionality support in both pacemaker and crmsh<br>
(cannot say if that is supported in versions you have though, probably<br>
yes) so you may want to play with that to get the exact picture from the<br>
resource agent.<br>
<br>
Anyways, upgrading to 1.1.12 and more recent crmsh is nice to have for<br>
you because you may be just hitting a long-ago solved and forgotten<br>
bug/issue.<br>
<br>
Concerning your<br>
> expected-quorum-votes="1"<br>
<br>
You need to configure votequorum in corosync with two_node: 1 instead of<br>
that line.<br>
<br>
><br>
> # Logs:<br>
> node01 lrmd[1036]: warning: child_timeout_callback:<br>
> drbd_pg_monitor_29000 process (PID 27744) timed out<br>
> node01 lrmd[1036]: warning: operation_finished:<br>
> drbd_pg_monitor_29000:27744 - timed out after 20000ms<br>
> node01 crmd[1039]: error: process_lrm_event: LRM operation<br>
> drbd_pg_monitor_29000 (69) Timed Out (timeout=20000ms)<br>
> node01 crmd[1039]: warning: update_failcount: Updating failcount for<br>
> drbd_pg on tyo1mqdb01p after failed monitor: rc=1 (update=value++,<br>
> time=1410486352)<br>
><br>
> Thanks,<br>
> Kiam<br>
><br>
> On Thu, Sep 11, 2014 at 6:58 PM, Norbert Kiam Maclang<br>
> <<a href="mailto:norbert.kiam.maclang@gmail.com">norbert.kiam.maclang@gmail.com</a> <mailto:<a href="mailto:norbert.kiam.maclang@gmail.com">norbert.kiam.maclang@gmail.com</a>>><br>
> wrote:<br>
><br>
> Thank you Vladislav.<br>
><br>
> I have configured resource level fencing on drbd and removed<br>
> wfc-timeout and defr-wfc-timeout (is this required?). My drbd<br>
> configuration is now:<br>
><br>
> resource pg {<br>
> device /dev/drbd0;<br>
> disk /dev/vdb;<br>
> meta-disk internal;<br>
> disk {<br>
> fencing resource-only;<br>
> on-io-error detach;<br>
> resync-rate 40M;<br>
> }<br>
> handlers {<br>
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";<br>
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";<br>
> split-brain "/usr/lib/drbd/notify-split-brain.sh nkbm";<br>
> }<br>
> on node01 {<br>
> address <a href="http://10.2.136.52:7789" target="_blank">10.2.136.52:7789</a> <<a href="http://10.2.136.52:7789" target="_blank">http://10.2.136.52:7789</a>>;<br>
> }<br>
> on node02 {<br>
> address <a href="http://10.2.136.55:7789" target="_blank">10.2.136.55:7789</a> <<a href="http://10.2.136.55:7789" target="_blank">http://10.2.136.55:7789</a>>;<br>
> }<br>
> net {<br>
> verify-alg md5;<br>
> after-sb-0pri discard-zero-changes;<br>
> after-sb-1pri discard-secondary;<br>
> after-sb-2pri disconnect;<br>
> }<br>
> }<br>
><br>
> Failover works on my initial test (restarting both nodes alternately<br>
> - this always works). Will wait for a couple of hours after doing a<br>
> failover test again (Which always fail on my previous setup).<br>
><br>
> Thank you!<br>
> Kiam<br>
><br>
> On Thu, Sep 11, 2014 at 2:14 PM, Vladislav Bogdanov<br>
> <<a href="mailto:bubble@hoster-ok.com">bubble@hoster-ok.com</a> <mailto:<a href="mailto:bubble@hoster-ok.com">bubble@hoster-ok.com</a>>> wrote:<br>
><br>
> 11.09.2014 05:57, Norbert Kiam Maclang wrote:<br>
> > Is this something to do with quorum? But I already set<br>
><br>
> You'd need to configure fencing at the drbd resources level.<br>
><br>
> <a href="http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib" target="_blank">http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib</a><br>
><br>
><br>
> ><br>
> > property no-quorum-policy="ignore" \<br>
> > expected-quorum-votes="1"<br>
> ><br>
> > Thanks in advance,<br>
> > Kiam<br>
> ><br>
> > On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang<br>
> > <<a href="mailto:norbert.kiam.maclang@gmail.com">norbert.kiam.maclang@gmail.com</a><br>
> <mailto:<a href="mailto:norbert.kiam.maclang@gmail.com">norbert.kiam.maclang@gmail.com</a>><br>
> <mailto:<a href="mailto:norbert.kiam.maclang@gmail.com">norbert.kiam.maclang@gmail.com</a><br>
> <mailto:<a href="mailto:norbert.kiam.maclang@gmail.com">norbert.kiam.maclang@gmail.com</a>>>><br>
> > wrote:<br>
> ><br>
> > Hi,<br>
> ><br>
> > Please help me understand what is causing the problem. I<br>
> have a 2<br>
> > node cluster running on vms using KVM. Each vm (I am using<br>
> Ubuntu<br>
> > 14.04) runs on a separate hypervisor on separate machines.<br>
> All are<br>
> > working well during testing (I restarted the vms<br>
> alternately), but<br>
> > after a day when I kill the other node, I always end up<br>
> corosync and<br>
> > pacemaker hangs on the surviving node. Date and time on<br>
> the vms are<br>
> > in sync, I use unicast, tcpdump shows both nodes exchanges,<br>
> > confirmed that DRBD is healthy and crm_mon show good<br>
> status before I<br>
> > kill the other node. Below are my configurations and<br>
> versions I used:<br>
> ><br>
> > corosync 2.3.3-1ubuntu1<br>
> > crmsh 1.2.5+hg1034-1ubuntu3<br>
> > drbd8-utils 2:8.4.4-1ubuntu1<br>
> > libcorosync-common4 2.3.3-1ubuntu1<br>
> > libcrmcluster4 1.1.10+git20130802-1ubuntu2<br>
> > libcrmcommon3 1.1.10+git20130802-1ubuntu2<br>
> > libcrmservice1 1.1.10+git20130802-1ubuntu2<br>
> > pacemaker 1.1.10+git20130802-1ubuntu2<br>
> > pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2<br>
> > postgresql-9.3 9.3.5-0ubuntu0.14.04.1<br>
> ><br>
> > # /etc/corosync/corosync:<br>
> > totem {<br>
> > version: 2<br>
> > token: 3000<br>
> > token_retransmits_before_loss_const: 10<br>
> > join: 60<br>
> > consensus: 3600<br>
> > vsftype: none<br>
> > max_messages: 20<br>
> > clear_node_high_bit: yes<br>
> > secauth: off<br>
> > threads: 0<br>
> > rrp_mode: none<br>
> > interface {<br>
> > member {<br>
> > memberaddr: 10.2.136.56<br>
> > }<br>
> > member {<br>
> > memberaddr: 10.2.136.57<br>
> > }<br>
> > ringnumber: 0<br>
> > bindnetaddr: 10.2.136.0<br>
> > mcastport: 5405<br>
> > }<br>
> > transport: udpu<br>
> > }<br>
> > amf {<br>
> > mode: disabled<br>
> > }<br>
> > quorum {<br>
> > provider: corosync_votequorum<br>
> > expected_votes: 1<br>
> > }<br>
> > aisexec {<br>
> > user: root<br>
> > group: root<br>
> > }<br>
> > logging {<br>
> > fileline: off<br>
> > to_stderr: yes<br>
> > to_logfile: no<br>
> > to_syslog: yes<br>
> > syslog_facility: daemon<br>
> > debug: off<br>
> > timestamp: on<br>
> > logger_subsys {<br>
> > subsys: AMF<br>
> > debug: off<br>
> > tags:<br>
> enter|leave|trace1|trace2|trace3|trace4|trace6<br>
> > }<br>
> > }<br>
> ><br>
> > # /etc/corosync/service.d/pcmk:<br>
> > service {<br>
> > name: pacemaker<br>
> > ver: 1<br>
> > }<br>
> ><br>
> > /etc/drbd.d/global_common.conf:<br>
> > global {<br>
> > usage-count no;<br>
> > }<br>
> ><br>
> > common {<br>
> > net {<br>
> > protocol C;<br>
> > }<br>
> > }<br>
> ><br>
> > # /etc/drbd.d/pg.res:<br>
> > resource pg {<br>
> > device /dev/drbd0;<br>
> > disk /dev/vdb;<br>
> > meta-disk internal;<br>
> > startup {<br>
> > wfc-timeout 15;<br>
> > degr-wfc-timeout 60;<br>
> > }<br>
> > disk {<br>
> > on-io-error detach;<br>
> > resync-rate 40M;<br>
> > }<br>
> > on node01 {<br>
> > address <a href="http://10.2.136.56:7789" target="_blank">10.2.136.56:7789</a> <<a href="http://10.2.136.56:7789" target="_blank">http://10.2.136.56:7789</a>><br>
> <<a href="http://10.2.136.56:7789" target="_blank">http://10.2.136.56:7789</a>>;<br>
> > }<br>
> > on node02 {<br>
> > address <a href="http://10.2.136.57:7789" target="_blank">10.2.136.57:7789</a> <<a href="http://10.2.136.57:7789" target="_blank">http://10.2.136.57:7789</a>><br>
> <<a href="http://10.2.136.57:7789" target="_blank">http://10.2.136.57:7789</a>>;<br>
> > }<br>
> > net {<br>
> > verify-alg md5;<br>
> > after-sb-0pri discard-zero-changes;<br>
> > after-sb-1pri discard-secondary;<br>
> > after-sb-2pri disconnect;<br>
> > }<br>
> > }<br>
> ><br>
> > # Pacemaker configuration:<br>
> > node $id="167938104" node01<br>
> > node $id="167938105" node02<br>
> > primitive drbd_pg ocf:linbit:drbd \<br>
> > params drbd_resource="pg" \<br>
> > op monitor interval="29s" role="Master" \<br>
> > op monitor interval="31s" role="Slave"<br>
> > primitive fs_pg ocf:heartbeat:Filesystem \<br>
> > params device="/dev/drbd0"<br>
> directory="/var/lib/postgresql/9.3/main"<br>
> > fstype="ext4"<br>
> > primitive ip_pg ocf:heartbeat:IPaddr2 \<br>
> > params ip="10.2.136.59" cidr_netmask="24" nic="eth0"<br>
> > primitive lsb_pg lsb:postgresql<br>
> > group PGServer fs_pg lsb_pg ip_pg<br>
> > ms ms_drbd_pg drbd_pg \<br>
> > meta master-max="1" master-node-max="1" clone-max="2"<br>
> > clone-node-max="1" notify="true"<br>
> > colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master<br>
> > order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start<br>
> > property $id="cib-bootstrap-options" \<br>
> > dc-version="1.1.10-42f2063" \<br>
> > cluster-infrastructure="corosync" \<br>
> > stonith-enabled="false" \<br>
> > no-quorum-policy="ignore"<br>
> > rsc_defaults $id="rsc-options" \<br>
> > resource-stickiness="100"<br>
> ><br>
> > # Logs on node01<br>
> > Sep 10 10:25:33 node01 crmd[1019]: notice:<br>
> peer_update_callback:<br>
> > Our peer on the DC is dead<br>
> > Sep 10 10:25:33 node01 crmd[1019]: notice:<br>
> do_state_transition:<br>
> > State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION<br>
> > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]<br>
> > Sep 10 10:25:33 node01 crmd[1019]: notice:<br>
> do_state_transition:<br>
> > State transition S_ELECTION -> S_INTEGRATION [<br>
> input=I_ELECTION_DC<br>
> > cause=C_FSA_INTERNAL origin=do_election_check ]<br>
> > Sep 10 10:25:33 node01 corosync[940]: [TOTEM ] A new<br>
> membership<br>
> > (<a href="http://10.2.136.56:52" target="_blank">10.2.136.56:52</a> <<a href="http://10.2.136.56:52" target="_blank">http://10.2.136.56:52</a>><br>
> <<a href="http://10.2.136.56:52" target="_blank">http://10.2.136.56:52</a>>) was formed. Members left:<br>
> > 167938105<br>
> > Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg:<br>
> PingAck did<br>
> > not arrive in time.<br>
> > Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer(<br>
> > Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk(<br>
> > UpToDate -> DUnknown )<br>
> > Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg:<br>
> asender<br>
> > terminated<br>
> > Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg:<br>
> Terminating<br>
> > drbd_a_pg<br>
> > Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg:<br>
> Connection<br>
> > closed<br>
> > Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn(<br>
> > NetworkFailure -> Unconnected )<br>
> > Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg:<br>
> receiver<br>
> > terminated<br>
> > Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg:<br>
> Restarting<br>
> > receiver thread<br>
> > Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg:<br>
> receiver<br>
> > (re)started<br>
> > Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn(<br>
> > Unconnected -> WFConnection )<br>
> > Sep 10 10:26:12 node01 lrmd[1016]: warning:<br>
> child_timeout_callback:<br>
> > drbd_pg_monitor_31000 process (PID 8445) timed out<br>
> > Sep 10 10:26:12 node01 lrmd[1016]: warning:<br>
> operation_finished:<br>
> > drbd_pg_monitor_31000:8445 - timed out after 20000ms<br>
> > Sep 10 10:26:12 node01 crmd[1019]: error:<br>
> process_lrm_event: LRM<br>
> > operation drbd_pg_monitor_31000 (30) Timed Out<br>
> (timeout=20000ms)<br>
> > Sep 10 10:26:32 node01 crmd[1019]: warning: cib_rsc_callback:<br>
> > Resource update 23 failed: (rc=-62) Timer expired<br>
> > Sep 10 10:27:03 node01 lrmd[1016]: warning:<br>
> child_timeout_callback:<br>
> > drbd_pg_monitor_31000 process (PID 8693) timed out<br>
> > Sep 10 10:27:03 node01 lrmd[1016]: warning:<br>
> operation_finished:<br>
> > drbd_pg_monitor_31000:8693 - timed out after 20000ms<br>
> > Sep 10 10:27:54 node01 lrmd[1016]: warning:<br>
> child_timeout_callback:<br>
> > drbd_pg_monitor_31000 process (PID 8938) timed out<br>
> > Sep 10 10:27:54 node01 lrmd[1016]: warning:<br>
> operation_finished:<br>
> > drbd_pg_monitor_31000:8938 - timed out after 20000ms<br>
> > Sep 10 10:28:33 node01 crmd[1019]: error: crm_timer_popped:<br>
> > Integration Timer (I_INTEGRATED) just popped in state<br>
> S_INTEGRATION!<br>
> > (180000ms)<br>
> > Sep 10 10:28:33 node01 crmd[1019]: warning:<br>
> do_state_transition:<br>
> > Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED<br>
> > Sep 10 10:28:33 node01 crmd[1019]: warning:<br>
> do_state_transition: 1<br>
> > cluster nodes failed to respond to the join offer.<br>
> > Sep 10 10:28:33 node01 crmd[1019]: notice:<br>
> crmd_join_phase_log:<br>
> > join-1: node02=none<br>
> > Sep 10 10:28:33 node01 crmd[1019]: notice:<br>
> crmd_join_phase_log:<br>
> > join-1: node01=welcomed<br>
> > Sep 10 10:28:45 node01 lrmd[1016]: warning:<br>
> child_timeout_callback:<br>
> > drbd_pg_monitor_31000 process (PID 9185) timed out<br>
> > Sep 10 10:28:45 node01 lrmd[1016]: warning:<br>
> operation_finished:<br>
> > drbd_pg_monitor_31000:9185 - timed out after 20000ms<br>
> > Sep 10 10:29:36 node01 lrmd[1016]: warning:<br>
> child_timeout_callback:<br>
> > drbd_pg_monitor_31000 process (PID 9432) timed out<br>
> > Sep 10 10:29:36 node01 lrmd[1016]: warning:<br>
> operation_finished:<br>
> > drbd_pg_monitor_31000:9432 - timed out after 20000ms<br>
> > Sep 10 10:30:27 node01 lrmd[1016]: warning:<br>
> child_timeout_callback:<br>
> > drbd_pg_monitor_31000 process (PID 9680) timed out<br>
> > Sep 10 10:30:27 node01 lrmd[1016]: warning:<br>
> operation_finished:<br>
> > drbd_pg_monitor_31000:9680 - timed out after 20000ms<br>
> > Sep 10 10:31:18 node01 lrmd[1016]: warning:<br>
> child_timeout_callback:<br>
> > drbd_pg_monitor_31000 process (PID 9927) timed out<br>
> > Sep 10 10:31:18 node01 lrmd[1016]: warning:<br>
> operation_finished:<br>
> > drbd_pg_monitor_31000:9927 - timed out after 20000ms<br>
> > Sep 10 10:32:09 node01 lrmd[1016]: warning:<br>
> child_timeout_callback:<br>
> > drbd_pg_monitor_31000 process (PID 10174) timed out<br>
> > Sep 10 10:32:09 node01 lrmd[1016]: warning:<br>
> operation_finished:<br>
> > drbd_pg_monitor_31000:10174 - timed out after 20000ms<br>
> ><br>
> > #crm_mon on node01 before I kill the other vm:<br>
> > Stack: corosync<br>
> > Current DC: node02 (167938104) - partition with quorum<br>
> > Version: 1.1.10-42f2063<br>
> > 2 Nodes configured<br>
> > 5 Resources configured<br>
> ><br>
> > Online: [ node01 node02 ]<br>
> ><br>
> > Resource Group: PGServer<br>
> > fs_pg (ocf::heartbeat:Filesystem): Started node02<br>
> > lsb_pg (lsb:postgresql): Started node02<br>
> > ip_pg (ocf::heartbeat:IPaddr2): Started node02<br>
> > Master/Slave Set: ms_drbd_pg [drbd_pg]<br>
> > Masters: [ node02 ]<br>
> > Slaves: [ node01 ]<br>
> ><br>
> > Thank you,<br>
> > Kiam<br>
> ><br>
> ><br>
> ><br>
> ><br>
> > _______________________________________________<br>
> > Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
> <mailto:<a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a>><br>
> > <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
> ><br>
> > Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
> > Getting started:<br>
> <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
> > Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
> ><br>
><br>
><br>
> _______________________________________________<br>
> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
> <mailto:<a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a>><br>
> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
><br>
> Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
> Getting started:<br>
> <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
> Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
><br>
><br>
><br>
><br>
><br>
> _______________________________________________<br>
> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
><br>
> Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
> Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
><br>
<br>
<br>
_______________________________________________<br>
Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
</blockquote></div><br></div>