<div dir="ltr">Hi,<div><br></div><div>After adding resource level fencing on drbd, I still ended up having problems with timeouts on drbd. Is there a recommended settings for this? I followed what is written in the drbd documentation - <a href="http://www.drbd.org/users-guide-emb/s-pacemaker-crm-drbd-backed-service.html">http://www.drbd.org/users-guide-emb/s-pacemaker-crm-drbd-backed-service.html</a> , Another thing I can't understand is why during initial tests, even I reboot the vms several times, failover works. But after I soak it for a couple of hours (say for example 8 hours or more) and continue with the tests, it will not failover and experience split brain. I confirmed it though that everything is healthy before performing a reboot. Disk health and network is good, drbd is synced, time beetween servers is good.<br><br># Logs:<br><div>node01 lrmd[1036]: warning: child_timeout_callback: drbd_pg_monitor_29000 process (PID 27744) timed out</div><div>node01 lrmd[1036]: warning: operation_finished: drbd_pg_monitor_29000:27744 - timed out after 20000ms</div><div>node01 crmd[1039]: error: process_lrm_event: LRM operation drbd_pg_monitor_29000 (69) Timed Out (timeout=20000ms)</div><div>node01 crmd[1039]: warning: update_failcount: Updating failcount for drbd_pg on tyo1mqdb01p after failed monitor: rc=1 (update=value++, time=1410486352)</div></div><div><br></div><div>Thanks,</div><div>Kiam</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Sep 11, 2014 at 6:58 PM, Norbert Kiam Maclang <span dir="ltr"><<a href="mailto:norbert.kiam.maclang@gmail.com" target="_blank">norbert.kiam.maclang@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Thank you Vladislav.<div><br></div><div>I have configured resource level fencing on drbd and removed wfc-timeout and defr-wfc-timeout (is this required?). My drbd configuration is now:</div><div><br></div><div><div>resource pg {</div><div> device /dev/drbd0;</div><div> disk /dev/vdb;</div><div> meta-disk internal;</div><div> disk {</div><div> fencing resource-only;</div><div> on-io-error detach;</div><div> resync-rate 40M;</div><div> }</div><div> handlers {</div><div> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";</div><div> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";</div><div> split-brain "/usr/lib/drbd/notify-split-brain.sh nkbm";</div><div> }</div><div> on node01 {</div><div> address <a href="http://10.2.136.52:7789" target="_blank">10.2.136.52:7789</a>;</div><div> }</div><div> on node02 {</div><div> address <a href="http://10.2.136.55:7789" target="_blank">10.2.136.55:7789</a>;</div><div> }</div><div> net {</div><div> verify-alg md5;</div><div> after-sb-0pri discard-zero-changes;</div><div> after-sb-1pri discard-secondary;</div><div> after-sb-2pri disconnect;</div><div> }</div><div>}</div></div><div><br></div><div>Failover works on my initial test (restarting both nodes alternately - this always works). Will wait for a couple of hours after doing a failover test again (Which always fail on my previous setup).</div><div><br></div><div>Thank you!</div><div>Kiam</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Sep 11, 2014 at 2:14 PM, Vladislav Bogdanov <span dir="ltr"><<a href="mailto:bubble@hoster-ok.com" target="_blank">bubble@hoster-ok.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">11.09.2014 05:57, Norbert Kiam Maclang wrote:<br>
> Is this something to do with quorum? But I already set<br>
<br>
You'd need to configure fencing at the drbd resources level.<br>
<br>
<a href="http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib" target="_blank">http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib</a><br>
<br>
<br>
><br>
> property no-quorum-policy="ignore" \<br>
> expected-quorum-votes="1"<br>
><br>
> Thanks in advance,<br>
> Kiam<br>
><br>
> On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang<br>
> <<a href="mailto:norbert.kiam.maclang@gmail.com" target="_blank">norbert.kiam.maclang@gmail.com</a> <mailto:<a href="mailto:norbert.kiam.maclang@gmail.com" target="_blank">norbert.kiam.maclang@gmail.com</a>>><br>
> wrote:<br>
><br>
> Hi,<br>
><br>
> Please help me understand what is causing the problem. I have a 2<br>
> node cluster running on vms using KVM. Each vm (I am using Ubuntu<br>
> 14.04) runs on a separate hypervisor on separate machines. All are<br>
> working well during testing (I restarted the vms alternately), but<br>
> after a day when I kill the other node, I always end up corosync and<br>
> pacemaker hangs on the surviving node. Date and time on the vms are<br>
> in sync, I use unicast, tcpdump shows both nodes exchanges,<br>
> confirmed that DRBD is healthy and crm_mon show good status before I<br>
> kill the other node. Below are my configurations and versions I used:<br>
><br>
> corosync 2.3.3-1ubuntu1<br>
> crmsh 1.2.5+hg1034-1ubuntu3<br>
> drbd8-utils 2:8.4.4-1ubuntu1<br>
> libcorosync-common4 2.3.3-1ubuntu1<br>
> libcrmcluster4 1.1.10+git20130802-1ubuntu2<br>
> libcrmcommon3 1.1.10+git20130802-1ubuntu2<br>
> libcrmservice1 1.1.10+git20130802-1ubuntu2<br>
> pacemaker 1.1.10+git20130802-1ubuntu2<br>
> pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2<br>
> postgresql-9.3 9.3.5-0ubuntu0.14.04.1<br>
><br>
> # /etc/corosync/corosync:<br>
> totem {<br>
> version: 2<br>
> token: 3000<br>
> token_retransmits_before_loss_const: 10<br>
> join: 60<br>
> consensus: 3600<br>
> vsftype: none<br>
> max_messages: 20<br>
> clear_node_high_bit: yes<br>
> secauth: off<br>
> threads: 0<br>
> rrp_mode: none<br>
> interface {<br>
> member {<br>
> memberaddr: 10.2.136.56<br>
> }<br>
> member {<br>
> memberaddr: 10.2.136.57<br>
> }<br>
> ringnumber: 0<br>
> bindnetaddr: 10.2.136.0<br>
> mcastport: 5405<br>
> }<br>
> transport: udpu<br>
> }<br>
> amf {<br>
> mode: disabled<br>
> }<br>
> quorum {<br>
> provider: corosync_votequorum<br>
> expected_votes: 1<br>
> }<br>
> aisexec {<br>
> user: root<br>
> group: root<br>
> }<br>
> logging {<br>
> fileline: off<br>
> to_stderr: yes<br>
> to_logfile: no<br>
> to_syslog: yes<br>
> syslog_facility: daemon<br>
> debug: off<br>
> timestamp: on<br>
> logger_subsys {<br>
> subsys: AMF<br>
> debug: off<br>
> tags: enter|leave|trace1|trace2|trace3|trace4|trace6<br>
> }<br>
> }<br>
><br>
> # /etc/corosync/service.d/pcmk:<br>
> service {<br>
> name: pacemaker<br>
> ver: 1<br>
> }<br>
><br>
> /etc/drbd.d/global_common.conf:<br>
> global {<br>
> usage-count no;<br>
> }<br>
><br>
> common {<br>
> net {<br>
> protocol C;<br>
> }<br>
> }<br>
><br>
> # /etc/drbd.d/pg.res:<br>
> resource pg {<br>
> device /dev/drbd0;<br>
> disk /dev/vdb;<br>
> meta-disk internal;<br>
> startup {<br>
> wfc-timeout 15;<br>
> degr-wfc-timeout 60;<br>
> }<br>
> disk {<br>
> on-io-error detach;<br>
> resync-rate 40M;<br>
> }<br>
> on node01 {<br>
> address <a href="http://10.2.136.56:7789" target="_blank">10.2.136.56:7789</a> <<a href="http://10.2.136.56:7789" target="_blank">http://10.2.136.56:7789</a>>;<br>
> }<br>
> on node02 {<br>
> address <a href="http://10.2.136.57:7789" target="_blank">10.2.136.57:7789</a> <<a href="http://10.2.136.57:7789" target="_blank">http://10.2.136.57:7789</a>>;<br>
> }<br>
> net {<br>
> verify-alg md5;<br>
> after-sb-0pri discard-zero-changes;<br>
> after-sb-1pri discard-secondary;<br>
> after-sb-2pri disconnect;<br>
> }<br>
> }<br>
><br>
> # Pacemaker configuration:<br>
> node $id="167938104" node01<br>
> node $id="167938105" node02<br>
> primitive drbd_pg ocf:linbit:drbd \<br>
> params drbd_resource="pg" \<br>
> op monitor interval="29s" role="Master" \<br>
> op monitor interval="31s" role="Slave"<br>
> primitive fs_pg ocf:heartbeat:Filesystem \<br>
> params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main"<br>
> fstype="ext4"<br>
> primitive ip_pg ocf:heartbeat:IPaddr2 \<br>
> params ip="10.2.136.59" cidr_netmask="24" nic="eth0"<br>
> primitive lsb_pg lsb:postgresql<br>
> group PGServer fs_pg lsb_pg ip_pg<br>
> ms ms_drbd_pg drbd_pg \<br>
> meta master-max="1" master-node-max="1" clone-max="2"<br>
> clone-node-max="1" notify="true"<br>
> colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master<br>
> order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start<br>
> property $id="cib-bootstrap-options" \<br>
> dc-version="1.1.10-42f2063" \<br>
> cluster-infrastructure="corosync" \<br>
> stonith-enabled="false" \<br>
> no-quorum-policy="ignore"<br>
> rsc_defaults $id="rsc-options" \<br>
> resource-stickiness="100"<br>
><br>
> # Logs on node01<br>
> Sep 10 10:25:33 node01 crmd[1019]: notice: peer_update_callback:<br>
> Our peer on the DC is dead<br>
> Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition:<br>
> State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION<br>
> cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]<br>
> Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition:<br>
> State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC<br>
> cause=C_FSA_INTERNAL origin=do_election_check ]<br>
> Sep 10 10:25:33 node01 corosync[940]: [TOTEM ] A new membership<br>
> (<a href="http://10.2.136.56:52" target="_blank">10.2.136.56:52</a> <<a href="http://10.2.136.56:52" target="_blank">http://10.2.136.56:52</a>>) was formed. Members left:<br>
> 167938105<br>
> Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: PingAck did<br>
> not arrive in time.<br>
> Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer(<br>
> Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk(<br>
> UpToDate -> DUnknown )<br>
> Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: asender<br>
> terminated<br>
> Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: Terminating<br>
> drbd_a_pg<br>
> Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: Connection<br>
> closed<br>
> Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn(<br>
> NetworkFailure -> Unconnected )<br>
> Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: receiver<br>
> terminated<br>
> Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: Restarting<br>
> receiver thread<br>
> Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: receiver<br>
> (re)started<br>
> Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn(<br>
> Unconnected -> WFConnection )<br>
> Sep 10 10:26:12 node01 lrmd[1016]: warning: child_timeout_callback:<br>
> drbd_pg_monitor_31000 process (PID 8445) timed out<br>
> Sep 10 10:26:12 node01 lrmd[1016]: warning: operation_finished:<br>
> drbd_pg_monitor_31000:8445 - timed out after 20000ms<br>
> Sep 10 10:26:12 node01 crmd[1019]: error: process_lrm_event: LRM<br>
> operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms)<br>
> Sep 10 10:26:32 node01 crmd[1019]: warning: cib_rsc_callback:<br>
> Resource update 23 failed: (rc=-62) Timer expired<br>
> Sep 10 10:27:03 node01 lrmd[1016]: warning: child_timeout_callback:<br>
> drbd_pg_monitor_31000 process (PID 8693) timed out<br>
> Sep 10 10:27:03 node01 lrmd[1016]: warning: operation_finished:<br>
> drbd_pg_monitor_31000:8693 - timed out after 20000ms<br>
> Sep 10 10:27:54 node01 lrmd[1016]: warning: child_timeout_callback:<br>
> drbd_pg_monitor_31000 process (PID 8938) timed out<br>
> Sep 10 10:27:54 node01 lrmd[1016]: warning: operation_finished:<br>
> drbd_pg_monitor_31000:8938 - timed out after 20000ms<br>
> Sep 10 10:28:33 node01 crmd[1019]: error: crm_timer_popped:<br>
> Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION!<br>
> (180000ms)<br>
> Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition:<br>
> Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED<br>
> Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition: 1<br>
> cluster nodes failed to respond to the join offer.<br>
> Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log:<br>
> join-1: node02=none<br>
> Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log:<br>
> join-1: node01=welcomed<br>
> Sep 10 10:28:45 node01 lrmd[1016]: warning: child_timeout_callback:<br>
> drbd_pg_monitor_31000 process (PID 9185) timed out<br>
> Sep 10 10:28:45 node01 lrmd[1016]: warning: operation_finished:<br>
> drbd_pg_monitor_31000:9185 - timed out after 20000ms<br>
> Sep 10 10:29:36 node01 lrmd[1016]: warning: child_timeout_callback:<br>
> drbd_pg_monitor_31000 process (PID 9432) timed out<br>
> Sep 10 10:29:36 node01 lrmd[1016]: warning: operation_finished:<br>
> drbd_pg_monitor_31000:9432 - timed out after 20000ms<br>
> Sep 10 10:30:27 node01 lrmd[1016]: warning: child_timeout_callback:<br>
> drbd_pg_monitor_31000 process (PID 9680) timed out<br>
> Sep 10 10:30:27 node01 lrmd[1016]: warning: operation_finished:<br>
> drbd_pg_monitor_31000:9680 - timed out after 20000ms<br>
> Sep 10 10:31:18 node01 lrmd[1016]: warning: child_timeout_callback:<br>
> drbd_pg_monitor_31000 process (PID 9927) timed out<br>
> Sep 10 10:31:18 node01 lrmd[1016]: warning: operation_finished:<br>
> drbd_pg_monitor_31000:9927 - timed out after 20000ms<br>
> Sep 10 10:32:09 node01 lrmd[1016]: warning: child_timeout_callback:<br>
> drbd_pg_monitor_31000 process (PID 10174) timed out<br>
> Sep 10 10:32:09 node01 lrmd[1016]: warning: operation_finished:<br>
> drbd_pg_monitor_31000:10174 - timed out after 20000ms<br>
><br>
> #crm_mon on node01 before I kill the other vm:<br>
> Stack: corosync<br>
> Current DC: node02 (167938104) - partition with quorum<br>
> Version: 1.1.10-42f2063<br>
> 2 Nodes configured<br>
> 5 Resources configured<br>
><br>
> Online: [ node01 node02 ]<br>
><br>
> Resource Group: PGServer<br>
> fs_pg (ocf::heartbeat:Filesystem): Started node02<br>
> lsb_pg (lsb:postgresql): Started node02<br>
> ip_pg (ocf::heartbeat:IPaddr2): Started node02<br>
> Master/Slave Set: ms_drbd_pg [drbd_pg]<br>
> Masters: [ node02 ]<br>
> Slaves: [ node01 ]<br>
><br>
> Thank you,<br>
> Kiam<br>
><br>
><br>
><br>
><br>
> _______________________________________________<br>
> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org" target="_blank">Pacemaker@oss.clusterlabs.org</a><br>
> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
><br>
> Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
> Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
><br>
<br>
<br>
_______________________________________________<br>
Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org" target="_blank">Pacemaker@oss.clusterlabs.org</a><br>
<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
</blockquote></div><br></div>
</blockquote></div><br></div>