<div dir="ltr">Hi,<div><br></div><div>Please help me understand what is causing the problem. I have a 2 node cluster running on vms using KVM. Each vm (I am using Ubuntu 14.04) runs on a separate hypervisor on separate machines. All are working well during testing (I restarted the vms alternately), but after a day when I kill the other node, I always end up corosync and pacemaker hangs on the surviving node. Date and time on the vms are in sync, I use unicast, tcpdump shows both nodes exchanges, confirmed that DRBD is healthy and crm_mon show good status before I kill the other node. Below are my configurations and versions I used:</div><div><br><div>corosync             2.3.3-1ubuntu1             </div><div>crmsh                1.2.5+hg1034-1ubuntu3      </div><div>drbd8-utils          2:8.4.4-1ubuntu1           </div><div>libcorosync-common4  2.3.3-1ubuntu1             </div><div>libcrmcluster4       1.1.10+git20130802-1ubuntu2</div><div>libcrmcommon3        1.1.10+git20130802-1ubuntu2</div><div>libcrmservice1       1.1.10+git20130802-1ubuntu2</div><div>pacemaker            1.1.10+git20130802-1ubuntu2</div><div>pacemaker-cli-utils  1.1.10+git20130802-1ubuntu2</div><div>postgresql-9.3       9.3.5-0ubuntu0.14.04.1</div></div><div><br></div><div># /etc/corosync/corosync:<br><div>totem {</div><div><span class="" style="white-space:pre">   </span>version: 2</div><div><span class="" style="white-space:pre"> </span>token: 3000</div><div><span class="" style="white-space:pre">        </span>token_retransmits_before_loss_const: 10</div><div><span class="" style="white-space:pre">    </span>join: 60</div><div><span class="" style="white-space:pre">   </span>consensus: 3600</div><div><span class="" style="white-space:pre">    </span>vsftype: none</div><div><span class="" style="white-space:pre">      </span>max_messages: 20</div><div><span class="" style="white-space:pre">   </span>clear_node_high_bit: yes</div><div> <span class="" style="white-space:pre"> </span>secauth: off</div><div> <span class="" style="white-space:pre">     </span>threads: 0</div><div> <span class="" style="white-space:pre">       </span>rrp_mode: none</div><div> <span class="" style="white-space:pre">   </span>interface {</div><div>                member {</div><div>                        memberaddr: 10.2.136.56</div><div>                }</div><div>                member {</div><div>                        memberaddr: 10.2.136.57</div><div>                }</div><div>                ringnumber: 0</div><div>                bindnetaddr: 10.2.136.0</div><div>                mcastport: 5405</div><div>        }</div><div>        transport: udpu</div><div>}</div><div>amf {</div><div><span class="" style="white-space:pre">    </span>mode: disabled</div><div>}</div><div>quorum {</div><div><span class="" style="white-space:pre">      </span>provider: corosync_votequorum</div><div><span class="" style="white-space:pre">      </span>expected_votes: 1</div><div>}</div><div>aisexec {</div><div>        user:   root</div><div>        group:  root</div><div>}</div><div>logging {</div><div>        fileline: off</div><div>        to_stderr: yes</div><div>        to_logfile: no</div><div>        to_syslog: yes</div><div><span class="" style="white-space:pre">       </span>syslog_facility: daemon</div><div>        debug: off</div><div>        timestamp: on</div><div>        logger_subsys {</div><div>                subsys: AMF</div><div>                debug: off</div><div>                tags: enter|leave|trace1|trace2|trace3|trace4|trace6</div><div>        }</div><div>}</div></div><div><br></div><div># /etc/corosync/service.d/pcmk:<br></div><div><div>service {</div><div>  name: pacemaker</div><div>  ver: 1</div><div>}</div></div><div><br></div><div>/etc/drbd.d/global_common.conf:<br></div><div><div>global {</div><div><span class="" style="white-space:pre">   </span>usage-count no;</div><div>}</div><div><br></div><div>common {</div><div><span class="" style="white-space:pre">    </span>net {</div><div>                protocol C;</div><div><span class="" style="white-space:pre">    </span>}</div><div>}</div></div><div><br></div><div># /etc/drbd.d/pg.res:</div><div><div>resource pg {</div><div>  device /dev/drbd0;</div><div>  disk /dev/vdb;</div><div>  meta-disk internal;</div><div>  startup {</div><div>    wfc-timeout 15;</div><div>    degr-wfc-timeout 60;</div><div>  }</div><div>  disk {</div><div>    on-io-error detach;</div><div>    resync-rate 40M;</div><div>  }</div><div>  on node01 {</div><div>    address <a href="http://10.2.136.56:7789">10.2.136.56:7789</a>;</div><div>  }</div><div>  on node02 {</div><div>    address <a href="http://10.2.136.57:7789">10.2.136.57:7789</a>;</div><div>  }</div><div>  net {</div><div>    verify-alg md5;</div><div>    after-sb-0pri discard-zero-changes;</div><div>    after-sb-1pri discard-secondary;</div><div>    after-sb-2pri disconnect;</div><div>  }</div><div>}</div></div><div><br></div><div># Pacemaker configuration:</div><div><div>node $id="167938104" node01</div><div>node $id="167938105" node02</div><div>primitive drbd_pg ocf:linbit:drbd \</div><div><span class="" style="white-space:pre">      </span>params drbd_resource="pg" \</div><div><span class="" style="white-space:pre">      </span>op monitor interval="29s" role="Master" \</div><div><span class="" style="white-space:pre">      </span>op monitor interval="31s" role="Slave"</div><div>primitive fs_pg ocf:heartbeat:Filesystem \</div><div><span class="" style="white-space:pre">        </span>params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main" fstype="ext4"</div><div>primitive ip_pg ocf:heartbeat:IPaddr2 \</div><div><span class="" style="white-space:pre">      </span>params ip="10.2.136.59" cidr_netmask="24" nic="eth0"</div><div>primitive lsb_pg lsb:postgresql</div><div>group PGServer fs_pg lsb_pg ip_pg</div><div>ms ms_drbd_pg drbd_pg \</div><div><span class="" style="white-space:pre">     </span>meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"</div><div>colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master</div><div>order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start</div><div>property $id="cib-bootstrap-options" \</div><div><span class="" style="white-space:pre">       </span>dc-version="1.1.10-42f2063" \</div><div><span class="" style="white-space:pre">    </span>cluster-infrastructure="corosync" \</div><div><span class="" style="white-space:pre">      </span>stonith-enabled="false" \</div><div><span class="" style="white-space:pre">        </span>no-quorum-policy="ignore"</div><div>rsc_defaults $id="rsc-options" \</div><div><span class="" style="white-space:pre">       </span>resource-stickiness="100"</div></div><div><br></div><div># Logs on node01</div><div><div>Sep 10 10:25:33 node01 crmd[1019]:   notice: peer_update_callback: Our peer on the DC is dead</div><div>Sep 10 10:25:33 node01 crmd[1019]:   notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]</div><div>Sep 10 10:25:33 node01 crmd[1019]:   notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]</div><div>Sep 10 10:25:33 node01 corosync[940]:   [TOTEM ] A new membership (<a href="http://10.2.136.56:52">10.2.136.56:52</a>) was formed. Members left: 167938105</div><div>Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: PingAck did not arrive in time.</div><div>Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) </div><div>Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: asender terminated</div><div>Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: Terminating drbd_a_pg</div><div>Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: Connection closed</div><div>Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn( NetworkFailure -> Unconnected ) </div><div>Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: receiver terminated</div><div>Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: Restarting receiver thread</div><div>Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: receiver (re)started</div><div>Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn( Unconnected -> WFConnection ) </div><div>Sep 10 10:26:12 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 8445) timed out</div><div>Sep 10 10:26:12 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:8445 - timed out after 20000ms</div><div>Sep 10 10:26:12 node01 crmd[1019]:    error: process_lrm_event: LRM operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms)</div><div>Sep 10 10:26:32 node01 crmd[1019]:  warning: cib_rsc_callback: Resource update 23 failed: (rc=-62) Timer expired</div><div>Sep 10 10:27:03 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 8693) timed out</div><div>Sep 10 10:27:03 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:8693 - timed out after 20000ms</div><div>Sep 10 10:27:54 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 8938) timed out</div><div>Sep 10 10:27:54 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:8938 - timed out after 20000ms</div><div>Sep 10 10:28:33 node01 crmd[1019]:    error: crm_timer_popped: Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms)</div><div>Sep 10 10:28:33 node01 crmd[1019]:  warning: do_state_transition: Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED</div><div>Sep 10 10:28:33 node01 crmd[1019]:  warning: do_state_transition: 1 cluster nodes failed to respond to the join offer.</div><div>Sep 10 10:28:33 node01 crmd[1019]:   notice: crmd_join_phase_log: join-1: node02=none</div><div>Sep 10 10:28:33 node01 crmd[1019]:   notice: crmd_join_phase_log: join-1: node01=welcomed</div><div>Sep 10 10:28:45 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 9185) timed out</div><div>Sep 10 10:28:45 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:9185 - timed out after 20000ms</div><div>Sep 10 10:29:36 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 9432) timed out</div><div>Sep 10 10:29:36 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:9432 - timed out after 20000ms</div><div>Sep 10 10:30:27 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 9680) timed out</div><div>Sep 10 10:30:27 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:9680 - timed out after 20000ms</div><div>Sep 10 10:31:18 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 9927) timed out</div><div>Sep 10 10:31:18 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:9927 - timed out after 20000ms</div><div>Sep 10 10:32:09 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 10174) timed out</div><div>Sep 10 10:32:09 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:10174 - timed out after 20000ms</div></div><div><br></div><div>#crm_mon on node01 before I kill the other vm:</div><div><div>Stack: corosync</div><div>Current DC: node02 (167938104) - partition with quorum</div><div>Version: 1.1.10-42f2063</div><div>2 Nodes configured</div><div>5 Resources configured</div><div><br></div><div>Online: [ node01 node02 ]</div><div><br></div><div> Resource Group: PGServer</div><div>     fs_pg      (ocf::heartbeat:Filesystem):    Started node02</div><div>     lsb_pg     (lsb:postgresql):       Started node02</div><div>     ip_pg      (ocf::heartbeat:IPaddr2):       Started node02</div><div> Master/Slave Set: ms_drbd_pg [drbd_pg]</div><div>     Masters: [ node02 ]</div><div>     Slaves: [ node01 ]</div></div><div><br></div><div>Thank you,</div><div>Kiam</div></div>