[Pacemaker] Corosync and Pacemaker Hangs

Thu Sep 11 06:14:22 UTC 2014

11.09.2014 05:57, Norbert Kiam Maclang wrote:
> Is this something to do with quorum? But I already set 

You'd need to configure fencing at the drbd resources level.

http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib

> 
> property no-quorum-policy="ignore" \
> expected-quorum-votes="1"
> 
> Thanks in advance,
> Kiam
> 
> On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang
> <norbert.kiam.maclang at gmail.com <mailto:norbert.kiam.maclang at gmail.com>>
> wrote:
> 
>     Hi,
> 
>     Please help me understand what is causing the problem. I have a 2
>     node cluster running on vms using KVM. Each vm (I am using Ubuntu
>     14.04) runs on a separate hypervisor on separate machines. All are
>     working well during testing (I restarted the vms alternately), but
>     after a day when I kill the other node, I always end up corosync and
>     pacemaker hangs on the surviving node. Date and time on the vms are
>     in sync, I use unicast, tcpdump shows both nodes exchanges,
>     confirmed that DRBD is healthy and crm_mon show good status before I
>     kill the other node. Below are my configurations and versions I used:
> 
>     corosync             2.3.3-1ubuntu1             
>     crmsh                1.2.5+hg1034-1ubuntu3      
>     drbd8-utils          2:8.4.4-1ubuntu1           
>     libcorosync-common4  2.3.3-1ubuntu1             
>     libcrmcluster4       1.1.10+git20130802-1ubuntu2
>     libcrmcommon3        1.1.10+git20130802-1ubuntu2
>     libcrmservice1       1.1.10+git20130802-1ubuntu2
>     pacemaker            1.1.10+git20130802-1ubuntu2
>     pacemaker-cli-utils  1.1.10+git20130802-1ubuntu2
>     postgresql-9.3       9.3.5-0ubuntu0.14.04.1
> 
>     # /etc/corosync/corosync:
>     totem {
>     version: 2
>     token: 3000
>     token_retransmits_before_loss_const: 10
>     join: 60
>     consensus: 3600
>     vsftype: none
>     max_messages: 20
>     clear_node_high_bit: yes
>      secauth: off
>      threads: 0
>      rrp_mode: none
>      interface {
>                     member {
>                             memberaddr: 10.2.136.56
>                     }
>                     member {
>                             memberaddr: 10.2.136.57
>                     }
>                     ringnumber: 0
>                     bindnetaddr: 10.2.136.0
>                     mcastport: 5405
>             }
>             transport: udpu
>     }
>     amf {
>     mode: disabled
>     }
>     quorum {
>     provider: corosync_votequorum
>     expected_votes: 1
>     }
>     aisexec {
>             user:   root
>             group:  root
>     }
>     logging {
>             fileline: off
>             to_stderr: yes
>             to_logfile: no
>             to_syslog: yes
>     syslog_facility: daemon
>             debug: off
>             timestamp: on
>             logger_subsys {
>                     subsys: AMF
>                     debug: off
>                     tags: enter|leave|trace1|trace2|trace3|trace4|trace6
>             }
>     }
> 
>     # /etc/corosync/service.d/pcmk:
>     service {
>       name: pacemaker
>       ver: 1
>     }
> 
>     /etc/drbd.d/global_common.conf:
>     global {
>     usage-count no;
>     }
> 
>     common {
>     net {
>                     protocol C;
>     }
>     }
> 
>     # /etc/drbd.d/pg.res:
>     resource pg {
>       device /dev/drbd0;
>       disk /dev/vdb;
>       meta-disk internal;
>       startup {
>         wfc-timeout 15;
>         degr-wfc-timeout 60;
>       }
>       disk {
>         on-io-error detach;
>         resync-rate 40M;
>       }
>       on node01 {
>         address 10.2.136.56:7789 <http://10.2.136.56:7789>;
>       }
>       on node02 {
>         address 10.2.136.57:7789 <http://10.2.136.57:7789>;
>       }
>       net {
>         verify-alg md5;
>         after-sb-0pri discard-zero-changes;
>         after-sb-1pri discard-secondary;
>         after-sb-2pri disconnect;
>       }
>     }
> 
>     # Pacemaker configuration:
>     node $id="167938104" node01
>     node $id="167938105" node02
>     primitive drbd_pg ocf:linbit:drbd \
>     params drbd_resource="pg" \
>     op monitor interval="29s" role="Master" \
>     op monitor interval="31s" role="Slave"
>     primitive fs_pg ocf:heartbeat:Filesystem \
>     params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main"
>     fstype="ext4"
>     primitive ip_pg ocf:heartbeat:IPaddr2 \
>     params ip="10.2.136.59" cidr_netmask="24" nic="eth0"
>     primitive lsb_pg lsb:postgresql
>     group PGServer fs_pg lsb_pg ip_pg
>     ms ms_drbd_pg drbd_pg \
>     meta master-max="1" master-node-max="1" clone-max="2"
>     clone-node-max="1" notify="true"
>     colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master
>     order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start
>     property $id="cib-bootstrap-options" \
>     dc-version="1.1.10-42f2063" \
>     cluster-infrastructure="corosync" \
>     stonith-enabled="false" \
>     no-quorum-policy="ignore"
>     rsc_defaults $id="rsc-options" \
>     resource-stickiness="100"
> 
>     # Logs on node01
>     Sep 10 10:25:33 node01 crmd[1019]:   notice: peer_update_callback:
>     Our peer on the DC is dead
>     Sep 10 10:25:33 node01 crmd[1019]:   notice: do_state_transition:
>     State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
>     cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
>     Sep 10 10:25:33 node01 crmd[1019]:   notice: do_state_transition:
>     State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
>     cause=C_FSA_INTERNAL origin=do_election_check ]
>     Sep 10 10:25:33 node01 corosync[940]:   [TOTEM ] A new membership
>     (10.2.136.56:52 <http://10.2.136.56:52>) was formed. Members left:
>     167938105
>     Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: PingAck did
>     not arrive in time.
>     Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer(
>     Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk(
>     UpToDate -> DUnknown ) 
>     Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: asender
>     terminated
>     Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: Terminating
>     drbd_a_pg
>     Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: Connection
>     closed
>     Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn(
>     NetworkFailure -> Unconnected ) 
>     Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: receiver
>     terminated
>     Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: Restarting
>     receiver thread
>     Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: receiver
>     (re)started
>     Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn(
>     Unconnected -> WFConnection ) 
>     Sep 10 10:26:12 node01 lrmd[1016]:  warning: child_timeout_callback:
>     drbd_pg_monitor_31000 process (PID 8445) timed out
>     Sep 10 10:26:12 node01 lrmd[1016]:  warning: operation_finished:
>     drbd_pg_monitor_31000:8445 - timed out after 20000ms
>     Sep 10 10:26:12 node01 crmd[1019]:    error: process_lrm_event: LRM
>     operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms)
>     Sep 10 10:26:32 node01 crmd[1019]:  warning: cib_rsc_callback:
>     Resource update 23 failed: (rc=-62) Timer expired
>     Sep 10 10:27:03 node01 lrmd[1016]:  warning: child_timeout_callback:
>     drbd_pg_monitor_31000 process (PID 8693) timed out
>     Sep 10 10:27:03 node01 lrmd[1016]:  warning: operation_finished:
>     drbd_pg_monitor_31000:8693 - timed out after 20000ms
>     Sep 10 10:27:54 node01 lrmd[1016]:  warning: child_timeout_callback:
>     drbd_pg_monitor_31000 process (PID 8938) timed out
>     Sep 10 10:27:54 node01 lrmd[1016]:  warning: operation_finished:
>     drbd_pg_monitor_31000:8938 - timed out after 20000ms
>     Sep 10 10:28:33 node01 crmd[1019]:    error: crm_timer_popped:
>     Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION!
>     (180000ms)
>     Sep 10 10:28:33 node01 crmd[1019]:  warning: do_state_transition:
>     Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
>     Sep 10 10:28:33 node01 crmd[1019]:  warning: do_state_transition: 1
>     cluster nodes failed to respond to the join offer.
>     Sep 10 10:28:33 node01 crmd[1019]:   notice: crmd_join_phase_log:
>     join-1: node02=none
>     Sep 10 10:28:33 node01 crmd[1019]:   notice: crmd_join_phase_log:
>     join-1: node01=welcomed
>     Sep 10 10:28:45 node01 lrmd[1016]:  warning: child_timeout_callback:
>     drbd_pg_monitor_31000 process (PID 9185) timed out
>     Sep 10 10:28:45 node01 lrmd[1016]:  warning: operation_finished:
>     drbd_pg_monitor_31000:9185 - timed out after 20000ms
>     Sep 10 10:29:36 node01 lrmd[1016]:  warning: child_timeout_callback:
>     drbd_pg_monitor_31000 process (PID 9432) timed out
>     Sep 10 10:29:36 node01 lrmd[1016]:  warning: operation_finished:
>     drbd_pg_monitor_31000:9432 - timed out after 20000ms
>     Sep 10 10:30:27 node01 lrmd[1016]:  warning: child_timeout_callback:
>     drbd_pg_monitor_31000 process (PID 9680) timed out
>     Sep 10 10:30:27 node01 lrmd[1016]:  warning: operation_finished:
>     drbd_pg_monitor_31000:9680 - timed out after 20000ms
>     Sep 10 10:31:18 node01 lrmd[1016]:  warning: child_timeout_callback:
>     drbd_pg_monitor_31000 process (PID 9927) timed out
>     Sep 10 10:31:18 node01 lrmd[1016]:  warning: operation_finished:
>     drbd_pg_monitor_31000:9927 - timed out after 20000ms
>     Sep 10 10:32:09 node01 lrmd[1016]:  warning: child_timeout_callback:
>     drbd_pg_monitor_31000 process (PID 10174) timed out
>     Sep 10 10:32:09 node01 lrmd[1016]:  warning: operation_finished:
>     drbd_pg_monitor_31000:10174 - timed out after 20000ms
> 
>     #crm_mon on node01 before I kill the other vm:
>     Stack: corosync
>     Current DC: node02 (167938104) - partition with quorum
>     Version: 1.1.10-42f2063
>     2 Nodes configured
>     5 Resources configured
> 
>     Online: [ node01 node02 ]
> 
>      Resource Group: PGServer
>          fs_pg      (ocf::heartbeat:Filesystem):    Started node02
>          lsb_pg     (lsb:postgresql):       Started node02
>          ip_pg      (ocf::heartbeat:IPaddr2):       Started node02
>      Master/Slave Set: ms_drbd_pg [drbd_pg]
>          Masters: [ node02 ]
>          Slaves: [ node01 ]
> 
>     Thank you,
>     Kiam
> 
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>