[Pacemaker] Corosync and Pacemaker Hangs

Thu Sep 11 00:23:36 EDT 2014

On 11 Sep 2014, at 12:57 pm, Norbert Kiam Maclang <norbert.kiam.maclang at gmail.com> wrote:

> Is this something to do with quorum? But I already set 
> 
> property no-quorum-policy="ignore" \
> 	expected-quorum-votes="1"

No fencing wouldn't be helping.
And it looks like drbd resources are hanging, not pacemaker/corosync.

> Sep 10 10:26:12 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 8445) timed out
> Sep 10 10:26:12 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:8445 - timed out after 20000ms

> 
> Thanks in advance,
> Kiam
> 
> On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang <norbert.kiam.maclang at gmail.com> wrote:
> Hi,
> 
> Please help me understand what is causing the problem. I have a 2 node cluster running on vms using KVM. Each vm (I am using Ubuntu 14.04) runs on a separate hypervisor on separate machines. All are working well during testing (I restarted the vms alternately), but after a day when I kill the other node, I always end up corosync and pacemaker hangs on the surviving node. Date and time on the vms are in sync, I use unicast, tcpdump shows both nodes exchanges, confirmed that DRBD is healthy and crm_mon show good status before I kill the other node. Below are my configurations and versions I used:
> 
> corosync             2.3.3-1ubuntu1             
> crmsh                1.2.5+hg1034-1ubuntu3      
> drbd8-utils          2:8.4.4-1ubuntu1           
> libcorosync-common4  2.3.3-1ubuntu1             
> libcrmcluster4       1.1.10+git20130802-1ubuntu2
> libcrmcommon3        1.1.10+git20130802-1ubuntu2
> libcrmservice1       1.1.10+git20130802-1ubuntu2
> pacemaker            1.1.10+git20130802-1ubuntu2
> pacemaker-cli-utils  1.1.10+git20130802-1ubuntu2
> postgresql-9.3       9.3.5-0ubuntu0.14.04.1
> 
> # /etc/corosync/corosync:
> totem {
> 	version: 2
> 	token: 3000
> 	token_retransmits_before_loss_const: 10
> 	join: 60
> 	consensus: 3600
> 	vsftype: none
> 	max_messages: 20
> 	clear_node_high_bit: yes
>  	secauth: off
>  	threads: 0
>  	rrp_mode: none
>  	interface {
>                 member {
>                         memberaddr: 10.2.136.56
>                 }
>                 member {
>                         memberaddr: 10.2.136.57
>                 }
>                 ringnumber: 0
>                 bindnetaddr: 10.2.136.0
>                 mcastport: 5405
>         }
>         transport: udpu
> }
> amf {
> 	mode: disabled
> }
> quorum {
> 	provider: corosync_votequorum
> 	expected_votes: 1
> }
> aisexec {
>         user:   root
>         group:  root
> }
> logging {
>         fileline: off
>         to_stderr: yes
>         to_logfile: no
>         to_syslog: yes
> 	syslog_facility: daemon
>         debug: off
>         timestamp: on
>         logger_subsys {
>                 subsys: AMF
>                 debug: off
>                 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
>         }
> }
> 
> # /etc/corosync/service.d/pcmk:
> service {
>   name: pacemaker
>   ver: 1
> }
> 
> /etc/drbd.d/global_common.conf:
> global {
> 	usage-count no;
> }
> 
> common {
> 	net {
>                 protocol C;
> 	}
> }
> 
> # /etc/drbd.d/pg.res:
> resource pg {
>   device /dev/drbd0;
>   disk /dev/vdb;
>   meta-disk internal;
>   startup {
>     wfc-timeout 15;
>     degr-wfc-timeout 60;
>   }
>   disk {
>     on-io-error detach;
>     resync-rate 40M;
>   }
>   on node01 {
>     address 10.2.136.56:7789;
>   }
>   on node02 {
>     address 10.2.136.57:7789;
>   }
>   net {
>     verify-alg md5;
>     after-sb-0pri discard-zero-changes;
>     after-sb-1pri discard-secondary;
>     after-sb-2pri disconnect;
>   }
> }
> 
> # Pacemaker configuration:
> node $id="167938104" node01
> node $id="167938105" node02
> primitive drbd_pg ocf:linbit:drbd \
> 	params drbd_resource="pg" \
> 	op monitor interval="29s" role="Master" \
> 	op monitor interval="31s" role="Slave"
> primitive fs_pg ocf:heartbeat:Filesystem \
> 	params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main" fstype="ext4"
> primitive ip_pg ocf:heartbeat:IPaddr2 \
> 	params ip="10.2.136.59" cidr_netmask="24" nic="eth0"
> primitive lsb_pg lsb:postgresql
> group PGServer fs_pg lsb_pg ip_pg
> ms ms_drbd_pg drbd_pg \
> 	meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
> colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master
> order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start
> property $id="cib-bootstrap-options" \
> 	dc-version="1.1.10-42f2063" \
> 	cluster-infrastructure="corosync" \
> 	stonith-enabled="false" \
> 	no-quorum-policy="ignore"
> rsc_defaults $id="rsc-options" \
> 	resource-stickiness="100"
> 
> # Logs on node01
> Sep 10 10:25:33 node01 crmd[1019]:   notice: peer_update_callback: Our peer on the DC is dead
> Sep 10 10:25:33 node01 crmd[1019]:   notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> Sep 10 10:25:33 node01 crmd[1019]:   notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
> Sep 10 10:25:33 node01 corosync[940]:   [TOTEM ] A new membership (10.2.136.56:52) was formed. Members left: 167938105
> Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: PingAck did not arrive in time.
> Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
> Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: asender terminated
> Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: Terminating drbd_a_pg
> Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: Connection closed
> Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn( NetworkFailure -> Unconnected ) 
> Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: receiver terminated
> Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: Restarting receiver thread
> Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: receiver (re)started
> Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn( Unconnected -> WFConnection ) 
> Sep 10 10:26:12 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 8445) timed out
> Sep 10 10:26:12 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:8445 - timed out after 20000ms
> Sep 10 10:26:12 node01 crmd[1019]:    error: process_lrm_event: LRM operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms)
> Sep 10 10:26:32 node01 crmd[1019]:  warning: cib_rsc_callback: Resource update 23 failed: (rc=-62) Timer expired
> Sep 10 10:27:03 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 8693) timed out
> Sep 10 10:27:03 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:8693 - timed out after 20000ms
> Sep 10 10:27:54 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 8938) timed out
> Sep 10 10:27:54 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:8938 - timed out after 20000ms
> Sep 10 10:28:33 node01 crmd[1019]:    error: crm_timer_popped: Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms)
> Sep 10 10:28:33 node01 crmd[1019]:  warning: do_state_transition: Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
> Sep 10 10:28:33 node01 crmd[1019]:  warning: do_state_transition: 1 cluster nodes failed to respond to the join offer.
> Sep 10 10:28:33 node01 crmd[1019]:   notice: crmd_join_phase_log: join-1: node02=none
> Sep 10 10:28:33 node01 crmd[1019]:   notice: crmd_join_phase_log: join-1: node01=welcomed
> Sep 10 10:28:45 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 9185) timed out
> Sep 10 10:28:45 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:9185 - timed out after 20000ms
> Sep 10 10:29:36 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 9432) timed out
> Sep 10 10:29:36 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:9432 - timed out after 20000ms
> Sep 10 10:30:27 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 9680) timed out
> Sep 10 10:30:27 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:9680 - timed out after 20000ms
> Sep 10 10:31:18 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 9927) timed out
> Sep 10 10:31:18 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:9927 - timed out after 20000ms
> Sep 10 10:32:09 node01 lrmd[1016]:  warning: child_timeout_callback: drbd_pg_monitor_31000 process (PID 10174) timed out
> Sep 10 10:32:09 node01 lrmd[1016]:  warning: operation_finished: drbd_pg_monitor_31000:10174 - timed out after 20000ms
> 
> #crm_mon on node01 before I kill the other vm:
> Stack: corosync
> Current DC: node02 (167938104) - partition with quorum
> Version: 1.1.10-42f2063
> 2 Nodes configured
> 5 Resources configured
> 
> Online: [ node01 node02 ]
> 
>  Resource Group: PGServer
>      fs_pg      (ocf::heartbeat:Filesystem):    Started node02
>      lsb_pg     (lsb:postgresql):       Started node02
>      ip_pg      (ocf::heartbeat:IPaddr2):       Started node02
>  Master/Slave Set: ms_drbd_pg [drbd_pg]
>      Masters: [ node02 ]
>      Slaves: [ node01 ]
> 
> Thank you,
> Kiam
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140911/ed5e567e/attachment-0003.sig>