[Pacemaker] Corosync and Pacemaker Hangs

Sun Sep 14 23:55:10 EDT 2014

15.09.2014 04:24, Norbert Kiam Maclang wrote:
> Hi Vladislav and Andrew,
> 
> After adding fencing/stonith (resource level) and fencing handlers on
> drbd, I am not getting monitor timeouts on drbd but I am experiencing a
> different problem now. As per my understanding, logs on node01 showed
> that it detects node02 to be disconnected (and moved the resources to
> itself) but crm_mon shows that the resources are still started on node02
> which is not.

That is probably the root of your issues.
That _may_ be caused by the fact that VMs are not rescheduled to run on
host CPUs fair enough. You'd need either to re-think the whole
architecture of your cluster or to some-how tune your cluster messaging
layer to deal with that. Increasing 'totem.token' would be the first
step. See man corosync.conf and cman docs.

> 
> Node01:
> 
> node01 crmd[952]:   notice: do_state_transition: State transition S_IDLE
> -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> node01 pengine[951]:   notice: unpack_config: On loss of CCM Quorum: Ignore
> node01 crmd[952]:   notice: run_graph: Transition 260 (Complete=0,
> Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-78.bz2): Complete
> node01 crmd[952]:   notice: do_state_transition: State transition
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
> origin=notify_crmd ]
> node01 pengine[951]:   notice: process_pe_message: Calculated Transition
> 260: /var/lib/pacemaker/pengine/pe-input-78.bz2
> node01 corosync[917]:   [TOTEM ] A processor failed, forming new
> configuration.
> node01 corosync[917]:   [TOTEM ] A new membership (10.2.131.20:352
> <http://10.2.131.20:352>) was formed. Members left: 167936789
> node01 crmd[952]:  warning: match_down_event: No match for shutdown
> action on 167936789
> node01 crmd[952]:   notice: peer_update_callback: Stonith/shutdown of
> node02 not matched
> node01 crmd[952]:   notice: do_state_transition: State transition S_IDLE
> -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
> node01 pengine[951]:   notice: unpack_config: On loss of CCM Quorum: Ignore
> node01 pengine[951]:  warning: pe_fence_node: Node node02 will be fenced
> because our peer process is no longer available
> node01 pengine[951]:  warning: determine_online_status: Node node02 is
> unclean
> node01 pengine[951]:  warning: stage6: Scheduling Node node02 for STONITH
> node01 pengine[951]:   notice: LogActions: Move    fs_pg#011(Started
> node02 -> node01)
> node01 pengine[951]:   notice: LogActions: Move    ip_pg#011(Started
> node02 -> node01)
> node01 pengine[951]:   notice: LogActions: Move    lsb_pg#011(Started
> node02 -> node01)
> node01 pengine[951]:   notice: LogActions: Demote  drbd_pg:0#011(Master
> -> Stopped node02)
> node01 pengine[951]:   notice: LogActions: Promote drbd_pg:1#011(Slave
> -> Master node01)
> node01 pengine[951]:   notice: LogActions: Stop    p_fence:0#011(node02)
> node01 crmd[952]:   notice: te_rsc_command: Initiating action 2: cancel
> drbd_pg_cancel_31000 on node01 (local)
> node01 crmd[952]:   notice: te_fence_node: Executing reboot fencing
> operation (54) on node02 (timeout=60000)
> node01 stonith-ng[948]:   notice: handle_request: Client
> crmd.952.6d7ac808 wants to fence (reboot) 'node02' with device '(any)'
> node01 stonith-ng[948]:   notice: initiate_remote_stonith_op: Initiating
> remote operation reboot for node02: 96530c7b-1c80-42c4-82cf-840bf3d5bb5f (0)
> node01 crmd[952]:   notice: te_rsc_command: Initiating action 68: notify
> drbd_pg_pre_notify_demote_0 on node02
> node01 crmd[952]:   notice: te_rsc_command: Initiating action 70: notify
> drbd_pg_pre_notify_demote_0 on node01 (local)
> node01 pengine[951]:  warning: process_pe_message: Calculated Transition
> 261: /var/lib/pacemaker/pengine/pe-warn-0.bz2
> node01 crmd[952]:   notice: process_lrm_event: LRM operation
> drbd_pg_notify_0 (call=63, rc=0, cib-update=0, confirmed=true) ok
> node01 kernel: [230495.836024] d-con pg: PingAck did not arrive in time.
> node01 kernel: [230495.836176] d-con pg: peer( Primary -> Unknown )
> conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) 
> node01 kernel: [230495.837204] d-con pg: asender terminated
> node01 kernel: [230495.837216] d-con pg: Terminating drbd_a_pg
> node01 kernel: [230495.837286] d-con pg: Connection closed
> node01 kernel: [230495.837298] d-con pg: conn( NetworkFailure ->
> Unconnected ) 
> node01 kernel: [230495.837299] d-con pg: receiver terminated
> node01 kernel: [230495.837300] d-con pg: Restarting receiver thread
> node01 kernel: [230495.837304] d-con pg: receiver (re)started
> node01 kernel: [230495.837314] d-con pg: conn( Unconnected ->
> WFConnection ) 
> node01 crmd[952]:  warning: action_timer_callback: Timer popped
> (timeout=20000, abort_level=1000000, complete=false)
> node01 crmd[952]:    error: print_synapse: [Action    2]: Completed rsc
> op drbd_pg_cancel_31000              on node01 (priority: 0, waiting: none)
> node01 crmd[952]:  warning: action_timer_callback: Timer popped
> (timeout=20000, abort_level=1000000, complete=false)
> node01 crmd[952]:    error: print_synapse: [Action   68]: In-flight rsc
> op drbd_pg_pre_notify_demote_0       on node02 (priority: 0, waiting: none)
> node01 crmd[952]:  warning: cib_action_update: rsc_op 68:
> drbd_pg_pre_notify_demote_0 on node02 timed out
> node01 crmd[952]:    error: cib_action_updated: Update 297 FAILED: Timer
> expired
> node01 crmd[952]:    error: stonith_async_timeout_handler: Async call 2
> timed out after 120000ms
> node01 crmd[952]:   notice: tengine_stonith_callback: Stonith operation
> 2/54:261:0:6978227d-ce2d-4dc6-955a-eb9313f112a5: Timer expired (-62)
> node01 crmd[952]:   notice: tengine_stonith_callback: Stonith operation
> 2 for node02 failed (Timer expired): aborting transition.
> node01 crmd[952]:   notice: run_graph: Transition 261 (Complete=6,
> Pending=0, Fired=0, Skipped=29, Incomplete=15,
> Source=/var/lib/pacemaker/pengine/pe-warn-0.bz2): Stopped
> node01 pengine[951]:   notice: unpack_config: On loss of CCM Quorum: Ignore
> node01 pengine[951]:  warning: pe_fence_node: Node node02 will be fenced
> because our peer process is no longer available
> node01 pengine[951]:  warning: determine_online_status: Node node02 is
> unclean
> node01 pengine[951]:  warning: stage6: Scheduling Node node02 for STONITH
> node01 pengine[951]:   notice: LogActions: Move    fs_pg#011(Started
> node02 -> node01)
> node01 pengine[951]:   notice: LogActions: Move    ip_pg#011(Started
> node02 -> node01)
> node01 pengine[951]:   notice: LogActions: Move    lsb_pg#011(Started
> node02 -> node01)
> node01 pengine[951]:   notice: LogActions: Demote  drbd_pg:0#011(Master
> -> Stopped node02)
> node01 pengine[951]:   notice: LogActions: Promote drbd_pg:1#011(Slave
> -> Master node01)
> node01 pengine[951]:   notice: LogActions: Stop    p_fence:0#011(node02)
> node01 crmd[952]:   notice: te_fence_node: Executing reboot fencing
> operation (53) on node02 (timeout=60000)
> node01 stonith-ng[948]:   notice: handle_request: Client
> crmd.952.6d7ac808 wants to fence (reboot) 'node02' with device '(any)'
> node01 stonith-ng[948]:   notice: initiate_remote_stonith_op: Initiating
> remote operation reboot for node02: a4fae8ce-3a6c-4fe5-a934-b5b83ae123cb (0)
> node01 crmd[952]:   notice: te_rsc_command: Initiating action 67: notify
> drbd_pg_pre_notify_demote_0 on node02
> node01 crmd[952]:   notice: te_rsc_command: Initiating action 69: notify
> drbd_pg_pre_notify_demote_0 on node01 (local)
> node01 pengine[951]:  warning: process_pe_message: Calculated Transition
> 262: /var/lib/pacemaker/pengine/pe-warn-1.bz2
> node01 crmd[952]:   notice: process_lrm_event: LRM operation
> drbd_pg_notify_0 (call=66, rc=0, cib-update=0, confirmed=true) ok
> 
> Last updated: Mon Sep 15 01:15:59 2014
> Last change: Sat Sep 13 15:23:45 2014 via cibadmin on node01
> Stack: corosync
> Current DC: node01 (167936788) - partition with quorum
> Version: 1.1.10-42f2063
> 2 Nodes configured
> 7 Resources configured
> 
> 
> Node node02 (167936789): UNCLEAN (online)
> Online: [ node01 ]
> 
>  Resource Group: PGServer
>      fs_pg      (ocf::heartbeat:Filesystem):    Started node02
>      ip_pg      (ocf::heartbeat:IPaddr2):       Started node02
>      lsb_pg     (lsb:postgresql):       Started node02
>  Master/Slave Set: ms_drbd_pg [drbd_pg]
>      Masters: [ node02 ]
>      Slaves: [ node01 ]
>  Clone Set: cln_p_fence [p_fence]
>      Started: [ node01 node02 ]
> 
> Thank you,
> Norbert
> 
> On Fri, Sep 12, 2014 at 12:06 PM, Vladislav Bogdanov
> <bubble at hoster-ok.com <mailto:bubble at hoster-ok.com>> wrote:
> 
>     12.09.2014 05:00, Norbert Kiam Maclang wrote:
>     > Hi,
>     >
>     > After adding resource level fencing on drbd, I still ended up having
>     > problems with timeouts on drbd. Is there a recommended settings for
>     > this? I followed what is written in the drbd documentation -
>     >
>     http://www.drbd.org/users-guide-emb/s-pacemaker-crm-drbd-backed-service.html
>     > , Another thing I can't understand is why during initial tests, even I
>     > reboot the vms several times, failover works. But after I soak it
>     for a
>     > couple of hours (say for example 8 hours or more) and continue
>     with the
>     > tests, it will not failover and experience split brain. I confirmed it
>     > though that everything is healthy before performing a reboot. Disk
>     > health and network is good, drbd is synced, time beetween servers
>     is good.
> 
>     I recall I've seen something similar a year ago (near the time your
>     pacemaker version is dated). I do not remember what was the exact
>     problem cause, but I saw that drbd RA timeouts because it waits for
>     something (fencing) in the kernel space to be done. drbd calls userspace
>     scripts from within kernelspace, and you'll see them in the process list
>     with the drbd kernel thread as a parent.
> 
>     I'd also upgrade your corosync configuration from "member" to "nodelist"
>     syntax, specifying "name" parameter together with ring0_addr for nodes
>     (that parameter is not referenced in corosync docs but should be
>     somewhere in the Pacemaker Explained - it is used only by the
>     pacemaker).
> 
>     Also there is trace_ra functionality support in both pacemaker and crmsh
>     (cannot say if that is supported in versions you have though, probably
>     yes) so you may want to play with that to get the exact picture from the
>     resource agent.
> 
>     Anyways, upgrading to 1.1.12 and more recent crmsh is nice to have for
>     you because you may be just hitting a long-ago solved and forgotten
>     bug/issue.
> 
>     Concerning your
>     >       expected-quorum-votes="1"
> 
>     You need to configure votequorum in corosync with two_node: 1 instead of
>     that line.
> 
>     >
>     > # Logs:
>     > node01 lrmd[1036]:  warning: child_timeout_callback:
>     > drbd_pg_monitor_29000 process (PID 27744) timed out
>     > node01 lrmd[1036]:  warning: operation_finished:
>     > drbd_pg_monitor_29000:27744 - timed out after 20000ms
>     > node01 crmd[1039]:    error: process_lrm_event: LRM operation
>     > drbd_pg_monitor_29000 (69) Timed Out (timeout=20000ms)
>     > node01 crmd[1039]:  warning: update_failcount: Updating failcount for
>     > drbd_pg on tyo1mqdb01p after failed monitor: rc=1 (update=value++,
>     > time=1410486352)
>     >
>     > Thanks,
>     > Kiam
>     >
>     > On Thu, Sep 11, 2014 at 6:58 PM, Norbert Kiam Maclang
>     > <norbert.kiam.maclang at gmail.com
>     <mailto:norbert.kiam.maclang at gmail.com>
>     <mailto:norbert.kiam.maclang at gmail.com
>     <mailto:norbert.kiam.maclang at gmail.com>>>
>     > wrote:
>     >
>     >     Thank you Vladislav.
>     >
>     >     I have configured resource level fencing on drbd and removed
>     >     wfc-timeout and defr-wfc-timeout (is this required?). My drbd
>     >     configuration is now:
>     >
>     >     resource pg {
>     >       device /dev/drbd0;
>     >       disk /dev/vdb;
>     >       meta-disk internal;
>     >       disk {
>     >         fencing resource-only;
>     >         on-io-error detach;
>     >         resync-rate 40M;
>     >       }
>     >       handlers {
>     >         fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
>     >         after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
>     >         split-brain "/usr/lib/drbd/notify-split-brain.sh nkbm";
>     >       }
>     >       on node01 {
>     >         address 10.2.136.52:7789 <http://10.2.136.52:7789>
>     <http://10.2.136.52:7789>;
>     >       }
>     >       on node02 {
>     >         address 10.2.136.55:7789 <http://10.2.136.55:7789>
>     <http://10.2.136.55:7789>;
>     >       }
>     >       net {
>     >         verify-alg md5;
>     >         after-sb-0pri discard-zero-changes;
>     >         after-sb-1pri discard-secondary;
>     >         after-sb-2pri disconnect;
>     >       }
>     >     }
>     >
>     >     Failover works on my initial test (restarting both nodes
>     alternately
>     >     - this always works). Will wait for a couple of hours after
>     doing a
>     >     failover test again (Which always fail on my previous setup).
>     >
>     >     Thank you!
>     >     Kiam
>     >
>     >     On Thu, Sep 11, 2014 at 2:14 PM, Vladislav Bogdanov
>     >     <bubble at hoster-ok.com <mailto:bubble at hoster-ok.com>
>     <mailto:bubble at hoster-ok.com <mailto:bubble at hoster-ok.com>>> wrote:
>     >
>     >         11.09.2014 05:57, Norbert Kiam Maclang wrote:
>     >         > Is this something to do with quorum? But I already set
>     >
>     >         You'd need to configure fencing at the drbd resources level.
>     >
>     >       
>      http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib
>     >
>     >
>     >         >
>     >         > property no-quorum-policy="ignore" \
>     >         > expected-quorum-votes="1"
>     >         >
>     >         > Thanks in advance,
>     >         > Kiam
>     >         >
>     >         > On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang
>     >         > <norbert.kiam.maclang at gmail.com
>     <mailto:norbert.kiam.maclang at gmail.com>
>     >         <mailto:norbert.kiam.maclang at gmail.com
>     <mailto:norbert.kiam.maclang at gmail.com>>
>     >         <mailto:norbert.kiam.maclang at gmail.com
>     <mailto:norbert.kiam.maclang at gmail.com>
>     >         <mailto:norbert.kiam.maclang at gmail.com
>     <mailto:norbert.kiam.maclang at gmail.com>>>>
>     >         > wrote:
>     >         >
>     >         >     Hi,
>     >         >
>     >         >     Please help me understand what is causing the problem. I
>     >         have a 2
>     >         >     node cluster running on vms using KVM. Each vm (I am
>     using
>     >         Ubuntu
>     >         >     14.04) runs on a separate hypervisor on separate
>     machines.
>     >         All are
>     >         >     working well during testing (I restarted the vms
>     >         alternately), but
>     >         >     after a day when I kill the other node, I always end up
>     >         corosync and
>     >         >     pacemaker hangs on the surviving node. Date and time on
>     >         the vms are
>     >         >     in sync, I use unicast, tcpdump shows both nodes
>     exchanges,
>     >         >     confirmed that DRBD is healthy and crm_mon show good
>     >         status before I
>     >         >     kill the other node. Below are my configurations and
>     >         versions I used:
>     >         >
>     >         >     corosync             2.3.3-1ubuntu1
>     >         >     crmsh                1.2.5+hg1034-1ubuntu3
>     >         >     drbd8-utils          2:8.4.4-1ubuntu1
>     >         >     libcorosync-common4  2.3.3-1ubuntu1
>     >         >     libcrmcluster4       1.1.10+git20130802-1ubuntu2
>     >         >     libcrmcommon3        1.1.10+git20130802-1ubuntu2
>     >         >     libcrmservice1       1.1.10+git20130802-1ubuntu2
>     >         >     pacemaker            1.1.10+git20130802-1ubuntu2
>     >         >     pacemaker-cli-utils  1.1.10+git20130802-1ubuntu2
>     >         >     postgresql-9.3       9.3.5-0ubuntu0.14.04.1
>     >         >
>     >         >     # /etc/corosync/corosync:
>     >         >     totem {
>     >         >     version: 2
>     >         >     token: 3000
>     >         >     token_retransmits_before_loss_const: 10
>     >         >     join: 60
>     >         >     consensus: 3600
>     >         >     vsftype: none
>     >         >     max_messages: 20
>     >         >     clear_node_high_bit: yes
>     >         >      secauth: off
>     >         >      threads: 0
>     >         >      rrp_mode: none
>     >         >      interface {
>     >         >                     member {
>     >         >                             memberaddr: 10.2.136.56
>     >         >                     }
>     >         >                     member {
>     >         >                             memberaddr: 10.2.136.57
>     >         >                     }
>     >         >                     ringnumber: 0
>     >         >                     bindnetaddr: 10.2.136.0
>     >         >                     mcastport: 5405
>     >         >             }
>     >         >             transport: udpu
>     >         >     }
>     >         >     amf {
>     >         >     mode: disabled
>     >         >     }
>     >         >     quorum {
>     >         >     provider: corosync_votequorum
>     >         >     expected_votes: 1
>     >         >     }
>     >         >     aisexec {
>     >         >             user:   root
>     >         >             group:  root
>     >         >     }
>     >         >     logging {
>     >         >             fileline: off
>     >         >             to_stderr: yes
>     >         >             to_logfile: no
>     >         >             to_syslog: yes
>     >         >     syslog_facility: daemon
>     >         >             debug: off
>     >         >             timestamp: on
>     >         >             logger_subsys {
>     >         >                     subsys: AMF
>     >         >                     debug: off
>     >         >                     tags:
>     >         enter|leave|trace1|trace2|trace3|trace4|trace6
>     >         >             }
>     >         >     }
>     >         >
>     >         >     # /etc/corosync/service.d/pcmk:
>     >         >     service {
>     >         >       name: pacemaker
>     >         >       ver: 1
>     >         >     }
>     >         >
>     >         >     /etc/drbd.d/global_common.conf:
>     >         >     global {
>     >         >     usage-count no;
>     >         >     }
>     >         >
>     >         >     common {
>     >         >     net {
>     >         >                     protocol C;
>     >         >     }
>     >         >     }
>     >         >
>     >         >     # /etc/drbd.d/pg.res:
>     >         >     resource pg {
>     >         >       device /dev/drbd0;
>     >         >       disk /dev/vdb;
>     >         >       meta-disk internal;
>     >         >       startup {
>     >         >         wfc-timeout 15;
>     >         >         degr-wfc-timeout 60;
>     >         >       }
>     >         >       disk {
>     >         >         on-io-error detach;
>     >         >         resync-rate 40M;
>     >         >       }
>     >         >       on node01 {
>     >         >         address 10.2.136.56:7789
>     <http://10.2.136.56:7789> <http://10.2.136.56:7789>
>     >         <http://10.2.136.56:7789>;
>     >         >       }
>     >         >       on node02 {
>     >         >         address 10.2.136.57:7789
>     <http://10.2.136.57:7789> <http://10.2.136.57:7789>
>     >         <http://10.2.136.57:7789>;
>     >         >       }
>     >         >       net {
>     >         >         verify-alg md5;
>     >         >         after-sb-0pri discard-zero-changes;
>     >         >         after-sb-1pri discard-secondary;
>     >         >         after-sb-2pri disconnect;
>     >         >       }
>     >         >     }
>     >         >
>     >         >     # Pacemaker configuration:
>     >         >     node $id="167938104" node01
>     >         >     node $id="167938105" node02
>     >         >     primitive drbd_pg ocf:linbit:drbd \
>     >         >     params drbd_resource="pg" \
>     >         >     op monitor interval="29s" role="Master" \
>     >         >     op monitor interval="31s" role="Slave"
>     >         >     primitive fs_pg ocf:heartbeat:Filesystem \
>     >         >     params device="/dev/drbd0"
>     >         directory="/var/lib/postgresql/9.3/main"
>     >         >     fstype="ext4"
>     >         >     primitive ip_pg ocf:heartbeat:IPaddr2 \
>     >         >     params ip="10.2.136.59" cidr_netmask="24" nic="eth0"
>     >         >     primitive lsb_pg lsb:postgresql
>     >         >     group PGServer fs_pg lsb_pg ip_pg
>     >         >     ms ms_drbd_pg drbd_pg \
>     >         >     meta master-max="1" master-node-max="1" clone-max="2"
>     >         >     clone-node-max="1" notify="true"
>     >         >     colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master
>     >         >     order pg_after_drbd inf: ms_drbd_pg:promote
>     PGServer:start
>     >         >     property $id="cib-bootstrap-options" \
>     >         >     dc-version="1.1.10-42f2063" \
>     >         >     cluster-infrastructure="corosync" \
>     >         >     stonith-enabled="false" \
>     >         >     no-quorum-policy="ignore"
>     >         >     rsc_defaults $id="rsc-options" \
>     >         >     resource-stickiness="100"
>     >         >
>     >         >     # Logs on node01
>     >         >     Sep 10 10:25:33 node01 crmd[1019]:   notice:
>     >         peer_update_callback:
>     >         >     Our peer on the DC is dead
>     >         >     Sep 10 10:25:33 node01 crmd[1019]:   notice:
>     >         do_state_transition:
>     >         >     State transition S_NOT_DC -> S_ELECTION [
>     input=I_ELECTION
>     >         >     cause=C_CRMD_STATUS_CALLBACK
>     origin=peer_update_callback ]
>     >         >     Sep 10 10:25:33 node01 crmd[1019]:   notice:
>     >         do_state_transition:
>     >         >     State transition S_ELECTION -> S_INTEGRATION [
>     >         input=I_ELECTION_DC
>     >         >     cause=C_FSA_INTERNAL origin=do_election_check ]
>     >         >     Sep 10 10:25:33 node01 corosync[940]:   [TOTEM ] A new
>     >         membership
>     >         >     (10.2.136.56:52 <http://10.2.136.56:52>
>     <http://10.2.136.56:52>
>     >         <http://10.2.136.56:52>) was formed. Members left:
>     >         >     167938105
>     >         >     Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg:
>     >         PingAck did
>     >         >     not arrive in time.
>     >         >     Sep 10 10:25:45 node01 kernel: [74452.740169] d-con
>     pg: peer(
>     >         >     Primary -> Unknown ) conn( Connected ->
>     NetworkFailure ) pdsk(
>     >         >     UpToDate -> DUnknown )
>     >         >     Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg:
>     >         asender
>     >         >     terminated
>     >         >     Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg:
>     >         Terminating
>     >         >     drbd_a_pg
>     >         >     Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg:
>     >         Connection
>     >         >     closed
>     >         >     Sep 10 10:25:45 node01 kernel: [74452.741259] d-con
>     pg: conn(
>     >         >     NetworkFailure -> Unconnected )
>     >         >     Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg:
>     >         receiver
>     >         >     terminated
>     >         >     Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg:
>     >         Restarting
>     >         >     receiver thread
>     >         >     Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg:
>     >         receiver
>     >         >     (re)started
>     >         >     Sep 10 10:25:45 node01 kernel: [74452.741269] d-con
>     pg: conn(
>     >         >     Unconnected -> WFConnection )
>     >         >     Sep 10 10:26:12 node01 lrmd[1016]:  warning:
>     >         child_timeout_callback:
>     >         >     drbd_pg_monitor_31000 process (PID 8445) timed out
>     >         >     Sep 10 10:26:12 node01 lrmd[1016]:  warning:
>     >         operation_finished:
>     >         >     drbd_pg_monitor_31000:8445 - timed out after 20000ms
>     >         >     Sep 10 10:26:12 node01 crmd[1019]:    error:
>     >         process_lrm_event: LRM
>     >         >     operation drbd_pg_monitor_31000 (30) Timed Out
>     >         (timeout=20000ms)
>     >         >     Sep 10 10:26:32 node01 crmd[1019]:  warning:
>     cib_rsc_callback:
>     >         >     Resource update 23 failed: (rc=-62) Timer expired
>     >         >     Sep 10 10:27:03 node01 lrmd[1016]:  warning:
>     >         child_timeout_callback:
>     >         >     drbd_pg_monitor_31000 process (PID 8693) timed out
>     >         >     Sep 10 10:27:03 node01 lrmd[1016]:  warning:
>     >         operation_finished:
>     >         >     drbd_pg_monitor_31000:8693 - timed out after 20000ms
>     >         >     Sep 10 10:27:54 node01 lrmd[1016]:  warning:
>     >         child_timeout_callback:
>     >         >     drbd_pg_monitor_31000 process (PID 8938) timed out
>     >         >     Sep 10 10:27:54 node01 lrmd[1016]:  warning:
>     >         operation_finished:
>     >         >     drbd_pg_monitor_31000:8938 - timed out after 20000ms
>     >         >     Sep 10 10:28:33 node01 crmd[1019]:    error:
>     crm_timer_popped:
>     >         >     Integration Timer (I_INTEGRATED) just popped in state
>     >         S_INTEGRATION!
>     >         >     (180000ms)
>     >         >     Sep 10 10:28:33 node01 crmd[1019]:  warning:
>     >         do_state_transition:
>     >         >     Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
>     >         >     Sep 10 10:28:33 node01 crmd[1019]:  warning:
>     >         do_state_transition: 1
>     >         >     cluster nodes failed to respond to the join offer.
>     >         >     Sep 10 10:28:33 node01 crmd[1019]:   notice:
>     >         crmd_join_phase_log:
>     >         >     join-1: node02=none
>     >         >     Sep 10 10:28:33 node01 crmd[1019]:   notice:
>     >         crmd_join_phase_log:
>     >         >     join-1: node01=welcomed
>     >         >     Sep 10 10:28:45 node01 lrmd[1016]:  warning:
>     >         child_timeout_callback:
>     >         >     drbd_pg_monitor_31000 process (PID 9185) timed out
>     >         >     Sep 10 10:28:45 node01 lrmd[1016]:  warning:
>     >         operation_finished:
>     >         >     drbd_pg_monitor_31000:9185 - timed out after 20000ms
>     >         >     Sep 10 10:29:36 node01 lrmd[1016]:  warning:
>     >         child_timeout_callback:
>     >         >     drbd_pg_monitor_31000 process (PID 9432) timed out
>     >         >     Sep 10 10:29:36 node01 lrmd[1016]:  warning:
>     >         operation_finished:
>     >         >     drbd_pg_monitor_31000:9432 - timed out after 20000ms
>     >         >     Sep 10 10:30:27 node01 lrmd[1016]:  warning:
>     >         child_timeout_callback:
>     >         >     drbd_pg_monitor_31000 process (PID 9680) timed out
>     >         >     Sep 10 10:30:27 node01 lrmd[1016]:  warning:
>     >         operation_finished:
>     >         >     drbd_pg_monitor_31000:9680 - timed out after 20000ms
>     >         >     Sep 10 10:31:18 node01 lrmd[1016]:  warning:
>     >         child_timeout_callback:
>     >         >     drbd_pg_monitor_31000 process (PID 9927) timed out
>     >         >     Sep 10 10:31:18 node01 lrmd[1016]:  warning:
>     >         operation_finished:
>     >         >     drbd_pg_monitor_31000:9927 - timed out after 20000ms
>     >         >     Sep 10 10:32:09 node01 lrmd[1016]:  warning:
>     >         child_timeout_callback:
>     >         >     drbd_pg_monitor_31000 process (PID 10174) timed out
>     >         >     Sep 10 10:32:09 node01 lrmd[1016]:  warning:
>     >         operation_finished:
>     >         >     drbd_pg_monitor_31000:10174 - timed out after 20000ms
>     >         >
>     >         >     #crm_mon on node01 before I kill the other vm:
>     >         >     Stack: corosync
>     >         >     Current DC: node02 (167938104) - partition with quorum
>     >         >     Version: 1.1.10-42f2063
>     >         >     2 Nodes configured
>     >         >     5 Resources configured
>     >         >
>     >         >     Online: [ node01 node02 ]
>     >         >
>     >         >      Resource Group: PGServer
>     >         >          fs_pg      (ocf::heartbeat:Filesystem):   
>     Started node02
>     >         >          lsb_pg     (lsb:postgresql):       Started node02
>     >         >          ip_pg      (ocf::heartbeat:IPaddr2):     
>      Started node02
>     >         >      Master/Slave Set: ms_drbd_pg [drbd_pg]
>     >         >          Masters: [ node02 ]
>     >         >          Slaves: [ node01 ]
>     >         >
>     >         >     Thank you,
>     >         >     Kiam
>     >         >
>     >         >
>     >         >
>     >         >
>     >         > _______________________________________________
>     >         > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>
>     >         <mailto:Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>>
>     >         > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>     >         >
>     >         > Project Home: http://www.clusterlabs.org
>     >         > Getting started:
>     >         http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     >         > Bugs: http://bugs.clusterlabs.org
>     >         >
>     >
>     >
>     >         _______________________________________________
>     >         Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>
>     >         <mailto:Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>>
>     >         http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>     >
>     >         Project Home: http://www.clusterlabs.org
>     >         Getting started:
>     >         http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     >         Bugs: http://bugs.clusterlabs.org
>     >
>     >
>     >
>     >
>     >
>     > _______________________________________________
>     > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>
>     > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>     >
>     > Project Home: http://www.clusterlabs.org
>     > Getting started:
>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     > Bugs: http://bugs.clusterlabs.org
>     >
> 
> 
>     _______________________________________________
>     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>     <mailto:Pacemaker at oss.clusterlabs.org>
>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
>     Project Home: http://www.clusterlabs.org
>     Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>