[Pacemaker] One node thinks everyone is online, but the other node doesn't think so

Mon Feb 23 23:07:02 UTC 2015

> On 26 Jan 2015, at 10:53 am, Dmitry Koterov <dmitry.koterov at gmail.com> wrote:
> 
> Hello.
> 
> I have a 3-node cluster where node1 and node2 are running corosync+pacemaker and node3 is running corosync only (for quorum). Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple of days.
> 
> Once upon a time I discovered the following situation: node2 thinks that both node1 and node2 are online, but node1 thinks that node2 is down. Could you please say: how could it be? There are no connectivity problems between the nodes at the moment (maybe they were, but why the system hasn't recovered?).

The logs show connectivity problems occurring, so no doubt there.
As to why it hasn't recovered, first check corosync - if it does not have a consistent view of the world pacemaker has no hope.
Alternatively, I recall there was a bug that could be preventing this in your version.  So if corosync looks fine, perhaps try an upgrade.

> The "crm status" is below. What other logs should I attach for the diagnostics?
> 
> Also, "service corosync stop" on node1 hangs forever with no additional lines in logs, so I cannot even stop the service. (But after "service corosync stop" on node1 the node node2 starts thinking that node1 is offline, although the command still hangs.)
> 
> 
> root at node2:~# crm status
> Current DC: node1 (1760315215) - partition with quorum
> 2 Nodes configured
> 6 Resources configured
> Online: [ node1 node2 ]
> Master/Slave Set: ms_drbd [drbd]
>      Masters: [ node2 ]
>      Slaves: [ node1 ]
> Resource Group: server
>      fs	(ocf::heartbeat:Filesystem):	Started node2
>      postgresql	(lsb:postgresql):	Started node2
>      bind9	(lsb:bind9):	Started node2
>      nginx	(lsb:nginx):	Started node2
> 
> 
> root at node1:/var/log# crm status
> Current DC: node1 (1760315215) - partition with quorum
> 2 Nodes configured
> 6 Resources configured
> Online: [ node1 ]
> OFFLINE: [ node2 ]
> Master/Slave Set: ms_drbd [drbd]
>      Masters: [ node1 ]
>      Stopped: [ node2 ]
> Resource Group: server
>      fs	(ocf::heartbeat:Filesystem):	Started node1
>      postgresql	(lsb:postgresql):	Started node1
>      bind9	(lsb:bind9):	Started node1
>      nginx	(lsb:nginx):	Started node1
> Failed actions:
>     drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out, last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown error
> 
> 
> A part of "crm configure show":
> 
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.10-42f2063" \
>         cluster-infrastructure="corosync" \
>         stonith-enabled="false" \
>         last-lrm-refresh="1421250983"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="100"
> 
> 
> Also I see in logs on node1 (maybe they're related to the issue, maybe not):
> 
> Jan 22 10:14:02 node1 pengine[2772]:  warning: pe_fence_node: Node node2 is unclean because it is partially and/or un-expectedly down
> Jan 22 10:14:02 node1 pengine[2772]:  warning: determine_online_status: Node node2 is unclean
> Jan 22 10:14:02 node1 pengine[2772]:  warning: stage6: Node node2 is unclean!
> Jan 22 10:14:02 node1 pengine[2772]:  warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED
> Jan 22 10:14:02 node1 pengine[2772]:    error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE
> 
> 
> On node2 the logs are:
> 
> Jan 22 10:13:57 node2 corosync[32761]:  [TOTEM ] A new membership (188.166.54.190:6276) was formed. Members left: 1760315215 13071578
> Jan 22 10:13:57 node2 crmd[311]:   notice: peer_update_callback: Our peer on the DC is dead
> Jan 22 10:13:57 node2 crmd[311]:   notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> Jan 22 10:13:57 node2 corosync[32761]:  [QUORUM] This node is within the non-primary component and will NOT provide any services.
> Jan 22 10:13:57 node2 corosync[32761]:  [QUORUM] Members[1]: 1017525950
> Jan 22 10:13:57 node2 crmd[311]:   notice: pcmk_quorum_notification: Membership 6276: quorum lost (1)
> Jan 22 10:13:57 node2 crmd[311]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was member)
> Jan 22 10:13:57 node2 crmd[311]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was member)
> Jan 22 10:13:57 node2 pacemakerd[302]:   notice: pcmk_quorum_notification: Membership 6276: quorum lost (1)
> Jan 22 10:13:57 node2 pacemakerd[302]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was member)
> Jan 22 10:13:57 node2 pacemakerd[302]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was member)
> Jan 22 10:13:57 node2 corosync[32761]:  [MAIN  ] Completed service synchronization, ready to provide service.
> Jan 22 10:14:01 node2 corosync[32761]:  [TOTEM ] A new membership (104.236.71.79:6288) was formed. Members joined: 1760315215 13071578
> Jan 22 10:14:02 node2 crmd[311]:    error: pcmk_cpg_membership: Node node1[1760315215] appears to be online even though we think it is dead
> Jan 22 10:14:02 node2 crmd[311]:   notice: crm_update_peer_state: pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost)
> Jan 22 10:14:03 node2 corosync[32761]:  [QUORUM] This node is within the primary component and will provide service.
> Jan 22 10:14:03 node2 corosync[32761]:  [QUORUM] Members[3]: 1760315215 13071578 1017525950
> Jan 22 10:14:03 node2 crmd[311]:   notice: pcmk_quorum_notification: Membership 6288: quorum acquired (3)
> Jan 22 10:14:03 node2 pacemakerd[302]:   notice: pcmk_quorum_notification: Membership 6288: quorum acquired (3)
> Jan 22 10:14:03 node2 pacemakerd[302]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now member (was lost)
> Jan 22 10:14:03 node2 corosync[32761]:  [MAIN  ] Completed service synchronization, ready to provide service.
> Jan 22 10:14:03 node2 crmd[311]:   notice: corosync_node_name: Unable to get node name for nodeid 13071578
> Jan 22 10:14:03 node2 crmd[311]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now member (was lost)
> Jan 22 10:14:03 node2 pacemakerd[302]:   notice: corosync_node_name: Unable to get node name for nodeid 13071578
> Jan 22 10:14:03 node2 pacemakerd[302]:   notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now member (was lost)
> Jan 22 10:14:03 node2 crmd[311]:  warning: do_log: FSA: Input I_JOIN_OFFER from route_message() received in state S_ELECTION
> Jan 22 10:14:04 node2 crmd[311]:   notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
> Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
> Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd (10000)
> Jan 22 10:14:05 node2 attrd[310]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
> Jan 22 10:14:05 node2 crmd[311]:   notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> Jan 22 10:15:11 node2 corosync[32761]:  [TOTEM ] A new membership (104.236.71.79:6296) was formed. Members left: 13071578
> Jan 22 10:15:14 node2 corosync[32761]:  [TOTEM ] A new membership (128.199.116.218:6312) was formed. Members joined: 13071578 left: 1760315215
> Jan 22 10:15:17 node2 corosync[32761]:  [TOTEM ] A new membership (104.236.71.79:6324) was formed. Members joined: 1760315215
> Jan 22 10:15:19 node2 crmd[311]:   notice: peer_update_callback: Our peer on the DC is dead
> Jan 22 10:15:19 node2 crmd[311]:   notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary -> Secondary )
> Jan 22 10:15:20 node2 corosync[32761]:  [QUORUM] Members[3]: 1760315215 13071578 1017525950
> Jan 22 10:15:20 node2 corosync[32761]:  [MAIN  ] Completed service synchronization, ready to provide service.
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org