[ClusterLabs] Previous DC fenced prior to integration

Ken Gaillot kgaillot at redhat.com
Fri Jul 29 16:09:15 EDT 2016


On 07/28/2016 01:48 PM, Nate Clark wrote:
> On Mon, Jul 25, 2016 at 2:48 PM, Nate Clark <nate at neworld.us> wrote:
>> On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>>> On 07/23/2016 10:14 PM, Nate Clark wrote:
>>>> On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>>>>> 23.07.2016 01:37, Nate Clark пишет:
>>>>>> Hello,
>>>>>>
>>>>>> I am running pacemaker 1.1.13 with corosync and think I may have
>>>>>> encountered a start up timing issue on a two node cluster. I didn't
>>>>>> notice anything in the changelog for 14 or 15 that looked similar to
>>>>>> this or open bugs.
>>>>>>
>>>>>> The rough out line of what happened:
>>>>>>
>>>>>> Module 1 and 2 running
>>>>>> Module 1 is DC
>>>>>> Module 2 shuts down
>>>>>> Module 1 updates node attributes used by resources
>>>>>> Module 1 shuts down
>>>>>> Module 2 starts up
>>>>>> Module 2 votes itself as DC
>>>>>> Module 1 starts up
>>>>>> Module 2 sees module 1 in corosync and notices it has quorum
>>>>>> Module 2 enters policy engine state.
>>>>>> Module 2 policy engine decides to fence 1
>>>>>> Module 2 then continues and starts resource on itself based upon the old state
>>>>>>
>>>>>> For some reason the integration never occurred and module 2 starts to
>>>>>> perform actions based on stale state.
>>>>>>
>>>>>> Here is the full logs
>>>>>> Jul 20 16:29:06.376805 module-2 crmd[21969]:   notice: Connecting to
>>>>>> cluster infrastructure: corosync
>>>>>> Jul 20 16:29:06.386853 module-2 crmd[21969]:   notice: Could not
>>>>>> obtain a node name for corosync nodeid 2
>>>>>> Jul 20 16:29:06.392795 module-2 crmd[21969]:   notice: Defaulting to
>>>>>> uname -n for the local corosync node name
>>>>>> Jul 20 16:29:06.403611 module-2 crmd[21969]:   notice: Quorum lost
>>>>>> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]:   notice: Watching
>>>>>> for stonith topology changes
>>>>>> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]:   notice: Added
>>>>>> 'watchdog' to the device list (1 active devices)
>>>>>> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]:   notice: Relying
>>>>>> on watchdog integration for fencing
>>>>>> Jul 20 16:29:06.416905 module-2 cib[21964]:   notice: Defaulting to
>>>>>> uname -n for the local corosync node name
>>>>>> Jul 20 16:29:06.417044 module-2 crmd[21969]:   notice:
>>>>>> pcmk_quorum_notification: Node module-2[2] - state is now member (was
>>>>>> (null))
>>>>>> Jul 20 16:29:06.421821 module-2 crmd[21969]:   notice: Defaulting to
>>>>>> uname -n for the local corosync node name
>>>>>> Jul 20 16:29:06.422121 module-2 crmd[21969]:   notice: Notifications disabled
>>>>>> Jul 20 16:29:06.422149 module-2 crmd[21969]:   notice: Watchdog
>>>>>> enabled but stonith-watchdog-timeout is disabled
>>>>>> Jul 20 16:29:06.422286 module-2 crmd[21969]:   notice: The local CRM
>>>>>> is operational
>>>>>> Jul 20 16:29:06.422312 module-2 crmd[21969]:   notice: State
>>>>>> transition S_STARTING -> S_PENDING [ input=I_PENDING
>>>>>> cause=C_FSA_INTERNAL origin=do_started ]
>>>>>> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]:   notice: Added
>>>>>> 'fence_sbd' to the device list (2 active devices)
>>>>>> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]:   notice: Added
>>>>>> 'ipmi-1' to the device list (3 active devices)
>>>>>> Jul 20 16:29:27.423578 module-2 crmd[21969]:  warning: FSA: Input
>>>>>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
>>>>>> Jul 20 16:29:27.424298 module-2 crmd[21969]:   notice: State
>>>>>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
>>>>>> cause=C_TIMER_POPPED origin=election_timeout_popped ]
>>>>>> Jul 20 16:29:27.460834 module-2 crmd[21969]:  warning: FSA: Input
>>>>>> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
>>>>>> Jul 20 16:29:27.463794 module-2 crmd[21969]:   notice: Notifications disabled
>>>>>> Jul 20 16:29:27.463824 module-2 crmd[21969]:   notice: Watchdog
>>>>>> enabled but stonith-watchdog-timeout is disabled
>>>>>> Jul 20 16:29:27.473285 module-2 attrd[21967]:   notice: Defaulting to
>>>>>> uname -n for the local corosync node name
>>>>>> Jul 20 16:29:27.498464 module-2 pengine[21968]:   notice: Relying on
>>>>>> watchdog integration for fencing
>>>>>> Jul 20 16:29:27.498536 module-2 pengine[21968]:   notice: We do not
>>>>>> have quorum - fencing and resource management disabled
>>>>>> Jul 20 16:29:27.502272 module-2 pengine[21968]:  warning: Node
>>>>>> module-1 is unclean!
>>>>>> Jul 20 16:29:27.502287 module-2 pengine[21968]:   notice: Cannot fence
>>>>>> unclean nodes until quorum is attained (or no-quorum-policy is set to
>>>>>> ignore)
>>>
>>> The above two messages indicate that module-2 cannot see module-1 at
>>> startup, therefore it must assume it is potentially misbehaving and must
>>> be shot. However, since it does not have quorum with only one out of two
>>> nodes, it must wait until module-1 joins until it can shoot it!
>>>
>>> This is a special problem with quorum in a two-node cluster. There are a
>>> variety of ways to deal with it, but the simplest is to set "two_node:
>>> 1" in corosync.conf (with corosync 2 or later). This will make each node
>>> wait for the other at startup, meaning both nodes must be started before
>>> the cluster can start working, but from that point on, it will assume it
>>> has quorum, and use fencing to ensure any lost node is really down.
>>
>> two_node is set to 1 for this system. I understand what you are saying
>> but what usually happens is S_INTEGRATION occurs after quorum as
>> achieved and the current DC acknowledges the other node which just
>> started and then accepts into the cluster. However it looks like
>> S_POLICY_ENGINE occurred first.
> 
> I saw a similar situation occur on another two node system. Based on
> Ken's previous comment it sounds like this is unexpected behavior for
> when two_node is enabled, or did I misinterpret his comment?
> 
> Thanks
> -nate

I didn't think it through properly ... two_node will only affect quorum,
so the above sequence makes sense once the cluster decides fencing is
necessary.

I'm not sure why it sometimes goes into S_INTEGRATION and sometimes
S_POLICY_ENGINE. In the above logs, it goes into S_INTEGRATION because
the DC election timed out. How are the logs in the successful case
different? Maybe the other node happens to join before the DC election
times out?




More information about the Users mailing list