[ClusterLabs] Previous DC fenced prior to integration

Thu Jul 28 14:48:03 EDT 2016

On Mon, Jul 25, 2016 at 2:48 PM, Nate Clark <nate at neworld.us> wrote:
> On Mon, Jul 25, 2016 at 11:20 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>> On 07/23/2016 10:14 PM, Nate Clark wrote:
>>> On Sat, Jul 23, 2016 at 1:06 AM, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>>>> 23.07.2016 01:37, Nate Clark пишет:
>>>>> Hello,
>>>>>
>>>>> I am running pacemaker 1.1.13 with corosync and think I may have
>>>>> encountered a start up timing issue on a two node cluster. I didn't
>>>>> notice anything in the changelog for 14 or 15 that looked similar to
>>>>> this or open bugs.
>>>>>
>>>>> The rough out line of what happened:
>>>>>
>>>>> Module 1 and 2 running
>>>>> Module 1 is DC
>>>>> Module 2 shuts down
>>>>> Module 1 updates node attributes used by resources
>>>>> Module 1 shuts down
>>>>> Module 2 starts up
>>>>> Module 2 votes itself as DC
>>>>> Module 1 starts up
>>>>> Module 2 sees module 1 in corosync and notices it has quorum
>>>>> Module 2 enters policy engine state.
>>>>> Module 2 policy engine decides to fence 1
>>>>> Module 2 then continues and starts resource on itself based upon the old state
>>>>>
>>>>> For some reason the integration never occurred and module 2 starts to
>>>>> perform actions based on stale state.
>>>>>
>>>>> Here is the full logs
>>>>> Jul 20 16:29:06.376805 module-2 crmd[21969]:   notice: Connecting to
>>>>> cluster infrastructure: corosync
>>>>> Jul 20 16:29:06.386853 module-2 crmd[21969]:   notice: Could not
>>>>> obtain a node name for corosync nodeid 2
>>>>> Jul 20 16:29:06.392795 module-2 crmd[21969]:   notice: Defaulting to
>>>>> uname -n for the local corosync node name
>>>>> Jul 20 16:29:06.403611 module-2 crmd[21969]:   notice: Quorum lost
>>>>> Jul 20 16:29:06.409237 module-2 stonith-ng[21965]:   notice: Watching
>>>>> for stonith topology changes
>>>>> Jul 20 16:29:06.409474 module-2 stonith-ng[21965]:   notice: Added
>>>>> 'watchdog' to the device list (1 active devices)
>>>>> Jul 20 16:29:06.413589 module-2 stonith-ng[21965]:   notice: Relying
>>>>> on watchdog integration for fencing
>>>>> Jul 20 16:29:06.416905 module-2 cib[21964]:   notice: Defaulting to
>>>>> uname -n for the local corosync node name
>>>>> Jul 20 16:29:06.417044 module-2 crmd[21969]:   notice:
>>>>> pcmk_quorum_notification: Node module-2[2] - state is now member (was
>>>>> (null))
>>>>> Jul 20 16:29:06.421821 module-2 crmd[21969]:   notice: Defaulting to
>>>>> uname -n for the local corosync node name
>>>>> Jul 20 16:29:06.422121 module-2 crmd[21969]:   notice: Notifications disabled
>>>>> Jul 20 16:29:06.422149 module-2 crmd[21969]:   notice: Watchdog
>>>>> enabled but stonith-watchdog-timeout is disabled
>>>>> Jul 20 16:29:06.422286 module-2 crmd[21969]:   notice: The local CRM
>>>>> is operational
>>>>> Jul 20 16:29:06.422312 module-2 crmd[21969]:   notice: State
>>>>> transition S_STARTING -> S_PENDING [ input=I_PENDING
>>>>> cause=C_FSA_INTERNAL origin=do_started ]
>>>>> Jul 20 16:29:07.416871 module-2 stonith-ng[21965]:   notice: Added
>>>>> 'fence_sbd' to the device list (2 active devices)
>>>>> Jul 20 16:29:08.418567 module-2 stonith-ng[21965]:   notice: Added
>>>>> 'ipmi-1' to the device list (3 active devices)
>>>>> Jul 20 16:29:27.423578 module-2 crmd[21969]:  warning: FSA: Input
>>>>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
>>>>> Jul 20 16:29:27.424298 module-2 crmd[21969]:   notice: State
>>>>> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
>>>>> cause=C_TIMER_POPPED origin=election_timeout_popped ]
>>>>> Jul 20 16:29:27.460834 module-2 crmd[21969]:  warning: FSA: Input
>>>>> I_ELECTION_DC from do_election_check() received in state S_INTEGRATION
>>>>> Jul 20 16:29:27.463794 module-2 crmd[21969]:   notice: Notifications disabled
>>>>> Jul 20 16:29:27.463824 module-2 crmd[21969]:   notice: Watchdog
>>>>> enabled but stonith-watchdog-timeout is disabled
>>>>> Jul 20 16:29:27.473285 module-2 attrd[21967]:   notice: Defaulting to
>>>>> uname -n for the local corosync node name
>>>>> Jul 20 16:29:27.498464 module-2 pengine[21968]:   notice: Relying on
>>>>> watchdog integration for fencing
>>>>> Jul 20 16:29:27.498536 module-2 pengine[21968]:   notice: We do not
>>>>> have quorum - fencing and resource management disabled
>>>>> Jul 20 16:29:27.502272 module-2 pengine[21968]:  warning: Node
>>>>> module-1 is unclean!
>>>>> Jul 20 16:29:27.502287 module-2 pengine[21968]:   notice: Cannot fence
>>>>> unclean nodes until quorum is attained (or no-quorum-policy is set to
>>>>> ignore)
>>
>> The above two messages indicate that module-2 cannot see module-1 at
>> startup, therefore it must assume it is potentially misbehaving and must
>> be shot. However, since it does not have quorum with only one out of two
>> nodes, it must wait until module-1 joins until it can shoot it!
>>
>> This is a special problem with quorum in a two-node cluster. There are a
>> variety of ways to deal with it, but the simplest is to set "two_node:
>> 1" in corosync.conf (with corosync 2 or later). This will make each node
>> wait for the other at startup, meaning both nodes must be started before
>> the cluster can start working, but from that point on, it will assume it
>> has quorum, and use fencing to ensure any lost node is really down.
>
> two_node is set to 1 for this system. I understand what you are saying
> but what usually happens is S_INTEGRATION occurs after quorum as
> achieved and the current DC acknowledges the other node which just
> started and then accepts into the cluster. However it looks like
> S_POLICY_ENGINE occurred first.

I saw a similar situation occur on another two node system. Based on
Ken's previous comment it sounds like this is unexpected behavior for
when two_node is enabled, or did I misinterpret his comment?

Thanks
-nate