[ClusterLabs] Antw: [EXT] Stonith failing

Andrei Borzenkov arvidjaar at gmail.com
Tue Aug 18 15:07:44 EDT 2020


18.08.2020 17:02, Ken Gaillot пишет:
> On Tue, 2020-08-18 at 08:21 +0200, Klaus Wenninger wrote:
>> On 8/18/20 7:49 AM, Andrei Borzenkov wrote:
>>> 17.08.2020 23:39, Jehan-Guillaume de Rorthais пишет:
>>>> On Mon, 17 Aug 2020 10:19:45 -0500
>>>> Ken Gaillot <kgaillot at redhat.com> wrote:
>>>>
>>>>> On Fri, 2020-08-14 at 15:09 +0200, Gabriele Bulfon wrote:
>>>>>> Thanks to all your suggestions, I now have the systems with
>>>>>> stonith
>>>>>> configured on ipmi.  
>>>>>
>>>>> A word of caution: if the IPMI is on-board -- i.e. it shares
>>>>> the same
>>>>> power supply as the computer -- power becomes a single point of
>>>>> failure. If the node loses power, the other node can't fence
>>>>> because
>>>>> the IPMI is also down, and the cluster can't recover.
>>>>>
>>>>> Some on-board IPMI controllers can share an Ethernet port with
>>>>> the main
>>>>> computer, which would be a similar situation.
>>>>>
>>>>> It's best to have a backup fencing method when using IPMI as
>>>>> the
>>>>> primary fencing method. An example would be an intelligent
>>>>> power switch
>>>>> or sbd.
>>>>
>>>> How SBD would be useful in this scenario? Poison pill will not be
>>>> swallowed by
>>>> the dead node... Is it just to wait for the watchdog timeout?
>>>>
>>>
>>> Node is expected to commit suicide if SBD lost access to shared
>>> block
>>> device. So either node swallowed poison pill and died or node died
>>> because it realized it was impossible to see poison pill or node
>>> was
>>> dead already. After watchdog timeout (twice watchdog timeout for
>>> safety)
>>> we assume node is dead.
>>
>> Yes, like this a suicide via watchdog will be triggered if there are
>> issues with thedisk. This is why it is important to have a reliable
>> watchdog with SBD even whenusing poison pill. As this alone would
>> make a single shared disk a SPOF, runningwith pacemaker integration
>> (default) a node with SBD will survive despite ofloosing the disk
>> when it has quorum and pacemaker looks healthy. As corosync-quorum
>> in 2-node-mode obviously won't be fit for this purpose SBD will
>> switch
>> to checking for presence of both nodes if 2-node-flag is set.
>>
>> Sorry for the lengthy explanation but the full picture is required
>> to understand whyit is sufficiently reliable and useful if configured
>> correctly.
>>
>> Klaus
> 
> What I'm not sure about is how watchdog-only sbd would behave as a
> fail-back method for a regular fence device. Will the cluster wait for
> the sbd timeout no matter what, or only if the regular fencing fails,
> or ...?
> 

Diskless SBD implicitly creates fencing device ("watchdog"), timeout
starts only when this device is selected for fencing. This device
appears to be completely invisible to normal stonith_admin operation, I
do not know how to query for it. In my testing explicit stonith resource
was always called first and only if it failed was "watchdog" self
fencing attempted. I tried to set negative priority for CIB stonith
resource but it did not change anything.


More information about the Users mailing list