[ClusterLabs] Antw: Re: Antw: Re: design of a two-node cluster
arvidjaar at gmail.com
Tue Dec 8 04:14:17 EST 2015
On Tue, Dec 8, 2015 at 12:01 PM, Ulrich Windl
<Ulrich.Windl at rz.uni-regensburg.de> wrote:
>>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 08.12.2015 um 09:01 in
> <CAA91j0Un+1EN6xRLM=dM6CK+UsDZmPnyYJtHa9d+BTRZFCgg2Q at mail.gmail.com>:
>> On Tue, Dec 8, 2015 at 10:44 AM, Ulrich Windl
>> <Ulrich.Windl at rz.uni-regensburg.de> wrote:
>>>>>> Digimer <lists at alteeve.ca> schrieb am 07.12.2015 um 22:40 in Nachricht
>>> <5665FCDC.1030001 at alteeve.ca>:
>>>> Node 1 looks up how to fence node 2, sees no delay and fences
>>>> immediately. Node 2 looks up how to fence node 1, sees a delay and
>>>> pauses. Node 2 will be dead long before the delay expires, ensuring that
>>>> node 2 always loses in such a case. If you have VMs on both nodes, then
>>>> no matter which node the delay is on, some servers will be interrupted.
>>> AFAIK, the cluster will try to migrate resources if a fencing is pending,
>> but not yet complete. Is that true?
>> If under "migrate" you really mean "restart resources that were
>> located on node that became inaccessible" I seriously hope the answer
>> is "not", otherwise what is the point in attempting fencing in the
>> first place?
> A node must be fenced if at least one resource fails to stop.
No, it "must" not. It is up to someone who configures cluster to
decide. If this resource is so important that cluster has to recover
it under any cost, then yes, fencing may be the only option. Leaving
resource as failed and letting administrator to handle it manually is
another option (it is quite possible that if it failed to stop it will
also fail to start in which case you just caused downtime without any
> That means other resources still may be able to be stopped or migrated before the fencing takes place. Possibly this is a decision between "kill everything as fast as possible" vs. "try to stop as many services as possible cleanly". I prefer the latter, but preferences may vary.
OK, in this context your question makes sense indeed. Personally I
also feel like "it has failed already, so it is not really that
urgent", especially if other resources can indeed be migrated
More information about the Users