[ClusterLabs] Pacemaker startup-fencing

Wed Mar 16 08:47:52 EDT 2016

Andrei Borzenkov <arvidjaar at gmail.com> writes:

> On Wed, Mar 16, 2016 at 2:22 PM, Ferenc Wágner <wferi at niif.hu> wrote:
>
>> Pacemaker explained says about this cluster option:
>>
>>     Advanced Use Only: Should the cluster shoot unseen nodes? Not using
>>     the default is very unsafe!
>>
>> 1. What are those "unseen" nodes?
>
> Nodes that lost communication with other nodes (think of unplugging cables)

Translating to node status, does is mean UNCLEAN (offline) nodes which
suddenly return?  Can Pacemaker tell these apart from abruptly power
cycled nodes (when reboot happens before the comeback)?  I guess if a
node was successfully fenced at the time, it won't be considered
UNCLEAN, but is that the only way to avoid that?

>> And a possibly related question:
>>
>> 2. If I've got UNCLEAN (offline) nodes, is there a way to clean them up,
>>    so that they don't get fenced when I switch them on?  I mean without
>>    removing the node altogether, to keep its capacity settings for
>>    example.
>
> You can declare node as down using "crm node clearstate". You should
> not really do it unless you ascertained that node is actually
> physically down.

Great.  Is there an equivalent in bare bones Pacemaker, that is, not
involving the CRM shell?  Like deleting some status or LRMD history
element of the node, for example?

>> And some more about fencing:
>>
>> 3. What's the difference in cluster behavior between
>>    - stonith-enabled=FALSE (9.3.2: how often will the stop operation be retried?)
>>    - having no configured STONITH devices (resources won't be started, right?)
>>    - failing to STONITH with some error (on every node)
>>    - timing out the STONITH operation
>>    - manual fencing
>
> I do not think there is much difference. Without fencing pacemaker
> cannot make decision to relocate resources so cluster will be stuck.

Then I wonder why I hear the "must have working fencing if you value
your data" mantra so often (and always without explanation).  After all,
it does not risk the data, only the automatic cluster recovery, right?

>> 4. What's the modern way to do manual fencing?  (stonith_admin
>>    --confirm + what?
>
> node name.

:) I did really poor wording that question.  I meant to ask what kind of
cluster (STONITH) configuration makes the cluster sit patiently until I
do the manual fencing, then carry on without timeouts or other errors.
Just as if some automatic fencing agent did the job, but letting me
investigate the node status beforehand.
-- 
Thanks,
Feri