[ClusterLabs] Antw: [EXT] Recoveing from node failure

Mon Dec 14 05:48:09 EST 2020

Thanks!

I tried first option, by adding pcmk_delay_base to the two stonith primitives.
First has 1 second, second has 5 seconds.
It didn't work :( they still killed each other :(
Anything wrong with the way I did it?

Here's the config:

node 1: xstha1 \
        attributes standby=off maintenance=off
node 2: xstha2 \
        attributes standby=off maintenance=off
primitive xstha1-stonith stonith:external/ipmi \
        params hostname=xstha1 ipaddr=192.168.221.18 userid=ADMIN passwd="***" interface=lanplus pcmk_delay_base=1 \
        op monitor interval=25 timeout=25 start-delay=25 \
        meta target-role=Started
primitive xstha1_san0_IP IPaddr \
        params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0
primitive xstha2-stonith stonith:external/ipmi \
        params hostname=xstha2 ipaddr=192.168.221.19 userid=ADMIN passwd="***" interface=lanplus pcmk_delay_base=5 \
        op monitor interval=25 timeout=25 start-delay=25 \
        meta target-role=Started
primitive xstha2_san0_IP IPaddr \
        params ip=10.10.10.2 cidr_netmask=255.255.255.0 nic=san0
primitive zpool_data ZFS \
        params pool=test \
        op start timeout=90 interval=0 \
        op stop timeout=90 interval=0 \
        meta target-role=Started
location xstha1-stonith-pref xstha1-stonith -inf: xstha1
location xstha1_san0_IP_pref xstha1_san0_IP 100: xstha1
location xstha2-stonith-pref xstha2-stonith -inf: xstha2
location xstha2_san0_IP_pref xstha2_san0_IP 100: xstha2
order zpool_data_order inf: zpool_data ( xstha1_san0_IP )
location zpool_data_pref zpool_data 100: xstha1
colocation zpool_data_with_IPs inf: zpool_data xstha1_san0_IP
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.15-e174ec8 \
        cluster-infrastructure=corosync \
        stonith-action=poweroff \
        no-quorum-policy=stop

Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets

----------------------------------------------------------------------------------

Da: Andrei Borzenkov <arvidjaar at gmail.com>
A: users at clusterlabs.org 
Data: 13 dicembre 2020 7.50.57 CET
Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure

12.12.2020 20:30, Gabriele Bulfon пишет:
> Thanks, I will experiment this.
>  
> Now, I have a last issue about stonith.
> I tried to reproduce a stonith situation, by disabling the network interface used for HA on node 1.
> Stonith is configured with ipmi poweroff.
> What happens, is that once the interface is down, both nodes tries to stonith the other node, causing both to poweroff...

Yes, this is expected. The options are basically

1. Have separate stonith resource for each node and configure static
(pcmk_delay_base) or random dynamic (pcmk_delay_max) delays to avoid
both nodes starting stonith at the same time. This does not take
resources in account.

2. Use fencing topology and create pseudo-stonith agent that does not
attempt to do anything but just delays for some time before continuing
with actual fencing agent. Delay can be based on anything including
resources running on node.

3. If you are using pacemaker 2.0.3+, you could use new
priority-fencing-delay feature that implements resource-based priority
fencing:

+ controller/fencing/scheduler: add new feature 'priority-fencing-delay'
Optionally derive the priority of a node from the
resource-priorities
of the resources it is running.
In a fencing-race the node with the highest priority has a certain
advantage over the others as fencing requests for that node are
executed with an additional delay.
controlled via cluster option priority-fencing-delay (default = 0)

See also https://www.mail-archive.com/users@clusterlabs.org/msg10328.html

> I would like the node running all resources (zpool and nfs ip) to be the first trying to stonith the other node.
> Or is there anything else better?
>  
> Here is the current crm config show:
>  

It is unreadable

> node 1: xstha1 \ attributes standby=off maintenance=offnode 2: xstha2 \ attributes standby=off maintenance=offprimitive xstha1-stonith stonith:external/ipmi \ params hostname=xstha1 ipaddr=192.168.221.18 userid=ADMIN passwd="******" interface=lanplus \ op monitor interval=25 timeout=25 start-delay=25 \ meta target-role=Startedprimitive xstha1_san0_IP IPaddr \ params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0primitive xstha2-stonith stonith:external/ipmi \ params hostname=xstha2 ipaddr=192.168.221.19 userid=ADMIN passwd="******" interface=lanplus \ op monitor interval=25 timeout=25 start-delay=25 \ meta target-role=Startedprimitive xstha2_san0_IP IPaddr \ params ip=10.10.10.2 cidr_netmask=255.255.255.0 nic=san0primitive zpool_data ZFS \ params pool=test \ op start timeout=90 interval=0 \ op stop timeout=90 interval=0 \ meta target-role=Startedlocation xstha1-stonith-pref xstha1-stonith -inf: xstha1location xstha1_san0_IP_pref xstha1_san0_IP 100: xstha1location xstha2-stonith-pref xstha2-stonith -inf: xstha2location xstha2_san0_IP_pref xstha2_san0_IP 100: xstha2order zpool_data_order inf: zpool_data ( xstha1_san0_IP )location zpool_data_pref zpool_data 100: xstha1colocation zpool_data_with_IPs inf: zpool_data xstha1_san0_IPproperty cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.15-e174ec8 \ cluster-infrastructure=corosync \ stonith-action=poweroff \ no-quorum-policy=stop
>  
> Thanks!
> Gabriele
>  
>  
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>  
> 
> 
> 
> 
> ----------------------------------------------------------------------------------
> 
> Da: Andrei Borzenkov <arvidjaar at gmail.com>
> A: users at clusterlabs.org 
> Data: 11 dicembre 2020 18.30.29 CET
> Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure
> 
> 
> 11.12.2020 18:37, Gabriele Bulfon пишет:
>> I found I can do this temporarily:
>>  
>> crm config property cib-bootstrap-options: no-quorum-policy=ignore
>>  
> 
> All two node clusters I remember run with setting forever :)
> 
>> then once node 2 is up again:
>>  
>> crm config property cib-bootstrap-options: no-quorum-policy=stop
>>  
>> so that I make sure nodes will not mount in another strange situation.
>>  
>> Is there any better way? 
> 
> "better" us subjective, but ...
> 
>> (such as ignore until everything is back to normal then conisder top again)
>>  
> 
> That is what stonith does. Because quorum is pretty much useless in two
> node cluster, as I already said all clusters I have seem used
> no-quorum-policy=ignore and stonith-enabled=true. It means when node
> boots and other node is not available stonith is attempted; if stonith
> succeeds pacemaker continues with starting resources; if stonith fails,
> node is stuck.
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20201214/87f28820/attachment-0001.htm>