[ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Mon Dec 14 05:53:22 EST 2020


>>> Gabriele Bulfon <gbulfon at sonicle.com> schrieb am 14.12.2020 um 11:48 in
Nachricht <1065144646.7212.1607942889206 at www>:
> Thanks!
> 
> I tried first option, by adding pcmk_delay_base to the two stonith 
> primitives.
> First has 1 second, second has 5 seconds.
> It didn't work :( they still killed each other :(
> Anything wrong with the way I did it?

Hard to say without seeing the logs...

>  
> Here's the config:
>  
> node 1: xstha1 \
>         attributes standby=off maintenance=off
> node 2: xstha2 \
>         attributes standby=off maintenance=off
> primitive xstha1-stonith stonith:external/ipmi \
>         params hostname=xstha1 ipaddr=192.168.221.18 userid=ADMIN 
> passwd="***" interface=lanplus pcmk_delay_base=1 \
>         op monitor interval=25 timeout=25 start-delay=25 \
>         meta target-role=Started
> primitive xstha1_san0_IP IPaddr \
>         params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0
> primitive xstha2-stonith stonith:external/ipmi \
>         params hostname=xstha2 ipaddr=192.168.221.19 userid=ADMIN 
> passwd="***" interface=lanplus pcmk_delay_base=5 \
>         op monitor interval=25 timeout=25 start-delay=25 \
>         meta target-role=Started
> primitive xstha2_san0_IP IPaddr \
>         params ip=10.10.10.2 cidr_netmask=255.255.255.0 nic=san0
> primitive zpool_data ZFS \
>         params pool=test \
>         op start timeout=90 interval=0 \
>         op stop timeout=90 interval=0 \
>         meta target-role=Started
> location xstha1-stonith-pref xstha1-stonith -inf: xstha1
> location xstha1_san0_IP_pref xstha1_san0_IP 100: xstha1
> location xstha2-stonith-pref xstha2-stonith -inf: xstha2
> location xstha2_san0_IP_pref xstha2_san0_IP 100: xstha2
> order zpool_data_order inf: zpool_data ( xstha1_san0_IP )
> location zpool_data_pref zpool_data 100: xstha1
> colocation zpool_data_with_IPs inf: zpool_data xstha1_san0_IP
> property cib-bootstrap-options: \
>         have-watchdog=false \
>         dc-version=1.1.15-e174ec8 \
>         cluster-infrastructure=corosync \
>         stonith-action=poweroff \
>         no-quorum-policy=stop
>  
>  
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
>  
> 
> 
> 
> 
>
----------------------------------------------------------------------------
> ------
> 
> Da: Andrei Borzenkov <arvidjaar at gmail.com>
> A: users at clusterlabs.org 
> Data: 13 dicembre 2020 7.50.57 CET
> Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure
> 
> 
> 12.12.2020 20:30, Gabriele Bulfon пишет:
>> Thanks, I will experiment this.
>>  
>> Now, I have a last issue about stonith.
>> I tried to reproduce a stonith situation, by disabling the network
interface 
> used for HA on node 1.
>> Stonith is configured with ipmi poweroff.
>> What happens, is that once the interface is down, both nodes tries to 
> stonith the other node, causing both to poweroff...
> 
> Yes, this is expected. The options are basically
> 
> 1. Have separate stonith resource for each node and configure static
> (pcmk_delay_base) or random dynamic (pcmk_delay_max) delays to avoid
> both nodes starting stonith at the same time. This does not take
> resources in account.
> 
> 2. Use fencing topology and create pseudo-stonith agent that does not
> attempt to do anything but just delays for some time before continuing
> with actual fencing agent. Delay can be based on anything including
> resources running on node.
> 
> 3. If you are using pacemaker 2.0.3+, you could use new
> priority-fencing-delay feature that implements resource-based priority
> fencing:
> 
> + controller/fencing/scheduler: add new feature 'priority-fencing-delay'
> Optionally derive the priority of a node from the
> resource-priorities
> of the resources it is running.
> In a fencing-race the node with the highest priority has a certain
> advantage over the others as fencing requests for that node are
> executed with an additional delay.
> controlled via cluster option priority-fencing-delay (default = 0)
> 
> 
> See also https://www.mail-archive.com/users@clusterlabs.org/msg10328.html 
> 
>> I would like the node running all resources (zpool and nfs ip) to be the 
> first trying to stonith the other node.
>> Or is there anything else better?
>>  
>> Here is the current crm config show:
>>  
> 
> It is unreadable
> 
>> node 1: xstha1 \ attributes standby=off maintenance=offnode 2: xstha2 \ 
> attributes standby=off maintenance=offprimitive xstha1-stonith 
> stonith:external/ipmi \ params hostname=xstha1 ipaddr=192.168.221.18 
> userid=ADMIN passwd="******" interface=lanplus \ op monitor interval=25 
> timeout=25 start-delay=25 \ meta target-role=Startedprimitive xstha1_san0_IP

> IPaddr \ params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0primitive 
> xstha2-stonith stonith:external/ipmi \ params hostname=xstha2 
> ipaddr=192.168.221.19 userid=ADMIN passwd="******" interface=lanplus \ op 
> monitor interval=25 timeout=25 start-delay=25 \ meta 
> target-role=Startedprimitive xstha2_san0_IP IPaddr \ params ip=10.10.10.2 
> cidr_netmask=255.255.255.0 nic=san0primitive zpool_data ZFS \ params 
> pool=test \ op start timeout=90 interval=0 \ op stop timeout=90 interval=0 \

> meta target-role=Startedlocation xstha1-stonith-pref xstha1-stonith -inf: 
> xstha1location xstha1_san0_IP_pref xstha1_san0_IP 100: xstha1location 
> xstha2-stonith-pref xstha2-stonith -inf: xstha2location xstha2_san0_IP_pref

> xstha2_san0_IP 100: xstha2order zpool_data_order inf: zpool_data ( 
> xstha1_san0_IP )location zpool_data_pref zpool_data 100: xstha1colocation 
> zpool_data_with_IPs inf: zpool_data xstha1_san0_IPproperty 
> cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.15-e174ec8 \ 
> cluster-infrastructure=corosync \ stonith-action=poweroff \ 
> no-quorum-policy=stop
>>  
>> Thanks!
>> Gabriele
>>  
>>  
>> Sonicle S.r.l. : http://www.sonicle.com 
>> Music: http://www.gabrielebulfon.com 
>> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
>>  
>> 
>> 
>> 
>> 
>> 
>
-----------------------------------------------------------------------------
> -----
>> 
>> Da: Andrei Borzenkov <arvidjaar at gmail.com>
>> A: users at clusterlabs.org 
>> Data: 11 dicembre 2020 18.30.29 CET
>> Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure
>> 
>> 
>> 11.12.2020 18:37, Gabriele Bulfon пишет:
>>> I found I can do this temporarily:
>>>  
>>> crm config property cib-bootstrap-options: no-quorum-policy=ignore
>>>  
>> 
>> All two node clusters I remember run with setting forever :)
>> 
>>> then once node 2 is up again:
>>>  
>>> crm config property cib-bootstrap-options: no-quorum-policy=stop
>>>  
>>> so that I make sure nodes will not mount in another strange situation.
>>>  
>>> Is there any better way? 
>> 
>> "better" us subjective, but ...
>> 
>>> (such as ignore until everything is back to normal then conisder top
again)
>>>  
>> 
>> That is what stonith does. Because quorum is pretty much useless in two
>> node cluster, as I already said all clusters I have seem used
>> no-quorum-policy=ignore and stonith-enabled=true. It means when node
>> boots and other node is not available stonith is attempted; if stonith
>> succeeds pacemaker continues with starting resources; if stonith fails,
>> node is stuck.
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 





More information about the Users mailing list