[ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Dec 14 05:53:22 EST 2020
>>> Gabriele Bulfon <gbulfon at sonicle.com> schrieb am 14.12.2020 um 11:48 in
Nachricht <1065144646.7212.1607942889206 at www>:
> Thanks!
>
> I tried first option, by adding pcmk_delay_base to the two stonith
> primitives.
> First has 1 second, second has 5 seconds.
> It didn't work :( they still killed each other :(
> Anything wrong with the way I did it?
Hard to say without seeing the logs...
>
> Here's the config:
>
> node 1: xstha1 \
> attributes standby=off maintenance=off
> node 2: xstha2 \
> attributes standby=off maintenance=off
> primitive xstha1-stonith stonith:external/ipmi \
> params hostname=xstha1 ipaddr=192.168.221.18 userid=ADMIN
> passwd="***" interface=lanplus pcmk_delay_base=1 \
> op monitor interval=25 timeout=25 start-delay=25 \
> meta target-role=Started
> primitive xstha1_san0_IP IPaddr \
> params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0
> primitive xstha2-stonith stonith:external/ipmi \
> params hostname=xstha2 ipaddr=192.168.221.19 userid=ADMIN
> passwd="***" interface=lanplus pcmk_delay_base=5 \
> op monitor interval=25 timeout=25 start-delay=25 \
> meta target-role=Started
> primitive xstha2_san0_IP IPaddr \
> params ip=10.10.10.2 cidr_netmask=255.255.255.0 nic=san0
> primitive zpool_data ZFS \
> params pool=test \
> op start timeout=90 interval=0 \
> op stop timeout=90 interval=0 \
> meta target-role=Started
> location xstha1-stonith-pref xstha1-stonith -inf: xstha1
> location xstha1_san0_IP_pref xstha1_san0_IP 100: xstha1
> location xstha2-stonith-pref xstha2-stonith -inf: xstha2
> location xstha2_san0_IP_pref xstha2_san0_IP 100: xstha2
> order zpool_data_order inf: zpool_data ( xstha1_san0_IP )
> location zpool_data_pref zpool_data 100: xstha1
> colocation zpool_data_with_IPs inf: zpool_data xstha1_san0_IP
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.15-e174ec8 \
> cluster-infrastructure=corosync \
> stonith-action=poweroff \
> no-quorum-policy=stop
>
>
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>
>
>
>
>
>
----------------------------------------------------------------------------
> ------
>
> Da: Andrei Borzenkov <arvidjaar at gmail.com>
> A: users at clusterlabs.org
> Data: 13 dicembre 2020 7.50.57 CET
> Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure
>
>
> 12.12.2020 20:30, Gabriele Bulfon пишет:
>> Thanks, I will experiment this.
>>
>> Now, I have a last issue about stonith.
>> I tried to reproduce a stonith situation, by disabling the network
interface
> used for HA on node 1.
>> Stonith is configured with ipmi poweroff.
>> What happens, is that once the interface is down, both nodes tries to
> stonith the other node, causing both to poweroff...
>
> Yes, this is expected. The options are basically
>
> 1. Have separate stonith resource for each node and configure static
> (pcmk_delay_base) or random dynamic (pcmk_delay_max) delays to avoid
> both nodes starting stonith at the same time. This does not take
> resources in account.
>
> 2. Use fencing topology and create pseudo-stonith agent that does not
> attempt to do anything but just delays for some time before continuing
> with actual fencing agent. Delay can be based on anything including
> resources running on node.
>
> 3. If you are using pacemaker 2.0.3+, you could use new
> priority-fencing-delay feature that implements resource-based priority
> fencing:
>
> + controller/fencing/scheduler: add new feature 'priority-fencing-delay'
> Optionally derive the priority of a node from the
> resource-priorities
> of the resources it is running.
> In a fencing-race the node with the highest priority has a certain
> advantage over the others as fencing requests for that node are
> executed with an additional delay.
> controlled via cluster option priority-fencing-delay (default = 0)
>
>
> See also https://www.mail-archive.com/users@clusterlabs.org/msg10328.html
>
>> I would like the node running all resources (zpool and nfs ip) to be the
> first trying to stonith the other node.
>> Or is there anything else better?
>>
>> Here is the current crm config show:
>>
>
> It is unreadable
>
>> node 1: xstha1 \ attributes standby=off maintenance=offnode 2: xstha2 \
> attributes standby=off maintenance=offprimitive xstha1-stonith
> stonith:external/ipmi \ params hostname=xstha1 ipaddr=192.168.221.18
> userid=ADMIN passwd="******" interface=lanplus \ op monitor interval=25
> timeout=25 start-delay=25 \ meta target-role=Startedprimitive xstha1_san0_IP
> IPaddr \ params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0primitive
> xstha2-stonith stonith:external/ipmi \ params hostname=xstha2
> ipaddr=192.168.221.19 userid=ADMIN passwd="******" interface=lanplus \ op
> monitor interval=25 timeout=25 start-delay=25 \ meta
> target-role=Startedprimitive xstha2_san0_IP IPaddr \ params ip=10.10.10.2
> cidr_netmask=255.255.255.0 nic=san0primitive zpool_data ZFS \ params
> pool=test \ op start timeout=90 interval=0 \ op stop timeout=90 interval=0 \
> meta target-role=Startedlocation xstha1-stonith-pref xstha1-stonith -inf:
> xstha1location xstha1_san0_IP_pref xstha1_san0_IP 100: xstha1location
> xstha2-stonith-pref xstha2-stonith -inf: xstha2location xstha2_san0_IP_pref
> xstha2_san0_IP 100: xstha2order zpool_data_order inf: zpool_data (
> xstha1_san0_IP )location zpool_data_pref zpool_data 100: xstha1colocation
> zpool_data_with_IPs inf: zpool_data xstha1_san0_IPproperty
> cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.15-e174ec8 \
> cluster-infrastructure=corosync \ stonith-action=poweroff \
> no-quorum-policy=stop
>>
>> Thanks!
>> Gabriele
>>
>>
>> Sonicle S.r.l. : http://www.sonicle.com
>> Music: http://www.gabrielebulfon.com
>> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>>
>>
>>
>>
>>
>>
>
-----------------------------------------------------------------------------
> -----
>>
>> Da: Andrei Borzenkov <arvidjaar at gmail.com>
>> A: users at clusterlabs.org
>> Data: 11 dicembre 2020 18.30.29 CET
>> Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure
>>
>>
>> 11.12.2020 18:37, Gabriele Bulfon пишет:
>>> I found I can do this temporarily:
>>>
>>> crm config property cib-bootstrap-options: no-quorum-policy=ignore
>>>
>>
>> All two node clusters I remember run with setting forever :)
>>
>>> then once node 2 is up again:
>>>
>>> crm config property cib-bootstrap-options: no-quorum-policy=stop
>>>
>>> so that I make sure nodes will not mount in another strange situation.
>>>
>>> Is there any better way?
>>
>> "better" us subjective, but ...
>>
>>> (such as ignore until everything is back to normal then conisder top
again)
>>>
>>
>> That is what stonith does. Because quorum is pretty much useless in two
>> node cluster, as I already said all clusters I have seem used
>> no-quorum-policy=ignore and stonith-enabled=true. It means when node
>> boots and other node is not available stonith is attempted; if stonith
>> succeeds pacemaker continues with starting resources; if stonith fails,
>> node is stuck.
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list