[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Recoveing from node failure

Mon Dec 14 07:22:44 EST 2020

>>> Gabriele Bulfon <gbulfon at sonicle.com> schrieb am 14.12.2020 um 12:40 in
Nachricht <16685368.7249.1607946038308 at www>:
> I isolated the log when everything happens (when I disable the ha
interface), 
> attached here.

What looks odd in my eyes is "A new membership (127.0.0.1:352) was formed"
using localhost address.

Dec 14 12:34:42 [677] stonith-ng:     info: call_remote_stonith:       
Requesting that 'xstha1' perform op 'xstha2 poweroff' for crmd.681 (72s, 0s)
Dec 14 12:34:44 [677] stonith-ng:   notice: log_operation:      Operation
'poweroff' [2235] (call 2 from crmd.681) for host 'xstha2' with device
'xstha2-stonith' returned: 0 (OK)
xstha2 should be off now...
Dec 14 12:34:44 [681]       crmd:     info: cib_fencing_updated:       
Fencing update 43 for xstha2: complete

This looks odd:
Dec 14 12:34:44 [681]       crmd:  warning: match_down_event:   No reason to
expect node 2 to be down

I could not see fencing of xtsha1 from xstha2.

>  
> Gabriele
>  
>  
> Sonicle S.r.l. : http://www.sonicle.com 
> Music: http://www.gabrielebulfon.com 
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
>  
> 
> 
> 
> 
>
----------------------------------------------------------------------------
> ------
> 
> Da: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
> A: users at clusterlabs.org 
> Data: 14 dicembre 2020 11.53.22 CET
> Oggetto: [ClusterLabs] Antw: Re: Antw: [EXT] Recoveing from node failure
> 
> 
>>>> Gabriele Bulfon <gbulfon at sonicle.com> schrieb am 14.12.2020 um 11:48 in
> Nachricht <1065144646.7212.1607942889206 at www>:
>> Thanks!
>> 
>> I tried first option, by adding pcmk_delay_base to the two stonith 
>> primitives.
>> First has 1 second, second has 5 seconds.
>> It didn't work :( they still killed each other :(
>> Anything wrong with the way I did it?
> 
> Hard to say without seeing the logs...
> 
>> 
>> Here's the config:
>> 
>> node 1: xstha1 \
>> attributes standby=off maintenance=off
>> node 2: xstha2 \
>> attributes standby=off maintenance=off
>> primitive xstha1-stonith stonith:external/ipmi \
>> params hostname=xstha1 ipaddr=192.168.221.18 userid=ADMIN 
>> passwd="***" interface=lanplus pcmk_delay_base=1 \
>> op monitor interval=25 timeout=25 start-delay=25 \
>> meta target-role=Started
>> primitive xstha1_san0_IP IPaddr \
>> params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0
>> primitive xstha2-stonith stonith:external/ipmi \
>> params hostname=xstha2 ipaddr=192.168.221.19 userid=ADMIN 
>> passwd="***" interface=lanplus pcmk_delay_base=5 \
>> op monitor interval=25 timeout=25 start-delay=25 \
>> meta target-role=Started
>> primitive xstha2_san0_IP IPaddr \
>> params ip=10.10.10.2 cidr_netmask=255.255.255.0 nic=san0
>> primitive zpool_data ZFS \
>> params pool=test \
>> op start timeout=90 interval=0 \
>> op stop timeout=90 interval=0 \
>> meta target-role=Started
>> location xstha1-stonith-pref xstha1-stonith -inf: xstha1
>> location xstha1_san0_IP_pref xstha1_san0_IP 100: xstha1
>> location xstha2-stonith-pref xstha2-stonith -inf: xstha2
>> location xstha2_san0_IP_pref xstha2_san0_IP 100: xstha2
>> order zpool_data_order inf: zpool_data ( xstha1_san0_IP )
>> location zpool_data_pref zpool_data 100: xstha1
>> colocation zpool_data_with_IPs inf: zpool_data xstha1_san0_IP
>> property cib-bootstrap-options: \
>> have-watchdog=false \
>> dc-version=1.1.15-e174ec8 \
>> cluster-infrastructure=corosync \
>> stonith-action=poweroff \
>> no-quorum-policy=stop
>> 
>> 
>> Sonicle S.r.l. : http://www.sonicle.com 
>> Music: http://www.gabrielebulfon.com 
>> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
>> 
>> 
>> 
>> 
>> 
>>
>
----------------------------------------------------------------------------
>> ------
>> 
>> Da: Andrei Borzenkov <arvidjaar at gmail.com>
>> A: users at clusterlabs.org 
>> Data: 13 dicembre 2020 7.50.57 CET
>> Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure
>> 
>> 
>> 12.12.2020 20:30, Gabriele Bulfon пишет:
>>> Thanks, I will experiment this.
>>> 
>>> Now, I have a last issue about stonith.
>>> I tried to reproduce a stonith situation, by disabling the network
> interface 
>> used for HA on node 1.
>>> Stonith is configured with ipmi poweroff.
>>> What happens, is that once the interface is down, both nodes tries to 
>> stonith the other node, causing both to poweroff...
>> 
>> Yes, this is expected. The options are basically
>> 
>> 1. Have separate stonith resource for each node and configure static
>> (pcmk_delay_base) or random dynamic (pcmk_delay_max) delays to avoid
>> both nodes starting stonith at the same time. This does not take
>> resources in account.
>> 
>> 2. Use fencing topology and create pseudo-stonith agent that does not
>> attempt to do anything but just delays for some time before continuing
>> with actual fencing agent. Delay can be based on anything including
>> resources running on node.
>> 
>> 3. If you are using pacemaker 2.0.3+, you could use new
>> priority-fencing-delay feature that implements resource-based priority
>> fencing:
>> 
>> + controller/fencing/scheduler: add new feature 'priority-fencing-delay'
>> Optionally derive the priority of a node from the
>> resource-priorities
>> of the resources it is running.
>> In a fencing-race the node with the highest priority has a certain
>> advantage over the others as fencing requests for that node are
>> executed with an additional delay.
>> controlled via cluster option priority-fencing-delay (default = 0)
>> 
>> 
>> See also https://www.mail-archive.com/users@clusterlabs.org/msg10328.html 
>> 
>>> I would like the node running all resources (zpool and nfs ip) to be the 
>> first trying to stonith the other node.
>>> Or is there anything else better?
>>> 
>>> Here is the current crm config show:
>>> 
>> 
>> It is unreadable
>> 
>>> node 1: xstha1 \ attributes standby=off maintenance=offnode 2: xstha2 \ 
>> attributes standby=off maintenance=offprimitive xstha1-stonith 
>> stonith:external/ipmi \ params hostname=xstha1 ipaddr=192.168.221.18 
>> userid=ADMIN passwd="******" interface=lanplus \ op monitor interval=25 
>> timeout=25 start-delay=25 \ meta target-role=Startedprimitive
xstha1_san0_IP
> 
>> IPaddr \ params ip=10.10.10.1 cidr_netmask=255.255.255.0 nic=san0primitive

>> xstha2-stonith stonith:external/ipmi \ params hostname=xstha2 
>> ipaddr=192.168.221.19 userid=ADMIN passwd="******" interface=lanplus \ op 
>> monitor interval=25 timeout=25 start-delay=25 \ meta 
>> target-role=Startedprimitive xstha2_san0_IP IPaddr \ params ip=10.10.10.2 
>> cidr_netmask=255.255.255.0 nic=san0primitive zpool_data ZFS \ params 
>> pool=test \ op start timeout=90 interval=0 \ op stop timeout=90 interval=0
\
> 
>> meta target-role=Startedlocation xstha1-stonith-pref xstha1-stonith -inf: 
>> xstha1location xstha1_san0_IP_pref xstha1_san0_IP 100: xstha1location 
>> xstha2-stonith-pref xstha2-stonith -inf: xstha2location
xstha2_san0_IP_pref
> 
>> xstha2_san0_IP 100: xstha2order zpool_data_order inf: zpool_data ( 
>> xstha1_san0_IP )location zpool_data_pref zpool_data 100: xstha1colocation 
>> zpool_data_with_IPs inf: zpool_data xstha1_san0_IPproperty 
>> cib-bootstrap-options: \ have-watchdog=false \ dc-version=1.1.15-e174ec8 \

>> cluster-infrastructure=corosync \ stonith-action=poweroff \ 
>> no-quorum-policy=stop
>>> 
>>> Thanks!
>>> Gabriele
>>> 
>>> 
>>> Sonicle S.r.l. : http://www.sonicle.com 
>>> Music: http://www.gabrielebulfon.com 
>>> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>
>
----------------------------------------------------------------------------
> -
>> -----
>>> 
>>> Da: Andrei Borzenkov <arvidjaar at gmail.com>
>>> A: users at clusterlabs.org 
>>> Data: 11 dicembre 2020 18.30.29 CET
>>> Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure
>>> 
>>> 
>>> 11.12.2020 18:37, Gabriele Bulfon пишет:
>>>> I found I can do this temporarily:
>>>> 
>>>> crm config property cib-bootstrap-options: no-quorum-policy=ignore
>>>> 
>>> 
>>> All two node clusters I remember run with setting forever :)
>>> 
>>>> then once node 2 is up again:
>>>> 
>>>> crm config property cib-bootstrap-options: no-quorum-policy=stop
>>>> 
>>>> so that I make sure nodes will not mount in another strange situation.
>>>> 
>>>> Is there any better way? 
>>> 
>>> "better" us subjective, but ...
>>> 
>>>> (such as ignore until everything is back to normal then conisder top
> again)
>>>> 
>>> 
>>> That is what stonith does. Because quorum is pretty much useless in two
>>> node cluster, as I already said all clusters I have seem used
>>> no-quorum-policy=ignore and stonith-enabled=true. It means when node
>>> boots and other node is not available stonith is attempted; if stonith
>>> succeeds pacemaker continues with starting resources; if stonith fails,
>>> node is stuck.
>>> 
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>> 
>>> ClusterLabs home: https://www.clusterlabs.org/ 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>> 
>>> ClusterLabs home: https://www.clusterlabs.org/ 
>>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/