[ClusterLabs] Antw: [EXT] delaying start of a resource

Wed Dec 16 11:05:46 EST 2020

Looking at the two logs, looks like corosync decided that xst1 was offline, while xst was still online.
I just issued an "ifconfig ha0 down" on xst1, so I expect both nodes cannot see other one, while I see these same lines both on xst1 and xst2 log:

ec 16 15:08:56 [667]    pengine:  warning: pe_fence_node:      Cluster node xstha1 will be fenced: peer is no longer part of the cluster
Dec 16 15:08:56 [667]    pengine:  warning: determine_online_status:    Node xstha1 is unclean
Dec 16 15:08:56 [667]    pengine:     info: determine_online_status_fencing:    Node xstha2 is active
Dec 16 15:08:56 [667]    pengine:     info: determine_online_status:    Node xstha2 is online

why xst2 and not xst1?
I would expect no action at all in this case, until stonith is done...
While it goes on with :

Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action xstha1_san0_IP_stop_0 on xstha1 is unrunnable (offline)
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action zpool_data_stop_0 on xstha1 is unrunnable (offline)
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action xstha2-stonith_stop_0 on xstha1 is unrunnable (offline)
Dec 16 15:08:56 [667]    pengine:  warning: custom_action:      Action xstha2-stonith_stop_0 on xstha1 is unrunnable (offline)

trying to stop everythin on xst1 (but it's not runnable).
Then:

Dec 16 15:08:56 [667]    pengine:   notice: LogAction:   * Move       xstha1_san0_IP     ( xstha1 -> xstha2 )
Dec 16 15:08:56 [667]    pengine:     info: LogActions: Leave   xstha2_san0_IP  (Started xstha2)
Dec 16 15:08:56 [667]    pengine:   notice: LogAction:   * Move       zpool_data         ( xstha1 -> xstha2 )
Dec 16 15:08:56 [667]    pengine:     info: LogActions: Leave   xstha1-stonith  (Started xstha2)
Dec 16 15:08:56 [667]    pengine:   notice: LogAction:   * Stop       xstha2-stonith     (           xstha1 )   due to node availability

as if xst2 has been elected to be the running node, not knowing xst1 will kill xst2 within few seconds.

What is wrong here?

Thanks!
Gabriele

Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets

Da: Gabriele Bulfon <gbulfon at sonicle.com>
A: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
Data: 16 dicembre 2020 15.56.28 CET
Oggetto: Re: [ClusterLabs] Antw: [EXT] delaying start of a resource

Thanks, here are the logs, there are infos about how it tried to start resources on the nodes.
Keep in mind the node1 was already running the resources, and I simulated a problem by turning down the ha interface.

Gabriele

Sonicle S.r.l. : http://www.sonicle.com
Music: http://www.gabrielebulfon.com
eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets

----------------------------------------------------------------------------------

Da: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
A: users at clusterlabs.org 
Data: 16 dicembre 2020 15.45.36 CET
Oggetto: [ClusterLabs] Antw: [EXT] delaying start of a resource

>>> Gabriele Bulfon <gbulfon at sonicle.com> schrieb am 16.12.2020 um 15:32 in
Nachricht <1523391015.734.1608129155836 at www>:
> Hi, I have now a two node cluster using stonith with different 
> pcmk_delay_base, so that node 1 has priority to stonith node 2 in case of 
> problems.
> 
> Though, there is still one problem: once node 2 delays its stonith action 
> for 10 seconds, and node 1 just 1, node 2 does not delay start of resources, 
> so it happens that while it's not yet powered off by node 1 (and waiting its 
> dalay to power off node 1) it actually starts resources, causing a moment of 
> few seconds where both NFS IP and ZFS pool (!!!!!) is mounted by both!

AFAIK pacemaker will not start resources on a node that is scheduled for stonith. Even more: Pacemaker will tra to stop resources on a node scheduled for stonith to start them elsewhere.

> How can I delay node 2 resource start until the delayed stonith action is 
> done? Or how can I just delay the resource start so I can make it larger than 
> its pcmk_delay_base?

We probably need to see logs and configs to understand.

> 
> Also, I was suggested to set "stonith-enabled=true", but I don't know where 
> to set this flag (cib-bootstrap-options is not happy with it...).

I think it's on by default, so you must have set it to false.
In crm shell it is "configure# property stonith-enabled=...".

Regards,
Ulrich

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________Manage your subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs home: https://www.clusterlabs.org/

<<stonith1.txt>>
<<stonith2.txt>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20201216/560cd58b/attachment.htm>