[ClusterLabs] Antw: [EXT] Re: Resources restart when a node joins in

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Fri Aug 28 02:07:11 EDT 2020


>>> Citron Vert <citron_vert at hotmail.com> schrieb am 27.08.2020 um 09:46 in
Nachricht
<DB7PR02MB53716B42197E31D8D15CA0A288550 at DB7PR02MB5371.eurprd02.prod.outlook.com>

> Hi,
> 
> Sorry for using this email adress, my name is Quentin. Thank you for 
> your reply.
> 
> I have already tried the stickiness solution (with the deprecated  
> value). I tried the one you gave me, and it does not change anything.

Hi!

I see all your resources are systemd-based. Personally I thin systemd and
cluster do not play well with each other, as systemd also has kind of recovery
mechanisms that may interfere with the cluster. Also did you make sure to
"diasble" all the systemd services that the cluster controls?

There's an inconsistency:
Aug 27 08:46:16 [1330] NODE2    pengine:     info: determine_op_status:      
Operation monitor found resource SERVICE9 active on NODE1

SERVICE9 isn't listed as running before.

This should be checked:
Aug 27 08:46:55 [1330] NODE2    pengine:     info: native_color:      Resource
SERVICE14:1 cannot run anywhere
Aug 27 08:46:55 [1330] NODE2    pengine:     info: native_color:      Resource
SERVICE15:1 cannot run anywhere

This sounds odd, too:
Aug 27 08:46:59 [1326] NODE2        cib:     info: cib_process_replace:      
Replaced 0.434.19 with 0.434.19 from NODE2

Aug 27 08:46:59 [1326] NODE2        cib:     info: cib_process_replace:      
Replaced 0.434.19 with 0.434.19 from NODE2

You should check why this failed:
Aug 27 08:47:01 [1331] NODE2       crmd:  warning: status_from_rc:    Action
20 (SERVICE1_monitor_0) on NODE1 failed (target: 7 vs. rc: 0): Error

Aug 27 08:47:02 [1330] NODE2    pengine:     info: common_print:      SERVICE1
       (systemd:service1):     Started
Aug 27 08:47:02 [1330] NODE2    pengine:     info: common_print:             
1 : NODE1
Aug 27 08:47:02 [1330] NODE2    pengine:     info: common_print:             
2 : NODE2

Aug 27 08:47:02 [1330] NODE2    pengine:    error: native_create_actions:    
Resource SERVICE1 is active on 2 nodes (attempting recovery)

Then:
Aug 27 08:47:02 [1330] NODE2    pengine:   notice: LogAction:  * Move      
SERVICE1                 ( NODE1 -> NODE2 )

Aug 27 08:47:02 [1331] NODE2       crmd:  warning: status_from_rc:    Action
20 (SERVICE9_monitor_0) on NODE1 failed (target: 7 vs. rc: 0): Error

Aug 27 08:47:04 [1331] NODE2       crmd:     info: match_graph_event: Action
SERVICE1_stop_0 (36) confirmed on NODE2 (rc=0)
Aug 27 08:47:22 [1331] NODE2       crmd:  warning: status_from_rc:    Action
35 (SERVICE1_stop_0) on NODE1 failed (target: 0 vs. rc: 198): Error
Aug 27 08:47:22 [1331] NODE2       crmd:     info: match_graph_event: Action
SERVICE1_stop_0 (35) confirmed on NODE1 (rc=198, ignoring failure)

The above sounds very odd to me.

You should also include the corresponding systemd logs.

The other thing I noticed is that you have colocation, but no ordering. I
guess your VIP should be started before your dependendies, right?

Regards,
Ulrich

> 
> Resources don't seem to move from node to node (i don't see the changes 
> with crm_mon command).
> 
> 
> In the logs i found this line /"error: native_create_actions:     
> Resource SERVICE1 is active on 2 nodes/"
> 
> Which led me to contact you to understand and learn a little more about 
> this cluster. And why there are running resources on the passive node.
> 
> 
> You will find attached the logs during the reboot of the passive node 
> and my cluster configuration.
> 
> I think I'm missing out on something in the configuration / logs that I 
> don't understand..
> 
> 
> Thank you in advance for your help,
> 
> Quentin
> 
> 
> Le 26/08/2020 à 20:16, Reid Wahl a écrit :
>> Hi, Citron.
>>
>> Based on your description, it sounds like some resources **might** be 
>> moving from node 1 to node 2, failing on node 2, and then moving back 
>> to node 1. If that's what's happening (and even if it's not), then 
>> it's probably smart to set some resource stickiness as a resource 
>> default. The below command sets a resource stickiness score of 1.
>>
>>     # pcs resource defaults resource-stickiness=1
>>
>> Also note that the "default-resource-stickiness" cluster property is 
>> deprecated and should not be used.
>>
>> Finally, an explicit default resource stickiness score of 0 can 
>> interfere with the placement of cloned resource instances. If you 
>> don't want any stickiness, then it's better to leave stickiness unset. 
>> That way, primitives will have a stickiness of 0, but clone instances 
>> will have a stickiness of 1.
>>
>> If adding stickiness does not resolve the issue, can you share your 
>> cluster configuration and some logs that show the issue happening? Off 
>> the top of my head I'm not sure why resources would start and stop on 
>> node 2 without moving away from node1, unless they're clone instances 
>> that are starting and then failing a monitor operation on node 2.
>>
>> On Wed, Aug 26, 2020 at 8:42 AM Citron Vert <citron_vert at hotmail.com 
>> <mailto:citron_vert at hotmail.com>> wrote:
>>
>>     Hello,
>>     I am contacting you because I have a problem with my cluster and I
>>     cannot find (nor understand) any information that can help me.
>>
>>     I have a 2 nodes cluster (pacemaker, corosync, pcs) installed on
>>     CentOS 7 with a set of configuration.
>>     Everything seems to works fine, but here is what happens:
>>
>>       * Node1 and Node2 are running well with Node1 as primary
>>       * I reboot Node2 wich is passive (no changes on Node1)
>>       * Node2 comes back in the cluster as passive
>>       * corosync logs shows resources getting started then stopped on
>>         Node2
>>       * "crm_mon" command shows some ressources on Node1 getting
>>         restarted
>>
>>     I don't understand how it should work.
>>     If a node comes back, and becomes passive (since Node1 is running
>>     primary), there is no reason for the resources to be started then
>>     stopped on the new passive node ?
>>
>>     One of my resources becomes unstable because it gets started and
>>     then stoped too quickly on Node2, wich seems to make it restart on
>>     Node1 without a failover.
>>
>>     I tried several things and solution proposed by different sites
>>     and forums but without success.
>>
>>
>>     Is there a way so that the node, which joins the cluster as
>>     passive, does not start its own resources ?
>>
>>
>>     thanks in advance
>>
>>
>>     Here are some information just in case :
>>
>>     $ rpm -qa | grep -E "corosync|pacemaker|pcs"
>>     corosync-2.4.5-4.el7.x86_64
>>     pacemaker-cli-1.1.21-4.el7.x86_64
>>     pacemaker-1.1.21-4.el7.x86_64
>>     pcs-0.9.168-4.el7.centos.x86_64
>>     corosynclib-2.4.5-4.el7.x86_64
>>     pacemaker-libs-1.1.21-4.el7.x86_64
>>     pacemaker-cluster-libs-1.1.21-4.el7.x86_64
>>
>>
>>             <nvpair id="cib-bootstrap-options-stonith-enabled" 
> name="stonith-enabled" value="false"/>
>>             <nvpair id="cib-bootstrap-options-no-quorum-policy" 
> name="no-quorum-policy" value="ignore"/>
>>             <nvpair id="cib-bootstrap-options-dc-deadtime"
name="dc-deadtime" 
> value="120s"/>
>>             <nvpair id="cib-bootstrap-options-have-watchdog" 
> name="have-watchdog" value="false"/>
>>             <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"

> value="1.1.21-4.el7-f14e36fd43"/>
>>             <nvpair id="cib-bootstrap-options-cluster-infrastructure" 
> name="cluster-infrastructure" value="corosync"/>
>>             <nvpair id="cib-bootstrap-options-cluster-name" 
> name="cluster-name" value="CLUSTER"/>
>>             <nvpair id="cib-bootstrap-options-last-lrm-refresh" 
> name="last-lrm-refresh" value="1598446314"/>
>>             <nvpair id="cib-bootstrap-options-default-resource-stickiness"

> name="default-resource-stickiness" value="0"/>
>>
>>
>>
>>
>>     _______________________________________________
>>     Manage your subscription:
>>     https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>>     ClusterLabs home: https://www.clusterlabs.org/ 
>>
>>
>>
>> -- 
>> Regards,
>>
>> Reid Wahl, RHCA
>> Software Maintenance Engineer, Red Hat
>> CEE - Platform Support Delivery - ClusterHA





More information about the Users mailing list