[ClusterLabs] Resources restart when a node joins in

Fri Aug 28 04:27:12 EDT 2020

No problem! That's what we're here for. I'm glad it's sorted out :)

On Fri, Aug 28, 2020 at 12:27 AM Citron Vert <citron_vert at hotmail.com>
wrote:

> Hi,
>
> You are right, the problems seem to come from some services that are
> started at startup.
>
> My installation script disables all startup options for all services we
> use, that's why I didn't focus on this possibility.
>
> But after a quick investigation, a colleague had the good idea to make a
> "security" script that monitors and starts certain services.
>
>
> Sorry to have contacted you for this little mistake,
>
> Thank you for the help, it was effective
>
> Quentin
>
>
>
> Le 27/08/2020 à 09:56, Reid Wahl a écrit :
>
> Hi, Quentin. Thanks for the logs!
>
> I see you highlighted the fact that SERVICE1 was in "Stopping" state on
> both node 1 and node 2 when node 1 was rejoining the cluster. I also noted
> the following later in the logs, as well as some similar messages earlier:
>
> Aug 27 08:47:02 [1330] NODE2    pengine:     info: determine_op_status:       Operation monitor found resource SERVICE1 active on NODE1
> Aug 27 08:47:02 [1330] NODE2    pengine:     info: determine_op_status:       Operation monitor found resource SERVICE1 active on NODE1
> Aug 27 08:47:02 [1330] NODE2    pengine:     info: determine_op_status:       Operation monitor found resource SERVICE4 active on NODE2
> Aug 27 08:47:02 [1330] NODE2    pengine:     info: determine_op_status:       Operation monitor found resource SERVICE1 active on NODE2
> ...
> Aug 27 08:47:02 [1330] NODE2    pengine:     info: common_print:              1 : NODE1
> Aug 27 08:47:02 [1330] NODE2    pengine:     info: common_print:              2 : NODE2
> ...
> Aug 27 08:47:02 [1330] NODE2    pengine:    error: native_create_actions:     Resource SERVICE1 is active on 2 nodes (attempting recovery)
> Aug 27 08:47:02 [1330] NODE2    pengine:   notice: native_create_actions:     See https://wiki.clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information
>
>
> Can you make sure that all the cluster-managed systemd services are disabled from starting at boot (i.e., `systemctl is-enabled service1`, and the same for all the others) on both nodes? If they are enabled, disable them.
>
>
> On Thu, Aug 27, 2020 at 12:46 AM Citron Vert <citron_vert at hotmail.com>
> wrote:
>
>> Hi,
>>
>> Sorry for using this email adress, my name is Quentin. Thank you for your
>> reply.
>>
>> I have already tried the stickiness solution (with the deprecated
>> value). I tried the one you gave me, and it does not change anything.
>>
>> Resources don't seem to move from node to node (i don't see the changes
>> with crm_mon command).
>>
>>
>> In the logs i found this line *"error: native_create_actions:
>> Resource SERVICE1 is active on 2 nodes*"
>>
>> Which led me to contact you to understand and learn a little more about
>> this cluster. And why there are running resources on the passive node.
>>
>>
>> You will find attached the logs during the reboot of the passive node and
>> my cluster configuration.
>>
>> I think I'm missing out on something in the configuration / logs that I
>> don't understand..
>>
>>
>> Thank you in advance for your help,
>>
>> Quentin
>>
>>
>> Le 26/08/2020 à 20:16, Reid Wahl a écrit :
>>
>> Hi, Citron.
>>
>> Based on your description, it sounds like some resources **might** be
>> moving from node 1 to node 2, failing on node 2, and then moving back to
>> node 1. If that's what's happening (and even if it's not), then it's
>> probably smart to set some resource stickiness as a resource default. The
>> below command sets a resource stickiness score of 1.
>>
>>     # pcs resource defaults resource-stickiness=1
>>
>> Also note that the "default-resource-stickiness" cluster property is
>> deprecated and should not be used.
>>
>> Finally, an explicit default resource stickiness score of 0 can interfere
>> with the placement of cloned resource instances. If you don't want any
>> stickiness, then it's better to leave stickiness unset. That way,
>> primitives will have a stickiness of 0, but clone instances will have a
>> stickiness of 1.
>>
>> If adding stickiness does not resolve the issue, can you share your
>> cluster configuration and some logs that show the issue happening? Off the
>> top of my head I'm not sure why resources would start and stop on node 2
>> without moving away from node1, unless they're clone instances that are
>> starting and then failing a monitor operation on node 2.
>>
>> On Wed, Aug 26, 2020 at 8:42 AM Citron Vert <citron_vert at hotmail.com>
>> wrote:
>>
>>> Hello,
>>> I am contacting you because I have a problem with my cluster and I
>>> cannot find (nor understand) any information that can help me.
>>>
>>> I have a 2 nodes cluster (pacemaker, corosync, pcs) installed on CentOS
>>> 7 with a set of configuration.
>>> Everything seems to works fine, but here is what happens:
>>>
>>>    - Node1 and Node2 are running well with Node1 as primary
>>>    - I reboot Node2 wich is passive (no changes on Node1)
>>>    - Node2 comes back in the cluster as passive
>>>    - corosync logs shows resources getting started then stopped on Node2
>>>    - "crm_mon" command shows some ressources on Node1 getting restarted
>>>
>>> I don't understand how it should work.
>>> If a node comes back, and becomes passive (since Node1 is running
>>> primary), there is no reason for the resources to be started then stopped
>>> on the new passive node ?
>>>
>>> One of my resources becomes unstable because it gets started and then
>>> stoped too quickly on Node2, wich seems to make it restart on Node1 without
>>> a failover.
>>>
>>> I tried several things and solution proposed by different sites and
>>> forums but without success.
>>>
>>>
>>> Is there a way so that the node, which joins the cluster as passive,
>>> does not start its own resources ?
>>>
>>>
>>> thanks in advance
>>>
>>>
>>> Here are some information just in case :
>>> $ rpm -qa | grep -E "corosync|pacemaker|pcs"
>>> corosync-2.4.5-4.el7.x86_64
>>> pacemaker-cli-1.1.21-4.el7.x86_64
>>> pacemaker-1.1.21-4.el7.x86_64
>>> pcs-0.9.168-4.el7.centos.x86_64
>>> corosynclib-2.4.5-4.el7.x86_64
>>> pacemaker-libs-1.1.21-4.el7.x86_64
>>> pacemaker-cluster-libs-1.1.21-4.el7.x86_64
>>>
>>>
>>>         <nvpair id="cib-bootstrap-options-stonith-enabled" name=
>>> "stonith-enabled" value="false"/>
>>>         <nvpair id="cib-bootstrap-options-no-quorum-policy" name=
>>> "no-quorum-policy" value="ignore"/>
>>>         <nvpair id="cib-bootstrap-options-dc-deadtime" name=
>>> "dc-deadtime" value="120s"/>
>>>         <nvpair id="cib-bootstrap-options-have-watchdog" name=
>>> "have-watchdog" value="false"/>
>>>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
>>>  value="1.1.21-4.el7-f14e36fd43"/>
>>>         <nvpair id="cib-bootstrap-options-cluster-infrastructure" name=
>>> "cluster-infrastructure" value="corosync"/>
>>>         <nvpair id="cib-bootstrap-options-cluster-name" name=
>>> "cluster-name" value="CLUSTER"/>
>>>         <nvpair id="cib-bootstrap-options-last-lrm-refresh" name=
>>> "last-lrm-refresh" value="1598446314"/>
>>>         <nvpair id="cib-bootstrap-options-default-resource-stickiness"
>>>  name="default-resource-stickiness" value="0"/>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>
>>
>> --
>> Regards,
>>
>> Reid Wahl, RHCA
>> Software Maintenance Engineer, Red Hat
>> CEE - Platform Support Delivery - ClusterHA
>>
>>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
>
>

-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20200828/414b3e52/attachment-0001.htm>