[ClusterLabs] [EXTERNAL] Users Digest, Vol 55, Issue 19

Michael Powell Michael.Powell at harmonicinc.com
Mon Aug 12 17:17:36 EDT 2019


Yes, I have tried that.  I used crm_resource --meta -p resource-stickiness -v 0 -r SS16201289RN00023 to disable resource stickiness and then kill -9 <pid> to kill the application associated with the master resource.  The results are the same:  the slave resource remains a slave while the failed resource is restarted and becomes master again.



One approach that seems to work is to run crm_resource -M -r ms-SS16201289RN00023 -H mgraid-16201289RN00023-1 to move the resource to the other node (assuming that the master is running on node mgraid-16201289RN00023-0.)  My original understanding was that this would "restart" the resource on the destination node, but that was apparently a misunderstanding.  I can change our scripts to use this approach, but a) thought that maintain the approach of demoting the master resource and promoting the slave to master was more generic and b) I am unsure of any potential side effects of moving the resource.  Given what I'm trying to accomplish, is this in fact the preferred approach?



Regards,

    Michael





-----Original Message-----
From: Users <users-bounces at clusterlabs.org> On Behalf Of users-request at clusterlabs.org
Sent: Monday, August 12, 2019 1:10 PM
To: users at clusterlabs.org
Subject: [EXTERNAL] Users Digest, Vol 55, Issue 19



Send Users mailing list submissions to

                users at clusterlabs.org<mailto:users at clusterlabs.org>



To subscribe or unsubscribe via the World Wide Web, visit

                https://lists.clusterlabs.org/mailman/listinfo/users

or, via email, send a message with subject or body 'help' to

                users-request at clusterlabs.org<mailto:users-request at clusterlabs.org>



You can reach the person managing the list at

                users-owner at clusterlabs.org<mailto:users-owner at clusterlabs.org>



When replying, please edit your Subject line so it is more specific than "Re: Contents of Users digest..."





Today's Topics:



   1. why is node fenced ? (Lentes, Bernd)

   2. Postgres HA - pacemaker RA do not support auto      failback (Shital A)

   3. Re: why is node fenced ? (Chris Walker)

   4. Re: Master/slave failover does not work as expected

      (Andrei Borzenkov)





----------------------------------------------------------------------



Message: 1

Date: Mon, 12 Aug 2019 18:09:24 +0200 (CEST)

From: "Lentes, Bernd" <bernd.lentes at helmholtz-muenchen.de<mailto:bernd.lentes at helmholtz-muenchen.de>>

To: Pacemaker ML <users at clusterlabs.org<mailto:users at clusterlabs.org>>

Subject: [ClusterLabs] why is node fenced ?

Message-ID:

                <546330844.1686419.1565626164456.JavaMail.zimbra at helmholtz-muenchen.de<mailto:546330844.1686419.1565626164456.JavaMail.zimbra at helmholtz-muenchen.de>>



Content-Type: text/plain; charset=utf-8



Hi,



last Friday (9th of August) i had to install patches on my two-node cluster.

I put one of the nodes (ha-idg-2) into standby (crm node standby ha-idg-2), patched it, rebooted, started the cluster (systemctl start pacemaker) again, put the node again online, everything fine.



Then i wanted to do the same procedure with the other node (ha-idg-1).

I put it in standby, patched it, rebooted, started pacemaker again.

But then ha-idg-1 fenced ha-idg-2, it said the node is unclean.

I know that nodes which are unclean need to be shutdown, that's logical.



But i don't know from where the conclusion comes that the node is unclean respectively why it is unclean, i searched in the logs and didn't find any hint.



I put the syslog and the pacemaker log on a seafile share, i'd be very thankful if you'll have a look.

https://hmgubox.helmholtz-muenchen.de/d/53a10960932445fb9cfe/



Here the cli history of the commands:



17:03:04  crm node standby ha-idg-2

17:07:15  zypper up (install Updates on ha-idg-2)

17:17:30  systemctl reboot

17:25:21  systemctl start pacemaker.service

17:25:47  crm node online ha-idg-2

17:26:35  crm node standby ha-idg1-

17:30:21  zypper up (install Updates on ha-idg-1)

17:37:32  systemctl reboot

17:43:04  systemctl start pacemaker.service

17:44:00  ha-idg-1 is fenced



Thanks.



Bernd



OS is SLES 12 SP4, pacemaker 1.1.19, corosync 2.3.6-9.13.1





--



Bernd Lentes

Systemadministration

Institut f?r Entwicklungsgenetik

Geb?ude 35.34 - Raum 208

HelmholtzZentrum m?nchen

bernd.lentes at helmholtz-muenchen.de<mailto:bernd.lentes at helmholtz-muenchen.de>

phone: +49 89 3187 1241

phone: +49 89 3187 3827

fax: +49 89 3187 2294

http://www.helmholtz-muenchen.de/idg



Perfekt ist wer keine Fehler macht

Also sind Tote perfekt



Helmholtz Zentrum Muenchen

Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)

Ingolstaedter Landstr. 1

85764 Neuherberg

www.helmholtz-muenchen.de<http://www.helmholtz-muenchen.de>

Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling

Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, Kerstin Guenther

Registergericht: Amtsgericht Muenchen HRB 6466

USt-IdNr: DE 129521671







------------------------------



Message: 2

Date: Mon, 12 Aug 2019 12:24:02 +0530

From: Shital A <brightuser2019 at gmail.com<mailto:brightuser2019 at gmail.com>>

To: pgsql-general at postgresql.com<mailto:pgsql-general at postgresql.com>, Users at clusterlabs.org<mailto:Users at clusterlabs.org>

Subject: [ClusterLabs] Postgres HA - pacemaker RA do not support auto

                failback

Message-ID:

                <CAMp7vw_KF2eM_Buh_fPbZNC9Z6PVvx+7rxjyMhfMcoZXuWGLKw at mail.gmail.com<mailto:CAMp7vw_KF2eM_Buh_fPbZNC9Z6PVvx+7rxjyMhfMcoZXuWGLKw at mail.gmail.com>>

Content-Type: text/plain; charset="utf-8"



Hello,



Postgres version : 9.6

OS:Rhel 7.6



We are working on HA setup for postgres cluster of two nodes in

active-passive mode.



Installed:

Pacemaker 1.1.19

Corosync 2.4.3



The pacemaker agent with this installation doesn't support automatic

failback. What I mean by that is explained below:

1. Cluster is setup like A - B with A as master.

2. Kill services on A, node B will come up as master.

3. node A is ready to join the cluster, we have to delete the lock file it

creates on any one of the node and execute the cleanup command to get the

node back as standby



Step 3 is manual so HA is not achieved in real sense.



Please help to check:

1. Is there any version of the resouce agent which supports automatic

failback? To avoid generation of lock file and deleting it.



2. If there is no such support, if we need such functionality, do we have

to modify existing code?



How this can be achieved. Please suggest.

Thanks.



Thanks.

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/737a010e/attachment-0001.html>



------------------------------



Message: 3

Date: Mon, 12 Aug 2019 17:47:02 +0000

From: Chris Walker <cwalker at cray.com<mailto:cwalker at cray.com>>

To: Cluster Labs - All topics related to open-source clustering

                welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>

Subject: Re: [ClusterLabs] why is node fenced ?

Message-ID: <EAFEF777-5A49-4C06-A2F6-8711F528B4B6 at cray.com<mailto:EAFEF777-5A49-4C06-A2F6-8711F528B4B6 at cray.com>>

Content-Type: text/plain; charset="utf-8"



When ha-idg-1 started Pacemaker around 17:43, it did not see ha-idg-2, for example,



Aug 09 17:43:05 [6318] ha-idg-1 pacemakerd:     info: pcmk_quorum_notification: Quorum retained | membership=1320 members=1



after ~20s (dc-deadtime parameter), ha-idg-2 is marked 'unclean' and STONITHed as part of startup fencing.



There is nothing in ha-idg-2's HA logs around 17:43 indicating that it saw ha-idg-1 either, so it appears that there was no communication at all between the two nodes.



I'm not sure exactly why the nodes did not see one another, but there are indications of network issues around this time



2019-08-09T17:42:16.427947+02:00 ha-idg-2 kernel: [ 1229.245533] bond1: now running without any active interface!



so perhaps that's related.



HTH,

Chris





?On 8/12/19, 12:09 PM, "Users on behalf of Lentes, Bernd" <users-bounces at clusterlabs.org on behalf of bernd.lentes at helmholtz-muenchen.de<mailto:users-bounces at clusterlabs.org%20on%20behalf%20of%20bernd.lentes at helmholtz-muenchen.de>> wrote:



    Hi,



    last Friday (9th of August) i had to install patches on my two-node cluster.

    I put one of the nodes (ha-idg-2) into standby (crm node standby ha-idg-2), patched it, rebooted,

    started the cluster (systemctl start pacemaker) again, put the node again online, everything fine.



    Then i wanted to do the same procedure with the other node (ha-idg-1).

    I put it in standby, patched it, rebooted, started pacemaker again.

    But then ha-idg-1 fenced ha-idg-2, it said the node is unclean.

    I know that nodes which are unclean need to be shutdown, that's logical.



    But i don't know from where the conclusion comes that the node is unclean respectively why it is unclean,

    i searched in the logs and didn't find any hint.



    I put the syslog and the pacemaker log on a seafile share, i'd be very thankful if you'll have a look.

    https://hmgubox.helmholtz-muenchen.de/d/53a10960932445fb9cfe/



    Here the cli history of the commands:



    17:03:04  crm node standby ha-idg-2

    17:07:15  zypper up (install Updates on ha-idg-2)

    17:17:30  systemctl reboot

    17:25:21  systemctl start pacemaker.service

    17:25:47  crm node online ha-idg-2

    17:26:35  crm node standby ha-idg1-

    17:30:21  zypper up (install Updates on ha-idg-1)

    17:37:32  systemctl reboot

    17:43:04  systemctl start pacemaker.service

    17:44:00  ha-idg-1 is fenced



    Thanks.



    Bernd



    OS is SLES 12 SP4, pacemaker 1.1.19, corosync 2.3.6-9.13.1





    --



    Bernd Lentes

    Systemadministration

    Institut f?r Entwicklungsgenetik

    Geb?ude 35.34 - Raum 208

    HelmholtzZentrum m?nchen

    bernd.lentes at helmholtz-muenchen.de<mailto:bernd.lentes at helmholtz-muenchen.de>

    phone: +49 89 3187 1241

    phone: +49 89 3187 3827

    fax: +49 89 3187 2294

    http://www.helmholtz-muenchen.de/idg



    Perfekt ist wer keine Fehler macht

    Also sind Tote perfekt





    Helmholtz Zentrum Muenchen

    Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)

    Ingolstaedter Landstr. 1

    85764 Neuherberg

    www.helmholtz-muenchen.de<http://www.helmholtz-muenchen.de>

    Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling

    Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, Kerstin Guenther

    Registergericht: Amtsgericht Muenchen HRB 6466

    USt-IdNr: DE 129521671



    _______________________________________________

    Manage your subscription:

    https://lists.clusterlabs.org/mailman/listinfo/users



    ClusterLabs home: https://www.clusterlabs.org/





------------------------------



Message: 4

Date: Mon, 12 Aug 2019 23:09:31 +0300

From: Andrei Borzenkov <arvidjaar at gmail.com<mailto:arvidjaar at gmail.com>>

To: Cluster Labs - All topics related to open-source clustering

                welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>

Cc: Venkata Reddy Chappavarapu <Venkata.Chappavarapu at harmonicinc.com<mailto:Venkata.Chappavarapu at harmonicinc.com>>

Subject: Re: [ClusterLabs] Master/slave failover does not work as

                expected

Message-ID:

                <CAA91j0WxSxt_eVmUvXgJ_0goBkBw69r3o-VesRvGc6atg6o=jQ at mail.gmail.com<mailto:CAA91j0WxSxt_eVmUvXgJ_0goBkBw69r3o-VesRvGc6atg6o=jQ at mail.gmail.com>>

Content-Type: text/plain; charset="utf-8"



On Mon, Aug 12, 2019 at 4:12 PM Michael Powell <

Michael.Powell at harmonicinc.com<mailto:Michael.Powell at harmonicinc.com>> wrote:



> At 07:44:49, the ss agent discovers that the master instance has failed on

> node *mgraid?-0* as a result of a failed *ssadm* request in response to

> an *ss_monitor()* operation.  It issues a *crm_master -Q -D* command with

> the intent of demoting the master and promoting the slave, on the other

> node, to master.  The *ss_demote()* function finds that the application

> is no longer running and returns *OCF_NOT_RUNNING* (7).  In the older

> product, this was sufficient to promote the other instance to master, but

> in the current product, that does not happen.  Currently, the failed

> application is restarted, as expected, and is promoted to master, but this

> takes 10?s of seconds.

>

>

>



Did you try to disable resource stickiness for this ms?

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/12978d55/attachment.html>

-------------- next part --------------

A non-text attachment was scrubbed...

Name: image001.gif

Type: image/gif

Size: 1854 bytes

Desc: not available

URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/12978d55/attachment.gif>



------------------------------



Subject: Digest Footer



_______________________________________________

Manage your subscription:

https://lists.clusterlabs.org/mailman/listinfo/users



ClusterLabs home: https://www.clusterlabs.org/



------------------------------



End of Users Digest, Vol 55, Issue 19

*************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190812/420b00e3/attachment-0001.html>


More information about the Users mailing list