[ClusterLabs] Help understanding recover of promotable resource after a "pcs cluster stop --all"

Tue May 3 01:14:47 EDT 2022

Have you checked with drbd commands if the 2 nodes were in sync?

Also consider adding the shared dir, lvm,etc into a single group -> see https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/s1-resourcegroupcreatenfs-haaa
Best Regards,Strahil Nikolov

  On Tue, May 3, 2022 at 0:25, Ken Gaillot<kgaillot at redhat.com> wrote:   On Mon, 2022-05-02 at 13:11 -0300, Salatiel Filho wrote:
> Hi, Ken, here is the info you asked for.
> 
> 
> # pcs constraint
> Location Constraints:
>  Resource: fence-server1
>    Disabled on:
>      Node: server1 (score:-INFINITY)
>  Resource: fence-server2
>    Disabled on:
>      Node: server2 (score:-INFINITY)
> Ordering Constraints:
>  promote DRBDData-clone then start nfs (kind:Mandatory)
> Colocation Constraints:
>  nfs with DRBDData-clone (score:INFINITY) (rsc-role:Started)
> (with-rsc-role:Master)
> Ticket Constraints:
> 
> # sudo crm_mon -1A
> ...
> Node Attributes:
>  * Node: server2:
>    * master-DRBDData                    : 10000

In the scenario you described, only server1 is up. If there is no
master score for server1, it cannot be master. It's up the resource
agent to set it. I'm not familiar enough with that agent to know why it
might not.

> 
> 
> 
> Atenciosamente/Kind regards,
> Salatiel
> 
> On Mon, May 2, 2022 at 12:26 PM Ken Gaillot <kgaillot at redhat.com>
> wrote:
> > On Mon, 2022-05-02 at 09:58 -0300, Salatiel Filho wrote:
> > > Hi, I am trying to understand the recovering process of a
> > > promotable
> > > resource after "pcs cluster stop --all" and shutdown of both
> > > nodes.
> > > I have a two nodes + qdevice quorum with a DRBD resource.
> > > 
> > > This is a summary of the resources before my test. Everything is
> > > working just fine and server2 is the master of DRBD.
> > > 
> > >  * fence-server1    (stonith:fence_vmware_rest):    Started
> > > server2
> > >  * fence-server2    (stonith:fence_vmware_rest):    Started
> > > server1
> > >  * Clone Set: DRBDData-clone [DRBDData] (promotable):
> > >    * Masters: [ server2 ]
> > >    * Slaves: [ server1 ]
> > >  * Resource Group: nfs:
> > >    * drbd_fs    (ocf::heartbeat:Filesystem):    Started server2
> > > 
> > > 
> > > 
> > > then I issue "pcs cluster stop --all". The cluster will be
> > > stopped on
> > > both nodes as expected.
> > > Now I restart server1( previously the slave ) and poweroff
> > > server2 (
> > > previously the master ). When server1 restarts it will fence
> > > server2
> > > and I can see that server2 is starting on vcenter, but I just
> > > pressed
> > > any key on grub to make sure the server2 would not restart,
> > > instead
> > > it
> > > would just be "paused" on grub screen.
> > > 
> > > SSH'ing to server1 and running pcs status I get:
> > > 
> > > Cluster name: cluster1
> > > Cluster Summary:
> > >  * Stack: corosync
> > >  * Current DC: server1 (version 2.1.0-8.el8-7c3f660707) -
> > > partition
> > > with quorum
> > >  * Last updated: Mon May  2 09:52:03 2022
> > >  * Last change:  Mon May  2 09:39:22 2022 by root via cibadmin
> > > on
> > > server1
> > >  * 2 nodes configured
> > >  * 11 resource instances configured
> > > 
> > > Node List:
> > >  * Online: [ server1 ]
> > >  * OFFLINE: [ server2 ]
> > > 
> > > Full List of Resources:
> > >  * fence-server1    (stonith:fence_vmware_rest):    Stopped
> > >  * fence-server2    (stonith:fence_vmware_rest):    Started
> > > server1
> > >  * Clone Set: DRBDData-clone [DRBDData] (promotable):
> > >    * Slaves: [ server1 ]
> > >    * Stopped: [ server2 ]
> > >  * Resource Group: nfs:
> > >    * drbd_fs    (ocf::heartbeat:Filesystem):    Stopped
> > > 
> > > 
> > > So I can see there is quorum, but the server1 is never promoted
> > > as
> > > DRBD master, so the remaining resources will be stopped until
> > > server2
> > > is back.
> > > 1) What do I need to do to force the promotion and recover
> > > without
> > > restarting server2?
> > > 2) Why if instead of rebooting server1 and power off server2 I
> > > reboot
> > > server2 and poweroff server1 the cluster can recover by itself?
> > > 
> > > 
> > > Thanks!
> > > 
> > 
> > You shouldn't need to force promotion, that is the default behavior
> > in
> > that situation. There must be something else in the configuration
> > that
> > is preventing promotion.
> > 
> > The DRBD resource agent should set a promotion score for the node.
> > You
> > can run "crm_mon -1A" to show all node attributes; there should be
> > one
> > like "master-DRBDData" for the active node.
> > 
> > You can also show the constraints in the cluster to see if there is
> > anything relevant to the master role.

-- 
Ken Gaillot <kgaillot at redhat.com>

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20220503/403ef2a5/attachment-0001.htm>