[ClusterLabs] Unable to restart resources

Tue Mar 26 15:36:30 EDT 2019

On Tue, 2019-03-26 at 09:33 -0600, JCA wrote:
> Making some progress with Pacemaker/DRBD, but still trying to grasp
> some of the basics of this framework. Here is my current situation:
> 
> I have a two-node cluster, pmk1 and pmk2, with resources ClusterIP
> and DrbdFS. In what follows, commands preceded by '[pmk1] #' are to
> be understood as commands issued by the superuser in pmk1, whereas
> those preceded by '[pmk2] #' are issued by the superuser in pmk2
> (pretty obvious, but better make it crystal clear).
> 
> [pmk1] # pcs status resources
>  ClusterIP	(ocf::heartbeat:IPaddr2):	Started pmk1
>  Master/Slave Set: DrbdDataClone [DrbdData]
>      Masters: [ pmk1 ]
>      Slaves: [ pmk2 ]
>  DrbdFS	(ocf::heartbeat:Filesystem):	Started pmk1
> 
> [pmk2] # pcs status resources
>  ClusterIP	(ocf::heartbeat:IPaddr2):	Started pmk1
>  Master/Slave Set: DrbdDataClone [DrbdData]
>      Masters: [ pmk1 ]
>      Slaves: [ pmk2 ]
>  DrbdFS	(ocf::heartbeat:Filesystem):	Started pmk2

If this is an accurate copy/paste, something is already wrong: DrbdFS
is seen as started on pmk1 from pmk1's point of view, but on pmk2 from
pmk2's point of view. The view of the cluster should be the same no
matter which node you run it for.

Maybe just a copy/paste error?

> 
> There is an ext4 filesystem in the DRBD device, mounted at
> /var/lib/pmk. When things are as described above, in pmk1 this
> directory contains the data that I used when I populated the DRBD
> filesystem  in pmk1, whereas in pmk2 it contains nothing. I.e.
> everything is as expected.
> 
> Then I did
> 
> [pmk1] # pcs cluster stop pmk1
> pmk1: Stopping Cluster (pacemaker)...
> pmk1: Stopping Cluster (corosync)...
> 
> [pmk2] # pcs status resources
>  ClusterIP	(ocf::heartbeat:IPaddr2):	Started pmk2
>  Master/Slave Set: DrbdDataClone [DrbdData]
>      Masters: [ pmk2 ]
>      Stopped: [ pmk2 ]

Similarly here I'd expect "Stopped: [ pmk1 ]"

>  DrbdFS	(ocf::heartbeat:Filesystem):	Started pmk2
> 
> After this the contents of /var/lib/pmk in pmk2 are those that were
> used to populated the DRBD filesystem in pmk1 (plus any changes
> introduced by pmk1 before I stopped it), whereas /var/lib/pmk in pmk1
> is now empty. Which implies that things seem to be behaving OK - or,
> at least, the way I was expecting for them to behave.
> 
> Next I tried to bring pmk1 back on:
> 
> [pmk1] # pcs cluster start pmk1
> pmk1: Starting Cluster (corosync)...
> pmk1: Starting Cluster (pacemaker)...
> 
> [pmk1] # pcs status resources
> ClusterIP	(ocf::heartbeat:IPaddr2):	Stopped
>  Master/Slave Set: DrbdDataClone [DrbdData]
>      Stopped: [ pmk1 pmk2 ]
>  DrbdFS	(ocf::heartbeat:Filesystem):	Stopped
> 
> [pmk2] # pcs status resources
>  ClusterIP	(ocf::heartbeat:IPaddr2):	Started pmk2
>  Master/Slave Set: DrbdDataClone [DrbdData]
>      Masters: [ pmk2 ]
>      Stopped: [ pmk2 ]
>  DrbdFS	(ocf::heartbeat:Filesystem):	Started pmk2

A split view like this would be understandable immediately at start-up, 
but pmk1 should quickly pick up the correct info from pmk2. Did this
stay like this for more than a minute?

> Node pmk1 is back up, but ClusterIP and DrbdFS are not, at least on
> pmk1. And pmk2 remains in charge. I clumsily tried to restart those
> resources by hand in pmk1, to no avail:
> 
> [pmk1] # pcs resource restart ClusterIP
> Error: Error performing operation: No such device or address
> ClusterIP is not running anywhere and so cannot be restarted
> 
> I also tried stopping and starting the pmk1 node from pmk1, and also
> from pmk2, several times, to no avail.
> 
> How can I bring back the pmk1 node on correctly, so that everything
> is how it originally was - i.e. with pmk1 up and running, and with
> the resources also up and running in pmk1?
> 
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>