[ClusterLabs] Maintenance & Pacemaker Restart Demotes MS Resources

Fri Jun 7 12:48:31 EDT 2019

On Fri, 2019-06-07 at 07:31 -0700, Dirk Gassen wrote:
> Thanks, that seems to have been the problem in my case. (For some
> reason the attribute did not reappear on its own, but adding it
> manually w/ crm_attribute did work).
> 
> I assume that this happened since I didn't have another node that
> could become the DC while restarting pacemaker? If I do add another
> node then the problem doesn't seem to appear.

Yes, that makes sense.

> 
> Dirk
> 
> On Wed, Jun 5, 2019 at 3:17 PM Ken Gaillot <kgaillot at redhat.com>
> wrote:
> > On Wed, 2019-06-05 at 13:28 -0700, Dirk Gassen wrote:
> > > Thanks for your quick reply. I should have been a bit more
> > verbose in
> > > my problem description.
> > > 
> > > After starting up pacemaker again and before "crm node testras3
> > > ready" I did actually monitor the cluster with "crm_mon" and
> > waited
> > > until it indicated that it knew about the states of the
> > resources.
> > > 
> > > Here is actually the excerpt from syslog:
> > > * crm node maintenance testras3
> > > > 16:14:50 On loss of CCM Quorum: Ignore
> > > > 16:14:50 Forcing unmanaged master MariaDB:0 to remain promoted
> > on
> > > testras3
> > > > 16:14:50 Calculated Transition 12:
> > /var/lib/pacemaker/pengine/pe-
> > > input-72.bz2
> > > * systemctl stop pacemaker
> > > > 16:15:29 On loss of CCM Quorum: Ignore
> > > > 16:15:29 Forcing unmanaged master MariaDB:0 to remain promoted
> > on
> > > testras3
> > 
> > Ah, there is no master score for MariaDB, so when the node leaves
> > maintenance mode, the resource must be demoted.
> > 
> > Restarting pacemaker clears all transient node attributes
> > (including
> > the master score). The next monitor would set it again, but
> > maintenance
> > mode cancels monitors, so it won't run until it comes out of
> > maintenance mode, at which point it wants to do the demote.
> > 
> > A good way around this would be to unmanage the MariaDB resource
> > before
> > putting the node in maintenance. When you take the node out of
> > maintenance, the monitor will start up again, but it won't take any
> > actions. Once the monitor runs and sets the master score (which you
> > can
> > confirm with crm_master --query --resource MariaDB --node <node>),
> > you
> > can manage the resource.
> > 
> > > > 16:15:29 Scheduling Node testras3 for shutdown
> > > > 16:15:29 Calculated Transition 13:
> > /var/lib/pacemaker/pengine/pe-
> > > input-73.bz2
> > > > 16:15:29 Invoking handler for signal 15: Terminated
> > > * systemctl start pacemaker
> > > > 16:15:57 Additional logging available in /var/log/pacemaker.log
> > > > 16:16:20 On loss of CCM Quorum: Ignore
> > > > 16:16:20 Calculated Transition 0:
> > /var/lib/pacemaker/pengine/pe-
> > > input-74.bz2
> > > > 16:16:20 On loss of CCM Quorum: Ignore
> > > > 16:16:20 Forcing unmanaged master MariaDB:0 to remain promoted
> > on
> > > testras3
> > > > 16:16:20 Calculated Transition 1:
> > /var/lib/pacemaker/pengine/pe-
> > > input-75.bz2
> > > * crm node ready testras3
> > > > 16:18:01 On loss of CCM Quorum: Ignore
> > > > 16:18:01 Stop    AppserverIP#011(testras3)
> > > > 16:18:01 Demote  MariaDB:0#011(Master -> Slave testras3)
> > > > 16:18:01 Calculated Transition 2:
> > /var/lib/pacemaker/pengine/pe-
> > > input-76.bz2
> > > > 16:18:01 On loss of CCM Quorum: Ignore
> > > > 16:18:01 Start   AppserverIP#011(testras3)
> > > > 16:18:01 Promote MariaDB:0#011(Slave -> Master testras3)
> > > > 16:18:01 Calculated Transition 3:
> > /var/lib/pacemaker/pengine/pe-
> > > input-77.bz2
> > > > 16:18:02 On loss of CCM Quorum: Ignore
> > > > 16:18:02 Calculated Transition 4:
> > /var/lib/pacemaker/pengine/pe-
> > > input-78.bz2
> > > 
> > > So it looks like to me that the cluster is demoting ms_MariaDB
> > from
> > > Master to Slave. I'm not sure if I should have waited for
> > something
> > > else to occur?
> > > 
> > > I have attached pe-input-76.bz2.
> > > 
> > > Dirk
> > > 
> > > On Wed, Jun 5, 2019 at 10:22 AM Ken Gaillot <kgaillot at redhat.com>
> > > wrote:
> > > > On Wed, 2019-06-05 at 07:40 -0700, Dirk Gassen wrote:
> > > > > Hi,
> > > > > 
> > > > > I have the following CIB:
> > > > > > primitive AppserverIP IPaddr \
> > > > > >         params ip=10.1.8.70 cidr_netmask=255.255.255.192
> > > > nic=eth0 \
> > > > > >         op monitor interval=30s
> > > > > > primitive MariaDB mysql \
> > > > > >         params binary="/usr/bin/mysqld_safe"
> > > > > pid="/var/run/mysqld/mysqld.pid"
> > > > socket="/var/run/mysqld/mysqld.sock"
> > > > > replication_user=repl replication_passwd="r3plic at tion"
> > > > > max_slave_lag=15 evict_outdated_slaves=false test_user=repl
> > > > > test_passwd="r3plic at tion" config="/etc/mysql/my.cnf"
> > user=mysql
> > > > > group=mysql datadir="/opt/mysql" \
> > > > > >         op monitor interval=27s role=Master
> > OCF_CHECK_LEVEL=1 \
> > > > > >         op monitor interval=35s timeout=30 role=Slave
> > > > > OCF_CHECK_LEVEL=1 \
> > > > > >         op start interval=0 timeout=130 \
> > > > > >         op stop interval=0 timeout=130
> > > > > > ms ms_MariaDB MariaDB \
> > > > > >         meta master-max=1 master-node-max=1 clone-node-
> > max=1
> > > > > notify=true globally-unique=false target-role=Started is-
> > > > managed=true
> > > > > > colocation colo_sm_aip inf: AppserverIP:Started
> > > > ms_MariaDB:Master
> > > > > 
> > > > > When I do "crm node testras3 maintenance && systemctl stop
> > > > pacemaker
> > > > > && systemctl start pacemaker && crm node testras3 ready" the
> > > > cluster
> > > > > decides to demote ms_MariaDB and (because of the colocation)
> > to
> > > > stop
> > > > > AppserverIP. it then follows up immediately with promoting
> > > > ms_MariaDB
> > > > > and starting AppserverIP again.
> > > > > 
> > > > > If I leave out restarting pacemaker the cluster does not
> > demote
> > > > > ms_MariaDB and AppserverIP is left running.
> > > > > 
> > > > > Why is the demotion happening and is there a way to avoid
> > this?
> > > > 
> > > > It looks like there isn't enough time between starting
> > pacemaker
> > > > and
> > > > taking the node out of maintenance for pacemaker to re-detect
> > the
> > > > state
> > > > of all resources. It's best to do that manually, i.e. wait for
> > the
> > > > status output to show all the resources again, but you could
> > > > automate
> > > > it with a fixed sleep or maybe a brief sleep plus crm_resource
> > --
> > > > wait.
> > > > 
> > > > > Corosync 2.3.5-3ubuntu2.3 and Pacemaker 1.1.14-2ubuntu1.6
> > > > > 
> > > > > Sincerely,
> > > > > Dirk
> > > > > -- 
> > > > > Dirk Gassen
> > > > > Senior Software Engineer | GetWellNetwork
> > > > > o: 240.482.3146
> > > > > e: dgassen at getwellnetwork.com
> > > > > To help people take an active role in their health journey
> > > > -- 
> > > > Ken Gaillot <kgaillot at redhat.com>
> > > > 
> > > > _______________________________________________
> > > > Manage your subscription:
> > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > > 
> > > > ClusterLabs home: https://www.clusterlabs.org/
> > > 
> > > 
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > ClusterLabs home: https://www.clusterlabs.org/
> > -- 
> > Ken Gaillot <kgaillot at redhat.com>
> > 
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>