[ClusterLabs] Antw: Rebooting a standby node triggers lots of transitions
kadlecsik.jozsef at wigner.mta.hu
Fri Sep 7 06:46:20 EDT 2018
On Wed, 5 Sep 2018, Kadlecsik József wrote:
> On Wed, 5 Sep 2018, Ken Gaillot wrote:
> > > > For testing purposes one of our nodes was put in standby node and
> > > > then rebooted several times. When the standby node started up, it
> > > > joined the cluster as a new member and it resulted in transitions
> > > > between the online nodes. However, when the standby node was
> > > > rebooted in mid‑transitions, it triggered another transitions
> > > > again. As a result, live migrations was aborted and guests
> > > > stopped/started.
> > > >
> > > > How can one make sure that join/leave operations of standby nodes
> > > > do not
> > > > affect the location of the running resources?
> > > >
> > > > It's pacemaker 1.1.16‑1 with corosync 2.4.2‑3+deb9u1 on debian
> > > > stretch
> > > > nodes.
> > Node joins/leaves do and should trigger new transitions, but that should
> > not result in any actions if the node is in standby.
> > The cluster will wait for any actions in progress (such as a live
> > migration) to complete before beginning a new transition, so there is
> > likely something else going on that is affecting the migration.
> > > Logs and more details, please!
> > Particularly the detail log on the DC should be helpful. It will have
> > "pengine:" messages with "saving inputs" at each transition.
> I attached the log file.
> There are log lines like this
> Sep 5 12:22:30 atlas4 crmd: notice: Transition aborted by
> w2-utilization-cpu doing modify cpu=1: Configuration change
> which I don't understand: in the configuration the cpu utilization is
> explicitly set to cpu=2 for w2.
> Nothing changed, just the node atlas0 (in standby mode) was halted/started
> several times. Still, resources were migrated, like in this case:
> Sep 5 12:22:31 atlas4 VirtualDomain(mail0): INFO: mail0: Starting
> live migration to atlas3 (using: virsh --connect=qemu:///system --quiet
> migrate --live mail0 qemu+tls://atlas3/system ).
> And besides the successful migrations, sometimes the guest was
> stopped/started instead of migration:
> Sep 5 12:25:22 atlas4 crmd: notice: Result of stop operation for
> mail0 on atlas4: 0 (ok)
> Sep 5 12:25:22 atlas4 crmd: notice: Initiating start operation
> mail0_start_0 on atlas3
Just guessing: maybe utilization is taken into account even when a node is
offline and that cause transitions?
I can provide pe-input files which was recorded during the events.
E-mail : kadlecsik.jozsef at wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
H-1525 Budapest 114, POB. 49, Hungary
More information about the Users