[ClusterLabs] Antw: Rebooting a standby node triggers lots of transitions

Fri Sep 7 06:46:20 EDT 2018

On Wed, 5 Sep 2018, Kadlecsik József wrote:

> On Wed, 5 Sep 2018, Ken Gaillot wrote:
> 
> > > > For testing purposes one of our nodes was put in standby node and 
> > > > then  rebooted several times. When the standby node started up, it 
> > > > joined the  cluster as a new member and it resulted in transitions 
> > > > between the online  nodes. However, when the standby node was 
> > > > rebooted in mid‑transitions, it  triggered another transitions 
> > > > again. As a result, live migrations was  aborted and guests 
> > > > stopped/started.
> > > > 
> > > > How can one make sure that join/leave operations of standby nodes
> > > > do not 
> > > > affect the location of the running resources?
> > > > 
> > > > It's pacemaker 1.1.16‑1 with corosync 2.4.2‑3+deb9u1 on debian
> > > > stretch 
> > > > nodes.
> > 
> > Node joins/leaves do and should trigger new transitions, but that should 
> > not result in any actions if the node is in standby.
> >
> > The cluster will wait for any actions in progress (such as a live 
> > migration) to complete before beginning a new transition, so there is 
> > likely something else going on that is affecting the migration.
> > 
> > > Logs and more details, please!
> > 
> > Particularly the detail log on the DC should be helpful. It will have
> > "pengine:" messages with "saving inputs" at each transition.
> 
> I attached the log file.
> 
> There are log lines like this
> 
> Sep  5 12:22:30 atlas4 crmd[32776]:   notice: Transition aborted by 
> w2-utilization-cpu doing modify cpu=1: Configuration change 
> 
> which I don't understand: in the configuration the cpu utilization is 
> explicitly set to cpu=2 for w2.
> 
> Nothing changed, just the node atlas0 (in standby mode) was halted/started 
> several times. Still, resources were migrated, like in this case:
> 
> Sep  5 12:22:31 atlas4 VirtualDomain(mail0)[61781]: INFO: mail0: Starting 
> live migration to atlas3 (using: virsh --connect=qemu:///system --quiet 
> migrate --live  mail0 qemu+tls://atlas3/system ).
> 
> And besides the successful migrations, sometimes the guest was 
> stopped/started instead of migration:
> 
> Sep  5 12:25:22 atlas4 crmd[32776]:   notice: Result of stop operation for 
> mail0 on atlas4: 0 (ok) 
> Sep  5 12:25:22 atlas4 crmd[32776]:   notice: Initiating start operation 
> mail0_start_0 on atlas3 

Just guessing: maybe utilization is taken into account even when a node is 
offline and that cause transitions?

I can provide pe-input files which was recorded during the events.

Best regards,
Jozsef
--
E-mail : kadlecsik.jozsef at wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics, Hungarian Academy of Sciences
         H-1525 Budapest 114, POB. 49, Hungary