[ClusterLabs] two virtual domains start and stop every 15 minutes

Fri Jul 5 20:54:00 EDT 2019

On Fri, 2019-07-05 at 13:07 +0200, Lentes, Bernd wrote:
> 
> ----- On Jul 4, 2019, at 1:25 AM, kgaillot kgaillot at redhat.com wrote:
> 
> > On Wed, 2019-06-19 at 18:46 +0200, Lentes, Bernd wrote:
> > > ----- On Jun 15, 2019, at 4:30 PM, Bernd Lentes
> > > bernd.lentes at helmholtz-muenchen.de wrote:
> > > 
> > > > ----- Am 14. Jun 2019 um 21:20 schrieb kgaillot 
> > > > kgaillot at redhat.com
> > > > :
> > > > 
> > > > > On Fri, 2019-06-14 at 18:27 +0200, Lentes, Bernd wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > i had that problem already once but still it's not clear
> > > > > > for me
> > > > > > what
> > > > > > really happens.
> > > > > > I had this problem some days ago:
> > > > > > I have a 2-node cluster with several virtual domains as
> > > > > > resources. I
> > > > > > put one node (ha-idg-2) into standby, and two running
> > > > > > virtual
> > > > > > domains
> > > > > > were migrated to the other node (ha-idg-1). The other
> > > > > > virtual
> > > > > > domains
> > > > > > were already running on ha-idg-1.
> > > > > > Since then the two virtual domains which migrated
> > > > > > (vm_idcc_devel and
> > > > > > vm_severin) start or stop every 15 minutes on ha-idg-1.
> > > > > > ha-idg-2 resides in standby.
> > > > > > I know that the 15 minutes interval is related to the
> > > > > > "cluster-
> > > > > > recheck-interval".
> > > > > > But why are these two domains started and stopped ?
> > > > > > I looked around much in the logs, checked the pe-input
> > > > > > files,
> > > > > > watched
> > > > > > some graphs created by crm_simulate with dotty ...
> > > > > > I always see that the domains are started and 15 minutes
> > > > > > later
> > > > > > stopped and 15 minutes later started ...
> > > > > > but i don't see WHY. I would really like to know that.
> > > > > > And why are the domains not started from the monitor
> > > > > > resource
> > > > > > operation ? It should recognize that the domain is stopped
> > > > > > and
> > > > > > starts
> > > > > > it again. My monitor interval is 30 seconds.
> > > > > > I had two errors pending concerning these domains, a failed
> > > > > > migrate
> > > > > > from ha-idg-1 to ha-idg-2, form some time before.
> > > > > > Could that be the culprit ?
> > 
> > It did indeed turn out to be.
> > 
> > The resource history on ha-idg-1 shows the last failed action as a
> > migrate_to from ha-idg-1 to ha-idg-2, and the last successful
> > action as
> > a migrate_from from ha-idg-2 to ha-idg-1. That confused pacemaker
> > as to
> > the current status of the migration.
> > 
> > A full migration is migrate_to on the source node, migrate_from on
> > the
> > target node, and stop on the source node. When the resource history
> > has
> > a failed migrate_to on the source, and a stop but no migrate_from
> > on
> > the target, the migration is considered "dangling" and forces a
> > stop of
> > the resource on the source, because it's possible the migrate_from
> > never got a chance to be scheduled.
> > 
> > That is wrong in this situation. The resource is happily running on
> > the
> > node with the failed migrate_to because it was later moved back
> > successfully, and the failed migrate_to is no longer relevant.
> > 
> > My current plan for a fix is that if a node with a failed
> > migrate_to
> > has a successful migrate_from or start that's newer, and the target
> > node of the failed migrate_to has a successful stop, then the
> > migration
> > should not be considered dangling.
> > 
> > A couple of side notes on your configuration:
> > 
> > Instead of putting action=off in fence device configurations, you
> > should use pcmk_reboot_action=off. Pacemaker adds action when
> > sending
> > the fence command.
> 
> I did that already.
>  
> > When keeping a fence device off its target node, use a finite
> > negative
> > score rather than -INFINITY. This ensures the node can fence itself
> > as
> > a last resort.
> 
> I will do that.
> 
> Thanks for clarifying this, it happened very often.
> I conclude that it's very important to cleanup a resource failure
> quickly after finding the cause
> and solving the problem, not having any pending errors.

This is the first bug I can recall that was triggered by an old
failure, so I don't think it's important as a general policy outside of
live migrations.

I've got a fix I'll merge soon.

> 
> Bernd
>  
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep,
> Heinrich Bassler, Kerstin Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
-- 
Ken Gaillot <kgaillot at redhat.com>