[ClusterLabs] two virtual domains start and stop every 15 minutes

Fri Jul 5 07:07:12 EDT 2019

----- On Jul 4, 2019, at 1:25 AM, kgaillot kgaillot at redhat.com wrote:

> On Wed, 2019-06-19 at 18:46 +0200, Lentes, Bernd wrote:
>> ----- On Jun 15, 2019, at 4:30 PM, Bernd Lentes
>> bernd.lentes at helmholtz-muenchen.de wrote:
>> 
>> > ----- Am 14. Jun 2019 um 21:20 schrieb kgaillot kgaillot at redhat.com
>> > :
>> > 
>> > > On Fri, 2019-06-14 at 18:27 +0200, Lentes, Bernd wrote:
>> > > > Hi,
>> > > > 
>> > > > i had that problem already once but still it's not clear for me
>> > > > what
>> > > > really happens.
>> > > > I had this problem some days ago:
>> > > > I have a 2-node cluster with several virtual domains as
>> > > > resources. I
>> > > > put one node (ha-idg-2) into standby, and two running virtual
>> > > > domains
>> > > > were migrated to the other node (ha-idg-1). The other virtual
>> > > > domains
>> > > > were already running on ha-idg-1.
>> > > > Since then the two virtual domains which migrated
>> > > > (vm_idcc_devel and
>> > > > vm_severin) start or stop every 15 minutes on ha-idg-1.
>> > > > ha-idg-2 resides in standby.
>> > > > I know that the 15 minutes interval is related to the "cluster-
>> > > > recheck-interval".
>> > > > But why are these two domains started and stopped ?
>> > > > I looked around much in the logs, checked the pe-input files,
>> > > > watched
>> > > > some graphs created by crm_simulate with dotty ...
>> > > > I always see that the domains are started and 15 minutes later
>> > > > stopped and 15 minutes later started ...
>> > > > but i don't see WHY. I would really like to know that.
>> > > > And why are the domains not started from the monitor resource
>> > > > operation ? It should recognize that the domain is stopped and
>> > > > starts
>> > > > it again. My monitor interval is 30 seconds.
>> > > > I had two errors pending concerning these domains, a failed
>> > > > migrate
>> > > > from ha-idg-1 to ha-idg-2, form some time before.
>> > > > Could that be the culprit ?
> 
> It did indeed turn out to be.
> 
> The resource history on ha-idg-1 shows the last failed action as a
> migrate_to from ha-idg-1 to ha-idg-2, and the last successful action as
> a migrate_from from ha-idg-2 to ha-idg-1. That confused pacemaker as to
> the current status of the migration.
> 
> A full migration is migrate_to on the source node, migrate_from on the
> target node, and stop on the source node. When the resource history has
> a failed migrate_to on the source, and a stop but no migrate_from on
> the target, the migration is considered "dangling" and forces a stop of
> the resource on the source, because it's possible the migrate_from
> never got a chance to be scheduled.
> 
> That is wrong in this situation. The resource is happily running on the
> node with the failed migrate_to because it was later moved back
> successfully, and the failed migrate_to is no longer relevant.
> 
> My current plan for a fix is that if a node with a failed migrate_to
> has a successful migrate_from or start that's newer, and the target
> node of the failed migrate_to has a successful stop, then the migration
> should not be considered dangling.
> 
> A couple of side notes on your configuration:
> 
> Instead of putting action=off in fence device configurations, you
> should use pcmk_reboot_action=off. Pacemaker adds action when sending
> the fence command.

I did that already.

> When keeping a fence device off its target node, use a finite negative
> score rather than -INFINITY. This ensures the node can fence itself as
> a last resort.
I will do that.

Thanks for clarifying this, it happened very often.
I conclude that it's very important to cleanup a resource failure quickly after finding the cause
and solving the problem, not having any pending errors.

Bernd

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671