[ClusterLabs] why is node fenced ?

Wed Aug 19 10:04:11 EDT 2020

On Tue, 2020-08-18 at 12:30 -0500, Ken Gaillot wrote:
> On Tue, 2020-08-18 at 16:47 +0200, Lentes, Bernd wrote:
> > 
> > ----- On Aug 17, 2020, at 5:09 PM, kgaillot kgaillot at redhat.com
> > wrote:
> > 
> > 
> > > > I checked all relevant pe-files in this time period.
> > > > This is what i found out (i just write the important entries):
> > 
> >  
> > > > Executing cluster transition:
> > > >  * Resource action: vm_nextcloud    stop on ha-idg-2
> > > > Revised cluster status:
> > > >  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> > > > 
> > > > ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-
> > > > input-
> > > > 3118 -G transition-4516.xml -D transition-4516.dot
> > > > Current cluster status:
> > > > Node ha-idg-1 (1084777482): standby
> > > > Online: [ ha-idg-2 ]
> > > >  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> > > > <============== vm_nextcloud is stopped
> > > > Transition Summary:
> > > >  * Shutdown ha-idg-1
> > > > Executing cluster transition:
> > > >  * Resource action: vm_nextcloud    stop on ha-idg-1 <==== why
> > > > stop ?
> > > > It is already stopped
> > > 
> > > I'm not sure, I'd have to see the pe input.
> > 
> > You find it here: 
> > https://hmgubox2.helmholtz-muenchen.de/index.php/s/WJGtodMZ9k7rN29
> 
> This appears to be a scheduler bug.

Fix is in master branch and will land in 2.0.5 expected at end of the
year

https://github.com/ClusterLabs/pacemaker/pull/2146

> The scheduler considers a migration to be "dangling" if it has a
> record
> of a failed migrate_to on the source node, but no migrate_from on the
> target node (and no migrate_from or start on the source node, which
> would indicate a later full restart or reverse migration).
> 
> In this case, any migrate_from on the target has since been
> superseded
> by a failed start and a successful stop, so there is no longer a
> record
> of it. Therefore the migration is considered dangling, which requires
> a
> full stop on the source node.
> 
> However in this case we already have a successful stop on the source
> node after the failed migrate_to, and I believe that should be
> sufficient to consider it no longer dangling.
> 
> > > >  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> > > > <=======
> > > > vm_nextcloud is stopped
> > > > Transition Summary:
> > > >  * Fence (Off) ha-idg-1 'resource actions are unrunnable'
> > > > Executing cluster transition:
> > > >  * Fencing ha-idg-1 (Off)
> > > >  * Pseudo action:   vm_nextcloud_stop_0 <======= why stop ? It
> > > > is
> > > > already stopped ?
> > > > Revised cluster status:
> > > > Node ha-idg-1 (1084777482): OFFLINE (standby)
> > > > Online: [ ha-idg-2 ]
> > > >  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> > > > 
> > > > I don't understand why the cluster tries to stop a resource
> > > > which
> > > > is
> > > > already stopped.
> > 
> > Bernd
> > Helmholtz Zentrum München
> > 
> > Helmholtz Zentrum Muenchen
> > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> > Ingolstaedter Landstr. 1
> > 85764 Neuherberg
> > www.helmholtz-muenchen.de
> > Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
> > Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep,
> > Kerstin
> > Guenther
> > Registergericht: Amtsgericht Muenchen HRB 6466
> > USt-IdNr: DE 129521671
-- 
Ken Gaillot <kgaillot at redhat.com>