[ClusterLabs] why is node fenced ?

Ken Gaillot kgaillot at redhat.com
Mon Aug 17 11:09:02 EDT 2020


On Fri, 2020-08-14 at 20:37 +0200, Lentes, Bernd wrote:
> ----- On Aug 9, 2020, at 10:17 PM, Bernd Lentes 
> bernd.lentes at helmholtz-muenchen.de wrote:
> 
> 
> > > So this appears to be the problem. From these logs I would guess
> > > the
> > > successful stop on ha-idg-1 did not get written to the CIB for
> > > some
> > > reason. I'd look at the pe input from this transition on ha-idg-2 
> > > to
> > > confirm that.
> > > 
> > > Without the DC knowing about the stop, it tries to schedule a new
> > > one,
> > > but the node is shutting down so it can't do it, which means it
> > > has to
> > > be fenced.
> 
> I checked all relevant pe-files in this time period.
> This is what i found out (i just write the important entries):
> 
> ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-
> 3116 -G transition-3116.xml -D transition-3116.dot
> Current cluster status:
>  ...
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Started ha-idg-1
> Transition Summary:
>  ...
> * Migrate    vm_nextcloud           ( ha-idg-1 -> ha-idg-2 )
> Executing cluster transition:
>  * Resource action: vm_nextcloud    migrate_from on ha-idg-2 <=======
> migrate vm_nextcloud
>  * Resource action: vm_nextcloud    stop on ha-idg-1 
>  * Pseudo action:   vm_nextcloud_start_0
> Revised cluster status:
> Node ha-idg-1 (1084777482): standby
> Online: [ ha-idg-2 ]
> vm_nextcloud   (ocf::heartbeat:VirtualDomain): Started ha-idg-2
> 
> 
> ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-error-
> 48 -G transition-4514.xml -D transition-4514.dot
> Current cluster status:
> Node ha-idg-1 (1084777482): standby
> Online: [ ha-idg-2 ]
> ...
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): FAILED[ ha-idg-2 ha-
> idg-1 ] <====== migration failed
> Transition Summary:
> ..
>  * Recover    vm_nextcloud            (             ha-idg-2 )
> Executing cluster transition:
>  * Resource action: vm_nextcloud    stop on ha-idg-2
>  * Resource action: vm_nextcloud    stop on ha-idg-1 
>  * Resource action: vm_nextcloud    start on ha-idg-2
>  * Resource action: vm_nextcloud    monitor=30000 on ha-idg-2
> Revised cluster status:
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Started ha-idg-2
> 
> ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-
> 3117 -G transition-3117.xml -D transition-3117.dot
> Current cluster status:
> Node ha-idg-1 (1084777482): standby
> Online: [ ha-idg-2 ]
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): FAILED ha-idg-2
> <====== start on ha-idg-2 failed
> Transition Summary:
>  * Stop       vm_nextcloud     ( ha-idg-2 )   due to node
> availability <==== stop vm_nextcloud (what means due to node
> availability ?)

"Due to node availability" means no node is allowed to run the
resource, so it has to be stopped.

> Executing cluster transition:
>  * Resource action: vm_nextcloud    stop on ha-idg-2
> Revised cluster status:
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> 
> ha-idg-1:~/why-fenced/ha-idg-1/pengine # crm_simulate -S -x pe-input-
> 3118 -G transition-4516.xml -D transition-4516.dot
> Current cluster status:
> Node ha-idg-1 (1084777482): standby
> Online: [ ha-idg-2 ]
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> <============== vm_nextcloud is stopped
> Transition Summary:
>  * Shutdown ha-idg-1
> Executing cluster transition:
>  * Resource action: vm_nextcloud    stop on ha-idg-1 <==== why stop ?
> It is already stopped

I'm not sure, I'd have to see the pe input.

> Revised cluster status:
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> 
> ha-idg-1:~/why-fenced/ha-idg-2/pengine # crm_simulate -S -x pe-input-
> 3545 -G transition-0.xml -D transition-0.dot
> Current cluster status:
> Node ha-idg-1 (1084777482): pending
> Online: [ ha-idg-2 ]
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <======
> vm_nextcloud is stopped
> Transition Summary:
> 
> Executing cluster transition:
> Using the original execution date of: 2020-07-20 15:05:33Z
> Revised cluster status:
> vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> 
> ha-idg-1:~/why-fenced/ha-idg-2/pengine # crm_simulate -S -x pe-warn-
> 749 -G transition-1.xml -D transition-1.dot
> Current cluster status:
> Node ha-idg-1 (1084777482): OFFLINE (standby)
> Online: [ ha-idg-2 ]
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped <=======
> vm_nextcloud is stopped
> Transition Summary:
>  * Fence (Off) ha-idg-1 'resource actions are unrunnable'
> Executing cluster transition:
>  * Fencing ha-idg-1 (Off)
>  * Pseudo action:   vm_nextcloud_stop_0 <======= why stop ? It is
> already stopped ?
> Revised cluster status:
> Node ha-idg-1 (1084777482): OFFLINE (standby)
> Online: [ ha-idg-2 ]
>  vm_nextcloud   (ocf::heartbeat:VirtualDomain): Stopped
> 
> I don't understand why the cluster tries to stop a resource which is
> already stopped.
> 
> Bernd
> Helmholtz Zentrum München
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin
> Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list