[ClusterLabs] why is node fenced ?

Mon Aug 10 17:59:07 EDT 2020

On Sun, 2020-08-09 at 22:17 +0200, Lentes, Bernd wrote:
> 
> ----- Am 29. Jul 2020 um 18:53 schrieb kgaillot kgaillot at redhat.com:
> 
> > On Wed, 2020-07-29 at 17:26 +0200, Lentes, Bernd wrote:
> > > Hi,
> > > 
> > > a few days ago one of my nodes was fenced and i don't know why,
> > > which
> > > is something i really don't like.
> > > What i did:
> > > I put one node (ha-idg-1) in standby. The resources on it (most
> > > of
> > > all virtual domains) were migrated to ha-idg-2,
> > > except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was
> > > missing the xml of the domain points to.
> > > Then the cluster tries to start vm_nextcloud on ha-idg-2 which of
> > > course also failed.
> > > Then ha-idg-1 was fenced.
> > > I did a "crm history" over the respective time period, you find
> > > it
> > > here:
> > > https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF
> > > 
> > > Here, from my point of view, the most interesting from the logs:
> > > ha-idg-1:
> > > Jul 20 16:59:33 [23763] ha-idg-1        cib:     info:
> > > cib_perform_op:  Diff: --- 2.16196.19 2
> > > Jul 20 16:59:33 [23763] ha-idg-1        cib:     info:
> > > cib_perform_op:  Diff: +++ 2.16197.0
> > > bc9a558dfbe6d7196653ce56ad1ee758
> > > Jul 20 16:59:33 [23763] ha-idg-1        cib:     info:
> > > cib_perform_op:  +  /cib:  @epoch=16197, @num_updates=0
> > > Jul 20 16:59:33 [23763] ha-idg-1        cib:     info:
> > > cib_perform_op:  +  /cib/configuration/nodes/node[@id='1084777482
> > > ']/i
> > > nstance_attributes[@id='nodes-108
> > > 4777482']/nvpair[@id='nodes-1084777482-standby']:  @value=on
> > > ha-idg-1 set to standby
> > > 
> > > Jul 20 16:59:34 [23768] ha-idg-1       crmd:   notice:
> > > process_lrm_event:       ha-idg-1-vm_nextcloud_migrate_to_0:3169
> > > [
> > > error: Cannot access storage file
> > > '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubu
> > > ntu/
> > > ubuntu-18.04.4-live-server-amd64.iso': No such file or
> > > directory\nocf-exit-reason:vm_nextcloud: live migration to ha-
> > > idg-2
> > > failed: 1\n ]
> > > migration failed
> > > 
> > > Jul 20 17:04:01 [23767] ha-idg-1    pengine:    error:
> > > native_create_actions:   Resource vm_nextcloud is active on 2
> > > nodes
> > > (attempting recovery)
> > > ???
> > 
> > This is standard for a failed live migration -- the cluster doesn't
> > know how far the migration actually got before failing, so it has
> > to
> > assume the VM could be active on either node. (The log message
> > would
> > make more sense saying "might be active" rather than "is active".)
> > 
> > > Jul 20 17:04:01 [23767] ha-idg-1    pengine:   notice:
> > > LogAction:        *
> > > Recover    vm_nextcloud           (             ha-idg-2 )
> > 
> > The recovery from that situation is a full stop on both nodes, and
> > start on one of them.
> > 
> > > Jul 20 17:04:01 [23768] ha-idg-1       crmd:   notice:
> > > te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0 on
> > > ha-
> > > idg-2 | action 106
> > > Jul 20 17:04:01 [23768] ha-idg-1       crmd:   notice:
> > > te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0
> > > locally on ha-idg-1 | action 2
> > > 
> > > Jul 20 17:04:01 [23768] ha-idg-1       crmd:     info:
> > > match_graph_event:       Action vm_nextcloud_stop_0 (106)
> > > confirmed
> > > on ha-idg-2 (rc=0)
> > > 
> > > Jul 20 17:04:06 [23768] ha-idg-1       crmd:   notice:
> > > process_lrm_event:       Result of stop operation for
> > > vm_nextcloud on
> > > ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0
> > > confirmed=true
> > > cib-update=5960
> > 
> > It looks like both stops succeeded.
> > 
> > > Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd:   notice:
> > > crm_signal_dispatch:     Caught 'Terminated' signal | 15
> > > (invoking
> > > handler)
> > > systemctl stop pacemaker.service
> > > 
> > > 
> > > ha-idg-2:
> > > Jul 20 17:04:03 [10691] ha-idg-2       crmd:   notice:
> > > process_lrm_event:       Result of stop operation for
> > > vm_nextcloud on
> > > ha-idg-2: 0 (ok) | call=157 key=vm_nextcloud_stop_0
> > > confirmed=true
> > > cib-update=57
> > > the log from ha-idg-2 is two seconds ahead of ha-idg-1
> > > 
> > > Jul 20 17:04:08 [10688] ha-idg-2       lrmd:   notice:
> > > log_execute:     executing - rsc:vm_nextcloud action:start
> > > call_id:192
> > > Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice:
> > > operation_finished:      vm_nextcloud_start_0:29107:stderr [
> > > error:
> > > Failed to create domain from /mnt/share/vm_nextcloud.xml ]
> > > Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice:
> > > operation_finished:      vm_nextcloud_start_0:29107:stderr [
> > > error:
> > > Cannot access storage file
> > > '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubu
> > > ntu/
> > > ubuntu-18.04.4-live-server-amd64.iso': No such file or directory
> > > ]
> > > Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice:
> > > operation_finished:      vm_nextcloud_start_0:29107:stderr [ ocf-
> > > exit-reason:Failed to start virtual domain vm_nextcloud. ]
> > > Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice:
> > > log_finished:    finished - rsc:vm_nextcloud action:start
> > > call_id:192
> > > pid:29107 exit-code:1 exec-time:581ms queue-time:0ms
> > > start on ha-idg-2 failed
> > 
> > The start failed ...
> > 
> > > Jul 20 17:05:32 [10691] ha-idg-2       crmd:     info:
> > > do_dc_takeover:  Taking over DC status for this partition
> > > ha-idg-1 stopped pacemaker
> > 
> > Since the ha-idg-2 is now shutting down, ha-idg-1 becomes DC.
> > 
> > > Jul 20 17:05:33 [10690] ha-idg-2    pengine:  warning:
> > > unpack_rsc_op_failure:   Processing failed migrate_to of
> > > vm_nextcloud
> > > on ha-idg-1: unknown error | rc=1
> > > Jul 20 17:05:33 [10690] ha-idg-2    pengine:  warning:
> > > unpack_rsc_op_failure:   Processing failed start of vm_nextcloud
> > > on
> > > ha-idg-2: unknown error | rc
> > > 
> > > Jul 20 17:05:33 [10690] ha-idg-2    pengine:     info:
> > > native_color:    Resource vm_nextcloud cannot run anywhere
> > > logical
> > > 
> > > Jul 20 17:05:33 [10690] ha-idg-2    pengine:  warning:
> > > custom_action:   Action vm_nextcloud_stop_0 on ha-idg-1 is
> > > unrunnable
> > > (pending)
> > > ???
> > 
> > So this appears to be the problem. From these logs I would guess
> > the
> > successful stop on ha-idg-1 did not get written to the CIB for some
> > reason. I'd look at the pe input from this transition on ha-idg-2
> > to
> > confirm that.
> > 
> > Without the DC knowing about the stop, it tries to schedule a new
> > one,
> > but the node is shutting down so it can't do it, which means it has
> > to
> > be fenced.
> > 
> > > Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning:
> > > custom_action:   Action vm_nextcloud_stop_0 on ha-idg-1 is
> > > unrunnable
> > > (offline)
> > > Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning:
> > > pe_fence_node:   Cluster node ha-idg-1 will be fenced: resource
> > > actions are unrunnable
> > > Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning:
> > > stage6:  Scheduling Node ha-idg-1 for STONITH
> > > Jul 20 17:05:35 [10690] ha-idg-2    pengine:     info:
> > > native_stop_constraints: vm_nextcloud_stop_0 is implicit after
> > > ha-
> > > idg-1 is fenced
> > > Jul 20 17:05:35 [10690] ha-idg-2    pengine:   notice:
> > > LogNodeActions:   * Fence (Off) ha-idg-1 'resource actions are
> > > unrunnable'
> > > 
> > > 
> > > Why does it say "Jul 20 17:05:35 [10690] ha-idg-
> > > 2    pengine:  warning: custom_action:   Action
> > > vm_nextcloud_stop_0
> > > on ha-idg-1 is unrunnable (offline)" although
> > > "Jul 20 17:04:06 [23768] ha-idg-1       crmd:   notice:
> > > process_lrm_event:       Result of stop operation for
> > > vm_nextcloud on
> > > ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0
> > > confirmed=true
> > > cib-update=5960"
> > > says that stop was ok ?
> 
> I'm stll digging in the logs trying to understand what happened.
> What i'm wondering abaout:
> 
> Jul 20 17:04:26 [23768] ha-idg-1       crmd:   notice:
> run_graph:       Transition 4515 (Complete=10, Pending=0, Fired=0,
> Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-
> 3117.bz2): Complete
> 
> Jul 20 17:05:29 [23768] ha-idg-1       crmd:     info:
> abort_transition_graph:  Transition 4515 aborted by status-
> 1084777482-shutdown doing create shutdown=1595257529: Transient
> attribute change | cib=2.16197.72
> source=abort_unless_down:317   path=/cib/status/node_state[@id='10847
> 77482']/transient_attributes[@id='1084777482']/instance_attributes[@i
> d='status-1084777482'] complete=true
> 
> 
> A transition is completed and three secons later aborted. How can
> that be?

The most recent transition is aborted, but since all its actions are
complete, the only effect is to trigger a new transition.

We should probably rephrase the log message. In fact, the whole
"transition" terminology is kind of obscure. It's hard to come up with
something better though.

> 
> Bernd
> Helmholtz Zentrum München
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin
> Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
-- 
Ken Gaillot <kgaillot at redhat.com>