[ClusterLabs] why is node fenced ?

Sun Aug 9 16:17:24 EDT 2020

----- Am 29. Jul 2020 um 18:53 schrieb kgaillot kgaillot at redhat.com:

> On Wed, 2020-07-29 at 17:26 +0200, Lentes, Bernd wrote:
>> Hi,
>> 
>> a few days ago one of my nodes was fenced and i don't know why, which
>> is something i really don't like.
>> What i did:
>> I put one node (ha-idg-1) in standby. The resources on it (most of
>> all virtual domains) were migrated to ha-idg-2,
>> except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was
>> missing the xml of the domain points to.
>> Then the cluster tries to start vm_nextcloud on ha-idg-2 which of
>> course also failed.
>> Then ha-idg-1 was fenced.
>> I did a "crm history" over the respective time period, you find it
>> here:
>> https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF
>> 
>> Here, from my point of view, the most interesting from the logs:
>> ha-idg-1:
>> Jul 20 16:59:33 [23763] ha-idg-1        cib:     info:
>> cib_perform_op:  Diff: --- 2.16196.19 2
>> Jul 20 16:59:33 [23763] ha-idg-1        cib:     info:
>> cib_perform_op:  Diff: +++ 2.16197.0 bc9a558dfbe6d7196653ce56ad1ee758
>> Jul 20 16:59:33 [23763] ha-idg-1        cib:     info:
>> cib_perform_op:  +  /cib:  @epoch=16197, @num_updates=0
>> Jul 20 16:59:33 [23763] ha-idg-1        cib:     info:
>> cib_perform_op:  +  /cib/configuration/nodes/node[@id='1084777482']/i
>> nstance_attributes[@id='nodes-108
>> 4777482']/nvpair[@id='nodes-1084777482-standby']:  @value=on
>> ha-idg-1 set to standby
>> 
>> Jul 20 16:59:34 [23768] ha-idg-1       crmd:   notice:
>> process_lrm_event:       ha-idg-1-vm_nextcloud_migrate_to_0:3169 [
>> error: Cannot access storage file
>> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/
>> ubuntu-18.04.4-live-server-amd64.iso': No such file or
>> directory\nocf-exit-reason:vm_nextcloud: live migration to ha-idg-2
>> failed: 1\n ]
>> migration failed
>> 
>> Jul 20 17:04:01 [23767] ha-idg-1    pengine:    error:
>> native_create_actions:   Resource vm_nextcloud is active on 2 nodes
>> (attempting recovery)
>> ???
> 
> This is standard for a failed live migration -- the cluster doesn't
> know how far the migration actually got before failing, so it has to
> assume the VM could be active on either node. (The log message would
> make more sense saying "might be active" rather than "is active".)
> 
>> Jul 20 17:04:01 [23767] ha-idg-1    pengine:   notice:
>> LogAction:        *
>> Recover    vm_nextcloud           (             ha-idg-2 )
> 
> The recovery from that situation is a full stop on both nodes, and
> start on one of them.
> 
>> Jul 20 17:04:01 [23768] ha-idg-1       crmd:   notice:
>> te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0 on ha-
>> idg-2 | action 106
>> Jul 20 17:04:01 [23768] ha-idg-1       crmd:   notice:
>> te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0
>> locally on ha-idg-1 | action 2
>> 
>> Jul 20 17:04:01 [23768] ha-idg-1       crmd:     info:
>> match_graph_event:       Action vm_nextcloud_stop_0 (106) confirmed
>> on ha-idg-2 (rc=0)
>> 
>> Jul 20 17:04:06 [23768] ha-idg-1       crmd:   notice:
>> process_lrm_event:       Result of stop operation for vm_nextcloud on
>> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true
>> cib-update=5960
> 
> It looks like both stops succeeded.
> 
>> Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd:   notice:
>> crm_signal_dispatch:     Caught 'Terminated' signal | 15 (invoking
>> handler)
>> systemctl stop pacemaker.service
>> 
>> 
>> ha-idg-2:
>> Jul 20 17:04:03 [10691] ha-idg-2       crmd:   notice:
>> process_lrm_event:       Result of stop operation for vm_nextcloud on
>> ha-idg-2: 0 (ok) | call=157 key=vm_nextcloud_stop_0 confirmed=true
>> cib-update=57
>> the log from ha-idg-2 is two seconds ahead of ha-idg-1
>> 
>> Jul 20 17:04:08 [10688] ha-idg-2       lrmd:   notice:
>> log_execute:     executing - rsc:vm_nextcloud action:start
>> call_id:192
>> Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice:
>> operation_finished:      vm_nextcloud_start_0:29107:stderr [ error:
>> Failed to create domain from /mnt/share/vm_nextcloud.xml ]
>> Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice:
>> operation_finished:      vm_nextcloud_start_0:29107:stderr [ error:
>> Cannot access storage file
>> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/
>> ubuntu-18.04.4-live-server-amd64.iso': No such file or directory ]
>> Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice:
>> operation_finished:      vm_nextcloud_start_0:29107:stderr [ ocf-
>> exit-reason:Failed to start virtual domain vm_nextcloud. ]
>> Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice:
>> log_finished:    finished - rsc:vm_nextcloud action:start call_id:192
>> pid:29107 exit-code:1 exec-time:581ms queue-time:0ms
>> start on ha-idg-2 failed
> 
> The start failed ...
> 
>> Jul 20 17:05:32 [10691] ha-idg-2       crmd:     info:
>> do_dc_takeover:  Taking over DC status for this partition
>> ha-idg-1 stopped pacemaker
> 
> Since the ha-idg-2 is now shutting down, ha-idg-1 becomes DC.
> 
>> Jul 20 17:05:33 [10690] ha-idg-2    pengine:  warning:
>> unpack_rsc_op_failure:   Processing failed migrate_to of vm_nextcloud
>> on ha-idg-1: unknown error | rc=1
>> Jul 20 17:05:33 [10690] ha-idg-2    pengine:  warning:
>> unpack_rsc_op_failure:   Processing failed start of vm_nextcloud on
>> ha-idg-2: unknown error | rc
>> 
>> Jul 20 17:05:33 [10690] ha-idg-2    pengine:     info:
>> native_color:    Resource vm_nextcloud cannot run anywhere
>> logical
>> 
>> Jul 20 17:05:33 [10690] ha-idg-2    pengine:  warning:
>> custom_action:   Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable
>> (pending)
>> ???
> 
> So this appears to be the problem. From these logs I would guess the
> successful stop on ha-idg-1 did not get written to the CIB for some
> reason. I'd look at the pe input from this transition on ha-idg-2 to
> confirm that.
> 
> Without the DC knowing about the stop, it tries to schedule a new one,
> but the node is shutting down so it can't do it, which means it has to
> be fenced.
> 
>> Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning:
>> custom_action:   Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable
>> (offline)
>> Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning:
>> pe_fence_node:   Cluster node ha-idg-1 will be fenced: resource
>> actions are unrunnable
>> Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning:
>> stage6:  Scheduling Node ha-idg-1 for STONITH
>> Jul 20 17:05:35 [10690] ha-idg-2    pengine:     info:
>> native_stop_constraints: vm_nextcloud_stop_0 is implicit after ha-
>> idg-1 is fenced
>> Jul 20 17:05:35 [10690] ha-idg-2    pengine:   notice:
>> LogNodeActions:   * Fence (Off) ha-idg-1 'resource actions are
>> unrunnable'
>> 
>> 
>> Why does it say "Jul 20 17:05:35 [10690] ha-idg-
>> 2    pengine:  warning: custom_action:   Action vm_nextcloud_stop_0
>> on ha-idg-1 is unrunnable (offline)" although
>> "Jul 20 17:04:06 [23768] ha-idg-1       crmd:   notice:
>> process_lrm_event:       Result of stop operation for vm_nextcloud on
>> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true
>> cib-update=5960"
>> says that stop was ok ?

I'm stll digging in the logs trying to understand what happened.
What i'm wondering abaout:

Jul 20 17:04:26 [23768] ha-idg-1       crmd:   notice: run_graph:       Transition 4515 (Complete=10, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3117.bz2): Complete

Jul 20 17:05:29 [23768] ha-idg-1       crmd:     info: abort_transition_graph:  Transition 4515 aborted by status-1084777482-shutdown doing create shutdown=1595257529: Transient attribute change | cib=2.16197.72 source=abort_unless_down:317   path=/cib/status/node_state[@id='1084777482']/transient_attributes[@id='1084777482']/instance_attributes[@id='status-1084777482'] complete=true

A transition is completed and three secons later aborted. How can that be?

Bernd
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671