[ClusterLabs] why is node fenced ?

Wed Jul 29 12:53:28 EDT 2020

On Wed, 2020-07-29 at 17:26 +0200, Lentes, Bernd wrote:
> Hi,
> 
> a few days ago one of my nodes was fenced and i don't know why, which
> is something i really don't like.
> What i did:
> I put one node (ha-idg-1) in standby. The resources on it (most of
> all virtual domains) were migrated to ha-idg-2,
> except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was
> missing the xml of the domain points to.
> Then the cluster tries to start vm_nextcloud on ha-idg-2 which of
> course also failed.
> Then ha-idg-1 was fenced.
> I did a "crm history" over the respective time period, you find it
> here:
> https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF
> 
> Here, from my point of view, the most interesting from the logs:
> ha-idg-1:
> Jul 20 16:59:33 [23763] ha-idg-1        cib:     info:
> cib_perform_op:  Diff: --- 2.16196.19 2
> Jul 20 16:59:33 [23763] ha-idg-1        cib:     info:
> cib_perform_op:  Diff: +++ 2.16197.0 bc9a558dfbe6d7196653ce56ad1ee758
> Jul 20 16:59:33 [23763] ha-idg-1        cib:     info:
> cib_perform_op:  +  /cib:  @epoch=16197, @num_updates=0
> Jul 20 16:59:33 [23763] ha-idg-1        cib:     info:
> cib_perform_op:  +  /cib/configuration/nodes/node[@id='1084777482']/i
> nstance_attributes[@id='nodes-108
> 4777482']/nvpair[@id='nodes-1084777482-standby']:  @value=on
> ha-idg-1 set to standby
> 
> Jul 20 16:59:34 [23768] ha-idg-1       crmd:   notice:
> process_lrm_event:       ha-idg-1-vm_nextcloud_migrate_to_0:3169 [
> error: Cannot access storage file
> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/
> ubuntu-18.04.4-live-server-amd64.iso': No such file or
> directory\nocf-exit-reason:vm_nextcloud: live migration to ha-idg-2
> failed: 1\n ]
> migration failed
> 
> Jul 20 17:04:01 [23767] ha-idg-1    pengine:    error:
> native_create_actions:   Resource vm_nextcloud is active on 2 nodes
> (attempting recovery)
> ???

This is standard for a failed live migration -- the cluster doesn't
know how far the migration actually got before failing, so it has to
assume the VM could be active on either node. (The log message would
make more sense saying "might be active" rather than "is active".)

> Jul 20 17:04:01 [23767] ha-idg-1    pengine:   notice:
> LogAction:        *
> Recover    vm_nextcloud           (             ha-idg-2 )

The recovery from that situation is a full stop on both nodes, and
start on one of them.

> Jul 20 17:04:01 [23768] ha-idg-1       crmd:   notice:
> te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0 on ha-
> idg-2 | action 106
> Jul 20 17:04:01 [23768] ha-idg-1       crmd:   notice:
> te_rsc_command:  Initiating stop operation vm_nextcloud_stop_0
> locally on ha-idg-1 | action 2
> 
> Jul 20 17:04:01 [23768] ha-idg-1       crmd:     info:
> match_graph_event:       Action vm_nextcloud_stop_0 (106) confirmed
> on ha-idg-2 (rc=0)
> 
> Jul 20 17:04:06 [23768] ha-idg-1       crmd:   notice:
> process_lrm_event:       Result of stop operation for vm_nextcloud on
> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true
> cib-update=5960

It looks like both stops succeeded.

> Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd:   notice:
> crm_signal_dispatch:     Caught 'Terminated' signal | 15 (invoking
> handler)
> systemctl stop pacemaker.service
> 
> 
> ha-idg-2:
> Jul 20 17:04:03 [10691] ha-idg-2       crmd:   notice:
> process_lrm_event:       Result of stop operation for vm_nextcloud on
> ha-idg-2: 0 (ok) | call=157 key=vm_nextcloud_stop_0 confirmed=true
> cib-update=57
> the log from ha-idg-2 is two seconds ahead of ha-idg-1
> 
> Jul 20 17:04:08 [10688] ha-idg-2       lrmd:   notice:
> log_execute:     executing - rsc:vm_nextcloud action:start
> call_id:192
> Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice:
> operation_finished:      vm_nextcloud_start_0:29107:stderr [ error:
> Failed to create domain from /mnt/share/vm_nextcloud.xml ]
> Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice:
> operation_finished:      vm_nextcloud_start_0:29107:stderr [ error:
> Cannot access storage file
> '/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/
> ubuntu-18.04.4-live-server-amd64.iso': No such file or directory ]
> Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice:
> operation_finished:      vm_nextcloud_start_0:29107:stderr [ ocf-
> exit-reason:Failed to start virtual domain vm_nextcloud. ]
> Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice:
> log_finished:    finished - rsc:vm_nextcloud action:start call_id:192
> pid:29107 exit-code:1 exec-time:581ms queue-time:0ms
> start on ha-idg-2 failed

The start failed ...

> Jul 20 17:05:32 [10691] ha-idg-2       crmd:     info:
> do_dc_takeover:  Taking over DC status for this partition
> ha-idg-1 stopped pacemaker

Since the ha-idg-2 is now shutting down, ha-idg-1 becomes DC.

> Jul 20 17:05:33 [10690] ha-idg-2    pengine:  warning:
> unpack_rsc_op_failure:   Processing failed migrate_to of vm_nextcloud
> on ha-idg-1: unknown error | rc=1
> Jul 20 17:05:33 [10690] ha-idg-2    pengine:  warning:
> unpack_rsc_op_failure:   Processing failed start of vm_nextcloud on
> ha-idg-2: unknown error | rc
> 
> Jul 20 17:05:33 [10690] ha-idg-2    pengine:     info:
> native_color:    Resource vm_nextcloud cannot run anywhere
> logical
> 
> Jul 20 17:05:33 [10690] ha-idg-2    pengine:  warning:
> custom_action:   Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable
> (pending)
> ???

So this appears to be the problem. From these logs I would guess the
successful stop on ha-idg-1 did not get written to the CIB for some
reason. I'd look at the pe input from this transition on ha-idg-2 to
confirm that.

Without the DC knowing about the stop, it tries to schedule a new one,
but the node is shutting down so it can't do it, which means it has to
be fenced.

> Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning:
> custom_action:   Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable
> (offline)
> Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning:
> pe_fence_node:   Cluster node ha-idg-1 will be fenced: resource
> actions are unrunnable
> Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning:
> stage6:  Scheduling Node ha-idg-1 for STONITH
> Jul 20 17:05:35 [10690] ha-idg-2    pengine:     info:
> native_stop_constraints: vm_nextcloud_stop_0 is implicit after ha-
> idg-1 is fenced
> Jul 20 17:05:35 [10690] ha-idg-2    pengine:   notice:
> LogNodeActions:   * Fence (Off) ha-idg-1 'resource actions are
> unrunnable'
> 
> 
> Why does it say "Jul 20 17:05:35 [10690] ha-idg-
> 2    pengine:  warning: custom_action:   Action vm_nextcloud_stop_0
> on ha-idg-1 is unrunnable (offline)" although
> "Jul 20 17:04:06 [23768] ha-idg-1       crmd:   notice:
> process_lrm_event:       Result of stop operation for vm_nextcloud on
> ha-idg-1: 0 (ok) | call=3197 key=vm_nextcloud_stop_0 confirmed=true
> cib-update=5960"
> says that stop was ok ?
> 
> 
> Bernd
> 
-- 
Ken Gaillot <kgaillot at redhat.com>