[ClusterLabs] Antw: [EXT] why is node fenced ?
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Thu Jul 30 03:28:39 EDT 2020
>>> "Lentes, Bernd" <bernd.lentes at helmholtz-muenchen.de> schrieb am 29.07.2020
um
17:26 in Nachricht
<1894379294.27456141.1596036406000.JavaMail.zimbra at helmholtz-muenchen.de>:
> Hi,
>
> a few days ago one of my nodes was fenced and i don't know why, which is
> something i really don't like.
> What i did:
> I put one node (ha-idg-1) in standby. The resources on it (most of all
> virtual domains) were migrated to ha-idg-2,
> except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was missing the
> xml of the domain points to.
> Then the cluster tries to start vm_nextcloud on ha-idg-2 which of course
> also failed.
> Then ha-idg-1 was fenced.
My guess is that ha-idg-1 was fenced because a failed migration from ha-idg-2
is treated like a stop failure on ha-idg-2. Stop failures cause fencing. You
should have tested your resource before going productive.
>
> I did a "crm history" over the respective time period, you find it here:
> https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF
>
> Here, from my point of view, the most interesting from the logs:
> ha-idg-1:
> Jul 20 16:59:33 [23763] ha-idg-1 cib: info: cib_perform_op:
> Diff: --- 2.16196.19 2
> Jul 20 16:59:33 [23763] ha-idg-1 cib: info: cib_perform_op:
> Diff: +++ 2.16197.0 bc9a558dfbe6d7196653ce56ad1ee758
> Jul 20 16:59:33 [23763] ha-idg-1 cib: info: cib_perform_op: +
> /cib: @epoch=16197, @num_updates=0
> Jul 20 16:59:33 [23763] ha-idg-1 cib: info: cib_perform_op: +
>
/cib/configuration/nodes/node[@id='1084777482']/instance_attributes[@id='node
> s-108
> 4777482']/nvpair[@id='nodes-1084777482-standby']: @value=on
> ha-idg-1 set to standby
>
> Jul 20 16:59:34 [23768] ha-idg-1 crmd: notice: process_lrm_event:
> ha-idg-1-vm_nextcloud_migrate_to_0:3169 [ error: Cannot access storage
> file
>
'/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ubuntu-1
> 8.04.4-live-server-amd64.iso': No such file or
> directory\nocf-exit-reason:vm_nextcloud: live migration to ha-idg-2 failed:
> 1\n ]
> migration failed
>
> Jul 20 17:04:01 [23767] ha-idg-1 pengine: error:
> native_create_actions: Resource vm_nextcloud is active on 2 nodes
> (attempting recovery)
> ???
>
> Jul 20 17:04:01 [23767] ha-idg-1 pengine: notice: LogAction: *
> Recover vm_nextcloud ( ha-idg-2 )
>
> Jul 20 17:04:01 [23768] ha-idg-1 crmd: notice: te_rsc_command:
> Initiating stop operation vm_nextcloud_stop_0 on ha-idg-2 | action 106
> Jul 20 17:04:01 [23768] ha-idg-1 crmd: notice: te_rsc_command:
> Initiating stop operation vm_nextcloud_stop_0 locally on ha-idg-1 | action
2
>
> Jul 20 17:04:01 [23768] ha-idg-1 crmd: info: match_graph_event:
> Action vm_nextcloud_stop_0 (106) confirmed on ha-idg-2 (rc=0)
>
> Jul 20 17:04:06 [23768] ha-idg-1 crmd: notice: process_lrm_event:
> Result of stop operation for vm_nextcloud on ha-idg-1: 0 (ok) | call=3197
> key=vm_nextcloud_stop_0 confirmed=true cib-update=5960
>
> Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd: notice: crm_signal_dispatch:
> Caught 'Terminated' signal | 15 (invoking handler)
> systemctl stop pacemaker.service
>
>
> ha-idg-2:
> Jul 20 17:04:03 [10691] ha-idg-2 crmd: notice: process_lrm_event:
> Result of stop operation for vm_nextcloud on ha-idg-2: 0 (ok) | call=157
> key=vm_nextcloud_stop_0 confirmed=true cib-update=57
> the log from ha-idg-2 is two seconds ahead of ha-idg-1
>
> Jul 20 17:04:08 [10688] ha-idg-2 lrmd: notice: log_execute:
> executing - rsc:vm_nextcloud action:start call_id:192
> Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: operation_finished:
> vm_nextcloud_start_0:29107:stderr [ error: Failed to create domain from
> /mnt/share/vm_nextcloud.xml ]
> Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: operation_finished:
> vm_nextcloud_start_0:29107:stderr [ error: Cannot access storage file
>
'/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ubuntu-1
> 8.04.4-live-server-amd64.iso': No such file or directory ]
> Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: operation_finished:
> vm_nextcloud_start_0:29107:stderr [ ocf-exit-reason:Failed to start
> virtual domain vm_nextcloud. ]
> Jul 20 17:04:09 [10688] ha-idg-2 lrmd: notice: log_finished:
> finished - rsc:vm_nextcloud action:start call_id:192 pid:29107 exit-code:1
> exec-time:581ms queue-time:0ms
> start on ha-idg-2 failed
>
> Jul 20 17:05:32 [10691] ha-idg-2 crmd: info: do_dc_takeover:
> Taking over DC status for this partition
> ha-idg-1 stopped pacemaker
>
> Jul 20 17:05:33 [10690] ha-idg-2 pengine: warning:
> unpack_rsc_op_failure: Processing failed migrate_to of vm_nextcloud on
> ha-idg-1: unknown error | rc=1
> Jul 20 17:05:33 [10690] ha-idg-2 pengine: warning:
> unpack_rsc_op_failure: Processing failed start of vm_nextcloud on
ha-idg-2:
> unknown error | rc
>
> Jul 20 17:05:33 [10690] ha-idg-2 pengine: info: native_color:
> Resource vm_nextcloud cannot run anywhere
> logical
>
> Jul 20 17:05:33 [10690] ha-idg-2 pengine: warning: custom_action:
> Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable (pending)
> ???
>
> Jul 20 17:05:35 [10690] ha-idg-2 pengine: warning: custom_action:
> Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable (offline)
> Jul 20 17:05:35 [10690] ha-idg-2 pengine: warning: pe_fence_node:
> Cluster node ha-idg-1 will be fenced: resource actions are unrunnable
> Jul 20 17:05:35 [10690] ha-idg-2 pengine: warning: stage6: Scheduling
> Node ha-idg-1 for STONITH
> Jul 20 17:05:35 [10690] ha-idg-2 pengine: info:
> native_stop_constraints: vm_nextcloud_stop_0 is implicit after ha-idg-1 is
> fenced
> Jul 20 17:05:35 [10690] ha-idg-2 pengine: notice: LogNodeActions: *
> Fence (Off) ha-idg-1 'resource actions are unrunnable'
>
>
> Why does it say "Jul 20 17:05:35 [10690] ha-idg-2 pengine: warning:
> custom_action: Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable
> (offline)" although
> "Jul 20 17:04:06 [23768] ha-idg-1 crmd: notice: process_lrm_event:
> Result of stop operation for vm_nextcloud on ha-idg-1: 0 (ok) |
call=3197
> key=vm_nextcloud_stop_0 confirmed=true cib-update=5960"
> says that stop was ok ?
>
>
> Bernd
>
> --
>
> Bernd Lentes
> Systemadministration
> Institute for Metabolism and Cell Death (MCD)
> Building 25 - office 122
> HelmholtzZentrum München
> bernd.lentes at helmholtz-muenchen.de
> phone: +49 89 3187 1241
> phone: +49 89 3187 3827
> fax: +49 89 3187 2294
> http://www.helmholtz-muenchen.de/mcd
>
> stay healthy
> Helmholtz Zentrum München
>
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin
> Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list