[ClusterLabs] Antw: [EXT] why is node fenced ?

Thu Jul 30 03:28:39 EDT 2020

>>> "Lentes, Bernd" <bernd.lentes at helmholtz-muenchen.de> schrieb am 29.07.2020
um
17:26 in Nachricht
<1894379294.27456141.1596036406000.JavaMail.zimbra at helmholtz-muenchen.de>:
> Hi,
> 
> a few days ago one of my nodes was fenced and i don't know why, which is 
> something i really don't like.
> What i did:
> I put one node (ha-idg-1) in standby. The resources on it (most of all 
> virtual domains) were migrated to ha-idg-2,
> except one domain (vm_nextcloud). On ha-idg-2 a mountpoint was missing the 
> xml of the domain points to.
> Then the cluster tries to start vm_nextcloud on ha-idg-2 which of course 
> also failed.
> Then ha-idg-1 was fenced.

My guess is that ha-idg-1 was fenced because a failed migration from ha-idg-2
is treated like a stop failure on ha-idg-2. Stop failures cause fencing. You
should have tested your resource before going productive.

> 
> I did a "crm history" over the respective time period, you find it here:
> https://hmgubox2.helmholtz-muenchen.de/index.php/s/529dfcXf5a72ifF 
> 
> Here, from my point of view, the most interesting from the logs:
> ha-idg-1:
> Jul 20 16:59:33 [23763] ha-idg-1        cib:     info: cib_perform_op:  
> Diff: --- 2.16196.19 2
> Jul 20 16:59:33 [23763] ha-idg-1        cib:     info: cib_perform_op:  
> Diff: +++ 2.16197.0 bc9a558dfbe6d7196653ce56ad1ee758
> Jul 20 16:59:33 [23763] ha-idg-1        cib:     info: cib_perform_op:  +  
> /cib:  @epoch=16197, @num_updates=0
> Jul 20 16:59:33 [23763] ha-idg-1        cib:     info: cib_perform_op:  +  
>
/cib/configuration/nodes/node[@id='1084777482']/instance_attributes[@id='node
> s-108
> 4777482']/nvpair[@id='nodes-1084777482-standby']:  @value=on
> ha-idg-1 set to standby
> 
> Jul 20 16:59:34 [23768] ha-idg-1       crmd:   notice: process_lrm_event:   

>    ha-idg-1-vm_nextcloud_migrate_to_0:3169 [ error: Cannot access storage 
> file 
>
'/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ubuntu-1
> 8.04.4-live-server-amd64.iso': No such file or 
> directory\nocf-exit-reason:vm_nextcloud: live migration to ha-idg-2 failed:

> 1\n ]
> migration failed
> 
> Jul 20 17:04:01 [23767] ha-idg-1    pengine:    error: 
> native_create_actions:   Resource vm_nextcloud is active on 2 nodes 
> (attempting recovery)
> ???
> 
> Jul 20 17:04:01 [23767] ha-idg-1    pengine:   notice: LogAction:        * 
> Recover    vm_nextcloud           (             ha-idg-2 )
> 
> Jul 20 17:04:01 [23768] ha-idg-1       crmd:   notice: te_rsc_command:  
> Initiating stop operation vm_nextcloud_stop_0 on ha-idg-2 | action 106
> Jul 20 17:04:01 [23768] ha-idg-1       crmd:   notice: te_rsc_command:  
> Initiating stop operation vm_nextcloud_stop_0 locally on ha-idg-1 | action
2
> 
> Jul 20 17:04:01 [23768] ha-idg-1       crmd:     info: match_graph_event:   

>    Action vm_nextcloud_stop_0 (106) confirmed on ha-idg-2 (rc=0)
> 
> Jul 20 17:04:06 [23768] ha-idg-1       crmd:   notice: process_lrm_event:   

>    Result of stop operation for vm_nextcloud on ha-idg-1: 0 (ok) | call=3197

> key=vm_nextcloud_stop_0 confirmed=true cib-update=5960
> 
> Jul 20 17:05:29 [23761] ha-idg-1 pacemakerd:   notice: crm_signal_dispatch: 

>    Caught 'Terminated' signal | 15 (invoking handler)
> systemctl stop pacemaker.service
> 
> 
> ha-idg-2:
> Jul 20 17:04:03 [10691] ha-idg-2       crmd:   notice: process_lrm_event:   

>    Result of stop operation for vm_nextcloud on ha-idg-2: 0 (ok) | call=157

> key=vm_nextcloud_stop_0 confirmed=true cib-update=57
> the log from ha-idg-2 is two seconds ahead of ha-idg-1
> 
> Jul 20 17:04:08 [10688] ha-idg-2       lrmd:   notice: log_execute:     
> executing - rsc:vm_nextcloud action:start call_id:192
> Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice: operation_finished:  

>    vm_nextcloud_start_0:29107:stderr [ error: Failed to create domain from 
> /mnt/share/vm_nextcloud.xml ]
> Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice: operation_finished:  

>    vm_nextcloud_start_0:29107:stderr [ error: Cannot access storage file 
>
'/mnt/mcd/AG_BioInformatik/Technik/software_und_treiber/linux/ubuntu/ubuntu-1
> 8.04.4-live-server-amd64.iso': No such file or directory ]
> Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice: operation_finished:  

>    vm_nextcloud_start_0:29107:stderr [ ocf-exit-reason:Failed to start 
> virtual domain vm_nextcloud. ]
> Jul 20 17:04:09 [10688] ha-idg-2       lrmd:   notice: log_finished:    
> finished - rsc:vm_nextcloud action:start call_id:192 pid:29107 exit-code:1 
> exec-time:581ms queue-time:0ms
> start on ha-idg-2 failed
> 
> Jul 20 17:05:32 [10691] ha-idg-2       crmd:     info: do_dc_takeover:  
> Taking over DC status for this partition
> ha-idg-1 stopped pacemaker
> 
> Jul 20 17:05:33 [10690] ha-idg-2    pengine:  warning: 
> unpack_rsc_op_failure:   Processing failed migrate_to of vm_nextcloud on 
> ha-idg-1: unknown error | rc=1
> Jul 20 17:05:33 [10690] ha-idg-2    pengine:  warning: 
> unpack_rsc_op_failure:   Processing failed start of vm_nextcloud on
ha-idg-2: 
> unknown error | rc
> 
> Jul 20 17:05:33 [10690] ha-idg-2    pengine:     info: native_color:    
> Resource vm_nextcloud cannot run anywhere
> logical
> 
> Jul 20 17:05:33 [10690] ha-idg-2    pengine:  warning: custom_action:   
> Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable (pending)
> ???
> 
> Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning: custom_action:   
> Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable (offline)
> Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning: pe_fence_node:   
> Cluster node ha-idg-1 will be fenced: resource actions are unrunnable
> Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning: stage6:  Scheduling 
> Node ha-idg-1 for STONITH
> Jul 20 17:05:35 [10690] ha-idg-2    pengine:     info: 
> native_stop_constraints: vm_nextcloud_stop_0 is implicit after ha-idg-1 is 
> fenced
> Jul 20 17:05:35 [10690] ha-idg-2    pengine:   notice: LogNodeActions:   * 
> Fence (Off) ha-idg-1 'resource actions are unrunnable'
> 
> 
> Why does it say "Jul 20 17:05:35 [10690] ha-idg-2    pengine:  warning: 
> custom_action:   Action vm_nextcloud_stop_0 on ha-idg-1 is unrunnable 
> (offline)" although
> "Jul 20 17:04:06 [23768] ha-idg-1       crmd:   notice: process_lrm_event:  

>     Result of stop operation for vm_nextcloud on ha-idg-1: 0 (ok) |
call=3197 
> key=vm_nextcloud_stop_0 confirmed=true cib-update=5960"
> says that stop was ok ?
> 
> 
> Bernd
> 
> -- 
> 
> Bernd Lentes 
> Systemadministration 
> Institute for Metabolism and Cell Death (MCD) 
> Building 25 - office 122 
> HelmholtzZentrum München 
> bernd.lentes at helmholtz-muenchen.de 
> phone: +49 89 3187 1241 
> phone: +49 89 3187 3827 
> fax: +49 89 3187 2294 
> http://www.helmholtz-muenchen.de/mcd 
> 
> stay healthy
> Helmholtz Zentrum München
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de 
> Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin 
> Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/