[ClusterLabs] VM failure during shutdown

Mon Jun 25 10:47:56 EDT 2018

On Mon, 2018-06-25 at 11:33 +0300, Vaggelis Papastavros wrote:
> Dear friends ,
> 
> We have the following configuration :
> 
> CentOS7 , pacemaker 0.9.152 and Corosync 2.4.0, storage with DRBD
> and 
> stonith eanbled with APC PDU devices.
> 
> I have a windows VM configured as cluster resource with the
> following 
> attributes :
> 
> Resource: WindowSentinelOne_res (class=ocf provider=heartbeat 
> type=VirtualDomain)
> Attributes: hypervisor=qemu:///system 
> config=/opt/customer_vms/conf/WindowSentinelOne/WindowSentinelOne.xml
>  
> migration_transport=ssh
> Utilization: cpu=8 hv_memory=8192
> Operations: start interval=0s timeout=120s 
> (WindowSentinelOne_res-start-interval-0s)
>                      stop interval=0s timeout=120s 
> (WindowSentinelOne_res-stop-interval-0s)
>                      monitor interval=10s timeout=30s 
> (WindowSentinelOne_res-monitor-interval-10s)
> 
> under some circumstances  (which i try to identify) the VM fails and 
> disappears under virsh list --all and also pacemaker reports the VM
> as 
> stopped .
> 
> If run pcs resource cleanup windows_wm everything is OK, but i can't 
> identify the reason of failure.
> 
> For example when shutdown the VM (with windows shutdown)  the
> cluster 
> reports the following :
> 
> WindowSentinelOne_res    (ocf::heartbeat:VirtualDomain): Started sgw-
> 02 
> (failure ignored)
> 
> Failed Actions:
> * WindowSentinelOne_res_monitor_10000 on sgw-02 'not running' (7): 
> call=67, status=complete, exitreason='none',
>      last-rc-change='Mon Jun 25 07:41:37 2018', queued=0ms, exec=0ms.
> 
> 
> My questions are
> 
> 1) why the VM shutdown is reported as (FailedAction) from cluster ?
> Its 
> a worthy operation during VM life cycle .

Pacemaker has no way of knowing that the VM was intentionally shut
down, vs crashed.

When some resource is managed by the cluster, all starts and stops of
the resource have to go through the cluster. You can either set target-
role=Stopped in the resource configuration, or if it's a temporary
issue (e.g. rebooting for some OS updates), you could set is-
managed=false to take it out of cluster control, do the work, then set
is-managed=true again.

> 2) why sometimes the resource is marked as stopped (the VM is
> healthy) 
> and needs cleanup ?

That's a problem. If the VM is truly healthy, it sounds like there's an
issue with the resource agent. You'd have to look at the logs to see if
it gave any more information (e.g. if it's a timeout, raising the
timeout might be sufficient).

> 3) I can't understand the corosync logs ... during the the VM
> shutdown 
> corosync logs is the following

FYI, the system log will have the most important messages. corosync.log
will additionally have info-level messages -- potentially helpful but
definitely difficult to follow.

> Jun 25 07:41:37 [5140] sgw-02       crmd:     info: 
> process_lrm_event:    Result of monitor operation for 
> WindowSentinelOne_res on sgw-02: 7 (not running) | call=67 
> key=WindowSentinelOne_res_monitor_10000 confirmed=false cib-update=36

This is really the only important message. It says that a recurring
monitor on the WindowSentinelOne_res resource on node sgw-02 exited
with status code 7 (which means the resource agent thinks the resource
is not running).

'key=WindowSentinelOne_res_monitor_10000' is how pacemaker identifies
resource agent actions. The format is <resource-name>_<action-
name>_<action-interval-in-milliseconds>

This is the only information Pacemaker will get from the resource
agent. To investigate more deeply, you'll have to check for log
messages from the agent itself.

> Jun 25 07:41:37 [5130] sgw-02        cib:     info: 
> cib_process_request:    Forwarding cib_modify operation for section 
> status to all (origin=local/crmd/36)
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> Diff: --- 0.4704.67 2
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> Diff: +++ 0.4704.68 (null)
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> +  /cib:  @num_updates=68
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> +  /cib/status/node_state[@id='2']: @crm-debug-
> origin=do_update_resource
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> ++ 
> /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resour
> ce[@id='WindowSentinelOne_res']: 
> <lrm_rsc_op id="WindowSentinelOne_res_last_failure_0" 
> operation_key="WindowSentinelOne_res_monitor_10000"
> operation="monitor" 
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" 
> transition-key="84:3:0:f910c793-a714-4e24-80d1-b0ec66275491" 
> transition-magic="0:7;84:3:0:f910c793-a714-4e24-80d1-b0ec66275491" 
> on_node="sgw-02" cal
> Jun 25 07:41:37 [5130] sgw-02        cib:     info: 
> cib_process_request:    Completed cib_modify operation for section 
> status: OK (rc=0, origin=sgw-02/crmd/36, version=0.4704.68)

You can usually ignore the 'cib' messages. This just means Pacemaker
recorded the result on disk.

> Jun 25 07:41:37 [5137] sgw-02      attrd:     info: 
> attrd_peer_update:    Setting fail-count-WindowSentinelOne_res[sgw-
> 02]: 
> (null) -> 1 from sgw-01

Since the cluster expected the resource to be running, this result is a
failure. Failures are counted using special node attributes that start
with "fail-count-". This is what Pacemaker uses to determine if a
resource has reached its migration-threshold.

> Jun 25 07:41:37 [5137] sgw-02      attrd:     info:
> write_attribute:    
> Sent update 10 with 1 changes for fail-count-WindowSentinelOne_res, 
> id=<n/a>, set=(null)
> Jun 25 07:41:37 [5130] sgw-02        cib:     info: 
> cib_process_request:    Forwarding cib_modify operation for section 
> status to all (origin=local/attrd/10)
> Jun 25 07:41:37 [5137] sgw-02      attrd:     info: 
> attrd_peer_update:    Setting 
> last-failure-WindowSentinelOne_res[sgw-02]: (null) -> 1529912497 from
> sgw-01

Similarly, the time the failure occurred is stored in a 'last-failure-' 
node attribute, which Pacemaker uses to determine if a resource has
reached its failure-timeout.

> Jun 25 07:41:37 [5137] sgw-02      attrd:     info:
> write_attribute:    
> Sent update 11 with 1 changes for last-failure-
> WindowSentinelOne_res, 
> id=<n/a>, set=(null)
> Jun 25 07:41:37 [5130] sgw-02        cib:     info: 
> cib_process_request:    Forwarding cib_modify operation for section 
> status to all (origin=local/attrd/11)
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> Diff: --- 0.4704.68 2
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> Diff: +++ 0.4704.69 (null)
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> +  /cib:  @num_updates=69
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> ++ 
> /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instanc
> e_attributes[@id='status-2']: 
> <nvpair id="status-2-fail-count-WindowSentinelOne_res" 
> name="fail-count-WindowSentinelOne_res" value="1"/>
> Jun 25 07:41:37 [5130] sgw-02        cib:     info: 
> cib_process_request:    Completed cib_modify operation for section 
> status: OK (rc=0, origin=sgw-02/attrd/10, version=0.4704.69)
> Jun 25 07:41:37 [5137] sgw-02      attrd:     info: 
> attrd_cib_callback:    Update 10 for fail-count-
> WindowSentinelOne_res: 
> OK (0)
> Jun 25 07:41:37 [5137] sgw-02      attrd:     info: 
> attrd_cib_callback:    Update 10 for 
> fail-count-WindowSentinelOne_res[sgw-02]=1: OK (0)
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> Diff: --- 0.4704.69 2
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> Diff: +++ 0.4704.70 (null)
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> +  /cib:  @num_updates=70
> Jun 25 07:41:37 [5130] sgw-02        cib:     info:
> cib_perform_op:    
> ++ 
> /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instanc
> e_attributes[@id='status-2']: 
> <nvpair id="status-2-last-failure-WindowSentinelOne_res" 
> name="last-failure-WindowSentinelOne_res" value="1529912497"/>
> Jun 25 07:41:37 [5130] sgw-02        cib:     info: 
> cib_process_request:    Completed cib_modify operation for section 
> status: OK (rc=0, origin=sgw-02/attrd/11, version=0.4704.70)
> Jun 25 07:41:37 [5137] sgw-02      attrd:     info: 
> attrd_cib_callback:    Update 11 for last-failure-
> WindowSentinelOne_res: 
> OK (0)
> Jun 25 07:41:37 [5137] sgw-02      attrd:     info: 
> attrd_cib_callback:    Update 11 for 
> last-failure-WindowSentinelOne_res[sgw-02]=1529912497: OK (0)
> Jun 25 07:41:42 [5130] sgw-02        cib:     info:
> cib_process_ping:    
> Reporting our current digest to sgw-01:
> 3e27415fcb003ef3373b47ffa6c5f358 
> for 0.4704.70 (0x7faac1729720 0)
> 
> Sincerely ,
> 
> Vaggelis Papastavros
-- 
Ken Gaillot <kgaillot at redhat.com>