[ClusterLabs] Antw: [EXT] Re: failed migration handled the wrong way

Fri Feb 5 09:31:43 EST 2021

05.02.2021 12:54, Ulrich Windl пишет:
>>>> Ulrich Windl schrieb am 01.02.2021 um 11:59 in Nachricht <6017DF04.888 :
> 161 :
> 60728>:
>>>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 01.02.2021 um 11:05 in
>> Nachricht
>> <CAA91j0V-4YzNfT-KJ1nzLE_UyEdNOoiBtUMFjaST4O8L+uX8aQ at mail.gmail.com>:
>>> On Mon, Feb 1, 2021 at 12:53 PM Ulrich Windl
>>> <Ulrich.Windl at rz.uni‑regensburg.de> wrote:
>> ...
>>>> Feb 01 10:33:08 h16 pacemaker‑execd[7464]:  notice: 
>>> prm_xen_test‑jeos5_stop_0[33137] error output [ error: intern             
>   
>>
>>>                                                   al error: Failed to 
>>> shutdown domain '13' with libxenlight ]
>>>> Feb 01 10:33:08 h16 pacemaker‑execd[7464]:  notice: 
>>> prm_xen_test‑jeos5_stop_0[33137] error output [  ]
>>>> Feb 01 10:33:08 h16 pacemaker‑execd[7464]:  notice: prm_xen_test‑jeos5
> stop 
>>> (call 230, PID 33137) exited with sta                                     
>   
>>  
>>>                          tus 0 (execution time 177112ms, queue time 0ms)
>>>>
>>>> ### Shouldn't the result be error?
> 
> I think I found the error: /usr/lib/ocf/resource.d/heartbeat/VirtualDomain's
> force_stop() returns $OCF_SUCCESS
> if the message did not match (the case has no default it seems):

a) you need to see actual message before making any conclusion
b) if message does not match it takes one of two other branches where it
checks return code.

You need to reproduce the issue and collect exact command output.

> 
> force_stop()
> {
>         local out ex translate
>         local status=0
> 
>         ocf_log info "Issuing forced shutdown (destroy) request for domain
> ${DOMAIN_NAME}."
>         out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
>         ex=$?
>         translate=$(echo $out|tr 'A-Z' 'a-z')
>         echo >&2 "$translate"
>         case $ex$translate in
>                 *"error:"*"domain is not running"*|*"error:"*"domain not
> found"*|\
>                 *"error:"*"failed to get domain"*)
>                         : ;; # unexpected path to the intended outcome, all is
> well
>                 [!0]*)
>                         ocf_exit_reason "forced stop failed"
>                         return $OCF_ERR_GENERIC ;;
>                 0*)
>                         while [ $status != $OCF_NOT_RUNNING ]; do
>                                 VirtualDomain_status
>                                 status=$?
>                         done ;;
>         esac
>         return $OCF_SUCCESS
> }
> I also wonder: doesn't the continued (line-wrapped) regular expression break
> because the continued line is indented?
> Personally I think error handling based on the error message text is very
> instable as messages may change, and the author's aren't aware what their error
> messages are used for...
> 
> Regards,
> Ulrich
> 
>>>>
>>>
>>> If domain remained active, I would say yes. But do not forget that
>>> failure to stop resources by default will kill the node.
>>
>> In fact "virsh list" still listed the domain, but the cluster had destroyed
> 
>> the image (once again).
>> Trying a "restart" of the VM, actually resulted in the node being fenced.
>>
>>>
>>>> Fortunately locking prevented duplicate activation of h18:
>>>> Feb 01 10:32:51 h18 systemd[1]: Started Virtualization daemon.
>>>> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: 
>>> Lockspace for path /var/lib/libvirt/lockd/                                
>   
>>  
>>>                               files already exists
>>>> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: 
>>> Lockspace for path /var/lib/libvirt/lockd/                                
>   
>>  
>>>                               lvmvolumes already exists
>>>> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: 
>>> Lockspace for path /var/lib/libvirt/lockd/                                
>   
>>  
>>>                               scsivolumes already exists
>>>>
>>>> So the main issue seems that a failed forced stop returned "success", 
>>> causing a "recoveer" on h18 while the VM still runs on h16.
>>>
>>> No, "recover" was caused by failure to migrate. You told pacemaker
>>> that you now want this VM on another host, and your wish was its
>>> command ‑ it attempted to fulfill it. It obviously needed to stop VM
>>> on its current host before trying to (re‑)start on a new home.
>>
>> But the VM *wasn't* stopped on h16!
>>
>>>
>>>>
>>>> h16:~ # rpm ‑qf /usr/lib/ocf/resource.d/heartbeat/VirtualDomain
>>>> resource‑agents‑4.4.0+git57.70549516‑3.12.1.x86_64
>>>>
>>>> (SLES15 SP2)
>>>>
>>>> Regards,
>>>> Ulrich
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/ 
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/ 
>>
>>
>>
>>
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>