[ClusterLabs] Antw: Re: Antw: [EXT] Re: failed migration handled the wrong way

Mon Feb 8 03:15:13 EST 2021

>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 05.02.2021 um 15:31 in
Nachricht <4572fad7-c5ae-6d93-2559-741d052e3f9a at gmail.com>:
> 05.02.2021 12:54, Ulrich Windl пишет:
>>>>> Ulrich Windl schrieb am 01.02.2021 um 11:59 in Nachricht <6017DF04.888
:
>> 161 :
>> 60728>:
>>>>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 01.02.2021 um 11:05
in
>>> Nachricht
>>> <CAA91j0V-4YzNfT-KJ1nzLE_UyEdNOoiBtUMFjaST4O8L+uX8aQ at mail.gmail.com>:
>>>> On Mon, Feb 1, 2021 at 12:53 PM Ulrich Windl
>>>> <Ulrich.Windl at rz.uni‑regensburg.de> wrote:
>>> ...
>>>>> Feb 01 10:33:08 h16 pacemaker‑execd[7464]:  notice: 
>>>> prm_xen_test‑jeos5_stop_0[33137] error output [ error: intern            

>>   
>>>
>>>>                                                   al error: Failed to 
>>>> shutdown domain '13' with libxenlight ]
>>>>> Feb 01 10:33:08 h16 pacemaker‑execd[7464]:  notice: 
>>>> prm_xen_test‑jeos5_stop_0[33137] error output [  ]
>>>>> Feb 01 10:33:08 h16 pacemaker‑execd[7464]:  notice: prm_xen_test‑jeos5
>> stop 
>>>> (call 230, PID 33137) exited with sta                                    

>>   
>>>  
>>>>                          tus 0 (execution time 177112ms, queue time 0ms)
>>>>>
>>>>> ### Shouldn't the result be error?
>> 
>> I think I found the error:
/usr/lib/ocf/resource.d/heartbeat/VirtualDomain's
>> force_stop() returns $OCF_SUCCESS
>> if the message did not match (the case has no default it seems):
> 
> a) you need to see actual message before making any conclusion
> b) if message does not match it takes one of two other branches where it
> checks return code.
> 
> You need to reproduce the issue and collect exact command output.

I still think the default case should be error, not success.

What about this:?
Feb 01 10:33:08 h16 pacemaker-execd[7464]:  notice:
prm_xen_test-jeos5_stop_0[33137] error output [ error: Failed to shutdown
domain test-jeos5 ]
Feb 01 10:33:08 h16 pacemaker-execd[7464]:  notice:
prm_xen_test-jeos5_stop_0[33137] error output [ error: internal error: Failed
to shutdown domain '13' with libxenlight ]
Feb 01 10:33:08 h16 pacemaker-execd[7464]:  notice:
prm_xen_test-jeos5_stop_0[33137] error output [  ]
Feb 01 10:33:08 h16 pacemaker-execd[7464]:  notice: prm_xen_test-jeos5 stop
(call 230, PID 33137) exited with status 0 (execution time 177112ms, queue time
0ms)

Regards,
Ulrich

> 
>> 
>> force_stop()
>> {
>>         local out ex translate
>>         local status=0
>> 
>>         ocf_log info "Issuing forced shutdown (destroy) request for domain
>> ${DOMAIN_NAME}."
>>         out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
>>         ex=$?
>>         translate=$(echo $out|tr 'A-Z' 'a-z')
>>         echo >&2 "$translate"
>>         case $ex$translate in
>>                 *"error:"*"domain is not running"*|*"error:"*"domain not
>> found"*|\
>>                 *"error:"*"failed to get domain"*)
>>                         : ;; # unexpected path to the intended outcome, all

> is
>> well
>>                 [!0]*)
>>                         ocf_exit_reason "forced stop failed"
>>                         return $OCF_ERR_GENERIC ;;
>>                 0*)
>>                         while [ $status != $OCF_NOT_RUNNING ]; do
>>                                 VirtualDomain_status
>>                                 status=$?
>>                         done ;;
>>         esac
>>         return $OCF_SUCCESS
>> }
>> I also wonder: doesn't the continued (line-wrapped) regular expression
break
>> because the continued line is indented?
>> Personally I think error handling based on the error message text is very
>> instable as messages may change, and the author's aren't aware what their 
> error
>> messages are used for...
>> 
>> Regards,
>> Ulrich
>> 
>>>>>
>>>>
>>>> If domain remained active, I would say yes. But do not forget that
>>>> failure to stop resources by default will kill the node.
>>>
>>> In fact "virsh list" still listed the domain, but the cluster had
destroyed
>> 
>>> the image (once again).
>>> Trying a "restart" of the VM, actually resulted in the node being fenced.
>>>
>>>>
>>>>> Fortunately locking prevented duplicate activation of h18:
>>>>> Feb 01 10:32:51 h18 systemd[1]: Started Virtualization daemon.
>>>>> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: 
>>>> Lockspace for path /var/lib/libvirt/lockd/                               

>>   
>>>  
>>>>                               files already exists
>>>>> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: 
>>>> Lockspace for path /var/lib/libvirt/lockd/                               

>>   
>>>  
>>>>                               lvmvolumes already exists
>>>>> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: 
>>>> Lockspace for path /var/lib/libvirt/lockd/                               

>>   
>>>  
>>>>                               scsivolumes already exists
>>>>>
>>>>> So the main issue seems that a failed forced stop returned "success", 
>>>> causing a "recoveer" on h18 while the VM still runs on h16.
>>>>
>>>> No, "recover" was caused by failure to migrate. You told pacemaker
>>>> that you now want this VM on another host, and your wish was its
>>>> command ‑ it attempted to fulfill it. It obviously needed to stop VM
>>>> on its current host before trying to (re‑)start on a new home.
>>>
>>> But the VM *wasn't* stopped on h16!
>>>
>>>>
>>>>>
>>>>> h16:~ # rpm ‑qf /usr/lib/ocf/resource.d/heartbeat/VirtualDomain
>>>>> resource‑agents‑4.4.0+git57.70549516‑3.12.1.x86_64
>>>>>
>>>>> (SLES15 SP2)
>>>>>
>>>>> Regards,
>>>>> Ulrich
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Manage your subscription:
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>>
>>>>> ClusterLabs home: https://www.clusterlabs.org/ 
>>>> _______________________________________________
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/ 
>>>
>>>
>>>
>>>
>> 
>> 
>> 
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/