[ClusterLabs] Antw: [EXT] Re: failed migration handled the wrong way

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Fri Feb 5 04:54:26 EST 2021


>>> Ulrich Windl schrieb am 01.02.2021 um 11:59 in Nachricht <6017DF04.888 :
161 :
60728>:
>>>> Andrei Borzenkov <arvidjaar at gmail.com> schrieb am 01.02.2021 um 11:05 in
> Nachricht
> <CAA91j0V-4YzNfT-KJ1nzLE_UyEdNOoiBtUMFjaST4O8L+uX8aQ at mail.gmail.com>:
> > On Mon, Feb 1, 2021 at 12:53 PM Ulrich Windl
> > <Ulrich.Windl at rz.uni‑regensburg.de> wrote:
> ...
> >> Feb 01 10:33:08 h16 pacemaker‑execd[7464]:  notice: 
> > prm_xen_test‑jeos5_stop_0[33137] error output [ error: intern             
  
> 
> >                                                   al error: Failed to 
> > shutdown domain '13' with libxenlight ]
> >> Feb 01 10:33:08 h16 pacemaker‑execd[7464]:  notice: 
> > prm_xen_test‑jeos5_stop_0[33137] error output [  ]
> >> Feb 01 10:33:08 h16 pacemaker‑execd[7464]:  notice: prm_xen_test‑jeos5
stop 
> > (call 230, PID 33137) exited with sta                                     
  
>  
> >                          tus 0 (execution time 177112ms, queue time 0ms)
> >>
> >> ### Shouldn't the result be error?

I think I found the error: /usr/lib/ocf/resource.d/heartbeat/VirtualDomain's
force_stop() returns $OCF_SUCCESS
if the message did not match (the case has no default it seems):

force_stop()
{
        local out ex translate
        local status=0

        ocf_log info "Issuing forced shutdown (destroy) request for domain
${DOMAIN_NAME}."
        out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
        ex=$?
        translate=$(echo $out|tr 'A-Z' 'a-z')
        echo >&2 "$translate"
        case $ex$translate in
                *"error:"*"domain is not running"*|*"error:"*"domain not
found"*|\
                *"error:"*"failed to get domain"*)
                        : ;; # unexpected path to the intended outcome, all is
well
                [!0]*)
                        ocf_exit_reason "forced stop failed"
                        return $OCF_ERR_GENERIC ;;
                0*)
                        while [ $status != $OCF_NOT_RUNNING ]; do
                                VirtualDomain_status
                                status=$?
                        done ;;
        esac
        return $OCF_SUCCESS
}
I also wonder: doesn't the continued (line-wrapped) regular expression break
because the continued line is indented?
Personally I think error handling based on the error message text is very
instable as messages may change, and the author's aren't aware what their error
messages are used for...

Regards,
Ulrich

> >>
> > 
> > If domain remained active, I would say yes. But do not forget that
> > failure to stop resources by default will kill the node.
> 
> In fact "virsh list" still listed the domain, but the cluster had destroyed

> the image (once again).
> Trying a "restart" of the VM, actually resulted in the node being fenced.
> 
> > 
> >> Fortunately locking prevented duplicate activation of h18:
> >> Feb 01 10:32:51 h18 systemd[1]: Started Virtualization daemon.
> >> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: 
> > Lockspace for path /var/lib/libvirt/lockd/                                
  
>  
> >                               files already exists
> >> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: 
> > Lockspace for path /var/lib/libvirt/lockd/                                
  
>  
> >                               lvmvolumes already exists
> >> Feb 01 10:32:52 h18 virtlockd[9904]: Requested operation is not valid: 
> > Lockspace for path /var/lib/libvirt/lockd/                                
  
>  
> >                               scsivolumes already exists
> >>
> >> So the main issue seems that a failed forced stop returned "success", 
> > causing a "recoveer" on h18 while the VM still runs on h16.
> > 
> > No, "recover" was caused by failure to migrate. You told pacemaker
> > that you now want this VM on another host, and your wish was its
> > command ‑ it attempted to fulfill it. It obviously needed to stop VM
> > on its current host before trying to (re‑)start on a new home.
> 
> But the VM *wasn't* stopped on h16!
> 
> > 
> >>
> >> h16:~ # rpm ‑qf /usr/lib/ocf/resource.d/heartbeat/VirtualDomain
> >> resource‑agents‑4.4.0+git57.70549516‑3.12.1.x86_64
> >>
> >> (SLES15 SP2)
> >>
> >> Regards,
> >> Ulrich
> >>
> >>
> >>
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users 
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/ 
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> 





More information about the Users mailing list