[ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

Andrei Borzenkov arvidjaar at gmail.com
Fri Oct 23 01:11:23 EDT 2020


22.10.2020 23:29, Lentes, Bernd пишет:
> Hi guys,
> 
> ocassionally stopping a VirtualDomain resource via "crm resource stop" does not work, and in the end the node is fenced, which is ugly.
> I had a look at the RA to see what it does. After trying to stop the domain via "virsh shutdown ..." in a configurable time it switches to "virsh destroy".
> i assume "virsh destroy" send a sigkill to the respective process. But when the host is doing heavily IO it's possible that the process is in "D" state (uninterruptible sleep) 
> in which it can't be finished with a SIGKILL. The the node the domain is running on is fenced due to that.
> I digged deeper and found out that the signal is often delivered a bit later (just some seconds) and the process is killed, but pacemaker already decided to fence the node.
> It's all about this excerp in the RA:
> 
> force_stop()
> {
>         local out ex translate
>         local status=0
> 
>         ocf_log info "Issuing forced shutdown (destroy) request for domain ${DOMAIN_NAME}."
>         out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
>         ex=$?
>         translate=$(echo $out|tr 'A-Z' 'a-z')
>         echo >&2 "$translate"
>         case $ex$translate in
>                 *"error:"*"domain is not running"*|*"error:"*"domain not found"*|\
>                 *"error:"*"failed to get domain"*)
>                         : ;; # unexpected path to the intended outcome, all is well
>                 [!0]*)
>                         ocf_exit_reason "forced stop failed"
>                         return $OCF_ERR_GENERIC ;;
>                 0*)
>                         while [ $status != $OCF_NOT_RUNNING ]; do
>                                 VirtualDomain_status
>                                 status=$?
>                         done ;;
>         esac
>         return $OCF_SUCCESS
> }
> 
> I'm thinking about the following:
> How about to let the script wait a bit after "virsh destroy". I saw that usually it just takes some seconds that "virsh destroy" is successfull.
> I'm thinking about this change:
> 
>  ocf_log info "Issuing forced shutdown (destroy) request for domain ${DOMAIN_NAME}."
>         out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
>         ex=$?
>         sleep (10)    <============================ (or maybe configurable)
>         translate=$(echo $out|tr 'A-Z' 'a-z')
> 
> 
> What do you think ?
> 


It makes no difference. You wait 10 seconds before parsing output of
"virsh destroy", that's all. It does not change output itself, so if
output indicates that "virsh destroy" failed, it will still indicate
that after 10 seconds.

Either you need to repeat "virsh destroy" in a loop, or virsh itself
should be more robust.


More information about the Users mailing list