[ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?

Lentes, Bernd bernd.lentes at helmholtz-muenchen.de
Thu Oct 22 16:29:57 EDT 2020


Hi guys,

ocassionally stopping a VirtualDomain resource via "crm resource stop" does not work, and in the end the node is fenced, which is ugly.
I had a look at the RA to see what it does. After trying to stop the domain via "virsh shutdown ..." in a configurable time it switches to "virsh destroy".
i assume "virsh destroy" send a sigkill to the respective process. But when the host is doing heavily IO it's possible that the process is in "D" state (uninterruptible sleep) 
in which it can't be finished with a SIGKILL. The the node the domain is running on is fenced due to that.
I digged deeper and found out that the signal is often delivered a bit later (just some seconds) and the process is killed, but pacemaker already decided to fence the node.
It's all about this excerp in the RA:

force_stop()
{
        local out ex translate
        local status=0

        ocf_log info "Issuing forced shutdown (destroy) request for domain ${DOMAIN_NAME}."
        out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
        ex=$?
        translate=$(echo $out|tr 'A-Z' 'a-z')
        echo >&2 "$translate"
        case $ex$translate in
                *"error:"*"domain is not running"*|*"error:"*"domain not found"*|\
                *"error:"*"failed to get domain"*)
                        : ;; # unexpected path to the intended outcome, all is well
                [!0]*)
                        ocf_exit_reason "forced stop failed"
                        return $OCF_ERR_GENERIC ;;
                0*)
                        while [ $status != $OCF_NOT_RUNNING ]; do
                                VirtualDomain_status
                                status=$?
                        done ;;
        esac
        return $OCF_SUCCESS
}

I'm thinking about the following:
How about to let the script wait a bit after "virsh destroy". I saw that usually it just takes some seconds that "virsh destroy" is successfull.
I'm thinking about this change:

 ocf_log info "Issuing forced shutdown (destroy) request for domain ${DOMAIN_NAME}."
        out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1)
        ex=$?
        sleep (10)    <============================ (or maybe configurable)
        translate=$(echo $out|tr 'A-Z' 'a-z')


What do you think ?

Bernd


-- 

Bernd Lentes 
Systemadministration 
Institute for Metabolism and Cell Death (MCD) 
Building 25 - office 122 
HelmholtzZentrum München 
bernd.lentes at helmholtz-muenchen.de 
phone: +49 89 3187 1241 
phone: +49 89 3187 3827 
fax: +49 89 3187 2294 
http://www.helmholtz-muenchen.de/mcd 

stay healthy
Helmholtz Zentrum München

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671



More information about the Users mailing list