[ClusterLabs] Antw: [EXT] Re: VirtualDomain & "deeper" monitors - what/how?

Sun Jun 6 16:56:05 EDT 2021

Let me know if there is a better approach to the following problem.  When the virtual machine does not respond to a state query I want the cluster to kick it

I could not find any useful docs for using the nagios plugins. After reading the documentation about running a custom script via the "monitor" function in the RA I determined that would not meet my requirements as it's only run on start and migrate(unless I read it incorrectly?).

Here is what I did (im on ubuntu 20.04):

cp /usr/lib/ocf/resource.d/heartbeat/VirtualDomain /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain
cp /usr/share/resource-agents/ocft/configs/VirtualDomain cp /usr/share/resource-agents/ocft/configs/MyVirtDomain
sed -i 's/VirtualDomain/MyVirtDomain/g' /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain
sed -i 's/VirtualDomain/MyVirtDomain/g' /usr/share/resource-agents/ocft/configs/MyVirtDomain

edited function *MyVirtDomain_status* in /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain, adding the following to the status case *running|paused|idle|blocked|"in shutdown")*

FROM
                        running|paused|idle|blocked|"in shutdown")
                                # running: domain is currently actively consuming cycles
                                # paused: domain is paused (suspended)
                                # idle: domain is running but idle
                                # blocked: synonym for idle used by legacy Xen versions
                                # in shutdown: the domain is in process of shutting down, but has not completely shutdown or crashed.

                                ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status."
                                rc=$OCF_SUCCESS

TO
                        running|paused|idle|blocked|"in shutdown")
                                # running: domain is currently actively consuming cycles
                                # paused: domain is paused (suspended)
                                # idle: domain is running but idle
                                # blocked: synonym for idle used by legacy Xen versions
                                # in shutdown: the domain is in process of shutting down, but has not completely shutdown or crashed.
                                custom_chk=$(/path/to/myscript.sh -H $DOMAIN_NAME -C guest-get-time -l 25 -w 1)
                                custom_rc=$?
                                if [ ${custom_rc} -eq 0 ]; then
                                  ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status."
                                  rc=$OCF_SUCCESS
                                else
                                  ocf_log debug "Virtual domain $DOMAIN_NAME is currently ${custom_chk}."
                                  rc=$OCF_ERR_GENERIC
                                fi

The custom script uses the qemu-guest-agent in my guest, passing the parameter to grab the guest's time (seems to be most universal [windows, centos6, ubuntu, centos 7]). Runs 25 loops, sleeps 1 second between iterations, exit 0 as soon as the agent responds with the time and exit 1 after the 25th loop, which are OCF_SUCCESS and OCF_ERR_GENERIC based on docs.

# /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
[GOOD] - myvm virsh qemu-agent-command guest-get-time output: {"return":1623011582178375000}

or when its not responding:
# /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent is not responding: QEMU guest agent is not connected
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent is not responding: QEMU guest agent is not connected
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent is not responding: QEMU guest agent is not connected
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent is not responding: QEMU guest agent is not connected
... (exits after 25th or
[GOOD] - myvm virsh qemu-agent-command guest-get-time output: {"return":1623011582178375000}

and when the vm isnt running:
# /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
[BAD] - myvm virsh qemu-agent-command guest-get-time output: error: failed to get domain 'myvm'

I updated my test vm to use the new RA, updated the status timeout to 40s from default of 30s just in case.

I'd like to be able to update the parameters to *myscript.sh* via crm configure edit at some point, but will figure that out later...

My test:

reboot the VM from within the OS, hit escape so that I enter the boot mode prompt... after ~30 seconds the cluster decides the resource is having a problem, marks it as failed, and restarts the virtual machine (on the same node -- which in my case in desirable), once the guest is back up and responding the cluster reports the VM as Started

I still have plenty more testing to do and will keep the list posted on progress.

-Kyle

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Thursday, May 27th, 2021 at 05:34, Kyle O'Donnell <kyleo at 0b10.mx> wrote:

> guest-get-fsinfo doesn't seem to work on older agents (centos6) I've found guest-get-time more universal.
>
> Also, found this helpful thread on using monitor_scripts which is part of the VirtualDomain RA
>
> https://linux-ha-dev.linux-ha.narkive.com/yxvySDA2/monitor-scripts-parameter-for-the-virtualdomain-ra-was-re-linux-ha-ocf-resource-agent-for-kvm
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>
> On Sunday, May 16th, 2021 at 22:49, Kyle O'Donnell kyleo at 0b10.mx wrote:
>
> > I am thinking about using the qemu-guest-agent to run one of the available commands to determine the health of the OS inside
> >
> > virsh qemu-agent-command myvm --pretty '{"execute":"guest-get-fsinfo"}'
> >
> > https://qemu-project.gitlab.io/qemu/interop/qemu-ga-ref.html
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> >
> > On Thursday, May 13th, 2021 at 01:28, Andrei Borzenkov arvidjaar at gmail.com wrote:
> >
> > > On 03.05.2021 09:48, Ulrich Windl wrote:
> > >
> > > > > > > Ken Gaillot kgaillot at redhat.com schrieb am 30.04.2021 um 16:57 in
> > > > > > >
> > > > > > > Nachricht
> > > > > > >
> > > > > > > 3acef4bc31923fb019619c713300444c2dcd354a.camel at redhat.com:
> > > > > > >
> > > > > > > On Fri, 2021‑04‑30 at 11:00 +0100, lejeczek wrote:
> > > > >
> > > > > > Hi guys
> > > > > >
> > > > > > I'd like to ask around for thoughts & suggestions on any
> > > > > >
> > > > > > semi/official ways to monitor VirtualDomain.
> > > > > >
> > > > > > Something beyond what included RA does ‑ such as actual
> > > > > >
> > > > > > health testing of and communication with VM's OS.
> > > > > >
> > > > > > many thanks, L.
> > > > >
> > > > > This use case led to a Pacemaker feature many moons ago ...
> > > > >
> > > > > Pacemaker supports nagios plug‑ins as a resource type (e.g.
> > > > >
> > > > > nagios:check_apache_status). These are service checks usually used with
> > > > >
> > > > > monitoring software such as nagios, icinga, etc.
> > > > >
> > > > > If the service being monitored is inside a VirtualDomain, named vm1 for
> > > > >
> > > > > example, you can configure the nagios resource with the resource meta‑
> > > > >
> > > > > attribute container="vm1". If the nagios check fails, Pacemaker will
> > > > >
> > > > > restart vm1.
> > > >
> > > > "check fails" mans WARNING, CRITICAL, or UNKNOWN? ;-)
> > >
> > > switch (rc) {
> > >
> > > case NAGIOS_STATE_OK:
> > >
> > > return PCMK_OCF_OK;
> > >
> > > case NAGIOS_INSUFFICIENT_PRIV:
> > >
> > > return PCMK_OCF_INSUFFICIENT_PRIV;
> > >
> > > case NAGIOS_NOT_INSTALLED:
> > >
> > > return PCMK_OCF_NOT_INSTALLED;
> > >
> > > case NAGIOS_STATE_WARNING:
> > >
> > > case NAGIOS_STATE_CRITICAL:
> > >
> > > case NAGIOS_STATE_UNKNOWN:
> > >
> > > case NAGIOS_STATE_DEPENDENT:
> > >
> > > default:
> > >
> > > return PCMK_OCF_UNKNOWN_ERROR;
> > >
> > > }
> > >
> > > return PCMK_OCF_UNKNOWN_ERROR;
> > >
> > > Manage your subscription:
> > >
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/