[ClusterLabs] Antw: [EXT] Re: VirtualDomain & "deeper" monitors - what/how?

Klaus Wenninger kwenning at redhat.com
Tue Oct 26 03:04:09 EDT 2021


On Mon, Oct 25, 2021 at 9:34 PM Kyle O'Donnell <kyleo at 0b10.mx> wrote:

> Finally got around to working on this.
>
> I spoke with someone on the #cluterslabs IRC channel who mentioned that
> the monitor_scripts param does indeed run at some frequency (op monitor
> timeout=? interval=?), not just during the "start" and "migrate_from"
> actions.
>
> The monitor_scripts param does not support scripts with command line args,
> just a space delimited list for running multiple scripts. This means that
> each VirtualDomain resource needs its own script to be able to define the
> ${DOMAIN_NAME}.  I found that a bit annoying so I created a symlink to a
> wrapper script using the ${DOMAIN_NAME} as the first part of the filename
> and a separator for awk:
>
> The scripts being called by the monitor operation should inherit the
environment from the monitor so that you should be able to use these
variables.

Klaus

> ln -s /path/to/wrapper_script.sh
> /path/to/wrapper/myvmhostname_____wrapper_script.sh
>
> and in my wrapper_script.sh:
> #!/bin/bash
> DOMAIN_NAME=$(basename "$0" |awk -F'____' '{print $1}')
> /path/to/myscript.sh -H ${DOMAIN_NAME} -C guest-get-time -l 25 -w 1
>
> (a bit hack-y but better than creating 1 script per vm resource and
> modifying it with the ${DOMAIN_NAME})
>
> Then creating the cluster resource:
> pcs resource create myvmhostname VirtualDomain
> config="/path/to/myvmhostname/myvmhostname.xml" hypervisor="qemu:///system"
> migration_transport="ssh" force_stop="false"
> monitor_scripts="/path/to/wrapper/myvmhostname_____wrapper_script.sh" meta
> allow-migrate="true" target-role="Stopped" op migrate_from timeout=90s
> interval=0s op migrate_to timeout=120s interval=0s op monitor timeout=40s
> interval=10s op start timeout=90s interval=0s op stop timeout=90s
> interval=0s
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>
> On Sunday, June 6th, 2021 at 16:56, Kyle O'Donnell <kyleo at 0b10.mx> wrote:
>
> > Let me know if there is a better approach to the following problem. When
> the virtual machine does not respond to a state query I want the cluster to
> kick it
> >
> > I could not find any useful docs for using the nagios plugins. After
> reading the documentation about running a custom script via the "monitor"
> function in the RA I determined that would not meet my requirements as it's
> only run on start and migrate(unless I read it incorrectly?).
> >
> > Here is what I did (im on ubuntu 20.04):
> >
> > cp /usr/lib/ocf/resource.d/heartbeat/VirtualDomain
> /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain
> >
> > cp /usr/share/resource-agents/ocft/configs/VirtualDomain cp
> /usr/share/resource-agents/ocft/configs/MyVirtDomain
> >
> > sed -i 's/VirtualDomain/MyVirtDomain/g'
> /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain
> >
> > sed -i 's/VirtualDomain/MyVirtDomain/g'
> /usr/share/resource-agents/ocft/configs/MyVirtDomain
> >
> > edited function MyVirtDomain_status in
> /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain, adding the following to the
> status case running|paused|idle|blocked|"in shutdown")
> >
> > FROM
> >
> > running|paused|idle|blocked|"in shutdown")
> >
> > # running: domain is currently actively consuming cycles
> >
> > # paused: domain is paused (suspended)
> >
> > # idle: domain is running but idle
> >
> > # blocked: synonym for idle used by legacy Xen versions
> >
> > # in shutdown: the domain is in process of shutting down, but has not
> completely shutdown or crashed.
> >
> > ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status."
> >
> > rc=$OCF_SUCCESS
> >
> > TO
> >
> > running|paused|idle|blocked|"in shutdown")
> >
> > # running: domain is currently actively consuming cycles
> >
> > # paused: domain is paused (suspended)
> >
> > # idle: domain is running but idle
> >
> > # blocked: synonym for idle used by legacy Xen versions
> >
> > # in shutdown: the domain is in process of shutting down, but has not
> completely shutdown or crashed.
> >
> > custom_chk=$(/path/to/myscript.sh -H $DOMAIN_NAME -C guest-get-time -l
> 25 -w 1)
> >
> > custom_rc=$?
> >
> > if [ ${custom_rc} -eq 0 ]; then
> >
> > ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status."
> >
> > rc=$OCF_SUCCESS
> >
> > else
> >
> > ocf_log debug "Virtual domain $DOMAIN_NAME is currently ${custom_chk}."
> >
> > rc=$OCF_ERR_GENERIC
> >
> > fi
> >
> > The custom script uses the qemu-guest-agent in my guest, passing the
> parameter to grab the guest's time (seems to be most universal [windows,
> centos6, ubuntu, centos 7]). Runs 25 loops, sleeps 1 second between
> iterations, exit 0 as soon as the agent responds with the time and exit 1
> after the 25th loop, which are OCF_SUCCESS and OCF_ERR_GENERIC based on
> docs.
> >
> > /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
> > =========================================================
> >
> > [GOOD] - myvm virsh qemu-agent-command guest-get-time output:
> {"return":1623011582178375000}
> >
> > or when its not responding:
> >
> > /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
> > =========================================================
> >
> > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error:
> Guest agent is not responding: QEMU guest agent is not connected
> >
> > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error:
> Guest agent is not responding: QEMU guest agent is not connected
> >
> > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error:
> Guest agent is not responding: QEMU guest agent is not connected
> >
> > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error:
> Guest agent is not responding: QEMU guest agent is not connected
> >
> > ... (exits after 25th or
> >
> > [GOOD] - myvm virsh qemu-agent-command guest-get-time output:
> {"return":1623011582178375000}
> >
> > and when the vm isnt running:
> >
> > /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
> > =========================================================
> >
> > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error:
> failed to get domain 'myvm'
> >
> > I updated my test vm to use the new RA, updated the status timeout to
> 40s from default of 30s just in case.
> >
> > I'd like to be able to update the parameters to myscript.sh via crm
> configure edit at some point, but will figure that out later...
> >
> > My test:
> >
> > reboot the VM from within the OS, hit escape so that I enter the boot
> mode prompt... after ~30 seconds the cluster decides the resource is having
> a problem, marks it as failed, and restarts the virtual machine (on the
> same node -- which in my case in desirable), once the guest is back up and
> responding the cluster reports the VM as Started
> >
> > I still have plenty more testing to do and will keep the list posted on
> progress.
> >
> > -Kyle
> >
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> >
> > On Thursday, May 27th, 2021 at 05:34, Kyle O'Donnell kyleo at 0b10.mx
> wrote:
> >
> > > guest-get-fsinfo doesn't seem to work on older agents (centos6) I've
> found guest-get-time more universal.
> > >
> > > Also, found this helpful thread on using monitor_scripts which is part
> of the VirtualDomain RA
> > >
> > >
> https://linux-ha-dev.linux-ha.narkive.com/yxvySDA2/monitor-scripts-parameter-for-the-virtualdomain-ra-was-re-linux-ha-ocf-resource-agent-for-kvm
> > >
> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > >
> > > On Sunday, May 16th, 2021 at 22:49, Kyle O'Donnell kyleo at 0b10.mx
> wrote:
> > >
> > > > I am thinking about using the qemu-guest-agent to run one of the
> available commands to determine the health of the OS inside
> > > >
> > > > virsh qemu-agent-command myvm --pretty
> '{"execute":"guest-get-fsinfo"}'
> > > >
> > > > https://qemu-project.gitlab.io/qemu/interop/qemu-ga-ref.html
> > > >
> > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > >
> > > > On Thursday, May 13th, 2021 at 01:28, Andrei Borzenkov
> arvidjaar at gmail.com wrote:
> > > >
> > > > > On 03.05.2021 09:48, Ulrich Windl wrote:
> > > > >
> > > > > > > > > Ken Gaillot kgaillot at redhat.com schrieb am 30.04.2021 um
> 16:57 in
> > > > > > > > >
> > > > > > > > > Nachricht
> > > > > > > > >
> > > > > > > > > 3acef4bc31923fb019619c713300444c2dcd354a.camel at redhat.com:
> > > > > > > > >
> > > > > > > > > On Fri, 2021‑04‑30 at 11:00 +0100, lejeczek wrote:
> > > > > > >
> > > > > > > > Hi guys
> > > > > > > >
> > > > > > > > I'd like to ask around for thoughts & suggestions on any
> > > > > > > >
> > > > > > > > semi/official ways to monitor VirtualDomain.
> > > > > > > >
> > > > > > > > Something beyond what included RA does ‑ such as actual
> > > > > > > >
> > > > > > > > health testing of and communication with VM's OS.
> > > > > > > >
> > > > > > > > many thanks, L.
> > > > > > >
> > > > > > > This use case led to a Pacemaker feature many moons ago ...
> > > > > > >
> > > > > > > Pacemaker supports nagios plug‑ins as a resource type (e.g.
> > > > > > >
> > > > > > > nagios:check_apache_status). These are service checks usually
> used with
> > > > > > >
> > > > > > > monitoring software such as nagios, icinga, etc.
> > > > > > >
> > > > > > > If the service being monitored is inside a VirtualDomain,
> named vm1 for
> > > > > > >
> > > > > > > example, you can configure the nagios resource with the
> resource meta‑
> > > > > > >
> > > > > > > attribute container="vm1". If the nagios check fails,
> Pacemaker will
> > > > > > >
> > > > > > > restart vm1.
> > > > > >
> > > > > > "check fails" mans WARNING, CRITICAL, or UNKNOWN? ;-)
> > > > >
> > > > > switch (rc) {
> > > > >
> > > > > case NAGIOS_STATE_OK:
> > > > >
> > > > > return PCMK_OCF_OK;
> > > > >
> > > > > case NAGIOS_INSUFFICIENT_PRIV:
> > > > >
> > > > > return PCMK_OCF_INSUFFICIENT_PRIV;
> > > > >
> > > > > case NAGIOS_NOT_INSTALLED:
> > > > >
> > > > > return PCMK_OCF_NOT_INSTALLED;
> > > > >
> > > > > case NAGIOS_STATE_WARNING:
> > > > >
> > > > > case NAGIOS_STATE_CRITICAL:
> > > > >
> > > > > case NAGIOS_STATE_UNKNOWN:
> > > > >
> > > > > case NAGIOS_STATE_DEPENDENT:
> > > > >
> > > > > default:
> > > > >
> > > > > return PCMK_OCF_UNKNOWN_ERROR;
> > > > >
> > > > > }
> > > > >
> > > > > return PCMK_OCF_UNKNOWN_ERROR;
> > > > >
> > > > > Manage your subscription:
> > > > >
> > > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > > >
> > > > > ClusterLabs home: https://www.clusterlabs.org/
> >
> > Manage your subscription:
> >
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20211026/1ab8c191/attachment-0001.htm>


More information about the Users mailing list