[ClusterLabs] Antw: Re: Antw: [EXT] Re: VirtualDomain & "deeper" monitors ‑ what/how?

Thu Oct 28 04:31:58 EDT 2021

I think OCF_RESOURCE_INSTANCE is the name of the cluster resource, which in my case matches my vm name, but doesn't have to.  Parsing the xml config for the vm is safer, which is what the VirtualDomain RA does too.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Thursday, October 28th, 2021 at 03:03, Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de> wrote:

> Hi!
>
> I wonder: Shouldn't "OCF_RESOURCE_INSTANCE" help you to identify what is going
>
> to be monitored?
>
> (Reasonable naming assumed ;-))
>
> Regards,
>
> Ulrich
>
> > > > Kyle O'Donnell kyleo at 0b10.mx schrieb am 26.10.2021 um 13:53 in
>
> Nachricht
>
> <uNHregOAnWaFxn5xMCQhuxDLxi-E_norRLuSuhxjZTewjFtRwmq_hVWHwJN4Lo4ybtKSmwYqAbkk9zf
>
> 57Gp0Hmww903ZC09_P2QrHeneW0=@0b10.mx>:
>
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> >
> > On Tuesday, October 26th, 2021 at 03:04, Klaus Wenninger
>
> kwenning at redhat.com
>
> > wrote:
> >
> > > On Mon, Oct 25, 2021 at 9:34 PM Kyle O'Donnell kyleo at 0b10.mx wrote:
> > >
> > > > Finally got around to working on this.
> > > >
> > > > I spoke with someone on the #cluterslabs IRC channel who mentioned that
>
> the
>
> > monitor_scripts param does indeed run at some frequency (op monitor
>
> timeout=?
>
> > interval=?), not just during the "start" and "migrate_from" actions.
> >
> > > > The monitor_scripts param does not support scripts with command line args,
>
> > just a space delimited list for running multiple scripts. This means that
> >
> > each VirtualDomain resource needs its own script to be able to define the
> >
> > ${DOMAIN_NAME}. I found that a bit annoying so I created a symlink to a
> >
> > wrapper script using the ${DOMAIN_NAME} as the first part of the filename
>
> and
>
> > a separator for awk:
> >
> > > The scripts being called by the monitor operation should inherit the
> > >
> > > environment from the monitor so that you should be able to use these
> > >
> > > variables.
> > >
> > > Klaus
> >
> > Thanks!
> >
> > I tried referencing the ${DOMAIN_NAME} variable initially but that did not
> >
> > work. I tried running the function that creates the variable
> >
> > (VirtualDomain_getconfig) it also did not work.
> >
> > After some debugging it looks like the following variables are available
> >
> > from the parent script:
> >
> > error output [ OCF_ROOT=/usr/lib/ocf ] ]
> >
> > error output [ OCF_RESKEY_crm_feature_set=3.2.1 ]
> >
> > error output [ HA_LOGFACILITY=daemon ]
> >
> > error output [ PCMK_debug=0 ]
> >
> > error output [ HA_debug=0 ]
> >
> > error output [ PWD=/var/lib/pacemaker/cores ]
> >
> > error output [ OCF_RESKEY_hypervisor=qemu:///system ]
> >
> > error output [ HA_logfile=/var/log/pacemaker/pacemaker.log ]
> >
> > error output [ HA_logfacility=daemon ]
> >
> > error output [ OCF_EXIT_REASON_PREFIX=ocf-exit-reason: ]
> >
> > error output [ OCF_RESOURCE_PROVIDER=heartbeat ]
> >
> > error output [ PCMK_service=pacemaker-execd ]
> >
> > error output [ PCMK_mcp=true ]
> >
> > error output [
> >
> > OCF_RESKEY_monitor_scripts=/path/to/myvmhostname____wrap_check.sh ]
> >
> > error output [ OCF_RA_VERSION_MAJOR=1 ]
> >
> > error output [ VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no
>
> > --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p
> >
> > --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions
> >
> > --gen-suppressions=all ]
> >
> > error output [ HA_cluster_type=corosync ]
> >
> > error output [ INVOCATION_ID=652062571c8f415a9a7a228c5ad77b20 ]
> >
> > error output [ OCF_RESKEY_CRM_meta_interval=10000 ]
> >
> > error output [ OCF_RESOURCE_INSTANCE=myvmhostname ]
> >
> > error output [ HA_quorum_type=corosync ]
> >
> > error output [ OCF_RA_VERSION_MINOR=0 ]
> >
> > error output [ HA_mcp=true ]
> >
> > error output [ OCF_RESKEY_config=/path/to/myvmhostname/myvmhostname.xml ]
> >
> > error output [ PCMK_quorum_type=corosync ]
> >
> > error output [ OCF_RESKEY_CRM_meta_name=monitor ]
> >
> > error output [ OCF_RESKEY_migration_transport=ssh ]
> >
> > error output [ SHLVL=1 ]
> >
> > error output [ OCF_RESKEY_CRM_meta_on_node=node02 ]
> >
> > error output [ PCMK_watchdog=false ]
> >
> > error output [ PCMK_logfile=/var/log/pacemaker/pacemaker.log ]
> >
> > error output [ OCF_RESKEY_CRM_meta_timeout=40000 ]
> >
> > error output [ OCF_RESOURCE_TYPE=VirtualDomain ]
> >
> > error output [ PCMK_logfacility=daemon ]
> >
> > error output [ LC_ALL=C ]
> >
> > error output [ HA_LOGFILE=/var/log/pacemaker/pacemaker.log ]
> >
> > error output [ JOURNAL_STREAM=9:42440 ]
> >
> > error output [ OCF_RESKEY_CRM_meta_on_node_uuid=2 ]
> >
> > error output [
>
> PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/
>
> > sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb ]
> >
> > error output [ OCF_RESKEY_force_stop=false ]
> >
> > error output [ PCMK_cluster_type=corosync ]
> >
> > error output [ _=/usr/bin/env ]
> >
> > The most helpful variables is:
> >
> > error output [ OCF_RESKEY_config=/path/to/myvmhostname/myvmhostname.xml ]
> >
> > So I copied part of the "VirtualDomain_getconfig" function from the resource
>
> > script to populate the variable in the same way:
> >
> > DOMAIN_NAME=`egrep '[[:space:]]*<name>.*</name>[[:space:]]*$' ${OCF_RESKEY_config} 2>/dev/null | sed -e 's/[[:space:]]*<name>\\(.*\\)<\\/name>[[:space:]]*$/\\1/'`
> >
> > and now it's working without the hacky symlink
> >
> > > > ln -s /path/to/wrapper_script.sh
> > > >
> > > > /path/to/wrapper/myvmhostname_____wrapper_script.sh
> > >
> > > > and in my wrapper_script.sh:
> > > >
> > > > #!/bin/bash
> > > >
> > > > DOMAIN_NAME=$(basename "$0" |awk -F'____' '{print $1}')
> > > >
> > > > /path/to/myscript.sh -H ${DOMAIN_NAME} -C guest-get-time -l 25 -w 1
> > > >
> > > > (a bit hack-y but better than creating 1 script per vm resource and
> > > >
> > > > modifying it with the ${DOMAIN_NAME})
> > >
> > > > Then creating the cluster resource:
> > > >
> > > > pcs resource create myvmhostname VirtualDomain
> > > >
> > > > config="/path/to/myvmhostname/myvmhostname.xml" hypervisor="qemu:///system"
>
> > migration_transport="ssh" force_stop="false"
> >
> > monitor_scripts="/path/to/wrapper/myvmhostname_____wrapper_script.sh" meta
> >
> > allow-migrate="true" target-role="Stopped" op migrate_from timeout=90s
> >
> > interval=0s op migrate_to timeout=120s interval=0s op monitor timeout=40s
> >
> > interval=10s op start timeout=90s interval=0s op stop timeout=90s
>
> interval=0s
>
> > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > >
> > > > On Sunday, June 6th, 2021 at 16:56, Kyle O'Donnell kyleo at 0b10.mx wrote:
> > > >
> > > > > Let me know if there is a better approach to the following problem. When
>
> the
>
> > virtual machine does not respond to a state query I want the cluster to kick
>
> > it
> >
> > > > > I could not find any useful docs for using the nagios plugins. After
>
> reading
>
> > the documentation about running a custom script via the "monitor" function
>
> in
>
> > the RA I determined that would not meet my requirements as it's only run on
>
> > start and migrate(unless I read it incorrectly?).
> >
> > > > > Here is what I did (im on ubuntu 20.04):
> > > > >
> > > > > cp /usr/lib/ocf/resource.d/heartbeat/VirtualDomain
> > > > >
> > > > > /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain
> > >
> > > > > cp /usr/share/resource-agents/ocft/configs/VirtualDomain cp
> > > > >
> > > > > /usr/share/resource-agents/ocft/configs/MyVirtDomain
> > >
> > > > > sed -i 's/VirtualDomain/MyVirtDomain/g'
> > > > >
> > > > > /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain
> > >
> > > > > sed -i 's/VirtualDomain/MyVirtDomain/g'
> > > > >
> > > > > /usr/share/resource-agents/ocft/configs/MyVirtDomain
> > >
> > > > > edited function MyVirtDomain_status in
> > > > >
> > > > > /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain, adding the following to the
>
> > status case running|paused|idle|blocked|"in shutdown")
> >
> > > > > FROM
> > > > >
> > > > > running|paused|idle|blocked|"in shutdown")
> > > > >
> > > > > running: domain is currently actively consuming cycles
> > > > > ======================================================
> > > > >
> > > > > paused: domain is paused (suspended)
> > > > > ====================================
> > > > >
> > > > > idle: domain is running but idle
> > > > > ================================
> > > > >
> > > > > blocked: synonym for idle used by legacy Xen versions
> > > > > =====================================================
> > > > >
> > > > > in shutdown: the domain is in process of shutting down, but has not
> > > > > ===================================================================
> >
> > completely shutdown or crashed.
> >
> > > > > ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status."
> > > > >
> > > > > rc=$OCF_SUCCESS
> > > > >
> > > > > TO
> > > > >
> > > > > running|paused|idle|blocked|"in shutdown")
> > > > >
> > > > > running: domain is currently actively consuming cycles
> > > > > ======================================================
> > > > >
> > > > > paused: domain is paused (suspended)
> > > > > ====================================
> > > > >
> > > > > idle: domain is running but idle
> > > > > ================================
> > > > >
> > > > > blocked: synonym for idle used by legacy Xen versions
> > > > > =====================================================
> > > > >
> > > > > in shutdown: the domain is in process of shutting down, but has not
> > > > > ===================================================================
> >
> > completely shutdown or crashed.
> >
> > > > > custom_chk=$(/path/to/myscript.sh -H $DOMAIN_NAME -C guest-get-time -l 25
>
> -w
>
> > > > > custom_rc=$?
> > > > >
> > > > > if [ ${custom_rc} -eq 0 ]; then
> > > > >
> > > > > ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status."
> > > > >
> > > > > rc=$OCF_SUCCESS
> > > > >
> > > > > else
> > > > >
> > > > > ocf_log debug "Virtual domain $DOMAIN_NAME is currently ${custom_chk}."
> > > > >
> > > > > rc=$OCF_ERR_GENERIC
> > > > >
> > > > > fi
> > > > >
> > > > > The custom script uses the qemu-guest-agent in my guest, passing the
> > > > >
> > > > > parameter to grab the guest's time (seems to be most universal [windows,
> > > > >
> > > > > centos6, ubuntu, centos 7]). Runs 25 loops, sleeps 1 second between
> > > > >
> > > > > iterations, exit 0 as soon as the agent responds with the time and exit 1
> > > > >
> > > > > after the 25th loop, which are OCF_SUCCESS and OCF_ERR_GENERIC based on
>
> docs.
>
> > > > > /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
> > > > > =========================================================
> > > > >
> > > > > [GOOD] - myvm virsh qemu-agent-command guest-get-time output:
> > > > >
> > > > > {"return":1623011582178375000}
> > >
> > > > > or when its not responding:
> > > > >
> > > > > /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
> > > > > =========================================================
> > > > >
> > > > > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest
>
> > agent is not responding: QEMU guest agent is not connected
> >
> > > > > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest
>
> > agent is not responding: QEMU guest agent is not connected
> >
> > > > > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest
>
> > agent is not responding: QEMU guest agent is not connected
> >
> > > > > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest
>
> > agent is not responding: QEMU guest agent is not connected
> >
> > > > > ... (exits after 25th or
> > > > >
> > > > > [GOOD] - myvm virsh qemu-agent-command guest-get-time output:
> > > > >
> > > > > {"return":1623011582178375000}
> > >
> > > > > and when the vm isnt running:
> > > > >
> > > > > /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
> > > > > =========================================================
> > > > >
> > > > > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error:
>
> failed
>
> > to get domain 'myvm'
> >
> > > > > I updated my test vm to use the new RA, updated the status timeout to 40s
>
> > from default of 30s just in case.
> >
> > > > > I'd like to be able to update the parameters to myscript.sh via crm
> > > > >
> > > > > configure edit at some point, but will figure that out later...
> > >
> > > > > My test:
> > > > >
> > > > > reboot the VM from within the OS, hit escape so that I enter the boot
>
> mode
>
> > prompt... after ~30 seconds the cluster decides the resource is having a
> >
> > problem, marks it as failed, and restarts the virtual machine (on the same
> >
> > node -- which in my case in desirable), once the guest is back up and
> >
> > responding the cluster reports the VM as Started
> >
> > > > > I still have plenty more testing to do and will keep the list posted on
> > > > >
> > > > > progress.
> > >
> > > > > -Kyle
> > > > >
> > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > >
> > > > > On Thursday, May 27th, 2021 at 05:34, Kyle O'Donnell kyleo at 0b10.mx
>
> wrote:
>
> > > > > > guest-get-fsinfo doesn't seem to work on older agents (centos6) I've
>
> found
>
> > guest-get-time more universal.
> >
> > > > > > Also, found this helpful thread on using monitor_scripts which is part
>
> of
>
> > the VirtualDomain RA
>
> https://linux-ha-dev.linux-ha.narkive.com/yxvySDA2/monitor-scripts-parameter-
>
> > for-the-virtualdomain-ra-was-re-linux-ha-ocf-resource-agent-for-kvm
> >
> > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > > >
> > > > > > On Sunday, May 16th, 2021 at 22:49, Kyle O'Donnell kyleo at 0b10.mx
>
> wrote:
>
> > > > > > > I am thinking about using the qemu-guest-agent to run one of the
>
> available
>
> > commands to determine the health of the OS inside
> >
> > > > > > > virsh qemu-agent-command myvm --pretty
>
> '{"execute":"guest-get-fsinfo"}'
>
> > > > > > > https://qemu-project.gitlab.io/qemu/interop/qemu-ga-ref.html
> > > > > > >
> > > > > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > > > > > >
> > > > > > > On Thursday, May 13th, 2021 at 01:28, Andrei Borzenkov
>
> arvidjaar at gmail.com
>
> > wrote:
> >
> > > > > > > > On 03.05.2021 09:48, Ulrich Windl wrote:
> > > > > > > >
> > > > > > > > > > > > Ken Gaillot kgaillot at redhat.com schrieb am 30.04.2021 um
>
> 16:57 in
>
> > > > > > > > > > > > Nachricht
> > > > > > > > > > > >
> > > > > > > > > > > > 3acef4bc31923fb019619c713300444c2dcd354a.camel at redhat.com:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, 2021‑04‑30 at 11:00 +0100, lejeczek wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi guys
> > > > > > > > > > >
> > > > > > > > > > > I'd like to ask around for thoughts & suggestions on any
> > > > > > > > > > >
> > > > > > > > > > > semi/official ways to monitor VirtualDomain.
> > > > > > > > > > >
> > > > > > > > > > > Something beyond what included RA does ‑ such as actual
> > > > > > > > > > >
> > > > > > > > > > > health testing of and communication with VM's OS.
> > > > > > > > > > >
> > > > > > > > > > > many thanks, L.
> > > > > > > > > >
> > > > > > > > > > This use case led to a Pacemaker feature many moons ago ...
> > > > > > > > > >
> > > > > > > > > > Pacemaker supports nagios plug‑ins as a resource type (e.g.
> > > > > > > > > >
> > > > > > > > > > nagios:check_apache_status). These are service checks usually
>
> used with
>
> > > > > > > > > > monitoring software such as nagios, icinga, etc.
> > > > > > > > > >
> > > > > > > > > > If the service being monitored is inside a VirtualDomain, named
>
> vm1 for
>
> > > > > > > > > > example, you can configure the nagios resource with the
>
> resource meta‑
>
> > > > > > > > > > attribute container="vm1". If the nagios check fails, Pacemaker
>
> will
>
> > > > > > > > > > restart vm1.
> > > > > > > > >
> > > > > > > > > "check fails" mans WARNING, CRITICAL, or UNKNOWN? ;-)
> > > > > > > >
> > > > > > > > switch (rc) {
> > > > > > > >
> > > > > > > > case NAGIOS_STATE_OK:
> > > > > > > >
> > > > > > > > return PCMK_OCF_OK;
> > > > > > > >
> > > > > > > > case NAGIOS_INSUFFICIENT_PRIV:
> > > > > > > >
> > > > > > > > return PCMK_OCF_INSUFFICIENT_PRIV;
> > > > > > > >
> > > > > > > > case NAGIOS_NOT_INSTALLED:
> > > > > > > >
> > > > > > > > return PCMK_OCF_NOT_INSTALLED;
> > > > > > > >
> > > > > > > > case NAGIOS_STATE_WARNING:
> > > > > > > >
> > > > > > > > case NAGIOS_STATE_CRITICAL:
> > > > > > > >
> > > > > > > > case NAGIOS_STATE_UNKNOWN:
> > > > > > > >
> > > > > > > > case NAGIOS_STATE_DEPENDENT:
> > > > > > > >
> > > > > > > > default:
> > > > > > > >
> > > > > > > > return PCMK_OCF_UNKNOWN_ERROR;
> > > > > > > >
> > > > > > > > }
> > > > > > > >
> > > > > > > > return PCMK_OCF_UNKNOWN_ERROR;
> > > > > > > >
> > > > > > > > Manage your subscription:
> > > > > > > >
> > > > > > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > > > > > >
> > > > > > > > ClusterLabs home: https://www.clusterlabs.org/
> > > > >
> > > > > Manage your subscription:
> > > > >
> > > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > > >
> > > > > ClusterLabs home: https://www.clusterlabs.org/
> > > >
> > > > Manage your subscription:
> > > >
> > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > ClusterLabs home: https://www.clusterlabs.org/