<div><br></div><div class="protonmail_signature_block protonmail_signature_block-empty"><div class="protonmail_signature_block-user protonmail_signature_block-empty"></div><div class="protonmail_signature_block-proton protonmail_signature_block-empty"><br></div></div><div class="protonmail_quote"><div>‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐<br></div><div>On Tuesday, October 26th, 2021 at 03:04, Klaus Wenninger <kwenning@redhat.com> wrote:<br></div></div><blockquote type="cite" class="protonmail_quote"><div dir="ltr"><div dir="ltr"><br></div><div><br></div><div class="gmail_quote"><div dir="ltr">On Mon, Oct 25, 2021 at 9:34 PM Kyle O'Donnell <<a target="_blank" rel="noopener noreferrer" href="mailto:kyleo@0b10.mx">kyleo@0b10.mx</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex; --darkreader-inline-border-left: #3e4446;" data-darkreader-inline-border-left=""><div>Finally got around to working on this.<br></div><div> <br></div><div> I spoke with someone on the #cluterslabs IRC channel who mentioned that the monitor_scripts param does indeed run at some frequency (op monitor timeout=? interval=?), not just during the "start" and "migrate_from" actions.<br></div><div> <br></div><div> The monitor_scripts param does not support scripts with command line args, just a space delimited list for running multiple scripts. This means that each VirtualDomain resource needs its own script to be able to define the ${DOMAIN_NAME}.  I found that a bit annoying so I created a symlink to a wrapper script using the ${DOMAIN_NAME} as the first part of the filename and a separator for awk:<br></div><div> <br></div></blockquote><div>The scripts being called by the monitor operation should inherit the environment from the monitor so that you should be able to use these variables.<br></div><div><br></div><div>Klaus <br></div></div></div></blockquote><div><br></div><div>Thanks!<br></div><div><br></div><div>I tried referencing the ${DOMAIN_NAME} variable initially but that did not work. I tried running the function that creates the variable (VirtualDomain_getconfig) it also did not work.<br></div><div><br></div><div>After some debugging it looks like the following variables are available from the parent script:<br></div><div>error output [ OCF_ROOT=/usr/lib/ocf ] ]<br></div><div>error output [ OCF_RESKEY_crm_feature_set=3.2.1 ]<br></div><div>error output [ HA_LOGFACILITY=daemon ]<br></div><div>error output [ PCMK_debug=0 ]<br></div><div>error output [ HA_debug=0 ]<br></div><div>error output [ PWD=/var/lib/pacemaker/cores ]<br></div><div>error output [ OCF_RESKEY_hypervisor=qemu:///system ]<br></div><div>error output [ HA_logfile=/var/log/pacemaker/pacemaker.log ] <br></div><div>error output [ HA_logfacility=daemon ]<br></div><div>error output [ OCF_EXIT_REASON_PREFIX=ocf-exit-reason: ]<br></div><div>error output [ OCF_RESOURCE_PROVIDER=heartbeat ]<br></div><div>error output [ PCMK_service=pacemaker-execd ]<br></div><div>error output [ PCMK_mcp=true ]<br></div><div>error output [ OCF_RESKEY_monitor_scripts=/path/to/myvmhostname____wrap_check.sh ]<br></div><div>error output [ OCF_RA_VERSION_MAJOR=1 ]<br></div><div>error output [ VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions --gen-suppressions=all ]<br></div><div>error output [ HA_cluster_type=corosync ]<br></div><div>error output [ INVOCATION_ID=652062571c8f415a9a7a228c5ad77b20 ]<br></div><div>error output [ OCF_RESKEY_CRM_meta_interval=10000 ]<br></div><div>error output [ OCF_RESOURCE_INSTANCE=myvmhostname ]<br></div><div>error output [ HA_quorum_type=corosync ]<br></div><div>error output [ OCF_RA_VERSION_MINOR=0 ]<br></div><div>error output [ HA_mcp=true ]<br></div><div>error output [ OCF_RESKEY_config=/path/to/myvmhostname/myvmhostname.xml ]<br></div><div>error output [ PCMK_quorum_type=corosync ]<br></div><div>error output [ OCF_RESKEY_CRM_meta_name=monitor ]<br></div><div>error output [ OCF_RESKEY_migration_transport=ssh ]<br></div><div>error output [ SHLVL=1 ]<br></div><div>error output [ OCF_RESKEY_CRM_meta_on_node=node02 ]<br></div><div>error output [ PCMK_watchdog=false ]<br></div><div>error output [ PCMK_logfile=/var/log/pacemaker/pacemaker.log ]<br></div><div>error output [ OCF_RESKEY_CRM_meta_timeout=40000 ]<br></div><div>error output [ OCF_RESOURCE_TYPE=VirtualDomain ]<br></div><div>error output [ PCMK_logfacility=daemon ]<br></div><div>error output [ LC_ALL=C ]<br></div><div>error output [ HA_LOGFILE=/var/log/pacemaker/pacemaker.log ] <br></div><div>error output [ JOURNAL_STREAM=9:42440 ]<br></div><div>error output [ OCF_RESKEY_CRM_meta_on_node_uuid=2 ]<br></div><div>error output [ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb ]<br></div><div>error output [ OCF_RESKEY_force_stop=false ]<br></div><div>error output [ PCMK_cluster_type=corosync ]<br></div><div>error output [ _=/usr/bin/env ]<br></div><div><br></div><div>The most helpful variables is:<br></div><div>error output [ OCF_RESKEY_config=/path/to/myvmhostname/myvmhostname.xml ]<br></div><div><br></div><div>So I copied part of the "VirtualDomain_getconfig" function from the resource script to populate the variable in the same way:<br></div><div>DOMAIN_NAME=`egrep '[[:space:]]*<name>.*</name>[[:space:]]*$' ${OCF_RESKEY_config} 2>/dev/null | sed -e 's/[[:space:]]*<name>\(.*\)<\/name>[[:space:]]*$/\1/'`<br></div><div><br></div><div>and now it's working without the hacky symlink <br></div><div><br></div><blockquote type="cite" class="protonmail_quote"><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex; --darkreader-inline-border-left: #3e4446;" data-darkreader-inline-border-left=""><div>ln -s /path/to/wrapper_script.sh /path/to/wrapper/myvmhostname_____wrapper_script.sh<br></div><div> <br></div><div> and in my wrapper_script.sh:<br></div><div> #!/bin/bash<br></div><div> DOMAIN_NAME=$(basename "$0" |awk -F'____' '{print $1}')<br></div><div> /path/to/myscript.sh -H ${DOMAIN_NAME} -C guest-get-time -l 25 -w 1<br></div><div> <br></div><div> (a bit hack-y but better than creating 1 script per vm resource and modifying it with the ${DOMAIN_NAME})<br></div><div> <br></div><div> Then creating the cluster resource:<br></div><div> pcs resource create myvmhostname VirtualDomain config="/path/to/myvmhostname/myvmhostname.xml" hypervisor="qemu:///system" migration_transport="ssh" force_stop="false" monitor_scripts="/path/to/wrapper/myvmhostname_____wrapper_script.sh" meta allow-migrate="true" target-role="Stopped" op migrate_from timeout=90s interval=0s op migrate_to timeout=120s interval=0s op monitor timeout=40s interval=10s op start timeout=90s interval=0s op stop timeout=90s interval=0s<br></div><div> <br></div><div> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐<br></div><div> <br></div><div> On Sunday, June 6th, 2021 at 16:56, Kyle O'Donnell <<a rel="noopener noreferrer" href="mailto:kyleo@0b10.mx" target="_blank">kyleo@0b10.mx</a>> wrote:<br></div><div> <br></div><div> > Let me know if there is a better approach to the following problem. When the virtual machine does not respond to a state query I want the cluster to kick it<br></div><div> ><br></div><div> > I could not find any useful docs for using the nagios plugins. After reading the documentation about running a custom script via the "monitor" function in the RA I determined that would not meet my requirements as it's only run on start and migrate(unless I read it incorrectly?).<br></div><div> ><br></div><div> > Here is what I did (im on ubuntu 20.04):<br></div><div> ><br></div><div> > cp /usr/lib/ocf/resource.d/heartbeat/VirtualDomain /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain<br></div><div> ><br></div><div> > cp /usr/share/resource-agents/ocft/configs/VirtualDomain cp /usr/share/resource-agents/ocft/configs/MyVirtDomain<br></div><div> ><br></div><div> > sed -i 's/VirtualDomain/MyVirtDomain/g' /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain<br></div><div> ><br></div><div> > sed -i 's/VirtualDomain/MyVirtDomain/g' /usr/share/resource-agents/ocft/configs/MyVirtDomain<br></div><div> ><br></div><div> > edited function MyVirtDomain_status in /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain, adding the following to the status case running|paused|idle|blocked|"in shutdown")<br></div><div> ><br></div><div> > FROM<br></div><div> ><br></div><div> > running|paused|idle|blocked|"in shutdown")<br></div><div> ><br></div><div> > # running: domain is currently actively consuming cycles<br></div><div> ><br></div><div> > # paused: domain is paused (suspended)<br></div><div> ><br></div><div> > # idle: domain is running but idle<br></div><div> ><br></div><div> > # blocked: synonym for idle used by legacy Xen versions<br></div><div> ><br></div><div> > # in shutdown: the domain is in process of shutting down, but has not completely shutdown or crashed.<br></div><div> ><br></div><div> > ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status."<br></div><div> ><br></div><div> > rc=$OCF_SUCCESS<br></div><div> ><br></div><div> > TO<br></div><div> ><br></div><div> > running|paused|idle|blocked|"in shutdown")<br></div><div> ><br></div><div> > # running: domain is currently actively consuming cycles<br></div><div> ><br></div><div> > # paused: domain is paused (suspended)<br></div><div> ><br></div><div> > # idle: domain is running but idle<br></div><div> ><br></div><div> > # blocked: synonym for idle used by legacy Xen versions<br></div><div> ><br></div><div> > # in shutdown: the domain is in process of shutting down, but has not completely shutdown or crashed.<br></div><div> ><br></div><div> > custom_chk=$(/path/to/myscript.sh -H $DOMAIN_NAME -C guest-get-time -l 25 -w 1)<br></div><div> ><br></div><div> > custom_rc=$?<br></div><div> ><br></div><div> > if [ ${custom_rc} -eq 0 ]; then<br></div><div> ><br></div><div> > ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status."<br></div><div> ><br></div><div> > rc=$OCF_SUCCESS<br></div><div> ><br></div><div> > else<br></div><div> ><br></div><div> > ocf_log debug "Virtual domain $DOMAIN_NAME is currently ${custom_chk}."<br></div><div> ><br></div><div> > rc=$OCF_ERR_GENERIC<br></div><div> ><br></div><div> > fi<br></div><div> ><br></div><div> > The custom script uses the qemu-guest-agent in my guest, passing the parameter to grab the guest's time (seems to be most universal [windows, centos6, ubuntu, centos 7]). Runs 25 loops, sleeps 1 second between iterations, exit 0 as soon as the agent responds with the time and exit 1 after the 25th loop, which are OCF_SUCCESS and OCF_ERR_GENERIC based on docs.<br></div><div> ><br></div><div> > /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1<br></div><div> > =========================================================<br></div><div> ><br></div><div> > [GOOD] - myvm virsh qemu-agent-command guest-get-time output: {"return":1623011582178375000}<br></div><div> ><br></div><div> > or when its not responding:<br></div><div> ><br></div><div> > /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1<br></div><div> > =========================================================<br></div><div> ><br></div><div> > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent is not responding: QEMU guest agent is not connected<br></div><div> ><br></div><div> > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent is not responding: QEMU guest agent is not connected<br></div><div> ><br></div><div> > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent is not responding: QEMU guest agent is not connected<br></div><div> ><br></div><div> > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest agent is not responding: QEMU guest agent is not connected<br></div><div> ><br></div><div> > ... (exits after 25th or<br></div><div> ><br></div><div> > [GOOD] - myvm virsh qemu-agent-command guest-get-time output: {"return":1623011582178375000}<br></div><div> ><br></div><div> > and when the vm isnt running:<br></div><div> ><br></div><div> > /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1<br></div><div> > =========================================================<br></div><div> ><br></div><div> > [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: failed to get domain 'myvm'<br></div><div> ><br></div><div> > I updated my test vm to use the new RA, updated the status timeout to 40s from default of 30s just in case.<br></div><div> ><br></div><div> > I'd like to be able to update the parameters to myscript.sh via crm configure edit at some point, but will figure that out later...<br></div><div> ><br></div><div> > My test:<br></div><div> ><br></div><div> > reboot the VM from within the OS, hit escape so that I enter the boot mode prompt... after ~30 seconds the cluster decides the resource is having a problem, marks it as failed, and restarts the virtual machine (on the same node -- which in my case in desirable), once the guest is back up and responding the cluster reports the VM as Started<br></div><div> ><br></div><div> > I still have plenty more testing to do and will keep the list posted on progress.<br></div><div> ><br></div><div> > -Kyle<br></div><div> ><br></div><div> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐<br></div><div> ><br></div><div> > On Thursday, May 27th, 2021 at 05:34, Kyle O'Donnell <a rel="noopener noreferrer" href="mailto:kyleo@0b10.mx" target="_blank">kyleo@0b10.mx</a> wrote:<br></div><div> ><br></div><div> > > guest-get-fsinfo doesn't seem to work on older agents (centos6) I've found guest-get-time more universal.<br></div><div> > ><br></div><div> > > Also, found this helpful thread on using monitor_scripts which is part of the VirtualDomain RA<br></div><div> > ><br></div><div> > > <a href="https://linux-ha-dev.linux-ha.narkive.com/yxvySDA2/monitor-scripts-parameter-for-the-virtualdomain-ra-was-re-linux-ha-ocf-resource-agent-for-kvm" rel="noopener noreferrer" target="_blank">https://linux-ha-dev.linux-ha.narkive.com/yxvySDA2/monitor-scripts-parameter-for-the-virtualdomain-ra-was-re-linux-ha-ocf-resource-agent-for-kvm</a><br></div><div> > ><br></div><div> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐<br></div><div> > ><br></div><div> > > On Sunday, May 16th, 2021 at 22:49, Kyle O'Donnell <a rel="noopener noreferrer" href="mailto:kyleo@0b10.mx" target="_blank">kyleo@0b10.mx</a> wrote:<br></div><div> > ><br></div><div> > > > I am thinking about using the qemu-guest-agent to run one of the available commands to determine the health of the OS inside<br></div><div> > > ><br></div><div> > > > virsh qemu-agent-command myvm --pretty '{"execute":"guest-get-fsinfo"}'<br></div><div> > > ><br></div><div> > > > <a href="https://qemu-project.gitlab.io/qemu/interop/qemu-ga-ref.html" rel="noopener noreferrer" target="_blank">https://qemu-project.gitlab.io/qemu/interop/qemu-ga-ref.html</a><br></div><div> > > ><br></div><div> > > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐<br></div><div> > > ><br></div><div> > > > On Thursday, May 13th, 2021 at 01:28, Andrei Borzenkov <a rel="noopener noreferrer" href="mailto:arvidjaar@gmail.com" target="_blank">arvidjaar@gmail.com</a> wrote:<br></div><div> > > ><br></div><div> > > > > On 03.05.2021 09:48, Ulrich Windl wrote:<br></div><div> > > > ><br></div><div> > > > > > > > > Ken Gaillot <a rel="noopener noreferrer" href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a> schrieb am 30.04.2021 um 16:57 in<br></div><div> > > > > > > > ><br></div><div> > > > > > > > > Nachricht<br></div><div> > > > > > > > ><br></div><div> > > > > > > > > <a rel="noopener noreferrer" href="mailto:3acef4bc31923fb019619c713300444c2dcd354a.camel@redhat.com" target="_blank">3acef4bc31923fb019619c713300444c2dcd354a.camel@redhat.com</a>:<br></div><div> > > > > > > > ><br></div><div> > > > > > > > > On Fri, 2021‑04‑30 at 11:00 +0100, lejeczek wrote:<br></div><div> > > > > > ><br></div><div> > > > > > > > Hi guys<br></div><div> > > > > > > ><br></div><div> > > > > > > > I'd like to ask around for thoughts & suggestions on any<br></div><div> > > > > > > ><br></div><div> > > > > > > > semi/official ways to monitor VirtualDomain.<br></div><div> > > > > > > ><br></div><div> > > > > > > > Something beyond what included RA does ‑ such as actual<br></div><div> > > > > > > ><br></div><div> > > > > > > > health testing of and communication with VM's OS.<br></div><div> > > > > > > ><br></div><div> > > > > > > > many thanks, L.<br></div><div> > > > > > ><br></div><div> > > > > > > This use case led to a Pacemaker feature many moons ago ...<br></div><div> > > > > > ><br></div><div> > > > > > > Pacemaker supports nagios plug‑ins as a resource type (e.g.<br></div><div> > > > > > ><br></div><div> > > > > > > nagios:check_apache_status). These are service checks usually used with<br></div><div> > > > > > ><br></div><div> > > > > > > monitoring software such as nagios, icinga, etc.<br></div><div> > > > > > ><br></div><div> > > > > > > If the service being monitored is inside a VirtualDomain, named vm1 for<br></div><div> > > > > > ><br></div><div> > > > > > > example, you can configure the nagios resource with the resource meta‑<br></div><div> > > > > > ><br></div><div> > > > > > > attribute container="vm1". If the nagios check fails, Pacemaker will<br></div><div> > > > > > ><br></div><div> > > > > > > restart vm1.<br></div><div> > > > > ><br></div><div> > > > > > "check fails" mans WARNING, CRITICAL, or UNKNOWN? ;-)<br></div><div> > > > ><br></div><div> > > > > switch (rc) {<br></div><div> > > > ><br></div><div> > > > > case NAGIOS_STATE_OK:<br></div><div> > > > ><br></div><div> > > > > return PCMK_OCF_OK;<br></div><div> > > > ><br></div><div> > > > > case NAGIOS_INSUFFICIENT_PRIV:<br></div><div> > > > ><br></div><div> > > > > return PCMK_OCF_INSUFFICIENT_PRIV;<br></div><div> > > > ><br></div><div> > > > > case NAGIOS_NOT_INSTALLED:<br></div><div> > > > ><br></div><div> > > > > return PCMK_OCF_NOT_INSTALLED;<br></div><div> > > > ><br></div><div> > > > > case NAGIOS_STATE_WARNING:<br></div><div> > > > ><br></div><div> > > > > case NAGIOS_STATE_CRITICAL:<br></div><div> > > > ><br></div><div> > > > > case NAGIOS_STATE_UNKNOWN:<br></div><div> > > > ><br></div><div> > > > > case NAGIOS_STATE_DEPENDENT:<br></div><div> > > > ><br></div><div> > > > > default:<br></div><div> > > > ><br></div><div> > > > > return PCMK_OCF_UNKNOWN_ERROR;<br></div><div> > > > ><br></div><div> > > > > }<br></div><div> > > > ><br></div><div> > > > > return PCMK_OCF_UNKNOWN_ERROR;<br></div><div> > > > ><br></div><div> > > > > Manage your subscription:<br></div><div> > > > ><br></div><div> > > > > <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noopener noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br></div><div> > > > ><br></div><div> > > > > ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noopener noreferrer" target="_blank">https://www.clusterlabs.org/</a><br></div><div> ><br></div><div> > Manage your subscription:<br></div><div> ><br></div><div> > <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noopener noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br></div><div> ><br></div><div> > ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noopener noreferrer" target="_blank">https://www.clusterlabs.org/</a><br></div><div> _______________________________________________<br></div><div> Manage your subscription:<br></div><div> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noopener noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br></div><div> <br></div><div> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noopener noreferrer" target="_blank">https://www.clusterlabs.org/</a><br></div></blockquote></div></div></blockquote><div><br></div>