[ClusterLabs] Antw: Re: Antw: [EXT] Re: VirtualDomain & "deeper" monitors ‑ what/how?

Thu Oct 28 03:03:36 EDT 2021

Hi!

I wonder: Shouldn't "OCF_RESOURCE_INSTANCE" help you to identify what is going
to be monitored?
(Reasonable naming assumed ;-))

Regards,
Ulrich

>>> Kyle O'Donnell <kyleo at 0b10.mx> schrieb am 26.10.2021 um 13:53 in
Nachricht
<uNHregOAnWaFxn5xMCQhuxDLxi-E_norRLuSuhxjZTewjFtRwmq_hVWHwJN4Lo4ybtKSmwYqAbkk9zf
57Gp0Hmww903ZC09_P2QrHeneW0=@0b10.mx>:
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, October 26th, 2021 at 03:04, Klaus Wenninger
<kwenning at redhat.com> 
> wrote:
> 
>> On Mon, Oct 25, 2021 at 9:34 PM Kyle O'Donnell <kyleo at 0b10.mx> wrote:
>>
>>> Finally got around to working on this.
>>>
>>> I spoke with someone on the #cluterslabs IRC channel who mentioned that
the 
> monitor_scripts param does indeed run at some frequency (op monitor
timeout=? 
> interval=?), not just during the "start" and "migrate_from" actions.
>>>
>>> The monitor_scripts param does not support scripts with command line args,

> just a space delimited list for running multiple scripts. This means that 
> each VirtualDomain resource needs its own script to be able to define the 
> ${DOMAIN_NAME}. I found that a bit annoying so I created a symlink to a 
> wrapper script using the ${DOMAIN_NAME} as the first part of the filename
and 
> a separator for awk:
>>
>> The scripts being called by the monitor operation should inherit the 
> environment from the monitor so that you should be able to use these 
> variables.
>>
>> Klaus
> 
> Thanks!
> 
> I tried referencing the ${DOMAIN_NAME} variable initially but that did not 
> work. I tried running the function that creates the variable 
> (VirtualDomain_getconfig) it also did not work.
> 
> After some debugging it looks like the following variables are available 
> from the parent script:
> error output [ OCF_ROOT=/usr/lib/ocf ] ]
> error output [ OCF_RESKEY_crm_feature_set=3.2.1 ]
> error output [ HA_LOGFACILITY=daemon ]
> error output [ PCMK_debug=0 ]
> error output [ HA_debug=0 ]
> error output [ PWD=/var/lib/pacemaker/cores ]
> error output [ OCF_RESKEY_hypervisor=qemu:///system ]
> error output [ HA_logfile=/var/log/pacemaker/pacemaker.log ]
> error output [ HA_logfacility=daemon ]
> error output [ OCF_EXIT_REASON_PREFIX=ocf-exit-reason: ]
> error output [ OCF_RESOURCE_PROVIDER=heartbeat ]
> error output [ PCMK_service=pacemaker-execd ]
> error output [ PCMK_mcp=true ]
> error output [ 
> OCF_RESKEY_monitor_scripts=/path/to/myvmhostname____wrap_check.sh ]
> error output [ OCF_RA_VERSION_MAJOR=1 ]
> error output [ VALGRIND_OPTS=--leak-check=full --trace-children=no --vgdb=no

> --num-callers=25 --log-file=/var/lib/pacemaker/valgrind-%p 
> --suppressions=/usr/share/pacemaker/tests/valgrind-pcmk.suppressions 
> --gen-suppressions=all ]
> error output [ HA_cluster_type=corosync ]
> error output [ INVOCATION_ID=652062571c8f415a9a7a228c5ad77b20 ]
> error output [ OCF_RESKEY_CRM_meta_interval=10000 ]
> error output [ OCF_RESOURCE_INSTANCE=myvmhostname ]
> error output [ HA_quorum_type=corosync ]
> error output [ OCF_RA_VERSION_MINOR=0 ]
> error output [ HA_mcp=true ]
> error output [ OCF_RESKEY_config=/path/to/myvmhostname/myvmhostname.xml ]
> error output [ PCMK_quorum_type=corosync ]
> error output [ OCF_RESKEY_CRM_meta_name=monitor ]
> error output [ OCF_RESKEY_migration_transport=ssh ]
> error output [ SHLVL=1 ]
> error output [ OCF_RESKEY_CRM_meta_on_node=node02 ]
> error output [ PCMK_watchdog=false ]
> error output [ PCMK_logfile=/var/log/pacemaker/pacemaker.log ]
> error output [ OCF_RESKEY_CRM_meta_timeout=40000 ]
> error output [ OCF_RESOURCE_TYPE=VirtualDomain ]
> error output [ PCMK_logfacility=daemon ]
> error output [ LC_ALL=C ]
> error output [ HA_LOGFILE=/var/log/pacemaker/pacemaker.log ]
> error output [ JOURNAL_STREAM=9:42440 ]
> error output [ OCF_RESKEY_CRM_meta_on_node_uuid=2 ]
> error output [ 
>
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/
> sbin:/bin:/usr/sbin:/usr/bin:/usr/ucb ]
> error output [ OCF_RESKEY_force_stop=false ]
> error output [ PCMK_cluster_type=corosync ]
> error output [ _=/usr/bin/env ]
> 
> The most helpful variables is:
> error output [ OCF_RESKEY_config=/path/to/myvmhostname/myvmhostname.xml ]
> 
> So I copied part of the "VirtualDomain_getconfig" function from the resource

> script to populate the variable in the same way:
> DOMAIN_NAME=`egrep '[[:space:]]*<name>.*</name>[[:space:]]*$' 
> ${OCF_RESKEY_config} 2>/dev/null | sed -e 
> 's/[[:space:]]*<name>\(.*\)<\/name>[[:space:]]*$/\1/'`
> 
> and now it's working without the hacky symlink
> 
>>> ln -s /path/to/wrapper_script.sh 
> /path/to/wrapper/myvmhostname_____wrapper_script.sh
>>>
>>> and in my wrapper_script.sh:
>>> #!/bin/bash
>>> DOMAIN_NAME=$(basename "$0" |awk -F'____' '{print $1}')
>>> /path/to/myscript.sh -H ${DOMAIN_NAME} -C guest-get-time -l 25 -w 1
>>>
>>> (a bit hack-y but better than creating 1 script per vm resource and 
> modifying it with the ${DOMAIN_NAME})
>>>
>>> Then creating the cluster resource:
>>> pcs resource create myvmhostname VirtualDomain 
> config="/path/to/myvmhostname/myvmhostname.xml" hypervisor="qemu:///system"

> migration_transport="ssh" force_stop="false" 
> monitor_scripts="/path/to/wrapper/myvmhostname_____wrapper_script.sh" meta 
> allow-migrate="true" target-role="Stopped" op migrate_from timeout=90s 
> interval=0s op migrate_to timeout=120s interval=0s op monitor timeout=40s 
> interval=10s op start timeout=90s interval=0s op stop timeout=90s
interval=0s
>>>
>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>
>>> On Sunday, June 6th, 2021 at 16:56, Kyle O'Donnell <kyleo at 0b10.mx> wrote:
>>>
>>>> Let me know if there is a better approach to the following problem. When
the 
> virtual machine does not respond to a state query I want the cluster to kick

> it
>>>>
>>>> I could not find any useful docs for using the nagios plugins. After
reading 
> the documentation about running a custom script via the "monitor" function
in 
> the RA I determined that would not meet my requirements as it's only run on

> start and migrate(unless I read it incorrectly?).
>>>>
>>>> Here is what I did (im on ubuntu 20.04):
>>>>
>>>> cp /usr/lib/ocf/resource.d/heartbeat/VirtualDomain 
> /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain
>>>>
>>>> cp /usr/share/resource-agents/ocft/configs/VirtualDomain cp 
> /usr/share/resource-agents/ocft/configs/MyVirtDomain
>>>>
>>>> sed -i 's/VirtualDomain/MyVirtDomain/g' 
> /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain
>>>>
>>>> sed -i 's/VirtualDomain/MyVirtDomain/g' 
> /usr/share/resource-agents/ocft/configs/MyVirtDomain
>>>>
>>>> edited function MyVirtDomain_status in 
> /usr/lib/ocf/resource.d/heartbeat/MyVirtDomain, adding the following to the

> status case running|paused|idle|blocked|"in shutdown")
>>>>
>>>> FROM
>>>>
>>>> running|paused|idle|blocked|"in shutdown")
>>>>
>>>> # running: domain is currently actively consuming cycles
>>>>
>>>> # paused: domain is paused (suspended)
>>>>
>>>> # idle: domain is running but idle
>>>>
>>>> # blocked: synonym for idle used by legacy Xen versions
>>>>
>>>> # in shutdown: the domain is in process of shutting down, but has not 
> completely shutdown or crashed.
>>>>
>>>> ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status."
>>>>
>>>> rc=$OCF_SUCCESS
>>>>
>>>> TO
>>>>
>>>> running|paused|idle|blocked|"in shutdown")
>>>>
>>>> # running: domain is currently actively consuming cycles
>>>>
>>>> # paused: domain is paused (suspended)
>>>>
>>>> # idle: domain is running but idle
>>>>
>>>> # blocked: synonym for idle used by legacy Xen versions
>>>>
>>>> # in shutdown: the domain is in process of shutting down, but has not 
> completely shutdown or crashed.
>>>>
>>>> custom_chk=$(/path/to/myscript.sh -H $DOMAIN_NAME -C guest-get-time -l 25
-w 
> 1)
>>>>
>>>> custom_rc=$?
>>>>
>>>> if [ ${custom_rc} -eq 0 ]; then
>>>>
>>>> ocf_log debug "Virtual domain $DOMAIN_NAME is currently $status."
>>>>
>>>> rc=$OCF_SUCCESS
>>>>
>>>> else
>>>>
>>>> ocf_log debug "Virtual domain $DOMAIN_NAME is currently ${custom_chk}."
>>>>
>>>> rc=$OCF_ERR_GENERIC
>>>>
>>>> fi
>>>>
>>>> The custom script uses the qemu-guest-agent in my guest, passing the 
> parameter to grab the guest's time (seems to be most universal [windows, 
> centos6, ubuntu, centos 7]). Runs 25 loops, sleeps 1 second between 
> iterations, exit 0 as soon as the agent responds with the time and exit 1 
> after the 25th loop, which are OCF_SUCCESS and OCF_ERR_GENERIC based on
docs.
>>>>
>>>> /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
>>>> =========================================================
>>>>
>>>> [GOOD] - myvm virsh qemu-agent-command guest-get-time output: 
> {"return":1623011582178375000}
>>>>
>>>> or when its not responding:
>>>>
>>>> /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
>>>> =========================================================
>>>>
>>>> [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest

> agent is not responding: QEMU guest agent is not connected
>>>>
>>>> [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest

> agent is not responding: QEMU guest agent is not connected
>>>>
>>>> [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest

> agent is not responding: QEMU guest agent is not connected
>>>>
>>>> [BAD] - myvm virsh qemu-agent-command guest-get-time output: error: Guest

> agent is not responding: QEMU guest agent is not connected
>>>>
>>>> ... (exits after 25th or
>>>>
>>>> [GOOD] - myvm virsh qemu-agent-command guest-get-time output: 
> {"return":1623011582178375000}
>>>>
>>>> and when the vm isnt running:
>>>>
>>>> /path/to/myscript.sh -H myvm -C guest-get-time -l 25 -w 1
>>>> =========================================================
>>>>
>>>> [BAD] - myvm virsh qemu-agent-command guest-get-time output: error:
failed 
> to get domain 'myvm'
>>>>
>>>> I updated my test vm to use the new RA, updated the status timeout to 40s

> from default of 30s just in case.
>>>>
>>>> I'd like to be able to update the parameters to myscript.sh via crm 
> configure edit at some point, but will figure that out later...
>>>>
>>>> My test:
>>>>
>>>> reboot the VM from within the OS, hit escape so that I enter the boot
mode 
> prompt... after ~30 seconds the cluster decides the resource is having a 
> problem, marks it as failed, and restarts the virtual machine (on the same 
> node -- which in my case in desirable), once the guest is back up and 
> responding the cluster reports the VM as Started
>>>>
>>>> I still have plenty more testing to do and will keep the list posted on 
> progress.
>>>>
>>>> -Kyle
>>>>
>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>
>>>> On Thursday, May 27th, 2021 at 05:34, Kyle O'Donnell kyleo at 0b10.mx
wrote:
>>>>
>>>> > guest-get-fsinfo doesn't seem to work on older agents (centos6) I've
found 
> guest-get-time more universal.
>>>> >
>>>> > Also, found this helpful thread on using monitor_scripts which is part
of 
> the VirtualDomain RA
>>>> >
>>>> > 
>
https://linux-ha-dev.linux-ha.narkive.com/yxvySDA2/monitor-scripts-parameter-

> for-the-virtualdomain-ra-was-re-linux-ha-ocf-resource-agent-for-kvm
>>>> >
>>>> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>> >
>>>> > On Sunday, May 16th, 2021 at 22:49, Kyle O'Donnell kyleo at 0b10.mx
wrote:
>>>> >
>>>> > > I am thinking about using the qemu-guest-agent to run one of the
available 
> commands to determine the health of the OS inside
>>>> > >
>>>> > > virsh qemu-agent-command myvm --pretty
'{"execute":"guest-get-fsinfo"}'
>>>> > >
>>>> > > https://qemu-project.gitlab.io/qemu/interop/qemu-ga-ref.html 
>>>> > >
>>>> > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>> > >
>>>> > > On Thursday, May 13th, 2021 at 01:28, Andrei Borzenkov
arvidjaar at gmail.com 
> wrote:
>>>> > >
>>>> > > > On 03.05.2021 09:48, Ulrich Windl wrote:
>>>> > > >
>>>> > > > > > > > Ken Gaillot kgaillot at redhat.com schrieb am 30.04.2021 um
16:57 in
>>>> > > > > > > >
>>>> > > > > > > > Nachricht
>>>> > > > > > > >
>>>> > > > > > > > 3acef4bc31923fb019619c713300444c2dcd354a.camel at redhat.com:
>>>> > > > > > > >
>>>> > > > > > > > On Fri, 2021‑04‑30 at 11:00 +0100, lejeczek wrote:
>>>> > > > > >
>>>> > > > > > > Hi guys
>>>> > > > > > >
>>>> > > > > > > I'd like to ask around for thoughts & suggestions on any
>>>> > > > > > >
>>>> > > > > > > semi/official ways to monitor VirtualDomain.
>>>> > > > > > >
>>>> > > > > > > Something beyond what included RA does ‑ such as actual
>>>> > > > > > >
>>>> > > > > > > health testing of and communication with VM's OS.
>>>> > > > > > >
>>>> > > > > > > many thanks, L.
>>>> > > > > >
>>>> > > > > > This use case led to a Pacemaker feature many moons ago ...
>>>> > > > > >
>>>> > > > > > Pacemaker supports nagios plug‑ins as a resource type (e.g.
>>>> > > > > >
>>>> > > > > > nagios:check_apache_status). These are service checks usually
used with
>>>> > > > > >
>>>> > > > > > monitoring software such as nagios, icinga, etc.
>>>> > > > > >
>>>> > > > > > If the service being monitored is inside a VirtualDomain, named
vm1 for
>>>> > > > > >
>>>> > > > > > example, you can configure the nagios resource with the
resource meta‑
>>>> > > > > >
>>>> > > > > > attribute container="vm1". If the nagios check fails, Pacemaker
will
>>>> > > > > >
>>>> > > > > > restart vm1.
>>>> > > > >
>>>> > > > > "check fails" mans WARNING, CRITICAL, or UNKNOWN? ;-)
>>>> > > >
>>>> > > > switch (rc) {
>>>> > > >
>>>> > > > case NAGIOS_STATE_OK:
>>>> > > >
>>>> > > > return PCMK_OCF_OK;
>>>> > > >
>>>> > > > case NAGIOS_INSUFFICIENT_PRIV:
>>>> > > >
>>>> > > > return PCMK_OCF_INSUFFICIENT_PRIV;
>>>> > > >
>>>> > > > case NAGIOS_NOT_INSTALLED:
>>>> > > >
>>>> > > > return PCMK_OCF_NOT_INSTALLED;
>>>> > > >
>>>> > > > case NAGIOS_STATE_WARNING:
>>>> > > >
>>>> > > > case NAGIOS_STATE_CRITICAL:
>>>> > > >
>>>> > > > case NAGIOS_STATE_UNKNOWN:
>>>> > > >
>>>> > > > case NAGIOS_STATE_DEPENDENT:
>>>> > > >
>>>> > > > default:
>>>> > > >
>>>> > > > return PCMK_OCF_UNKNOWN_ERROR;
>>>> > > >
>>>> > > > }
>>>> > > >
>>>> > > > return PCMK_OCF_UNKNOWN_ERROR;
>>>> > > >
>>>> > > > Manage your subscription:
>>>> > > >
>>>> > > > https://lists.clusterlabs.org/mailman/listinfo/users 
>>>> > > >
>>>> > > > ClusterLabs home: https://www.clusterlabs.org/ 
>>>>
>>>> Manage your subscription:
>>>>
>>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/ 
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/