[ClusterLabs] Antw: [EXT] VirtualDomain - started but... not really

Tue Dec 14 09:48:36 EST 2021

> Hi!
>
> My guess is that you checked the corresponding logs already; why not show them here?
> I can imagine that the VMs die rather early after start.
>
> Regards,
> Ulrich
>
>>>> lejeczek via Users <users at clusterlabs.org> schrieb am 10.12.2021 um 17:33 in
> Nachricht <df8eac8f-a58e-28e5-53b5-73eb1fe432b2 at yahoo.co.uk>:
>> Hi guys.
>>
>> I quite often.. well, to frequently in my mind, see a VM
>> which cluster says:
>> -> $ pcs resource status | grep -v disabled
>> ...
>>     * c8kubermaster2    (ocf::heartbeat:VirtualDomain):
>>    Started dzien
>> ..
>>
>> but that is false, also cluster itself confirms it:
>> -> $ pcs resource debug-monitor c8kubermaster2
>> crm_resource: Error performing operation: Not running
>> Operation force-check for c8kubermaster2
>> (ocf:heartbeat:VirtualDomain) returned: 'not running' (7)
>>
>> What is the issue here, might be & how best to troubleshoot it?
>>
>> -> $ pcs resource config c8kubermaster2
>>    Resource: c8kubermaster2 (class=ocf provider=heartbeat
>> type=VirtualDomain)
>>     Attributes:
>> config=/var/lib/pacemaker/conf.d/c8kubermaster2.xml
>> hypervisor=qemu:///system migration_transport=ssh
>>     Meta Attrs: allow-migrate=true failure-timeout=30s
>>     Operations: migrate_from interval=0s timeout=180s
>> (c8kubermaster2-migrate_from-interval-0s)
>>                 migrate_to interval=0s timeout=180s
>> (c8kubermaster2-migrate_to-interval-0s)
>>                 monitor interval=30s
>> (c8kubermaster2-monitor-interval-30s)
>>                 start interval=0s timeout=90s
>> (c8kubermaster2-start-interval-0s)
>>                 stop interval=0s timeout=90s
>> (c8kubermaster2-stop-interval-0s)
>>
>> many thanks, L.
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
Not much there in the logs I could see (which is probably 
why cluster decides the resource is okey)
What is the resource's monitor for it not for that exactly - 
to check the state of resource - whether it dies early or 
late should not matter.
What suffices in order to "fix" such resource 
false-positive, I do quick dis/enable the resource or as in 
this very instance rpm updates which restarted node.
Again, how cluster might think resource is okey while 
debug-monitor shows it's not.
I only do not know how to reproduce this in a controlled, 
orderly manner.

thanks, L