[ClusterLabs] Tuchanka

Mon Oct 5 07:33:20 EDT 2020

> On 2 Oct 2020, at 17:19, Klaus Wenninger <kwenning at redhat.com> wrote:
>   
>>> My english is poor, I'll try to find other words. My primary and main task
>>> was to create a prototype for an automatic deploy system. So I used only the
>>> same technique that will be used on the real hardware servers: RedHat dvd
>>> image + kickstart. And to test such deploying too. That's why I do not use
>>> any special image for virtual machines.
>> How exactly using a vagrant box you built yourself is different with
>> virtualbox where you clone (I suppose) an existing VM you built?

Looked like we still can not understand each other. :) I do not clone any VM.
As I already said my primary and main task was to create and test working prototype
of automatic deploying system for real (hardware) servers. So for deploying I do not use
any technique dedicated to VM, nor VirtualBox nor Vagrant. Only technique suitable for real servers: RedHat standard DVD image, 
RedHat KickStart and SSH. Such deploying was also tested in loop. Btw, I got curious bug with pacemaker in this process.

If I appoint only 1 CPU (100%) per VM, the pacemaker sometimes go in something like race condition with infinite "system load" of VM.
It's was hardly to debug, because I can not connect to VM if such happens. The simple workaround was applied: 2 CPU (50%). In this case pacemaker worked fine. 
But this is strange, why the pacemaker sometimes can not work on a VM with only 1 virtual CPU.

The automatic test system itself was the second step.

>> 
>>>> Watchdog is kind of a self-fencing method. Cluster with quorum+watchdog, or
>>>> SBD+watchdog or quorum+SBD+watchdog are fine...without "active" fencing.  
>>> quorum+watchdog or SBD+watchdog are useless. Quorum+SBD+watchdog is a
>>> solution, but also has some drawback, so this is not perfect or fine yet.
>> Well, by "SBD", I meant "Storage Based Death": using a shared storage to poison
>> pill other nodes. Not just the sbd daemon, that is used for SBD and watchdog.
>> Sorry for the shortcut and the confusion.

I see. May be use SBD as watchdog daemon looked like surplus and none optimal using of resources, but pacemaker do not have a dedicated watchdog daemon.
May be will be match simpler if such functionality will be inside corosync or pacemaker, like, for instance in Patroni.

>>> SBD is not good as watchdog daemon. In my version it does not check
>>> that the corosync and any processes of the pacemaker are not frozen (for
>>> instance by kill -STOP). Looked like checking for corosync have been already
>>> done: https://github.com/ClusterLabs/sbd/pull/83
>> Good.
>> 
>>> Don't know what about checking all processes of the pacemaker.
>> This moves toward the good direction I would say:
>> 
>>  https://lists.clusterlabs.org/pipermail/users/2020-August/027602.html
>> 
>> The main Pacemaker process is now checked by sbd. Maybe other processes will be
>> included in futur releases as "more in-depth health checks" as written in this
>> email.
> We are targetting a hierarchical approach:
> 
> SBD is checking pacemakerd - more explicitly a timestamp
> when pacemakerwas considered fine last time. So this task
> of checking liveness of thewhole group of pacemaker
> daemons can be passed over to pacemakerd without risking
> that pacemakerd might be stalled or something.

After I will have moved to new versions I will and kill -STOP tests for corosync and processes of pacemaker.