[ClusterLabs] Tuchanka

Mon Oct 5 08:03:27 EDT 2020

On 10/5/20 1:33 PM, Олег Самойлов wrote:
>
>> On 2 Oct 2020, at 17:19, Klaus Wenninger <kwenning at redhat.com> wrote:
>>   
>>>> My english is poor, I'll try to find other words. My primary and main task
>>>> was to create a prototype for an automatic deploy system. So I used only the
>>>> same technique that will be used on the real hardware servers: RedHat dvd
>>>> image + kickstart. And to test such deploying too. That's why I do not use
>>>> any special image for virtual machines.
>>> How exactly using a vagrant box you built yourself is different with
>>> virtualbox where you clone (I suppose) an existing VM you built?
> Looked like we still can not understand each other. :) I do not clone any VM.
> As I already said my primary and main task was to create and test working prototype
> of automatic deploying system for real (hardware) servers. So for deploying I do not use
> any technique dedicated to VM, nor VirtualBox nor Vagrant. Only technique suitable for real servers: RedHat standard DVD image, 
> RedHat KickStart and SSH. Such deploying was also tested in loop. Btw, I got curious bug with pacemaker in this process.
>  
> If I appoint only 1 CPU (100%) per VM, the pacemaker sometimes go in something like race condition with infinite "system load" of VM.
> It's was hardly to debug, because I can not connect to VM if such happens. The simple workaround was applied: 2 CPU (50%). In this case pacemaker worked fine. 
> But this is strange, why the pacemaker sometimes can not work on a VM with only 1 virtual CPU.
Which versions of pacemaker & corosync were you using again?
I remember to have seen corosync hogging cpu on single-core
machines many many years ago.
>
> The automatic test system itself was the second step.
>
>>>>> Watchdog is kind of a self-fencing method. Cluster with quorum+watchdog, or
>>>>> SBD+watchdog or quorum+SBD+watchdog are fine...without "active" fencing.  
>>>> quorum+watchdog or SBD+watchdog are useless. Quorum+SBD+watchdog is a
>>>> solution, but also has some drawback, so this is not perfect or fine yet.
>>> Well, by "SBD", I meant "Storage Based Death": using a shared storage to poison
>>> pill other nodes. Not just the sbd daemon, that is used for SBD and watchdog.
>>> Sorry for the shortcut and the confusion.
> I see. May be use SBD as watchdog daemon looked like surplus and none optimal using of resources, but pacemaker do not have a dedicated watchdog daemon.
> May be will be match simpler if such functionality will be inside corosync or pacemaker, like, for instance in Patroni.
>
The watchdog daemon for pacemaker is SBD ;-)
Btw. you can enable usage of a watchdog-device in
corosync at compile-time although I don't think that
will really give you what you are searching for here.
Anyway these watchdog daemons are to be kept
as simple as possible as to be sure that the crucial
parts are being done in a simple loop that is as
well taking care of kicking the hardware-watchdog.
So spawning that out into a project of its own shouldn't
be the worst idea. Of course there are probably
arguments which would speak for a watchdog-daemon
to be kept in the source-tree of pacemaker and to spawn
it similarly as all the other pacemaker-subdaemons or
build it into pacemakerd (the daemon that spawns all
the rest of the pacemaker-daemons and already does some
supervision on them).
>>>> SBD is not good as watchdog daemon. In my version it does not check
>>>> that the corosync and any processes of the pacemaker are not frozen (for
>>>> instance by kill -STOP). Looked like checking for corosync have been already
>>>> done: https://github.com/ClusterLabs/sbd/pull/83
>>> Good.
>>>
>>>> Don't know what about checking all processes of the pacemaker.
>>> This moves toward the good direction I would say:
>>>
>>>  https://lists.clusterlabs.org/pipermail/users/2020-August/027602.html
>>>
>>> The main Pacemaker process is now checked by sbd. Maybe other processes will be
>>> included in futur releases as "more in-depth health checks" as written in this
>>> email.
>> We are targetting a hierarchical approach:
>>
>> SBD is checking pacemakerd - more explicitly a timestamp
>> when pacemakerwas considered fine last time. So this task
>> of checking liveness of thewhole group of pacemaker
>> daemons can be passed over to pacemakerd without risking
>> that pacemakerd might be stalled or something.
> After I will have moved to new versions I will and kill -STOP tests for corosync and processes of pacemaker.
>
As said that should work for corosync and for pacemakerd with
current pacemaker (master & probably 2.0.5 end of the year) &
SBD (currently just master). Closer observation of all the
pacemaker sub daemons is work in progress.