[ClusterLabs] Pacemaker kill does not cause node fault ???

Wed Feb 8 03:45:29 EST 2017

Ken Gaillot <kgaillot at redhat.com> writes:

> On 02/03/2017 07:00 AM, RaSca wrote:
>> 
>> On 03/02/2017 11:06, Ferenc Wágner wrote:
>>> Ken Gaillot <kgaillot at redhat.com> writes:
>>>
>>>> On 01/10/2017 04:24 AM, Stefan Schloesser wrote:
>>>>
>>>>> I am currently testing a 2 node cluster under Ubuntu 16.04. The setup
>>>>> seems to be working ok including the STONITH.
>>>>> For test purposes I issued a "pkill -f pace" killing all pacemaker
>>>>> processes on one node.
>>>>>
>>>>> Result:
>>>>> The node is marked as "pending", all resources stay on it. If I
>>>>> manually kill a resource it is not noticed. On the other node a drbd
>>>>> "promote" command fails (drbd is still running as master on the first
>>>>> node).
>>>>
>>>> I suspect that, when you kill pacemakerd, systemd respawns it quickly
>>>> enough that fencing is unnecessary. Try "pkill -f pace; systemd stop
>>>> pacemaker".
>>>
>>> What exactly is "quickly enough"?
>> 
>> What Ken is saying is that Pacemaker, as a service managed by systemd,
>> have in its service definition file
>> (/usr/lib/systemd/system/pacemaker.service) this option:
>> 
>> Restart=on-failure
>> 
>> Looking at [1] it is explained: systemd restarts immediately the process
>> if it ends for some unexpected reason (like a forced kill).
>> 
>> [1] https://www.freedesktop.org/software/systemd/man/systemd.service.html
>
> And the cluster itself is resilient to some daemon restarts. If only
> pacemakerd is killed, corosync and pacemaker's crmd can still function
> without any issues. When pacemakerd respawns, it reestablishes contact
> with any other cluster daemons still running (and its pacemakerd peers
> on other cluster nodes).

KillMode=process looks like is a very important compenent of the service
file then.  Probably worth commenting, especially its relation to
Restart=on-failure (it also affects plain stop operations, of course).

But I still wonder how "quickly enough" could be quantified.  Have we
got a timeout for this, or are we good while the cluster is quiescent,
or maybe something else?
-- 
Thanks,
Feri