[ClusterLabs] Pacemaker kill does not cause node fault ???

Mon Feb 13 23:18:36 UTC 2017

On 02/08/2017 02:45 AM, Ferenc Wágner wrote:
> Ken Gaillot <kgaillot at redhat.com> writes:
> 
>> On 02/03/2017 07:00 AM, RaSca wrote:
>>>
>>> On 03/02/2017 11:06, Ferenc Wágner wrote:
>>>> Ken Gaillot <kgaillot at redhat.com> writes:
>>>>
>>>>> On 01/10/2017 04:24 AM, Stefan Schloesser wrote:
>>>>>
>>>>>> I am currently testing a 2 node cluster under Ubuntu 16.04. The setup
>>>>>> seems to be working ok including the STONITH.
>>>>>> For test purposes I issued a "pkill -f pace" killing all pacemaker
>>>>>> processes on one node.
>>>>>>
>>>>>> Result:
>>>>>> The node is marked as "pending", all resources stay on it. If I
>>>>>> manually kill a resource it is not noticed. On the other node a drbd
>>>>>> "promote" command fails (drbd is still running as master on the first
>>>>>> node).
>>>>>
>>>>> I suspect that, when you kill pacemakerd, systemd respawns it quickly
>>>>> enough that fencing is unnecessary. Try "pkill -f pace; systemd stop
>>>>> pacemaker".
>>>>
>>>> What exactly is "quickly enough"?
>>>
>>> What Ken is saying is that Pacemaker, as a service managed by systemd,
>>> have in its service definition file
>>> (/usr/lib/systemd/system/pacemaker.service) this option:
>>>
>>> Restart=on-failure
>>>
>>> Looking at [1] it is explained: systemd restarts immediately the process
>>> if it ends for some unexpected reason (like a forced kill).
>>>
>>> [1] https://www.freedesktop.org/software/systemd/man/systemd.service.html
>>
>> And the cluster itself is resilient to some daemon restarts. If only
>> pacemakerd is killed, corosync and pacemaker's crmd can still function
>> without any issues. When pacemakerd respawns, it reestablishes contact
>> with any other cluster daemons still running (and its pacemakerd peers
>> on other cluster nodes).
> 
> KillMode=process looks like is a very important compenent of the service
> file then.  Probably worth commenting, especially its relation to
> Restart=on-failure (it also affects plain stop operations, of course).
> 
> But I still wonder how "quickly enough" could be quantified.  Have we
> got a timeout for this, or are we good while the cluster is quiescent,
> or maybe something else?

pacemakerd's main purpose is to monitor the other daemons and respawn
them if necessary. If systemd asks it to shut down, or if one of the
daemons exits with the "don't respawn" exit code, it will stop all
daemons. So if it's not running, nothing immediately happens that would
lead to fencing. But if another daemon dies, or if systemd is shutting
down the host, it can't do its job, and fencing might result.