[ClusterLabs] Pacemaker kill does not cause node fault ???

Fri Feb 3 15:11:25 UTC 2017

On 02/03/2017 07:00 AM, RaSca wrote:
> 
> On 03/02/2017 11:06, Ferenc Wágner wrote:
>> Ken Gaillot <kgaillot at redhat.com> writes:
>>
>>> On 01/10/2017 04:24 AM, Stefan Schloesser wrote:
>>>
>>>> I am currently testing a 2 node cluster under Ubuntu 16.04. The setup
>>>> seems to be working ok including the STONITH.
>>>> For test purposes I issued a "pkill -f pace" killing all pacemaker
>>>> processes on one node.
>>>>
>>>> Result:
>>>> The node is marked as "pending", all resources stay on it. If I
>>>> manually kill a resource it is not noticed. On the other node a drbd
>>>> "promote" command fails (drbd is still running as master on the first
>>>> node).
>>>
>>> I suspect that, when you kill pacemakerd, systemd respawns it quickly
>>> enough that fencing is unnecessary. Try "pkill -f pace; systemd stop
>>> pacemaker".
>>
>> What exactly is "quickly enough"?
> 
> What Ken is saying is that Pacemaker, as a service managed by systemd,
> have in its service definition file
> (/usr/lib/systemd/system/pacemaker.service) this option:
> 
> Restart=on-failure
> 
> Looking at [1] it is explained: systemd restarts immediately the process
> if it ends for some unexpected reason (like a forced kill).
> 
> [1] https://www.freedesktop.org/software/systemd/man/systemd.service.html

And the cluster itself is resilient to some daemon restarts. If only
pacemakerd is killed, corosync and pacemaker's crmd can still function
without any issues. When pacemakerd respawns, it reestablishes contact
with any other cluster daemons still running (and its pacemakerd peers
on other cluster nodes).