[ClusterLabs] Antw: Re: Pacemaker kill does not cause node fault ???

Mon Feb 6 15:13:34 UTC 2017

On 02/06/2017 03:28 AM, Ulrich Windl wrote:
>>>> RaSca <rasca at miamammausalinux.org> schrieb am 03.02.2017 um 14:00 in
> Nachricht
> <0de64981-904f-5bdb-c98f-9c59ee47b6c5 at miamammausalinux.org>:
> 
>> On 03/02/2017 11:06, Ferenc Wágner wrote:
>>> Ken Gaillot <kgaillot at redhat.com> writes:
>>>
>>>> On 01/10/2017 04:24 AM, Stefan Schloesser wrote:
>>>>
>>>>> I am currently testing a 2 node cluster under Ubuntu 16.04. The setup
>>>>> seems to be working ok including the STONITH.
>>>>> For test purposes I issued a "pkill -f pace" killing all pacemaker
>>>>> processes on one node.
>>>>>
>>>>> Result:
>>>>> The node is marked as "pending", all resources stay on it. If I
>>>>> manually kill a resource it is not noticed. On the other node a drbd
>>>>> "promote" command fails (drbd is still running as master on the first
>>>>> node).
>>>>
>>>> I suspect that, when you kill pacemakerd, systemd respawns it quickly
>>>> enough that fencing is unnecessary. Try "pkill -f pace; systemd stop
>>>> pacemaker".
>>>
>>> What exactly is "quickly enough"?
>>
>> What Ken is saying is that Pacemaker, as a service managed by systemd,
>> have in its service definition file
>> (/usr/lib/systemd/system/pacemaker.service) this option:
>>
>> Restart=on-failure
>>
>> Looking at [1] it is explained: systemd restarts immediately the process
>> if it ends for some unexpected reason (like a forced kill).
> 
> Isn't the question: Is crmd a process that is expected to die (and thus need
> restarting)? Or wouldn't one prefer to debug this situation. I fear that
> restarting it might just cover some fatal failure...

If crmd or corosync dies, the node will be fenced (if fencing is enabled
and working). If one of the crmd's persistent connections (such as to
the cib) fails, it will exit, so it ends up the same. But the other
daemons (such as pacemakerd or attrd) can die and respawn without any
risk to services.

The failure will be logged, but it will not be reported in cluster
status, so there is a chance of not noticing it.

> 
>>
>> [1] https://www.freedesktop.org/software/systemd/man/systemd.service.html 
>>
>> -- 
>> RaSca
>> rasca at miamammausalinux.org