[ClusterLabs] Antw: Re: Antw: Re: Pacemaker kill does not cause node fault ???

Tue Feb 7 15:01:54 UTC 2017

On 02/07/2017 01:11 AM, Ulrich Windl wrote:
>>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 06.02.2017 um 16:13 in
> Nachricht
> <40eba339-2f46-28b8-4605-c7047e0ee703 at redhat.com>:
>> On 02/06/2017 03:28 AM, Ulrich Windl wrote:
>>>>>> RaSca <rasca at miamammausalinux.org> schrieb am 03.02.2017 um 14:00 in
>>> Nachricht
>>> <0de64981-904f-5bdb-c98f-9c59ee47b6c5 at miamammausalinux.org>:
>>>
>>>> On 03/02/2017 11:06, Ferenc Wágner wrote:
>>>>> Ken Gaillot <kgaillot at redhat.com> writes:
>>>>>
>>>>>> On 01/10/2017 04:24 AM, Stefan Schloesser wrote:
>>>>>>
>>>>>>> I am currently testing a 2 node cluster under Ubuntu 16.04. The setup
>>>>>>> seems to be working ok including the STONITH.
>>>>>>> For test purposes I issued a "pkill -f pace" killing all pacemaker
>>>>>>> processes on one node.
>>>>>>>
>>>>>>> Result:
>>>>>>> The node is marked as "pending", all resources stay on it. If I
>>>>>>> manually kill a resource it is not noticed. On the other node a drbd
>>>>>>> "promote" command fails (drbd is still running as master on the first
>>>>>>> node).
>>>>>>
>>>>>> I suspect that, when you kill pacemakerd, systemd respawns it quickly
>>>>>> enough that fencing is unnecessary. Try "pkill -f pace; systemd stop
>>>>>> pacemaker".
>>>>>
>>>>> What exactly is "quickly enough"?
>>>>
>>>> What Ken is saying is that Pacemaker, as a service managed by systemd,
>>>> have in its service definition file
>>>> (/usr/lib/systemd/system/pacemaker.service) this option:
>>>>
>>>> Restart=on-failure
>>>>
>>>> Looking at [1] it is explained: systemd restarts immediately the process
>>>> if it ends for some unexpected reason (like a forced kill).
>>>
>>> Isn't the question: Is crmd a process that is expected to die (and thus
> need
>>> restarting)? Or wouldn't one prefer to debug this situation. I fear that
>>> restarting it might just cover some fatal failure...
>>
>> If crmd or corosync dies, the node will be fenced (if fencing is enabled
>> and working). If one of the crmd's persistent connections (such as to
>> the cib) fails, it will exit, so it ends up the same. But the other
> 
> But isn't it due to crmd not responding to network packets? So if the timeout
> is long enough, and crmd is started fast enough, will the node really be
> fenced?

If crmd dies, it leaves its corosync process group, and I'm pretty sure
the other nodes will fence it for that reason, regardless of the duration.

>> daemons (such as pacemakerd or attrd) can die and respawn without any
>> risk to services.
>>
>> The failure will be logged, but it will not be reported in cluster
>> status, so there is a chance of not noticing it.
> 
> I don't understand: A node is fenced, but it will not be noted in the cluster
> status???

I meant the case where pacemakerd or attrd respawns quickly. The node is
not fenced in that case, and the only indication will be in the logs.