[ClusterLabs] Antw: Re: Antw: Re: Pacemaker kill does not cause node fault ???

Tue Feb 7 02:11:05 EST 2017

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 06.02.2017 um 16:13 in
Nachricht
<40eba339-2f46-28b8-4605-c7047e0ee703 at redhat.com>:
> On 02/06/2017 03:28 AM, Ulrich Windl wrote:
>>>>> RaSca <rasca at miamammausalinux.org> schrieb am 03.02.2017 um 14:00 in
>> Nachricht
>> <0de64981-904f-5bdb-c98f-9c59ee47b6c5 at miamammausalinux.org>:
>> 
>>> On 03/02/2017 11:06, Ferenc Wágner wrote:
>>>> Ken Gaillot <kgaillot at redhat.com> writes:
>>>>
>>>>> On 01/10/2017 04:24 AM, Stefan Schloesser wrote:
>>>>>
>>>>>> I am currently testing a 2 node cluster under Ubuntu 16.04. The setup
>>>>>> seems to be working ok including the STONITH.
>>>>>> For test purposes I issued a "pkill -f pace" killing all pacemaker
>>>>>> processes on one node.
>>>>>>
>>>>>> Result:
>>>>>> The node is marked as "pending", all resources stay on it. If I
>>>>>> manually kill a resource it is not noticed. On the other node a drbd
>>>>>> "promote" command fails (drbd is still running as master on the first
>>>>>> node).
>>>>>
>>>>> I suspect that, when you kill pacemakerd, systemd respawns it quickly
>>>>> enough that fencing is unnecessary. Try "pkill -f pace; systemd stop
>>>>> pacemaker".
>>>>
>>>> What exactly is "quickly enough"?
>>>
>>> What Ken is saying is that Pacemaker, as a service managed by systemd,
>>> have in its service definition file
>>> (/usr/lib/systemd/system/pacemaker.service) this option:
>>>
>>> Restart=on-failure
>>>
>>> Looking at [1] it is explained: systemd restarts immediately the process
>>> if it ends for some unexpected reason (like a forced kill).
>> 
>> Isn't the question: Is crmd a process that is expected to die (and thus
need
>> restarting)? Or wouldn't one prefer to debug this situation. I fear that
>> restarting it might just cover some fatal failure...
> 
> If crmd or corosync dies, the node will be fenced (if fencing is enabled
> and working). If one of the crmd's persistent connections (such as to
> the cib) fails, it will exit, so it ends up the same. But the other

But isn't it due to crmd not responding to network packets? So if the timeout
is long enough, and crmd is started fast enough, will the node really be
fenced?

> daemons (such as pacemakerd or attrd) can die and respawn without any
> risk to services.
> 
> The failure will be logged, but it will not be reported in cluster
> status, so there is a chance of not noticing it.

I don't understand: A node is fenced, but it will not be noted in the cluster
status???

[...]

Regards,
Ulrich