[ClusterLabs] Antw: Re: Antw: Re: Pacemaker kill does not cause node fault ???

Ken Gaillot kgaillot at redhat.com
Tue Feb 7 15:01:54 UTC 2017


On 02/07/2017 01:11 AM, Ulrich Windl wrote:
>>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 06.02.2017 um 16:13 in
> Nachricht
> <40eba339-2f46-28b8-4605-c7047e0ee703 at redhat.com>:
>> On 02/06/2017 03:28 AM, Ulrich Windl wrote:
>>>>>> RaSca <rasca at miamammausalinux.org> schrieb am 03.02.2017 um 14:00 in
>>> Nachricht
>>> <0de64981-904f-5bdb-c98f-9c59ee47b6c5 at miamammausalinux.org>:
>>>
>>>> On 03/02/2017 11:06, Ferenc Wágner wrote:
>>>>> Ken Gaillot <kgaillot at redhat.com> writes:
>>>>>
>>>>>> On 01/10/2017 04:24 AM, Stefan Schloesser wrote:
>>>>>>
>>>>>>> I am currently testing a 2 node cluster under Ubuntu 16.04. The setup
>>>>>>> seems to be working ok including the STONITH.
>>>>>>> For test purposes I issued a "pkill -f pace" killing all pacemaker
>>>>>>> processes on one node.
>>>>>>>
>>>>>>> Result:
>>>>>>> The node is marked as "pending", all resources stay on it. If I
>>>>>>> manually kill a resource it is not noticed. On the other node a drbd
>>>>>>> "promote" command fails (drbd is still running as master on the first
>>>>>>> node).
>>>>>>
>>>>>> I suspect that, when you kill pacemakerd, systemd respawns it quickly
>>>>>> enough that fencing is unnecessary. Try "pkill -f pace; systemd stop
>>>>>> pacemaker".
>>>>>
>>>>> What exactly is "quickly enough"?
>>>>
>>>> What Ken is saying is that Pacemaker, as a service managed by systemd,
>>>> have in its service definition file
>>>> (/usr/lib/systemd/system/pacemaker.service) this option:
>>>>
>>>> Restart=on-failure
>>>>
>>>> Looking at [1] it is explained: systemd restarts immediately the process
>>>> if it ends for some unexpected reason (like a forced kill).
>>>
>>> Isn't the question: Is crmd a process that is expected to die (and thus
> need
>>> restarting)? Or wouldn't one prefer to debug this situation. I fear that
>>> restarting it might just cover some fatal failure...
>>
>> If crmd or corosync dies, the node will be fenced (if fencing is enabled
>> and working). If one of the crmd's persistent connections (such as to
>> the cib) fails, it will exit, so it ends up the same. But the other
> 
> But isn't it due to crmd not responding to network packets? So if the timeout
> is long enough, and crmd is started fast enough, will the node really be
> fenced?

If crmd dies, it leaves its corosync process group, and I'm pretty sure
the other nodes will fence it for that reason, regardless of the duration.

>> daemons (such as pacemakerd or attrd) can die and respawn without any
>> risk to services.
>>
>> The failure will be logged, but it will not be reported in cluster
>> status, so there is a chance of not noticing it.
> 
> I don't understand: A node is fenced, but it will not be noted in the cluster
> status???

I meant the case where pacemakerd or attrd respawns quickly. The node is
not fenced in that case, and the only indication will be in the logs.




More information about the Users mailing list