[ClusterLabs] Antw: Re: Antw: Re: Pacemaker kill does not cause node fault ???

Wed Feb 8 08:49:20 UTC 2017

Ken Gaillot <kgaillot at redhat.com> writes:

> On 02/07/2017 01:11 AM, Ulrich Windl wrote:
>
>> Ken Gaillot <kgaillot at redhat.com> writes:
>>
>>> On 02/06/2017 03:28 AM, Ulrich Windl wrote:
>>>
>>>> Isn't the question: Is crmd a process that is expected to die (and
>>>> thus need restarting)? Or wouldn't one prefer to debug this
>>>> situation. I fear that restarting it might just cover some fatal
>>>> failure...
>>>
>>> If crmd or corosync dies, the node will be fenced (if fencing is enabled
>>> and working). If one of the crmd's persistent connections (such as to
>>> the cib) fails, it will exit, so it ends up the same.
>> 
>> But isn't it due to crmd not responding to network packets? So if the
>> timeout is long enough, and crmd is started fast enough, will the
>> node really be fenced?
>
> If crmd dies, it leaves its corosync process group, and I'm pretty sure
> the other nodes will fence it for that reason, regardless of the duration.

See http://lists.clusterlabs.org/pipermail/users/2016-March/002415.html
for a case when a Pacemaker cluster survived a crmd failure and restart.
Re-reading the thread, I'm still unsure what saved our ass from
resources being started in parallel and losing massive data.  I'd fully
expect fencing in such cases...
-- 
Feri