[ClusterLabs] Antw: Re: Antw: Re: Pacemaker kill does not cause node fault ???

Mon Feb 13 19:29:42 EST 2017

On 02/08/2017 02:49 AM, Ferenc Wágner wrote:
> Ken Gaillot <kgaillot at redhat.com> writes:
> 
>> On 02/07/2017 01:11 AM, Ulrich Windl wrote:
>>
>>> Ken Gaillot <kgaillot at redhat.com> writes:
>>>
>>>> On 02/06/2017 03:28 AM, Ulrich Windl wrote:
>>>>
>>>>> Isn't the question: Is crmd a process that is expected to die (and
>>>>> thus need restarting)? Or wouldn't one prefer to debug this
>>>>> situation. I fear that restarting it might just cover some fatal
>>>>> failure...
>>>>
>>>> If crmd or corosync dies, the node will be fenced (if fencing is enabled
>>>> and working). If one of the crmd's persistent connections (such as to
>>>> the cib) fails, it will exit, so it ends up the same.
>>>
>>> But isn't it due to crmd not responding to network packets? So if the
>>> timeout is long enough, and crmd is started fast enough, will the
>>> node really be fenced?
>>
>> If crmd dies, it leaves its corosync process group, and I'm pretty sure
>> the other nodes will fence it for that reason, regardless of the duration.
> 
> See http://lists.clusterlabs.org/pipermail/users/2016-March/002415.html
> for a case when a Pacemaker cluster survived a crmd failure and restart.
> Re-reading the thread, I'm still unsure what saved our ass from
> resources being started in parallel and losing massive data.  I'd fully
> expect fencing in such cases...

Looking at that again, crmd leaving the process group isn't enough to be
fenced -- that should abort the transition and update the node state in
the CIB, but it's up to the (new) DC to determine that fencing is needed.

If crmd respawns quickly enough to join the election for the new DC
(which seemed to be the case here), it should just need to be re-probed.