[ClusterLabs] Pacemaker crash and fencing failure

Sat Nov 21 16:32:22 UTC 2015

On Sat, Nov 21, 2015 at 1:50 AM, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> 21.11.2015 03:38, Brian Campbell пишет:
>>
>>
>> What I'm concerned about is the initial failure of crmd on master1
>> that led to master2 deciding to fence it, and then master2's failure
>> to fence master1 and thus getting stuck and not being able to manage
>> resources. It seems to have simply stopped doing anything, with no
>> logs indicating why it did so.
>>
>
> That's actually normal. If fencing is required but could not be performed
> cluster is stuck - no further actions can be completed in this state. So the
> root cause here seems to be unsuccessful fencing.

Yes, that part I expect. The problem I'm having is that there's no
indication of why fencing was unnsuccessful, since we had previously
tested fencing and it was working; in fact, we see fencing working
later on in the logs, after someone manually reboots master1 it sees
it as unclean and sucessfully fences it.

So, the problem is that fencing failed to work without anything logged
about why, so it's hard to figure out what needs to be fixed to make
it more reliable in the future.

-- Brian