[ClusterLabs] Three node cluster becomes completely fenced if one node leaves

Fri Mar 31 19:15:53 UTC 2017

We are only using one mount, and that mount has nothing on it currently.

I have fixed the problem. Our OS is Ubuntu 16.04 LTS (Xenial). I added the
17.04 (Zesty) repo to get newer a newer version of Corosync. I upgraded
Corosync, which upgraded a long list of other related packages (Pacemaker
and gfs2 among them). My fencing now works properly. If a node loses
network connection to the cluster, only that node is fenced. Presumably, it
is a bug in one of the packages in the Xenial repo that has been fixed by
in versions in Zesty.

-------
Seth Reid

On Fri, Mar 31, 2017 at 10:31 AM, Bob Peterson <rpeterso at redhat.com> wrote:

> ----- Original Message -----
> | I can confirm that doing an ifdown is not the source of my corosync
> issues.
> | My cluster is in another state, so I can't pull a cable, but I can down a
> | port on a switch. That had the exact same affects as doing an ifdown. Two
> | machines got fenced when it should have only been one.
> |
> | -------
> | Seth Reid
>
>
> Hi Seth,
>
> I don't know if your problem is the same thing I'm looking at BUT:
> I'm currently working on a fix to the GFS2 file system for a
> similar problem. The scenario is something like this:
>
> 1. Node X goes down for some reason.
> 2. Node X gets fenced by one of the other nodes.
> 3. As part of the recovery, GFS2 on all the other nodes have to
>    replay the journals for all the file systems mounted on X.
> 4. GFS2 journal replay hogs the CPU, which causes corosync to be
>    starved for CPU on some node (say node Y).
> 5. Since corosync on node Y was starved for CPU, it doesn't respond
>    in time to the other nodes (say node Z).
> 6. Thus, node Z fences node Y.
>
> In my case, the solution is to fix GFS2 so that it does some
> "cond_resched()" (conditional schedule) statements to allow corosync
> (and dlm) to get some work done. Thus, corosync isn't starved for
> CPU and does its work, and therefore, it doesn't get fenced.
>
> I don't know if that's what is happening in your case.
> Do you have a lot of GFS2 mount points that would need recovery
> when the first fence event occurs?
> In my case, I can recreate the problem by having 60 GFS2 mount points.
>
> Hopefully I'll be sending a GFS2 patch to the cluster-devel
> mailing list for this problem soon.
>
> In testing my fix, I've periodically experienced some weirdness
> and other unexplained fencing, so maybe there's a second problem
> lurking (or maybe there's just something weird in the experimental
> kernel I'm using as a base). Hopefully testing will prove whether
> my fix to GFS2 recovery is enough or if there's another problem.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170331/b074beb1/attachment-0002.html>