[ClusterLabs] Three node cluster becomes completely fenced if one node leaves

Fri Mar 31 17:31:29 UTC 2017

----- Original Message -----
| I can confirm that doing an ifdown is not the source of my corosync issues.
| My cluster is in another state, so I can't pull a cable, but I can down a
| port on a switch. That had the exact same affects as doing an ifdown. Two
| machines got fenced when it should have only been one.
| 
| -------
| Seth Reid
| System Operations Engineer
| Vendini, Inc.
| 415.349.7736
| sreid at vendini.com
| www.vendini.com

Hi Seth,

I don't know if your problem is the same thing I'm looking at BUT:
I'm currently working on a fix to the GFS2 file system for a
similar problem. The scenario is something like this:

1. Node X goes down for some reason.
2. Node X gets fenced by one of the other nodes.
3. As part of the recovery, GFS2 on all the other nodes have to
   replay the journals for all the file systems mounted on X.
4. GFS2 journal replay hogs the CPU, which causes corosync to be
   starved for CPU on some node (say node Y).
5. Since corosync on node Y was starved for CPU, it doesn't respond
   in time to the other nodes (say node Z).
6. Thus, node Z fences node Y.

In my case, the solution is to fix GFS2 so that it does some
"cond_resched()" (conditional schedule) statements to allow corosync
(and dlm) to get some work done. Thus, corosync isn't starved for
CPU and does its work, and therefore, it doesn't get fenced.

I don't know if that's what is happening in your case.
Do you have a lot of GFS2 mount points that would need recovery
when the first fence event occurs?
In my case, I can recreate the problem by having 60 GFS2 mount points.

Hopefully I'll be sending a GFS2 patch to the cluster-devel
mailing list for this problem soon.

In testing my fix, I've periodically experienced some weirdness
and other unexplained fencing, so maybe there's a second problem
lurking (or maybe there's just something weird in the experimental
kernel I'm using as a base). Hopefully testing will prove whether
my fix to GFS2 recovery is enough or if there's another problem.

Regards,

Bob Peterson
Red Hat File Systems