[ClusterLabs] Two Node NFS doesn't failover with hardware glitches

Tue Apr 7 14:01:00 EDT 2015

Erich Prinz <erich at live2sync.com> writes:

> Still, this doesn't solve for the problem of a resource hanging on the primary node. Everything I'm reading indicates fencing is required, yet the boilerplate configuration from Linbit has stonith disabled.
>
> These units are running CentOS 6.5
> corosync 1.4.1
> pacemaker 1.1.10
> drbd
>
> Two questions then:
>
> 1. how do we handle cranky hardware issues to ensure a smooth failover?
> 2. what additional steps are needed to ensure the NFS mounts don't go stale on the clients?
>
>

As you might have guessed, you have answered your question already -
what you need to solve this situation is stonith. When a node refuses to
die gracefully, you really do need stonith to force it into a known
state.

These days most documentation tries to emphasize this more than in the
past. I can recommend Tims cartoon explanation of how and why stonith
works:

http://ourobengr.com/stonith-story/

-- 
// Kristoffer Grönlund
// kgronlund at suse.com