[ClusterLabs] Two Node NFS doesn't failover with hardware glitches

Tue Apr 7 19:31:50 UTC 2015

> On Apr 7, 2015, at 13:01, Kristoffer Grönlund <kgronlund at suse.com> wrote:
> 
> Erich Prinz <erich at live2sync.com> writes:
> 
>> Still, this doesn't solve for the problem of a resource hanging on the primary node. Everything I'm reading indicates fencing is required, yet the boilerplate configuration from Linbit has stonith disabled.
>> 
>> These units are running CentOS 6.5
>> corosync 1.4.1
>> pacemaker 1.1.10
>> drbd
>> 
>> Two questions then:
>> 
>> 1. how do we handle cranky hardware issues to ensure a smooth failover?
>> 2. what additional steps are needed to ensure the NFS mounts don't go stale on the clients?
>> 
>> 
> 
> As you might have guessed, you have answered your question already -
> what you need to solve this situation is stonith. When a node refuses to
> die gracefully, you really do need stonith to force it into a known
> state.
> 
> These days most documentation tries to emphasize this more than in the
> past. I can recommend Tims cartoon explanation of how and why stonith
> works:
> 
> http://ourobengr.com/stonith-story/
> 
> -- 
> // Kristoffer Grönlund
> // kgronlund at suse.com

Thanks Kristoffer.

I certainly understand the death match - funny cartoon to drive the point home.

The underlying question then is: how to implement a non-power fence to force the node to release? Or is that even possible?