[ClusterLabs] Failure of preferred node in a 2 node cluster

Sun Apr 29 04:37:21 UTC 2018

Ah, ok, now I get it.

So node 2 should wait until it's confident that the lost node either
shut down or was killed by it's watchdog timer. After that, it will
consider the node fenced and proceed with recovery. I don't think ATB
will factor in here, as the cluster should treat this as a simple "node
was lost, fencing finally worked, it's safe to recover now" thing.

The node IDs shouldn't matter in this case. What decides the winner is
who is allowed access to the shared storage. The one that can is allowed
to keep kicking the watchdog. The one that loses access, assuming it is
alive at all, should be forced off when the watchdog timer expires.

digimer

On 2018-04-28 09:19 PM, Wei Shan wrote:
> Hi,
> 
> I'm using Redhat Cluster Suite 7with watchdog timer based fence agent. I
> understand this is a really bad setup but this is what the end-user wants.
> 
> ATB => auto_tie_breaker
> 
> "When the auto_tie_breaker is used in even-number member clusters, then
> the failure of the partition containing the auto_tie_breaker_node (by
> default the node with lowest ID) will cause other partition to become
> inquorate and it will self-fence. In 2-node clusters with
> auto_tie_breaker this means that failure of node favoured by
> auto_tie_breaker_node (typically nodeid 1) will result in reboot of
> other node (typically nodeid 2) that detects the inquorate state. If
> this is undesirable then corosync-qdevice can be used instead of the
> auto_tie_breaker to provide additional vote to quorum making behaviour
> closer to odd-number member clusters."
> 
> Thanks
> 
> 
> On Sun, 29 Apr 2018 at 02:15, Digimer <lists at alteeve.ca
> <mailto:lists at alteeve.ca>> wrote:
> 
>     On 2018-04-28 09:06 PM, Wei Shan wrote:
>     > Hi all,
>     >
>     > If I have a 2 node cluster with ATB enabled and the lowest node ID
>     node
>     > has failed. What will happen? My assumption is that the higher node ID
>     > node will self fence and be rebooted. What happens after that?
>     >
>     > Thanks!
>     >
>     > --
>     > Regards,
>     > Ang Wei Shan
> 
>     Which cluster stack is this? I am not familiar with the term "ATB".
> 
>     If it's a standard pacemaker or cman/rgmanager cluster, then on node
>     failure, the good node should block and request a fence (a lost node is
>     not allowed to be assumed gone via self fence, except when using a
>     watchdog timer based fence agent). If the fence doesn't work, the
>     survivor should remain blocked (better to hang than risk corruption). If
>     the fence succeeds, then the survivor node will recover any lost
>     services based on the configuration of those services (usually a simple
>     (re)start on the good node).
> 
>     -- 
>     Digimer
>     Papers and Projects: https://alteeve.com/w/
>     "I am, somehow, less interested in the weight and convolutions of
>     Einstein’s brain than in the near certainty that people of equal talent
>     have lived and died in cotton fields and sweatshops." - Stephen Jay
>     Gould
> 
> 
> 
> -- 
> Regards,
> Ang Wei Shan

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould