[ClusterLabs] Problem with stonith and starting services

Thu Jul 6 13:47:42 UTC 2017

On 07/04/2017 08:28 AM, Cesar Hernandez wrote:
> 
>>
>> Agreed, I don't think it's multicast vs unicast.
>>
>> I can't see from this what's going wrong. Possibly node1 is trying to
>> re-fence node2 when it comes back. Check that the fencing resources are
>> configured correctly, and check whether node1 sees the first fencing
>> succeed.
> 
> 
> Thanks. Checked fencing resource and it always returns, it's a custom script I used on other installations and it always worked.
> I think the clue are the two messages that appear when it fails:
> 
> Jul  3 09:07:04 node2 pacemakerd[597]:  warning: The crmd process (608) can no longer be respawned, shutting the cluster down.
> Jul  3 09:07:04 node2 crmd[608]:     crit: We were allegedly just fenced by node1 for node1!
> 
> Anyone knows what are they related to? Seems not to be much information on the Internet
> 
> Thanks
> Cesar

"We were allegedly just fenced" means that the node just received a
notification from stonithd that another node successfully fenced it.
Clearly, this is a problem, because a node that is truly fenced should
be unable to receive any communications from the cluster. As such, the
cluster services immediately exit and stay down.

So, the above log means that node1 decided that node2 needed to be
fenced, requested fencing of node2, and received a successful result for
the fencing, and yet node2 was not killed.

Your fence agent should not return success until node2 has verifiably
been stopped. If there is some way to query the AWS API whether node2 is
running or not, that would be sufficient (merely checking that the node
is not responding to some command such as ping is not sufficient).