[ClusterLabs] Got pacemaker into a hung state

Ken Gaillot kgaillot at redhat.com
Mon Sep 16 15:13:32 UTC 2024


On Sun, 2024-09-15 at 19:47 -0400, Madison Kelly wrote:
> Hi all,
> 
>    I was working on our OCF RA, and had a bug where the RA hung. 
> (specifically, a DNS query returned a fake IP, probably a search
> engine 
> after entering an invalid domain, and the RA hung checking if the
> target 
> was in ~/.ssh/known_hosts). Specifically, I was trying to do a 
> migration, which of course timed out and went into a FAILED state.
> 
>    I expected the FAILED state, but after that, both nodes were 
> repeatedly showing:
> 
> ====
> Sep 15 19:41:07 an-a01n02.alteeve.com pacemaker-controld[1283158]:  
> warning: Delaying join-33 finalization while transition in progress
> Sep 15 19:41:07 an-a01n02.alteeve.com pacemaker-controld[1283158]:  
> warning: Delaying join-33 finalization while transition in progress
> ====

That sounds like a bug. Once the timeout happened, the transition
should have been complete. The DC's pacemaker.log should show what
actions were needed in the transition just before the most recent
"saving inputs" message before this point. Then you can check the logs
for the results of those actions to see if maybe something was still in
progress for a long time.

Also, only the DC logs that message. Are you sure it was on both nodes
at the same time? If so, they must have lost cluster communication. But
of course that should lead to Corosync failure and fencing.

> 
>    I could not do a 'pcs resource cleanup', I could not withdraw the 
> node I triggered the migration from, and even after I fenced the
> node 
> that I had run the migration from, the peer remained stuck. In the
> end, 
> I had to reboot both nodes in the pacemaker cluster.
> 
>    This was a dev system, so no harm, but now I am worried something 
> could leave a production system hung. How would you recover from a 
> situation like this, without rebooting?
> 
> Madi
> 
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list