[ClusterLabs] Got pacemaker into a hung state

Sun Sep 15 23:47:23 UTC 2024

Hi all,

   I was working on our OCF RA, and had a bug where the RA hung. 
(specifically, a DNS query returned a fake IP, probably a search engine 
after entering an invalid domain, and the RA hung checking if the target 
was in ~/.ssh/known_hosts). Specifically, I was trying to do a 
migration, which of course timed out and went into a FAILED state.

   I expected the FAILED state, but after that, both nodes were 
repeatedly showing:

====
Sep 15 19:41:07 an-a01n02.alteeve.com pacemaker-controld[1283158]:  
warning: Delaying join-33 finalization while transition in progress
Sep 15 19:41:07 an-a01n02.alteeve.com pacemaker-controld[1283158]:  
warning: Delaying join-33 finalization while transition in progress
====

   I could not do a 'pcs resource cleanup', I could not withdraw the 
node I triggered the migration from, and even after I fenced the node 
that I had run the migration from, the peer remained stuck. In the end, 
I had to reboot both nodes in the pacemaker cluster.

   This was a dev system, so no harm, but now I am worried something 
could leave a production system hung. How would you recover from a 
situation like this, without rebooting?

Madi

-- 
wiki - https://alteeve.com/w
cell - 647-471-0951
work - 647-417-7486 x 404