[ClusterLabs] Got pacemaker into a hung state
Madison Kelly
mkelly at alteeve.com
Sun Sep 15 23:47:23 UTC 2024
Hi all,
I was working on our OCF RA, and had a bug where the RA hung.
(specifically, a DNS query returned a fake IP, probably a search engine
after entering an invalid domain, and the RA hung checking if the target
was in ~/.ssh/known_hosts). Specifically, I was trying to do a
migration, which of course timed out and went into a FAILED state.
I expected the FAILED state, but after that, both nodes were
repeatedly showing:
====
Sep 15 19:41:07 an-a01n02.alteeve.com pacemaker-controld[1283158]:
warning: Delaying join-33 finalization while transition in progress
Sep 15 19:41:07 an-a01n02.alteeve.com pacemaker-controld[1283158]:
warning: Delaying join-33 finalization while transition in progress
====
I could not do a 'pcs resource cleanup', I could not withdraw the
node I triggered the migration from, and even after I fenced the node
that I had run the migration from, the peer remained stuck. In the end,
I had to reboot both nodes in the pacemaker cluster.
This was a dev system, so no harm, but now I am worried something
could leave a production system hung. How would you recover from a
situation like this, without rebooting?
Madi
--
wiki - https://alteeve.com/w
cell - 647-471-0951
work - 647-417-7486 x 404
More information about the Users
mailing list