[ClusterLabs] Connections to attrd are fragile if it is restarted

Tue May 6 20:28:07 UTC 2025

I have seen an issue where the following sequence of events happens:

1. pacemaker-attrd becomes unresponsive to IPC for some reason.
2. pacemakerd detects this and kills it, then restarts it.
3. pacemaker-controld attempts to connect to pacemaker-attrd before it is finished starting and gets ECONNREFUSED, it treats this as a fatal error.
4. pacemakerd reaps the exit status of controld and starts shutting down.

Fencing was not configured so this resulted in loosing automation on the host, but either way this seems like something that pacemaker should handle.

In reproducing the issue by forcing attrd to be unresponsive by sending it SIGSTOP I have also observed it failing with a slight different problem.

1. pacemaker-attrd becomes unresponsive to IPC for some reason.
2. pacemaker-controld connects to attrd and sends it a request
3. pacemakerd kills attrd and restarts it.
4. pacemaker-controld  times out waiting for a response, recieves ENOTCONN, treats that as a fatal error and exits.
5. pacemakerd reaps the exit status of controld and starts shutting down.

I saw this in pacemaker 2.1.6, but from reading the code I believe pacemaker 3.0 will behave identically if the timing lines up.

I have a patch for 2.1.6 that adds retrying to the existing retry loop in connect_and_send_attrd_request() for these two failures. Looking at the possible errors from the code paths I think it's likely only worth retrying for the error codes I mentioned.

The issue I have is that the refactoring that has been done since 2.1.6 means the retry loops for EAGAIN and EALRADY have been pushed down the stack, and just adding retries to pcmk__connect_ipc() wouldn't be a good idea since the daemons call it to check for existing instances of themselves and decidedly don't want to retry on ECONNREFUSED. Similarly I'm not sure about the implications of adding retries to pcmk__send_ipc_request()

I'd like to get a fix for this upstream since though the consequences are much less of a problem if fencing is enabled, this is still something pacemaker is supposed to be able to handle without going down.

Is the proper fix to add retries on ENOTCONN and ECONNREFUSED to every call site of connect_and_send_attrd_request() in one of the pacemaker daemons?

I am willing to work on a fix myself, but I'm wondering what it should look like to get accepted. Patch that I have against 2.1.6 is attached. Ideas for improvements in the fix for that version are also very welcome.

Thomas Jones
Software Developer
He/Him

IBM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250506/5b8ae340/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-fix-possible-fatal-error-if-we-get-connection-errors.patch
Type: text/x-patch
Size: 1550 bytes
Desc: 0001-fix-possible-fatal-error-if-we-get-connection-errors.patch
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250506/5b8ae340/attachment.bin>