[ClusterLabs] Connections to attrd are fragile if it is restarted

Mon May 12 19:50:19 UTC 2025

Alright, sounds reasonable.

Though even if another daemon became unresponsive wouldn't pacemakerd just restart it? The main issue I'm trying to address is that it's possible for a failure pacemaker would otherwise be able to deal with to result in it just shutting down if the timing is wrong. I think the timeout for liveliness is long enough that it shouldn't be an issue if just restarting attrd will solve it. We'd never end up waiting longer than the timeouts and the wait time of the retry loop. Even if the liveliness check does trip pacemaker should still recover shouldn't it?

A few seconds should be more than long enough since any of the failures here would be after the attrd has been killed, so it just needs enough time to start back up.

I have created a PR against the main branch. https://github.com/ClusterLabs/pacemaker/pull/3878

________________________________
From: Chris Lumens <clumens at redhat.com>
Sent: 07 May 2025 13:14
To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
Cc: Thomas Jones <Thomas.Jones at ibm.com>
Subject: [EXTERNAL] Re: [ClusterLabs] Connections to attrd are fragile if it is restarted

Hi, Thomas.

Thanks for looking into this.  We have a couple bugs in the RH bug
tracker about this, and while I don't see anything in the upstream
tracker, I wouldn't be surprised if there's an issue filed in there too.
So, it's great to have someone taking a look.

I am about to leave for a short vacation through Monday, so I do not
have the time right now to give your email the consideration it
deserves.  I will definitely read it more thoroughly when I get back.

I am working on a lot of this IPC code right now, in fact (see
https://github.com/ClusterLabs/pacemaker/pull/3877   for instance) so
hopefully that means we can get this problem figured out.

>I'm not sure about the implications of adding retries to
>pcmk__send_ipc_request()

On main, there's two things to consider for retrying in that function.
The most obvious is the while loop at the end for reading synchronous
replies.  We're already looping for EAGAIN, so looping for other error
codes is reasonable.

The biggest concern there is that for any callers looking for a sync
reply, looping will block them which can lead to daemons being
unresponsive (which I think would cause us to hit this same problem
again).

The other one is crm_ipc_send, which despite its name can receive and
then send and then receive again.  It luckily is set up to handle
timeouts, so we should be good to deal with additional error codes there
too.

>Is the proper fix to add retries on ENOTCONN and ECONNREFUSED to every
>call site of connect_and_send_attrd_request() in one of the pacemaker
>daemons?

I would have to think about this more, but I think maybe instead of
adding retries to the callers, we should add retries to
connect_and_send_attrd_request itself.  That way if initiating the IPC
connection succeeds but sending the request does not, we're not spamming
the IPC server with new connections.

In general, we want to try our best to keep the number of IPC messages
to a minimum in order to improve responsiveness.

>I am willing to work on a fix myself, but I'm wondering what it should
>look like to get accepted. Patch that I have against 2.1.6 is attached.
>Ideas for improvements in the fix for that version are also very
>welcome.

Typically we do patch review on github, which would require you to have
an account there, fork the pacemaker repo, write a patch, and send a
pull request.  That's kind of a high bar especially for a small-ish
patch, so we can just do review here.

I'd be happy to look into all this more next week, but anyone else on
pacemaker that wants to take a look as well would be welcome to.

- Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20250512/1e15ef2b/attachment.htm>