<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Alright, sounds reasonable.</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Though even if another daemon became unresponsive wouldn't pacemakerd just restart it? The main issue I'm trying to address is that it's possible for a failure pacemaker would otherwise be able to deal with to result in it just shutting down if the timing is
wrong. I think the timeout for liveliness is long enough that it shouldn't be an issue if just restarting attrd will solve it. We'd never end up waiting longer than the timeouts and the wait time of the retry loop. Even if the liveliness check does trip pacemaker
should still recover shouldn't it?</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
A few seconds should be more than long enough since any of the failures here would be after the attrd has been killed, so it just needs enough time to start back up.</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
I have created a PR against the main branch. <a href="https://github.com/ClusterLabs/pacemaker/pull/3878" id="LPlnk976385">
https://github.com/ClusterLabs/pacemaker/pull/3878</a></div>
<div id="appendonsend" style="color: inherit;"></div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<hr style="display: inline-block; width: 98%;">
<div id="divRplyFwdMsg" dir="ltr" style="color: inherit;"><span style="font-family: Calibri, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);"><b>From:</b> Chris Lumens <clumens@redhat.com><br>
<b>Sent:</b> 07 May 2025 13:14<br>
<b>To:</b> Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org><br>
<b>Cc:</b> Thomas Jones <Thomas.Jones@ibm.com><br>
<b>Subject:</b> [EXTERNAL] Re: [ClusterLabs] Connections to attrd are fragile if it is restarted</span>
<div> </div>
</div>
<div class="elementToProof" style="font-size: 11pt;">Hi, Thomas.<br>
<br>
Thanks for looking into this. We have a couple bugs in the RH bug<br>
tracker about this, and while I don't see anything in the upstream<br>
tracker, I wouldn't be surprised if there's an issue filed in there too.<br>
So, it's great to have someone taking a look.<br>
<br>
I am about to leave for a short vacation through Monday, so I do not<br>
have the time right now to give your email the consideration it<br>
deserves. I will definitely read it more thoroughly when I get back.<br>
<br>
I am working on a lot of this IPC code right now, in fact (see<br>
<a href="https://github.com/ClusterLabs/pacemaker/pull/3877" id="OWA5122f079-0292-e88e-0d58-4459fd1cb246" class="OWAAutoLink" data-auth="NotApplicable">https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ClusterLabs_pacemaker_pull_3877&d=DwIBAg&c=BSDicqBQBDjDI9RkVyTcHQ&r=83HsPgg1yF8WZuJd0f98-SMMOaVQmS2uG-iYW0MPHs8&m=KRvMNWj4hMk26GBoT608u9qXvFks34z2xLHNsz8R8QSGEyWlG5V7S2oRljrZCrJV&s=pKC2rySLvwTx0GgtsYVVkqowa1Bg5A2gsx1p1uQc2qE&e=</a>
for instance) so<br>
hopefully that means we can get this problem figured out.<br>
<br>
>I'm not sure about the implications of adding retries to<br>
>pcmk__send_ipc_request()<br>
<br>
On main, there's two things to consider for retrying in that function.<br>
The most obvious is the while loop at the end for reading synchronous<br>
replies. We're already looping for EAGAIN, so looping for other error<br>
codes is reasonable.<br>
<br>
The biggest concern there is that for any callers looking for a sync<br>
reply, looping will block them which can lead to daemons being<br>
unresponsive (which I think would cause us to hit this same problem<br>
again).<br>
<br>
The other one is crm_ipc_send, which despite its name can receive and<br>
then send and then receive again. It luckily is set up to handle<br>
timeouts, so we should be good to deal with additional error codes there<br>
too.<br>
<br>
>Is the proper fix to add retries on ENOTCONN and ECONNREFUSED to every<br>
>call site of connect_and_send_attrd_request() in one of the pacemaker<br>
>daemons?<br>
<br>
I would have to think about this more, but I think maybe instead of<br>
adding retries to the callers, we should add retries to<br>
connect_and_send_attrd_request itself. That way if initiating the IPC<br>
connection succeeds but sending the request does not, we're not spamming<br>
the IPC server with new connections.<br>
<br>
In general, we want to try our best to keep the number of IPC messages<br>
to a minimum in order to improve responsiveness.<br>
<br>
>I am willing to work on a fix myself, but I'm wondering what it should<br>
>look like to get accepted. Patch that I have against 2.1.6 is attached.<br>
>Ideas for improvements in the fix for that version are also very<br>
>welcome.<br>
<br>
Typically we do patch review on github, which would require you to have<br>
an account there, fork the pacemaker repo, write a patch, and send a<br>
pull request. That's kind of a high bar especially for a small-ish<br>
patch, so we can just do review here.<br>
<br>
I'd be happy to look into all this more next week, but anyone else on<br>
pacemaker that wants to take a look as well would be welcome to.<br>
<br>
- Chris<br>
<br>
</div>
</body>
</html>