[ClusterLabs] Q: repeating message " cmirrord[17741]: [yEa32lLX] Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN"

Mon Nov 12 03:25:11 EST 2018

Hello Ulrich,

Could you reproduce this issue stably? if yes, please share your steps.
Since we also encountered a similar issue, it looks that Cmirrord can not join the CPG(corosync related concept), then the resource is timeout, the node is fenced.

Thanks
Gang

>>> On 2018/11/12 at 15:46, in message
<5BE92FC2020000A10002E056 at gwsmtp1.uni-regensburg.de>, "Ulrich Windl"
<Ulrich.Windl at rz.uni-regensburg.de> wrote:
> Hi!
> 
> While analyzing some odd cluster problem in SLES11 SP4, I found this message 
> repeating quite a lot (several times per second) with the same text:
> 
> [...more...]
> Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of 
> cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
> Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of 
> cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
> Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of 
> cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
> Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of 
> cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
> Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of 
> cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
> Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of 
> cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
> Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of 
> cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
> Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of 
> cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
> Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of 
> cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
> Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of 
> cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
> Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of 
> cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
> [...many more...]
> 
> I wonder: Shouldn't the retry number be incremented? Or are these different 
> retries? If so, where is it visible?
> 
> The situation I'm analyzing is when a node should have been fenced, but 
> somehow it wasn't, but also just stopped working (seemed like frozen). The 
> last message a few minutes(!) before the other rnodes complained was:
> 
> Nov 10 22:04:18 h01 crmd[16596]:   notice: throttle_mode: High CIB load 
> detected: 1.246333
> (After this the node seemed dead/frozen).
> 
> Regards,
> Ulrich
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org