[ClusterLabs] Q: repeating message " cmirrord[17741]: [yEa32lLX] Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN"

Mon Nov 12 02:46:10 EST 2018

Hi!

While analyzing some odd cluster problem in SLES11 SP4, I found this message repeating quite a lot (several times per second) with the same text:

[...more...]
Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
Nov 10 22:10:47 h05 cmirrord[17741]: [yEa32lLX]  Retry #1 of cpg_mcast_joined: SA_AIS_ERR_TRY_AGAIN
[...many more...]

I wonder: Shouldn't the retry number be incremented? Or are these different retries? If so, where is it visible?

The situation I'm analyzing is when a node should have been fenced, but somehow it wasn't, but also just stopped working (seemed like frozen). The last message a few minutes(!) before the other rnodes complained was:

Nov 10 22:04:18 h01 crmd[16596]:   notice: throttle_mode: High CIB load detected: 1.246333
(After this the node seemed dead/frozen).

Regards,
Ulrich