[ClusterLabs] Too quick node reboot leads to failed corosync assert on other node(s)

Thu Feb 18 09:40:01 UTC 2016

On 12/02/16 15:51, Jan Friesse wrote:
> Michal,
> 
> 
>> Hello.
>>
>> The subject is just a hypothesis that I'd like to confirm/discuss here.
>>
>> TL;DR Token timeout shouldn't be greater than reboot cycle, is that
>> correct?
>>
> 
> actually when I was reading your mail it was like "this sounds so
> familiar, we solved this problem looong time ago". Chrissie, are you
> able to remember/find patch solving this problem.
> 

I definitely remember looking into this, or something very like it, ages
ago. I can't find anything in the commit logs for either corosync or
cman that looks relevant though. If you're seeing it on recent builds
then it's obviously still a problem anyway and we ought to look into it!

Chrissie

>>
>> The situation with the assert is as follows:
>>
>> node01:
>> 18:08:02 scheduling node01 for STONITH
>> [ gap ]
>> 18:08:26 start of log after reboot
>> 18:08:36 corosync started
>>
>> node02:
>> 18:08:02 it starts queuing outgoing messages (it has no token)
>> 18:08:32 timeout, it realizes token is lost in OPERATIONAL state
>> 18:08:32 it enters gather->commit->recovery process
>> 18:08:37 it received commit token for the 2nd time and wants to
>>           enter recovery state
>> 18:08:37 it thinks that node01 is transitioning from the old ring
>>           configuration to the new and wants to retransmit old messages
>>           with seq number greater that node01's aru -- problem is that
>>           the difference is too large and
>>           'assert(range < QUEUE_RTR_ITEMS_SIZE_MAX)' in
>>           memb_state_recovery_enter fails
>>
>> IOW, node01 shouldn't be considered transitioning from old ring to new
>> ring only because it is present in both configurations because it was
>> restarted(!) in between. In the end, its new aru is zero, however, the
>> seq number before the token loss might have been arbitrarily high,
>> especially greater than QUEUE_RTR_ITEMS_SIZE_MAX and thus the assertion
>> fails.
> 
> Yep, sounds very reasonable.
> 
>>
>> I'm referring to the Shift_to_recovery routine as described in paper [1]
>> on page 21. There are two types of nodes:
>> - my_new_memb (they are fresh new in the ring),
>> - my_trans_memb (they transition from the old ring).
>>
>> I've looked more into the paper, rather than into the code, so my
>> reasoning might not fit the reality. This happened with corosync-1.4.7,
>> however, the relevant code of memb_state_recovery_enter in upstream is
>> still the same.
>>
>> Below is last output [2] from corosync-fplay on the node which failed
>> the assertion.
>> Does this make sense to anyone?
>> FTR, totem:token value is set to 30000ms, perhaps decreasing this below
>> reboot cycle would prevent the issue.
> 
> Highly probably. Also did you find any difference between case where
> node was fenced (so corosync wasn't killed correctly) and case where
> corosync was properly terminated?
> 
> Do you have any reproducer for this problem? It's for sure something
> good to have fixed.
> 
> Regards,
>   Honza
> 
> 
>>
>> Thanks,
>> Michal
>>
>> [1] http://corosync.github.com/corosync/doc/tocssrp95.ps.gz
>> [2] corosync-fplay output
>>
>> rec=[5444857] time=[2015-10-24 18:08:02] Tracing(1) Messsage=Delivering
>> MCAST message with seq 12a9b4 to pending delivery queue
>> rec=[5444858] time=[2015-10-24 18:08:02] Tracing(1) Messsage=releasing
>> messages up to and including 12a9b3
>> rec=[5444859] time=[2015-10-24 18:08:02] Tracing(1) Messsage=releasing
>> messages up to and including 12a9b4
>>
>> rec=[5444860] time=[2015-10-24 18:08:04] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444861] time=[2015-10-24 18:08:06] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444862] time=[2015-10-24 18:08:07] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444863] time=[2015-10-24 18:08:09] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444864] time=[2015-10-24 18:08:10] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444865] time=[2015-10-24 18:08:12] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444866] time=[2015-10-24 18:08:13] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444867] time=[2015-10-24 18:08:15] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444868] time=[2015-10-24 18:08:16] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444869] time=[2015-10-24 18:08:18] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444870] time=[2015-10-24 18:08:20] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444871] time=[2015-10-24 18:08:21] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444872] time=[2015-10-24 18:08:23] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444873] time=[2015-10-24 18:08:24] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444874] time=[2015-10-24 18:08:26] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444875] time=[2015-10-24 18:08:27] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444876] time=[2015-10-24 18:08:30] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444877] time=[2015-10-24 18:08:31] Tracing(1) Messsage=mcasted
>> message added to pending queue
>>
>> rec=[5444878] time=[2015-10-24 18:08:32] Log Message=The token was lost
>> in the OPERATIONAL state.
>> rec=[5444879] time=[2015-10-24 18:08:32] Log Message=A processor failed,
>> forming new configuration.
>> rec=[5444880] time=[2015-10-24 18:08:32] Log Message=Receive multicast
>> socket recv buffer size (262142 bytes).
>> rec=[5444881] time=[2015-10-24 18:08:32] Log Message=Transmit multicast
>> socket send buffer size (262142 bytes).
>> rec=[5444882] time=[2015-10-24 18:08:32] Log Message=Local receive
>> multicast loop socket recv buffer size (262142 bytes).
>> rec=[5444883] time=[2015-10-24 18:08:32] Log Message=Local transmit
>> multicast loop socket send buffer size (262142 bytes).
>> rec=[5444884] time=[2015-10-24 18:08:32] Log Message=entering GATHER
>> state from 2.
>> rec=[5444885] time=[2015-10-24 18:08:33] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444886] time=[2015-10-24 18:08:34] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444887] time=[2015-10-24 18:08:36] Tracing(1) Messsage=mcasted
>> message added to pending queue
>> rec=[5444888] time=[2015-10-24 18:08:37] Log Message=got commit token
>> rec=[5444889] time=[2015-10-24 18:08:37] Log Message=Saving state aru
>> 12a9b4 high seq received 12a9b4
>> rec=[5444890] time=[2015-10-24 18:08:37] Log Message=Storing new
>> sequence id for ring 6a4
>> rec=[5444891] time=[2015-10-24 18:08:37] Log Message=entering COMMIT
>> state.
>> rec=[5444892] time=[2015-10-24 18:08:37] Log Message=got commit token
>> rec=[5444893] time=[2015-10-24 18:08:37] Log Message=entering RECOVERY
>> state.
>> rec=[5444894] time=[2015-10-24 18:08:37] Log Message=TRANS [0] member
>> 10.10.102.65: (node1, rebooted node)
>> rec=[5444895] time=[2015-10-24 18:08:37] Log Message=TRANS [1] member
>> 10.10.102.66: (node2)
>> rec=[5444896] time=[2015-10-24 18:08:37] Log Message=position [0] member
>> 10.10.102.65:
>> rec=[5444897] time=[2015-10-24 18:08:37] Log Message=previous ring seq
>> 6a0 rep 10.10.102.65
>>
>>
>> rec=[5444898] time=[2015-10-24 18:08:37] Log Message=aru 0 high
>> delivered 0 received flag 0
>>
>> rec=[5444899] time=[2015-10-24 18:08:37] Log Message=position [1] member
>> 10.10.102.66:
>> rec=[5444900] time=[2015-10-24 18:08:37] Log Message=previous ring seq
>> 6a0 rep 10.10.102.65
>> rec=[5444901] time=[2015-10-24 18:08:37] Log Message=aru 12a9b4 high
>> delivered 12a9b4 received flag 1
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>