[ClusterLabs] Too quick node reboot leads to failed corosync assert on other node(s)

Wed Feb 10 11:45:42 EST 2016

Hello.

The subject is just a hypothesis that I'd like to confirm/discuss here.

TL;DR Token timeout shouldn't be greater than reboot cycle, is that correct?

The situation with the assert is as follows:

node01:
18:08:02 scheduling node01 for STONITH
[ gap ]
18:08:26 start of log after reboot
18:08:36 corosync started

node02:
18:08:02 it starts queuing outgoing messages (it has no token)
18:08:32 timeout, it realizes token is lost in OPERATIONAL state
18:08:32 it enters gather->commit->recovery process
18:08:37 it received commit token for the 2nd time and wants to
         enter recovery state
18:08:37 it thinks that node01 is transitioning from the old ring
         configuration to the new and wants to retransmit old messages
         with seq number greater that node01's aru -- problem is that
         the difference is too large and
         'assert(range < QUEUE_RTR_ITEMS_SIZE_MAX)' in
         memb_state_recovery_enter fails

IOW, node01 shouldn't be considered transitioning from old ring to new
ring only because it is present in both configurations because it was
restarted(!) in between. In the end, its new aru is zero, however, the
seq number before the token loss might have been arbitrarily high,
especially greater than QUEUE_RTR_ITEMS_SIZE_MAX and thus the assertion
fails.

I'm referring to the Shift_to_recovery routine as described in paper [1]
on page 21. There are two types of nodes:
- my_new_memb (they are fresh new in the ring),
- my_trans_memb (they transition from the old ring).

I've looked more into the paper, rather than into the code, so my
reasoning might not fit the reality. This happened with corosync-1.4.7,
however, the relevant code of memb_state_recovery_enter in upstream is
still the same.

Below is last output [2] from corosync-fplay on the node which failed
the assertion.
Does this make sense to anyone?
FTR, totem:token value is set to 30000ms, perhaps decreasing this below
reboot cycle would prevent the issue.

Thanks,
Michal

[1] http://corosync.github.com/corosync/doc/tocssrp95.ps.gz
[2] corosync-fplay output

rec=[5444857] time=[2015-10-24 18:08:02] Tracing(1) Messsage=Delivering
MCAST message with seq 12a9b4 to pending delivery queue
rec=[5444858] time=[2015-10-24 18:08:02] Tracing(1) Messsage=releasing
messages up to and including 12a9b3
rec=[5444859] time=[2015-10-24 18:08:02] Tracing(1) Messsage=releasing
messages up to and including 12a9b4

rec=[5444860] time=[2015-10-24 18:08:04] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444861] time=[2015-10-24 18:08:06] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444862] time=[2015-10-24 18:08:07] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444863] time=[2015-10-24 18:08:09] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444864] time=[2015-10-24 18:08:10] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444865] time=[2015-10-24 18:08:12] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444866] time=[2015-10-24 18:08:13] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444867] time=[2015-10-24 18:08:15] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444868] time=[2015-10-24 18:08:16] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444869] time=[2015-10-24 18:08:18] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444870] time=[2015-10-24 18:08:20] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444871] time=[2015-10-24 18:08:21] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444872] time=[2015-10-24 18:08:23] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444873] time=[2015-10-24 18:08:24] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444874] time=[2015-10-24 18:08:26] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444875] time=[2015-10-24 18:08:27] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444876] time=[2015-10-24 18:08:30] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444877] time=[2015-10-24 18:08:31] Tracing(1) Messsage=mcasted
message added to pending queue

rec=[5444878] time=[2015-10-24 18:08:32] Log Message=The token was lost
in the OPERATIONAL state.
rec=[5444879] time=[2015-10-24 18:08:32] Log Message=A processor failed,
forming new configuration.
rec=[5444880] time=[2015-10-24 18:08:32] Log Message=Receive multicast
socket recv buffer size (262142 bytes).
rec=[5444881] time=[2015-10-24 18:08:32] Log Message=Transmit multicast
socket send buffer size (262142 bytes).
rec=[5444882] time=[2015-10-24 18:08:32] Log Message=Local receive
multicast loop socket recv buffer size (262142 bytes).
rec=[5444883] time=[2015-10-24 18:08:32] Log Message=Local transmit
multicast loop socket send buffer size (262142 bytes).
rec=[5444884] time=[2015-10-24 18:08:32] Log Message=entering GATHER
state from 2.
rec=[5444885] time=[2015-10-24 18:08:33] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444886] time=[2015-10-24 18:08:34] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444887] time=[2015-10-24 18:08:36] Tracing(1) Messsage=mcasted
message added to pending queue
rec=[5444888] time=[2015-10-24 18:08:37] Log Message=got commit token
rec=[5444889] time=[2015-10-24 18:08:37] Log Message=Saving state aru
12a9b4 high seq received 12a9b4
rec=[5444890] time=[2015-10-24 18:08:37] Log Message=Storing new
sequence id for ring 6a4
rec=[5444891] time=[2015-10-24 18:08:37] Log Message=entering COMMIT state.
rec=[5444892] time=[2015-10-24 18:08:37] Log Message=got commit token
rec=[5444893] time=[2015-10-24 18:08:37] Log Message=entering RECOVERY
state.
rec=[5444894] time=[2015-10-24 18:08:37] Log Message=TRANS [0] member
10.10.102.65: (node1, rebooted node)
rec=[5444895] time=[2015-10-24 18:08:37] Log Message=TRANS [1] member
10.10.102.66: (node2)
rec=[5444896] time=[2015-10-24 18:08:37] Log Message=position [0] member
10.10.102.65:
rec=[5444897] time=[2015-10-24 18:08:37] Log Message=previous ring seq
6a0 rep 10.10.102.65

rec=[5444898] time=[2015-10-24 18:08:37] Log Message=aru 0 high
delivered 0 received flag 0

rec=[5444899] time=[2015-10-24 18:08:37] Log Message=position [1] member
10.10.102.66:
rec=[5444900] time=[2015-10-24 18:08:37] Log Message=previous ring seq
6a0 rep 10.10.102.65
rec=[5444901] time=[2015-10-24 18:08:37] Log Message=aru 12a9b4 high
delivered 12a9b4 received flag 1

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160210/d73896f0/attachment-0002.sig>