[ClusterLabs] Strange lost quorum with qdevice

Олег Самойлов splarv at ya.ru
Tue Aug 13 08:39:06 EDT 2019



> 12 авг. 2019 г., в 8:46, Jan Friesse <jfriesse at redhat.com> написал(а):
> 
> Let me try to bring some light in there:
> 
> - dpd_interval is qnetd variable how often qnetd walks thru the list of all clients (qdevices) and checks timestamp of last sent message. If diff between current timestamp and last sent message timestamp is larger than 2 * timeout sent by client then client is considered as death.
> 
> - interval - affects how often qdevice sends heartbeat to corosync (this is half of the interval) about its liveness and also how often it sends heartbeat to qnetd (0.8 * interval). On corosync side this is used as a timeout after which qdevice daemon is considered death and its votes are no longer valid.
> 
> - sync_timeout - Not used by qdevice/qnetd. Used by corosync during sync phase. If corosync doesn't get reply by qdevice till this timeout it considers qdevice daemon death and continues sync process.
> 
> It was probably not evident from my reply, but what I meant was to change just dpd_interval. Could you please recheck with dpd_interval=1, timeout=20, sync_timeout=60?

Did you by mistype call 'timeout' parameter of qdevice as 'interval'?

I can't understand how is planned your configuration to work. dpd_interval=1, timeout=20, sync_timeout=60. Qnetd will check every 1 second and timeout will be detected on 2 second away from the last message from Qdevice. But Qdevice will send message only every 0.8*20=16 seconds. So, by your description, quorum must be lost every time.

The reality also is strange. Qnetd detect lost client not after 2 seconds, by your description, but after 33 seconds and this is only slightly less than 40s when dpd_interval=20. With 33 seconds timeout real reaction time is 54s which enough for timeout 60s, in this example. But looked like 6 (+-, may be slightly random) seconds gap is enough, there is no problems with lost quorum. 

Aug 13 02:21:38 witness corosync-qnetd: Aug 13 02:21:38 debug   Client ::ffff:192.168.89.12:36150 (cluster krogan1, node_id 2) sent membership node list.
Aug 13 02:21:38 witness corosync-qnetd: Aug 13 02:21:38 debug     msg seq num = 11
Aug 13 02:21:38 witness corosync-qnetd: Aug 13 02:21:38 debug     ring id = (2.30)
Aug 13 02:21:39 witness corosync-qnetd: Aug 13 02:21:38 debug     heuristics = Undefined
Aug 13 02:21:39 witness corosync-qnetd: Aug 13 02:21:38 debug     node list:
Aug 13 02:21:39 witness corosync-qnetd: Aug 13 02:21:38 debug       node_id = 2, data_center_id = 0, node_state = not set
Aug 13 02:21:39 witness corosync-qnetd: Aug 13 02:21:38 debug   ffsplit: Membership for cluster krogan1 is not yet stable
Aug 13 02:21:39 witness corosync-qnetd: Aug 13 02:21:38 debug   Algorithm result vote is Wait for reply
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 warning Client ::ffff:192.168.89.11:48924 doesn't sent any message during 33000ms. Disconnecting
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug   Client ::ffff:192.168.89.11:48924 (init_received 1, cluster krogan1, node_id 1) disconnect
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug   ffsplit: Membership for cluster krogan1 is now stable
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug   ffsplit: Quorate partition selected
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug     node list:
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug       node_id = 2, data_center_id = 0, node_state = not set
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug   ffsplit: No client gets NACK
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug   Sending vote info to client ::ffff:192.168.89.12:36150 (cluster krogan1, node_id 2)
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug     msg seq num = 6
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug     vote = ACK
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug   Client ::ffff:192.168.89.12:36150 (cluster krogan1, node_id 2) replied back to vote info message
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug     msg seq num = 6
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug   ffsplit: All ACK votes sent for cluster krogan1
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug   Client ::ffff:192.168.89.12:36150 (cluster krogan1, node_id 2) sent quorum node list.
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug     msg seq num = 12
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug     quorate = 1
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug     node list:
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug       node_id = 1, data_center_id = 0, node_state = dead
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug       node_id = 2, data_center_id = 0, node_state = member
Aug 13 02:22:32 witness corosync-qnetd: Aug 13 02:22:32 debug   Algorithm result vote is No change
Aug 13 02:22:38 witness ntpd[556]: 0.0.0.0 0612 02 freq_set kernel 11.647 PPM
Aug 13 02:22:38 witness ntpd[556]: 0.0.0.0 0615 05 clock_sync
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Client ::ffff:192.168.89.12:36150 (cluster krogan1, node_id 2) sent membership node list.
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     msg seq num = 13
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     ring id = (1.38)
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     heuristics = Undefined
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     node list:
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 1, data_center_id = 0, node_state = not set
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 2, data_center_id = 0, node_state = not set
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   ffsplit: Membership for cluster krogan1 is now stable
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   ffsplit: Quorate partition selected
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     node list:
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 1, data_center_id = 0, node_state = not set
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 2, data_center_id = 0, node_state = not set
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   ffsplit: No client gets NACK
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Sending vote info to client ::ffff:192.168.89.12:36150 (cluster krogan1, node_id 2)
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     msg seq num = 7
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     vote = ACK
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Algorithm result vote is No change
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Client ::ffff:192.168.89.12:36150 (cluster krogan1, node_id 2) replied back to vote info message
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     msg seq num = 7
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   ffsplit: All ACK votes sent for cluster krogan1
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Client ::ffff:192.168.89.12:36150 (cluster krogan1, node_id 2) sent quorum node list.
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     msg seq num = 14
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     quorate = 1
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     node list:
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 1, data_center_id = 0, node_state = member
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 2, data_center_id = 0, node_state = member
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Algorithm result vote is No change
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   New client connected
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     cluster name = krogan1
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     tls started = 1
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     tls peer certificate verified = 1
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     node_id = 1
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     pointer = 0x5602a3e6f5f0
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     addr_str = ::ffff:192.168.89.11:36144
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     ring id = (1.38)
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     cluster dump:
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       client = ::ffff:192.168.89.12:36150, node_id = 2
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       client = ::ffff:192.168.89.11:36144, node_id = 1
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Client ::ffff:192.168.89.11:36144 (cluster krogan1, node_id 1) sent initial node list.
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     msg seq num = 4
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     node list:
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 1, data_center_id = 0, node_state = not set
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 2, data_center_id = 0, node_state = not set
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Algorithm result vote is Ask later
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Client ::ffff:192.168.89.11:36144 (cluster krogan1, node_id 1) sent membership node list.
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     msg seq num = 5
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     ring id = (1.38)
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     heuristics = Undefined
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     node list:
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 1, data_center_id = 0, node_state = not set
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 2, data_center_id = 0, node_state = not set
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   ffsplit: Membership for cluster krogan1 is now stable
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   ffsplit: Quorate partition selected
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     node list:
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 1, data_center_id = 0, node_state = not set
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 2, data_center_id = 0, node_state = not set
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   ffsplit: No client gets NACK
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Sending vote info to client ::ffff:192.168.89.12:36150 (cluster krogan1, node_id 2)
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     msg seq num = 8
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     vote = ACK
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Sending vote info to client ::ffff:192.168.89.11:36144 (cluster krogan1, node_id 1)
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     msg seq num = 1
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     vote = ACK
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Algorithm result vote is No change
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Client ::ffff:192.168.89.11:36144 (cluster krogan1, node_id 1) sent quorum node list.
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     msg seq num = 6
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     quorate = 1
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     node list:
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 1, data_center_id = 0, node_state = member
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug       node_id = 2, data_center_id = 0, node_state = member
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Algorithm result vote is No change
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Client ::ffff:192.168.89.12:36150 (cluster krogan1, node_id 2) replied back to vote info message
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     msg seq num = 8
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   Client ::ffff:192.168.89.11:36144 (cluster krogan1, node_id 1) replied back to vote info message
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug     msg seq num = 1
Aug 13 02:23:18 witness corosync-qnetd: Aug 13 02:23:18 debug   ffsplit: All ACK votes sent for cluster krogan1

I have checked with dpd_interval=1, timeout=10 (default), sync_timeout=30 (default).

The lost client timeout now 17 seconds, reaction time is 21 second. Also work without problem.

Well, I think, if don't take into accound strange dependency between dpd_interval and reaction time, IMHO, the increasing sync_timeout will be a straight approach to fix the problem. The sync_timeout must be:
10 (dpd_interval, how often qnetd checks)
+
20 (2*dpd_interval, this is timeout for a client)
+
10 (the lag time, I suppose it equal to dpd_interval)

I checked dpd_interval=10 (default), timeout=10 (default), sync_timeout=40. I got 20 seconds timeout and I saw reaction time between 17-34 seconds (time between "Algorithm result vote is Wait for reply" and "ffsplit: All ACK votes sent for cluster"). 


More information about the Users mailing list