[ClusterLabs] Fast-failover on 2 nodes + qnetd: qdevice connenction disrupted.

Thu May 2 10:54:54 EDT 2024

I don't see fencing times in here -- fencing is absolutely essential.

With the setup you describe, I would drop qdevice. With fencing, quorum
is not strictly required in a two-node cluster (two_node should be set
in corosync.conf). You can set priority-fencing-delay to reduce the
chance of simultaneous fencing. For VMs, you can use fence_xvm, which
is extremely quick.

On Thu, 2024-05-02 at 02:56 +0300, alexey at pavlyuts.ru wrote:
> Hi All,
>  
> I am trying to build application-specific 2-node failover cluster
> using ubuntu 22, pacemaker 2.1.2 + corosync 3.1.6 and DRBD 9.2.9,
> knet transport.
>  
> For some reason I can’t use 3-node then I have to use qnetd+qdevice
> 3.0.1.
>  
> The main goal Is to protect custom app which is not cluster-aware by
> itself. It is quite stateful, can’t store the state outside memory
> and take some time to get converged with other parts of the system,
> then the best scenario is “failover is a restart with same config”,
> but each unnecessary restart is painful. So, if failover done, app
> must retain on the backup node until it fail or admin push it back,
> this work well with stickiness param.
>  
> So, the goal is to detect serving node fail ASAP and restart it ASAP
> on other node, using DRBD-synced config/data. ASAP means within 5-7
> sec, not 30 or more.
>  
> I was tried different combinations of timing, and finally got
> acceptable result within 5 sec for the best case. But! The case is
> very unstable.
>  
> My setup is a simple: two nodes on VM, and one more VM as arbiter
> (qnetd), VMs under Proxmox and connected by net via external ethernet
> switch to get closer to reality where “nodes VM” should locate as VM
> on different PHY hosts in one rack.
>  
> Then, it was adjusted for faster detect and failover.
> In Corosync, left the token default 1000ms, but add
> “heartbeat_failures_allowed: 3”, this made corosync catch node
> failure for about 200ms (4x50ms heartbeat).
> Both qnet and qdevice was run with  net_heartbeat_interval_min=200 to
> allow play with faster hearbeats and detects
> Also, quorum.device.net has timeout: 500, sync_timeout: 3000, algo:
> LMS.
>  
> The testing is to issue “ate +%M:%S.%N && qm stop 201”, and then
> check the logs on timestamp when the app started on the “backup”
> host. And, when backup host boot again, the test is to check the logs
> for the app was not restarted.
>  
> Sometimes switchover work like a charm but sometimes it may delay for
> dozens of seconds.
> Sometimes when the primary host boot up again, secondary hold quorum
> well and keep app running, sometimes quorum is lost first (and
> pacemaker downs the app) and then found and pacemaker get app up
> again, so unwanted restart happen.
>  
> My investigation shows that the difference between “good” and “bad”
> cases:
>  
> Good case - all the logs clear and reasonable.
>  
> Bad case: qnetd losing connection to second node just after the
> connection to “failure” node detected and then it may take dozens of
> seconds to restore it. All this time qdevice trying to connect qnetd
> and fails:
>  
> Example, host 192.168.100.1 send to failure, 100.2 is failover to:
>  
> From qnetd:
> May 01 23:30:39 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.1:60686 doesn't sent any message during 600ms.
> Disconnecting
> May 01 23:30:39 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.1:60686 (init_received 1, cluster bsc-test-
> cluster, node_id 1) disconnect
> May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms: Client
> 0x55a6fc6785b0 (cluster bsc-test-cluster, node_id 1) disconnect
> May 01 23:30:39 arbiter corosync-qnetd[6338]: algo-lms:   server
> going down 0
> >>> This is unexpected down, at normal scenario connection persist
> May 01 23:30:40 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:32790 doesn't sent any message during 600ms.
> Disconnecting
> May 01 23:30:40 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:32790 (init_received 1, cluster bsc-test-
> cluster, node_id 2) disconnect
> May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms: Client
> 0x55a6fc6363d0 (cluster bsc-test-cluster, node_id 2) disconnect
> May 01 23:30:40 arbiter corosync-qnetd[6338]: algo-lms:   server
> going down 0
> May 01 23:30:56 arbiter corosync-qnetd[6338]: New client connected
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   cluster name = bsc-
> test-cluster
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   tls started = 0
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   tls peer certificate
> verified = 0
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   node_id = 2
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   pointer =
> 0x55a6fc6363d0
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   addr_str =
> ::ffff:192.168.100.2:57736
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   ring id = (2.801)
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   cluster dump:
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     client =
> ::ffff:192.168.100.2:57736, node_id = 2
> May 01 23:30:56 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent
> initial node list.
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   msg seq num = 99
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   Node list:
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     0 node_id = 1,
> data_center_id = 0, node_state = not set
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     1 node_id = 2,
> data_center_id = 0, node_state = not set
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: cluster bsc-
> test-cluster config_list has 2 nodes
> May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm result vote
> is No change
> May 01 23:30:56 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent
> membership node list.
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   msg seq num = 100
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   ring id = (2.801)
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   heuristics =
> Undefined
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   Node list:
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     0 node_id = 2,
> data_center_id = 0, node_state = not set
> May 01 23:30:56 arbiter corosync-qnetd[6338]:
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: membership
> list from node 2 partition (2.801)
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-util: partition
> (2.801) (0x55a6fc67e110) has 1 nodes
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: Only 1
> partition. This is votequorum's problem, not ours
> May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm result vote
> is ACK
> May 01 23:30:56 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent
> quorum node list.
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   msg seq num = 101
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   quorate = 0
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   Node list:
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     0 node_id = 1,
> data_center_id = 0, node_state = dead
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     1 node_id = 2,
> data_center_id = 0, node_state = member
> May 01 23:30:56 arbiter corosync-qnetd[6338]:
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: quorum node
> list from node 2 partition (2.801)
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-util: partition
> (2.801) (0x55a6fc697e70) has 1 nodes
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: Only 1
> partition. This is votequorum's problem, not ours
> May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm result vote
> is ACK
> May 01 23:30:56 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:57736 (cluster bsc-test-cluster, node_id 2) sent
> quorum node list.
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   msg seq num = 102
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   quorate = 1
> May 01 23:30:56 arbiter corosync-qnetd[6338]:   Node list:
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     0 node_id = 1,
> data_center_id = 0, node_state = dead
> May 01 23:30:56 arbiter corosync-qnetd[6338]:     1 node_id = 2,
> data_center_id = 0, node_state = member
> May 01 23:30:56 arbiter corosync-qnetd[6338]:
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: quorum node
> list from node 2 partition (2.801)
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-util: partition
> (2.801) (0x55a6fc669dc0) has 1 nodes
> May 01 23:30:56 arbiter corosync-qnetd[6338]: algo-lms: Only 1
> partition. This is votequorum's problem, not ours
> May 01 23:30:56 arbiter corosync-qnetd[6338]: Algorithm result vote
> is ACK
> May 01 23:30:58 arbiter corosync-qnetd[6338]: Client closed
> connection
> May 01 23:30:58 arbiter corosync-qnetd[6338]: Client
> ::ffff:192.168.100.2:57674 (init_received 0, cluster bsc-test-
> cluster, node_id 0) disconnect
> >> At this point resource start on backup host
>  
> From qdevice:
>  
> May 01 23:30:40 node2 corosync-qdevice[781]: Server closed connection
> May 01 23:30:40 node2 corosync-qdevice[781]: algo-lms: disconnected.
> quorate = 1, WFA = 0
> May 01 23:30:40 node2 corosync-qdevice[781]: algo-lms: disconnected.
> reason = 22, WFA = 0
> May 01 23:30:40 node2 corosync-qdevice[781]: Algorithm result vote is
> NACK
> May 01 23:30:40 node2 corosync-qdevice[781]: Cast vote timer remains
> scheduled every 250ms voting NACK.
> May 01 23:30:40 node2 corosync-qdevice[781]: Sleeping for 161 ms
> before reconnect
> May 01 23:30:40 node2 corosync-qdevice[781]: Trying connect to qnetd
> server arbiter:5403 (timeout = 400ms)
> May 01 23:30:41 node2 corosync-qdevice[781]: Connect timeout
> May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected.
> quorate = 1, WFA = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected.
> reason = 27, WFA = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm result vote is
> NACK
> May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer remains
> scheduled every 250ms voting NACK.
> May 01 23:30:41 node2 corosync-qdevice[781]: Trying connect to qnetd
> server arbiter:5403 (timeout = 400ms)
> May 01 23:30:41 node2 corosync-qdevice[781]: Votequorum nodelist
> notify callback:
> May 01 23:30:41 node2 corosync-qdevice[781]:   Ring_id = (2.801)
> May 01 23:30:41 node2 corosync-qdevice[781]:   Node list (size = 1):
> May 01 23:30:41 node2 corosync-qdevice[781]:     0 nodeid = 2
> May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm decided to
> pause cast vote timer and result vote is No change
> May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer is now
> paused.
> May 01 23:30:41 node2 corosync-qdevice[781]: worker:
> qdevice_heuristics_worker_cmd_process_exec: Received exec command
> with seq_no "24" and timeout "1500"
> May 01 23:30:41 node2 corosync-qdevice[781]: Received heuristics exec
> result command with seq_no "24" and result "Disabled"
> May 01 23:30:41 node2 corosync-qdevice[781]: Votequorum heuristics
> exec result callback:
> May 01 23:30:41 node2 corosync-qdevice[781]:   seq_number = 24,
> exec_result = Disabled
> May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm decided to not
> send list, result vote is No change and heuristics is Undefined
> May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer is no
> longer paused.
> May 01 23:30:41 node2 corosync-qdevice[781]: Not scheduling
> heuristics timer because mode is not enabled
> May 01 23:30:41 node2 corosync-qdevice[781]: Votequorum quorum notify
> callback:
> May 01 23:30:41 node2 corosync-qdevice[781]:   Quorate = 0
> May 01 23:30:41 node2 corosync-qdevice[781]:   Node list (size = 3):
> May 01 23:30:41 node2 corosync-qdevice[781]:     0 nodeid = 1, state
> = 2
> May 01 23:30:41 node2 corosync-qdevice[781]:     1 nodeid = 2, state
> = 1
> May 01 23:30:41 node2 corosync-qdevice[781]:     2 nodeid = 0, state
> = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: quorum_notify.
> quorate = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm decided to not
> send list and result vote is No change
> May 01 23:30:41 node2 corosync-qdevice[781]: Connect timeout
> May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected.
> quorate = 0, WFA = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected.
> reason = 27, WFA = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm result vote is
> NACK
> May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer remains
> scheduled every 250ms voting NACK.
> May 01 23:30:41 node2 corosync-qdevice[781]: Trying connect to qnetd
> server arbiter:5403 (timeout = 400ms)
> >>> At this point quorum reported lost
> May 01 23:30:41 node2 corosync-qdevice[781]: Connect timeout
> May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected.
> quorate = 0, WFA = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: algo-lms: disconnected.
> reason = 27, WFA = 0
> May 01 23:30:41 node2 corosync-qdevice[781]: Algorithm result vote is
> NACK
> May 01 23:30:41 node2 corosync-qdevice[781]: Cast vote timer remains
> scheduled every 250ms voting NACK.
>  
> >>> This failure pattern repeats 31 times
> May 01 23:30:41 node2 corosync-qdevice[781]: Trying connect to qnetd
> server arbiter:5403 (timeout = 400ms)
> May 01 23:30:42 node2 corosync-qdevice[781]: Connect timeout
> May 01 23:30:42 node2 corosync-qdevice[781]: algo-lms: disconnected.
> quorate = 0, WFA = 0
> May 01 23:30:42 node2 corosync-qdevice[781]: algo-lms: disconnected.
> reason = 27, WFA = 0
> May 01 23:30:42 node2 corosync-qdevice[781]: Algorithm result vote is
> NACK
> May 01 23:30:42 node2 corosync-qdevice[781]: Cast vote timer remains
> scheduled every 250ms voting NACK.
> >>> End of pattern repeat, continue
>  
> May 01 23:30:56 node2 corosync-qdevice[781]: Trying connect to qnetd
> server arbiter:5403 (timeout = 400ms)
> May 01 23:30:56 node2 corosync-qdevice[781]: Sending preinit msg to
> qnetd
> May 01 23:30:56 node2 corosync-qdevice[781]: Received preinit reply
> msg
> May 01 23:30:56 node2 corosync-qdevice[781]: Received init reply msg
> May 01 23:30:56 node2 corosync-qdevice[781]: Scheduling send of
> heartbeat every 400ms
> May 01 23:30:56 node2 corosync-qdevice[781]: Executing after-connect
> heuristics.
> May 01 23:30:56 node2 corosync-qdevice[781]: worker:
> qdevice_heuristics_worker_cmd_process_exec: Received exec command
> with seq_no "25" and timeout "250"
> May 01 23:30:56 node2 corosync-qdevice[781]: Received heuristics exec
> result command with seq_no "25" and result "Disabled"
> May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm decided to
> send config node list, send membership node list, send quorum node
> list, heuristics is Undefined and result vote is Wait for reply
> May 01 23:30:56 node2 corosync-qdevice[781]: Sending set option seq =
> 98, HB(0) = 0ms, KAP Tie-breaker(1) = Enabled
> May 01 23:30:56 node2 corosync-qdevice[781]: Sending config node list
> seq = 99
> May 01 23:30:56 node2 corosync-qdevice[781]:   Node list:
> May 01 23:30:56 node2 corosync-qdevice[781]:     0 node_id = 1,
> data_center_id = 0, node_state = not set
> May 01 23:30:56 node2 corosync-qdevice[781]:     1 node_id = 2,
> data_center_id = 0, node_state = not set
> May 01 23:30:56 node2 corosync-qdevice[781]: Sending membership node
> list seq = 100, ringid = (2.801), heuristics = Undefined.
> May 01 23:30:56 node2 corosync-qdevice[781]:   Node list:
> May 01 23:30:56 node2 corosync-qdevice[781]:     0 node_id = 2,
> data_center_id = 0, node_state = not set
> May 01 23:30:56 node2 corosync-qdevice[781]: Sending quorum node list
> seq = 101, quorate = 0
> May 01 23:30:56 node2 corosync-qdevice[781]:   Node list:
> May 01 23:30:56 node2 corosync-qdevice[781]:     0 node_id = 1,
> data_center_id = 0, node_state = dead
> May 01 23:30:56 node2 corosync-qdevice[781]:     1 node_id = 2,
> data_center_id = 0, node_state = member
> May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer is now
> stopped.
> May 01 23:30:56 node2 corosync-qdevice[781]: Received set option
> reply seq(1) = 98, HB(0) = 0ms, KAP Tie-breaker(1) = Enabled
> May 01 23:30:56 node2 corosync-qdevice[781]: Received initial config
> node list reply
> May 01 23:30:56 node2 corosync-qdevice[781]:   seq = 99
> May 01 23:30:56 node2 corosync-qdevice[781]:   vote = No change
> May 01 23:30:56 node2 corosync-qdevice[781]:   ring id = (2.801)
> May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is
> No change
> May 01 23:30:56 node2 corosync-qdevice[781]: Received membership node
> list reply
> May 01 23:30:56 node2 corosync-qdevice[781]:   seq = 100
> May 01 23:30:56 node2 corosync-qdevice[781]:   vote = ACK
> May 01 23:30:56 node2 corosync-qdevice[781]:   ring id = (2.801)
> May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is
> ACK
> May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer is now
> scheduled every 250ms voting ACK.
> May 01 23:30:56 node2 corosync-qdevice[781]: Received quorum node
> list reply
> May 01 23:30:56 node2 corosync-qdevice[781]:   seq = 101
> May 01 23:30:56 node2 corosync-qdevice[781]:   vote = ACK
> May 01 23:30:56 node2 corosync-qdevice[781]:   ring id = (2.801)
> May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is
> ACK
> May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer remains
> scheduled every 250ms voting ACK.
> May 01 23:30:56 node2 corosync-qdevice[781]: Votequorum quorum notify
> callback:
> May 01 23:30:56 node2 corosync-qdevice[781]:   Quorate = 1
> May 01 23:30:56 node2 corosync-qdevice[781]:   Node list (size = 3):
> May 01 23:30:56 node2 corosync-qdevice[781]:     0 nodeid = 1, state
> = 2
> May 01 23:30:56 node2 corosync-qdevice[781]:     1 nodeid = 2, state
> = 1
> May 01 23:30:56 node2 corosync-qdevice[781]:     2 nodeid = 0, state
> = 0
> May 01 23:30:56 node2 corosync-qdevice[781]: algo-lms: quorum_notify.
> quorate = 1
> May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm decided to
> send list and result vote is No change
> May 01 23:30:56 node2 corosync-qdevice[781]: Sending quorum node list
> seq = 102, quorate = 1
> May 01 23:30:56 node2 corosync-qdevice[781]:   Node list:
> May 01 23:30:56 node2 corosync-qdevice[781]:     0 node_id = 1,
> data_center_id = 0, node_state = dead
> May 01 23:30:56 node2 corosync-qdevice[781]:     1 node_id = 2,
> data_center_id = 0, node_state = member
> May 01 23:30:56 node2 corosync-qdevice[781]: Received quorum node
> list reply
> May 01 23:30:56 node2 corosync-qdevice[781]:   seq = 102
> May 01 23:30:56 node2 corosync-qdevice[781]:   vote = ACK
> May 01 23:30:56 node2 corosync-qdevice[781]:   ring id = (2.801)
> May 01 23:30:56 node2 corosync-qdevice[781]: Algorithm result vote is
> ACK
> May 01 23:30:56 node2 corosync-qdevice[781]: Cast vote timer remains
> scheduled every 250ms voting ACK.
> >>> Here everything become OK and resource started on Node2
>  
> Also, I’ve done wireshark capture and found great mess in TCP, it
> seems like connection between qdevice and qnetd really stops for some
> time and packets won’t deliver.
>  
> For my guess, it match corosync syncing activities, and I suspect
> that corosync prevent any other traffic on the interface it use for
> rings.
>  
> As I switch qnetd and qdevice to use different interface it seems to
> work fine.
>  
> So, the question is: does corosync really temporary blocks any other
> traffic on the interface it uses? Or it is just a coincidence? If it
> blocks, is there a way to manage it?
>  
> Thank you for any suggest on that!
>  
> Sincerely,
>  
> Alex
>  
>  
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>