[ClusterLabs] Corosync - qdevice not voting

Thu Mar 18 13:23:36 EDT 2021

Hello.

I have configured corosync with 2 nodes and added a qdevice to help with
the quorum.

On node1 I added firewall rules to block connections from node2 and the
qdevice, trying to simulate a network issue.

The problem I'm having is that one node1 I can see it dropping the
service (the IP), but on node2 it never gets the IP, it is like the qdevice
is not voting.

This is my corosync.conf:

totem {
        version: 2
        cluster_name: cluster1
        token: 3000
        token_retransmits_before_loss_const: 10
        clear_node_high_bit: yes
        crypto_cipher: none
        crypto_hash: none
}
        interface {
                ringnumber: 0
                bindnetaddr: X.X.X.X
                mcastaddr: 239.255.43.2
                mcastport: 5405
                ttl: 1
        }
        nodelist{
                node {
                        ring0_addr: X.X.X.2
                        name: node1.domain.com
                        nodeid: 2
                }
                node {
                        ring0_addr: X.X.X.3
                        name: node2.domain.com
                        nodeid: 3
                }
        }

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

#}

quorum {
  provider: corosync_votequorum
  device {
    votes: 1
    model: net
    net {
      tls: off
      host: qdevice.domain.com
      algorithm: lms
    }
    heuristics {
      mode: on
      exec_ping: /usr/bin/ping -q -c 1 "qdevice.domain.com"
    }
  }
}

I'm getting this on the qdevice host (before adding the firewall rules), so
looks like the cluster is properly configured:

pcs qdevice status net --full
QNetd address: *:5403
TLS: Supported (client certificate required)
Connected clients: 2
Connected clusters: 1
Maximum send/receive size: 32768/32768 bytes
Cluster "cluster1":
    Algorithm: LMS
    Tie-breaker: Node with lowest node ID
    Node ID 3:
        Client address: ::ffff:X.X.X.3:59746
        HB interval: 8000ms
        Configured node list: 2, 3
        Ring ID: 2.95d
        Membership node list: 2, 3
        Heuristics: Pass (membership: Pass, regular: Undefined)
        TLS active: No
        Vote: ACK (ACK)
    Node ID 2:
        Client address: ::ffff:X.X.X.2:33944
        HB interval: 8000ms
        Configured node list: 2, 3
        Ring ID: 2.95d
        Membership node list: 2, 3
        Heuristics: Pass (membership: Pass, regular: Undefined)
        TLS active: No
        Vote: ACK (ACK)

These are partial logs on node2 after activating the firewall rules on
node1. These logs repeats all the time until I remove the firewall rules:

Mar 18 12:48:56 [7202] node2.domain.com stonith-ng:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=16): Try again (6)
Mar 18 12:48:56 [7201] node2.domain.com        cib:     info: crm_cs_flush:
Sent 0 CPG messages  (2 remaining, last=87): Try again (6)
Mar 18 12:48:56 [7202] node2.domain.com stonith-ng:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=16): Try again (6)
Mar 18 12:48:56 [7185] node2.domain.com pacemakerd:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=13): Try again (6)
[7177] node2.domain.com corosyncinfo    [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
[7177] node2.domain.com corosyncnotice  [TOTEM ] A new membership
(X.X.X.3:2469) was formed. Members
[7177] node2.domain.com corosyncwarning [CPG   ] downlist left_list: 0
received
[7177] node2.domain.com corosyncwarning [TOTEM ] Discarding JOIN message
during flush, nodeid=3
Mar 18 12:48:56 [7201] node2.domain.com        cib:     info: crm_cs_flush:
Sent 0 CPG messages  (2 remaining, last=87): Try again (6)
Mar 18 12:48:56 [7202] node2.domain.com stonith-ng:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=16): Try again (6)
Mar 18 12:48:56 [7185] node2.domain.com pacemakerd:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=13): Try again (6)
Mar 18 12:48:56 [7201] node2.domain.com        cib:     info: crm_cs_flush:
Sent 0 CPG messages  (2 remaining, last=87): Try again (6)
Mar 18 12:48:56 [7185] node2.domain.com pacemakerd:     info: crm_cs_flush:
Sent 0 CPG messages  (1 remaining, last=13): Try again (6)

Also on node2:

pcs quorum status
Error: Unable to get quorum status: Unable to get node address for nodeid
2: CS_ERR_NOT_EXIST

And these are the logs on the qdevice host:

Mar 18 12:48:50 debug   algo-lms: membership list from node 3 partition
(3.99d)
Mar 18 12:48:50 debug   algo-util: all_ring_ids_match: seen nodeid 2
(client 0x55a99ce070d0) ring_id (2.995)
Mar 18 12:48:50 debug   algo-util: nodeid 2 in our partition has different
ring_id (2.995) to us (3.99d)
Mar 18 12:48:50 debug   algo-lms: nodeid 3: ring ID (3.99d) not unique in
this membership, waiting
Mar 18 12:48:50 debug   Algorithm result vote is Wait for reply
Mar 18 12:48:52 debug   algo-lms: Client 0x55a99cdfe590 (cluster cluster1,
node_id 3) Timer callback
Mar 18 12:48:52 debug   algo-util: all_ring_ids_match: seen nodeid 2
(client 0x55a99ce070d0) ring_id (2.995)
Mar 18 12:48:52 debug   algo-util: nodeid 2 in our partition has different
ring_id (2.995) to us (3.99d)
Mar 18 12:48:52 debug   algo-lms: nodeid 3: ring ID (3.99d) not unique in
this membership, waiting
Mar 18 12:48:52 debug   Algorithm for client ::ffff:X.X.X.3:59762 decided
to reschedule timer and not send vote with value Wait for reply
Mar 18 12:48:53 debug   Client closed connection
Mar 18 12:48:53 debug   Client ::ffff:X.X.X.2:33960 (init_received 1,
cluster cluster1, node_id 2) disconnect
Mar 18 12:48:53 debug   algo-lms: Client 0x55a99ce070d0 (cluster cluster1,
node_id 2) disconnect
Mar 18 12:48:53 info    algo-lms:   server going down 0
Mar 18 12:48:54 debug   algo-lms: Client 0x55a99cdfe590 (cluster cluster1,
node_id 3) Timer callback
Mar 18 12:48:54 debug   algo-util: partition (3.99d) (0x55a99ce07780) has 1
nodes
Mar 18 12:48:54 debug   algo-lms: Only 1 partition. This is votequorum's
problem, not ours
Mar 18 12:48:54 debug   Algorithm for client ::ffff:X.X.X.3:59762 decided
to not reschedule timer and send vote with value ACK
Mar 18 12:48:54 debug   Sending vote info to client ::ffff:X.X.X.3:59762
(cluster cluster1, node_id 3)
Mar 18 12:48:54 debug     msg seq num = 1
Mar 18 12:48:54 debug     vote = ACK
Mar 18 12:48:54 debug   Client ::ffff:X.X.X.3:59762 (cluster cluster1,
node_id 3) replied back to vote info message
Mar 18 12:48:54 debug     msg seq num = 1
Mar 18 12:48:54 debug   algo-lms: Client 0x55a99cdfe590 (cluster cluster1,
node_id 3) replied back to vote info message
Mar 18 12:48:54 debug   Client ::ffff:X.X.X.3:59762 (cluster cluster1,
node_id 3) sent membership node list.
Mar 18 12:48:54 debug     msg seq num = 8
Mar 18 12:48:54 debug     ring id = (3.9a1)
Mar 18 12:48:54 debug     heuristics = Pass
Mar 18 12:48:54 debug     node list:
Mar 18 12:48:54 debug       node_id = 3, data_center_id = 0, node_state =
not set
Mar 18 12:48:54 debug
Mar 18 12:48:54 debug   algo-lms: membership list from node 3 partition
(3.9a1)
Mar 18 12:48:54 debug   algo-util: partition (3.99d) (0x55a99ce073f0) has 1
nodes
Mar 18 12:48:54 debug   algo-lms: Only 1 partition. This is votequorum's
problem, not ours
Mar 18 12:48:54 debug   Algorithm result vote is ACK
Mar 18 12:48:58 debug   Client ::ffff:X.X.X.3:59762 (cluster cluster1,
node_id 3) sent membership node list.
Mar 18 12:48:58 debug     msg seq num = 9
Mar 18 12:48:58 debug     ring id = (3.9a5)
Mar 18 12:48:58 debug     heuristics = Pass
Mar 18 12:48:58 debug     node list:
Mar 18 12:48:58 debug       node_id = 3, data_center_id = 0, node_state =
not set

I'm running it on CentOS7 servers and tried to follow the RH7 official
docs, but I found a few issues there, and a bug that they won't correct,
since there is a workaround. In the end, looks like it is working fine,
except for this voting issue.

After lots of time looking for answers on Google, I decided to send a
message here, and hopefully you can help me to fix it (it is probably a
silly mistake).

Any help will be appreciated.

Thank you.

Marcelo H. Terres <mhterres at gmail.com>
https://www.mundoopensource.com.br
https://twitter.com/mhterres
https://linkedin.com/in/marceloterres
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210318/16bcbebe/attachment-0001.htm>