[ClusterLabs] Antw: [EXT] Re: QDevice not found after reboot but appears after cluster restart
Jan Friesse
jfriesse at redhat.com
Mon Aug 1 11:45:07 EDT 2022
Hi,
On 01/08/2022 16:18, john tillman wrote:
>>>>> "john tillman" <johnt at panix.com> schrieb am 29.07.2022 um 22:51 in
>> Nachricht
>> <beb30bf64d4c615aff6034000038118c.squirrel at mail.panix.com>:
>>>>> On Thursday 28 July 2022 at 22:17:01, john tillman wrote:
>>>>>
>>>>>> I have a two cluster setup with a qdevice. 'pcs quorum status' from a
>>>>>> cluster node shows the qdevice casting a vote. On the qdevice node
>>>>>> 'corosync‑qnetd‑tool ‑s' says I have 2 connected clients and 1
>>>>>> cluster.
>>>>>> The vote count looks correct when I shutdown either one of the
>>>>>> cluster
>>>>>> nodes or the qdevice. So the voting seems to be working at this
>>>>>> point.
>>>>>
>>>>> Indeed ‑ shutting down 1 of 3 nodes leaves quorum intact, therefore
>>>>> everything
>>>>> still awake knows what's going on.
>>>>>
>>>>>> From this state, if I reboot both my cluster nodes at the same time
>>>>>
>>>>> Ugh!
>>>>>
>>>>>> but leave the qdevice node running, the cluster will not see the
>>>>>> qdevice
>>>>>> when the nodes come back up: 'pcs quorum status' show 3 votes
>>>>>> expected
>>>>>> but
>>>>>> only 2 votes cast (from the cluster nodes).
>>>>>
>>>>> I would think this is to be expected, since if you reboot 2 out of 3
>>>>> nodes,
>>>>> you completely lose quorum, so the single node left has no idea what
>>>>> to
>>>>> trust
>>>>> when the other nodes return.
>>>>
>>>> No, no. I do have quorum after the reboots. It is courtesy of the 2
>>>> cluster nodes casting their quorum votes. However, the qdevice is not
>>>> casting a vote so I am down to 2 out of 3 nodes.
>>>>
>>>> And the qdevice is not part of the cluster. It will never have any
>>>> resources running on it. Its job is just to vote.
>>>>
>>>> ‑John
>>>>
>>>
>>> I thought maybe the problem was that the network wasn't ready when
>>> corosync.service started so I forced a "ExecStartPre=/usr/bin/sleep 10"
>>> into it but that didn't change anything.
>>
>> This type of fix is broken anyway: You are not delaying, you are waiting
>> for
>> an event (network up).
>> Basically the OS distribution should have configured it correctly already.
>>
>> In SLES15 there is:
>> Requires=network-online.target
>> After=network-online.target
>>
>
> Thank you for the response.
>
> Yes, I saw that those values were correctly set in the service
> configuration file for corosync. The delay was just a test. I just wanted
> to make sure that it wasn't a race condition of bringing up the bond and
> trying to connect to the quorum node.
>
> I was grep'ing the corosync log for VOTEQ entries and noticed when it
> works I see consecutively:
> ... [VOTEQ ] Sending quorum callback, quorate = 0
> ... [VOTEQ ] Received qdevice op 1 req from node 1 [QDevice]
> When it does not work I never see 'Received qdevice...' line in the log.
> Is there something else I can look for to find this problem? Some other
> test you can think of? Maybe some configuration of the votequorum
> service?
maybe good start is to get cluster into state of "non working" qdevice
and then paste:
- /var/log/messages of corosync/qdevice
- output of `corosync-qdevice-tool -sv` (from nodes) and
`corosync-qnetd-tool -lv` (from machine where qnetd is running)
"Received qdevice op 1 req from node 1 [QDevice]" it means qdevice is
registered (= corosync-qdevice was started) - if line is really missing
it can mean corosync-qdevice is not running - log or running
`corosync-qdevice -f -d` should give some insights why it is not running.
Honza
>
>
>>>
>>> I could still use some advice with debugging this oddity. Or have I
>>> used
>>> up my quota of questions this year :‑)
>>>
>>> ‑John
>>>
>>>>>
>>>>> Starting from a situation such as this, your only hope is to rebuilt
>>>>> the
>>>>> cluster from scratch, IMHO.
>>>>>
>>>>>
>>>>> Antony.
>>>>>
>>>>> ‑‑
>>>>> Police have found a cartoonist dead in his house. They say that
>>>>> details
>>>>> are
>>>>> currently sketchy.
>>>>>
>>>>> Please reply to the
>>>>> list;
>>>>> please
>>>>> *don't*
>>>>> CC
>>>>> me.
>>>>> _______________________________________________
>>>>> Manage your subscription:
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
More information about the Users
mailing list