[ClusterLabs] Antw: [EXT] Re: QDevice not found after reboot but appears after cluster restart

john tillman johnt at panix.com
Mon Aug 1 15:07:11 EDT 2022


> Hi,
>
> On 01/08/2022 16:18, john tillman wrote:
>>>>>> "john tillman" <johnt at panix.com> schrieb am 29.07.2022 um 22:51 in
>>> Nachricht
>>> <beb30bf64d4c615aff6034000038118c.squirrel at mail.panix.com>:
>>>>>> On Thursday 28 July 2022 at 22:17:01, john tillman wrote:
>>>>>>
>>>>>>> I have a two cluster setup with a qdevice. 'pcs quorum status' from
>>>>>>> a
>>>>>>> cluster node shows the qdevice casting a vote.  On the qdevice node
>>>>>>> 'corosync‑qnetd‑tool ‑s' says I have 2
>>>>>>> connected clients and 1
>>>>>>> cluster.
>>>>>>> The vote count looks correct when I shutdown either one of the
>>>>>>> cluster
>>>>>>> nodes or the qdevice.  So the voting seems to be working at this
>>>>>>> point.
>>>>>>
>>>>>> Indeed ‑ shutting down 1 of 3 nodes leaves quorum intact,
>>>>>> therefore
>>>>>> everything
>>>>>> still awake knows what's going on.
>>>>>>
>>>>>>>  From this state, if I reboot both my cluster nodes at the same
>>>>>>> time
>>>>>>
>>>>>> Ugh!
>>>>>>
>>>>>>> but leave the qdevice node running, the cluster will not see the
>>>>>>> qdevice
>>>>>>> when the nodes come back up: 'pcs quorum status' show 3 votes
>>>>>>> expected
>>>>>>> but
>>>>>>> only 2 votes cast (from the cluster nodes).
>>>>>>
>>>>>> I would think this is to be expected, since if you reboot 2 out of 3
>>>>>> nodes,
>>>>>> you completely lose quorum, so the single node left has no idea what
>>>>>> to
>>>>>> trust
>>>>>> when the other nodes return.
>>>>>
>>>>> No, no.  I do have quorum after the reboots.  It is courtesy of the 2
>>>>> cluster nodes casting their quorum votes.  However, the qdevice is
>>>>> not
>>>>> casting a vote so I am down to 2 out of 3 nodes.
>>>>>
>>>>> And the qdevice is not part of the cluster.  It will never have any
>>>>> resources running on it.  Its job is just to vote.
>>>>>
>>>>> ‑John
>>>>>
>>>>
>>>> I thought maybe the problem was that the network wasn't ready when
>>>> corosync.service started so I forced a "ExecStartPre=/usr/bin/sleep
>>>> 10"
>>>> into it but that didn't change anything.
>>>
>>> This type of fix is broken anyway: You are not delaying, you are
>>> waiting
>>> for
>>> an event (network up).
>>> Basically the OS distribution should have configured it correctly
>>> already.
>>>
>>> In SLES15 there is:
>>> Requires=network-online.target
>>> After=network-online.target
>>>
>>
>> Thank you for the response.
>>
>> Yes, I saw that those values were correctly set in the service
>> configuration file for corosync.  The delay was just a test. I just
>> wanted
>> to make sure that it wasn't a race condition of bringing up the bond and
>> trying to connect to the quorum node.
>>
>> I was grep'ing the corosync log for VOTEQ entries and noticed when it
>> works I see consecutively:
>> ... [VOTEQ ] Sending quorum callback, quorate = 0
>> ... [VOTEQ ] Received qdevice op 1 req from node 1 [QDevice]
>> When it does not work I never see 'Received qdevice...' line in the log.
>> Is there something else I can look for to find this problem?  Some other
>> test you can think of?  Maybe some configuration of the votequorum
>> service?
>
> maybe good start is to get cluster into state of "non working" qdevice
> and then paste:
> - /var/log/messages of corosync/qdevice
> - output of `corosync-qdevice-tool -sv` (from nodes) and
> `corosync-qnetd-tool -lv` (from machine where qnetd is running)
>
> "Received qdevice op 1 req from node 1 [QDevice]" it means qdevice is
> registered (= corosync-qdevice was started) - if line is really missing
> it can mean corosync-qdevice is not running - log or running
> `corosync-qdevice -f -d` should give some insights why it is not running.
>
> Honza
>
>

My corosync-qdevice service was not enabled at boot.  Sigh.

Thank you Honza for pointing that out!  And thank you all for your
patience and attention.

John

>>
>>
>>>>
>>>> I could still use some advice with debugging this oddity.  Or have I
>>>> used
>>>> up my quota of questions this year :‑)
>>>>
>>>> ‑John
>>>>
>>>>>>
>>>>>> Starting from a situation such as this, your only hope is to rebuilt
>>>>>> the
>>>>>> cluster from scratch, IMHO.
>>>>>>
>>>>>>
>>>>>> Antony.
>>>>>>
>>>>>> ‑‑
>>>>>> Police have found a cartoonist dead in his house.  They say that
>>>>>> details
>>>>>> are
>>>>>> currently sketchy.
>>>>>>
>>>>>>                                                     Please reply to
>>>>>> the
>>>>>> list;
>>>>>>                                                           please
>>>>>> *don't*
>>>>>> CC
>>>>>> me.
>>>>>> _______________________________________________
>>>>>> Manage your subscription:
>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Manage your subscription:
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>>
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>




More information about the Users mailing list