[ClusterLabs] Antw: [EXT] Re: QDevice not found after reboot but appears after cluster restart

Mon Aug 1 11:45:07 EDT 2022

Hi,

On 01/08/2022 16:18, john tillman wrote:
>>>>> "john tillman" <johnt at panix.com> schrieb am 29.07.2022 um 22:51 in
>> Nachricht
>> <beb30bf64d4c615aff6034000038118c.squirrel at mail.panix.com>:
>>>>> On Thursday 28 July 2022 at 22:17:01, john tillman wrote:
>>>>>
>>>>>> I have a two cluster setup with a qdevice. 'pcs quorum status' from a
>>>>>> cluster node shows the qdevice casting a vote.  On the qdevice node
>>>>>> 'corosyncâ€‘qnetdâ€‘tool â€‘s' says I have 2 connected clients and 1
>>>>>> cluster.
>>>>>> The vote count looks correct when I shutdown either one of the
>>>>>> cluster
>>>>>> nodes or the qdevice.  So the voting seems to be working at this
>>>>>> point.
>>>>>
>>>>> Indeed â€‘ shutting down 1 of 3 nodes leaves quorum intact, therefore
>>>>> everything
>>>>> still awake knows what's going on.
>>>>>
>>>>>>  From this state, if I reboot both my cluster nodes at the same time
>>>>>
>>>>> Ugh!
>>>>>
>>>>>> but leave the qdevice node running, the cluster will not see the
>>>>>> qdevice
>>>>>> when the nodes come back up: 'pcs quorum status' show 3 votes
>>>>>> expected
>>>>>> but
>>>>>> only 2 votes cast (from the cluster nodes).
>>>>>
>>>>> I would think this is to be expected, since if you reboot 2 out of 3
>>>>> nodes,
>>>>> you completely lose quorum, so the single node left has no idea what
>>>>> to
>>>>> trust
>>>>> when the other nodes return.
>>>>
>>>> No, no.  I do have quorum after the reboots.  It is courtesy of the 2
>>>> cluster nodes casting their quorum votes.  However, the qdevice is not
>>>> casting a vote so I am down to 2 out of 3 nodes.
>>>>
>>>> And the qdevice is not part of the cluster.  It will never have any
>>>> resources running on it.  Its job is just to vote.
>>>>
>>>> â€‘John
>>>>
>>>
>>> I thought maybe the problem was that the network wasn't ready when
>>> corosync.service started so I forced a "ExecStartPre=/usr/bin/sleep 10"
>>> into it but that didn't change anything.
>>
>> This type of fix is broken anyway: You are not delaying, you are waiting
>> for
>> an event (network up).
>> Basically the OS distribution should have configured it correctly already.
>>
>> In SLES15 there is:
>> Requires=network-online.target
>> After=network-online.target
>>
> 
> Thank you for the response.
> 
> Yes, I saw that those values were correctly set in the service
> configuration file for corosync.  The delay was just a test. I just wanted
> to make sure that it wasn't a race condition of bringing up the bond and
> trying to connect to the quorum node.
> 
> I was grep'ing the corosync log for VOTEQ entries and noticed when it
> works I see consecutively:
> ... [VOTEQ ] Sending quorum callback, quorate = 0
> ... [VOTEQ ] Received qdevice op 1 req from node 1 [QDevice]
> When it does not work I never see 'Received qdevice...' line in the log.
> Is there something else I can look for to find this problem?  Some other
> test you can think of?  Maybe some configuration of the votequorum
> service?

maybe good start is to get cluster into state of "non working" qdevice 
and then paste:
- /var/log/messages of corosync/qdevice
- output of `corosync-qdevice-tool -sv` (from nodes) and 
`corosync-qnetd-tool -lv` (from machine where qnetd is running)

"Received qdevice op 1 req from node 1 [QDevice]" it means qdevice is 
registered (= corosync-qdevice was started) - if line is really missing 
it can mean corosync-qdevice is not running - log or running 
`corosync-qdevice -f -d` should give some insights why it is not running.

Honza

> 
> 
>>>
>>> I could still use some advice with debugging this oddity.  Or have I
>>> used
>>> up my quota of questions this year :â€‘)
>>>
>>> â€‘John
>>>
>>>>>
>>>>> Starting from a situation such as this, your only hope is to rebuilt
>>>>> the
>>>>> cluster from scratch, IMHO.
>>>>>
>>>>>
>>>>> Antony.
>>>>>
>>>>> â€‘â€‘
>>>>> Police have found a cartoonist dead in his house.  They say that
>>>>> details
>>>>> are
>>>>> currently sketchy.
>>>>>
>>>>>                                                     Please reply to the
>>>>> list;
>>>>>                                                           please
>>>>> *don't*
>>>>> CC
>>>>> me.
>>>>> _______________________________________________
>>>>> Manage your subscription:
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Manage your subscription:
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> ClusterLabs home: https://www.clusterlabs.org/
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
>