[ClusterLabs] Antw: Re: Antw: [EXT] The 2 servers of the cluster randomly reboot almost together
Sebastien BASTARD
sebastien at domalys.com
Tue Feb 22 08:02:58 EST 2022
Hello Ulrich, Hello team,
This night the servers of the cluster restarted together twice ( 08h17m07 &
08h50m04 22/02/2022 for server A, 08h16m32 & 08h49m43 22/02/2022 for server
B ).
Here is the result of the up/down test :
*ServerA :*
*Log of Qdevice from ServerA :*
- None
*Log of ServerB from ServerA :*
- 21/02/2022 18:48:45 Down between 0 and 4 seconds
- 21/02/2022 18:58:33 Down between 0 and 4 seconds
- 21/02/2022 19:19:43 Down between 0 and 3 seconds
- *No trace of lost communication for 08h17 & 08h50 of server B because
after the restart, the scripts of up/down test have not restarted.*
*ServerB :*
*Log of Qdevice from ServerB :*
- 21/02/2022 08:30:26 Down between 0 and 3 seconds
- 21/02/2022 23:02:14 Down between 0 and 3 seconds
*Log of ServerA from ServerB : *
- 21/02/2022 18:47:38 Down between 0 and 4 seconds
- 21/02/2022 19:25:06 Down between 0 and 4 seconds
- 21/02/2022 19:42:39 Down between 0 and 4 seconds
- *No trace of lost communication for 08h16 & 08h49 of server B because
after the restart, the scripts of up/down test have not restarted.*
*QDevice :*
*Log of ServerA from Qdevice :*
- 22/02/2022 07:15:57 Down between 83 and 86 seconds => ( it match of
restart of the server if we add 1 hour to the time )
- 22/02/2022 07:48:52 Down between 82 and 85 seconds => ( it match of
restart of the server if we add 1 hour to the time )
*Log of ServerB from Qdevice :*
- 21/02/2022 23:02:22 Down between 0 and 4 seconds
- 22/02/2022 07:15:46 Down between 55 and 58 seconds => ( it match of
restart of the server if we add 1 hour to the time )
- 22/02/2022 07:48:58 Down between 56 and 59 seconds => ( it match of
restart of the server if we add 1 hour to the time )
Strangely, the clocks of the 3 computers are the same but, each time, the
time of Qdevice is less than 1 hour than ServerA or ServerB.
I don't understand why I have no trace of lost connection between servers
before they restarted.
If ServerS and ServerS lost connection with the qDevice, can someone
confirm to me if they restart (fencing) or not ?
Thanks for your help.
Le lun. 21 févr. 2022 à 10:08, Sebastien BASTARD <sebastien at domalys.com> a
écrit :
> Hello Ulrich,
>
> I modified your script to add the capability to test the TCP connectivity.
> Currently, between servers A or B and the QDevice, there is a firewall
> which doesn't answer to ping request. So, I tested the 5403 port.
>
> There is result of the week-end :
>
> Logs of Server A :
>
> ==> log_up_down_ServerB_from_ServerA.txt <==
> ---START 1645111039 (2022-02-17_15:17:19)
> 0 (11) -> 1 1645111050 (2022-02-17_15:17:30)
> ---EXIT 1645177062 (2022-02-18_09:37:42)
> ---START 1645199714 (2022-02-18_15:55:14)
> 0 (4) -> 1 1645199718 (2022-02-18_15:55:18)
>
>
> ==> log_up_down_qdevice_from_ServerA.txt <==
> ---START 1645117334 (2022-02-17_17:02:14)
> 0 (10) -> 1 1645117344 (2022-02-17_17:02:24)
> *1 (27820) -> 0 1645145164 (2022-02-18_00:46:04)*
> 0 (10) -> 1 1645145174 (2022-02-18_00:46:14)
> ---EXIT 1645177062 (2022-02-18_09:37:42)
> ---START 1645199684 (2022-02-18_15:54:44)
> 0 (3) -> 1 1645199687 (2022-02-18_15:54:47)
> *1 (19519) -> 0 1645219206 (2022-02-18_21:20:06)*
> 0 (3) -> 1 1645219209 (2022-02-18_21:20:09)
>
> The scripts on Server A stopped working because I forgot to launch it in
> the background. But we can see that server A lost connection with the
> Qdevice twice.
>
> Logs of Server B :
>
> ==> log_up_down_ ServerA_from_ServerB.txt <==
> ---START 1645110964 (2022-02-17_15:16:04)
> 0 (11) -> 1 1645110975 (2022-02-17_15:16:15)
> ---EXIT 1645199533 (2022-02-18_15:52:13)
> ---START 1645199576 (2022-02-18_15:52:56)
> 0 (4) -> 1 1645199580 (2022-02-18_15:53:00)
>
>
> ==> log_up_down_qdevice_from_ ServerB .txt <==
> ---START 1645117428 (2022-02-17_17:03:48)
> 0 (10) -> 1 1645117438 (2022-02-17_17:03:58)
> ---EXIT 1645199529 (2022-02-18_15:52:09)
> ---START 1645199546 (2022-02-18_15:52:26)
> 0 (3) -> 1 1645199549 (2022-02-18_15:52:29)
> *1 (232677) -> 0 1645432226 (2022-02-21_08:30:26)*
> 0 (3) -> 1 1645432229 (2022-02-21_08:30:29)
>
>
> The scripts on Server B stopped working because I forgot to launch it in
> the background. But we can see that server B lost connection with the
> Qdevice one time.
>
> Logs of qDevice :
>
> ==> log_up_down_ServerA_from_qdevice.txt <==
> ---START 1645363302 (2022-02-20_13:21:42)
> 0 (4) -> 1 1645363306 (2022-02-20_13:21:46)
>
>
> ==> log_up_down_ ServerB _from_qdevice.txt <==
> ---START 1645363310 (2022-02-20_13:21:50)
> 0 (4) -> 1 1645363314 (2022-02-20_13:21:54)
>
>
> The scripts on qDevice stopped working because the input was linked to the
> script and after some minutes, the OS killed the script. We can see the
> Qdevice never lost the connection with the 2 servers.
>
> I continue to control the output of the scripts to see when the servers
> lost the connections and when they are fencing.
>
> Best regards.
>
>
> Le ven. 18 févr. 2022 à 08:07, Ulrich Windl <
> Ulrich.Windl at rz.uni-regensburg.de> a écrit :
>
>> >>> Sebastien BASTARD <sebastien at domalys.com> schrieb am 17.02.2022 um
>> 16:28 in
>> Nachricht
>> <CAAjZqdz9a2OorPyoSjdRFWNgJT5snOH2KehkpXdEbAuZrWOvEw at mail.gmail.com>:
>> > Thank you Ulrich for your script !
>> >
>> > I launched it, with 10 seconds delay :
>> >
>> > - on Server A, to ping Server B
>> > - on Server B, to ping server A
>> > - on QDevice, to ping server A and Server B
>> >
>> > I currently can't ping Qdevice from server A and B, because it is
>> behind a
>> > firewall which only authorizes port 5403.
>> >
>> > Tomorrow, I will see the results.
>>
>> Maybe another remark: The script was not desoigned for cluster, so it was
>> good enough to reditrect the output of the script to a file.
>> However bash may buffer some lines before they are written. If the script
>> is killed, that's not a problem, but if the node is fenced, you might loose
>> the last lines(s).
>> So maybe you want do change the echo statement in log_time() to:
>> echo "$@ $t ($(date -d@"$t" -u +%F_%T))" >> your_log_file
>>
>> Maybe you want to use a variable or parameter for that.
>>
>> Regards,
>> Ulrich
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
>
> --
>
>
> Sébastien BASTARD
> *Ingénieur R&D* | Domalys • Créateurs d’autonomie
>
> | phone : +33 5 49 83 00 08
> | site : www.domalys.com
> | email : sebastien at domalys.com
> | address : 58 Rue du Vercors 86240 Fontaine-Le-Comte
>
> <https://www.domalys.com/> <https://www.facebook.com/domalys/>
> <https://twitter.com/domalysfr>
> <https://www.youtube.com/channel/UCRLVU19hjkZ0dv29FaPJacw>
> <https://www.linkedin.com/company/domalys/?originalSubdomain=fr>
> <https://youtu.be/77t5rETTwQs> <https://www.ces.tech>
> <https://www.ces.tech>
>
>
--
Sébastien BASTARD
*Ingénieur R&D* | Domalys • Créateurs d’autonomie
| phone : +33 5 49 83 00 08
| site : www.domalys.com
| email : sebastien at domalys.com
| address : 58 Rue du Vercors 86240 Fontaine-Le-Comte
<https://www.domalys.com/> <https://www.facebook.com/domalys/>
<https://twitter.com/domalysfr>
<https://www.youtube.com/channel/UCRLVU19hjkZ0dv29FaPJacw>
<https://www.linkedin.com/company/domalys/?originalSubdomain=fr>
<https://youtu.be/77t5rETTwQs> <https://www.ces.tech>
<https://www.ces.tech>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20220222/e2ce567a/attachment-0001.htm>
More information about the Users
mailing list