<div dir="ltr">hi guys thanks for supporting.<div>the time query isnt the problem, is known that took its time. the network is 10gbs bonding, quite impossible to sature with queries :=).</div><div>the servers are totally overkilled, at database full working loads  20% of the resources have been used.</div><div>checking again the logs what for me is not clear its the cause of the loss of quorum and then fence the node.</div><div>there are no informations into the logs (even into Idrac/ motherboard event logs).</div><div><br></div><div>the only clear logs are :</div><div>228684] ltaoperdbs03 corosyncnotice  [TOTEM ] A processor failed, forming new configuration.<br>[228684] ltaoperdbs03 corosyncnotice  [TOTEM ] A new membership (<a href="http://172.18.2.12:227">172.18.2.12:227</a>) was formed. Members left: 1<br>[228684] ltaoperdbs03 corosyncnotice  [TOTEM ] Failed to receive the leave message. failed: 1<br>[228684] ltaoperdbs03 corosyncwarning [CPG   ] downlist left_list: 1 received<br>[228684] ltaoperdbs03 corosyncwarning [CPG   ] downlist left_list: 1 received<br>Jul 13 00:40:37 [228695] ltaoperdbs03        cib:     info: pcmk_cpg_membership:        Group cib event 3: ltaoperdbs02 (node 1 pid 6136) left via cluster exit<br>Jul 13 00:40:37 [228695] ltaoperdbs03        cib:     info: crm_update_peer_proc:       pcmk_cpg_membership: Node ltaoperdbs02[1] - corosync-cpg is now offline<br>Jul 13 00:40:37 [228700] ltaoperdbs03       crmd:     info: pcmk_cpg_membership:        Group crmd event 3: ltaoperdbs02 (node 1 pid 6141) left via cluster exit<br>Jul 13 00:40:37 [228695] ltaoperdbs03        cib:   notice: crm_update_peer_state_iter: Node ltaoperdbs02 state is now lost | nodeid=1 previous=member source=crm_update_peer_proc<br></div><div>Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:  warning: pe_fence_node:      Cluster node ltaoperdbs02 will be fenced: peer is no longer part of the cluster<br>Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:  warning: determine_online_status:    Node ltaoperdbs02 is unclean<br></div><div><br></div><div>Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:   notice: LogNodeActions:      * Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster'<br>Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:   notice: LogAction:   * Promote    pgsqld:0     ( Slave -> Master ltaoperdbs03 )<br>Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:     info: LogActions: Leave   pgsqld:1        (Slave ltaoperdbs04)<br></div><div><br></div><div><br></div><div>So the cluster works flawessy as expected: as soon ltaoperdbs02 become "unreachable", it formed a new quorum, fenced the lost node and promoted the new master.</div><div><br></div><div>What i cant findout is WHY its happened. </div><div>there are no useful  information into the system logs neither into the Idrac motherboard logs.</div><div><br></div><div>There is a way to improve or configure a log system for fenced / failed node?</div><div><br></div><div>Thanks</div><div><br></div><div>Damiano</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno gio 22 lug 2021 alle ore 15:06 Jehan-Guillaume de Rorthais <<a href="mailto:jgdr@dalibo.com">jgdr@dalibo.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br>

<br>

On Wed, 14 Jul 2021 07:58:14 +0200<br>

"Ulrich Windl" <<a href="mailto:Ulrich.Windl@rz.uni-regensburg.de" target="_blank">Ulrich.Windl@rz.uni-regensburg.de</a>> wrote:<br>

[...]<br>

> Could it be that a command saturated the network?<br>

> Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13 00:39:28.936<br>

> UTC [172262] LOG:  duration: 660.329 ms  execute <unnamed>:  SELECT<br>

> xmf.file_id, f.size, fp.full_path  FROM ism_x_medium_file xmf  JOIN#011<br>

> ism_files f  ON f.id_file = xmf.file_id  JOIN#011 ism_files_path fp  ON<br>

> f.id_file = fp.file_id  JOIN ism_online o  ON o.file_id = xmf.file_id  WHERE<br>

> xmf.medium_id = 363 AND  xmf.x_media_file_status_id = 1  AND<br>

> o.online_status_id = 3    GROUP BY xmf.file_id, f.size,  fp.full_path   LIMIT<br>

> 7265 ;<br>

<br>

I doubt such a query could saturate the network. The query time itself isn't<br>

proportional to the result set size.<br>

<br>

Moreover, there's only three fields per row and according to their name, I<br>

doubt the row size is really big.<br>

<br>

Plus, imagine the result set is that big, chances are that the frontend will<br>

not be able to cope with it as fast as the network, unless the frontend is doing<br>

nothing really fancy with the dataset. So the frontend itself might saturate<br>

before the network, giving some break to the later.<br>

<br>

However, if this query time is unusual, that might illustrate some pressure on<br>

the server by some other mean (CPU ? MEM ? IO ?). Detailed metrics would help.<br>

<br>

Regards,<br>

_______________________________________________<br>

Manage your subscription:<br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>

<br>

ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>

</blockquote></div>