[ClusterLabs] Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres

Fri Jul 23 06:52:00 EDT 2021

hi guys thanks for supporting.
the time query isnt the problem, is known that took its time. the network
is 10gbs bonding, quite impossible to sature with queries :=).
the servers are totally overkilled, at database full working loads  20% of
the resources have been used.
checking again the logs what for me is not clear its the cause of the loss
of quorum and then fence the node.
there are no informations into the logs (even into Idrac/ motherboard event
logs).

the only clear logs are :
228684] ltaoperdbs03 corosyncnotice  [TOTEM ] A processor failed, forming
new configuration.
[228684] ltaoperdbs03 corosyncnotice  [TOTEM ] A new membership (
172.18.2.12:227) was formed. Members left: 1
[228684] ltaoperdbs03 corosyncnotice  [TOTEM ] Failed to receive the leave
message. failed: 1
[228684] ltaoperdbs03 corosyncwarning [CPG   ] downlist left_list: 1
received
[228684] ltaoperdbs03 corosyncwarning [CPG   ] downlist left_list: 1
received
Jul 13 00:40:37 [228695] ltaoperdbs03        cib:     info:
pcmk_cpg_membership:        Group cib event 3: ltaoperdbs02 (node 1 pid
6136) left via cluster exit
Jul 13 00:40:37 [228695] ltaoperdbs03        cib:     info:
crm_update_peer_proc:       pcmk_cpg_membership: Node ltaoperdbs02[1] -
corosync-cpg is now offline
Jul 13 00:40:37 [228700] ltaoperdbs03       crmd:     info:
pcmk_cpg_membership:        Group crmd event 3: ltaoperdbs02 (node 1 pid
6141) left via cluster exit
Jul 13 00:40:37 [228695] ltaoperdbs03        cib:   notice:
crm_update_peer_state_iter: Node ltaoperdbs02 state is now lost | nodeid=1
previous=member source=crm_update_peer_proc
Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:  warning: pe_fence_node:
     Cluster node ltaoperdbs02 will be fenced: peer is no longer part of
the cluster
Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:  warning:
determine_online_status:    Node ltaoperdbs02 is unclean

Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:   notice: LogNodeActions:
     * Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster'
Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:   notice: LogAction:   *
Promote    pgsqld:0     ( Slave -> Master ltaoperdbs03 )
Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:     info: LogActions:
Leave   pgsqld:1        (Slave ltaoperdbs04)

So the cluster works flawessy as expected: as soon ltaoperdbs02 become
"unreachable", it formed a new quorum, fenced the lost node and promoted
the new master.

What i cant findout is WHY its happened.
there are no useful  information into the system logs neither into the
Idrac motherboard logs.

There is a way to improve or configure a log system for fenced / failed
node?

Thanks

Damiano

Il giorno gio 22 lug 2021 alle ore 15:06 Jehan-Guillaume de Rorthais <
jgdr at dalibo.com> ha scritto:

> Hi,
>
> On Wed, 14 Jul 2021 07:58:14 +0200
> "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> wrote:
> [...]
> > Could it be that a command saturated the network?
> > Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13
> 00:39:28.936
> > UTC [172262] LOG:  duration: 660.329 ms  execute <unnamed>:  SELECT
> > xmf.file_id, f.size, fp.full_path  FROM ism_x_medium_file xmf  JOIN#011
> > ism_files f  ON f.id_file = xmf.file_id  JOIN#011 ism_files_path fp  ON
> > f.id_file = fp.file_id  JOIN ism_online o  ON o.file_id = xmf.file_id
> WHERE
> > xmf.medium_id = 363 AND  xmf.x_media_file_status_id = 1  AND
> > o.online_status_id = 3    GROUP BY xmf.file_id, f.size,  fp.full_path
>  LIMIT
> > 7265 ;
>
> I doubt such a query could saturate the network. The query time itself
> isn't
> proportional to the result set size.
>
> Moreover, there's only three fields per row and according to their name, I
> doubt the row size is really big.
>
> Plus, imagine the result set is that big, chances are that the frontend
> will
> not be able to cope with it as fast as the network, unless the frontend is
> doing
> nothing really fancy with the dataset. So the frontend itself might
> saturate
> before the network, giving some break to the later.
>
> However, if this query time is unusual, that might illustrate some
> pressure on
> the server by some other mean (CPU ? MEM ? IO ?). Detailed metrics would
> help.
>
> Regards,
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210723/097d3a9b/attachment-0001.htm>