[ClusterLabs] Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres
damiano giuliani
damianogiuliani87 at gmail.com
Fri Jul 23 06:52:00 EDT 2021
hi guys thanks for supporting.
the time query isnt the problem, is known that took its time. the network
is 10gbs bonding, quite impossible to sature with queries :=).
the servers are totally overkilled, at database full working loads 20% of
the resources have been used.
checking again the logs what for me is not clear its the cause of the loss
of quorum and then fence the node.
there are no informations into the logs (even into Idrac/ motherboard event
logs).
the only clear logs are :
228684] ltaoperdbs03 corosyncnotice [TOTEM ] A processor failed, forming
new configuration.
[228684] ltaoperdbs03 corosyncnotice [TOTEM ] A new membership (
172.18.2.12:227) was formed. Members left: 1
[228684] ltaoperdbs03 corosyncnotice [TOTEM ] Failed to receive the leave
message. failed: 1
[228684] ltaoperdbs03 corosyncwarning [CPG ] downlist left_list: 1
received
[228684] ltaoperdbs03 corosyncwarning [CPG ] downlist left_list: 1
received
Jul 13 00:40:37 [228695] ltaoperdbs03 cib: info:
pcmk_cpg_membership: Group cib event 3: ltaoperdbs02 (node 1 pid
6136) left via cluster exit
Jul 13 00:40:37 [228695] ltaoperdbs03 cib: info:
crm_update_peer_proc: pcmk_cpg_membership: Node ltaoperdbs02[1] -
corosync-cpg is now offline
Jul 13 00:40:37 [228700] ltaoperdbs03 crmd: info:
pcmk_cpg_membership: Group crmd event 3: ltaoperdbs02 (node 1 pid
6141) left via cluster exit
Jul 13 00:40:37 [228695] ltaoperdbs03 cib: notice:
crm_update_peer_state_iter: Node ltaoperdbs02 state is now lost | nodeid=1
previous=member source=crm_update_peer_proc
Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: warning: pe_fence_node:
Cluster node ltaoperdbs02 will be fenced: peer is no longer part of
the cluster
Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: warning:
determine_online_status: Node ltaoperdbs02 is unclean
Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: notice: LogNodeActions:
* Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster'
Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: notice: LogAction: *
Promote pgsqld:0 ( Slave -> Master ltaoperdbs03 )
Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: info: LogActions:
Leave pgsqld:1 (Slave ltaoperdbs04)
So the cluster works flawessy as expected: as soon ltaoperdbs02 become
"unreachable", it formed a new quorum, fenced the lost node and promoted
the new master.
What i cant findout is WHY its happened.
there are no useful information into the system logs neither into the
Idrac motherboard logs.
There is a way to improve or configure a log system for fenced / failed
node?
Thanks
Damiano
Il giorno gio 22 lug 2021 alle ore 15:06 Jehan-Guillaume de Rorthais <
jgdr at dalibo.com> ha scritto:
> Hi,
>
> On Wed, 14 Jul 2021 07:58:14 +0200
> "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> wrote:
> [...]
> > Could it be that a command saturated the network?
> > Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13
> 00:39:28.936
> > UTC [172262] LOG: duration: 660.329 ms execute <unnamed>: SELECT
> > xmf.file_id, f.size, fp.full_path FROM ism_x_medium_file xmf JOIN#011
> > ism_files f ON f.id_file = xmf.file_id JOIN#011 ism_files_path fp ON
> > f.id_file = fp.file_id JOIN ism_online o ON o.file_id = xmf.file_id
> WHERE
> > xmf.medium_id = 363 AND xmf.x_media_file_status_id = 1 AND
> > o.online_status_id = 3 GROUP BY xmf.file_id, f.size, fp.full_path
> LIMIT
> > 7265 ;
>
> I doubt such a query could saturate the network. The query time itself
> isn't
> proportional to the result set size.
>
> Moreover, there's only three fields per row and according to their name, I
> doubt the row size is really big.
>
> Plus, imagine the result set is that big, chances are that the frontend
> will
> not be able to cope with it as fast as the network, unless the frontend is
> doing
> nothing really fancy with the dataset. So the frontend itself might
> saturate
> before the network, giving some break to the later.
>
> However, if this query time is unusual, that might illustrate some
> pressure on
> the server by some other mean (CPU ? MEM ? IO ?). Detailed metrics would
> help.
>
> Regards,
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210723/097d3a9b/attachment-0001.htm>
More information about the Users
mailing list