[ClusterLabs] Antw: Re: Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Fri Jul 23 07:32:15 EDT 2021
>>> damiano giuliani <damianogiuliani87 at gmail.com> schrieb am 23.07.2021 um 12:52
in Nachricht
<CAG=zYNM9FRNaOsre92mjZL_BXYun37FfcuvqXe0qpURg0msimg at mail.gmail.com>:
> hi guys thanks for supporting.
> the time query isnt the problem, is known that took its time. the network
> is 10gbs bonding, quite impossible to sature with queries :=).
> the servers are totally overkilled, at database full working loads 20% of
> the resources have been used.
> checking again the logs what for me is not clear its the cause of the loss
> of quorum and then fence the node.
> there are no informations into the logs (even into Idrac/ motherboard event
> logs).
>
> the only clear logs are :
> 228684] ltaoperdbs03 corosyncnotice [TOTEM ] A processor failed, forming
> new configuration.
Hi!
I wonder: Would the corosync Blackbox (COROSYNC-BLACKBOX(8)) help? As an alternative you could capture TOTEM packages in some rotating files, trying to find out what was going on.
As it seems now, the issue is that a remote node cannot "be seen".
Regards,
Ulrich
> [228684] ltaoperdbs03 corosyncnotice [TOTEM ] A new membership (
> 172.18.2.12:227) was formed. Members left: 1
> [228684] ltaoperdbs03 corosyncnotice [TOTEM ] Failed to receive the leave
> message. failed: 1
> [228684] ltaoperdbs03 corosyncwarning [CPG ] downlist left_list: 1
> received
> [228684] ltaoperdbs03 corosyncwarning [CPG ] downlist left_list: 1
> received
> Jul 13 00:40:37 [228695] ltaoperdbs03 cib: info:
> pcmk_cpg_membership: Group cib event 3: ltaoperdbs02 (node 1 pid
> 6136) left via cluster exit
> Jul 13 00:40:37 [228695] ltaoperdbs03 cib: info:
> crm_update_peer_proc: pcmk_cpg_membership: Node ltaoperdbs02[1] -
> corosync-cpg is now offline
> Jul 13 00:40:37 [228700] ltaoperdbs03 crmd: info:
> pcmk_cpg_membership: Group crmd event 3: ltaoperdbs02 (node 1 pid
> 6141) left via cluster exit
> Jul 13 00:40:37 [228695] ltaoperdbs03 cib: notice:
> crm_update_peer_state_iter: Node ltaoperdbs02 state is now lost | nodeid=1
> previous=member source=crm_update_peer_proc
> Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: warning: pe_fence_node:
> Cluster node ltaoperdbs02 will be fenced: peer is no longer part of
> the cluster
> Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: warning:
> determine_online_status: Node ltaoperdbs02 is unclean
>
> Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: notice: LogNodeActions:
> * Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster'
> Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: notice: LogAction: *
> Promote pgsqld:0 ( Slave -> Master ltaoperdbs03 )
> Jul 13 00:40:37 [228699] ltaoperdbs03 pengine: info: LogActions:
> Leave pgsqld:1 (Slave ltaoperdbs04)
>
>
> So the cluster works flawessy as expected: as soon ltaoperdbs02 become
> "unreachable", it formed a new quorum, fenced the lost node and promoted
> the new master.
>
> What i cant findout is WHY its happened.
> there are no useful information into the system logs neither into the
> Idrac motherboard logs.
>
> There is a way to improve or configure a log system for fenced / failed
> node?
>
> Thanks
>
> Damiano
>
> Il giorno gio 22 lug 2021 alle ore 15:06 Jehan-Guillaume de Rorthais <
> jgdr at dalibo.com> ha scritto:
>
>> Hi,
>>
>> On Wed, 14 Jul 2021 07:58:14 +0200
>> "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> wrote:
>> [...]
>> > Could it be that a command saturated the network?
>> > Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13
>> 00:39:28.936
>> > UTC [172262] LOG: duration: 660.329 ms execute <unnamed>: SELECT
>> > xmf.file_id, f.size, fp.full_path FROM ism_x_medium_file xmf JOIN#011
>> > ism_files f ON f.id_file = xmf.file_id JOIN#011 ism_files_path fp ON
>> > f.id_file = fp.file_id JOIN ism_online o ON o.file_id = xmf.file_id
>> WHERE
>> > xmf.medium_id = 363 AND xmf.x_media_file_status_id = 1 AND
>> > o.online_status_id = 3 GROUP BY xmf.file_id, f.size, fp.full_path
>> LIMIT
>> > 7265 ;
>>
>> I doubt such a query could saturate the network. The query time itself
>> isn't
>> proportional to the result set size.
>>
>> Moreover, there's only three fields per row and according to their name, I
>> doubt the row size is really big.
>>
>> Plus, imagine the result set is that big, chances are that the frontend
>> will
>> not be able to cope with it as fast as the network, unless the frontend is
>> doing
>> nothing really fancy with the dataset. So the frontend itself might
>> saturate
>> before the network, giving some break to the later.
>>
>> However, if this query time is unusual, that might illustrate some
>> pressure on
>> the server by some other mean (CPU ? MEM ? IO ?). Detailed metrics would
>> help.
>>
>> Regards,
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
More information about the Users
mailing list