[ClusterLabs] Antw: Re: Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres

Fri Jul 23 07:32:15 EDT 2021

>>> damiano giuliani <damianogiuliani87 at gmail.com> schrieb am 23.07.2021 um 12:52
in Nachricht
<CAG=zYNM9FRNaOsre92mjZL_BXYun37FfcuvqXe0qpURg0msimg at mail.gmail.com>:
> hi guys thanks for supporting.
> the time query isnt the problem, is known that took its time. the network
> is 10gbs bonding, quite impossible to sature with queries :=).
> the servers are totally overkilled, at database full working loads  20% of
> the resources have been used.
> checking again the logs what for me is not clear its the cause of the loss
> of quorum and then fence the node.
> there are no informations into the logs (even into Idrac/ motherboard event
> logs).
> 
> the only clear logs are :
> 228684] ltaoperdbs03 corosyncnotice  [TOTEM ] A processor failed, forming
> new configuration.

Hi!

I wonder: Would the corosync Blackbox (COROSYNC-BLACKBOX(8)) help? As an alternative you could capture TOTEM packages in some rotating files, trying to find out what was going on.
As it seems now, the issue is that a remote node cannot "be seen".

Regards,
Ulrich

> [228684] ltaoperdbs03 corosyncnotice  [TOTEM ] A new membership (
> 172.18.2.12:227) was formed. Members left: 1
> [228684] ltaoperdbs03 corosyncnotice  [TOTEM ] Failed to receive the leave
> message. failed: 1
> [228684] ltaoperdbs03 corosyncwarning [CPG   ] downlist left_list: 1
> received
> [228684] ltaoperdbs03 corosyncwarning [CPG   ] downlist left_list: 1
> received
> Jul 13 00:40:37 [228695] ltaoperdbs03        cib:     info:
> pcmk_cpg_membership:        Group cib event 3: ltaoperdbs02 (node 1 pid
> 6136) left via cluster exit
> Jul 13 00:40:37 [228695] ltaoperdbs03        cib:     info:
> crm_update_peer_proc:       pcmk_cpg_membership: Node ltaoperdbs02[1] -
> corosync-cpg is now offline
> Jul 13 00:40:37 [228700] ltaoperdbs03       crmd:     info:
> pcmk_cpg_membership:        Group crmd event 3: ltaoperdbs02 (node 1 pid
> 6141) left via cluster exit
> Jul 13 00:40:37 [228695] ltaoperdbs03        cib:   notice:
> crm_update_peer_state_iter: Node ltaoperdbs02 state is now lost | nodeid=1
> previous=member source=crm_update_peer_proc
> Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:  warning: pe_fence_node:
>      Cluster node ltaoperdbs02 will be fenced: peer is no longer part of
> the cluster
> Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:  warning:
> determine_online_status:    Node ltaoperdbs02 is unclean
> 
> Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:   notice: LogNodeActions:
>      * Fence (reboot) ltaoperdbs02 'peer is no longer part of the cluster'
> Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:   notice: LogAction:   *
> Promote    pgsqld:0     ( Slave -> Master ltaoperdbs03 )
> Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:     info: LogActions:
> Leave   pgsqld:1        (Slave ltaoperdbs04)
> 
> 
> So the cluster works flawessy as expected: as soon ltaoperdbs02 become
> "unreachable", it formed a new quorum, fenced the lost node and promoted
> the new master.
> 
> What i cant findout is WHY its happened.
> there are no useful  information into the system logs neither into the
> Idrac motherboard logs.
> 
> There is a way to improve or configure a log system for fenced / failed
> node?
> 
> Thanks
> 
> Damiano
> 
> Il giorno gio 22 lug 2021 alle ore 15:06 Jehan-Guillaume de Rorthais <
> jgdr at dalibo.com> ha scritto:
> 
>> Hi,
>>
>> On Wed, 14 Jul 2021 07:58:14 +0200
>> "Ulrich Windl" <Ulrich.Windl at rz.uni-regensburg.de> wrote:
>> [...]
>> > Could it be that a command saturated the network?
>> > Jul 13 00:39:28 ltaoperdbs02 postgres[172262]: [20-1] 2021-07-13
>> 00:39:28.936
>> > UTC [172262] LOG:  duration: 660.329 ms  execute <unnamed>:  SELECT
>> > xmf.file_id, f.size, fp.full_path  FROM ism_x_medium_file xmf  JOIN#011
>> > ism_files f  ON f.id_file = xmf.file_id  JOIN#011 ism_files_path fp  ON
>> > f.id_file = fp.file_id  JOIN ism_online o  ON o.file_id = xmf.file_id
>> WHERE
>> > xmf.medium_id = 363 AND  xmf.x_media_file_status_id = 1  AND
>> > o.online_status_id = 3    GROUP BY xmf.file_id, f.size,  fp.full_path
>>  LIMIT
>> > 7265 ;
>>
>> I doubt such a query could saturate the network. The query time itself
>> isn't
>> proportional to the result set size.
>>
>> Moreover, there's only three fields per row and according to their name, I
>> doubt the row size is really big.
>>
>> Plus, imagine the result set is that big, chances are that the frontend
>> will
>> not be able to cope with it as fast as the network, unless the frontend is
>> doing
>> nothing really fancy with the dataset. So the frontend itself might
>> saturate
>> before the network, giving some break to the later.
>>
>> However, if this query time is unusual, that might illustrate some
>> pressure on
>> the server by some other mean (CPU ? MEM ? IO ?). Detailed metrics would
>> help.
>>
>> Regards,
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>