[ClusterLabs] Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres

Fri Oct 8 09:00:30 EDT 2021

Hi Guys, after months of suddens  unexpected failovers, checking every
corners and types of logs without any luck, cuz no logs and no reasons or
problems were shown anywhere, i was on the edge of madness i finally
managed to find out what was the problems of this suddends switches.
Was a tough bout but finally i think i got it.
in quite sure this can be quite useful expecially for high load databases
clusters.
the servers are all resoruce overkills with 80 cpus and 256 gb ram even if
the db ingest milions records x day, the network si bonded 10gbs, ssd disks.
so i found out under high loads the db suddently switches without no
reason, kicking out the master cuz no comunication with him.
the network works flavlessy without dropping a packets, ram was never
saturated and cpu are quite ovrkill.
So it turn out that a lil bit of swap was used and i suspect corosync
process were swapped to disks creating lag where 1s default corosync
timeout was not enough.
So it is, swap doesnt log anything and moving process to allocated ram to
swap take times more that 1s default timeout (probably many many mores).
i fix it changing the swappiness of each servers to 10 (at minimum)
avoinding the corosync process could swap.
 this issue which should be easy drove me crazy because nowhere process
swap is tracked on logs but make corosync trigger the timeout and make the
cluster failover.

Really hope it can help the community

Best

Il giorno ven 23 lug 2021 alle ore 15:46 Jehan-Guillaume de Rorthais <
jgdr at dalibo.com> ha scritto:

> On Fri, 23 Jul 2021 12:52:00 +0200
> damiano giuliani <damianogiuliani87 at gmail.com> wrote:
>
> > the time query isnt the problem, is known that took its time. the network
> > is 10gbs bonding, quite impossible to sature with queries :=).
>
> Everything is possible, it's just harder :)
>
> [...]
> > checking again the logs what for me is not clear its the cause of the
> loss
> > of quorum and then fence the node.
>
> As said before, according to logs from other nodes, ltaoperdbs02 did not
> answers to the TOTEM protocol anymore, so it left the communication group.
> But
> worse, it did it without saying goodbye properly:
>
>   > [TOTEM ] Failed to receive the leave message. failed: 1
>
> From this exact time, the node is then considered "uncleaned", aka
> its state "unknown". To solve this trouble, the cluster needs to fence it
> to
> set a predictable state: OFF. So, the reaction to the trouble is sane.
>
> Now, from the starting point of this conversation, the question is what
> happened? Logs on other nodes will probably not help, as they just
> witnessed a
> node disappearing without any explanation.
>
> Logs from ltaoperdbs02 might help, but the corosync log you sent stop at
> 00:38:44, almost 2 minutes before the fencing as reported from other nodes:
>
>   > Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:  warning:
> pe_fence_node:
>   >    Cluster node ltaoperdbs02 will be fenced: peer is no longer part of
>
>
> > So the cluster works flawessy as expected: as soon ltaoperdbs02 become
> > "unreachable", it formed a new quorum, fenced the lost node and promoted
> > the new master.
>
> exact.
>
> > What i cant findout is WHY its happened.
> > there are no useful  information into the system logs neither into the
> > Idrac motherboard logs.
>
> Because I suppose some log where not synced to disks when the server has
> been
> fenced.
>
> Either the server clocks were not synched (I doubt), or you really lost
> almost
> 2 minutes of logs.
>
> > There is a way to improve or configure a log system for fenced / failed
> > node?
>
> Yes:
>
> 1.setup rsyslog to export logs on some dedicated logging servers. Such
> servers should receive and save logs from your clusters and other hardwares
> (network?) and keep them safe. You will not loose messages anymore.
>
> 2. Gather a lot of system metrics and keep them safe (eg. export them
> using pcp,
> collectd, etc). Metrics and visualization are important to cross-compare
> with
> logs and pinpoint something behaving outside of the usual scope.
>
>
> Looking at your log, I still find your query time are suspicious. I'm not
> convinced they are the root cause, they might be just a bad symptom/signal
> of something going wrong there. Having a one-row INSERT taking 649.754ms is
> suspicious. Maybe it's just a locking problem, maybe there's some CPU-bound
> postgis things involved, maybe with some GIN or GiST indexes, but it's
> still
> suspicious considering the server is over-sized in performance as you
> stated...
>
> And maybe the network or SAN had a hick-up and corosync has been too
> sensible
> to it. Check the retransmit and timeout parameters?
>
>
> Regards,
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20211008/57dfd95b/attachment.htm>