[ClusterLabs] Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres

Andrei Borzenkov arvidjaar at gmail.com
Sat Oct 9 02:55:28 EDT 2021


On 08.10.2021 16:00, damiano giuliani wrote:
> Hi Guys, after months of suddens  unexpected failovers, checking every
> corners and types of logs without any luck, cuz no logs and no reasons or
> problems were shown anywhere, i was on the edge of madness i finally
> managed to find out what was the problems of this suddends switches.
> Was a tough bout but finally i think i got it.
> in quite sure this can be quite useful expecially for high load databases
> clusters.
> the servers are all resoruce overkills with 80 cpus and 256 gb ram even if
> the db ingest milions records x day, the network si bonded 10gbs, ssd disks.
> so i found out under high loads the db suddently switches without no
> reason, kicking out the master cuz no comunication with him.
> the network works flavlessy without dropping a packets, ram was never
> saturated and cpu are quite ovrkill.
> So it turn out that a lil bit of swap was used and i suspect corosync
> process were swapped to disks creating lag where 1s default corosync
> timeout was not enough.

But you do not know whether corosync was swapped out at all. So it is
just guess.

> So it is, swap doesnt log anything and moving process to allocated ram to
> swap take times more that 1s default timeout (probably many many mores).
> i fix it changing the swappiness of each servers to 10 (at minimum)
> avoinding the corosync process could swap.

swappiness kernel parameter does not really prevent swap from being used.

What is your kernel version? On several consecutive kernel versions I
observed the following effect - once swap started being used at all
system experienced periodical stalls for several seconds. It feeled like
frozen system. It did not matter how much swap was in allocated -
several megabytes was already enough.

As far as I understand, the problem was not really time to swap out/in,
but time kernel spent traversing page tables to make decision. I think
it start with kernel 5.3 (or may be 5.2) and I do not see it any more
since I believe kernel 5.7.

>  this issue which should be easy drove me crazy because nowhere process
> swap is tracked on logs but make corosync trigger the timeout and make the
> cluster failover.
> 
> Really hope it can help the community
> 
> Best
> 
> Il giorno ven 23 lug 2021 alle ore 15:46 Jehan-Guillaume de Rorthais <
> jgdr at dalibo.com> ha scritto:
> 
>> On Fri, 23 Jul 2021 12:52:00 +0200
>> damiano giuliani <damianogiuliani87 at gmail.com> wrote:
>>
>>> the time query isnt the problem, is known that took its time. the network
>>> is 10gbs bonding, quite impossible to sature with queries :=).
>>
>> Everything is possible, it's just harder :)
>>
>> [...]
>>> checking again the logs what for me is not clear its the cause of the
>> loss
>>> of quorum and then fence the node.
>>
>> As said before, according to logs from other nodes, ltaoperdbs02 did not
>> answers to the TOTEM protocol anymore, so it left the communication group.
>> But
>> worse, it did it without saying goodbye properly:
>>
>>   > [TOTEM ] Failed to receive the leave message. failed: 1
>>
>> From this exact time, the node is then considered "uncleaned", aka
>> its state "unknown". To solve this trouble, the cluster needs to fence it
>> to
>> set a predictable state: OFF. So, the reaction to the trouble is sane.
>>
>> Now, from the starting point of this conversation, the question is what
>> happened? Logs on other nodes will probably not help, as they just
>> witnessed a
>> node disappearing without any explanation.
>>
>> Logs from ltaoperdbs02 might help, but the corosync log you sent stop at
>> 00:38:44, almost 2 minutes before the fencing as reported from other nodes:
>>
>>   > Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:  warning:
>> pe_fence_node:
>>   >    Cluster node ltaoperdbs02 will be fenced: peer is no longer part of
>>
>>
>>> So the cluster works flawessy as expected: as soon ltaoperdbs02 become
>>> "unreachable", it formed a new quorum, fenced the lost node and promoted
>>> the new master.
>>
>> exact.
>>
>>> What i cant findout is WHY its happened.
>>> there are no useful  information into the system logs neither into the
>>> Idrac motherboard logs.
>>
>> Because I suppose some log where not synced to disks when the server has
>> been
>> fenced.
>>
>> Either the server clocks were not synched (I doubt), or you really lost
>> almost
>> 2 minutes of logs.
>>
>>> There is a way to improve or configure a log system for fenced / failed
>>> node?
>>
>> Yes:
>>
>> 1.setup rsyslog to export logs on some dedicated logging servers. Such
>> servers should receive and save logs from your clusters and other hardwares
>> (network?) and keep them safe. You will not loose messages anymore.
>>
>> 2. Gather a lot of system metrics and keep them safe (eg. export them
>> using pcp,
>> collectd, etc). Metrics and visualization are important to cross-compare
>> with
>> logs and pinpoint something behaving outside of the usual scope.
>>
>>
>> Looking at your log, I still find your query time are suspicious. I'm not
>> convinced they are the root cause, they might be just a bad symptom/signal
>> of something going wrong there. Having a one-row INSERT taking 649.754ms is
>> suspicious. Maybe it's just a locking problem, maybe there's some CPU-bound
>> postgis things involved, maybe with some GIN or GiST indexes, but it's
>> still
>> suspicious considering the server is over-sized in performance as you
>> stated...
>>
>> And maybe the network or SAN had a hick-up and corosync has been too
>> sensible
>> to it. Check the retransmit and timeout parameters?
>>
>>
>> Regards,
>>
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 



More information about the Users mailing list