<div dir="ltr">Hi Guys, after months of suddens  unexpected failovers, checking every corners and types of logs without any luck, cuz no logs and no reasons or problems were shown anywhere, i was on the edge of madness i finally managed to find out what was the problems of this suddends switches.<div>Was a tough bout but finally i think i got it.</div><div>in quite sure this can be quite useful expecially for high load databases clusters.</div><div>the servers are all resoruce overkills with 80 cpus and 256 gb ram even if the db ingest milions records x day, the network si bonded 10gbs, ssd disks.</div><div>so i found out under high loads the db suddently switches without no reason, kicking out the master cuz no comunication with him.</div><div>the network works flavlessy without dropping a packets, ram was never saturated and cpu are quite ovrkill.</div><div>So it turn out that a lil bit of swap was used and i suspect corosync process were swapped to disks creating lag where 1s default corosync timeout was not enough.</div><div>So it is, swap doesnt log anything and moving process to allocated ram to swap take times more that 1s default timeout (probably many many mores).</div><div>i fix it changing the swappiness of each servers to 10 (at minimum) avoinding the corosync process could swap.</div><div> this issue which should be easy drove me crazy because nowhere process swap is tracked on logs but make corosync trigger the timeout and make the cluster failover.</div><div><br></div><div>Really hope it can help the community</div><div><br></div><div>Best</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Il giorno ven 23 lug 2021 alle ore 15:46 Jehan-Guillaume de Rorthais <<a href="mailto:jgdr@dalibo.com" target="_blank">jgdr@dalibo.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, 23 Jul 2021 12:52:00 +0200<br>

damiano giuliani <<a href="mailto:damianogiuliani87@gmail.com" target="_blank">damianogiuliani87@gmail.com</a>> wrote:<br>

<br>

> the time query isnt the problem, is known that took its time. the network<br>

> is 10gbs bonding, quite impossible to sature with queries :=).<br>

<br>

Everything is possible, it's just harder :)<br>

<br>

[...]<br>

> checking again the logs what for me is not clear its the cause of the loss<br>

> of quorum and then fence the node.<br>

<br>

As said before, according to logs from other nodes, ltaoperdbs02 did not<br>

answers to the TOTEM protocol anymore, so it left the communication group. But<br>

worse, it did it without saying goodbye properly:<br>

<br>

  > [TOTEM ] Failed to receive the leave message. failed: 1 <br>

<br>

>From this exact time, the node is then considered "uncleaned", aka<br>

its state "unknown". To solve this trouble, the cluster needs to fence it to<br>

set a predictable state: OFF. So, the reaction to the trouble is sane.<br>

<br>

Now, from the starting point of this conversation, the question is what<br>

happened? Logs on other nodes will probably not help, as they just witnessed a<br>

node disappearing without any explanation.<br>

<br>

Logs from ltaoperdbs02 might help, but the corosync log you sent stop at<br>

00:38:44, almost 2 minutes before the fencing as reported from other nodes:<br>

<br>

  > Jul 13 00:40:37 [228699] ltaoperdbs03    pengine:  warning: pe_fence_node:<br>

  >    Cluster node ltaoperdbs02 will be fenced: peer is no longer part of<br>

<br>

<br>

> So the cluster works flawessy as expected: as soon ltaoperdbs02 become<br>

> "unreachable", it formed a new quorum, fenced the lost node and promoted<br>

> the new master.<br>

<br>

exact.<br>

<br>

> What i cant findout is WHY its happened.<br>

> there are no useful  information into the system logs neither into the<br>

> Idrac motherboard logs.<br>

<br>

Because I suppose some log where not synced to disks when the server has been<br>

fenced.<br>

<br>

Either the server clocks were not synched (I doubt), or you really lost almost<br>

2 minutes of logs.<br>

<br>

> There is a way to improve or configure a log system for fenced / failed<br>

> node?<br>

<br>

Yes:<br>

<br>

1.setup rsyslog to export logs on some dedicated logging servers. Such<br>

servers should receive and save logs from your clusters and other hardwares<br>

(network?) and keep them safe. You will not loose messages anymore.<br>

<br>

2. Gather a lot of system metrics and keep them safe (eg. export them using pcp,<br>

collectd, etc). Metrics and visualization are important to cross-compare with<br>

logs and pinpoint something behaving outside of the usual scope.<br>

<br>

<br>

Looking at your log, I still find your query time are suspicious. I'm not<br>

convinced they are the root cause, they might be just a bad symptom/signal<br>

of something going wrong there. Having a one-row INSERT taking 649.754ms is<br>

suspicious. Maybe it's just a locking problem, maybe there's some CPU-bound<br>

postgis things involved, maybe with some GIN or GiST indexes, but it's still<br>

suspicious considering the server is over-sized in performance as you stated...<br>

<br>

And maybe the network or SAN had a hick-up and corosync has been too sensible<br>

to it. Check the retransmit and timeout parameters?<br>

<br>

<br>

Regards,<br>

</blockquote></div>