[ClusterLabs] Antw: Re: Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres

Mon Oct 11 02:09:31 EDT 2021

>>> damiano giuliani <damianogiuliani87 at gmail.com> schrieb am 08.10.2021 um 15:00
in Nachricht
<CAG=zYNO0ieawqearuzh2cDMy-6KzF3DHbBUBr0iiUrF47bg7jw at mail.gmail.com>:
> Hi Guys, after months of suddens  unexpected failovers, checking every
> corners and types of logs without any luck, cuz no logs and no reasons or

If you have no logs, you should cleaerly check your configuration.

...
> So it turn out that a lil bit of swap was used and i suspect corosync
> process were swapped to disks creating lag where 1s default corosync
> timeout was not enough.

BTW: Do you use thing provisioned swap (just in case)?

> So it is, swap doesnt log anything and moving process to allocated ram to
> swap take times more that 1s default timeout (probably many many mores).

When swapping to/from SSD, it's hard to believe that it takes so long that the cluster nodes would be fenced.
Also code that is periodically referenced won't be swapped, specificall if you have plenty of RAM.

> i fix it changing the swappiness of each servers to 10 (at minimum)
> avoinding the corosync process could swap.

Do you have a proof that swap was the problem?

...

Regards,
Ulrich