[ClusterLabs] Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres

Fri Oct 8 10:41:24 EDT 2021

On Fri, 8 Oct 2021 15:00:30 +0200
damiano giuliani <damianogiuliani87 at gmail.com> wrote:

> Hi Guys,

Hi,

Good to hear from you, thank for the follow up!

My answer below.

> ...
> So it turn out that a lil bit of swap was used and i suspect corosync
> process were swapped to disks creating lag where 1s default corosync
> timeout was not enough.

Did you see the swap activity (in/out, not just swap occupation) happen in the
same time the member was lost on corosync side?

Did you check corosync or some of its libs were indeed in swap?

> So it is, swap doesnt log anything and moving process to allocated ram to
> swap take times more that 1s default timeout (probably many many mores).

Well, I have two different thoughts.

First, corosync now sit on a lot of memory because of knet. Did you try to
switch back to udpu which is using way less memory?

Second, a colleague suggested me to check if corosync mlock itself. And indeed
it mlockall (see mlock(2)) itself in physical memory. The mlock call might
fail, but the error doesn't stop corosync from starting anyway. Check your logs
for error:

  "Could not lock memory of service to avoid page faults"

On my side, mlocks is unlimited on ulimit settings. Check the values
in /proc/$(coro PID)/limits (be careful with the ulimit command, check the proc
itself).

> i fix it changing the swappiness of each servers to 10 (at minimum)
> avoinding the corosync process could swap.

That would be my first reflex as well. Keep us informed if the definitely fixed
your failover troubles.

>  this issue which should be easy drove me crazy because nowhere process
> swap is tracked on logs but make corosync trigger the timeout and make the
> cluster failover.

This is really interesting and useful.

Thanks,