[ClusterLabs] Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres

Mon Oct 11 05:57:37 EDT 2021

Hi,

I kept the full answer in history to keep the list informed of your full
answer.

My answer down below.

On Mon, 11 Oct 2021 11:33:12 +0200
damiano giuliani <damianogiuliani87 at gmail.com> wrote:

> ehy guys sorry for being late, was busy during the WE
> 
> here i im:
> 
> 
> > Did you see the swap activity (in/out, not just swap occupation) happen in
> > the
> >
> > same time the member was lost on corosync side?
> > Did you check corosync or some of its libs were indeed in swap?
> >
> >
> no and i dont know how do it, i just noticed the swap occupation which
> suggest me (and my collegue) to find out if it could cause some trouble.
> 
> > First, corosync now sit on a lot of memory because of knet. Did you try to
> > switch back to udpu which is using way less memory?
> 
> 
> No i havent move to udpd, cast stop processes at all.
> 
>   "Could not lock memory of service to avoid page faults"
> 
> 
> grep -rn 'Could not lock memory of service to avoid page faults' /var/log/*
> returns noting

This message should appears on corosync startup. Make sure the logs hadn't been
rotated to a blackhole in the meantime...

> > On my side, mlocks is unlimited on ulimit settings. Check the values
> > in /proc/$(coro PID)/limits (be careful with the ulimit command, check the
> > proc itself).
> 
> 
> cat /proc/101350/limits
> Limit                     Soft Limit           Hard Limit           Units
> Max cpu time              unlimited            unlimited            seconds
> Max file size             unlimited            unlimited            bytes
> Max data size             unlimited            unlimited            bytes
> Max stack size            8388608              unlimited            bytes
> Max core file size        0                    unlimited            bytes
> Max resident set          unlimited            unlimited            bytes
> Max processes             770868               770868
> processes
> Max open files            1024                 4096                 files
> Max locked memory         unlimited            unlimited            bytes
> Max address space         unlimited            unlimited            bytes
> Max file locks            unlimited            unlimited            locks
> Max pending signals       770868               770868               signals
> Max msgqueue size         819200               819200               bytes
> Max nice priority         0                    0
> Max realtime priority     0                    0
> Max realtime timeout      unlimited            unlimited            us
> 
> Ah... That's the first thing I change.
> > In SLES, that is defaulted to 10s and so far I have never seen an
> > environment that is stable enough for the default 1s timeout.
> 
> 
> old versions have 10s default
> you are not going to fix the problem lthis way, 1s timeout for a bonded
> network and overkill hardware is enourmous time.
> 
> hostnamectl | grep Kernel
>             Kernel: Linux 3.10.0-1160.6.1.el7.x86_64
> [root at ltaoperdbs03 ~]# cat /etc/os-release
> NAME="CentOS Linux"
> VERSION="7 (Core)"
> 
> > Indeed. But it's an arbitrage between swapping process mem or freeing
> > mem by removing data from cache. For database servers, it is advised to
> > use a
> > lower value for swappiness anyway, around 5-10, as a swapped process means
> > longer query, longer data in caches, piling sessions, etc.
> 
> 
> totally agree, for db server swappines has to be 5-10.
> 
> kernel?
> > What are your settings for vm.dirty_* ?
> 
> 
> 
> hostnamectl | grep Kernel
>             Kernel: Linux 3.10.0-1160.6.1.el7.x86_64
> [root at ltaoperdbs03 ~]# cat /etc/os-release
> NAME="CentOS Linux"
> VERSION="7 (Core)"
> 
> 
> sysctl -a | grep dirty
> vm.dirty_background_bytes = 0
> vm.dirty_background_ratio = 10

Considering your 256GB of physical memory, this means you can dirty up to 25GB
pages in cache before the kernel start to write them on storage.

You might want to trigger these background, lighter syncs much before hitting
this limit.

> vm.dirty_bytes = 0
> vm.dirty_expire_centisecs = 3000
> vm.dirty_ratio = 20

This is 20% of your 256GB physical memory. After this limit, writes have to go
to disks, directly. Considering the time to write to SSD compared to memory
and the amount of data to sync in the background as well (52GB), this could be
very painful.

> vm.dirty_writeback_centisecs = 500
> 
> 
> > Do you have a proof that swap was the problem?
> 
> 
> not at all but after switch to swappiness to 10, cluster doesnt sunndletly
> swap anymore from a month