[ClusterLabs] Antw: [EXT] unexpected fenced node and promotion of the new master PAF ‑ postgres
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Mon Oct 11 05:57:37 EDT 2021
Hi,
I kept the full answer in history to keep the list informed of your full
answer.
My answer down below.
On Mon, 11 Oct 2021 11:33:12 +0200
damiano giuliani <damianogiuliani87 at gmail.com> wrote:
> ehy guys sorry for being late, was busy during the WE
>
> here i im:
>
>
> > Did you see the swap activity (in/out, not just swap occupation) happen in
> > the
> >
> > same time the member was lost on corosync side?
> > Did you check corosync or some of its libs were indeed in swap?
> >
> >
> no and i dont know how do it, i just noticed the swap occupation which
> suggest me (and my collegue) to find out if it could cause some trouble.
>
> > First, corosync now sit on a lot of memory because of knet. Did you try to
> > switch back to udpu which is using way less memory?
>
>
> No i havent move to udpd, cast stop processes at all.
>
> "Could not lock memory of service to avoid page faults"
>
>
> grep -rn 'Could not lock memory of service to avoid page faults' /var/log/*
> returns noting
This message should appears on corosync startup. Make sure the logs hadn't been
rotated to a blackhole in the meantime...
> > On my side, mlocks is unlimited on ulimit settings. Check the values
> > in /proc/$(coro PID)/limits (be careful with the ulimit command, check the
> > proc itself).
>
>
> cat /proc/101350/limits
> Limit Soft Limit Hard Limit Units
> Max cpu time unlimited unlimited seconds
> Max file size unlimited unlimited bytes
> Max data size unlimited unlimited bytes
> Max stack size 8388608 unlimited bytes
> Max core file size 0 unlimited bytes
> Max resident set unlimited unlimited bytes
> Max processes 770868 770868
> processes
> Max open files 1024 4096 files
> Max locked memory unlimited unlimited bytes
> Max address space unlimited unlimited bytes
> Max file locks unlimited unlimited locks
> Max pending signals 770868 770868 signals
> Max msgqueue size 819200 819200 bytes
> Max nice priority 0 0
> Max realtime priority 0 0
> Max realtime timeout unlimited unlimited us
>
> Ah... That's the first thing I change.
> > In SLES, that is defaulted to 10s and so far I have never seen an
> > environment that is stable enough for the default 1s timeout.
>
>
> old versions have 10s default
> you are not going to fix the problem lthis way, 1s timeout for a bonded
> network and overkill hardware is enourmous time.
>
> hostnamectl | grep Kernel
> Kernel: Linux 3.10.0-1160.6.1.el7.x86_64
> [root at ltaoperdbs03 ~]# cat /etc/os-release
> NAME="CentOS Linux"
> VERSION="7 (Core)"
>
> > Indeed. But it's an arbitrage between swapping process mem or freeing
> > mem by removing data from cache. For database servers, it is advised to
> > use a
> > lower value for swappiness anyway, around 5-10, as a swapped process means
> > longer query, longer data in caches, piling sessions, etc.
>
>
> totally agree, for db server swappines has to be 5-10.
>
> kernel?
> > What are your settings for vm.dirty_* ?
>
>
>
> hostnamectl | grep Kernel
> Kernel: Linux 3.10.0-1160.6.1.el7.x86_64
> [root at ltaoperdbs03 ~]# cat /etc/os-release
> NAME="CentOS Linux"
> VERSION="7 (Core)"
>
>
> sysctl -a | grep dirty
> vm.dirty_background_bytes = 0
> vm.dirty_background_ratio = 10
Considering your 256GB of physical memory, this means you can dirty up to 25GB
pages in cache before the kernel start to write them on storage.
You might want to trigger these background, lighter syncs much before hitting
this limit.
> vm.dirty_bytes = 0
> vm.dirty_expire_centisecs = 3000
> vm.dirty_ratio = 20
This is 20% of your 256GB physical memory. After this limit, writes have to go
to disks, directly. Considering the time to write to SSD compared to memory
and the amount of data to sync in the background as well (52GB), this could be
very painful.
> vm.dirty_writeback_centisecs = 500
>
>
> > Do you have a proof that swap was the problem?
>
>
> not at all but after switch to swappiness to 10, cluster doesnt sunndletly
> swap anymore from a month
More information about the Users
mailing list