[ClusterLabs] Antw: [EXT] Non recoverable state of cluster after exit of one node due to killing of processes by oom killer

Mon Feb 15 02:17:25 EST 2021

>>> shivraj dongawe <shivraj198 at gmail.com> schrieb am 14.02.2021 um 12:03 in
Nachricht
<CALpaHO--3ERfwST70mBL-Wm9g6yH3YtD-wDA1r_CKnbrsxu4Sg at mail.gmail.com>:
> We are running a two node cluster on Ubuntu 20.04 LTS. Cluster related
> package version details are as
> follows: pacemaker/focal-updates,focal-security 2.0.3-3ubuntu4.1 amd64
> pacemaker/focal 2.0.3-3ubuntu3 amd64
> corosync/focal 3.0.3-2ubuntu2 amd64
> pcs/focal 0.10.4-3 all
> fence-agents/focal 4.5.2-1 amd64
> gfs2-utils/focal 3.2.0-3 amd64
> dlm-controld/focal 4.0.9-1build1 amd64
> lvm2-lockd/focal 2.03.07-1ubuntu1 amd64
> 
> Cluster configuration details:
> 1. Cluster is having a shared storage mounted through gfs2 filesystem with
> the help of dlm and lvmlockd.
> 2. Corosync is configured to use knet for transport.
> 3. Fencing is configured using fence_scsi on the shared storage which is
> being used for gfs2 filesystem
> 4. Two main resources configured are cluster/virtual ip and postgresql-12,
> postgresql-12 is configured as a systemd resource.
> We had done failover testing(rebooting/shutting down of a node, link
> failure) of the cluster and had observed that resources were getting
> migrated properly on the active node.
> 
> Recently we came across an issue which has occurred repeatedly in span of
> two days.
> Details are below:
> 1. Out of memory killer is getting invoked on active node and it starts
> killing processes.
> Sample is as follows:
> postgres invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE),
> order=0, oom_score_adj=0
> 2. At one instance it started with killing of pacemaker and on another with
> postgresql. It does not stop with the killing of a single process it goes
> on killing others(more concerning is killing of cluster related processes)
> as well. We have observed that swap space on that node is 2 GB against RAM
> of 96 GB and are in the process of increasing swap space to see if this
> resolves this issue. Postgres is configured with shared_buffers value of 32
> GB(which is way less than 96 GB).
> We are not yet sure which process is eating up that much memory suddenly.
> 3. As a result of killing processes on node1, node2 is trying to fence
> node1 and thereby initiating stopping of cluster resources on node1.

How is fencing being done?

> 4. At this point we go in a stage where it is assumed that node1 is down
> and application resources, cluster IP and postgresql are being started on
> node2.
> 5. Postgresql on node 2 fails to start in 60 sec(start operation timeout)
> and is declared as failed. During the start operation of postgres, we have
> found many messages related to failure of fencing and other resources such
> as dlm and vg waiting for fencing to complete.
> Details of syslog messages of node2 during this event are attached in file.
> 6. After this point we are in a state where node1 and node2 both go in
> fenced state and resources are unrecoverable(all resources on both nodes).
> 
> Now my question is out of memory issue of node1 can be taken care by
> increasing swap and finding out the process responsible for such huge
> memory usage and taking necessary actions to minimize that memory usage,
> but the other issue that remains unclear is why cluster is not shifted to
> node2 cleanly and become unrecoverable.