<div dir="ltr"><div class="gmail_quote"><div class="gmail_attr"><br></div><div dir="ltr"><div dir="ltr" data-smartmail="gmail_signature"><div dir="ltr"><div>We are running a two node cluster on Ubuntu 20.04 LTS. Cluster related package version details are as follows: pacemaker/focal-updates,focal-security 2.0.3-3ubuntu4.1 amd64<br>pacemaker/focal 2.0.3-3ubuntu3 amd64<br>corosync/focal 3.0.3-2ubuntu2 amd64<br>pcs/focal 0.10.4-3 all<br>fence-agents/focal 4.5.2-1 amd64<br>gfs2-utils/focal 3.2.0-3 amd64<br>dlm-controld/focal 4.0.9-1build1 amd64<br>lvm2-lockd/focal 2.03.07-1ubuntu1 amd64<br><br>Cluster configuration details: <br>1. Cluster is having a shared storage mounted through gfs2 filesystem with the help of dlm and lvmlockd. <br>2. Corosync is configured to use knet for transport. <br>3. Fencing is configured using fence_scsi on the shared storage which is being used for gfs2 filesystem <br>4. Two main resources configured are cluster/virtual ip and postgresql-12, postgresql-12 is configured as a systemd resource. <br>We had done failover testing(rebooting/shutting down of a node, link failure) of the cluster and had observed that resources were getting migrated properly on the active node. <br><br>Recently we came across an issue which has occurred repeatedly in span of two days.<br>Details are below:<br>1. Out of memory killer is getting invoked on active node and it starts killing processes. <br>Sample is as follows: <br>postgres invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0<br>2. At one instance it started with killing of pacemaker and on another with postgresql. It does not stop with the killing of a single process it goes on killing others(more concerning is killing of cluster related processes) as well. We have observed that swap space on that node is 2 GB against RAM of 96 GB and are in the process of increasing swap space to see if this resolves this issue. Postgres is configured with shared_buffers value of 32 GB(which is way less than 96 GB). <br>We are not yet sure which process is eating up that much memory suddenly. <br>3. As a result of killing processes on node1, node2 is trying to fence node1 and thereby initiating stopping of cluster resources on node1. <br>4. At this point we go in a stage where it is assumed that node1 is down and application resources, cluster IP and postgresql are being started on node2. <br>5. Postgresql on node 2 fails to start in 60 sec(start operation timeout) and is declared as failed. During the start operation of postgres, we have found many messages related to failure of fencing and other resources such as dlm and vg waiting for fencing to complete. <br>Details of syslog messages of node2 during this event are attached in file. <br>6. After this point we are in a state where node1 and node2 both go in fenced state and resources are unrecoverable(all resources on both nodes). <br><br>Now my question is out of memory issue of node1 can be taken care by increasing swap and finding out the process responsible for such huge memory usage and taking necessary actions to minimize that memory usage, but the other issue that remains unclear is why cluster is not shifted to node2 cleanly and become unrecoverable. <br><br></div></div></div></div>
</div></div>