[ClusterLabs] Antw: Re: Antw: [EXT] Non recoverable state of cluster after exit of one node due to killing of processes by oom killer

Thu Feb 25 12:21:25 EST 2021

On Thu, 2021-02-25 at 06:34 +0000, shivraj dongawe wrote:
> 
> @Ken Gaillot, Thanks for sharing your inputs on the possible behavior
> of the cluster. 
> We have reconfirmed that dlm on a healthy node was waiting for
> fencing of faulty node and shared storage access on the healthy node
> was blocked during this process. 
> Kindly let me know whether this is the natural behavior or is it a
> result of some misconfiguration. 

Your configuration looks perfect to me, except for one thing: I believe
lvmlockd should be *after* dlm_controld in the group. I don't know if
that's causing the problem, but it's worth trying.

It is expected that DLM will wait for fencing, but it should be happy
after fencing completes, so something is not right.

> As asked by I am sharing configuration information as an attachment
> to this mail. 
> 
> 
> On Fri, Feb 19, 2021 at 11:28 PM Ken Gaillot <kgaillot at redhat.com>
> wrote:
> > On Fri, 2021-02-19 at 07:48 +0530, shivraj dongawe wrote:
> > > Any update on this . 
> > > Is there any issue in the configuration that we are using ?
> > > 
> > > On Mon, Feb 15, 2021, 14:40 shivraj dongawe <shivraj198 at gmail.com
> > >
> > > wrote:
> > > > Kindly read "fencing is done using fence_scsi" from the
> > previous
> > > > message as "fencing is configured". 
> > > > 
> > > > As per the error messages we have analyzed node2 initiated
> > fencing
> > > > of node1 as many processes of node1 related to cluster have
> > been
> > > > killed by oom killer and node1 marked as down. 
> > > > Now many resources of node2 have waited for fencing of node1,
> > as
> > > > seen from following messages of syslog of node2: 
> > > > dlm_controld[1616]: 91659 lvm_postgres_db_vg wait for fencing
> > > > dlm_controld[1616]: 91659 lvm_global wait for fencing
> > > > 
> > > > These were messages when postgresql-12 service was being
> > started on
> > > > node2. 
> > > > As postgresql service is dependent on these
> > services(dlm,lvmlockd
> > > > and gfs2), it has not started in time on node2. 
> > > > And node2 fenced itself after declaring that services can not
> > be
> > > > started on it. 
> > > > 
> > > > On Mon, Feb 15, 2021 at 9:00 AM Ulrich Windl <
> > > > Ulrich.Windl at rz.uni-regensburg.de> wrote:
> > > > > >>> shivraj dongawe <shivraj198 at gmail.com> schrieb am
> > 15.02.2021
> > > > > um 08:27 in
> > > > > Nachricht
> > > > > <
> > > > > 
> > CALpaHO_6LsYM=t76CifsRkFeLYDKQc+hY3kz7PRKp7b4se=-Aw at mail.gmail.com
> > > > > >:
> > > > > > Fencing is done using fence_scsi.
> > > > > > Config details are as follows:
> > > > > >  Resource: scsi (class=stonith type=fence_scsi)
> > > > > >   Attributes: devices=/dev/mapper/mpatha
> > pcmk_host_list="node1
> > > > > node2"
> > > > > > pcmk_monitor_action=metadata pcmk_reboot_action=off
> > > > > >   Meta Attrs: provides=unfencing
> > > > > >   Operations: monitor interval=60s (scsi-monitor-interval-
> > 60s)
> > > > > > 
> > > > > > On Mon, Feb 15, 2021 at 7:17 AM Ulrich Windl <
> > > > > > Ulrich.Windl at rz.uni-regensburg.de> wrote:
> > > > > > 
> > > > > >> >>> shivraj dongawe <shivraj198 at gmail.com> schrieb am
> > > > > 14.02.2021 um 12:03
> > > > > >> in
> > > > > >> Nachricht
> > > > > >> <
> > > > > 
> > CALpaHO--3ERfwST70mBL-Wm9g6yH3YtD-wDA1r_CKnbrsxu4Sg at mail.gmail.com
> > > > > >:
> > > > > >> > We are running a two node cluster on Ubuntu 20.04 LTS.
> > > > > Cluster related
> > > > > >> > package version details are as
> > > > > >> > follows: pacemaker/focal-updates,focal-security 2.0.3-
> > > > > 3ubuntu4.1 amd64
> > > > > >> > pacemaker/focal 2.0.3-3ubuntu3 amd64
> > > > > >> > corosync/focal 3.0.3-2ubuntu2 amd64
> > > > > >> > pcs/focal 0.10.4-3 all
> > > > > >> > fence-agents/focal 4.5.2-1 amd64
> > > > > >> > gfs2-utils/focal 3.2.0-3 amd64
> > > > > >> > dlm-controld/focal 4.0.9-1build1 amd64
> > > > > >> > lvm2-lockd/focal 2.03.07-1ubuntu1 amd64
> > > > > >> >
> > > > > >> > Cluster configuration details:
> > > > > >> > 1. Cluster is having a shared storage mounted through
> > gfs2
> > > > > filesystem
> > > > > >> with
> > > > > >> > the help of dlm and lvmlockd.
> > > > > >> > 2. Corosync is configured to use knet for transport.
> > > > > >> > 3. Fencing is configured using fence_scsi on the shared
> > > > > storage which is
> > > > > >> > being used for gfs2 filesystem
> > > > > >> > 4. Two main resources configured are cluster/virtual ip
> > and
> > > > > >> postgresql-12,
> > > > > >> > postgresql-12 is configured as a systemd resource.
> > > > > >> > We had done failover testing(rebooting/shutting down of
> > a
> > > > > node, link
> > > > > >> > failure) of the cluster and had observed that resources
> > were
> > > > > getting
> > > > > >> > migrated properly on the active node.
> > > > > >> >
> > > > > >> > Recently we came across an issue which has occurred
> > > > > repeatedly in span of
> > > > > >> > two days.
> > > > > >> > Details are below:
> > > > > >> > 1. Out of memory killer is getting invoked on active
> > node
> > > > > and it starts
> > > > > >> > killing processes.
> > > > > >> > Sample is as follows:
> > > > > >> > postgres invoked oom-killer:
> > > > > gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE),
> > > > > >> > order=0, oom_score_adj=0
> > > > > >> > 2. At one instance it started with killing of pacemaker
> > and
> > > > > on another
> > > > > >> with
> > > > > >> > postgresql. It does not stop with the killing of a
> > single
> > > > > process it goes
> > > > > >> > on killing others(more concerning is killing of cluster
> > > > > related
> > > > > >> processes)
> > > > > >> > as well. We have observed that swap space on that node
> > is 2
> > > > > GB against
> > > > > >> RAM
> > > > > >> > of 96 GB and are in the process of increasing swap space
> > to
> > > > > see if this
> > > > > >> > resolves this issue. Postgres is configured with
> > > > > shared_buffers value of
> > > > > >> 32
> > > > > >> > GB(which is way less than 96 GB).
> > > > > >> > We are not yet sure which process is eating up that much
> > > > > memory suddenly.
> > > > > >> > 3. As a result of killing processes on node1, node2 is
> > > > > trying to fence
> > > > > >> > node1 and thereby initiating stopping of cluster
> > resources
> > > > > on node1.
> > > > > >>
> > > > > >> How is fencing being done?
> > > > > >>
> > > > > >> > 4. At this point we go in a stage where it is assumed
> > that
> > > > > node1 is down
> > > > > >> > and application resources, cluster IP and postgresql are
> > > > > being started on
> > > > > >> > node2.
> > > > > 
> > > > > This is why I was asking: Is your fencing successful
> > ("assumed
> > > > > that node1 is down
> > > > > "), or isn't it?
> > > > > 
> > > > > >> > 5. Postgresql on node 2 fails to start in 60 sec(start
> > > > > operation timeout)
> > > > > >> > and is declared as failed. During the start operation of
> > > > > postgres, we
> > > > > >> have
> > > > > >> > found many messages related to failure of fencing and
> > other
> > > > > resources
> > > > > >> such
> > > > > >> > as dlm and vg waiting for fencing to complete.
> > 
> > It does seem that DLM is where the problem occurs.
> > 
> > Note that fencing is scheduled in two separate ways, once by DLM
> > and
> > once by the cluster itself, when node1 is lost.
> > 
> > The fencing scheduled by the cluster completes successfully:
> > 
> > Feb 13 11:07:56 DB-2 pacemaker-controld[2451]:  notice: Peer node1
> > was
> > terminated (reboot) by node2 on behalf of pacemaker-controld.2451:
> > OK
> > 
> > but DLM just attempts fencing over and over, eventually causing
> > resource timeouts. Those timeouts cause the cluster to schedule
> > resource recovery (stop+start), but the stops timeout for the same
> > reason, and it is those stop timeouts that cause node2 to be
> > fenced.
> > 
> > I'm not familiar enough with DLM to know what might keep it from
> > being
> > able to contact Pacemaker for fencing.
> > 
> > Can you attach your configuration as well (with any sensitive info
> > removed)? I assume you've created an ocf:pacemaker:controld clone,
> > and
> > that the other resources are layered on top of that with colocation
> > and
> > ordering constraints.
> > 
> > > > > >> > Details of syslog messages of node2 during this event
> > are
> > > > > attached in
> > > > > >> file.
> > > > > >> > 6. After this point we are in a state where node1 and
> > node2
> > > > > both go in
> > > > > >> > fenced state and resources are unrecoverable(all
> > resources
> > > > > on both
> > > > > >> nodes).
> > > > > >> >
> > > > > >> > Now my question is out of memory issue of node1 can be
> > taken
> > > > > care by
> > > > > >> > increasing swap and finding out the process responsible
> > for
> > > > > such huge
> > > > > >> > memory usage and taking necessary actions to minimize
> > that
> > > > > memory usage,
> > > > > >> > but the other issue that remains unclear is why cluster
> > is
> > > > > not shifted to
> > > > > >> > node2 cleanly and become unrecoverable.
> > > > > >>
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>