[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Non recoverable state of cluster after exit of one node due to killing of processes by oom killer

Thu Feb 25 05:21:50 EST 2021

>>> shivraj dongawe <shivraj198 at gmail.com> schrieb am 25.02.2021 um 07:34 in
Nachricht
<CALpaHO9w3SCndDoqw1-YFMJhFMy8mZ-i6CYu0cC3BX6x26cs3g at mail.gmail.com>:
> @Ken Gaillot, Thanks for sharing your inputs on the possible behavior of
> the cluster.
> We have reconfirmed that dlm on a healthy node was waiting for fencing of
> faulty node and shared storage access on the healthy node was blocked
> during this process.
> Kindly let me know whether this is the natural behavior or is it a result
> of some misconfiguration.
> As asked by I am sharing configuration information as an attachment to this
> mail.

Hi!

I think this is they way it's intended to be: If a node is "unclean" (faulty) then DLM waits for a confirmation of the unclean node becoming clean (i.e. being fenced, known to be off). Then a new cluster configuration (quorum) is formed, and possible recovery actions (like releasing locks the fenced node held) take place. I see with OCFS2 that I/O may hang while the cluster is waiting for a node being fenced.

Regards,
Ulrich

> 
> 
> On Fri, Feb 19, 2021 at 11:28 PM Ken Gaillot <kgaillot at redhat.com> wrote:
> 
>> On Fri, 2021-02-19 at 07:48 +0530, shivraj dongawe wrote:
>> > Any update on this .
>> > Is there any issue in the configuration that we are using ?
>> >
>> > On Mon, Feb 15, 2021, 14:40 shivraj dongawe <shivraj198 at gmail.com>
>> > wrote:
>> > > Kindly read "fencing is done using fence_scsi" from the previous
>> > > message as "fencing is configured".
>> > >
>> > > As per the error messages we have analyzed node2 initiated fencing
>> > > of node1 as many processes of node1 related to cluster have been
>> > > killed by oom killer and node1 marked as down.
>> > > Now many resources of node2 have waited for fencing of node1, as
>> > > seen from following messages of syslog of node2:
>> > > dlm_controld[1616]: 91659 lvm_postgres_db_vg wait for fencing
>> > > dlm_controld[1616]: 91659 lvm_global wait for fencing
>> > >
>> > > These were messages when postgresql-12 service was being started on
>> > > node2.
>> > > As postgresql service is dependent on these services(dlm,lvmlockd
>> > > and gfs2), it has not started in time on node2.
>> > > And node2 fenced itself after declaring that services can not be
>> > > started on it.
>> > >
>> > > On Mon, Feb 15, 2021 at 9:00 AM Ulrich Windl <
>> > > Ulrich.Windl at rz.uni-regensburg.de> wrote:
>> > > > >>> shivraj dongawe <shivraj198 at gmail.com> schrieb am 15.02.2021
>> > > > um 08:27 in
>> > > > Nachricht
>> > > > <
>> > > > CALpaHO_6LsYM=t76CifsRkFeLYDKQc+hY3kz7PRKp7b4se=-Aw at mail.gmail.com 
>> > > > >:
>> > > > > Fencing is done using fence_scsi.
>> > > > > Config details are as follows:
>> > > > >  Resource: scsi (class=stonith type=fence_scsi)
>> > > > >   Attributes: devices=/dev/mapper/mpatha pcmk_host_list="node1
>> > > > node2"
>> > > > > pcmk_monitor_action=metadata pcmk_reboot_action=off
>> > > > >   Meta Attrs: provides=unfencing
>> > > > >   Operations: monitor interval=60s (scsi-monitor-interval-60s)
>> > > > >
>> > > > > On Mon, Feb 15, 2021 at 7:17 AM Ulrich Windl <
>> > > > > Ulrich.Windl at rz.uni-regensburg.de> wrote:
>> > > > >
>> > > > >> >>> shivraj dongawe <shivraj198 at gmail.com> schrieb am
>> > > > 14.02.2021 um 12:03
>> > > > >> in
>> > > > >> Nachricht
>> > > > >> <
>> > > > CALpaHO--3ERfwST70mBL-Wm9g6yH3YtD-wDA1r_CKnbrsxu4Sg at mail.gmail.com 
>> > > > >:
>> > > > >> > We are running a two node cluster on Ubuntu 20.04 LTS.
>> > > > Cluster related
>> > > > >> > package version details are as
>> > > > >> > follows: pacemaker/focal-updates,focal-security 2.0.3-
>> > > > 3ubuntu4.1 amd64
>> > > > >> > pacemaker/focal 2.0.3-3ubuntu3 amd64
>> > > > >> > corosync/focal 3.0.3-2ubuntu2 amd64
>> > > > >> > pcs/focal 0.10.4-3 all
>> > > > >> > fence-agents/focal 4.5.2-1 amd64
>> > > > >> > gfs2-utils/focal 3.2.0-3 amd64
>> > > > >> > dlm-controld/focal 4.0.9-1build1 amd64
>> > > > >> > lvm2-lockd/focal 2.03.07-1ubuntu1 amd64
>> > > > >> >
>> > > > >> > Cluster configuration details:
>> > > > >> > 1. Cluster is having a shared storage mounted through gfs2
>> > > > filesystem
>> > > > >> with
>> > > > >> > the help of dlm and lvmlockd.
>> > > > >> > 2. Corosync is configured to use knet for transport.
>> > > > >> > 3. Fencing is configured using fence_scsi on the shared
>> > > > storage which is
>> > > > >> > being used for gfs2 filesystem
>> > > > >> > 4. Two main resources configured are cluster/virtual ip and
>> > > > >> postgresql-12,
>> > > > >> > postgresql-12 is configured as a systemd resource.
>> > > > >> > We had done failover testing(rebooting/shutting down of a
>> > > > node, link
>> > > > >> > failure) of the cluster and had observed that resources were
>> > > > getting
>> > > > >> > migrated properly on the active node.
>> > > > >> >
>> > > > >> > Recently we came across an issue which has occurred
>> > > > repeatedly in span of
>> > > > >> > two days.
>> > > > >> > Details are below:
>> > > > >> > 1. Out of memory killer is getting invoked on active node
>> > > > and it starts
>> > > > >> > killing processes.
>> > > > >> > Sample is as follows:
>> > > > >> > postgres invoked oom-killer:
>> > > > gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE),
>> > > > >> > order=0, oom_score_adj=0
>> > > > >> > 2. At one instance it started with killing of pacemaker and
>> > > > on another
>> > > > >> with
>> > > > >> > postgresql. It does not stop with the killing of a single
>> > > > process it goes
>> > > > >> > on killing others(more concerning is killing of cluster
>> > > > related
>> > > > >> processes)
>> > > > >> > as well. We have observed that swap space on that node is 2
>> > > > GB against
>> > > > >> RAM
>> > > > >> > of 96 GB and are in the process of increasing swap space to
>> > > > see if this
>> > > > >> > resolves this issue. Postgres is configured with
>> > > > shared_buffers value of
>> > > > >> 32
>> > > > >> > GB(which is way less than 96 GB).
>> > > > >> > We are not yet sure which process is eating up that much
>> > > > memory suddenly.
>> > > > >> > 3. As a result of killing processes on node1, node2 is
>> > > > trying to fence
>> > > > >> > node1 and thereby initiating stopping of cluster resources
>> > > > on node1.
>> > > > >>
>> > > > >> How is fencing being done?
>> > > > >>
>> > > > >> > 4. At this point we go in a stage where it is assumed that
>> > > > node1 is down
>> > > > >> > and application resources, cluster IP and postgresql are
>> > > > being started on
>> > > > >> > node2.
>> > > >
>> > > > This is why I was asking: Is your fencing successful ("assumed
>> > > > that node1 is down
>> > > > "), or isn't it?
>> > > >
>> > > > >> > 5. Postgresql on node 2 fails to start in 60 sec(start
>> > > > operation timeout)
>> > > > >> > and is declared as failed. During the start operation of
>> > > > postgres, we
>> > > > >> have
>> > > > >> > found many messages related to failure of fencing and other
>> > > > resources
>> > > > >> such
>> > > > >> > as dlm and vg waiting for fencing to complete.
>>
>> It does seem that DLM is where the problem occurs.
>>
>> Note that fencing is scheduled in two separate ways, once by DLM and
>> once by the cluster itself, when node1 is lost.
>>
>> The fencing scheduled by the cluster completes successfully:
>>
>> Feb 13 11:07:56 DB-2 pacemaker-controld[2451]:  notice: Peer node1 was
>> terminated (reboot) by node2 on behalf of pacemaker-controld.2451: OK
>>
>> but DLM just attempts fencing over and over, eventually causing
>> resource timeouts. Those timeouts cause the cluster to schedule
>> resource recovery (stop+start), but the stops timeout for the same
>> reason, and it is those stop timeouts that cause node2 to be fenced.
>>
>> I'm not familiar enough with DLM to know what might keep it from being
>> able to contact Pacemaker for fencing.
>>
>> Can you attach your configuration as well (with any sensitive info
>> removed)? I assume you've created an ocf:pacemaker:controld clone, and
>> that the other resources are layered on top of that with colocation and
>> ordering constraints.
>>
>> > > > >> > Details of syslog messages of node2 during this event are
>> > > > attached in
>> > > > >> file.
>> > > > >> > 6. After this point we are in a state where node1 and node2
>> > > > both go in
>> > > > >> > fenced state and resources are unrecoverable(all resources
>> > > > on both
>> > > > >> nodes).
>> > > > >> >
>> > > > >> > Now my question is out of memory issue of node1 can be taken
>> > > > care by
>> > > > >> > increasing swap and finding out the process responsible for
>> > > > such huge
>> > > > >> > memory usage and taking necessary actions to minimize that
>> > > > memory usage,
>> > > > >> > but the other issue that remains unclear is why cluster is
>> > > > not shifted to
>> > > > >> > node2 cleanly and become unrecoverable.
>> > > > >>
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>