[ClusterLabs] DRBD ms resource keeps getting demoted

Tue Jan 19 09:14:13 EST 2021

Here is configuration (Note that we have put node01 in standby since then,
in order to keep the services stable for the moment):
===
node 1: node01.example.com \
        attributes standby=on
node 2: node02.example.com \
        attributes standby=off
primitive app_ourApp lsb:ourUser \
        meta target-role=Started \
        op stop interval=0s timeout=90s
primitive daemon_httpd apache \
        params configfile="/etc/httpd/conf/httpd.conf" port=80 \
        op start interval=0s timeout=60s \
        op monitor interval=5s timeout=20s \
        op stop interval=0s timeout=60s \
        meta target-role=Started
primitive drbd_ourApp ocf:linbit:drbd \
        params drbd_resource=ourApp \
        op monitor interval=15s role=Master \
        op monitor interval=30s role=Slave
primitive fs_ourApp Filesystem \
        params device="/dev/drbd0" directory="/data" fstype=xfs \
        op stop interval=0s timeout=90s
primitive ip_ourApp IPaddr2 \
        params ip=10.6.21.100 nic=bond0 cidr_netmask=24 iflabel=1
primitive pingd ocf:pacemaker:ping \
        params host_list=10.6.21.1 multiplier=100 \
        op monitor interval=30s timeout=20s
group httpd daemon_httpd \
        meta target-role=Started
group ourApp fs_ourApp ip_ourApp app_ourApp \
        meta target-role=Started
ms ms_drbd_ourApp drbd_ourApp \
        meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true target-role=Started
clone pingdclone pingd \
        meta globally-unique=false target-role=Started
location cli-ban-httpd-on-node01.example.com httpd role=Started -inf:
node01.example.com
location cli-ban-ms_drbd_ourApp-on-node01.example.com ms_drbd_ourApp
role=Master -inf: node01.example.com
location cli-ban-ourApp-on-node01.example.com ourApp role=Started -inf:
node01.example.com
colocation httpd-with-ip inf: daemon_httpd ip_ourApp
order httpd_after_ourApp inf: ourApp:start daemon_httpd
order ourApp_after_drbd inf: ms_drbd_ourApp:promote ourApp:start
colocation ourApp_on_drbd inf: ourApp ms_drbd_ourApp:Master
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.18-11.el7_5.3-2b07d5c5a9 \
        cluster-infrastructure=corosync \
        stonith-enabled=false \
        no-quorum-policy=stop \
        cluster-name=ourAppapp \
        last-lrm-refresh=1611024747
rsc_defaults rsc-options: \
        resource-stickiness=100
===

Here are constraints (if any) on all resources:
===
* httpd
  : Node node01.example.com     (score=-INFINITY, id=
cli-ban-httpd-on-node01.example.com)
* ourApp
  : Node node01.example.com     (score=-INFINITY, id=
cli-ban-ourApp-on-node01.example.com)
    ms_drbd_ourApp                (score=INFINITY, with role=Master,
id=ourApp_on_drbd)
    : Node node01.example.com   (score=-INFINITY, id=
cli-ban-ms_drbd_ourApp-on-node01.example.com)
* ourApp
  : Node node01.example.com     (score=-INFINITY, id=
cli-ban-ourApp-on-node01.example.com)
    ms_drbd_ourApp                (score=INFINITY, with role=Master,
id=ourApp_on_drbd)
    : Node node01.example.com   (score=-INFINITY, id=
cli-ban-ms_drbd_ourApp-on-node01.example.com)
* ourApp
  : Node node01.example.com     (score=-INFINITY, id=
cli-ban-ourApp-on-node01.example.com)
    ms_drbd_ourApp                (score=INFINITY, with role=Master,
id=ourApp_on_drbd)
    : Node node01.example.com   (score=-INFINITY, id=
cli-ban-ms_drbd_ourApp-on-node01.example.com)
    ourApp                        (score=INFINITY, needs role=Master,
id=ourApp_on_drbd)
    : Node node01.example.com   (score=-INFINITY, id=
cli-ban-ourApp-on-node01.example.com)
* ms_drbd_ourApp
  : Node node01.example.com     (score=-INFINITY, id=
cli-ban-ms_drbd_ourApp-on-node01.example.com)
* pingdclone
===

Here are all of the LogAction messages in the 30 minute period centered on
16:55 (i.e. this particular example):
===
Jan 18 16:52:19 [21937] node02.example.com pengine: notice: LogAction: *
Stop daemon_httpd ( node02.example.com ) due to node availability
Jan 18 16:52:19 [21937] node02.example.com pengine: notice: LogAction: *
Stop fs_ourApp ( node02.example.com ) due to node availability
Jan 18 16:52:19 [21937] node02.example.com pengine: notice: LogAction: *
Stop ip_ourApp ( node02.example.com ) due to node availability
Jan 18 16:52:19 [21937] node02.example.com pengine: notice: LogAction: *
Stop app_ourApp ( node02.example.com ) due to node availability
Jan 18 16:52:19 [21937] node02.example.com pengine: info: LogActions: Leave
drbd_ourApp:0 (Slave node01.example.com)
Jan 18 16:52:19 [21937] node02.example.com pengine: notice: LogAction: *
Demote drbd_ourApp:1 ( Master -> Slave node02.example.com )
Jan 18 16:52:19 [21937] node02.example.com pengine: info: LogActions: Leave
pingd:0 (Started node01.example.com)
Jan 18 16:52:19 [21937] node02.example.com pengine: info: LogActions: Leave
pingd:1 (Started node02.example.com)
Jan 18 16:52:25 [21937] node02.example.com pengine: notice: LogAction: *
Start daemon_httpd ( node02.example.com )
Jan 18 16:52:25 [21937] node02.example.com pengine: info: LogActions: Leave
fs_ourApp (Started node02.example.com)
Jan 18 16:52:25 [21937] node02.example.com pengine: info: LogActions: Leave
ip_ourApp (Started node02.example.com)
Jan 18 16:52:25 [21937] node02.example.com pengine: info: LogActions: Leave
app_ourApp (Started node02.example.com)
Jan 18 16:52:25 [21937] node02.example.com pengine: info: LogActions: Leave
drbd_ourApp:0 (Slave node01.example.com)
Jan 18 16:52:25 [21937] node02.example.com pengine: info: LogActions: Leave
drbd_ourApp:1 (Master node02.example.com)
Jan 18 16:52:25 [21937] node02.example.com pengine: info: LogActions: Leave
pingd:0 (Started node01.example.com)
Jan 18 16:52:25 [21937] node02.example.com pengine: info: LogActions: Leave
pingd:1 (Started node02.example.com)
Jan 18 16:52:44 [21937] node02.example.com pengine: info: LogActions: Leave
daemon_httpd (Started node02.example.com)
Jan 18 16:52:44 [21937] node02.example.com pengine: info: LogActions: Leave
fs_ourApp (Started node02.example.com)
Jan 18 16:52:44 [21937] node02.example.com pengine: info: LogActions: Leave
ip_ourApp (Started node02.example.com)
Jan 18 16:52:44 [21937] node02.example.com pengine: info: LogActions: Leave
app_ourApp (Started node02.example.com)
Jan 18 16:52:44 [21937] node02.example.com pengine: info: LogActions: Leave
drbd_ourApp:0 (Slave node01.example.com)
Jan 18 16:52:44 [21937] node02.example.com pengine: info: LogActions: Leave
drbd_ourApp:1 (Master node02.example.com)
Jan 18 16:52:44 [21937] node02.example.com pengine: info: LogActions: Leave
pingd:0 (Started node01.example.com)
Jan 18 16:52:44 [21937] node02.example.com pengine: info: LogActions: Leave
pingd:1 (Started node02.example.com)
Jan 18 16:53:37 [21937] node02.example.com pengine: notice: LogAction: *
Stop daemon_httpd ( node02.example.com ) due to node availability
Jan 18 16:53:37 [21937] node02.example.com pengine: notice: LogAction: *
Stop fs_ourApp ( node02.example.com ) due to node availability
Jan 18 16:53:37 [21937] node02.example.com pengine: notice: LogAction: *
Stop ip_ourApp ( node02.example.com ) due to node availability
Jan 18 16:53:37 [21937] node02.example.com pengine: notice: LogAction: *
Stop app_ourApp ( node02.example.com ) due to node availability
Jan 18 16:53:37 [21937] node02.example.com pengine: info: LogActions: Leave
drbd_ourApp:0 (Slave node01.example.com)
Jan 18 16:53:37 [21937] node02.example.com pengine: notice: LogAction: *
Demote drbd_ourApp:1 ( Master -> Slave node02.example.com )
Jan 18 16:53:37 [21937] node02.example.com pengine: info: LogActions: Leave
pingd:0 (Started node01.example.com)
Jan 18 16:53:37 [21937] node02.example.com pengine: info: LogActions: Leave
pingd:1 (Started node02.example.com)
Jan 18 16:53:50 [21937] node02.example.com pengine: info: LogActions: Leave
daemon_httpd (Stopped)
Jan 18 16:53:50 [21937] node02.example.com pengine: info: LogActions: Leave
fs_ourApp (Stopped)
Jan 18 16:53:50 [21937] node02.example.com pengine: info: LogActions: Leave
ip_ourApp (Stopped)
Jan 18 16:53:50 [21937] node02.example.com pengine: info: LogActions: Leave
app_ourApp (Stopped)
Jan 18 16:53:50 [21937] node02.example.com pengine: info: LogActions: Leave
drbd_ourApp:0 (Slave node01.example.com)
Jan 18 16:53:50 [21937] node02.example.com pengine: notice: LogAction: *
Demote drbd_ourApp:1 ( Master -> Slave node02.example.com )
Jan 18 16:53:50 [21937] node02.example.com pengine: info: LogActions: Leave
pingd:0 (Started node01.example.com)
Jan 18 16:53:50 [21937] node02.example.com pengine: info: LogActions: Leave
pingd:1 (Started node02.example.com)
Jan 18 16:53:51 [21937] node02.example.com pengine: notice: LogAction: *
Start daemon_httpd ( node02.example.com )
Jan 18 16:53:51 [21937] node02.example.com pengine: notice: LogAction: *
Start fs_ourApp ( node02.example.com )
Jan 18 16:53:51 [21937] node02.example.com pengine: notice: LogAction: *
Start ip_ourApp ( node02.example.com )
Jan 18 16:53:51 [21937] node02.example.com pengine: notice: LogAction: *
Start app_ourApp ( node02.example.com )
Jan 18 16:53:51 [21937] node02.example.com pengine: info: LogActions: Leave
drbd_ourApp:0 (Slave node01.example.com)
Jan 18 16:53:51 [21937] node02.example.com pengine: notice: LogAction: *
Promote drbd_ourApp:1 ( Slave -> Master node02.example.com )
Jan 18 16:53:51 [21937] node02.example.com pengine: info: LogActions: Leave
pingd:0 (Started node01.example.com)
Jan 18 16:53:51 [21937] node02.example.com pengine: info: LogActions: Leave
pingd:1 (Started node02.example.com)
===

On Tue, Jan 19, 2021 at 2:27 AM Reid Wahl <nwahl at redhat.com> wrote:

> Can you share the cluster configuration (e.g., `pcs config` or the CIB)?
> And are there any additional LogAction messages after that one (e.g.,
> Promote for node01)?
>
> On Mon, Jan 18, 2021 at 7:47 PM Stuart Massey <djangoschef at gmail.com>
> wrote:
>
>> So, we have a 2-node cluster with a quorum device. One of the nodes
>> (node1) is having some trouble, so we have added constraints to prevent any
>> resources migrating to it, but have not put it in standby, so that drbd in
>> secondary on that node stays in sync. The problems it is having lead to OS
>> lockups that eventually resolve themselves - but that causes it to be
>> temporarily dropped from the cluster by the current master (node2).
>> Sometimes when node1 rejoins, then node2 will demote the drbd ms
>> resource. That causes all resources that depend on it to be
>> stopped, leading to a service outage. They are then restarted on node2,
>> since they can't run on node1 (due to constraints).
>> We are having a hard time understanding why this happens. It seems like
>> there may be some sort of DC contention happening. Does anyone have any
>> idea how we might prevent this from happening?
>> Selected messages (de-identified) from pacemaker.log that illustrate
>> suspicion re DC confusion are below. The update_dc and
>> abort_transition_graph re deletion of lrm seem to always precede the
>> demotion, and a demotion seems to always follow (when not already demoted).
>>
>> Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
>> do_dc_takeover:        Taking over DC status for this partition
>> Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
>> update_dc:     Set DC to node02.example.com (3.0.14)
>> Jan 18 16:52:17 [21938] node02.example.com       crmd:     info:
>> abort_transition_graph:        Transition aborted by deletion of
>> lrm[@id='1']: Resource state removal | cib=0.89.327
>> source=abort_unless_down:357
>> path=/cib/status/node_state[@id='1']/lrm[@id='1'] complete=true
>> Jan 18 16:52:19 [21937] node02.example.com    pengine:     info:
>> master_color:  ms_drbd_ourApp: Promoted 0 instances of a possible 1 to
>> master
>> Jan 18 16:52:19 [21937] node02.example.com    pengine:   notice:
>> LogAction:      * Demote     drbd_ourApp:1     (            Master -> Slave
>> node02.example.com )
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
>
> --
> Regards,
>
> Reid Wahl, RHCA
> Senior Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210119/a9fec2b7/attachment-0001.htm>