[ClusterLabs] Resource Parameter Change Not Honoring Constraints

Sat Apr 11 01:03:43 EDT 2020

On Wed, Apr 1, 2020 at 8:01 PM Ken Gaillot <kgaillot at redhat.com> wrote:
>
> On Thu, 2020-03-19 at 13:39 -0400, Marc Smith wrote:
> > On Mon, Mar 16, 2020 at 1:26 PM Marc Smith <msmith626 at gmail.com>
> > wrote:
> > >
> > > On Thu, Mar 12, 2020 at 10:51 AM Ken Gaillot <kgaillot at redhat.com>
> > > wrote:
> > > >
> > > > On Wed, 2020-03-11 at 17:24 -0400, Marc Smith wrote:
> > > > > Hi,
> > > > >
> > > > > I'm using Pacemaker 1.1.20 (yes, I know, a bit dated now). I
> > > > > noticed
> > > >
> > > > I'd still consider that recent :)
> > > >
> > > > > when I modify a resource parameter (eg, update the value), this
> > > > > causes
> > > > > the resource itself to restart. And that's fine, but when this
> > > > > resource is restarted, it doesn't appear to honor the full set
> > > > > of
> > > > > constraints for that resource.
> > > > >
> > > > > I see the output like this (right after the resource parameter
> > > > > change):
> > > > > ...
> > > > > Mar 11 20:43:25 localhost crmd[1943]:   notice: State
> > > > > transition
> > > > > S_IDLE -> S_POL
> > > > > ICY_ENGINE
> > > > > Mar 11 20:43:25 localhost crmd[1943]:   notice: Current ping
> > > > > state:
> > > > > S_POLICY_ENG
> > > > > INE
> > > > > Mar 11 20:43:25 localhost pengine[1942]:   notice: Clearing
> > > > > failure
> > > > > of
> > > > > p_bmd_140c58-1 on 140c58-1 because resource parameters have
> > > > > changed
> > > > > Mar 11 20:43:25 localhost pengine[1942]:   notice:  * Restart
> > > > > p_bmd_140c58-1             (                   140c58-1 )   due
> > > > > to
> > > > > resource definition change
> > > > > Mar 11 20:43:25 localhost pengine[1942]:   notice:  * Restart
> > > > > p_dummy_g_lvm_140c58-1     (                   140c58-1 )   due
> > > > > to
> > > > > required g_md_140c58-1 running
> > > > > Mar 11 20:43:25 localhost pengine[1942]:   notice:  * Restart
> > > > > p_lvm_140c58_vg_01         (                   140c58-1 )   due
> > > > > to
> > > > > required p_dummy_g_lvm_140c58-1 start
> > > > > Mar 11 20:43:25 localhost pengine[1942]:   notice: Calculated
> > > > > transition 41, saving inputs in
> > > > > /var/lib/pacemaker/pengine/pe-input-173.bz2
> > > > > Mar 11 20:43:25 localhost crmd[1943]:   notice: Initiating stop
> > > > > operation p_lvm_140c58_vg_01_stop_0 on 140c58-1
> > > > > Mar 11 20:43:25 localhost crmd[1943]:   notice: Transition
> > > > > aborted by
> > > > > deletion of lrm_rsc_op[@id='p_bmd_140c58-1_last_failure_0']:
> > > > > Resource
> > > > > operation removal
> > > > > Mar 11 20:43:25 localhost crmd[1943]:   notice: Current ping
> > > > > state:
> > > > > S_TRANSITION_ENGINE
> > > > > ...
> > > > >
> > > > > The stop on 'p_lvm_140c58_vg_01' then times out, because the
> > > > > other
> > > > > constraint (to stop the service above LVM) is never executed. I
> > > > > can
> > > > > see from the messages it never even tries to demote the
> > > > > resource
> > > > > above
> > > > > that.
> > > > >
> > > > > Yet, if I use crmsh at the shell, and do a restart on that same
> > > > > resource, it works correctly, and all constraints are honored:
> > > > > crm
> > > > > resource restart p_bmd_140c58-1
> > > > >
> > > > > I can certainly provide my full cluster config if needed, but
> > > > > hoping
> > > > > to keep this email concise for clarity. =)
> > > > >
> > > > > I guess my questions are: 1) Is the difference in restart
> > > > > behavior
> > > > > expected, and not all constraints are followed when resource
> > > > > parameters change (or some other restart event that originated
> > > > > internally like this)? 2) Or perhaps this is known bug that was
> > > > > already resolved in newer versions of Pacemaker?
> > > >
> > > > No to both. Can you attach that pe-input-173.bz2 file (with any
> > > > sensitive info removed)?
> > >
> > > Thanks; that system got wiped, so I reproduced it on another system
> > > and I am attaching that pe-input file. Log snippet is below for
> > > completeness:
> > >
> > > Mar 16 17:16:50 localhost crmd[1340]:   notice: State transition
> > > S_IDLE -> S_POL
> > > ICY_ENGINE
> > > Mar 16 17:16:50 localhost pengine[1339]:   notice:  * Restart
> > > p_bmd_126c4f-1             (                   126c4f-1 )   due to
> > > resource definition change
> > > Mar 16 17:16:50 localhost pengine[1339]:   notice:  * Restart
> > > p_dummy_g_lvm_126c4f-1     (                   126c4f-1 )   due to
> > > required g_md_126c4f-1 running
> > > Mar 16 17:16:50 localhost pengine[1339]:   notice:  * Restart
> > > p_lvm_126c4f_vg_01         (                   126c4f-1 )   due to
> > > required p_dummy_g_lvm_126c4f-1 start
> > > Mar 16 17:16:50 localhost pengine[1339]:   notice: Calculated
> > > transition 149, saving inputs in
> > > /var/lib/pacemaker/pengine/pe-input-46.bz2
> > >
> >
> > Hi Ken,
> >
> > Just a friendly bump to see if you had a chance to take a look at
> > this
> > issue? I appreciate your time and expertise! =)
> >
> > --Marc
>
> Sorry, I've been slammed lately.

No problem at all, appreciate you taking the time to investigate.

>
> There does appear to be a scheduler bug. The relevant constraint is (in
> plain language)
>
>    start g_lvm_* then promote ms_alua_*
>
> The implicit inverse of that is
>
>    demote ms_alua_* then stop g_lvm_*
>
> The bug is that ms_alua_* isn't demoted before g_lvm_* is stopped.
> (Note however that the configuration does not require ms_alua_* to be
> stopped.)

Anything I can do to debug further? I've worked around this for now in
my particular use case by simply stopping the ms_alua_* resource
before modifying the resource parameter, not ideal, but okay for now.

--Marc

>
> > >
> > > --Marc
> > >
> > >
> > > > >
> > > > > I searched a bit for #2 but I didn't get many (well any) hits
> > > > > on
> > > > > other
> > > > > users experiencing this behavior.
> > > > >
> > > > > Many thanks in advance.
> > > > >
> > > > > --Marc
> > > >
> > > > --
> > > > Ken Gaillot <kgaillot at redhat.com>
> > > >
> > > > _______________________________________________
> > > > Manage your subscription:
> > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > ClusterLabs home: https://www.clusterlabs.org/
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
> --
> Ken Gaillot <kgaillot at redhat.com>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/