[ClusterLabs] Pacemaker resource parameter reload confusion

Tue Dec 12 17:42:07 EST 2017

On Wed, 2017-11-01 at 10:04 +0100, Ferenc Wágner wrote:
> Ken Gaillot <kgaillot at redhat.com> writes:
> 
> > When an operation completes, a history entry (<lrm_rsc_op>) is
> > added to
> > the pe-input file. If the agent supports reload, the entry will
> > include
> > op-force-restart and op-restart-digest fields. Now I see those are
> > present in the vm-alder_last_0 entry, so agent support isn't the
> > issue.
> 
> Thanks for the explanation.
> 
> > However, the operation is recorded as a *failed* probe (i.e. the
> > resource was running where it wasn't expected). This gets recorded
> > as a
> > separate vm-alder_last_failure_0 entry, which does not get the
> > special
> > fields. It looks to me like this failure entry is forcing the
> > restart.
> > That would be a good idea if it's an actual failure; if we find a
> > resource unexpectedly running, we don't know how it was started, so
> > a
> > full restart makes sense. 
> > 
> > However, I'm guessing it may not have been a real error, but a
> > resource
> > cleanup. A cleanup clears the history so the resource is re-probed, 
> > and
> > I suspect that re-probe is what got recorded here as a failure.
> > Does
> > that match what actually happened?
> 
> Well, I can't really remember, it happened two months ago...  I'm
> pretty
> sure the resource wasn't running unexpectedly, I'd surely recall such
> a
> grave failure.  Interestingly, though, my shell history contains a
> cleanup operation shortly after the parameter change.  Also, if you
> look
> at the logs in my thread starting mail, you'll find
> 
> warning: Processing failed op monitor for vm-alder on vhbl05: not
> running (7)
> 
> which does not seem to match up with the failure in the lrm_rsc_op
> entry
> in pe-input.  It's sort of "normal" that such a resource disappears
> and
> gets restarted by the cluster.  If that report survived the
> unexpected
> restart, I might have wanted to routinely clean it up afterwards.
> 
> (I'm leaving for a short holiday now, expect longer delays.)

Looking at it again with crm_simulate with 1.1.18 + patches, it does
appear that the combination of a cleanup and a parameter change in the
same transition turned the reload into a restart.

The cleanup results in a failed probe being recorded, and that history
entry does not have the magic attributes indicating reloadability.

I suspect if you changed the parameter, waited for the reload to
happen, then did the cleanup, it would have been fine.

I'll have to investigate a fix.
-- 
Ken Gaillot <kgaillot at redhat.com>