[ClusterLabs] Pacemaker resource parameter reload confusion

Wed Nov 1 05:04:26 EDT 2017

Ken Gaillot <kgaillot at redhat.com> writes:

> When an operation completes, a history entry (<lrm_rsc_op>) is added to
> the pe-input file. If the agent supports reload, the entry will include
> op-force-restart and op-restart-digest fields. Now I see those are
> present in the vm-alder_last_0 entry, so agent support isn't the issue.

Thanks for the explanation.

> However, the operation is recorded as a *failed* probe (i.e. the
> resource was running where it wasn't expected). This gets recorded as a
> separate vm-alder_last_failure_0 entry, which does not get the special
> fields. It looks to me like this failure entry is forcing the restart.
> That would be a good idea if it's an actual failure; if we find a
> resource unexpectedly running, we don't know how it was started, so a
> full restart makes sense. 
>
> However, I'm guessing it may not have been a real error, but a resource
> cleanup. A cleanup clears the history so the resource is re-probed, and
> I suspect that re-probe is what got recorded here as a failure. Does
> that match what actually happened?

Well, I can't really remember, it happened two months ago...  I'm pretty
sure the resource wasn't running unexpectedly, I'd surely recall such a
grave failure.  Interestingly, though, my shell history contains a
cleanup operation shortly after the parameter change.  Also, if you look
at the logs in my thread starting mail, you'll find

warning: Processing failed op monitor for vm-alder on vhbl05: not running (7)

which does not seem to match up with the failure in the lrm_rsc_op entry
in pe-input.  It's sort of "normal" that such a resource disappears and
gets restarted by the cluster.  If that report survived the unexpected
restart, I might have wanted to routinely clean it up afterwards.

(I'm leaving for a short holiday now, expect longer delays.)
-- 
Regards,
Feri