[ClusterLabs] Pacemaker resource parameter reload confusion
wferi at niif.hu
Wed Nov 1 05:04:26 EDT 2017
Ken Gaillot <kgaillot at redhat.com> writes:
> When an operation completes, a history entry (<lrm_rsc_op>) is added to
> the pe-input file. If the agent supports reload, the entry will include
> op-force-restart and op-restart-digest fields. Now I see those are
> present in the vm-alder_last_0 entry, so agent support isn't the issue.
Thanks for the explanation.
> However, the operation is recorded as a *failed* probe (i.e. the
> resource was running where it wasn't expected). This gets recorded as a
> separate vm-alder_last_failure_0 entry, which does not get the special
> fields. It looks to me like this failure entry is forcing the restart.
> That would be a good idea if it's an actual failure; if we find a
> resource unexpectedly running, we don't know how it was started, so a
> full restart makes sense.
> However, I'm guessing it may not have been a real error, but a resource
> cleanup. A cleanup clears the history so the resource is re-probed, and
> I suspect that re-probe is what got recorded here as a failure. Does
> that match what actually happened?
Well, I can't really remember, it happened two months ago... I'm pretty
sure the resource wasn't running unexpectedly, I'd surely recall such a
grave failure. Interestingly, though, my shell history contains a
cleanup operation shortly after the parameter change. Also, if you look
at the logs in my thread starting mail, you'll find
warning: Processing failed op monitor for vm-alder on vhbl05: not running (7)
which does not seem to match up with the failure in the lrm_rsc_op entry
in pe-input. It's sort of "normal" that such a resource disappears and
gets restarted by the cluster. If that report survived the unexpected
restart, I might have wanted to routinely clean it up afterwards.
(I'm leaving for a short holiday now, expect longer delays.)
More information about the Users