<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Sep 21, 2016 at 6:25 AM, Ken Gaillot <span dir="ltr"><<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi everybody,<br>
<br>
Currently, Pacemaker's on-fail property allows you to configure how the<br>
cluster reacts to operation failures. The default "restart" means try to<br>
restart on the same node, optionally moving to another node once<br>
migration-threshold is reached. Other possibilities are "ignore",<br>
"block", "stop", "fence", and "standby".<br>
<br>
Occasionally, we get requests to have something like migration-threshold<br>
for values besides restart. For example, try restarting the resource on<br>
the same node 3 times, then fence.<br>
<br>
I'd like to get your feedback on two alternative approaches we're<br>
considering.<br>
<br>
###<br>
<br>
Our first proposed approach would add a new hard-fail-threshold<br>
operation property. If specified, the cluster would first try restarting<br>
the resource on the same node, </blockquote><div><br></div><div>Well, just as now, it would be _allowed_ to start on the same node, but this is not guaranteed.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">before doing the on-fail handling.<br>
<br>
For example, you could configure a promote operation with<br>
hard-fail-threshold=3 and on-fail=fence, to fence the node after 3 failures.</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
One point that's not settled is whether failures of *any* operation<br>
would count toward the 3 failures (which is how migration-threshold<br>
works now), or only failures of the specified operation.<br></blockquote><div><br></div><div>I think if hard-fail-threshold is per-op, then only failures of that operation should count.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Currently, if a start fails (but is retried successfully), then a<br>
promote fails (but is retried successfully), then a monitor fails, the<br>
resource will move to another node if migration-threshold=3. We could<br>
keep that behavior with hard-fail-threshold, or only count monitor<br>
failures toward monitor's hard-fail-threshold. Each alternative has<br>
advantages and disadvantages.<br>
<br>
###<br>
<br>
The second proposed approach would add a new on-restart-fail resource<br>
property.<br>
<br>
Same as now, on-fail set to anything but restart would be done<br>
immediately after the first failure. A new value, "ban", would<br>
immediately move the resource to another node. (on-fail=ban would behave<br>
like on-fail=restart with migration-threshold=1.)<br>
<br>
When on-fail=restart, and restarting on the same node doesn't work, the<br>
cluster would do the on-restart-fail handling. on-restart-fail would<br>
allow the same values as on-fail (minus "restart"), and would default to<br>
"ban". </blockquote><div><br></div><div>I do wish you well tracking "is this a restart" across demote -> stop -> start -> promote in 4 different transitions :-)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
So, if you want to fence immediately after any promote failure, you<br>
would still configure on-fail=fence; if you want to try restarting a few<br>
times first, you would configure on-fail=restart and on-restart-fail=fence.<br>
<br>
This approach keeps the current threshold behavior -- failures of any<br>
operation count toward the threshold. We'd rename migration-threshold to<br>
something like hard-fail-threshold, since it would apply to more than<br>
just migration, but unlike the first approach, it would stay a resource<br>
property.<br>
<br>
###<br>
<br>
Comparing the two approaches, the first is more flexible, but also more<br>
complex and potentially confusing.<br></blockquote><div><br></div><div>More complex to implement or more complex to configure?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
With either approach, we would deprecate the start-failure-is-fatal<br>
cluster property. start-failure-is-fatal=true would be equivalent to<br>
hard-fail-threshold=1 with the first approach, and on-fail=ban with the<br>
second approach. This would be both simpler and more useful -- it allows<br>
the value to be set differently per resource.<br>
<span class="gmail-HOEnZb"><font color="#888888">--<br>
Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>><br>
<br>
______________________________<wbr>_________________<br>
Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>
<a href="http://clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">http://clusterlabs.org/<wbr>mailman/listinfo/users</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>
</font></span></blockquote></div><br></div></div>