[Pacemaker] migration-threshold causing unnecessary restart of underlying resources

Wed Aug 18 04:25:39 UTC 2010

  Am 17.08.2010 13:51, schrieb Dejan Muhamedagic:
> On Tue, Aug 17, 2010 at 04:14:17AM +0200, Cnut Jansen wrote:
>> And if so: Is there also any possibility to define one-sided
>> dependencies/influences?
> Take a look at mandatory vs. advisory constraints in the
> Configuration Explained doc. A group is equivalent to a set of
> order/collocation constraints with the infinite score (inf).
Yeah, just right before your latest reply I had just tested with changing scores for colocation constraints from inf to 0, and unless cluster gets picked-on too hard, it seems to be an acceptable workaround for now when migration-threshold should really be needed. But I guess we'll rather waive migration-threshold (maybe use/try other options for similiar effect, if needed) than possibly mess around with optional/advisory-scores.
You know, score of 0 leave situations possible where the dependent could be (tried to be) started/running elsewhere than the resource it elementarily depends on, since that would rather be just like a "Hey, Paci, I'd be really glad if you at least tried to colocate the dependent to its underlying resource please; but only if you feel like to!", than like the required "Listen up, Paci: I insist(!!!) on colocating the dependent with its underlying resource! At all costs! That's a strict order!". I.e. you only just need to move the underlying resource to another node (set a location constraint) and a score-0-colocation is allready history.

>> Is it really "as exspected" that many(!) minutes - and even
>> cluster-rechecks - after the last picking-on and with a
>> failure-timeout of 45 seconds the failure counter is still not only
>> showing a count of 3, but also obviously really being 3 (not 0,
>> after being reset), thus now migrating resource allready on the
>> first following picking-on?!
> Of course, that's not how it should work. If you observe such a
> case, please file a bugzilla and attach hb_report. I just
> commented what was shown above: 04:47:17 - 04:44:47 = 150.
> Perhaps I missed something happening earlier?
Only the picking-on-until-migration-threshold's-limit and the pause of 
26 mins. (-;

I filed a bug report and hope that it's not too poor, since it was my 
first ever in this way and bed is calling for allready quite a long 
while. (-#
http://developerbugs.linux-foundation.org/show_bug.cgi?id=2468

>>> The count gets reset, but the cluster acts on it only after the
>>> cluster-recheck-interval, unless something else makes the cluster
>>> calculate new scores.
>> See above, picking-on #4: More than 26 minutes after the last
> Hmm, sorry, couldn't see anything going on for 26 mins. I
> probably didn't look carefully enough.
Yeah, don't worry, you saw it right: You couldn't see anything going 
during that time, because I indeed didn't do anything for those 26 mins; 
not even touched the VMs at all! d-#
I did - respectively did not - so to make absolutely sure that there are 
not even yet other timers for resetting the failure-counter or anything 
else, and to thus prove that it obviously really doesn't get reset at 
all. (Longest interval I so far heard of was the default 
shutdown-escalation of 20min)