[ClusterLabs] interesting blog on Pacemaker-related outage

Adam Spiers aspiers at suse.com
Thu Dec 7 07:13:29 EST 2017


https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/

It's a great write-up, although a little frustrating that it is still
not fully understood why a -inf colocation failed whereas a +inf
succeeded.  (I actually have a vague memory of discovering something
very similar a while back, but I can't find the details.)

IMHO this serves as a good example of the difficulty Pacemaker faces,
and consequently as valuable feedback for how Pacemaker needs to
improve: it's all too easy to do one tiny misconfiguration which can
potentially bring the whole house of cards tumbling down, and it's
often really hard to understand what went wrong.

So FWIW, my personal view is that more than anything else right now,
Pacemaker needs to be made easier to understand.  I know this is a big
ask since HA is unavoidably complex, but I'm sure there are actionable
items which would serve as relatively manageable yet very worthwhile
steps towards this goal.  I alluded to this during my presentation at
the Clusterlabs Summit, e.g. see

    https://aspiers.github.io/clusterlabs-summit-2017-openstack-ha/#/debugging

and the following slide.  And in fact I remember some really good
discussions on this during the summit too, but I'm not sure if they
led anywhere.

Hope this feedback is useful!




More information about the Users mailing list