[ClusterLabs] Pacemake/Corosync good fit for embedded product?

Thu Apr 12 11:29:12 EDT 2018

On Thu, 2018-04-12 at 14:37 +1200, David Hunt wrote:
> Thanks Guys,
> 
> Ideally I would like to have event driven (rather than slower polled)
> inputs into pacemaker to quickly trigger the fall over. I assume
> adding event driven inputs to pacemaker isn't straightforward? If it 

If you can detect the failure yourself, there is a way to inject a
failure into Pacemaker's state, using the crm_resource command-line
tool with the --fail option.

Given your timing requirements, you could squeeze a bit more by copying
the relevant portion of the crm_resource C code into whatever is doing
the failure detection (to avoid the overhead of spawning a process to
execute crm_resource).

The two sides of event-driven are being able to respond to events
(which pacemaker can do) and generating the events (which is often the
harder part). In a typical cluster today, node-level events are
generated by corosync, and resource-level events are generated by
resource agent monitor actions. Both of those methods require
significant time, but they have a high degree of certainty, i.e. they
don't require that the system being monitored is functioning correctly.
So they would still be a good backstop to any faster method you can
devise that generates events from within the system being monitored.

> was possible to add event inputs to pacemaker is pacemaker itself
> fast enough? Or is it also going to be relatively slow to switch?

You'll have to do live testing to determine that, but I would think it
would be difficult to find something faster that can handle the range
of failures that pacemaker can.

You'll want to consider and test as many failure scenarios as possible:

- Node-level failures. This could be a complete crash of a node, loss
of power, a node that is severely malfunctioning due to CPU or I/O
overload, etc. Pacemaker handles these via fencing, to ensure the node
is not competing for resources.

- Communication failures. This depends on what type of networking you
have between your nodes and with the outside world. It could be a
complete failure, or an intermittent failure. Pacemaker handles
communication failures between nodes via fencing; if there are any
other networks needed for serving resources, those must be monitored
via a resource, and handles via the usual resource recovery mechanisms.

- Service failures. This could be as simple as a process that crashed,
or as complex as a server that is accepting connections but
occasionally responding with garbage to particular request types due to
some memory buffer overflow. Pacemaker detects these via resource agent
monitors (or the injection method described earlier) and handles them
via a variety of expressed relationships (constraints) and recovery
goals (e.g. whether to try restarting the resource or immediately fence
the node, whether to try moving the resource to another node after a
certain number of failed restart attempts, etc.).

The goal of near-instant recovery has to be measured against the
variety of failure types. E.g. it may be possible to detect complete
network failure or power loss very quickly, but I can't imagine an even
theoretical way to immediately detect a (non-crash) failure of a server
process that passively responds to requests.

You'll need to ask: What are all the possible ways this system can
fail? What are all the possible ways to detect those failures? How
should recovery be attempted for each type of failure?

> It would seem based on this discussion it may still work still work
> to use pacemaker & corosync for initial setup & handle services which
> can handle a slower switch over time. For our services that require a
> much faster switch over time it would appear we need something
> propriety.
> 
> Regards
> David
> 
> On 12 April 2018 at 02:56, Klaus Wenninger <kwenning at redhat.com>
> wrote:
> > On 04/11/2018 10:44 AM, Jan Friesse wrote:
> > > David,
> > >
> > >> Hi,
> > >>
> > >> We are planning on creating a HA product in an active/standby
> > >> configuration
> > >> whereby the standby unit needs to take over from the active unit
> > very
> > >> fast
> > >> (<50ms including all services restored).
> > >>
> > >> We are able to do very fast signaling (say 1000Hz) between the
> > two
> > >> units to
> > >> detect failures so detecting a failure isn't really an issue.
> > >>
> > >> Pacemaker looks to be a very useful piece of software for
> > managing
> > >> resources so rather than roll our own it would make sense to
> > reuse
> > >> pacemaker.
> > >>
> > >> So my initial questions are:
> > >>
> > >>     1. Do people think pacemaker is the right thing to use?
> > Everything I
> > >>     read seem to be talking about multiple seconds for failure
> > >> detection etc.
> > >>     Feature wise it looks pretty similar to what we would want.
> > >>     2. Has anyone done anything similar to this?
> > >>     3. Any pointers on where/how to add additional failure
> > detection
> > >> inputs
> > >>     to pacemaker?
> > >>     4.
> > >>     5. For a new design would you go with pacemaker+corosync,
> > >>     pacemaker+corosync+knet or something different?
> > >>
> > >
> > >
> > > I will just share my point of view about Corosync side.
> > >
> > > Corosync is using it's own mechanism for detecting failure, based
> > on
> > > token rotation. Default timeout for detecting lost of token is 1
> > > second, so detecting failure takes hugely more than 50ms. It can
> > be
> > > lowered, but that is not really tested.
> > >
> > > That means it's not currently possible to use different signaling
> > > mechanism without significant Corosync change.
> > >
> > > So I don't think Corosync can be really used for described
> > scenario.
> > >
> > > Honza
> > 
> > On the other hand if a fail-over is triggered by loosing a node or
> > anything
> > that is being detected by corosync this is probably already the
> > fast-path
> > in a pacemaker-cluster.
> > 
> > Detection of other types of failures (like a resource failing on
> > an otherwise functional node) is probably even way slower.
> > When a failure is detected by corosync, pacemaker has some kind of
> > an event driven way to react on that.
> > We even have to add some delay to the mere corosync detection time
> > mentioned by Honza as pacemaker will have to run e.g. a selection
> > cycle for the designated coordinator to be able to do decisions
> > again.
> > 
> > For other failures the base principle is rather probing a resource
> > at a
> > fixed rate (multiple seconds usually) for detection of failures
> > instead
> > of an event-driven mechanism.
> > There might be trickery possible though using attributes to achieve
> > event-driven-like reaction on certain failures. But I haven't done
> > anything concrete to exploit these possibilities. Others might have
> > more info (which I personally would be interested in as well ;-) ).
> > 
> > Approaches to realize event-driven mechanisms for resource-failure-
> > detection are under investigation/development (systemd-resources,
> > IP resources sitting on interfaces, ...) but afaik there is nothing
> > available out of the box by now.
> > 
> > Having that all said I can add some personal experiences from
> > having implemented an embedded product based on a
> > pacemaker-cluster myself in the past:
> > 
> > As reaction time based on pacemaker would be too slow for e.g.
> > many communication-protocols (e.g. things like SIP) or realtime-
> > streams it seems advisable to solve these issues on the
> > application-layer inside a service (respectively distributed
> > service
> > in a cluster).
> > Pacemaker and it's decision engine can then be used to bring
> > up this distributed service in a cluster in some kind of an ordered
> > way.
> > Any additional services that might be less demanding regarding
> > switch-over timeout can be made available via pacemaker
> > directly.
> > 
> > Otherwise pacemaker configuration is very flexible so that you
> > can implement merely anything. It might be advisable to avoid
> > certain approaches which are common in cases where a cluster
> > is operated by somebody who can be informed quickly and
> > has to react under certain SLAs. Thinking of e.g. fencing a node
> > to be switched off instead of rebooting it might not be desirable
> > with kind of an appliance that is expected to just sit there and
> > work without merely any admin effort/expense at all.
> > But that is of course just an example and configuration (incl.
> > configuration concept) has to be tailored to your requirements.
> > 
> > Regards,
> > Klaus
> >  
> > >
> > >>
> > >> Thanks
> > >>
> > >> David
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> Users mailing list: Users at clusterlabs.org
> > >> https://lists.clusterlabs.org/mailman/listinfo/users
> > >>
> > >> Project Home: http://www.clusterlabs.org
> > >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scr
> > atch.pdf
> > >> Bugs: http://bugs.clusterlabs.org
> > >>
> > >
> > > _______________________________________________
> > > Users mailing list: Users at clusterlabs.org
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> > tch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > 
> > 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>