[ClusterLabs] restarting pacemakerd

Mon Jun 20 20:36:50 UTC 2016

On 06/18/2016 05:15 AM, Ferenc Wágner wrote:
> Hi,
> 
> Could somebody please elaborate a little why the pacemaker systemd
> service file contains "Restart=on-failure"?  I mean that a failed node
> gets fenced anyway, so most of the time this would be a futile effort.
> On the other hand, one could argue that restarting failed services
> should be the default behavior of systemd (or any init system).  Still,
> it is not.  I'd be grateful for some insight into the matter.

To clarify one point, the configuration mentioned here is systemd
configuration, not part of pacemaker configuration or operation. Systemd
monitors the processes it launches. With "Restart=on-failure", system
will re-launch pacemaker in situations systemd considers "failure"
(exiting nonzero, exiting with core dump, etc.).

Systemd does have various rate-limiting options, which we leave as
default in the pacemaker unit file. Perhaps one day we could try to come
up with ideal values, but it should be a rare situation, and admins can
always tune them as desired for their system using an override file.

The goal of restart is of course to have a slightly better shot at
recovery. You're right, if fencing is configured and quorum is retained,
the node will almost certainly get fenced anyway, but those conditions
aren't always true.

Systemd upstream recommends Restart=on-failure or Restart=on-abnormal
for all long-running services. on-abnormal would probably be better for
pacemaker, but it's not supported in older systemd versions.