[ClusterLabs] Doing reload right

Thu Jul 21 10:33:31 EDT 2016

On 07/20/2016 07:32 PM, Andrew Beekhof wrote:
> On Thu, Jul 21, 2016 at 2:47 AM, Adam Spiers <aspiers at suse.com> wrote:
>> Ken Gaillot <kgaillot at redhat.com> wrote:
>>> Hello all,
>>>
>>> I've been meaning to address the implementation of "reload" in Pacemaker
>>> for a while now, and I think the next release will be a good time, as it
>>> seems to be coming up more frequently.
>>
>> [snipped]
>>
>> I don't want to comment directly on any of the excellent points which
>> have been raised in this thread, but it seems like a good time to make
>> a plea for easier reload / restart of individual instances of cloned
>> services, one node at a time.  Currently, if nodes are all managed by
>> a configuration management system (such as Chef in our case),
> 
> Puppet creates the same kinds of issues.
> Both seem designed for a magical world full of unrelated servers that
> require no co-ordination to update.
> Particularly when the timing of an update to some central store (cib,
> database, whatever) needs to be carefully ordered.
> 
> When you say "restart" though, is that a traditional stop/start cycle
> in Pacemaker that also results in all the dependancies being stopped
> too?
> I'm guessing you really want the "atomic reload" kind where nothing
> else is affected because we already have the other style covered by
> crm_resource --restart.

crm_resource --restart isn't sufficient for his use case because it
affects all clone instances cluster-wide, whereas he needs to reload or
restart (depending on the service) the local instance only.

> 
> I propose that we introduce a --force-restart option for crm_resource which:
> 
> 1. disables any recurring monitor operations

None of the other --force-* options disable monitors, so for
consistency, I think we should leave this to the user (or add it for
other --force-*).

> 2. calls a native restart action directly on the resource if it
> exists, otherwise calls the native stop+start actions

What do you mean by native restart action? Systemd restart?

> 3. re-enables the recurring monitor operations regardless of whether
> the reload succeeds, fails, or times out, etc
> 
> No maintenance mode required, and whatever state the resource ends up
> in is re-detected by the cluster in step 3.

If you're lucky :-)

The cluster may still mess with the resource even without monitors, e.g.
a dependency fails or a preferred node comes online. Maintenance
mode/unmanaging would still be safer (though no --force-* option is
completely safe, besides check).

>> when the
>> system wants to perform a configuration run on that node (e.g. when
>> updating a service's configuration file from a template), it is
>> necessary to place the entire node in maintenance mode before
>> reloading or restarting that service on that node.  It works OK, but
>> can result in ugly effects such as the node getting stuck in
>> maintenance mode if the chef-client run failed, without any easy way
>> to track down the original cause.
>>
>> I went through several design iterations before settling on this
>> approach, and they are detailed in a lengthy comment here, which may
>> help you better understand the challenges we encountered:
>>
>>   https://github.com/crowbar/crowbar-ha/blob/master/chef/cookbooks/crowbar-pacemaker/providers/service.rb#L61
>>
>> Similar challenges are posed during upgrade of Pacemaker-managed
>> OpenStack infrastructure.
>>
>> Cheers,
>> Adam