[ClusterLabs] Coming in 1.1.14: remapping sequential reboots to all-off-then-all-on

Mon Oct 19 17:14:32 UTC 2015

On 10/19/2015 11:42 AM, Digimer wrote:
> On 19/10/15 12:34 PM, Ken Gaillot wrote:
>> Pacemaker supports fencing "topologies", allowing multiple fencing
>> devices to be used (in conjunction or as fallbacks) when a node needs to
>> be fenced.
>>
>> However, there is a catch when using something like redundant power
>> supplies. If you put two power switches in the same topology level, and
>> Pacemaker needs to reboot the node, it will reboot the first power
>> switch and then the second -- which has no effect since the supplies are
>> redundant.
>>
>> Pacemaker's upstream master branch has new handling that will be part of
>> the eventual 1.1.14 release. In such a case, it will turn all the
>> devices off, then turn them all back on again.
> 
> How long will it leave stay in the 'off' state? Is it configurable? I
> ask because if it's too short, some PSUs may not actually lose power.
> One or two seconds should be way more than enough though.

It simply waits for the fence agent to return success from the "off"
command before proceeding. I wouldn't assume any particular time between
that and initiating "on", and there's no way to set a delay there --
it's up to the agent to not return success until the action is actually
complete.

The standard says that agents should actually confirm that the device is
in the desired state after sending a command, so hopefully this is
already baked in.

>> With previous versions, there was a complicated configuration workaround
>> involving creating separate devices for the off and on actions. With the
>> new version, it happens automatically, and no special configuration is
>> needed.
>>
>> Here's an example where node1 is the affected node, and apc1 and apc2
>> are the fence devices:
>>
>>    pcs stonith level add 1 node1 apc1,apc2
> 
> Where would the outlet definition go? 'apc1:4,apc2:4'?

"apc1" here is name of a Pacemaker fence resource. Hostname, port, etc.
would be configured in the definition of the "apc1" resource (which I
omitted above to focus on the topology config).

>> Of course you can configure it using crm or XML as well.
>>
>> The fencing operation will be treated as successful as long as the "off"
>> commands succeed, because then it is safe for the cluster to recover any
>> resources that were on the node. Timeouts and errors in the "on" phase
>> will be logged but ignored.
>>
>> Any action-specific timeout for the remapped action will be used (for
>> example, pcmk_off_timeout will be used when executing the "off" command,
>> not pcmk_reboot_timeout).
> 
> I think this answers my question about how long it stays off for. What
> would be an example config to control the off time then?

This isn't a delay, but a timeout before declaring the action failed. If
an "off" command does not return in this amount of time, the command
(and the entire topology level) will be considered failed, and the next
level will be tried.

The timeouts are configured in the fence resource definition. So
combining the above questions, apc1 might be defined like this:

   pcs stonith create apc1 fence_apc_snmp \
      ipaddr=apc1.example.com \
      login=user passwd='supersecret' \
      pcmk_off_timeout=30s \
      pcmk_host_map="node1.example.com:1,node2.example.com:2"

>> The new code knows to skip the "on" step if the fence agent has
>> automatic unfencing (because it will happen when the node rejoins the
>> cluster). This allows fence_scsi to work with this feature.
> 
> http://i.imgur.com/i7BzivK.png

:-D