[ClusterLabs] multiple action= lines sent to STDIN of fencing agents - why?

Thu Oct 15 11:00:19 EDT 2015

On 10/15/2015 06:25 AM, Adam Spiers wrote:
> I inserted some debugging into fencing.py and found that stonithd
> sends stuff like this to STDIN of the fencing agents it forks:
> 
>     action=list
>     param1=value1
>     param2=value2
>     param3=value3
>     action=list
> 
> where paramX and valueX come from the configuration of the primitive
> for the fencing agent.
> 
> As a corollary, if the primitive for the fencing agent has 'action'
> defined as one of its parameters, this means that there will be three
> 'action=' lines, and the middle one could have a different value to
> the two sandwiching it.
> 
> When I first saw this, I had an extended #wtf moment and thought it
> was a bug.  But on closer inspection, it seems very deliberate, e.g.
> 
>   https://github.com/ClusterLabs/pacemaker/commit/bfd620645f151b71fafafa279969e9d8bd0fd74f
> 
> The "regardless of what the admin configured" comment suggests to me
> that there is an underlying assumption that any fencing agent will
> ensure that if the same parameter is duplicated on STDIN, the final
> value will override any previous ones.  And indeed fencing.py ensures
> this, but presumably it is possible to write agents which don't use
> fencing.py.
> 
> Is my understanding correct?  If so:

Yes, good sleuthing.

> 1) Is the first 'action=' line intended in order to set some kind of
>    default action, in the case that the admin didn't configure the
>    primitive with an 'action=' parameter *and* _action wasn't one of
>    list/status/monitor/metadata?  In what circumstances would this
>    happen?

The first action line is usually the only one.

Ideally, admins don't configure "action" as a parameter of a fence
device. They either specify nothing (in which case the cluster does what
it thinks should be done -- reboot, off, etc.), or they specify
pcmk_*_action to override the cluster's choice. For example,
pcmk_reboot_action=off tells the cluster to actually send the fence
agent "action=off" when a reboot is desired. (Perhaps the admin prefers
that flaky nodes stay down until investigated, or the fence device
doesn't handle reboots well.)

So the first action line is the result of that. If the admin configured
a pcmk_*_action for the requested action, the agent will get that,
otherwise it gets the requested action.

Second, any parameters in the device configuration are copied to the
agent. So if the admin did specify "action" there, it will get copied
(as a second instance, possibly different from the first).

But that would override *all* requested actions, which is a bad idea. No
one wants a recurring monitor action to shoot a node! :) So that last
step is a failsafe, if the admin did supply an "action", re-send the
original line if the requested action was informational
(list/status/monitor/metadata) and not a "real" fencing action (off/reboot).

> 2) Is this assumption of the agents always being order-sensitive
>    (i.e. last value always wins) documented anywhere?  The best
>    documentation on the API I could find was here:
> 
>       https://fedorahosted.org/cluster/wiki/FenceAgentAPI
> 
>    but it doesn't mention this.

Good point. It would be a good idea to add that to the API since it's
established practice, but it probably would also be a good idea for
pacemaker to send only the final value of any parameter.