[Pacemaker] Multi-site support in pacemaker (tokens, deadman, CTR)

Fri May 13 08:46:58 EDT 2011

Hi Andrew, Lars,

Thanks for the comments!

For the convenience of further development, before we reach the
consensus on the cib syntax, I temporarily implemented the syntax as the
following -- It's somewhat different from my previous idea. :-)
Please see also the attached patch which includes the schema, the
internal data structure and the functions for unpacking configuration.

The definitions of tickets:
<cib ...>
  <configuration>
     ...
    <tickets>
      <ticket id="ticketA" loss-policy="stop"/>
      <ticket id="ticketB" loss-policy="fence"/>
      <ticket id="ticketC" loss-policy="freeze"/>
    </tickets>
...

The state of tickets:

...
  <status>
    <cluster_state>
      <transient_attributes id="cluster">
        <instance_attributes id="status-cluster">
          <nvpair id="status-cluster-granted-ticket-ticketA"
name="granted-ticket-ticketA" value="true"/>
          <nvpair id="status-cluster-last-granted-ticket-ticketA"
name="last-granted-ticket-ticketA" value="1305259696"/>
          <nvpair id="status-cluster-granted-ticket-ticketB"
name="granted-ticket-ticketB" value="1"/>
          <nvpair id="status-cluster-last-granted-ticket-ticketB"
name="last-granted-ticket-ticketB" value="1305259760"/>
        </instance_attributes>
      </transient_attributes>
    </cluster_state>
    <node_state...
...

Even though I've done some code for it, it's still open for discussion.
So if there's anything you think is inappropriate, or I misunderstood
any of your points in the long mails, please let me know. ;-)

The "last-granted" time would be automatically set whenever the ticket
is granted. I'm not sure that we need to store the expiration time in
cib and even in PE data, given handling expiration in PE is not
preferred. Unless the facilities really need to store the expiration
time in cib. Anyway, That's trivial if it's stored in "cluster_state".
That would not even affect the schema.

Other replies and issues below:

On 05/12/11 19:47, Lars Marowsky-Bree wrote:
> On 2011-04-29T03:33:00, "Gao,Yan" <ygao at novell.com> wrote:
> 
>>> Yes; a ticket section, just like that.
>> All right. How about the schema:
>>     <element name="configuration">
>>       <interleave>
>> ...
>>         <element name="tickets">
>>           <zeroOrMore>
>>             <element name="ticket_set">
>>               <externalRef href="nvset.rng"/>
>>             </element>
>>           </zeroOrMore>
>>         </element>
>> ...
> 
> Makes sense to me.
Because we would need to set "loss-policy" and probably the
"owned-policy"/"bid" which you mentioned, one nvpair for each ticket is
not sufficient and clear.

> 
>>> Personally, I lean towards this. (Andrew has expressed a wish to do
>>> without the "rsc_" prefix, so lets drop this ;-)
>> Well then, how about "ticket_dep" or "ticket_req"?
> 
> The first sounds a bit better to me.
Still not quite sure if we have to introduce an new kind of constraint,
or just improve the "colocation"/"order" which might reuse some existing
code...

> 
>>> Not sure the kind="Deadman" is actually required, but it probably makes
>>> sense to be able to switch off the big hammer for debugging purposes.
>>> ;-)
>> I was thinking it's for switching on/off "immediately fence once the
>> dependency is no longer satisfied".
> 
> Agreed. I was saying that this is only for debugging purposes.
This might not be so useful if we can set "loss-policy" from ticket
definition.

> 
>>>
>>> I don't see why any resource would depend on several tickets; but I can
>>> see a use case for wanting to depend on _not_ owning a ticket, similar
>>> to the node attributes. And the resource would need a role, obviously.
>> OK. The schema I can imagine:
>>
>>   <define name="element-ticket_dep">
>>     <element name="ticket_dep">
>>       <attribute name="id"><data type="ID"/></attribute>
>>       <choice>
>>         <oneOrMore>
>>           <ref name="element-resource-set"/>
>>         </oneOrMore>
>>         <group>
>>           <attribute name="rsc"><data type="IDREF"/></attribute>
>>           <optional>
>>             <attribute name="rsc-role">
>>               <ref name="attribute-roles"/>
>>             </attribute>
>>           </optional>
>>         </group>
>>       </choice>
>>       <attribute name="ticket"><text/></attribute>
> 
> Actually, if we were to define a new type for the ticket list, we could
> reference the id of the ticket element here and only allow configured
> tickets.
Right.

> 
>>> A site that does not own a ticket but has a non-zero value for it
>>> defined will request the ticket from the CTR; the CTR will grant it to
>>> the site with the highest bid (but not to a site with 0)
>> The site with the highest "bid" is being revoked the ticket.
> 
> Hm? No, I thought the site with the highest bid would be granted the
> ticket, not have it revoked.

Sorry, I misunderstood that. Any state change should be due to the
change of "bid"s, right?

As far as I understand, a "bid" value determines the behaviors of both
the cluster transition and the CTR. For example:
A site has the ticket, now its ticket is about to be revoked. There're
two possibilities:

- There's a higher bid from another site now:
   CTR sets "granted-ticket-ticketA=false" for this site. Pengine
invokes the specified "loss-policy". When the transition is successfully
done, CTR sets "granted-ticket-ticketA=true" for the site with the
higher bid. -- Can CTR know if/when the transition is successfully done?

2. The site changes its bid to "0" for any reason to give up the ticket.
   Pengine stops the relevant resources. When the transition is
successfully done, CTR is aware of that and sets
"granted-ticket-ticketA=false" for this site and then sets
"granted-ticket-ticketA=true" for the site with the higher bid.

So CTR should be able to coordinate these actions?

Should the "bid" go into "ticket" definition (that could cover the
owned-policy=(start|stop) you mentioned, right?) ? Or probably into
"cluster_state"? Or as an "instance_attribute" of a CTR?

> 
>> Should it clear the "bid" also? Otherwise it will get the ticket again
>> soon after?
> 
> My thinking was that the "bid" is actually an on-going event/finite
> state machine. Whenever a state change occurs - connection dropped
> somewhere, bid value changed, a site becomes non-quorate and gives up
> its ticket, etc - the CTRs that are still alive re-evaluate the grants
> and decide anew where the tickets go.
That means the CTRs would need to record the state of every site along
with the current bid values to determine its next actions? Sounds a bit
complex. If that is unavoidable, the finite state machine should be
really reliable and cover all possible states.

> 
>>> (Tangent - ownership appears to belong to the status section; the value
>>> seems belongs to the cib->ticket section(?).)
>> Perhaps. Although there's no appropriate place to set a cluster-wide
>> attribute in the status section so far.
> 
> Right, and it is actually not so easy, since the status section is
> reconstructed from each node at times.
I think the introduced "cluster_state" could be created when the first
ticket is granted.

> 
>> Other solutions are:
>> A "ticket" is not a nvpair. It is
>>
>> - An object with "ownership" and "bid" attributes.
>> Or:
>> - A nvpair-set which includes the "ownership" and "bid" nvpairs.
> 
> Both might work. The first - an element with several attributes - might
> work best, I think, since it's a bit cleaner, and would allow us to do
> more validation when dependencies are defined.
Indeed.

Thanks,
  Yan
-- 
Gao,Yan <ygao at novell.com>
Software Engineer
China Server Team, SUSE.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pacemaker-tickets.diff
Type: text/x-patch
Size: 9226 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110513/91fecc20/attachment-0003.bin>