[Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

Thu Jan 31 01:05:41 EST 2013

Hi Yuichi

I create two patches trying to fix this issue. 

In these patches, expand lockfile() to let it not only record the daemon pid, 
but also record daemon starting status(include "starting" and "started") .
At the same time, modify the logic of the controld RA, so that it can read 
that status and return more precise result.

Would you mind testing it to see if it works for you?

Regards,
 Xia Li

>>> On 1/23/2013 at 12:43 PM, in message
<CAMb0o5J2JS3earuk5Z++O+z7p5f+5z=mMwzRNA7H+5nzeCtJJA at mail.gmail.com>, Yuichi
SEINO <seino.cluster2 at gmail.com> wrote: 
> Hi Jiaju, 
>  
> I understood about the complete solution. 
> However because this issue causes the critical problem that multiple 
> resources start, Could you apply this request or simply revert a 
> commit to tentatively handle this issue until you are resolved at the 
> summer? I think that we are difficult to avoid  this issue by the 
> operation unlike booth deadlock etc. If booth does not start at the 
> same time, then booth can avoid deadlock. 
>  
> This issue caused following things. 
> * Multiple resources start. 
> * When booth causes deadlock, the resource timeout dose not happen. 
> Previous, we could watch timeout on crm_mon. Currently, timeout 
> happens after booth was daemon. 
>  
> Sincerely, 
> Yuichi 
>  
> 2013/1/21 Jiaju Zhang <jjzhang at suse.de>: 
> > Hi Yuichi, 
> > 
> > On Fri, 2013-01-18 at 17:02 +0900, Yuichi SEINO wrote: 
> >> Hi Jiaju, 
> >> 
> >> I try fixing this issue by reverting a commit. What do you think about it? 
> >> https://github.com/jjzhang/booth/pull/48 
> > 
> > Moving the while setup stage before daemonizing seems not to be a sane 
> > solution. setup_ticket() needs to get the latest ticket information by 
> > communicating with other nodes. Currently it was there and using TCP, 
> > but long term and sane solution would be to move it to the main poll(), 
> > asynchronously waiting for catch-up result. Before catching-up was 
> > ready, booth can still response, it can participate in Paxos as a 
> > non-voting member. 
> > 
> > To fix this issue, how do you think if we remove the stale ticket 
> > information in the CIB once booth was starting? We already have the APIs 
> > in pacemaker.c which can clear the ticket information in the CIB. This 
> > step is reasonable because the tickets at that moment is really stale 
> > data. 
> > 
> > About the implementation, I have not thought it in very detail but one 
> > idea that came into my mind is that maybe we can expand lockfile() (or 
> > some wrapper to lockfile()) to let it do more things, not only record 
> > the daemon pid, but also record daemon starting status, like "starting", 
> > "started", thus, the controld RA can read that status and return more 
> > precise result. 
> > 
> > I'll have Xia to look into this problem in more detail. 
> > 
> > Thanks, 
> > Jiaju 
> > 
> > 
>  
>  
>  
> -- 
> Yuichi SEINO 
> METROSYSTEMS CORPORATION 
> E-mail:seino.cluster2 at gmail.com 
>  
>  

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fix_booth_state_issue.patch
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130130/182cf1a0/attachment-0006.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fix_booth_state_issue_RA.patch
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130130/182cf1a0/attachment-0007.ksh>