[Pacemaker] SLES 11 SP3 boothd behaviour

Sutherland, Rob RSutherland at BroadViewNet.com
Mon Aug 25 15:43:34 EDT 2014


Hello all,

We're in the process of implementing geo-redundancy on SLES 11 SP3 (version 0.1.0). We are seeing behavior in which site 2 in a geo-cluster decides that the ticket has expired long before actual expiry. Here's an example time-line:

1 - All sites (site 1, site 2 and arbitrator) agree on ticket owner and expiry. i.e. site 2 has the ticket with a 60-second expiry:
Aug 25 10:07:10 linux-4i31 booth-arbitrator: [22526]: info: command: 'crm_ticket -t geo-ticket -S expires -v 1408975690' was executed
Aug 25 10:07:10 bb5Btas0 booth-site: [27782]: info: command: 'crm_ticket -t geo-ticket -S expires -v 1408975690' was executed
Aug 25 10:07:10 bb5Atas1 booth-site: [7826]: info: command: 'crm_ticket -t geo-ticket -S expires -v 1408975690' was executed

2 - After 48 seconds (80% into lease), all three nodes are still in agreement:
Site 2:
Aug 25 10:07:58 bb5Btas0 booth-site: [27782]: info: command: 'crm_ticket -t geo-ticket -S owner -v 2' was executed
Aug 25 10:07:58 bb5Btas0 booth-site: [27782]: info: command: 'crm_ticket -t geo-ticket -S expires -v 1408975738' was executed

The arbitrator:
Aug 25 10:07:58 linux-4i31 crm_ticket[23836]:   notice: crm_log_args: Invoked: crm_ticket -t geo-ticket -S owner -v 2
Aug 25 10:07:58 linux-4i31 booth-arbitrator: [22526]: info: command: 'crm_ticket -t geo-ticket -S expires -v 1408975738' was executed

Site 1:
Aug 25 10:07:58 bb5Atas1 booth-site: [7826]: info: command: 'crm_ticket -t geo-ticket -S owner -v 2' was executed
Aug 25 10:07:58 bb5Atas1 booth-site: [7826]: info: command: 'crm_ticket -t geo-ticket -S expires -v 1408975738' was executed

3 - Site 2 decides that the ticket has expired (at the  expiry time set in step 1)
Aug 25 10:08:10 bb5Btas0 booth-site: [27782]: debug: lease expires ...

4 - At 10:08:58, both site 1 and the arbitrator expire the lease and pick a new master.

I presume that there was some missed communication between site 2 and the rest of the geo-cluster. There is nothing in the logs to help debug this, though. Any hints on debugging this?

BTW: we only ever see this on a site 2 - never a site 1. This is consistent across several labs. Is there a bias towards site 1?

Thanks in advance,

Rob


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140825/e9cd4c1d/attachment-0002.html>


More information about the Pacemaker mailing list