[ClusterLabs] 答复: No slave is promoted to be master

Mon Apr 16 02:10:34 UTC 2018

Thank you, Rorthais. I see now.

-----邮件原件-----
发件人: Jehan-Guillaume de Rorthais [mailto:jgdr at dalibo.com] 
发送时间: 2018年4月13日 17:17
收件人: 范国腾 <fanguoteng at highgo.com>
抄送: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
主题: Re: [ClusterLabs] No slave is promoted to be master

OK, I know what happen.

It seems like your standbies were not replicating when the master "crashed", you can find tons of messages like this in the log files:

  WARNING: No secondary connected to the master
  WARNING: "db2" is not connected to the primary
  WARNING: "db3" is not connected to the primary

When a standby is not replicating, the master set negative master score to them to forbid the promotion on them, as they are probably lagging for some undefined time.

The following command shows the scores just before the simulated master crash:

  $ crm_simulate -x pe-input-2039.bz2 -s|grep -E 'date|promotion'
  Using the original execution date of: 2018-04-11 16:23:07Z
  pgsqld:0 promotion score on db1: 1001
  pgsqld:1 promotion score on db2: -1000
  pgsqld:2 promotion score on db3: -1000

"1001" score design the master. Streaming standbies always have a positive master score between 1000 and 1000-N*10 where N is the number of connected standbies.

On Fri, 13 Apr 2018 01:37:54 +0000
范国腾 <fanguoteng at highgo.com> wrote:

> The log is in the attachment.
> 
> We make a bug in the PG code in master node to make it not be 
> restarted any more in order to test the following scenario: One slave 
> could be promoted when the master crashed,
> 
> -----邮件原件-----
> 发件人: Jehan-Guillaume de Rorthais [mailto:jgdr at dalibo.com]
> 发送时间: 2018年4月12日 17:39
> 收件人: 范国腾 <fanguoteng at highgo.com>
> 抄送: Cluster Labs - All topics related to open-source clustering 
> welcomed <users at clusterlabs.org> 主题: Re: [ClusterLabs] No slave is 
> promoted to be master
> 
> Hi,
> On Thu, 12 Apr 2018 08:31:39 +0000
> 范国腾 <fanguoteng at highgo.com> wrote:
> 
> > Thank you very much for help check this issue. The information is in 
> > the attachment.
> > 
> > I have restarted the cluster after I send my first email. Not sure 
> > if it affects the checking of "the result of "crm_simulate -sL"
> 
> It does...
> 
> Could you please provide files
> from /var/lib/pacemaker/pengine/pe-input-2039.bz2 to  pe-input-2065.bz2 ?
> 
> [...]
> > Then the master is restarted and it could not start（that is ok and 
> > we know the reason）。
> 
> Why couldn't it start ?

--
Jehan-Guillaume de Rorthais
Dalibo