[ClusterLabs] the PAF switchover does not happen if the VIP resource is stopped

Thu Apr 26 08:35:51 EDT 2018

On Thu, 26 Apr 2018 08:41:22 +0000
范国腾 <fanguoteng at highgo.com> wrote:

> Does it mean if one node has ever a resource failure, it could not be
> promoted to be master any more except that I run the pcs cleanup to clean the
> failcount?

Each time you have a failure, you will have to handle it soon or later. The
cluster might be able to auto-heal itself.

So yes, if you have a resource failure, you must take care of it as soon as
possible and maybe fix what should be fixed. Then reset the failcount to
restore normal behavior. Note that you could set failure-timeout to "forget"
failcounts after some timeout, but really, you should not for your database
sake (this could still be a valid parameter for some other setup though).

> I am testing the case if the VIP resource down because of some reason, the
> cluster could still work. So I only ifdown the VIP network(enp0s3), not the
> heartbeat network card(enp0s8)? 

If your pgsql cluster rely on the vIP to catch replication between nodes, then
no, your cluster can not work without it. PgSQL instances will still be up, but
not replicating anymore.

> -----邮件原件-----
> 发件人: Jehan-Guillaume de Rorthais [mailto:jgdr at dalibo.com] 
> 发送时间: 2018年4月26日 16:02
> 收件人: 范国腾 <fanguoteng at highgo.com>
> 抄送: Cluster Labs - All topics related to open-source clustering welcomed
> <users at clusterlabs.org>; 李梦怡 <limengyi at highgo.com> 主题: Re: [ClusterLabs]
> the PAF switchover does not happen if the VIP resource is stopped
> 
> On Thu, 26 Apr 2018 07:53:07 +0000
> 范国腾 <fanguoteng at highgo.com> wrote:
> 
> > 1. There is no failure in initial status. sds1 is master
> > 
> > [cid:image001.png at 01D3DD75.3F4BF110]  
> 
> yes.
> 
> > 2. ifdown the sds1 VIP network card.
> > 
> > [cid:image002.png at 01D3DD75.71D5DE70]  
> 
> ok, failcount and -inf score appears.
> 
> > 3. ifup the sds1 VIP network card and then ifdown sds2 VIP network 
> > card
> > 
> > [cid:image003.png at 01D3DD76.26C5E820]  
> 
> Now failcount and -inf score everywhere.
> 
> I'm not sure I understand your mail, do you have a question ?
> 
> > -----邮件原件-----
> > 发件人: Jehan-Guillaume de Rorthais [mailto:jgdr at dalibo.com]
> > 发送时间: 2018年4月26日 15:07
> > 收件人: 范国腾 <fanguoteng at highgo.com>
> > 抄送: Cluster Labs - All topics related to open-source clustering 
> > welcomed <users at clusterlabs.org>; 李梦怡 <limengyi at highgo.com> 主题: Re: 
> > [ClusterLabs] the PAF switchover does not happen if the VIP resource 
> > is stopped
> > 
> > 
> > 
> > On Thu, 26 Apr 2018 02:53:33 +0000
> > 
> > 范国腾 <fanguoteng at highgo.com<mailto:fanguoteng at highgo.com>> wrote:
> > 
> > 
> >   
> > > Hi Rorthais，  
> >   
> > >    
> >   
> > > Thank you for your help.    
> >   
> > >    
> >   
> > > The replication works at that time.    
> >   
> > >    
> >   
> > > I try again today.    
> >   
> > > (1) If I run "ifup enp0s3" in node2, then run "ifdown enp0s3" in  
> >   
> > > node1, the switchover issue could be reproduced. (2) But if I run  
> >   
> > > "ifup enp0s3" in node2, run "pcs resource cleanup mastergroup" to  
> >   
> > > clean the VIP resource, and there is no Failed Actions in "pcs  
> >   
> > > status", then run "ifdown enp0s3" in node1, it works. The switchover 
> > > could happened again.  
> >   
> > >    
> >   
> > >    
> >   
> > > Is there any parameter to control this behaviors so that I don't 
> > > need  
> >   
> > > to execute the "pcs cleanup" command every time?    
> > 
> > 
> > 
> > Check the failcounts for each resource on each nodes (pcs resource 
> > failcount [...]).
> > 
> > Check the scores as well (crm_simulate -sL).
> > 
> > 
> >   
> > >    
> >   
> > > -----邮件原件-----  
> >   
> > > 发件人: Jehan-Guillaume de Rorthais [mailto:jgdr at dalibo.com]  
> >   
> > > 发送时间: 2018年4月25日 18:39  
> >   
> > > 收件人: 范国腾 <fanguoteng at highgo.com<mailto:fanguoteng at highgo.com>>  
> >   
> > > 抄送: Cluster Labs - All topics related to open-source clustering  
> >   
> > > welcomed <users at clusterlabs.org<mailto:users at clusterlabs.org>>; 李梦怡
> > > <limengyi at highgo.com<mailto:limengyi at highgo.com>> 主题: Re:    
> >   
> > > [ClusterLabs] the PAF switchover does not happen if the VIP resource  
> >   
> > > is stopped  
> >   
> > >    
> >   
> > >    
> >   
> > > On Wed, 25 Apr 2018 08:58:34 +0000  
> >   
> > > 范国腾 <fanguoteng at highgo.com<mailto:fanguoteng at highgo.com>> wrote:    
> >   
> > >    
> >   
> > > >    
> >   
> > > > Our lab has two resource: (1) PAF (master/slave)    (2) VIP (bind to
> > > > the    
> >   
> > > > master PAF node). The configuration is in the attachment.    
> >   
> > > >    
> >   
> > > > Each node has two network card: One(enp0s8) is for the pacemaker  
> >   
> > > > heartbeat in internal network, the other(enp0s3) is for the master  
> >   
> > > > VIP in the external network.    
> >   
> > > >    
> >   
> > > >    
> >   
> > > >    
> >   
> > > > We are testing the following case: if the master VIP network card 
> > > > is  
> >   
> > > > down, the master postgres and VIP could switch to another node.    
> >   
> > > >    
> >   
> > > >    
> >   
> > > >    
> >   
> > > > 1. At first, node2 is master, I run "ifdown enp0s3" in node2, then  
> >   
> > > > node1 become the master, that is ok.    
> >   
> > > >    
> >   
> > > > 2. Then I run "ifup enp0s3" in node2, wait for 60 seconds,  
> >   
> > >    
> >   
> > > Did you check PostgreSQL instances were replicating again?    
> >   
> > >    
> >   
> > > > then run "ifdown enp0s3" in node1, but the node1 still be master.    
> >   
> > > > Why does switchover doesn't happened? How to recover to make 
> > > > system work?  
> 
> 
> 

-- 
Jehan-Guillaume de Rorthais
Dalibo