<div dir="ltr"><div dir="ltr">Here is the command output from crm configure show:<div><br></div><div><div>node 1: master \</div><div><span style="white-space:pre">   </span>attributes master-PGSQL=1001</div><div>node 2: secondary \</div><div><span style="white-space:pre">        </span>attributes master-PGSQL=1000</div><div>primitive AWSVIP awsvip \</div><div><span style="white-space:pre">  </span>params secondary_private_ip=10.x.x.x api_delay=5</div><div>primitive PGSQL pgsqlms \</div><div><span style="white-space:pre">      </span>params pgdata="/var/lib/postgresql/9.5/main" bindir="/usr/lib/postgresql/9.5/bin" pghost="/var/run/postgresql/" recovery_template="/etc/postgresql/9.5/main/recovery.conf.pcmk" start_opts="-c config_file=/etc/postgresql/9.5/main/postgresql.conf" \</div><div><span style="white-space:pre">      </span>op start timeout=60s interval=0 \</div><div><span style="white-space:pre">     </span>op stop timeout=60s interval=0 \</div><div><span style="white-space:pre">      </span>op promote timeout=15s interval=0 \</div><div><span style="white-space:pre">   </span>op demote timeout=120s interval=0 \</div><div><span style="white-space:pre">   </span>op monitor interval=15s timeout=10s role=Master \</div><div><span style="white-space:pre">     </span>op monitor interval=16s timeout=10s role=Slave \</div><div><span style="white-space:pre">      </span>op notify timeout=60 interval=0</div><div>primitive fencing-master stonith:external/ec2 \</div><div><span style="white-space:pre"> </span>params port=master \</div><div><span style="white-space:pre">  </span>op start interval=0s timeout=60s \</div><div><span style="white-space:pre">    </span>op monitor interval=360s timeout=60s \</div><div><span style="white-space:pre">        </span>op stop interval=0s timeout=60s</div><div>primitive fencing-secondary stonith:external/ec2 \</div><div><span style="white-space:pre">      </span>params port=secondary \</div><div><span style="white-space:pre">       </span>op start interval=0s timeout=60s \</div><div><span style="white-space:pre">    </span>op monitor interval=360s timeout=60s \</div><div><span style="white-space:pre">        </span>op stop interval=0s timeout=60s</div><div>ms PGSQL-HA PGSQL \</div><div><span style="white-space:pre">     </span>meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true</div><div>colocation IPAWSIP-WITH-MASTER inf: AWSVIP PGSQL-HA:Master</div><div>order demote-then-stop-ip Mandatory: _rsc_set_ PGSQL-HA:demote AWSVIP:stop symmetrical=false</div><div>location loc-fence-master fencing-master -inf: master</div><div>location loc-fence-secondary fencing-secondary -inf: secondary</div><div>order promote-then-ip Mandatory: _rsc_set_ PGSQL-HA:promote AWSVIP:start symmetrical=false</div><div>property cib-bootstrap-options: \</div><div><span style="white-space:pre">    </span>have-watchdog=false \</div><div><span style="white-space:pre"> </span>dc-version=1.1.14-70404b0 \</div><div><span style="white-space:pre">   </span>cluster-infrastructure=corosync \</div><div><span style="white-space:pre">     </span>cluster-name=pgc-psql-ha \</div><div><span style="white-space:pre">    </span>stonith-enabled=true \</div><div><span style="white-space:pre">        </span>no-quorum-policy=ignore \</div><div><span style="white-space:pre">     </span>maintenance-mode=false \</div><div><span style="white-space:pre">      </span>last-lrm-refresh=1551885417</div><div>rsc_defaults rsc-options: \</div><div><span style="white-space:pre"> </span>resource-stickiness=10 \</div><div><span style="white-space:pre">      </span>migration-threshold=1</div></div><div><br></div><div>Should I change any of those timeout parameters in order to avoid timeout? </div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 19 Apr 2019 at 12:23, Danka Ivanović <<a href="mailto:danka.ivanovic@gmail.com">danka.ivanovic@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Thanks for the clarification about failure-timeout, migration threshold and pacemaker.<div><div dir="ltr"><div>Instances are hosted on AWS cloud, and they are in the same security groups and availability zones.</div><div>I don't have information about hardware which hosts those VMs since they are non dedicated. UTC timezone is configured on both machines and default ntp configuration.</div><div><div>     remote           refid      st t when poll reach   delay   offset  jitter</div><div>==============================================================================</div><div> 0.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000</div><div> 1.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000</div><div> 2.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000</div><div> 3.ubuntu.pool.n .POOL.          16 p    -   64    0    0.000    0.000   0.000</div><div> <a href="http://ntp.ubuntu.com" target="_blank">ntp.ubuntu.com</a>  .POOL.          16 p    -   64    0    0.000    0.000   0.000</div><div>+198.46.223.227  204.9.54.119     2 u   65  512  377   22.318    0.096   1.111</div><div>-time1.plumdev.n .GPS.            1 u  116  512  377   72.487    1.386   0.544</div><div>-199.180.133.100 140.142.2.8      3 u  839 1024  377   65.574   -1.199   1.167</div><div>+helium.constant 128.59.0.245     2 u  217  512  377    7.368    0.952   0.090</div><div>*i.will.not.be.e 213.251.128.249  2 u  207  512  377   14.733    1.185   0.305</div><div><br></div></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais <<a href="mailto:jgdr@dalibo.com" target="_blank">jgdr@dalibo.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, 19 Apr 2019 11:08:33 +0200<br>

Danka Ivanović <<a href="mailto:danka.ivanovic@gmail.com" target="_blank">danka.ivanovic@gmail.com</a>> wrote:<br>

<br>

> Hi,<br>

> Thank you for your response.<br>

> <br>

> Ok, It seems that fencing resources and secondary timed out at the same<br>

> time, together with ldap.<br>

> I understand that because of "migration-threshold=1", standby tried to<br>

> recover just once and then was stopped. Is this ok, or the threshold should<br>

> be increased?<br>

<br>

It depend on your usecase really.<br>

<br>

Note that as soon as a resource hit migration threashold, there's an implicit<br>

constraint forbidding it to come back on this node until you reset the<br>

failcount. That's why your pgsql master resource never came back anywhere.<br>

<br>

You can as well set failure-timeout if you are brave enough to automate the<br>

failure reset. See:<br>

<a href="https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html" rel="noreferrer" target="_blank">https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html</a><br>

<br>

> Master server is started with systmectl, then pacemaker is started on<br>

> master, which detects master and then when starting pacemaker on secondary<br>

> it brings up postgres service in slave mode.<br>

<br>

You should not. Systemd should not mess with resources handled by Pacemaker.<br>

<br>

> I didn't manage to start postgres master over pacemaker. I tested<br>

> failover with setup like this and it works. I will try to setup postgres to<br>

> be run with pacemaker,<br>

<br>

Pacemaker is suppose to start the resource itself if it is enabled in its<br>

setup. Look at this whole chapter (its end is important):<br>

<a href="https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster" rel="noreferrer" target="_blank">https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster</a><br>

<br>

> but I am concerned about those timeouts which<br>

> caused cluster to crash. Can you help me investigate why this happened or<br>

> what should I change in order to avoid it? For aws virtual ip is used AWS<br>

> secondary IP.<br>

<br>

Really I can't help on this. It looks like suddenly both VMs froze most of<br>

their processes, or maybe some kind of clock jump, exhausting the timeouts...I<br>

really don't know.<br>

<br>

It sounds more related to your virtualization stack I suppose. Maybe some kind<br>

of "hot" backup? Maybe the hypervisor didn't schedule enough CPU to your VMs<br>

for too long?<br>

<br>

This is surprising both VM had timeouts in almost the same time. Do you know if<br>

they are on the same hypervisor host? If they do, this is a SPoF: you should<br>

move one of them in another host.<br>

<br>

++<br>

<br>

> On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais <<a href="mailto:jgdr@dalibo.com" target="_blank">jgdr@dalibo.com</a>><br>

> wrote:<br>

> <br>

> > On Thu, 18 Apr 2019 14:19:44 +0200<br>

> > Danka Ivanović <<a href="mailto:danka.ivanovic@gmail.com" target="_blank">danka.ivanovic@gmail.com</a>> wrote:<br>

> ><br>

> ><br>

> ><br>

> > It seems you had timeout for both fencing resources and your standby in<br>

> > the same<br>

> > time here:<br>

> >  <br>

> > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op<br>

> > >   monitor for fencing-secondary on master: unknown error (1)<br>

> > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op<br>

> > >   monitor for fencing-master on secondary: unknown error (1)<br>

> > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op<br>

> > >   monitor for PGSQL:1 on secondary: unknown error (1)<br>

> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary<br>

> > >   away from master after 1 failures (max=1)<br>

> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master  <br>

> > away  <br>

> > >   from secondary after 1 failures (max=1)<br>

> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away  <br>

> > from  <br>

> > >   secondary after 1 failures (max=1)<br>

> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away  <br>

> > from  <br>

> > >   secondary after 1 failures (max=1)  <br>

> ><br>

> > Because you have "migration-threshold=1", the standby will be shut down:<br>

> >  <br>

> > > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary)  <br>

> ><br>

> > The transition is stopped because the pgsql master timed out in the<br>

> > meantime<br>

> > :<br>

> >  <br>

> > > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5,<br>

> > > Pending=0, Fired=0, Skipped=1, Incomplete=6,<br>

> > > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped  <br>

> ><br>

> > and as you mentioned, your ldap as well:<br>

> >  <br>

> > > Apr 17 10:03:40 master nslcd[1518]: [d7e446] <group(all)> ldap_result()<br>

> > > timed out  <br>

> ><br>

> > Here are the four timeout errors (2 fencings and 2 pgsql instances):<br>

> >  <br>

> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op<br>

> > >   monitor for fencing-secondary on master: unknown error (1)<br>

> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op<br>

> > >   monitor for PGSQL:0 on master: unknown error (1)<br>

> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op<br>

> > >   monitor for fencing-master on secondary: unknown error (1)<br>

> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op<br>

> > >   monitor for PGSQL:1 on secondary: unknown error (1)  <br>

> ><br>

> > As a reaction, Pacemaker decide to stop everything because it can not move<br>

> > resources anywhere:<br>

> >  <br>

> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away  <br>

> > from  <br>

> > > master after 1 failures (max=1)<br>

> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away  <br>

> > from  <br>

> > > master after 1 failures (max=1)<br>

> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary<br>

> > > away from master after 1 failures (max=1)<br>

> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master  <br>

> > away  <br>

> > > from secondary after 1 failures (max=1)<br>

> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away  <br>

> > from  <br>

> > > secondary after 1 failures (max=1)<br>

> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away  <br>

> > from  <br>

> > > secondary after 1 failures (max=1)<br>

> > > Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master)<br>

> > > Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master -><br>

> > > Stopped master)<br>

> > > Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1 (secondary)  <br>

> ><br>

> > Now, following lines are really not expected. Why systemd detects<br>

> > PostgreSQL<br>

> > stopped?<br>

> >  <br>

> > > Apr 17 10:03:40 master postgresql@9.5-main[32458]: Cluster is not  <br>

> > running.  <br>

> > > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Control<br>

> > > process exited, code=exited status=2<br>

> > > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Unit<br>

> > > entered failed state.<br>

> > > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Failed  <br>

> > with  <br>

> > > result 'exit-code'.  <br>

> ><br>

> > I suspect the service is still enabled or has been started by hand.<br>

> ><br>

> > As soon as you setup a resource in Pacemaker, admin show **always** ask<br>

> > Pacemaker to start/stop it. Never use systemctl to handle the resource<br>

> > yourself.<br>

> ><br>

> > You must disable this service in systemd.<br>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail-m_-7268986109580146188gmail_signature">Pozdrav<br>Danka Ivanovic</div>

</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature">Pozdrav<br>Danka Ivanovic</div>