<div dir="ltr">Thanks for the clarification about failure-timeout, migration threshold and pacemaker.<div><div dir="ltr"><div>Instances are hosted on AWS cloud, and they are in the same security groups and availability zones.</div><div>I don't have information about hardware which hosts those VMs since they are non dedicated. UTC timezone is configured on both machines and default ntp configuration.</div><div><div> remote refid st t when poll reach delay offset jitter</div><div>==============================================================================</div><div> 0.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000</div><div> 1.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000</div><div> 2.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000</div><div> 3.ubuntu.pool.n .POOL. 16 p - 64 0 0.000 0.000 0.000</div><div> <a href="http://ntp.ubuntu.com">ntp.ubuntu.com</a> .POOL. 16 p - 64 0 0.000 0.000 0.000</div><div>+198.46.223.227 204.9.54.119 2 u 65 512 377 22.318 0.096 1.111</div><div>-time1.plumdev.n .GPS. 1 u 116 512 377 72.487 1.386 0.544</div><div>-199.180.133.100 140.142.2.8 3 u 839 1024 377 65.574 -1.199 1.167</div><div>+helium.constant 128.59.0.245 2 u 217 512 377 7.368 0.952 0.090</div><div>*i.will.not.be.e 213.251.128.249 2 u 207 512 377 14.733 1.185 0.305</div><div><br></div></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 19 Apr 2019 at 11:46, Jehan-Guillaume de Rorthais <<a href="mailto:jgdr@dalibo.com">jgdr@dalibo.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, 19 Apr 2019 11:08:33 +0200<br>
Danka Ivanović <<a href="mailto:danka.ivanovic@gmail.com" target="_blank">danka.ivanovic@gmail.com</a>> wrote:<br>
<br>
> Hi,<br>
> Thank you for your response.<br>
> <br>
> Ok, It seems that fencing resources and secondary timed out at the same<br>
> time, together with ldap.<br>
> I understand that because of "migration-threshold=1", standby tried to<br>
> recover just once and then was stopped. Is this ok, or the threshold should<br>
> be increased?<br>
<br>
It depend on your usecase really.<br>
<br>
Note that as soon as a resource hit migration threashold, there's an implicit<br>
constraint forbidding it to come back on this node until you reset the<br>
failcount. That's why your pgsql master resource never came back anywhere.<br>
<br>
You can as well set failure-timeout if you are brave enough to automate the<br>
failure reset. See:<br>
<a href="https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html" rel="noreferrer" target="_blank">https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html</a><br>
<br>
> Master server is started with systmectl, then pacemaker is started on<br>
> master, which detects master and then when starting pacemaker on secondary<br>
> it brings up postgres service in slave mode.<br>
<br>
You should not. Systemd should not mess with resources handled by Pacemaker.<br>
<br>
> I didn't manage to start postgres master over pacemaker. I tested<br>
> failover with setup like this and it works. I will try to setup postgres to<br>
> be run with pacemaker,<br>
<br>
Pacemaker is suppose to start the resource itself if it is enabled in its<br>
setup. Look at this whole chapter (its end is important):<br>
<a href="https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster" rel="noreferrer" target="_blank">https://clusterlabs.github.io/PAF/CentOS-7-admin-cookbook.html#starting-or-stopping-the-cluster</a><br>
<br>
> but I am concerned about those timeouts which<br>
> caused cluster to crash. Can you help me investigate why this happened or<br>
> what should I change in order to avoid it? For aws virtual ip is used AWS<br>
> secondary IP.<br>
<br>
Really I can't help on this. It looks like suddenly both VMs froze most of<br>
their processes, or maybe some kind of clock jump, exhausting the timeouts...I<br>
really don't know.<br>
<br>
It sounds more related to your virtualization stack I suppose. Maybe some kind<br>
of "hot" backup? Maybe the hypervisor didn't schedule enough CPU to your VMs<br>
for too long?<br>
<br>
This is surprising both VM had timeouts in almost the same time. Do you know if<br>
they are on the same hypervisor host? If they do, this is a SPoF: you should<br>
move one of them in another host.<br>
<br>
++<br>
<br>
> On Thu, 18 Apr 2019 at 18:24, Jehan-Guillaume de Rorthais <<a href="mailto:jgdr@dalibo.com" target="_blank">jgdr@dalibo.com</a>><br>
> wrote:<br>
> <br>
> > On Thu, 18 Apr 2019 14:19:44 +0200<br>
> > Danka Ivanović <<a href="mailto:danka.ivanovic@gmail.com" target="_blank">danka.ivanovic@gmail.com</a>> wrote:<br>
> ><br>
> ><br>
> ><br>
> > It seems you had timeout for both fencing resources and your standby in<br>
> > the same<br>
> > time here:<br>
> > <br>
> > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op<br>
> > > monitor for fencing-secondary on master: unknown error (1)<br>
> > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op<br>
> > > monitor for fencing-master on secondary: unknown error (1)<br>
> > > Apr 17 10:03:34 master pengine[12480]: warning: Processing failed op<br>
> > > monitor for PGSQL:1 on secondary: unknown error (1)<br>
> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-secondary<br>
> > > away from master after 1 failures (max=1)<br>
> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing fencing-master <br>
> > away <br>
> > > from secondary after 1 failures (max=1)<br>
> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away <br>
> > from <br>
> > > secondary after 1 failures (max=1)<br>
> > > Apr 17 10:03:34 master pengine[12480]: warning: Forcing PGSQL-HA away <br>
> > from <br>
> > > secondary after 1 failures (max=1) <br>
> ><br>
> > Because you have "migration-threshold=1", the standby will be shut down:<br>
> > <br>
> > > Apr 17 10:03:34 master pengine[12480]: notice: Stop PGSQL:1 (secondary) <br>
> ><br>
> > The transition is stopped because the pgsql master timed out in the<br>
> > meantime<br>
> > :<br>
> > <br>
> > > Apr 17 10:03:40 master crmd[12481]: notice: Transition 3462 (Complete=5,<br>
> > > Pending=0, Fired=0, Skipped=1, Incomplete=6,<br>
> > > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Stopped <br>
> ><br>
> > and as you mentioned, your ldap as well:<br>
> > <br>
> > > Apr 17 10:03:40 master nslcd[1518]: [d7e446] <group(all)> ldap_result()<br>
> > > timed out <br>
> ><br>
> > Here are the four timeout errors (2 fencings and 2 pgsql instances):<br>
> > <br>
> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op<br>
> > > monitor for fencing-secondary on master: unknown error (1)<br>
> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op<br>
> > > monitor for PGSQL:0 on master: unknown error (1)<br>
> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op<br>
> > > monitor for fencing-master on secondary: unknown error (1)<br>
> > > Apr 17 10:03:40 master pengine[12480]: warning: Processing failed op<br>
> > > monitor for PGSQL:1 on secondary: unknown error (1) <br>
> ><br>
> > As a reaction, Pacemaker decide to stop everything because it can not move<br>
> > resources anywhere:<br>
> > <br>
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away <br>
> > from <br>
> > > master after 1 failures (max=1)<br>
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away <br>
> > from <br>
> > > master after 1 failures (max=1)<br>
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-secondary<br>
> > > away from master after 1 failures (max=1)<br>
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing fencing-master <br>
> > away <br>
> > > from secondary after 1 failures (max=1)<br>
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away <br>
> > from <br>
> > > secondary after 1 failures (max=1)<br>
> > > Apr 17 10:03:40 master pengine[12480]: warning: Forcing PGSQL-HA away <br>
> > from <br>
> > > secondary after 1 failures (max=1)<br>
> > > Apr 17 10:03:40 master pengine[12480]: notice: Stop AWSVIP (master)<br>
> > > Apr 17 10:03:40 master pengine[12480]: notice: Demote PGSQL:0 (Master -><br>
> > > Stopped master)<br>
> > > Apr 17 10:03:40 master pengine[12480]: notice: Stop PGSQL:1 (secondary) <br>
> ><br>
> > Now, following lines are really not expected. Why systemd detects<br>
> > PostgreSQL<br>
> > stopped?<br>
> > <br>
> > > Apr 17 10:03:40 master postgresql@9.5-main[32458]: Cluster is not <br>
> > running. <br>
> > > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Control<br>
> > > process exited, code=exited status=2<br>
> > > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Unit<br>
> > > entered failed state.<br>
> > > Apr 17 10:03:40 master systemd[1]: postgresql@9.5-main.service: Failed <br>
> > with <br>
> > > result 'exit-code'. <br>
> ><br>
> > I suspect the service is still enabled or has been started by hand.<br>
> ><br>
> > As soon as you setup a resource in Pacemaker, admin show **always** ask<br>
> > Pacemaker to start/stop it. Never use systemctl to handle the resource<br>
> > yourself.<br>
> ><br>
> > You must disable this service in systemd.<br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature">Pozdrav<br>Danka Ivanovic</div>