[ClusterLabs] How to figure out the reason for pgsql_monitor timeout, which resulted in a failover

Benjamin Fras benjamin.fras at petafuel.de
Fri Jan 15 03:44:58 EST 2016


Dear list members,

 

we are running a couple of postgres clusters based on corosync / pacemaker,
each consisting of three nodes (master, slave and a witness host without
running postgres resources). According to the attached logs, the master is
referenced by nbgprepdb6, the recovered host by nbgprepdb5 and the witness
host by nbgprepwitness56. The configuration of the resources you can find in
pgsql_crm.txt.

 

It is a stable setup and in general it is running fine. However, today we
experienced some strange behaviour on one of our cluster nodes. First we did
a planned failover and successful recovery, where the recovered host was
recognized correctly as a slave and the cluster seemed to be just fine.
After a while pacemaker performed a failover, though. I don't see, why this
failover actually happened.

 

Regarding the logfiles (I have attached the pacemaker.log from all three
nodes), the demote of the master node and the failover was caused by a
timeout of the pgsql_monitor on the master server. But why did it time out?
Postgres itself obviously didn't have a problem, it was a clean shutdown
triggered by pacemaker. There are neither errors in the postgres.log nor in
the syslog (e. g. stating system out of memory or similar). I was not able
to find an explanation for this, so do you have any ideas where to look?

 

I have to add that we had some issues starting the recovered slave node,
because the pgsql_start-timeout was too low (120s). As postgres didn't
manage to catch up within this time, it was shut down by pacemaker. So we
tried a few times and after a while postgres came up. Anyway, I don't see
how this could be related to the described issue. 

 

Appreciate your help.

Best regards,

 

Ben

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160115/a164377b/attachment-0002.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: witness_host.txt
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160115/a164377b/attachment-0008.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pgsql_crm.txt
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160115/a164377b/attachment-0009.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: master_host.txt
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160115/a164377b/attachment-0010.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: recovered_host.txt
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160115/a164377b/attachment-0011.txt>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5049 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160115/a164377b/attachment-0002.p7s>


More information about the Users mailing list