<div dir="ltr">Thank you for your answer!<div><br></div><div>I can't see something useful in the pgsqlms log:</div><div><br></div><div><pre style="white-space:pre-wrap;margin-top:0px;margin-bottom:20px;font-family:Monaco,"Vera Sans Terminal",monospace;color:rgb(51,51,51);font-size:12px;padding:8px 15px;background:rgb(245,245,245);border-radius:5px;border:1px solid rgb(229,229,229)">[1398] server2 cib: info: cib_perform_op: ++ /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources: <lrm_resource id="pgsqld" type="pgsqlms" class="ocf" provider="heartbeat"/><br>[1402] server2 pengine: info: native_print: pgsqld (ocf::heartbeat:pgsqlms): FAILED Master server1<br>[1402] server2 pengine: info: native_print: pgsqld (ocf::heartbeat:pgsqlms): FAILED server1<br>1402] server2 pengine: info: native_print: pgsqld (ocf::heartbeat:pgsqlms): FAILED server1 (blocked)<br>pgsqlms(pgsqld)[3824]: Mar 06 18:04:21 WARNING: No secondary connected to the master<br>pgsqlms(pgsqld)[3824]: Mar 06 18:04:21 WARNING: "server1" is not connected to the primary<br>[1402] server2 pengine: info: native_print: pgsqld (ocf::heartbeat:pgsqlms): FAILED server1 (blocked)
</pre></div><div><br></div><div>However, here are the general logs:</div><div><br></div><div><pre style="white-space:pre-wrap;margin-top:0px;margin-bottom:20px;font-family:Monaco,"Vera Sans Terminal",monospace;color:rgb(51,51,51);font-size:12px;padding:8px 15px;background:rgb(245,245,245);border-radius:5px;border:1px solid rgb(229,229,229)">[1398] server2 cib: info: cib_perform_op: ++ <lrm_rsc_op id="pgsqld_last_0" operation_key="pgsqld_promote_0" operation="promote" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="6:0:0:bb00f80b-88c5-4453-8291-44dae9fcc635" transition-magic="0:0;6:0:0:bb00f80b-88c5-4453-8291-44dae9fcc635" on_node="server2" call-id="26" rc-code="0" op-status="0" interval="0" last-run="1583514091" last-rc-change="1583514091" exec-tim<br>[1398] server2 cib: info: cib_perform_op: ++ <lrm_rsc_op id="pgsqld_monitor_15000" operation_key="pgsqld_monitor_15000" operation="monitor" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10" transition-key="7:0:8:bb00f80b-88c5-4453-8291-44dae9fcc635" transition-magic="0:8;7:0:8:bb00f80b-88c5-4453-8291-44dae9fcc635" on_node="server2" call-id="29" rc-code="8" op-status="0" interval="15000" last-rc-change="1583514092" exec-time="469"<br>[1403] server2 crmd: notice: te_rsc_command: Initiating notify operation pgsqld_pre_notify_start_0 locally on server2 | action 48<br>[1403] server2 crmd: info: do_lrm_rsc_op: Performing key=48:5:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_notify_0 (48) confirmed on server2 (rc=0)<br>[1403] server2 crmd: notice: process_lrm_event: Result of notify operation for pgsqld on server2: 0 (ok) | call=31 key=pgsqld_notify_0 confirmed=true cib-update=0<br>[1398] server2 cib: info: cib_perform_op: ++ <lrm_rsc_op id="pgsqld_last_failure_0" operation_key="pgsqld_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635" transition-magic="0:9;5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635" exit-reason="Instance "pgsqld" controldata indicates a running primary instance, the<br>[1398] server2 cib: info: cib_perform_op: ++ <lrm_rsc_op id="pgsqld_last_0" operation_key="pgsqld_monitor_0" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" transition-key="5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635" transition-magic="0:9;5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635" exit-reason="Instance "pgsqld" controldata indicates a running primary instance, the instance<br>[1403] server2 crmd: notice: abort_transition_graph: Transition aborted by operation pgsqld_monitor_0 'create' on server1: Event failed | magic=0:9;5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635 cib=0.24.17 source=match_graph_event:310 complete=false<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_monitor_0 (5) confirmed on server1 (rc=9)<br>[1403] server2 crmd: info: process_graph_event: Detected action (5.5) pgsqld_monitor_0.6=master (failed): failed<br>[1403] server2 crmd: info: abort_transition_graph: Transition aborted by operation pgsqld_monitor_0 'create' on server1: Event failed | magic=0:9;5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635 cib=0.24.17 source=match_graph_event:310 complete=false<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_monitor_0 (5) confirmed on server1 (rc=9)<br>[1403] server2 crmd: info: process_graph_event: Detected action (5.5) pgsqld_monitor_0.6=master (failed): failed<br>[1403] server2 crmd: notice: te_rsc_command: Initiating notify operation pgsqld_pre_notify_demote_0 on server1 | action 56<br>[1403] server2 crmd: notice: te_rsc_command: Initiating notify operation pgsqld_pre_notify_demote_0 locally on server2 | action 58<br>[1403] server2 crmd: info: do_lrm_rsc_op: Performing key=58:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_notify_0 (58) confirmed on server2 (rc=0)<br>[1403] server2 crmd: notice: process_lrm_event: Result of notify operation for pgsqld on server2: 0 (ok) | call=32 key=pgsqld_notify_0 confirmed=true cib-update=0<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_notify_0 (56) confirmed on server1 (rc=0)<br>[1403] server2 crmd: notice: te_rsc_command: Initiating demote operation pgsqld_demote_0 on server1 | action 6<br>[1398] server2 cib: info: cib_perform_op: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_failure_0']: @operation_key=pgsqld_demote_0, @operation=demote, @transition-key=6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635, @transition-magic=0:1;6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635, @call-id=22, @rc-code=1, @last-run=1583514254, @last-rc-change=1583514254, @exec-time=183<br>[1398] server2 cib: info: cib_perform_op: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_0']: @operation_key=pgsqld_demote_0, @operation=demote, @transition-key=6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635, @transition-magic=0:1;6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635, @call-id=22, @rc-code=1, @last-run=1583514254, @last-rc-change=1583514254, @exec-time=183<br>[1403] server2 crmd: warning: status_from_rc: Action 6 (pgsqld_demote_0) on server1 failed (target: 0 vs. rc: 1): Error<br>[1403] server2 crmd: notice: abort_transition_graph: Transition aborted by operation pgsqld_demote_0 'modify' on server1: Event failed | magic=0:1;6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635 cib=0.24.21 source=match_graph_event:310 complete=false<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_demote_0 (6) confirmed on server1 (rc=1)<br>[1403] server2 crmd: info: process_graph_event: Detected action (6.6) pgsqld_demote_0.22=unknown error: failed<br>[1403] server2 crmd: warning: status_from_rc: Action 6 (pgsqld_demote_0) on server1 failed (target: 0 vs. rc: 1): Error<br>[1403] server2 crmd: info: abort_transition_graph: Transition aborted by operation pgsqld_demote_0 'modify' on server1: Event failed | magic=0:1;6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635 cib=0.24.21 source=match_graph_event:310 complete=false<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_demote_0 (6) confirmed on server1 (rc=1)<br>[1403] server2 crmd: info: process_graph_event: Detected action (6.6) pgsqld_demote_0.22=unknown error: failed<br>[1403] server2 crmd: notice: te_rsc_command: Initiating notify operation pgsqld_post_notify_demote_0 on server1 | action 57<br>[1403] server2 crmd: notice: te_rsc_command: Initiating notify operation pgsqld_post_notify_demote_0 locally on server2 | action 59<br>[1403] server2 crmd: info: do_lrm_rsc_op: Performing key=59:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_notify_0 (57) confirmed on server1 (rc=0)<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_notify_0 (59) confirmed on server2 (rc=0)<br>[1403] server2 crmd: notice: process_lrm_event: Result of notify operation for pgsqld on server2: 0 (ok) | call=33 key=pgsqld_notify_0 confirmed=true cib-update=0<br>[1403] server2 crmd: notice: te_rsc_command: Initiating notify operation pgsqld_pre_notify_stop_0 on server1 | action 46<br>[1403] server2 crmd: notice: te_rsc_command: Initiating notify operation pgsqld_pre_notify_stop_0 locally on server2 | action 47<br>[1403] server2 crmd: info: do_lrm_rsc_op: Performing key=47:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_notify_0 (47) confirmed on server2 (rc=0)<br>[1403] server2 crmd: notice: process_lrm_event: Result of notify operation for pgsqld on server2: 0 (ok) | call=34 key=pgsqld_notify_0 confirmed=true cib-update=0<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_notify_0 (46) confirmed on server1 (rc=0)<br>[1403] server2 crmd: notice: te_rsc_command: Initiating stop operation pgsqld_stop_0 on server1 | action 2<br>[1398] server2 cib: info: cib_perform_op: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_failure_0']: @operation_key=pgsqld_stop_0, @operation=stop, @transition-key=2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635, @transition-magic=0:1;2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635, @exit-reason=Unexpected state for instance "pgsqld" (returned 9), @call-id=25, @last-run=1583514255, @last-rc-change=1583514255, @exec-tim<br>[1398] server2 cib: info: cib_perform_op: + /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_0']: @operation_key=pgsqld_stop_0, @operation=stop, @transition-key=2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635, @transition-magic=0:1;2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635, @exit-reason=Unexpected state for instance "pgsqld" (returned 9), @call-id=25, @last-run=1583514255, @last-rc-change=1583514255, @exec-time=190<br>[1403] server2 crmd: warning: status_from_rc: Action 2 (pgsqld_stop_0) on server1 failed (target: 0 vs. rc: 1): Error<br>[1403] server2 crmd: notice: abort_transition_graph: Transition aborted by operation pgsqld_stop_0 'modify' on server1: Event failed | magic=0:1;2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635 cib=0.24.24 source=match_graph_event:310 complete=false<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_stop_0 (2) confirmed on server1 (rc=1)<br>[1403] server2 crmd: info: process_graph_event: Detected action (7.2) pgsqld_stop_0.25=unknown error: failed<br>[1403] server2 crmd: warning: status_from_rc: Action 2 (pgsqld_stop_0) on server1 failed (target: 0 vs. rc: 1): Error<br>[1403] server2 crmd: info: abort_transition_graph: Transition aborted by operation pgsqld_stop_0 'modify' on server1: Event failed | magic=0:1;2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635 cib=0.24.24 source=match_graph_event:310 complete=false<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_stop_0 (2) confirmed on server1 (rc=1)<br>[1403] server2 crmd: info: process_graph_event: Detected action (7.2) pgsqld_stop_0.25=unknown error: failed<br>[1403] server2 crmd: notice: te_rsc_command: Initiating notify operation pgsqld_post_notify_stop_0 locally on server2 | action 48<br>[1403] server2 crmd: info: do_lrm_rsc_op: Performing key=48:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0<br>[1403] server2 crmd: info: match_graph_event: Action pgsqld_notify_0 (48) confirmed on server2 (rc=0)<br>[1403] server2 crmd: notice: process_lrm_event: Result of notify operation for pgsqld on server2: 0 (ok) | call=35 key=pgsqld_notify_0 confirmed=true cib-update=0
</pre></div><div>Here I can see this kind of logs:</div><div>Action 6 (pgsqld_demote_0) on server1 failed (target: 0 vs. rc: 1): Error<br></div><div>Action 2 (pgsqld_stop_0) on server1 failed (target: 0 vs. rc: 1): Error<br></div><div><br></div><div>It seems it can not demote the previous master as a slave. Recovery.conf file is present at this (failed) node.</div><div><br></div><div>Should I assume that every ungraceful shutdown scenario (and even manual fence) would result with node failover (so I should rebuild the instance)? Also, rebuilding the instance is problematic - stopping pacemaker/corosync service or any related command is not executed (it's blocked) in this state of the node (FAILED/blocked). I guess it's more intuitive that the node at least should have status stopped, but not FAILED/blocked.
</div><div><br></div><div>Also, as a note, the fencing does work - it successfully reboots any node, the rejoin of the previous master seems to have this problem. </div><div></div><div><br></div><div>Thank you in advance,</div><div>Aleksandra</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 5 Mar 2020 at 12:40, Jehan-Guillaume de Rorthais <<a href="mailto:jgdr@dalibo.com">jgdr@dalibo.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello,<br>
<br>
On Thu, 5 Mar 2020 12:21:14 +0100<br>
Aleksandra C <<a href="mailto:aleksandra29c@gmail.com" target="_blank">aleksandra29c@gmail.com</a>> wrote:<br>
[...]<br>
> I would be very happy to use some help from you.<br>
> <br>
> I have configured PostgreSQL cluster with Pacemaker+PAF. The pacemaker<br>
> configuration is the following (from<br>
> <a href="https://clusterlabs.github.io/PAF/Quick_Start-CentOS-7.html" rel="noreferrer" target="_blank">https://clusterlabs.github.io/PAF/Quick_Start-CentOS-7.html</a>)<br>
> <br>
> # pgsqld<br>
> pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms \<br>
> bindir=/usr/pgsql-9.6/bin pgdata=/var/lib/pgsql/9.6/data \<br>
> op start timeout=60s \<br>
> op stop timeout=60s \<br>
> op promote timeout=30s \<br>
> op demote timeout=120s \<br>
> op monitor interval=15s timeout=10s role="Master" \<br>
> op monitor interval=16s timeout=10s role="Slave" \<br>
> op notify timeout=60s<br>
<br>
If you can, I would recommend using PostgreSQL v11 or v12. Support for v12 is in<br>
PAF 2.3rc2 which is supposed to be released next week.<br>
<br>
<br>
[...]<br>
> The cluster is behaving in strange way. When I manually fence the master<br>
> node (or ungracefully shutdown), after unfencing/starting, the node has<br>
> status Failed/blocked and the node is constantly fenced(restarted) by the<br>
> fencing agent. Should the fencing recover the cluster as Master/Slave<br>
> without problem?<br>
<br>
I suppose a failover occurred after the ungraceful shutdown? The old primary is<br>
probably seen as crashed from PAF point of view.<br>
<br>
Could you share pgsqlms detailed log?<br>
<br>
[...]<br>
> Is this a cluster misconfiguration? Any idea would be greatly appreciated.<br>
<br>
I don't think so. Make sure to look at<br>
<a href="https://clusterlabs.github.io/PAF/administration.html#failover" rel="noreferrer" target="_blank">https://clusterlabs.github.io/PAF/administration.html#failover</a><br>
<br>
Regards,<br>
</blockquote></div>