[ClusterLabs] PostgreSQL cluster with Pacemaker+PAF problems
Aleksandra C
aleksandra29c at gmail.com
Fri Mar 6 12:38:37 EST 2020
Thank you for your answer!
I can't see something useful in the pgsqlms log:
[1398] server2 cib: info: cib_perform_op: ++
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources:
<lrm_resource id="pgsqld" type="pgsqlms" class="ocf"
provider="heartbeat"/>
[1402] server2 pengine: info: native_print: pgsqld
(ocf::heartbeat:pgsqlms): FAILED Master server1
[1402] server2 pengine: info: native_print: pgsqld
(ocf::heartbeat:pgsqlms): FAILED server1
1402] server2 pengine: info: native_print: pgsqld
(ocf::heartbeat:pgsqlms): FAILED server1 (blocked)
pgsqlms(pgsqld)[3824]: Mar 06 18:04:21 WARNING: No secondary
connected to the master
pgsqlms(pgsqld)[3824]: Mar 06 18:04:21 WARNING: "server1" is not
connected to the primary
[1402] server2 pengine: info: native_print: pgsqld
(ocf::heartbeat:pgsqlms): FAILED server1 (blocked)
However, here are the general logs:
[1398] server2 cib: info: cib_perform_op: ++
<lrm_rsc_op id="pgsqld_last_0"
operation_key="pgsqld_promote_0" operation="promote"
crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10"
transition-key="6:0:0:bb00f80b-88c5-4453-8291-44dae9fcc635"
transition-magic="0:0;6:0:0:bb00f80b-88c5-4453-8291-44dae9fcc635"
on_node="server2" call-id="26" rc-code="0" op-status="0" interval="0"
last-run="1583514091" last-rc-change="1583514091" exec-tim
[1398] server2 cib: info: cib_perform_op: ++
<lrm_rsc_op id="pgsqld_monitor_15000"
operation_key="pgsqld_monitor_15000" operation="monitor"
crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10"
transition-key="7:0:8:bb00f80b-88c5-4453-8291-44dae9fcc635"
transition-magic="0:8;7:0:8:bb00f80b-88c5-4453-8291-44dae9fcc635"
on_node="server2" call-id="29" rc-code="8" op-status="0"
interval="15000" last-rc-change="1583514092" exec-time="469"
[1403] server2 crmd: notice: te_rsc_command: Initiating
notify operation pgsqld_pre_notify_start_0 locally on server2 | action
48
[1403] server2 crmd: info: do_lrm_rsc_op: Performing
key=48:5:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_notify_0 (48) confirmed on server2 (rc=0)
[1403] server2 crmd: notice: process_lrm_event: Result of
notify operation for pgsqld on server2: 0 (ok) | call=31
key=pgsqld_notify_0 confirmed=true cib-update=0
[1398] server2 cib: info: cib_perform_op: ++
<lrm_rsc_op
id="pgsqld_last_failure_0" operation_key="pgsqld_monitor_0"
operation="monitor" crm-debug-origin="do_update_resource"
crm_feature_set="3.0.10"
transition-key="5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635"
transition-magic="0:9;5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635"
exit-reason="Instance "pgsqld" controldata indicates a
running primary instance, the
[1398] server2 cib: info: cib_perform_op: ++
<lrm_rsc_op
id="pgsqld_last_0" operation_key="pgsqld_monitor_0"
operation="monitor" crm-debug-origin="do_update_resource"
crm_feature_set="3.0.10"
transition-key="5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635"
transition-magic="0:9;5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635"
exit-reason="Instance "pgsqld" controldata indicates a
running primary instance, the instance
[1403] server2 crmd: notice: abort_transition_graph:
Transition aborted by operation pgsqld_monitor_0 'create' on server1:
Event failed | magic=0:9;5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635
cib=0.24.17 source=match_graph_event:310 complete=false
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_monitor_0 (5) confirmed on server1 (rc=9)
[1403] server2 crmd: info: process_graph_event:
Detected action (5.5) pgsqld_monitor_0.6=master (failed): failed
[1403] server2 crmd: info: abort_transition_graph:
Transition aborted by operation pgsqld_monitor_0 'create' on server1:
Event failed | magic=0:9;5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635
cib=0.24.17 source=match_graph_event:310 complete=false
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_monitor_0 (5) confirmed on server1 (rc=9)
[1403] server2 crmd: info: process_graph_event:
Detected action (5.5) pgsqld_monitor_0.6=master (failed): failed
[1403] server2 crmd: notice: te_rsc_command: Initiating
notify operation pgsqld_pre_notify_demote_0 on server1 | action 56
[1403] server2 crmd: notice: te_rsc_command: Initiating
notify operation pgsqld_pre_notify_demote_0 locally on server2 |
action 58
[1403] server2 crmd: info: do_lrm_rsc_op: Performing
key=58:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_notify_0 (58) confirmed on server2 (rc=0)
[1403] server2 crmd: notice: process_lrm_event: Result of
notify operation for pgsqld on server2: 0 (ok) | call=32
key=pgsqld_notify_0 confirmed=true cib-update=0
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_notify_0 (56) confirmed on server1 (rc=0)
[1403] server2 crmd: notice: te_rsc_command: Initiating
demote operation pgsqld_demote_0 on server1 | action 6
[1398] server2 cib: info: cib_perform_op: +
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_failure_0']:
@operation_key=pgsqld_demote_0, @operation=demote,
@transition-key=6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@transition-magic=0:1;6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@call-id=22, @rc-code=1, @last-run=1583514254,
@last-rc-change=1583514254, @exec-time=183
[1398] server2 cib: info: cib_perform_op: +
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_0']:
@operation_key=pgsqld_demote_0, @operation=demote,
@transition-key=6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@transition-magic=0:1;6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@call-id=22, @rc-code=1, @last-run=1583514254,
@last-rc-change=1583514254, @exec-time=183
[1403] server2 crmd: warning: status_from_rc: Action 6
(pgsqld_demote_0) on server1 failed (target: 0 vs. rc: 1): Error
[1403] server2 crmd: notice: abort_transition_graph:
Transition aborted by operation pgsqld_demote_0 'modify' on server1:
Event failed | magic=0:1;6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635
cib=0.24.21 source=match_graph_event:310 complete=false
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_demote_0 (6) confirmed on server1 (rc=1)
[1403] server2 crmd: info: process_graph_event:
Detected action (6.6) pgsqld_demote_0.22=unknown error: failed
[1403] server2 crmd: warning: status_from_rc: Action 6
(pgsqld_demote_0) on server1 failed (target: 0 vs. rc: 1): Error
[1403] server2 crmd: info: abort_transition_graph:
Transition aborted by operation pgsqld_demote_0 'modify' on server1:
Event failed | magic=0:1;6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635
cib=0.24.21 source=match_graph_event:310 complete=false
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_demote_0 (6) confirmed on server1 (rc=1)
[1403] server2 crmd: info: process_graph_event:
Detected action (6.6) pgsqld_demote_0.22=unknown error: failed
[1403] server2 crmd: notice: te_rsc_command: Initiating
notify operation pgsqld_post_notify_demote_0 on server1 | action 57
[1403] server2 crmd: notice: te_rsc_command: Initiating
notify operation pgsqld_post_notify_demote_0 locally on server2 |
action 59
[1403] server2 crmd: info: do_lrm_rsc_op: Performing
key=59:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_notify_0 (57) confirmed on server1 (rc=0)
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_notify_0 (59) confirmed on server2 (rc=0)
[1403] server2 crmd: notice: process_lrm_event: Result of
notify operation for pgsqld on server2: 0 (ok) | call=33
key=pgsqld_notify_0 confirmed=true cib-update=0
[1403] server2 crmd: notice: te_rsc_command: Initiating
notify operation pgsqld_pre_notify_stop_0 on server1 | action 46
[1403] server2 crmd: notice: te_rsc_command: Initiating
notify operation pgsqld_pre_notify_stop_0 locally on server2 | action
47
[1403] server2 crmd: info: do_lrm_rsc_op: Performing
key=47:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_notify_0 (47) confirmed on server2 (rc=0)
[1403] server2 crmd: notice: process_lrm_event: Result of
notify operation for pgsqld on server2: 0 (ok) | call=34
key=pgsqld_notify_0 confirmed=true cib-update=0
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_notify_0 (46) confirmed on server1 (rc=0)
[1403] server2 crmd: notice: te_rsc_command: Initiating
stop operation pgsqld_stop_0 on server1 | action 2
[1398] server2 cib: info: cib_perform_op: +
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_failure_0']:
@operation_key=pgsqld_stop_0, @operation=stop,
@transition-key=2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@transition-magic=0:1;2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@exit-reason=Unexpected state for instance "pgsqld" (returned 9),
@call-id=25, @last-run=1583514255, @last-rc-change=1583514255,
@exec-tim
[1398] server2 cib: info: cib_perform_op: +
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_0']:
@operation_key=pgsqld_stop_0, @operation=stop,
@transition-key=2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@transition-magic=0:1;2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@exit-reason=Unexpected state for instance "pgsqld" (returned 9),
@call-id=25, @last-run=1583514255, @last-rc-change=1583514255,
@exec-time=190
[1403] server2 crmd: warning: status_from_rc: Action 2
(pgsqld_stop_0) on server1 failed (target: 0 vs. rc: 1): Error
[1403] server2 crmd: notice: abort_transition_graph:
Transition aborted by operation pgsqld_stop_0 'modify' on server1:
Event failed | magic=0:1;2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635
cib=0.24.24 source=match_graph_event:310 complete=false
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_stop_0 (2) confirmed on server1 (rc=1)
[1403] server2 crmd: info: process_graph_event:
Detected action (7.2) pgsqld_stop_0.25=unknown error: failed
[1403] server2 crmd: warning: status_from_rc: Action 2
(pgsqld_stop_0) on server1 failed (target: 0 vs. rc: 1): Error
[1403] server2 crmd: info: abort_transition_graph:
Transition aborted by operation pgsqld_stop_0 'modify' on server1:
Event failed | magic=0:1;2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635
cib=0.24.24 source=match_graph_event:310 complete=false
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_stop_0 (2) confirmed on server1 (rc=1)
[1403] server2 crmd: info: process_graph_event:
Detected action (7.2) pgsqld_stop_0.25=unknown error: failed
[1403] server2 crmd: notice: te_rsc_command: Initiating
notify operation pgsqld_post_notify_stop_0 locally on server2 | action
48
[1403] server2 crmd: info: do_lrm_rsc_op: Performing
key=48:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0
[1403] server2 crmd: info: match_graph_event: Action
pgsqld_notify_0 (48) confirmed on server2 (rc=0)
[1403] server2 crmd: notice: process_lrm_event: Result of
notify operation for pgsqld on server2: 0 (ok) | call=35
key=pgsqld_notify_0 confirmed=true cib-update=0
Here I can see this kind of logs:
Action 6 (pgsqld_demote_0) on server1 failed (target: 0 vs. rc: 1): Error
Action 2 (pgsqld_stop_0) on server1 failed (target: 0 vs. rc: 1): Error
It seems it can not demote the previous master as a slave. Recovery.conf
file is present at this (failed) node.
Should I assume that every ungraceful shutdown scenario (and even manual
fence) would result with node failover (so I should rebuild the instance)?
Also, rebuilding the instance is problematic - stopping pacemaker/corosync
service or any related command is not executed (it's blocked) in this state
of the node (FAILED/blocked). I guess it's more intuitive that the node at
least should have status stopped, but not FAILED/blocked.
Also, as a note, the fencing does work - it successfully reboots any node,
the rejoin of the previous master seems to have this problem.
Thank you in advance,
Aleksandra
On Thu, 5 Mar 2020 at 12:40, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
wrote:
> Hello,
>
> On Thu, 5 Mar 2020 12:21:14 +0100
> Aleksandra C <aleksandra29c at gmail.com> wrote:
> [...]
> > I would be very happy to use some help from you.
> >
> > I have configured PostgreSQL cluster with Pacemaker+PAF. The pacemaker
> > configuration is the following (from
> > https://clusterlabs.github.io/PAF/Quick_Start-CentOS-7.html)
> >
> > # pgsqld
> > pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms \
> > bindir=/usr/pgsql-9.6/bin pgdata=/var/lib/pgsql/9.6/data \
> > op start timeout=60s \
> > op stop timeout=60s \
> > op promote timeout=30s \
> > op demote timeout=120s \
> > op monitor interval=15s timeout=10s role="Master" \
> > op monitor interval=16s timeout=10s role="Slave" \
> > op notify timeout=60s
>
> If you can, I would recommend using PostgreSQL v11 or v12. Support for v12
> is in
> PAF 2.3rc2 which is supposed to be released next week.
>
>
> [...]
> > The cluster is behaving in strange way. When I manually fence the master
> > node (or ungracefully shutdown), after unfencing/starting, the node has
> > status Failed/blocked and the node is constantly fenced(restarted) by the
> > fencing agent. Should the fencing recover the cluster as Master/Slave
> > without problem?
>
> I suppose a failover occurred after the ungraceful shutdown? The old
> primary is
> probably seen as crashed from PAF point of view.
>
> Could you share pgsqlms detailed log?
>
> [...]
> > Is this a cluster misconfiguration? Any idea would be greatly
> appreciated.
>
> I don't think so. Make sure to look at
> https://clusterlabs.github.io/PAF/administration.html#failover
>
> Regards,
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20200306/938b5787/attachment-0001.htm>
More information about the Users
mailing list