[ClusterLabs] PostgreSQL cluster with Pacemaker+PAF problems

Fri Mar 6 12:38:37 EST 2020

Thank you for your answer!

I can't see something useful in the pgsqlms log:

[1398] server2        cib:     info: cib_perform_op:    ++
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources:
<lrm_resource id="pgsqld" type="pgsqlms" class="ocf"
provider="heartbeat"/>
[1402] server2    pengine:     info: native_print:           pgsqld
 (ocf::heartbeat:pgsqlms):       FAILED Master server1
[1402] server2    pengine:     info: native_print:           pgsqld
 (ocf::heartbeat:pgsqlms):       FAILED server1
1402] server2    pengine:     info: native_print:           pgsqld
(ocf::heartbeat:pgsqlms):       FAILED server1 (blocked)
pgsqlms(pgsqld)[3824]:  Mar 06 18:04:21  WARNING: No secondary
connected to the master
pgsqlms(pgsqld)[3824]:  Mar 06 18:04:21  WARNING: "server1" is not
connected to the primary
[1402] server2    pengine:     info: native_print:           pgsqld
 (ocf::heartbeat:pgsqlms):       FAILED server1 (blocked)

However, here are the general logs:

[1398] server2        cib:     info: cib_perform_op:    ++
                            <lrm_rsc_op id="pgsqld_last_0"
operation_key="pgsqld_promote_0" operation="promote"
crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10"
transition-key="6:0:0:bb00f80b-88c5-4453-8291-44dae9fcc635"
transition-magic="0:0;6:0:0:bb00f80b-88c5-4453-8291-44dae9fcc635"
on_node="server2" call-id="26" rc-code="0" op-status="0" interval="0"
last-run="1583514091" last-rc-change="1583514091" exec-tim
[1398] server2        cib:     info: cib_perform_op:    ++
                            <lrm_rsc_op id="pgsqld_monitor_15000"
operation_key="pgsqld_monitor_15000" operation="monitor"
crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10"
transition-key="7:0:8:bb00f80b-88c5-4453-8291-44dae9fcc635"
transition-magic="0:8;7:0:8:bb00f80b-88c5-4453-8291-44dae9fcc635"
on_node="server2" call-id="29" rc-code="8" op-status="0"
interval="15000" last-rc-change="1583514092" exec-time="469"
[1403] server2       crmd:   notice: te_rsc_command:    Initiating
notify operation pgsqld_pre_notify_start_0 locally on server2 | action
48
[1403] server2       crmd:     info: do_lrm_rsc_op:     Performing
key=48:5:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_notify_0 (48) confirmed on server2 (rc=0)
[1403] server2       crmd:   notice: process_lrm_event: Result of
notify operation for pgsqld on server2: 0 (ok) | call=31
key=pgsqld_notify_0 confirmed=true cib-update=0
[1398] server2        cib:     info: cib_perform_op:    ++
                                                   <lrm_rsc_op
id="pgsqld_last_failure_0" operation_key="pgsqld_monitor_0"
operation="monitor" crm-debug-origin="do_update_resource"
crm_feature_set="3.0.10"
transition-key="5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635"
transition-magic="0:9;5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635"
exit-reason="Instance "pgsqld" controldata indicates a
running primary instance, the
[1398] server2        cib:     info: cib_perform_op:    ++
                                                   <lrm_rsc_op
id="pgsqld_last_0" operation_key="pgsqld_monitor_0"
operation="monitor" crm-debug-origin="do_update_resource"
crm_feature_set="3.0.10"
transition-key="5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635"
transition-magic="0:9;5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635"
exit-reason="Instance "pgsqld" controldata indicates a
running primary instance, the instance
[1403] server2       crmd:   notice: abort_transition_graph:
Transition aborted by operation pgsqld_monitor_0 'create' on server1:
Event failed | magic=0:9;5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635
cib=0.24.17 source=match_graph_event:310 complete=false
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_monitor_0 (5) confirmed on server1 (rc=9)
[1403] server2       crmd:     info: process_graph_event:
Detected action (5.5) pgsqld_monitor_0.6=master (failed): failed
[1403] server2       crmd:     info: abort_transition_graph:
Transition aborted by operation pgsqld_monitor_0 'create' on server1:
Event failed | magic=0:9;5:5:7:bb00f80b-88c5-4453-8291-44dae9fcc635
cib=0.24.17 source=match_graph_event:310 complete=false
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_monitor_0 (5) confirmed on server1 (rc=9)
[1403] server2       crmd:     info: process_graph_event:
Detected action (5.5) pgsqld_monitor_0.6=master (failed): failed
[1403] server2       crmd:   notice: te_rsc_command:    Initiating
notify operation pgsqld_pre_notify_demote_0 on server1 | action 56
[1403] server2       crmd:   notice: te_rsc_command:    Initiating
notify operation pgsqld_pre_notify_demote_0 locally on server2 |
action 58
[1403] server2       crmd:     info: do_lrm_rsc_op:     Performing
key=58:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_notify_0 (58) confirmed on server2 (rc=0)
[1403] server2       crmd:   notice: process_lrm_event: Result of
notify operation for pgsqld on server2: 0 (ok) | call=32
key=pgsqld_notify_0 confirmed=true cib-update=0
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_notify_0 (56) confirmed on server1 (rc=0)
[1403] server2       crmd:   notice: te_rsc_command:    Initiating
demote operation pgsqld_demote_0 on server1 | action 6
[1398] server2        cib:     info: cib_perform_op:    +
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_failure_0']:
 @operation_key=pgsqld_demote_0, @operation=demote,
@transition-key=6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@transition-magic=0:1;6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@call-id=22, @rc-code=1, @last-run=1583514254,
@last-rc-change=1583514254, @exec-time=183
[1398] server2        cib:     info: cib_perform_op:    +
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_0']:
 @operation_key=pgsqld_demote_0, @operation=demote,
@transition-key=6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@transition-magic=0:1;6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@call-id=22, @rc-code=1, @last-run=1583514254,
@last-rc-change=1583514254, @exec-time=183
[1403] server2       crmd:  warning: status_from_rc:    Action 6
(pgsqld_demote_0) on server1 failed (target: 0 vs. rc: 1): Error
[1403] server2       crmd:   notice: abort_transition_graph:
Transition aborted by operation pgsqld_demote_0 'modify' on server1:
Event failed | magic=0:1;6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635
cib=0.24.21 source=match_graph_event:310 complete=false
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_demote_0 (6) confirmed on server1 (rc=1)
[1403] server2       crmd:     info: process_graph_event:
Detected action (6.6) pgsqld_demote_0.22=unknown error: failed
[1403] server2       crmd:  warning: status_from_rc:    Action 6
(pgsqld_demote_0) on server1 failed (target: 0 vs. rc: 1): Error
[1403] server2       crmd:     info: abort_transition_graph:
Transition aborted by operation pgsqld_demote_0 'modify' on server1:
Event failed | magic=0:1;6:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635
cib=0.24.21 source=match_graph_event:310 complete=false
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_demote_0 (6) confirmed on server1 (rc=1)
[1403] server2       crmd:     info: process_graph_event:
Detected action (6.6) pgsqld_demote_0.22=unknown error: failed
[1403] server2       crmd:   notice: te_rsc_command:    Initiating
notify operation pgsqld_post_notify_demote_0 on server1 | action 57
[1403] server2       crmd:   notice: te_rsc_command:    Initiating
notify operation pgsqld_post_notify_demote_0 locally on server2 |
action 59
[1403] server2       crmd:     info: do_lrm_rsc_op:     Performing
key=59:6:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_notify_0 (57) confirmed on server1 (rc=0)
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_notify_0 (59) confirmed on server2 (rc=0)
[1403] server2       crmd:   notice: process_lrm_event: Result of
notify operation for pgsqld on server2: 0 (ok) | call=33
key=pgsqld_notify_0 confirmed=true cib-update=0
[1403] server2       crmd:   notice: te_rsc_command:    Initiating
notify operation pgsqld_pre_notify_stop_0 on server1 | action 46
[1403] server2       crmd:   notice: te_rsc_command:    Initiating
notify operation pgsqld_pre_notify_stop_0 locally on server2 | action
47
[1403] server2       crmd:     info: do_lrm_rsc_op:     Performing
key=47:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_notify_0 (47) confirmed on server2 (rc=0)
[1403] server2       crmd:   notice: process_lrm_event: Result of
notify operation for pgsqld on server2: 0 (ok) | call=34
key=pgsqld_notify_0 confirmed=true cib-update=0
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_notify_0 (46) confirmed on server1 (rc=0)
[1403] server2       crmd:   notice: te_rsc_command:    Initiating
stop operation pgsqld_stop_0 on server1 | action 2
[1398] server2        cib:     info: cib_perform_op:    +
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_failure_0']:
 @operation_key=pgsqld_stop_0, @operation=stop,
@transition-key=2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@transition-magic=0:1;2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@exit-reason=Unexpected state for instance "pgsqld" (returned 9),
@call-id=25, @last-run=1583514255, @last-rc-change=1583514255,
@exec-tim
[1398] server2        cib:     info: cib_perform_op:    +
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_0']:
 @operation_key=pgsqld_stop_0, @operation=stop,
@transition-key=2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@transition-magic=0:1;2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635,
@exit-reason=Unexpected state for instance "pgsqld" (returned 9),
@call-id=25, @last-run=1583514255, @last-rc-change=1583514255,
@exec-time=190
[1403] server2       crmd:  warning: status_from_rc:    Action 2
(pgsqld_stop_0) on server1 failed (target: 0 vs. rc: 1): Error
[1403] server2       crmd:   notice: abort_transition_graph:
Transition aborted by operation pgsqld_stop_0 'modify' on server1:
Event failed | magic=0:1;2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635
cib=0.24.24 source=match_graph_event:310 complete=false
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_stop_0 (2) confirmed on server1 (rc=1)
[1403] server2       crmd:     info: process_graph_event:
Detected action (7.2) pgsqld_stop_0.25=unknown error: failed
[1403] server2       crmd:  warning: status_from_rc:    Action 2
(pgsqld_stop_0) on server1 failed (target: 0 vs. rc: 1): Error
[1403] server2       crmd:     info: abort_transition_graph:
Transition aborted by operation pgsqld_stop_0 'modify' on server1:
Event failed | magic=0:1;2:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635
cib=0.24.24 source=match_graph_event:310 complete=false
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_stop_0 (2) confirmed on server1 (rc=1)
[1403] server2       crmd:     info: process_graph_event:
Detected action (7.2) pgsqld_stop_0.25=unknown error: failed
[1403] server2       crmd:   notice: te_rsc_command:    Initiating
notify operation pgsqld_post_notify_stop_0 locally on server2 | action
48
[1403] server2       crmd:     info: do_lrm_rsc_op:     Performing
key=48:7:0:bb00f80b-88c5-4453-8291-44dae9fcc635 op=pgsqld_notify_0
[1403] server2       crmd:     info: match_graph_event: Action
pgsqld_notify_0 (48) confirmed on server2 (rc=0)
[1403] server2       crmd:   notice: process_lrm_event: Result of
notify operation for pgsqld on server2: 0 (ok) | call=35
key=pgsqld_notify_0 confirmed=true cib-update=0

Here I can see this kind of logs:
Action 6 (pgsqld_demote_0) on server1 failed (target: 0 vs. rc: 1): Error
Action 2 (pgsqld_stop_0) on server1 failed (target: 0 vs. rc: 1): Error

It seems it can not demote the previous master as a slave. Recovery.conf
file is present at this (failed) node.

Should I assume that every ungraceful shutdown scenario (and even manual
fence) would result with node failover (so I should rebuild the instance)?
Also, rebuilding the instance is problematic - stopping pacemaker/corosync
service or any related command is not executed (it's blocked) in this state
of the node (FAILED/blocked). I guess it's more intuitive that the node at
least should have status stopped, but not FAILED/blocked.

Also, as a note, the fencing does work - it successfully reboots any node,
the rejoin of the previous master seems to have this problem.

Thank you in advance,
Aleksandra

On Thu, 5 Mar 2020 at 12:40, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
wrote:

> Hello,
>
> On Thu, 5 Mar 2020 12:21:14 +0100
> Aleksandra C <aleksandra29c at gmail.com> wrote:
> [...]
> > I would be very happy to use some help from you.
> >
> > I have configured PostgreSQL cluster with Pacemaker+PAF. The pacemaker
> > configuration is the following (from
> > https://clusterlabs.github.io/PAF/Quick_Start-CentOS-7.html)
> >
> > # pgsqld
> > pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms \
> >     bindir=/usr/pgsql-9.6/bin pgdata=/var/lib/pgsql/9.6/data     \
> >     op start timeout=60s                                         \
> >     op stop timeout=60s                                          \
> >     op promote timeout=30s                                       \
> >     op demote timeout=120s                                       \
> >     op monitor interval=15s timeout=10s role="Master"            \
> >     op monitor interval=16s timeout=10s role="Slave"             \
> >     op notify timeout=60s
>
> If you can, I would recommend using PostgreSQL v11 or v12. Support for v12
> is in
> PAF 2.3rc2 which is supposed to be released next week.
>
>
> [...]
> > The cluster is behaving in strange way. When I manually fence the master
> > node (or ungracefully shutdown), after unfencing/starting, the node has
> > status Failed/blocked and the node is constantly fenced(restarted) by the
> > fencing agent. Should the fencing recover the cluster as Master/Slave
> > without problem?
>
> I suppose a failover occurred after the ungraceful shutdown? The old
> primary is
> probably seen as crashed from PAF point of view.
>
> Could you share pgsqlms detailed log?
>
> [...]
> > Is this a cluster misconfiguration? Any idea would be greatly
> appreciated.
>
> I don't think so. Make sure to look at
> https://clusterlabs.github.io/PAF/administration.html#failover
>
> Regards,
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20200306/938b5787/attachment-0001.htm>