[Pacemaker] Master-slave master not promoted on Corosync restart

Fri Oct 24 10:00:39 UTC 2014

Hi Andrew,

Yep, forgot the attachments. I did reproduce the issue, please find
the bz2 files attached. Please tell if you need hb_report being used.

Thank you !

2014-10-07 5:07 GMT+02:00 Andrew Beekhof <andrew at beekhof.net>:
> I think you forgot the attachments (and my eyes are going blind trying to read the word-wrapped logs :-)
>
> On 26 Sep 2014, at 6:37 pm, Sékine Coulibaly <scoulibaly at gmail.com> wrote:
>
>> Hi everyone,
>>
>> I'm trying my  best to diagnose a strange behaviour of my cluster.
>>
>> My cluster is basically a Master-Slave PostgreSQL cluster, with a VIP.
>> Two nodes (clustera and clusterb). I'm running RHEL 6.5, Corosync
>> 1.4.1-1 and Pacemaker 1.1.10.
>>
>> For the simplicity sake of the diagnostic, I took of the slave node.
>>
>> My problem is that the cluster properly promotes the POSTGRESQL
>> resource once (I issue a resource cleanup MS_POSTGRESQL to reset
>> failcount counter, and then all resources are mounted on clustera).
>> After a Corosync restart, the POSTGRESQL resource is not promoted.
>>
>> I narrowed down to the point where I add a location constraint
>> (without this location constraint, after a Corosync restart,
>> POSTGRESQL resource is promoted):
>>
>> location VIP_MGT_needs_gw VIP_MGT rule -inf: not_defined pingd or pingd lte 0
>>
>> The logs show that the pingd attribute value is 1000 (the ping IP is
>> pingable, and pinged [used tcpdump]). This attribute is set by :
>> primitive ping_eth1_mgt_gw ocf:pacemaker:ping params
>> host_list=178.3.1.47 multiplier=1000 op monitor interval=10s meta
>> migration-threshold=3
>>
>> From corosync.log I can see :
>> Sep 26 09:49:36 [22188] clustera    pengine:   notice: LogActions:
>> Start   POSTGRESQL:0    (clustera)
>> Sep 26 09:49:36 [22188] clustera    pengine:     info: LogActions:
>> Leave   POSTGRESQL:1    (Stopped)
>> [...]
>> Sep 26 09:49:36 [22186] clustera       lrmd:     info: log_execute:
>> executing - rsc:POSTGRESQL action:start call_id:20
>> [...]
>> Sep 26 09:49:37 [22187] clustera      attrd:   notice:
>> attrd_trigger_update:    Sending flush op to all hosts for:
>> master-POSTGRESQL (50)
>> [...]
>> Sep 26 09:49:37 [22189] clustera       crmd:     info:
>> match_graph_event:       Action POSTGRESQL_notify_0 (46) confirmed on
>> clustera (rc=0)
>> [...]
>> Sep 26 09:49:38 [22186] clustera       lrmd:     info: log_finished:
>> finished - rsc:ping_eth1_mgt_gw action:start call_id:22 pid:22352
>> exit-code:0 exec-time:2175ms queue-time:0ms
>> [...]
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: clone_print:
>>  Master/Slave Set: MS_POSTGRESQL [POSTGRESQL]
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: short_print:
>>      Slaves: [ clustera ]
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: short_print:
>>      Stopped: [ clusterb ]
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: native_print:
>> VIP_MGT (ocf::heartbeat:IPaddr2):       Stopped
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: clone_print:
>>  Clone Set: cloned_ping_eth1_mgt_gw [ping_eth1_mgt_gw]
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: short_print:
>>      Started: [ clustera ]
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: short_print:
>>      Stopped: [ clusterb ]
>> Sep 26 09:49:38 [22188] clustera    pengine:     info:
>> rsc_merge_weights:       VIP_MGT: Rolling back scores from
>> MS_POSTGRESQL
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: native_color:
>> Resource VIP_MGT cannot run anywhere
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: native_color:
>> POSTGRESQL:1: Rolling back scores from VIP_MGT
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: native_color:
>> Resource POSTGRESQL:1 cannot run anywhere
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: master_color:
>> MS_POSTGRESQL: Promoted 0 instances of a possible 1 to master
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: native_color:
>> Resource ping_eth1_mgt_gw:1 cannot run anywhere
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: RecurringOp:
>>  Start recurring monitor (60s) for POSTGRESQL:0 on clustera
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: RecurringOp:
>>  Start recurring monitor (60s) for POSTGRESQL:0 on clustera
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: RecurringOp:
>>  Start recurring monitor (10s) for ping_eth1_mgt_gw:0 on clustera
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: LogActions:
>> Leave   POSTGRESQL:0    (Slave clustera)
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: LogActions:
>> Leave   POSTGRESQL:1    (Stopped)
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: LogActions:
>> Leave   VIP_MGT (Stopped)
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: LogActions:
>> Leave   ping_eth1_mgt_gw:0      (Started clustera)
>> Sep 26 09:49:38 [22188] clustera    pengine:     info: LogActions:
>> Leave   ping_eth1_mgt_gw:1      (Stopped)
>> [...]
>>
>> Then everything goes weird. The POSTGRESQL is monitored, seen as
>> master (rc=8). Since it is expected (???) to be OCF_RUNNING, the
>> monitor operation failed. That's wierd is n't it, or am I missing
>> something ? :
>>
>> Sep 26 09:49:38 [22189] clustera       crmd:     info:
>> do_state_transition:     State transition S_POLICY_ENGINE ->
>> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
>> origin=handle_response ]
>> Sep 26 09:49:38 [22189] clustera       crmd:     info: do_te_invoke:
>> Processing graph 1 (ref=pe_calc-dc-1411717778-16) derived from
>> /var/lib/pacemaker/pengine/pe-input-1057.bz2
>> Sep 26 09:49:38 [22189] clustera       crmd:   notice: te_rsc_command:
>> Initiating action 6: monitor POSTGRESQL_monitor_60000 on clustera
>> (local)
>> Sep 26 09:49:38 [22189] clustera       crmd:     info: do_lrm_rsc_op:
>> Performing key=6:1:0:613bb344-94ce-4210-879c-6d6e90f35871
>> op=POSTGRESQL_monitor_60000
>> Sep 26 09:49:38 [22189] clustera       crmd:   notice: te_rsc_command:
>> Initiating action 33: monitor ping_eth1_mgt_gw_monitor_10000 on
>> clustera (local)
>> Sep 26 09:49:38 [22189] clustera       crmd:     info: do_lrm_rsc_op:
>> Performing key=33:1:0:613bb344-94ce-4210-879c-6d6e90f35871
>> op=ping_eth1_mgt_gw_monitor_10000
>> Sep 26 09:49:38 [22184] clustera        cib:     info: crm_client_new:
>> Connecting 0x1bbcc00 for uid=0 gid=0 pid=22767
>> id=0ef393a5-6f7b-4893-9b06-0fab26d9a06d
>> Sep 26 09:49:38 [22184] clustera        cib:     info:
>> cib_process_request:     Completed cib_query operation for section
>> nodes: OK (rc=0, origin=local/crm_attribute/2, version=0.692.9)
>> Sep 26 09:49:38 [22187] clustera      attrd:   notice:
>> attrd_trigger_update:    Sending flush op to all hosts for:
>> master-POSTGRESQL (100)
>> Sep 26 09:49:38 [22184] clustera        cib:     info:
>> cib_process_request:     Completed cib_query operation for section
>> //cib/status//node_state[@id='clustera']//transient_attributes//nvpair[@name='master-POSTGRESQL']:
>> OK (rc=0, origin=local/attrd/8, version=0.692.9)
>> Sep 26 09:49:38 [22184] clustera        cib:     info:
>> crm_client_destroy:      Destroying 0 events
>> Sep 26 09:49:38 [22187] clustera      attrd:   notice:
>> attrd_perform_update:    Sent update 9: master-POSTGRESQL=100
>> Sep 26 09:49:38 [22184] clustera        cib:     info:
>> cib_process_request:     Completed cib_modify operation for section
>> status: OK (rc=0, origin=local/attrd/9, version=0.692.10)
>> Sep 26 09:49:38 [22189] clustera       crmd:     info:
>> abort_transition_graph:  te_update_diff:172 - Triggered transition
>> abort (complete=0, node=clustera, tag=nvpair,
>> id=status-clustera-master-POSTGRESQL, name=master-POSTGRESQL,
>> value=100, magic=NA, cib=0.692.10) : Transient attribute: update
>> Sep 26 09:49:38 [22189] clustera       crmd:   notice:
>> process_lrm_event:       LRM operation POSTGRESQL_monitor_60000
>> (call=29, rc=8, cib-update=33, confirmed=false) master
>> Sep 26 09:49:38 [22184] clustera        cib:     info:
>> cib_process_request:     Completed cib_modify operation for section
>> status: OK (rc=0, origin=local/crmd/33, version=0.692.11)
>> Sep 26 09:49:38 [22189] clustera       crmd:  warning: status_from_rc:
>> Action 6 (POSTGRESQL_monitor_60000) on clustera failed (target: 0 vs.
>> rc: 8): Error
>> Sep 26 09:49:38 [22189] clustera       crmd:  warning:
>> update_failcount:        Updating failcount for POSTGRESQL on clustera
>> after failed monitor: rc=8 (update=value++, time=1411717778)
>> Sep 26 09:49:38 [22189] clustera       crmd:     info:
>> abort_transition_graph:  match_graph_event:313 - Triggered transition
>> abort (complete=0, node=clustera, tag=lrm_rsc_op,
>> id=POSTGRESQL_last_failure_0,
>> magic=0:8;6:1:0:613bb344-94ce-4210-879c-6d6e90f35871, cib=0.692.11) :
>> Event failed
>> Sep 26 09:49:38 [22189] clustera       crmd:     info:
>> match_graph_event:       Action POSTGRESQL_monitor_60000 (6) confirmed
>> on clustera (rc=4)
>> Sep 26 09:49:38 [22189] clustera       crmd:  warning:
>> update_failcount:        Updating failcount for POSTGRESQL on clustera
>> after failed monitor: rc=8 (update=value++, time=1411717778)
>> Sep 26 09:49:38 [22189] clustera       crmd:     info:
>> process_graph_event:     Detected action (1.6)
>> POSTGRESQL_monitor_60000.29=master: failed
>> Sep 26 09:49:38 [22187] clustera      attrd:   notice:
>> attrd_trigger_update:    Sending flush op to all hosts for:
>> fail-count-POSTGRESQL (1)
>> Sep 26 09:49:38 [22184] clustera        cib:     info:
>> cib_process_request:     Completed cib_query operation for section
>> //cib/status//node_state[@id='clustera']//transient_attributes//nvpair[@name='fail-count-POSTGRESQL']:
>> No such device or address (rc=-6, origin=local/attrd/10,
>> version=0.692.11)
>> Sep 26 09:49:38 [22184] clustera        cib:     info:
>> cib_process_request:     Completed cib_query operation for section
>> /cib: OK (rc=0, origin=local/attrd/11, version=0.692.11)
>> Sep 26 09:49:38 [22187] clustera      attrd:   notice:
>> attrd_perform_update:    Sent update 12: fail-count-POSTGRESQL=1
>>
>> Attached are the bz2 files and the corosync.log.
>>
>> Any help is appreciated !
>>
>> Sekine
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-208.bz2
Type: application/x-bzip2
Size: 3902 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0064.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-209.bz2
Type: application/x-bzip2
Size: 3834 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0065.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-210.bz2
Type: application/x-bzip2
Size: 3717 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0066.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-211.bz2
Type: application/x-bzip2
Size: 3709 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0067.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-212.bz2
Type: application/x-bzip2
Size: 3744 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0068.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-213.bz2
Type: application/x-bzip2
Size: 3778 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0069.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-214.bz2
Type: application/x-bzip2
Size: 3912 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0070.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-215.bz2
Type: application/x-bzip2
Size: 3971 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0071.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-216.bz2
Type: application/x-bzip2
Size: 3951 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0072.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-217.bz2
Type: application/x-bzip2
Size: 3960 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0073.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-218.bz2
Type: application/x-bzip2
Size: 3963 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0074.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-219.bz2
Type: application/x-bzip2
Size: 3966 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0075.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-220.bz2
Type: application/x-bzip2
Size: 3968 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0076.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-221.bz2
Type: application/x-bzip2
Size: 3968 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0077.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-222.bz2
Type: application/x-bzip2
Size: 3962 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0078.bz2>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pe-input-223.bz2
Type: application/x-bzip2
Size: 3948 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20141024/6396ac09/attachment-0079.bz2>