[ClusterLabs] 答复: Pacemaker Master restarts when Slave is added to the cluster

Thu Dec 28 04:36:51 UTC 2017

Andrei,

I set the interleave=true and it does not restart any more. Thank you very much. 
A word of you resolves the problem confusing my several days 😊

-----邮件原件-----
发件人: Andrei Borzenkov [mailto:arvidjaar at gmail.com] 
发送时间: 2017年12月27日 19:06
收件人: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
主题: Re: [ClusterLabs] Pacemaker Master restarts when Slave is added to the cluster

Usual suspect - interleave=false on clone resource.

On Wed, Dec 27, 2017 at 10:49 AM, 范国腾 <fanguoteng at highgo.com> wrote:
> Hello,
>
>
>
> In my test environment, I meet one issue about the pacemaker: when a 
> new node is added in the cluster, the master node restart. This issue 
> will lead to the system out of service for a while when adding a new 
> node because there is no master node. Could you please help tell how to debug such issue?
>
>
>
> I have a pacemaker master/slave cluster as below. pgsql-ha is a 
> resource. I copy the script from 
> /usr/lib/ocf/resource.d/heartbeat/Dumy and add some simple codes to make it support promote/demote.
>
> Now when I run “pcs cluster stop” on db1,the db1 is stopped status and 
> db2 is still master.
>
> The problem is: when I run “pcs cluster start” on db1.The db2 status 
> changes as below: master -> slave->stop->slave->master. Why does db2 restart?
>
>
>
> CENTOS7:
>
> ======================================================
>
> 2 nodes and 7 resources configured
>
>
>
> Online: [ db1 db2 ]
>
>
>
> Full list of resources:
>
>
>
> Clone Set: dlm-clone [dlm]
>
>      Started: [ db1 db2 ]
>
> Clone Set: clvmd-clone [clvmd]
>
>      Started: [ db1 db2 ]
>
> scsi-stonith-device    (stonith:fence_scsi):   Started db2
>
> Master/Slave Set: pgsql-ha [pgsqld]
>
>      Masters: [ db2 ]
>
>      Slaves: [ db1 ]
>
>
>
> Daemon Status:
>
>   corosync: active/enabled
>
>   pacemaker: active/enabled
>
>   pcsd: active/enabled
>
> [root at db1 heartbeat]#
>
> ==========================================================
>
> /var/log/messages:
>
> Dec 27 00:52:50 db2 cib[3290]:  notice: Purged 1 peers with id=1 
> and/or
> uname=db1 from the membership cache
>
> Dec 27 00:52:51 db2 kernel: dlm: closing connection to node 1
>
> Dec 27 00:52:51 db2 corosync[3268]: [TOTEM ] A new membership
> (192.168.199.199:372) was formed. Members left: 1
>
> Dec 27 00:52:51 db2 corosync[3268]: [QUORUM] Members[1]: 2
>
> Dec 27 00:52:51 db2 corosync[3268]: [MAIN  ] Completed service 
> synchronization, ready to provide service.
>
> Dec 27 00:52:51 db2 crmd[3295]:  notice: Node db1 state is now lost
>
> Dec 27 00:52:51 db2 crmd[3295]:  notice: do_shutdown of peer db1 is 
> complete
>
> Dec 27 00:52:51 db2 pacemakerd[3289]:  notice: Node db1 state is now 
> lost
>
> Dec 27 00:52:57 db2 Doctor(pgsqld)[6671]: INFO: pgsqld monitor : 8
>
> Dec 27 00:53:12 db2 Doctor(pgsqld)[6681]: INFO: pgsqld monitor : 8
>
> Dec 27 00:53:27 db2 Doctor(pgsqld)[6746]: INFO: pgsqld monitor : 8
>
> Dec 27 00:53:33 db2 corosync[3268]: [TOTEM ] A new membership
> (192.168.199.197:376) was formed. Members joined: 1
>
> Dec 27 00:53:33 db2 corosync[3268]: [QUORUM] Members[2]: 1 2
>
> Dec 27 00:53:33 db2 corosync[3268]: [MAIN  ] Completed service 
> synchronization, ready to provide service.
>
> Dec 27 00:53:33 db2 crmd[3295]:  notice: Node db1 state is now member
>
> Dec 27 00:53:33 db2 pacemakerd[3289]:  notice: Node db1 state is now 
> member
>
> Dec 27 00:53:33 db2 crmd[3295]:  notice: do_shutdown of peer db1 is 
> complete
>
> Dec 27 00:53:33 db2 crmd[3295]:  notice: State transition S_IDLE -> 
> S_INTEGRATION
>
> Dec 27 00:53:33 db2 pengine[3294]:  notice: Calculated transition 17, 
> saving inputs in /var/lib/pacemaker/pengine/pe-input-116.bz2
>
> Dec 27 00:53:33 db2 crmd[3295]:  notice: Transition 17 (Complete=0, 
> Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-116.bz2): Complete
>
> Dec 27 00:53:33 db2 crmd[3295]:  notice: State transition 
> S_TRANSITION_ENGINE -> S_IDLE
>
> Dec 27 00:53:33 db2 stonith-ng[3291]:  notice: Node db1 state is now 
> member
>
> Dec 27 00:53:33 db2 attrd[3293]:  notice: Node db1 state is now member
>
> Dec 27 00:53:33 db2 cib[3290]:  notice: Node db1 state is now member
>
> Dec 27 00:53:34 db2 crmd[3295]:  notice: State transition S_IDLE -> 
> S_INTEGRATION
>
> Dec 27 00:53:37 db2 crmd[3295]: warning: No reason to expect node 2 to 
> be down
>
> Dec 27 00:53:38 db2 pengine[3294]:  notice: Unfencing db1: node 
> discovery
>
> Dec 27 00:53:38 db2 pengine[3294]:  notice: Start   dlm:1#011(db1)
>
> Dec 27 00:53:38 db2 pengine[3294]:  notice: Start   clvmd:1#011(db1)
>
> Dec 27 00:53:38 db2 pengine[3294]:  notice: Restart 
> pgsqld:0#011(Master db2)
>
>
>
>
>
> /var/log/cluster/corosync.log:
>
>
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_process_request:
> Completed cib_modify operation for section status: OK (rc=0, 
> origin=db2/crmd/99, version=0.60.29)
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_process_request:
> Forwarding cib_delete operation for section 
> //node_state[@uname='db2']/lrm to all (origin=local/crmd/100)
>
> Dec 27 00:53:37 [3295] db2       crmd:     info: do_state_transition:
> State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE | 
> input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state
>
> Dec 27 00:53:37 [3295] db2       crmd:     info: abort_transition_graph:
> Transition aborted: Peer Cancelled | source=do_te_invoke:161 
> complete=true
>
> Dec 27 00:53:37 [3293] db2      attrd:     info: attrd_client_refresh:
> Updating all attributes
>
> Dec 27 00:53:37 [3293] db2      attrd:     info: write_attribute:       Sent
> update 12 with 2 changes for shutdown, id=<n/a>, set=(null)
>
> Dec 27 00:53:37 [3293] db2      attrd:     info: write_attribute:       Sent
> update 13 with 1 changes for last-failure-pgsqld, id=<n/a>, set=(null)
>
> Dec 27 00:53:37 [3293] db2      attrd:     info: write_attribute:       Sent
> update 14 with 2 changes for terminate, id=<n/a>, set=(null)
>
> Dec 27 00:53:37 [3293] db2      attrd:     info: write_attribute:       Sent
> update 15 with 1 changes for fail-count-pgsqld, id=<n/a>, set=(null)
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_process_request:
> Forwarding cib_modify operation for section status to all
> (origin=local/crmd/101)
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:
> Diff: --- 0.60.29 2
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:
> Diff: +++ 0.60.30 (null)
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        --
> /cib/status/node_state[@id='2']/lrm[@id='2']
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        +
> /cib:  @num_updates=30
>
> Dec 27 00:53:37 [3295] db2       crmd:  warning: match_down_event:      No
> reason to expect node 2 to be down
>
> Dec 27 00:53:37 [3295] db2       crmd:     info: abort_transition_graph:
> Transition aborted by deletion of lrm[@id='2']: Resource state removal 
> |
> cib=0.60.30 source=abort_unless_down:343 
> path=/cib/status/node_state[@id='2']/lrm[@id='2'] complete=true
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_process_request:
> Completed cib_delete operation for section //node_state[@uname='db2']/lrm:
> OK (rc=0, origin=db2/crmd/100, version=0.60.30)
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:
> Diff: --- 0.60.30 2
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:
> Diff: +++ 0.60.31 (null)
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        +
> /cib:  @num_updates=31
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        +
> /cib/status/node_state[@id='2']:  
> @crm-debug-origin=do_lrm_query_internal
>
> Dec 27 00:53:37 [3290] db2        cib:     info: cib_perform_op:        ++
> /cib/status/node_state[@id='2']:  <lrm id="2"/>
>
>
>
> I use this command to create the resource:
>
> pcs resource create pgsqld ocf:heartbeat:Doctor op start timeout=60s 
> op stop timeout=60s op promote timeout=30s op demote timeout=120s op 
> monitor interval=15s timeout=10s role="Master" op monitor interval=16s 
> timeout=10s role="Slave" op notify timeout=60s; pcs resource master 
> pgsql-ha pgsqld notify=true;pcs constraint order start clvmd-clone 
> then pgsql-ha;
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

_______________________________________________
Users mailing list: Users at clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org