[ClusterLabs] Cluster node getting stopped from other node(resending mail)
Ken Gaillot
kgaillot at redhat.com
Tue Jul 7 16:23:03 CEST 2015
On 07/07/2015 03:58 AM, Arjun Pandey wrote:
> Hi Ken
>
> If i look at the logs on the other node around the same time i see this. I
> can't figure out the reason based on these.Attaching the corosync.log for
> the other node as well.
I don't see anything there either. The relevant part could be earlier or
later.
Are you using the same version of pacemaker and corosync on both nodes?
> Jun 01 13:55:44 [1965] messi crmd: info: do_dc_join_offer_one:
> An unknown node joined - (re-)offer to any unconfirmed nodes
> Jun 01 13:55:44 [1965] messi crmd: info: join_make_offer:
> Making join offers based on membership 224
> Jun 01 13:55:44 [1965] messi crmd: info: join_make_offer:
> Skipping messi: already known 4
> Jun 01 13:55:44 [1965] messi crmd: info: join_make_offer:
> join-2: Sending offer to ronaldo
> Jun 01 13:55:44 [1960] messi cib: info: cib_process_request:
> Forwarding cib_modify operation for section status to master
> (origin=local/crmd/138)
> Jun 01 13:55:44 [1960] messi cib: info: cib_perform_op:
> Diff: --- 0.80.2 2
> Jun 01 13:55:44 [1960] messi cib: info: cib_perform_op:
> Diff: +++ 0.80.3 (null)
> Jun 01 13:55:44 [1965] messi crmd: info: crm_update_peer_join:
> join_make_offer: Node ronaldo[2] - join-2 phase 0 -> 1
> Jun 01 13:55:44 [1965] messi crmd: info: abort_transition_graph:
> Transition aborted: Peer Halt (source=do_te_invoke:158, 1)
> Jun 01 13:55:44 [1960] messi cib: info: cib_perform_op: +
> /cib: @num_updates=3
> Jun 01 13:55:44 [1960] messi cib: info: cib_perform_op: +
> /cib/status/node_state[@id='ronaldo']: @crmd=online,
> @crm-debug-origin=peer_update_callback
> Jun 01 13:55:44 [1960] messi cib: info: cib_process_request:
> Completed cib_modify operation for section status: OK (rc=0,
> origin=messi/crmd/138, version=0.80.3)
> Jun 01 13:55:44 [1960] messi cib: info: cib_process_request:
> Completed cib_modify operation for section nodes: OK (rc=0,
> origin=ronaldo/crmd/3, version=0.80.3)
> Jun 01 13:55:45 [1965] messi crmd: info: do_dc_join_offer_one:
> join-2: Processing join_announce request from ronaldo in state
> S_INTEGRATION
> Jun 01 13:55:45 [1965] messi crmd: info: crm_update_peer_join:
> do_dc_join_offer_one: Node ronaldo[2] - join-2 phase 1 -> 0
> Jun 01 13:55:45 [1965] messi crmd: info: join_make_offer:
> join-2: Sending offer to ronaldo
> Jun 01 13:55:45 [1965] messi crmd: info: crm_update_peer_join:
> join_make_offer: Node ronaldo[2] - join-2 phase 0 -> 1
> Jun 01 13:55:45 [1965] messi crmd: info: crm_update_peer_join:
> join_make_offer: Node messi[1] - join-2 phase 4 -> 0
> Jun 01 13:55:45 [1965] messi crmd: info: join_make_offer:
> join-2: Sending offer to messi
> Jun 01 13:55:45 [1965] messi crmd: info: crm_update_peer_join:
> join_make_offer: Node messi[1] - join-2 phase 0 -> 1
> Jun 01 13:55:45 [1965] messi crmd: info: abort_transition_graph:
> Transition aborted: Node join (source=do_dc_join_offer_one:236, 1)
> Jun 01 13:55:45 [1965] messi crmd: info: crm_update_peer_join:
> do_dc_join_filter_offer: Node messi[1] - join-2 phase 1 -> 2
> Jun 01 13:55:46 [1965] messi crmd: info: crm_update_peer_join:
> do_dc_join_filter_offer: Node ronaldo[2] - join-2 phase 1 -> 2
> Jun 01 13:55:46 [1965] messi crmd: info:
> crm_update_peer_expected: do_dc_join_filter_offer: Node ronaldo[2] -
> expected state is now member (was down)
> Jun 01 13:55:46 [1965] messi crmd: info: do_state_transition:
> State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> Jun 01 13:55:46 [1965] messi crmd: info: crmd_join_phase_log:
> join-2: messi=integrated
> Jun 01 13:55:46 [1965] messi crmd: info: crmd_join_phase_log:
> join-2: ronaldo=integrated
> Jun 01 13:55:46 [1965] messi crmd: info: do_dc_join_finalize:
> join-2: Syncing our CIB to the rest of the cluster
> Jun 01 13:55:46 [1965] messi crmd: info: crm_update_peer_join:
> finalize_join_for: Node messi[1] - join-2 phase 2 -> 3
> Jun 01 13:55:46 [1965] messi crmd: info: crm_update_peer_join:
> finalize_join_for: Node ronaldo[2] - join-2 phase 2 -> 3
> Jun 01 13:55:46 [1965] messi crmd: info: crm_update_peer_join:
> do_dc_join_ack: Node messi[1] - join-2 phase 3 -> 4
> Jun 01 13:55:46 [1965] messi crmd: info: do_dc_join_ack:
> join-2: Updating node state to member for messi
> Jun 01 13:55:46 [1965] messi crmd: info: erase_status_tag:
> Deleting xpath: //node_state[@uname='messi']/lrm
> Jun 01 13:55:46 [1960] messi cib: info: cib_process_replace:
> Digest matched on replace from messi: 5138b696984c7b834dd2b528dadabe0d
> Jun 01 13:55:46 [1960] messi cib: info: cib_process_replace:
> Replaced 0.80.3 with 0.80.3 from messi
> Jun 01 13:55:46 [1965] messi crmd: info: crm_update_peer_join:
> do_dc_join_ack: Node ronaldo[2] - join-2 phase 3 -> 4
>
>
> Regards
> Arjun
>
> On Wed, Jul 1, 2015 at 7:17 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
>
>> On 06/30/2015 11:30 PM, Arjun Pandey wrote:
>>> Hi
>>>
>>> I am running a 2 node cluster with this config on centos 6.5/6.6
>>>
>>> Master/Slave Set: foo-master [foo]
>>> Masters: [ messi ]
>>> Stopped: [ronaldo ]
>>> eth1-CP (ocf::pw:IPaddr): Started messi
>>> eth2-UP (ocf::pw:IPaddr): Started messi
>>> eth3-UPCP (ocf::pw:IPaddr): Started messi
>>>
>>> where i have a multi-state resource foo being run in master/slave mode
>> and
>>> IPaddr RA is just modified IPAddr2 RA. Additionally i have a
>>> collocation constraint for the IP addr to be collocated with the master.
>>>
>>> Sometimes when i setup the cluster , i find that one of the nodes (the
>>> second node that joins ) gets stopped and i find this log.
>>>
>>> 2015-06-01T13:55:46.153941+05:30 ronaldo pacemaker: Starting Pacemaker
>>> Cluster Manager
>>> 2015-06-01T13:55:46.233639+05:30 ronaldo attrd[25988]: notice:
>>> attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
>>> 2015-06-01T13:55:46.234162+05:30 ronaldo crmd[25990]: notice:
>>> do_state_transition: State transition S_PENDING -> S_NOT_DC [
>>> input=I_NOT_DC cause=C_HA_MESSAG
>>> E origin=do_cl_join_finalize_respond ]
>>> 2015-06-01T13:55:46.234701+05:30 ronaldo attrd[25988]: notice:
>>> attrd_local_callback: Sending full refresh (origin=crmd)
>>> 2015-06-01T13:55:46.234708+05:30 ronaldo attrd[25988]: notice:
>>> attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
>>> ************************ This looks to be the likely
>>> reason*******************************************
>>> 2015-06-01T13:55:46.254310+05:30 ronaldo crmd[25990]: error:
>>> handle_request: We didn't ask to be shut down, yet our DC is telling us
>> too
>>> .
>>>
>> *********************************************************************************************************
>>
>> Hi Arjun,
>>
>> I'd check the other node's logs at this time, to see why it requested
>> the shutdown.
>>
>>> 2015-06-01T13:55:46.254577+05:30 ronaldo crmd[25990]: notice:
>>> do_state_transition: State transition S_NOT_DC -> S_STOPPING [
>> input=I_STOP
>>> cause=C_HA_MESSAGE
>>> origin=route_message ]
>>> 2015-06-01T13:55:46.255134+05:30 ronaldo crmd[25990]: notice:
>>> lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown...
>>> waiting (2 ops remaining)
>>>
>>> Based on the logs , pacemaker on active was stopping the secondary cloud
>>> everytime it joins cluster. This issue seems similar to
>>>
>> http://pacemaker.oss.clusterlabs.narkive.com/rVvN8May/node-sends-shutdown-request-to-other-node-error
>>>
>>> Packages used :-
>>> pacemaker-1.1.12-4.el6.x86_64
>>> pacemaker-libs-1.1.12-4.el6.x86_64
>>> pacemaker-cli-1.1.12-4.el6.x86_64
>>> pacemaker-cluster-libs-1.1.12-4.el6.x86_64
>>> pacemaker-debuginfo-1.1.10-14.el6.x86_64
>>> pcsc-lite-libs-1.5.2-13.el6_4.x86_64
>>> pcs-0.9.90-2.el6.centos.2.noarch
>>> pcsc-lite-1.5.2-13.el6_4.x86_64
>>> pcsc-lite-openct-0.6.19-4.el6.x86_64
>>> corosync-1.4.1-17.el6.x86_64
>>> corosynclib-1.4.1-17.el6.x86_64
More information about the Users
mailing list