[ClusterLabs] Cluster node getting stopped from other node(resending mail)

Tue Jul 7 10:23:03 EDT 2015

On 07/07/2015 03:58 AM, Arjun Pandey wrote:
> Hi Ken
> 
> If i look at the logs on the other node around the same time i see this. I
> can't figure out the reason based on these.Attaching the corosync.log for
> the other node as well.

I don't see anything there either. The relevant part could be earlier or
later.

Are you using the same version of pacemaker and corosync on both nodes?

> Jun 01 13:55:44 [1965] messi       crmd:     info: do_dc_join_offer_one:
>      An unknown node joined - (re-)offer to any unconfirmed nodes
> Jun 01 13:55:44 [1965] messi       crmd:     info: join_make_offer:
> Making join offers based on membership 224
> Jun 01 13:55:44 [1965] messi       crmd:     info: join_make_offer:
> Skipping messi: already known 4
> Jun 01 13:55:44 [1965] messi       crmd:     info: join_make_offer:
> join-2: Sending offer to ronaldo
> Jun 01 13:55:44 [1960] messi        cib:     info: cib_process_request:
>     Forwarding cib_modify operation for section status to master
> (origin=local/crmd/138)
> Jun 01 13:55:44 [1960] messi        cib:     info: cib_perform_op:
>  Diff: --- 0.80.2 2
> Jun 01 13:55:44 [1960] messi        cib:     info: cib_perform_op:
>  Diff: +++ 0.80.3 (null)
> Jun 01 13:55:44 [1965] messi       crmd:     info: crm_update_peer_join:
>      join_make_offer: Node ronaldo[2] - join-2 phase 0 -> 1
> Jun 01 13:55:44 [1965] messi       crmd:     info: abort_transition_graph:
>      Transition aborted: Peer Halt (source=do_te_invoke:158, 1)
> Jun 01 13:55:44 [1960] messi        cib:     info: cib_perform_op:      +
>  /cib:  @num_updates=3
> Jun 01 13:55:44 [1960] messi        cib:     info: cib_perform_op:      +
>  /cib/status/node_state[@id='ronaldo']:  @crmd=online,
> @crm-debug-origin=peer_update_callback
> Jun 01 13:55:44 [1960] messi        cib:     info: cib_process_request:
>     Completed cib_modify operation for section status: OK (rc=0,
> origin=messi/crmd/138, version=0.80.3)
> Jun 01 13:55:44 [1960] messi        cib:     info: cib_process_request:
>     Completed cib_modify operation for section nodes: OK (rc=0,
> origin=ronaldo/crmd/3, version=0.80.3)
> Jun 01 13:55:45 [1965] messi       crmd:     info: do_dc_join_offer_one:
>      join-2: Processing join_announce request from ronaldo in state
> S_INTEGRATION
> Jun 01 13:55:45 [1965] messi       crmd:     info: crm_update_peer_join:
>      do_dc_join_offer_one: Node ronaldo[2] - join-2 phase 1 -> 0
> Jun 01 13:55:45 [1965] messi       crmd:     info: join_make_offer:
> join-2: Sending offer to ronaldo
> Jun 01 13:55:45 [1965] messi       crmd:     info: crm_update_peer_join:
>      join_make_offer: Node ronaldo[2] - join-2 phase 0 -> 1
> Jun 01 13:55:45 [1965] messi       crmd:     info: crm_update_peer_join:
>      join_make_offer: Node messi[1] - join-2 phase 4 -> 0
> Jun 01 13:55:45 [1965] messi       crmd:     info: join_make_offer:
> join-2: Sending offer to messi
> Jun 01 13:55:45 [1965] messi       crmd:     info: crm_update_peer_join:
>      join_make_offer: Node messi[1] - join-2 phase 0 -> 1
> Jun 01 13:55:45 [1965] messi       crmd:     info: abort_transition_graph:
>      Transition aborted: Node join (source=do_dc_join_offer_one:236, 1)
> Jun 01 13:55:45 [1965] messi       crmd:     info: crm_update_peer_join:
>      do_dc_join_filter_offer: Node messi[1] - join-2 phase 1 -> 2
> Jun 01 13:55:46 [1965] messi       crmd:     info: crm_update_peer_join:
>      do_dc_join_filter_offer: Node ronaldo[2] - join-2 phase 1 -> 2
> Jun 01 13:55:46 [1965] messi       crmd:     info:
> crm_update_peer_expected:    do_dc_join_filter_offer: Node ronaldo[2] -
> expected state is now member (was down)
> Jun 01 13:55:46 [1965] messi       crmd:     info: do_state_transition:
>     State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED
> cause=C_FSA_INTERNAL origin=check_join_state ]
> Jun 01 13:55:46 [1965] messi       crmd:     info: crmd_join_phase_log:
>     join-2: messi=integrated
> Jun 01 13:55:46 [1965] messi       crmd:     info: crmd_join_phase_log:
>     join-2: ronaldo=integrated
> Jun 01 13:55:46 [1965] messi       crmd:     info: do_dc_join_finalize:
>     join-2: Syncing our CIB to the rest of the cluster
> Jun 01 13:55:46 [1965] messi       crmd:     info: crm_update_peer_join:
>      finalize_join_for: Node messi[1] - join-2 phase 2 -> 3
> Jun 01 13:55:46 [1965] messi       crmd:     info: crm_update_peer_join:
>      finalize_join_for: Node ronaldo[2] - join-2 phase 2 -> 3
> Jun 01 13:55:46 [1965] messi       crmd:     info: crm_update_peer_join:
>      do_dc_join_ack: Node messi[1] - join-2 phase 3 -> 4
> Jun 01 13:55:46 [1965] messi       crmd:     info: do_dc_join_ack:
>  join-2: Updating node state to member for messi
> Jun 01 13:55:46 [1965] messi       crmd:     info: erase_status_tag:
>  Deleting xpath: //node_state[@uname='messi']/lrm
> Jun 01 13:55:46 [1960] messi        cib:     info: cib_process_replace:
>     Digest matched on replace from messi: 5138b696984c7b834dd2b528dadabe0d
> Jun 01 13:55:46 [1960] messi        cib:     info: cib_process_replace:
>     Replaced 0.80.3 with 0.80.3 from messi
> Jun 01 13:55:46 [1965] messi       crmd:     info: crm_update_peer_join:
>      do_dc_join_ack: Node ronaldo[2] - join-2 phase 3 -> 4
> 
> 
> Regards
> Arjun
> 
> On Wed, Jul 1, 2015 at 7:17 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
> 
>> On 06/30/2015 11:30 PM, Arjun Pandey wrote:
>>> Hi
>>>
>>> I am running a 2 node cluster with this config on centos 6.5/6.6
>>>
>>> Master/Slave Set: foo-master [foo]
>>> Masters: [ messi ]
>>> Stopped: [ronaldo ]
>>>  eth1-CP        (ocf::pw:IPaddr):       Started messi
>>>  eth2-UP        (ocf::pw:IPaddr):       Started messi
>>>  eth3-UPCP      (ocf::pw:IPaddr):       Started messi
>>>
>>> where i have a multi-state resource foo being run in master/slave mode
>> and
>>>  IPaddr RA is just modified IPAddr2 RA. Additionally i have a
>>> collocation constraint for the IP addr to be collocated with the master.
>>>
>>> Sometimes when i setup the cluster , i find that one of the nodes (the
>>> second node that joins ) gets stopped and i find this log.
>>>
>>> 2015-06-01T13:55:46.153941+05:30 ronaldo pacemaker: Starting Pacemaker
>>> Cluster Manager
>>> 2015-06-01T13:55:46.233639+05:30 ronaldo attrd[25988]:   notice:
>>> attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
>>> 2015-06-01T13:55:46.234162+05:30 ronaldo crmd[25990]:   notice:
>>> do_state_transition: State transition S_PENDING -> S_NOT_DC [
>>> input=I_NOT_DC cause=C_HA_MESSAG
>>> E origin=do_cl_join_finalize_respond ]
>>> 2015-06-01T13:55:46.234701+05:30 ronaldo attrd[25988]:   notice:
>>> attrd_local_callback: Sending full refresh (origin=crmd)
>>> 2015-06-01T13:55:46.234708+05:30 ronaldo attrd[25988]:   notice:
>>> attrd_trigger_update: Sending flush op to all hosts for: shutdown (0)
>>> ************************ This looks to be the likely
>>> reason*******************************************
>>> 2015-06-01T13:55:46.254310+05:30 ronaldo crmd[25990]:    error:
>>> handle_request: We didn't ask to be shut down, yet our DC is telling us
>> too
>>> .
>>>
>> *********************************************************************************************************
>>
>> Hi Arjun,
>>
>> I'd check the other node's logs at this time, to see why it requested
>> the shutdown.
>>
>>> 2015-06-01T13:55:46.254577+05:30 ronaldo crmd[25990]:   notice:
>>> do_state_transition: State transition S_NOT_DC -> S_STOPPING [
>> input=I_STOP
>>> cause=C_HA_MESSAGE
>>>  origin=route_message ]
>>> 2015-06-01T13:55:46.255134+05:30 ronaldo crmd[25990]:   notice:
>>> lrm_state_verify_stopped: Stopped 0 recurring operations at shutdown...
>>> waiting (2 ops remaining)
>>>
>>> Based on the logs , pacemaker on active was stopping the secondary cloud
>>> everytime it joins cluster. This issue seems similar to
>>>
>> http://pacemaker.oss.clusterlabs.narkive.com/rVvN8May/node-sends-shutdown-request-to-other-node-error
>>>
>>> Packages used :-
>>> pacemaker-1.1.12-4.el6.x86_64
>>> pacemaker-libs-1.1.12-4.el6.x86_64
>>> pacemaker-cli-1.1.12-4.el6.x86_64
>>> pacemaker-cluster-libs-1.1.12-4.el6.x86_64
>>> pacemaker-debuginfo-1.1.10-14.el6.x86_64
>>> pcsc-lite-libs-1.5.2-13.el6_4.x86_64
>>> pcs-0.9.90-2.el6.centos.2.noarch
>>> pcsc-lite-1.5.2-13.el6_4.x86_64
>>> pcsc-lite-openct-0.6.19-4.el6.x86_64
>>> corosync-1.4.1-17.el6.x86_64
>>> corosynclib-1.4.1-17.el6.x86_64