[ClusterLabs] Corosync - both nodes stay online

Tue Jan 17 08:27:53 UTC 2017

> Hi all,
>
> I have a two node cluster with the following details:
> - Ubuntu 10.04.4 LTS (I know its old
)
> - corosync 1.2.0

Isn't this typo? I mean, 1.2.0 is ... ancient and full of already fixed 
bugs.

> - pacemaker 1.0.8+hg15494-2ubuntu2
>
> Following configuration is applied to corosync:
>
> totem {
>          version: 2
>          token: 3000
>          token_retransmits_before_loss_const: 10
>          join: 60
>          consensus: 5000
>          vsftype: none
>          max_messages: 20
>          clear_node_high_bit: yes
>          secauth: off
>          threads: 0
>          rrp_mode: none
>          cluster_name: firewall-ha
>
>          interface {
>                  ringnumber: 0
>                  bindnetaddr: 192.168.211.1
>                  broadcast : yes
>                  mcastport: 5405
>                  ttl : 1
>          }
>
>          transport: udpu
> }
>
> nodelist {
>          node {
>                  ring0_addr: 192.168.211.1
>                  name: net1
>                  nodeid: 1
>          }
>          node {
>                  ring0_addr: 192.168.211.2
>                  name: net2
>                  nodeid: 2
>          }
> }
>
> quorum {
>          provider: corosync_votequorum
>          two_node: 1
> }
>
> amf {
>          mode: disabled
> }
>
> service {
>          ver:       0
>          name:      pacemaker
> }
>
> aisexec {
>          user:   root
>          group:  root
> }
>
> logging {
>          fileline: off
>          to_stderr: yes
>          to_logfile: yes
>          to_syslog: yes
>          logfile: /var/log/corosync/corosync.log
> 		syslog_facility: daemon
>          debug: off
>          timestamp: on
>          logger_subsys {
>                  subsys: AMF
>                  debug: off
>                  tags: enter|leave|trace1|trace2|trace3|trace4|trace6
>          }
> }
>

Actually config file most likely doesn't work as you expected. Like a 
nodelist - this is 2.x concept and unsupported by 1.x. Same applies to 
corosync_votequorum. Transport udpu is not implemented in 1.2.0 (it was 
added in 1.3.0).

I would recommend to use some backports repo and upgrade.

Regards,
   Honza

> Here is an output of crm status after starting coro sync on both nodes:
> ============
> Last updated: Mon Jan 16 21:24:18 2017
> Stack: openais
> Current DC: net1 - partition with quorum
> Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
>
> Online: [ net1 net2 ]
>
> Now if I kill net2 with:
> killall -9 corosync
>
> The primary host don’t « see » anything, the cluster still appear to be online on net1:
> ============
> Last updated: Mon Jan 16 21:25:25 2017
> Stack: openais
> Current DC: net1 - partition with quorum
> Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
>
> Online: [ net1 net2 ]
>
> I just see this part in the logs:
> Jan 16 21:35:21 corosync [TOTEM ] A processor failed, forming new configuration.
>
> And then, when I start corosync on net2, cluster stays offline:
> ============
> Last updated: Mon Jan 16 21:38:13 2017
> Stack: openais
> Current DC: NONE
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
>
> OFFLINE: [ net1 net2 ]
>
> I have to kill corosync on both nodes, and start on both node together to get back online.
>
>
> When the two nodes are up, I can see trafic with tcpdump:
> 21:41:49.653780 IP 192.168.211.1.5404 > 255.255.255.255.5405: UDP, length 82
> 21:41:49.678846 IP 192.168.211.1.5404 > 192.168.211.2.5405: UDP, length 70
> 21:41:49.680339 IP 192.168.211.2.5404 > 192.168.211.1.5405: UDP, length 70
> 21:41:49.889424 IP 192.168.211.1.5404 > 255.255.255.255.5405: UDP, length 82
> 21:41:49.910492 IP 192.168.211.1.5404 > 192.168.211.2.5405: UDP, length 70
> 21:41:49.911990 IP 192.168.211.2.5404 > 192.168.211.1.5405: UDP, length 70
>
> Here is the output of the state of the ring on net1:
> corosync-cfgtool -s
> Printing ring status.
> Local node ID 30648512
> RING ID 0
> 	id	= 192.168.211.1
> 	status	= ring 0 active with no faults
>
> And net2:
> Printing ring status.
> Local node ID 47425728
> RING ID 0
> 	id	= 192.168.211.2
> 	status	= ring 0 active with no faults
>
> Here is the log on net1 when I start the cluster on both nodes:
> Jan 16 21:41:52 net1 crmd: [15288]: info: crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped!
> Jan 16 21:41:52 net1 crmd: [15288]: WARN: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Jan 16 21:41:52 net1 crmd: [15288]: info: do_state_transition: State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Jan 16 21:41:52 net1 crmd: [15288]: info: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
> Jan 16 21:41:52 net1 crmd: [15288]: info: do_te_control: Registering TE UUID: 53d7e000-3468-4548-b9f9-5bdb9ac9bfc7
> Jan 16 21:41:52 net1 crmd: [15288]: WARN: cib_client_add_notify_callback: Callback already present
> Jan 16 21:41:52 net1 crmd: [15288]: info: set_graph_functions: Setting custom graph functions
> Jan 16 21:41:52 net1 crmd: [15288]: info: unpack_graph: Unpacked transition -1: 0 actions in 0 synapses
> Jan 16 21:41:52 net1 crmd: [15288]: info: do_dc_takeover: Taking over DC status for this partition
> Jan 16 21:41:52 net1 cib: [15284]: info: cib_process_readwrite: We are now in R/W mode
> Jan 16 21:41:52 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_master for section 'all' (origin=local/crmd/6, version=0.51.0): ok (rc=0)
> Jan 16 21:41:52 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/7, version=0.51.0): ok (rc=0)
> Jan 16 21:41:53 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/9, version=0.51.0): ok (rc=0)
> Jan 16 21:41:53 net1 crmd: [15288]: info: join_make_offer: Making join offers based on membership 36
> Jan 16 21:41:53 net1 crmd: [15288]: info: do_dc_join_offer_all: join-1: Waiting on 2 outstanding join acks
> Jan 16 21:41:53 net1 crmd: [15288]: info: ais_dispatch: Membership 36: quorum retained
> Jan 16 21:41:53 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/11, version=0.51.0): ok (rc=0)
> Jan 16 21:41:53 net1 crmd: [15288]: info: crm_ais_dispatch: Setting expected votes to 2
> Jan 16 21:41:53 net1 crmd: [15288]: info: config_query_callback: Checking for expired actions every 900000ms
> Jan 16 21:41:53 net1 crmd: [15288]: info: config_query_callback: Sending expected-votes=2 to corosync
> Jan 16 21:41:53 net1 crmd: [15288]: info: update_dc: Set DC to net1 (3.0.1)
> Jan 16 21:41:53 net1 crmd: [15288]: info: ais_dispatch: Membership 36: quorum retained
> Jan 16 21:41:53 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/14, version=0.51.0): ok (rc=0)
> Jan 16 21:41:53 net1 crmd: [15288]: info: crm_ais_dispatch: Setting expected votes to 2
> Jan 16 21:41:53 net1 crmd: [15288]: info: te_connect_stonith: Attempting connection to fencing daemon...
> Jan 16 21:41:53 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/17, version=0.51.0): ok (rc=0)
> Jan 16 21:41:54 net1 crmd: [15288]: info: te_connect_stonith: Connected
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: All 2 cluster nodes responded to the join offer.
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_dc_join_finalize: join-1: Syncing the CIB from net1 to the rest of the cluster
> Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/18, version=0.51.0): ok (rc=0)
> Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/19, version=0.51.0): ok (rc=0)
> Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/20, version=0.51.0): ok (rc=0)
> Jan 16 21:41:54 net1 crmd: [15288]: info: update_attrd: Connecting to attrd...
> Jan 16 21:41:54 net1 attrd: [15286]: info: find_hash_entry: Creating hash entry for terminate
> Jan 16 21:41:54 net1 attrd: [15286]: info: find_hash_entry: Creating hash entry for shutdown
> Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='net1']/transient_attributes (origin=local/crmd/21, version=0.51.0): ok (rc=0)
> Jan 16 21:41:54 net1 crmd: [15288]: info: erase_xpath_callback: Deletion of "//node_state[@uname='net1']/transient_attributes": ok (rc=0)
> Jan 16 21:41:54 net1 attrd: [15286]: info: crm_new_peer: Node net2 now has id: 47425728
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_dc_join_ack: join-1: Updating node state to member for net2
> Jan 16 21:41:54 net1 attrd: [15286]: info: crm_new_peer: Node 47425728 is now known as net2
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_dc_join_ack: join-1: Updating node state to member for net1
> Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='net2']/transient_attributes (origin=net2/crmd/7, version=0.51.0): ok (rc=0)
> Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='net2']/lrm (origin=local/crmd/22, version=0.51.0): ok (rc=0)
> Jan 16 21:41:54 net1 crmd: [15288]: info: erase_xpath_callback: Deletion of "//node_state[@uname='net2']/lrm": ok (rc=0)
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_dc_join_final: Ensuring DC, quorum and node attributes are up-to-date
> Jan 16 21:41:54 net1 crmd: [15288]: info: crm_update_quorum: Updating quorum status to true (call=28)
> Jan 16 21:41:54 net1 crmd: [15288]: info: abort_transition_graph: do_te_invoke:191 - Triggered transition abort (complete=1) : Peer Cancelled
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke: Query 29: Requesting the current CIB: S_POLICY_ENGINE
> Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='net1']/lrm (origin=local/crmd/24, version=0.51.1): ok (rc=0)
> Jan 16 21:41:54 net1 crmd: [15288]: info: erase_xpath_callback: Deletion of "//node_state[@uname='net1']/lrm": ok (rc=0)
> Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/26, version=0.51.2): ok (rc=0)
> Jan 16 21:41:54 net1 cib: [15284]: info: log_data_element: cib:diff: - <cib admin_epoch="0" epoch="51" num_updates="2" />
> Jan 16 21:41:54 net1 cib: [15284]: info: log_data_element: cib:diff: + <cib dc-uuid="net1" admin_epoch="0" epoch="52" num_updates="1" />
> Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/28, version=0.52.1): ok (rc=0)
> Jan 16 21:41:54 net1 crmd: [15288]: info: abort_transition_graph: need_abort:59 - Triggered transition abort (complete=1) : Non-status change
> Jan 16 21:41:54 net1 crmd: [15288]: info: need_abort: Aborting on change to admin_epoch
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke: Query 30: Requesting the current CIB: S_POLICY_ENGINE
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke_callback: Invoking the PE: query=30, ref=pe_calc-dc-1484599314-11, seq=36, quorate=1
> Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_local_callback: Sending full refresh (origin=crmd)
> Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
> Jan 16 21:41:54 net1 pengine: [15287]: notice: unpack_config: On loss of CCM Quorum: Ignore
> Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
> Jan 16 21:41:54 net1 pengine: [15287]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net2 is online
> Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net1 is online
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Jan 16 21:41:54 net1 crmd: [15288]: info: unpack_graph: Unpacked transition 0: 2 actions in 2 synapses
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_te_invoke: Processing graph 0 (ref=pe_calc-dc-1484599314-11) derived from /var/lib/pengine/pe-input-713.bz2
> Jan 16 21:41:54 net1 crmd: [15288]: info: te_rsc_command: Initiating action 2: probe_complete probe_complete on net1 (local) - no waiting
> Jan 16 21:41:54 net1 crmd: [15288]: info: te_rsc_command: Initiating action 3: probe_complete probe_complete on net2 - no waiting
> Jan 16 21:41:54 net1 crmd: [15288]: info: run_graph: ====================================================
> Jan 16 21:41:54 net1 crmd: [15288]: notice: run_graph: Transition 0 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-713.bz2): Complete
> Jan 16 21:41:54 net1 crmd: [15288]: info: te_graph_trigger: Transition 0 is now complete
> Jan 16 21:41:54 net1 crmd: [15288]: info: notify_crmd: Transition 0 status: done - <null>
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: Starting PEngine Recheck Timer
> Jan 16 21:41:54 net1 attrd: [15286]: info: find_hash_entry: Creating hash entry for probe_complete
> Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
> Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_perform_update: Sent update 10: probe_complete=true
> Jan 16 21:41:54 net1 crmd: [15288]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=net1, magic=NA, cib=0.52.2) : Transient attribute: update
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke: Query 31: Requesting the current CIB: S_POLICY_ENGINE
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke_callback: Invoking the PE: query=31, ref=pe_calc-dc-1484599314-14, seq=36, quorate=1
> Jan 16 21:41:54 net1 crmd: [15288]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=net2, magic=NA, cib=0.52.3) : Transient attribute: update
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke: Query 32: Requesting the current CIB: S_POLICY_ENGINE
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke_callback: Invoking the PE: query=32, ref=pe_calc-dc-1484599314-15, seq=36, quorate=1
> Jan 16 21:41:54 net1 pengine: [15287]: info: process_pe_message: Transition 0: PEngine Input stored in: /var/lib/pengine/pe-input-713.bz2
> Jan 16 21:41:54 net1 pengine: [15287]: notice: unpack_config: On loss of CCM Quorum: Ignore
> Jan 16 21:41:54 net1 pengine: [15287]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net2 is online
> Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net1 is online
> Jan 16 21:41:54 net1 crmd: [15288]: info: handle_response: pe_calc calculation pe_calc-dc-1484599314-14 is obsolete
> Jan 16 21:41:54 net1 cib: [15311]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-52.raw
> Jan 16 21:41:54 net1 pengine: [15287]: info: process_pe_message: Transition 1: PEngine Input stored in: /var/lib/pengine/pe-input-714.bz2
> Jan 16 21:41:54 net1 pengine: [15287]: notice: unpack_config: On loss of CCM Quorum: Ignore
> Jan 16 21:41:54 net1 pengine: [15287]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net2 is online
> Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net1 is online
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Jan 16 21:41:54 net1 crmd: [15288]: info: unpack_graph: Unpacked transition 2: 0 actions in 0 synapses
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_te_invoke: Processing graph 2 (ref=pe_calc-dc-1484599314-15) derived from /var/lib/pengine/pe-input-715.bz2
> Jan 16 21:41:54 net1 crmd: [15288]: info: run_graph: ====================================================
> Jan 16 21:41:54 net1 crmd: [15288]: notice: run_graph: Transition 2 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-715.bz2): Complete
> Jan 16 21:41:54 net1 crmd: [15288]: info: te_graph_trigger: Transition 2 is now complete
> Jan 16 21:41:54 net1 crmd: [15288]: info: notify_crmd: Transition 2 status: done - <null>
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: Starting PEngine Recheck Timer
> Jan 16 21:41:54 net1 cib: [15311]: info: write_cib_contents: Wrote version 0.52.0 of the CIB to disk (digest: 44f7626e0420b36260b8c67e9e576a7e)
> Jan 16 21:41:54 net1 pengine: [15287]: info: process_pe_message: Transition 2: PEngine Input stored in: /var/lib/pengine/pe-input-715.bz2
> Jan 16 21:41:54 net1 cib: [15311]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.ZpwuYr (digest: /var/lib/heartbeat/crm/cib.zVG2jb)
> Jan 16 21:50:50 net1 cib: [15284]: info: cib_stats: Processed 48 operations (2708.00us average, 0% utilization) in the last 10min
>
> Any help is appreciated !
>
> Thanks
>
>
>
>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>