[ClusterLabs] Corosync - both nodes stay online

Matthieu fsm.matthieu at gmail.com
Mon Jan 16 16:10:44 EST 2017


Hi all,

I have a two node cluster with the following details:
- Ubuntu 10.04.4 LTS (I know its old…)
- corosync 1.2.0
- pacemaker 1.0.8+hg15494-2ubuntu2

Following configuration is applied to corosync:

totem {
        version: 2
        token: 3000
        token_retransmits_before_loss_const: 10
        join: 60
        consensus: 5000
        vsftype: none
        max_messages: 20
        clear_node_high_bit: yes
        secauth: off
        threads: 0
        rrp_mode: none
        cluster_name: firewall-ha

        interface {
                ringnumber: 0
                bindnetaddr: 192.168.211.1
                broadcast : yes
                mcastport: 5405
                ttl : 1
        }

        transport: udpu
}

nodelist {
        node {
                ring0_addr: 192.168.211.1
                name: net1
                nodeid: 1
        }
        node {
                ring0_addr: 192.168.211.2
                name: net2
                nodeid: 2
        }
}

quorum {
        provider: corosync_votequorum
        two_node: 1
}

amf {
        mode: disabled
}

service {
        ver:       0
        name:      pacemaker
}

aisexec {
        user:   root
        group:  root
}

logging {
        fileline: off
        to_stderr: yes
        to_logfile: yes
        to_syslog: yes
        logfile: /var/log/corosync/corosync.log
		syslog_facility: daemon
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
                tags: enter|leave|trace1|trace2|trace3|trace4|trace6
        }
}

Here is an output of crm status after starting coro sync on both nodes:
============
Last updated: Mon Jan 16 21:24:18 2017
Stack: openais
Current DC: net1 - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ net1 net2 ]

Now if I kill net2 with:
killall -9 corosync

The primary host don’t « see » anything, the cluster still appear to be online on net1:
============
Last updated: Mon Jan 16 21:25:25 2017
Stack: openais
Current DC: net1 - partition with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ net1 net2 ]

I just see this part in the logs:
Jan 16 21:35:21 corosync [TOTEM ] A processor failed, forming new configuration.

And then, when I start corosync on net2, cluster stays offline:
============
Last updated: Mon Jan 16 21:38:13 2017
Stack: openais
Current DC: NONE
2 Nodes configured, 2 expected votes
0 Resources configured.
============

OFFLINE: [ net1 net2 ]

I have to kill corosync on both nodes, and start on both node together to get back online.


When the two nodes are up, I can see trafic with tcpdump:
21:41:49.653780 IP 192.168.211.1.5404 > 255.255.255.255.5405: UDP, length 82
21:41:49.678846 IP 192.168.211.1.5404 > 192.168.211.2.5405: UDP, length 70
21:41:49.680339 IP 192.168.211.2.5404 > 192.168.211.1.5405: UDP, length 70
21:41:49.889424 IP 192.168.211.1.5404 > 255.255.255.255.5405: UDP, length 82
21:41:49.910492 IP 192.168.211.1.5404 > 192.168.211.2.5405: UDP, length 70
21:41:49.911990 IP 192.168.211.2.5404 > 192.168.211.1.5405: UDP, length 70

Here is the output of the state of the ring on net1:
corosync-cfgtool -s
Printing ring status.
Local node ID 30648512
RING ID 0
	id	= 192.168.211.1
	status	= ring 0 active with no faults

And net2:
Printing ring status.
Local node ID 47425728
RING ID 0
	id	= 192.168.211.2
	status	= ring 0 active with no faults

Here is the log on net1 when I start the cluster on both nodes:
Jan 16 21:41:52 net1 crmd: [15288]: info: crm_timer_popped: Election Trigger (I_DC_TIMEOUT) just popped!
Jan 16 21:41:52 net1 crmd: [15288]: WARN: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Jan 16 21:41:52 net1 crmd: [15288]: info: do_state_transition: State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED origin=crm_timer_popped ]
Jan 16 21:41:52 net1 crmd: [15288]: info: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
Jan 16 21:41:52 net1 crmd: [15288]: info: do_te_control: Registering TE UUID: 53d7e000-3468-4548-b9f9-5bdb9ac9bfc7
Jan 16 21:41:52 net1 crmd: [15288]: WARN: cib_client_add_notify_callback: Callback already present
Jan 16 21:41:52 net1 crmd: [15288]: info: set_graph_functions: Setting custom graph functions
Jan 16 21:41:52 net1 crmd: [15288]: info: unpack_graph: Unpacked transition -1: 0 actions in 0 synapses
Jan 16 21:41:52 net1 crmd: [15288]: info: do_dc_takeover: Taking over DC status for this partition
Jan 16 21:41:52 net1 cib: [15284]: info: cib_process_readwrite: We are now in R/W mode
Jan 16 21:41:52 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_master for section 'all' (origin=local/crmd/6, version=0.51.0): ok (rc=0)
Jan 16 21:41:52 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/7, version=0.51.0): ok (rc=0)
Jan 16 21:41:53 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/9, version=0.51.0): ok (rc=0)
Jan 16 21:41:53 net1 crmd: [15288]: info: join_make_offer: Making join offers based on membership 36
Jan 16 21:41:53 net1 crmd: [15288]: info: do_dc_join_offer_all: join-1: Waiting on 2 outstanding join acks
Jan 16 21:41:53 net1 crmd: [15288]: info: ais_dispatch: Membership 36: quorum retained
Jan 16 21:41:53 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/11, version=0.51.0): ok (rc=0)
Jan 16 21:41:53 net1 crmd: [15288]: info: crm_ais_dispatch: Setting expected votes to 2
Jan 16 21:41:53 net1 crmd: [15288]: info: config_query_callback: Checking for expired actions every 900000ms
Jan 16 21:41:53 net1 crmd: [15288]: info: config_query_callback: Sending expected-votes=2 to corosync
Jan 16 21:41:53 net1 crmd: [15288]: info: update_dc: Set DC to net1 (3.0.1)
Jan 16 21:41:53 net1 crmd: [15288]: info: ais_dispatch: Membership 36: quorum retained
Jan 16 21:41:53 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/14, version=0.51.0): ok (rc=0)
Jan 16 21:41:53 net1 crmd: [15288]: info: crm_ais_dispatch: Setting expected votes to 2
Jan 16 21:41:53 net1 crmd: [15288]: info: te_connect_stonith: Attempting connection to fencing daemon...
Jan 16 21:41:53 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/17, version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: te_connect_stonith: Connected
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: All 2 cluster nodes responded to the join offer.
Jan 16 21:41:54 net1 crmd: [15288]: info: do_dc_join_finalize: join-1: Syncing the CIB from net1 to the rest of the cluster
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/18, version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/19, version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/20, version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: update_attrd: Connecting to attrd...
Jan 16 21:41:54 net1 attrd: [15286]: info: find_hash_entry: Creating hash entry for terminate
Jan 16 21:41:54 net1 attrd: [15286]: info: find_hash_entry: Creating hash entry for shutdown
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='net1']/transient_attributes (origin=local/crmd/21, version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: erase_xpath_callback: Deletion of "//node_state[@uname='net1']/transient_attributes": ok (rc=0)
Jan 16 21:41:54 net1 attrd: [15286]: info: crm_new_peer: Node net2 now has id: 47425728
Jan 16 21:41:54 net1 crmd: [15288]: info: do_dc_join_ack: join-1: Updating node state to member for net2
Jan 16 21:41:54 net1 attrd: [15286]: info: crm_new_peer: Node 47425728 is now known as net2
Jan 16 21:41:54 net1 crmd: [15288]: info: do_dc_join_ack: join-1: Updating node state to member for net1
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='net2']/transient_attributes (origin=net2/crmd/7, version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='net2']/lrm (origin=local/crmd/22, version=0.51.0): ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: erase_xpath_callback: Deletion of "//node_state[@uname='net2']/lrm": ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Jan 16 21:41:54 net1 crmd: [15288]: info: do_dc_join_final: Ensuring DC, quorum and node attributes are up-to-date
Jan 16 21:41:54 net1 crmd: [15288]: info: crm_update_quorum: Updating quorum status to true (call=28)
Jan 16 21:41:54 net1 crmd: [15288]: info: abort_transition_graph: do_te_invoke:191 - Triggered transition abort (complete=1) : Peer Cancelled
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke: Query 29: Requesting the current CIB: S_POLICY_ENGINE
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='net1']/lrm (origin=local/crmd/24, version=0.51.1): ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: erase_xpath_callback: Deletion of "//node_state[@uname='net1']/lrm": ok (rc=0)
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/26, version=0.51.2): ok (rc=0)
Jan 16 21:41:54 net1 cib: [15284]: info: log_data_element: cib:diff: - <cib admin_epoch="0" epoch="51" num_updates="2" />
Jan 16 21:41:54 net1 cib: [15284]: info: log_data_element: cib:diff: + <cib dc-uuid="net1" admin_epoch="0" epoch="52" num_updates="1" />
Jan 16 21:41:54 net1 cib: [15284]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/28, version=0.52.1): ok (rc=0)
Jan 16 21:41:54 net1 crmd: [15288]: info: abort_transition_graph: need_abort:59 - Triggered transition abort (complete=1) : Non-status change
Jan 16 21:41:54 net1 crmd: [15288]: info: need_abort: Aborting on change to admin_epoch
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke: Query 30: Requesting the current CIB: S_POLICY_ENGINE
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke_callback: Invoking the PE: query=30, ref=pe_calc-dc-1484599314-11, seq=36, quorate=1
Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_local_callback: Sending full refresh (origin=crmd)
Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
Jan 16 21:41:54 net1 pengine: [15287]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
Jan 16 21:41:54 net1 pengine: [15287]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net2 is online
Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net1 is online
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jan 16 21:41:54 net1 crmd: [15288]: info: unpack_graph: Unpacked transition 0: 2 actions in 2 synapses
Jan 16 21:41:54 net1 crmd: [15288]: info: do_te_invoke: Processing graph 0 (ref=pe_calc-dc-1484599314-11) derived from /var/lib/pengine/pe-input-713.bz2
Jan 16 21:41:54 net1 crmd: [15288]: info: te_rsc_command: Initiating action 2: probe_complete probe_complete on net1 (local) - no waiting
Jan 16 21:41:54 net1 crmd: [15288]: info: te_rsc_command: Initiating action 3: probe_complete probe_complete on net2 - no waiting
Jan 16 21:41:54 net1 crmd: [15288]: info: run_graph: ====================================================
Jan 16 21:41:54 net1 crmd: [15288]: notice: run_graph: Transition 0 (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-713.bz2): Complete
Jan 16 21:41:54 net1 crmd: [15288]: info: te_graph_trigger: Transition 0 is now complete
Jan 16 21:41:54 net1 crmd: [15288]: info: notify_crmd: Transition 0 status: done - <null>
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: Starting PEngine Recheck Timer
Jan 16 21:41:54 net1 attrd: [15286]: info: find_hash_entry: Creating hash entry for probe_complete
Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jan 16 21:41:54 net1 attrd: [15286]: info: attrd_perform_update: Sent update 10: probe_complete=true
Jan 16 21:41:54 net1 crmd: [15288]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=net1, magic=NA, cib=0.52.2) : Transient attribute: update
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke: Query 31: Requesting the current CIB: S_POLICY_ENGINE
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke_callback: Invoking the PE: query=31, ref=pe_calc-dc-1484599314-14, seq=36, quorate=1
Jan 16 21:41:54 net1 crmd: [15288]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=net2, magic=NA, cib=0.52.3) : Transient attribute: update
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke: Query 32: Requesting the current CIB: S_POLICY_ENGINE
Jan 16 21:41:54 net1 crmd: [15288]: info: do_pe_invoke_callback: Invoking the PE: query=32, ref=pe_calc-dc-1484599314-15, seq=36, quorate=1
Jan 16 21:41:54 net1 pengine: [15287]: info: process_pe_message: Transition 0: PEngine Input stored in: /var/lib/pengine/pe-input-713.bz2
Jan 16 21:41:54 net1 pengine: [15287]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jan 16 21:41:54 net1 pengine: [15287]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net2 is online
Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net1 is online
Jan 16 21:41:54 net1 crmd: [15288]: info: handle_response: pe_calc calculation pe_calc-dc-1484599314-14 is obsolete
Jan 16 21:41:54 net1 cib: [15311]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-52.raw
Jan 16 21:41:54 net1 pengine: [15287]: info: process_pe_message: Transition 1: PEngine Input stored in: /var/lib/pengine/pe-input-714.bz2
Jan 16 21:41:54 net1 pengine: [15287]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jan 16 21:41:54 net1 pengine: [15287]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net2 is online
Jan 16 21:41:54 net1 pengine: [15287]: info: determine_online_status: Node net1 is online
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jan 16 21:41:54 net1 crmd: [15288]: info: unpack_graph: Unpacked transition 2: 0 actions in 0 synapses
Jan 16 21:41:54 net1 crmd: [15288]: info: do_te_invoke: Processing graph 2 (ref=pe_calc-dc-1484599314-15) derived from /var/lib/pengine/pe-input-715.bz2
Jan 16 21:41:54 net1 crmd: [15288]: info: run_graph: ====================================================
Jan 16 21:41:54 net1 crmd: [15288]: notice: run_graph: Transition 2 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-715.bz2): Complete
Jan 16 21:41:54 net1 crmd: [15288]: info: te_graph_trigger: Transition 2 is now complete
Jan 16 21:41:54 net1 crmd: [15288]: info: notify_crmd: Transition 2 status: done - <null>
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Jan 16 21:41:54 net1 crmd: [15288]: info: do_state_transition: Starting PEngine Recheck Timer
Jan 16 21:41:54 net1 cib: [15311]: info: write_cib_contents: Wrote version 0.52.0 of the CIB to disk (digest: 44f7626e0420b36260b8c67e9e576a7e)
Jan 16 21:41:54 net1 pengine: [15287]: info: process_pe_message: Transition 2: PEngine Input stored in: /var/lib/pengine/pe-input-715.bz2
Jan 16 21:41:54 net1 cib: [15311]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.ZpwuYr (digest: /var/lib/heartbeat/crm/cib.zVG2jb)
Jan 16 21:50:50 net1 cib: [15284]: info: cib_stats: Processed 48 operations (2708.00us average, 0% utilization) in the last 10min

Any help is appreciated !

Thanks




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170116/89bb5ac8/attachment-0002.html>


More information about the Users mailing list