[Pacemaker] crmd - set dc unset dc loop

Andrew Beekhof andrew at beekhof.net
Thu Feb 23 18:45:57 EST 2012


On Thu, Feb 23, 2012 at 10:37 PM, Philippe Carbonnier
<Philippe.Carbonnier at vif.fr> wrote:
> Hello Andreas and everybody,
>
> Even with the new timeout : totem token:  5000, sometimes, when jboss1 goes
> standby, I've this errors : cib: [12758]: ERROR: send_ais_text: Sending
> message 194: FAILED (rc=2): Library error: Connection timed out (110)  and
> after cib: [12758]: ERROR: send_ais_message: Not connected to AIS

Any openais logs from about that time?

>
>
> Feb 22 14:13:45 ujboss1 crmd: [12762]: info: run_graph:
> ====================================================
> Feb 22 14:13:45 ujboss1 crmd: [12762]: notice: run_graph: Transition 3285
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-11116.bz2): Complete
> Feb 22 14:13:45 ujboss1 crmd: [12762]: info: te_graph_trigger: Transition
> 3285 is now complete
> Feb 22 14:13:45 ujboss1 crmd: [12762]: info: notify_crmd: Transition 3285
> status: done - <null>
> Feb 22 14:13:45 ujboss1 crmd: [12762]: info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Feb 22 14:13:45 ujboss1 crmd: [12762]: info: do_state_transition: Starting
> PEngine Recheck Timer
> Feb 22 14:13:45 ujboss1 pengine: [12761]: info: process_pe_message:
> Transition 3285: PEngine Input stored in:
> /var/lib/pengine/pe-input-11116.bz2
> Feb 22 14:17:50 ujboss1 cib: [12758]: info: cib_stats: Processed 1
> operations (0.00us average, 0% utilization) in the last 10min
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: -
> <cib admin_epoch="0" epoch="291" num_updates="1" >
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: -
> <configuration >
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> -     <nodes >
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> -       <node id="ujboss1" >
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> -         <instance_attributes id="nodes-ujboss1" >
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> -           <nvpair value="off" id="nodes-ujboss1-standby" />
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> -         </instance_attributes>
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> -       </node>
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> -     </nodes>
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: -
> </configuration>
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: -
> </cib>
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: +
> <cib admin_epoch="0" epoch="292" num_updates="1" >
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: +
> <configuration >
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> +     <nodes >
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> +       <node id="ujboss1" >
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> +         <instance_attributes id="nodes-ujboss1" >
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> +           <nvpair value="on" id="nodes-ujboss1-standby" />
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> +         </instance_attributes>
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> +       </node>
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff:
> +     </nodes>
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: +
> </configuration>
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: log_data_element: cib:diff: +
> </cib>
> Feb 22 14:17:56 ujboss1 cib: [12758]: info: cib_process_request: Operation
> complete: op cib_modify for section nodes (origin=local/crm_attribute/4,
> version=0.292.1): ok (rc=0)
> Feb 22 14:17:56 ujboss1 crmd: [12762]: info: abort_transition_graph:
> need_abort:59 - Triggered transition abort (complete=1) : Non-status change
> Feb 22 14:17:56 ujboss1 crmd: [12762]: info: need_abort: Aborting on change
> to admin_epoch
> Feb 22 14:17:56 ujboss1 crmd: [12762]: info: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
> Feb 22 14:17:56 ujboss1 crmd: [12762]: info: do_state_transition: All 2
> cluster nodes are eligible to run resources.
> Feb 22 14:17:56 ujboss1 crmd: [12762]: info: do_pe_invoke: Query 3407:
> Requesting the current CIB: S_POLICY_ENGINE
> Feb 22 14:17:58 ujboss1 cib: [12758]: ERROR: send_ais_text: Sending message
> 194: FAILED (rc=2): Library error: Connection timed out (110)
> Feb 22 14:17:59 ujboss1 crmd: [12762]: info: do_pe_invoke_callback: Invoking
> the PE: query=3407, ref=pe_calc-dc-1329916679-3431, seq=604, quorate=1
> Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Feb 22 14:17:59 ujboss1 pengine: [12761]: info: unpack_config: Node scores:
> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Feb 22 14:17:59 ujboss1 pengine: [12761]: info: determine_online_status:
> Node ujboss2 is online
> Feb 22 14:17:59 ujboss1 pengine: [12761]: info: unpack_status: Node ujboss1
> is in standby-mode
> Feb 22 14:17:59 ujboss1 pengine: [12761]: info: determine_online_status:
> Node ujboss1 is standby
> Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: group_print:  Resource
> Group: vifGroup
> Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: native_print:
> clusterIP   (ocf::heartbeat:IPaddr2):       Started ujboss1
> Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: native_print:
> routing-jboss       (lsb:routing-jboss):    Started ujboss1
> Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: RecurringOp:  Start
> recurring monitor (30s) for clusterIP on ujboss2
> Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: RecurringOp:  Start
> recurring monitor (30s) for routing-jboss on ujboss2
> Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: LogActions: Move resource
> clusterIP    (Started ujboss1 -> ujboss2)
> Feb 22 14:17:59 ujboss1 pengine: [12761]: notice: LogActions: Move resource
> routing-jboss        (Started ujboss1 -> ujboss2)
> Feb 22 14:18:00 ujboss1 cib: [10032]: info: write_cib_contents: Archived
> previous version as /var/lib/heartbeat/crm/cib-56.raw
> Feb 22 14:18:00 ujboss1 cib: [10032]: info: write_cib_contents: Wrote
> version 0.292.0 of the CIB to disk (digest:
> e978102f42a9b6512133008fe219e5a4)
> Feb 22 14:18:00 ujboss1 cib: [10032]: info: retrieveCib: Reading cluster
> configuration from: /var/lib/heartbeat/crm/cib.2Gj3jp (digest:
> /var/lib/heartbeat/crm/cib.ipauMZ)
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: unpack_graph: Unpacked
> transition 3286: 11 actions in 11 synapses
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: do_te_invoke: Processing graph
> 3286 (ref=pe_calc-dc-1329916679-3431) derived from
> /var/lib/pengine/pe-input-11117.bz2
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_pseudo_action: Pseudo action
> 15 fired and confirmed
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_rsc_command: Initiating
> action 10: stop routing-jboss_stop_0 on ujboss1 (local)
> Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: cancel_op: operation
> monitor[67] on lsb::routing-jboss::routing-jboss for client 12762, its
> parameters: CRM_meta_interval=[30000] CRM_meta_timeout=[20000]
> crm_feature_set=[3.0.1] CRM_meta_name=[monitor]  cancelled
> Feb 22 14:18:00 ujboss1 pengine: [12761]: info: process_pe_message:
> Transition 3286: PEngine Input stored in:
> /var/lib/pengine/pe-input-11117.bz2
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: do_lrm_rsc_op: Performing
> key=10:3286:0:cd17d699-5ac6-4c00-bc70-f61bba0c140c op=routing-jboss_stop_0 )
> Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: rsc:routing-jboss:68: stop
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: process_lrm_event: LRM
> operation routing-jboss_monitor_30000 (call=67, status=1, cib-update=0,
> confirmed=true) Cancelled
> Feb 22 14:18:00 ujboss1 lrmd: [10035]: WARN: For LSB init script, no
> additional parameters are needed.
> Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output:
> (routing-jboss:stop:stdout) Disabling traffic redirection from 128.1.13.9 to
> 128.1.13.7
> Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output:
> (routing-jboss:stop:stdout) [
> Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output:
> (routing-jboss:stop:stdout)   OK
> Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output:
> (routing-jboss:stop:stdout) ]
> Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output:
> (routing-jboss:stop:stdout)
> Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output:
> (routing-jboss:stop:stdout)
>
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: process_lrm_event: LRM
> operation routing-jboss_stop_0 (call=68, rc=0, cib-update=3408,
> confirmed=true) ok
> Feb 22 14:18:00 ujboss1 cib: [12758]: ERROR: send_ais_message: Not connected
> to AIS
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: match_graph_event: Action
> routing-jboss_stop_0 (10) confirmed on ujboss1 (rc=0)
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_rsc_command: Initiating
> action 7: stop clusterIP_stop_0 on ujboss1 (local)
> Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: cancel_op: operation
> monitor[65] on ocf::IPaddr2::clusterIP for client 12762, its parameters:
> CRM_meta_interval=[30000] ip=[128.1.13.9] cidr_netmask=[32]
> CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor]
> iflabel=[jbossfailover]  cancelled
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: do_lrm_rsc_op: Performing
> key=7:3286:0:cd17d699-5ac6-4c00-bc70-f61bba0c140c op=clusterIP_stop_0 )
> Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: rsc:clusterIP:69: stop
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: process_lrm_event: LRM
> operation clusterIP_monitor_30000 (call=65, status=1, cib-update=0,
> confirmed=true) Cancelled
> Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output:
> (clusterIP:stop:stderr) logger: unknown facility name: none.
>
> Feb 22 14:18:00 ujboss1 lrmd: [12759]: info: RA output:
> (clusterIP:stop:stderr) logger: unknown facility name: none.
>
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: process_lrm_event: LRM
> operation clusterIP_stop_0 (call=69, rc=0, cib-update=3409, confirmed=true)
> ok
> Feb 22 14:18:00 ujboss1 cib: [12758]: ERROR: send_ais_message: Not connected
> to AIS
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: match_graph_event: Action
> clusterIP_stop_0 (7) confirmed on ujboss1 (rc=0)
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_pseudo_action: Pseudo action
> 16 fired and confirmed
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_pseudo_action: Pseudo action
> 3 fired and confirmed
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_pseudo_action: Pseudo action
> 13 fired and confirmed
> Feb 22 14:18:00 ujboss1 crmd: [12762]: info: te_rsc_command: Initiating
> action 8: start clusterIP_start_0 on ujboss2
> Feb 22 14:18:01 ujboss1 cib: [12758]: ERROR: send_ais_message: Not connected
> to AIS
> Feb 22 14:18:01 ujboss1 crmd: [12762]: info: match_graph_event: Action
> clusterIP_start_0 (8) confirmed on ujboss2 (rc=0)
> Feb 22 14:18:01 ujboss1 crmd: [12762]: info: te_rsc_command: Initiating
> action 9: monitor clusterIP_monitor_30000 on ujboss2
> Feb 22 14:18:01 ujboss1 crmd: [12762]: info: te_rsc_command: Initiating
> action 11: start routing-jboss_start_0 on ujboss2
> Feb 22 14:18:01 ujboss1 cib: [12758]: ERROR: send_ais_message: Not connected
> to AIS
>
>
>
>
> Le 19/01/2012 17:38, Philippe Carbonnier a écrit :
>
> Hello Andreas, all,
>
> many thanks for your help. I have change my configurations, and will take
> you informed.
> Best regards,
> Philippe
>
> Le 19/01/2012 14:38, Andreas Kurz a écrit :
>
> Hello,
>
> On 01/19/2012 12:21 PM, Philippe Carbonnier wrote:
>
> Hello Andreas, and others,
>
> I'm sorry not having well explain my problem (I'm french and don't speak
> english fluently...).
>
> np .. I'm also not a native speaker ;-)
>
> The loop has arrived after putting ujoss1 (not ujboss2) standy, this is
> why I put more log for ujboss1 and not ujboss2.
>
> I'll try to answer your questions :
> the network setup is simple : ujboss1 ans ujboss2 are on the same lan,
> without firewall between. Perhaps it's not what you was looking for?
>
> hmm ... looks really like a short split brain ... default timeouts in
> corosync are quite short at 1000ms for token lost ... try e.g. "token:
> 5000" in the "totem" section of your corosync config.
>
> Regards,
> Andreas
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>




More information about the Pacemaker mailing list