[ClusterLabs] A processor failed, forming new configuration very often and without reason

Wed Apr 29 11:10:03 UTC 2015

Philippe,

Philippe Carbonnier napsal(a):
> Hello,
> just for the guys who doesn't want to read all the logs, I put my question
> on top (and at the end) of the post :
> Is there a timer that I can raise to try to give more time to each nodes to
> see each other BEFORE TOTEM fire the "A processor failed, forming new
> configuration", because the 2 nodes are really up and running.

There are many timers, but basically almost everything depends on token
timeout, so just set "token" to higher value.
> 
> The 2 linux servers (vif5_7 and host2.example.com) are 2 VM on the same
> VMWare ESX server. May be the network is 'not working' the way corosync
> wants ?

Yep. But first give a chance to token timeout increase.

> 

Regards,
  Honza

> 
> 2015-04-26 22:08 GMT+02:00 Andrew Beekhof <andrew at beekhof.net>:
> 
>>
>>> On 13 Apr 2015, at 7:08 pm, Philippe Carbonnier <
>> Philippe.Carbonnier at vif.fr> wrote:
>>>
>>> Hello Mr Beekhof,
>>>
>>> thanks for your answer. The error when trying to stop the service is
>> just the result of the unsuccesfull start of the service: the start try to
>> create a new IP alias which fail because the other node still ran it,
>>
>> Is this IP also a managed resource? If so, it should have been removed
>> when the service was asked to stop (and not reported ’success’ if it could
>> not do so).
>>
>> Yes this IP is managed (clusterIP ocf:heartbeat:IPaddr2).
> 
>>> so the stop can't be successfull because the IP alias is not up on this
>> node.
>>
> 
>> I think we’d better see your full config and agent, something sounds very
>> wrong.
>>
> 
> I don't know what you mean with full config, does crm configuration is
> enough ?
> 
> crm configuration:
> node host2.exemple.com \
>         attributes standby="off"
> node vif5_7 \
>         attributes standby="off"
> primitive clusterIP ocf:heartbeat:IPaddr2 \
>         params ip="10.10.72.3" cidr_netmask="32" iflabel="jbossfailover" \
>         op monitor interval="30s"
> primitive routing-jboss lsb:routing-jboss \
>         op monitor interval="30s"
> group vifGroup clusterIP routing-jboss
> location prefer-clusterIP clusterIP 50: vif5_7
> property $id="cib-bootstrap-options" \
>         dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         stonith-enabled="false" \
>         no-quorum-policy="ignore"
> rsc_defaults $id="rsc-options" \
>         resource-stickiness="20"
> 
>  lsb:routing-jboss is a simple script running iptables :
> When it start :
> iptables -t nat -A OUTPUT -d $CLUSTERIP  -j DNAT --to-destination
> $JBOSSFAILOVER
> or when it stop :
> iptables -t nat -D OUTPUT -d $CLUSTERIP -j DNAT --to-destination
> $JBOSSFAILOVER
> 
> When the situation is clean, ie both ressources are running on vif5_7 node,
> all is fine during 2 hours, and sometime the 2 nodes doesn't see each other
> :
> The problem I try to understand is 'just' "A processor failed, forming new
> configuration"
> 
> On the vif5_7 side:
> --------------------------
> Apr 20 03:21:29 vif5_7 lrmd: [12556]: info: rsc:clusterIP:5: monitor
> Apr 20 04:18:40 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 385864: memb=1, new=0, lost=1
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: memb: vif5_7
> 38275594
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: lost:
> hosts2.example.com 122161674
> Apr 20 04:18:40 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 385864: memb=1, new=0, lost=0
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB: vif5_7
> 38275594
> Apr 20 04:18:40 corosync [pcmk  ] info: ais_mark_unseen_peer_dead: Node
> hosts2.example.com was not seen in the previous transition
> Apr 20 04:18:40 corosync [pcmk  ] info: update_member: Node 122161674/
> hosts2.example.com is now: lost
> Apr 20 04:18:40 corosync [pcmk  ] info: send_member_notification: Sending
> membership update 385864 to 2 children
> Apr 20 04:18:40 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> Apr 20 04:18:40 corosync [MAIN  ] Completed service synchronization, ready
> to provide service.
> Apr 20 04:18:40 vif5_7 crmd: [12559]: WARN: check_dead_member: Our DC node (
> hosts2.example.com) left the cluster
> ...
> Apr 20 04:18:40 vif5_7 crmd: [12559]: info: update_dc: Set DC to vif5_7
> (3.0.1)
> 
> 
> And on the hosts2.example.com, at the same time I have:
> -----------------------------------------------
> Apr 20 04:09:46 hosts2.example.com cib: [16956]: info: cib_stats: Processed
> 1 operations (10000.00us average, 0% utilization) in the last 10min
> Apr 20 04:18:39 corosync [TOTEM ] A processor failed, forming new
> configuration.
> Apr 20 04:18:40 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 385864: memb=1, new=0, lost=1
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: memb:
> hosts2.example.com 122161674
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: lost: vif5_7
> 38275594
> Apr 20 04:18:40 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 385864: memb=1, new=0, lost=0
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> hosts2.example.com 122161674
> Apr 20 04:18:40 corosync [pcmk  ] info: ais_mark_unseen_peer_dead: Node
> vif5_7 was not seen in the previous transition
> Apr 20 04:18:40 corosync [pcmk  ] info: update_member: Node 38275594/vif5_7
> is now: lost
> Apr 20 04:18:40 corosync [pcmk  ] info: send_member_notification: Sending
> membership update 385864 to 2 children
> 
> 
> So, vif5_7 say : host2.example.com has disappeared and host2.example.com
> say : vif5_7 has disappeared. But why ?
> 
> 
> And in the SAME second host2.example.com reappears from the vif5_7 point of
> view !
> ---------------------------------------------------------------------------------------------------------------------------
> Apr 20 04:18:40 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 385868: memb=1, new=0, lost=0
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: memb: vif5_7
> 38275594
> Apr 20 04:18:40 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 385868: memb=2, new=1, lost=0
> Apr 20 04:18:40 corosync [pcmk  ] info: update_member: Node 122161674/
> host2.example.com is now: member
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: NEW:
> host2.example.com 122161674
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB: vif5_7
> 38275594
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> host2.example.com 122161674
> Apr 20 04:18:40 corosync [pcmk  ] info: send_member_notification: Sending
> membership update 385868 to 2 children
> Apr 20 04:18:40 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> Apr 20 04:18:40 vif5_7 cib: [12555]: notice: ais_dispatch: Membership
> 385868: quorum acquired
> Apr 20 04:18:40 vif5_7 cib: [12555]: info: crm_update_peer: Node
> host2.example.com: id=122161674 state=member (new) addr=r(0) ip(10.10.72.7)
>  votes=1 born=385364 seen=385868 proc=00000000000000000000000000013312
> Apr 20 04:18:40 vif5_7 cib: [12555]: info: cib_process_diff: Diff 0.3023.5
> -> 0.3023.6 not applied to 0.3023.1: current "num_updates" is less than
> required
> Apr 20 04:18:40 vif5_7 cib: [12555]: WARN: cib_server_process_diff: Not
> requesting full refresh in slave mode.
> Apr 20 04:18:40 vif5_7 cib: [12555]: info: cib_process_diff: Diff 0.3023.6
> -> 0.3023.7 not applied to 0.3023.1: current "num_updates" is less than
> required
> Apr 20 04:18:40 vif5_7 cib: [12555]: WARN: cib_server_process_diff: Not
> requesting full refresh in slave mode.
> Apr 20 04:18:40 vif5_7 cib: [12555]: info: cib_process_diff: Diff 0.3023.7
> -> 0.3023.8 not applied to 0.3023.1: current "num_updates" is less than
> required
> Apr 20 04:18:40 vif5_7 cib: [12555]: WARN: cib_server_process_diff: Not
> requesting full refresh in slave mode.
> Apr 20 04:18:40 vif5_7 cib: [12555]: info: cib_process_diff: Diff 0.3023.8
> -> 0.3024.1 not applied to 0.3023.1: current "num_updates" is less than
> required
> Apr 20 04:18:40 vif5_7 cib: [12555]: WARN: cib_server_process_diff: Not
> requesting full refresh in slave mode.
> Apr 20 04:18:40 corosync [MAIN  ] Completed service synchronization, ready
> to provide service.
> Apr 20 04:18:41 vif5_7 crmd: [12559]: info: te_connect_stonith: Connected
> Apr 20 04:18:41 vif5_7 crmd: [12559]: notice: ais_dispatch: Membership
> 385868: quorum acquired
> Apr 20 04:18:41 vif5_7 crmd: [12559]: info: ais_status_callback: status:
> host2.example.com is now member (was lost)
> Apr 20 04:18:41 vif5_7 crmd: [12559]: info: crm_update_peer: Node
> host2.example.com: id=122161674 state=member (new) addr=r(0) ip(10.10.72.7)
>  votes=1 born=385364 seen=385868 proc=00000000000000000000000000013312
> 
> 
> On host2.example.com, same thing : vif5_7 reappears
> Apr 20 04:18:40 corosync [pcmk  ] info: update_member: Node 38275594/vif5_7
> is now: member
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: NEW:  vif5_7
> 38275594
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB: vif5_7
> 38275594
> Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> host2.example.com  122161674
> ...
> 
> corosync see that the ressources are online on the 2 nodes :
> Apr 20 04:18:41 host2.example.com pengine: [16959]: info:
> determine_online_status: Node host2.example.com is online
> Apr 20 04:18:41 host2.example.com pengine: [16959]: info:
> determine_online_status: Node vif5_7 is online
> Apr 20 04:18:41 host2.example.com pengine: [16959]: ERROR:
> native_add_running: Resource lsb::routing-jboss:routing-jboss appears to be
> active on 2 nodes.
> Apr 20 04:18:41 host2.example.com pengine: [16959]: WARN: See
> http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
> Apr 20 04:18:41 host2.example.com pengine: [16959]: ERROR:
> native_add_running: Resource ocf::IPaddr2:clusterIP appears to be active on
> 2 nodes.
> Apr 20 04:18:41 host2.example.com pengine: [16959]: WARN: See
> http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
> Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: group_print:
>  Resource Group: vifGroup
> Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: native_print:
>    clusterIP       (ocf::heartbeat:IPaddr2) Started
> Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: native_print:
>    0 : host2.example.com
> Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: native_print:
>    1 : vif5_7
> Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: native_print:
>    routing-jboss   (lsb:routing-jboss) Started
> Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: native_print:
>    0 : host2.example.com
> Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: native_print:
>    1 : vif5_7
> Apr 20 04:18:41 host2.example.com pengine: [16959]: WARN:
> native_create_actions: Attempting recovery of resource clusterIP
> Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: RecurringOp:
>  Start recurring monitor (30s) for clusterIP on vif5_7
> Apr 20 04:18:41 host2.example.com pengine: [16959]: WARN:
> native_create_actions: Attempting recovery of resource routing-jboss
> Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: RecurringOp:
>  Start recurring monitor (30s) for routing-jboss on vif5_7
> Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: LogActions:
> Move resource clusterIP        (Started host2.example.com -> vif5_7)
> Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: LogActions:
> Move resource routing-jboss    (Started host2.example.com -> vif5_7)
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: do_state_transition:
> State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: unpack_graph:
> Unpacked transition 737: 13 actions in 13 synapses
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: do_te_invoke:
> Processing graph 737 (ref=pe_calc-dc-1429496321-2258) derived from
> /var/lib/pengine/pe-error-545.bz2
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_pseudo_action:
> Pseudo action 17 fired and confirmed
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
> Initiating action 12: stop routing-jboss_stop_0 on host2.example.com (local)
> Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: cancel_op: operation
> monitor[357] on lsb::routing-jboss::routing-jboss for client 16960, its
> parameters: CRM_meta_interval=[30000] CRM_meta_timeout=[20000]
> crm_feature_set=[3.0.1] CRM_meta_name=[monitor]  cancelled
> Apr 20 04:18:41 host2.example.com attrd: [16958]: info:
> attrd_local_callback: Sending full refresh (origin=crmd)
> Apr 20 04:18:41 host2.example.com attrd: [16958]: info:
> attrd_trigger_update: Sending flush op to all hosts for:
> fail-count-routing-jboss (<null>)
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: do_lrm_rsc_op:
> Performing key=12:737:0:90d438c0-2db4-4d39-bdf9-f0476d6f06ff
> op=routing-jboss_stop_0 )
> Apr 20 04:18:41 host2.example.com lrmd: [16957]: info:
> rsc:routing-jboss:358: stop
> Apr 20 04:18:41 host2.example.com attrd: [16958]: info:
> attrd_trigger_update: Sending flush op to all hosts for: probe_complete
> (true)
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
> Initiating action 13: stop routing-jboss_stop_0 on vif5_7
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: process_lrm_event:
> LRM operation routing-jboss_monitor_30000 (call=357, status=1,
> cib-update=0, confirmed=true) Cancelled
> Apr 20 04:18:41 host2.example.com lrmd: [17123]: WARN: For LSB init script,
> no additional parameters are needed.
> Apr 20 04:18:41 host2.example.com pengine: [16959]: ERROR:
> process_pe_message: Transition 737: ERRORs found during PE processing.
> PEngine Input stored in: /var/lib/pengine/pe-error-545.bz2
> Apr 20 04:18:41 host2.example.com attrd: [16958]: info:
> attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
> Apr 20 04:18:41 host2.example.com attrd: [16958]: info:
> attrd_trigger_update: Sending flush op to all hosts for:
> last-failure-routing-jboss (<null>)
> Apr 20 04:18:41 host2.example.com attrd: [16958]: info:
> attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
> Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
> (routing-jboss:stop:stdout) Disabling traffic redirection from 10.10.72.3
> to 10.10.72.7
> Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
> (routing-jboss:stop:stdout) [
> Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
> (routing-jboss:stop:stdout)   OK
> Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
> (routing-jboss:stop:stdout) ]
> Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
> (routing-jboss:stop:stdout) ^M
> Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
> (routing-jboss:stop:stdout)
> 
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: process_lrm_event:
> LRM operation routing-jboss_stop_0 (call=358, rc=0, cib-update=3395,
> confirmed=true) ok
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
> Action routing-jboss_stop_0 (12) confirmed on host2.example.com (rc=0)
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
> Action routing-jboss_stop_0 (13) confirmed on vif5_7 (rc=0)
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
> Initiating action 9: stop clusterIP_stop_0 on host2.example.com (local)
> Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: cancel_op: operation
> monitor[355] on ocf::IPaddr2::clusterIP for client 16960, its parameters:
> CRM_meta_interval=[30000] ip=[10.10.72.3] cidr_netmask=[32]
> CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor]
> iflabel=[jbossfailover]  cancelled
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: do_lrm_rsc_op:
> Performing key=9:737:0:90d438c0-2db4-4d39-bdf9-f0476d6f06ff
> op=clusterIP_stop_0 )
> Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: rsc:clusterIP:359:
> stop
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
> Initiating action 10: stop clusterIP_stop_0 on vif5_7
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: process_lrm_event:
> LRM operation clusterIP_monitor_30000 (call=355, status=1, cib-update=0,
> confirmed=true) Cancelled
> Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
> (clusterIP:stop:stderr) logger: unknown facility name: none.
> Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
> (clusterIP:stop:stderr) logger: unknown facility name: none.
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: process_lrm_event:
> LRM operation clusterIP_stop_0 (call=359, rc=0, cib-update=3396,
> confirmed=true) ok
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
> Action clusterIP_stop_0 (9) confirmed on host2.example.com (rc=0)
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
> Action clusterIP_stop_0 (10) confirmed on vif5_7 (rc=0)
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_pseudo_action:
> Pseudo action 18 fired and confirmed
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_pseudo_action:
> Pseudo action 5 fired and confirmed
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_pseudo_action:
> Pseudo action 15 fired and confirmed
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
> Initiating action 11: start clusterIP_start_0 on vif5_7
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
> Action clusterIP_start_0 (11) confirmed on vif5_7 (rc=0)
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
> Initiating action 4: monitor clusterIP_monitor_30000 on vif5_7
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
> Initiating action 14: start routing-jboss_start_0 on vif5_7
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
> Action routing-jboss_start_0 (14) confirmed on vif5_7 (rc=0)
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_pseudo_action:
> Pseudo action 16 fired and confirmed
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
> Initiating action 3: monitor routing-jboss_monitor_30000 on vif5_7
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
> Action routing-jboss_monitor_30000 (3) confirmed on vif5_7 (rc=0)
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
> Action clusterIP_monitor_30000 (4) confirmed on vif5_7 (rc=0)
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: run_graph:
> ====================================================
> Apr 20 04:18:41 host2.example.com crmd: [16960]: notice: run_graph:
> Transition 737 (Complete=13, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-error-545.bz2): Complete
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_graph_trigger:
> Transition 737 is now complete
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: notify_crmd:
> Transition 737 status: done - <null>
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: do_state_transition:
> State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Apr 20 04:18:41 host2.example.com crmd: [16960]: info: do_state_transition:
> Starting PEngine Recheck Timer
> Apr 20 04:18:44 host2.example.com lrmd: [16957]: info: RA output:
> (clusterIP:start:stderr) ARPING 10.10.72.3 from 10.10.72.3 eth0
> Sent 5 probes (5 broadcast(s))
> Received 0 response(s)
> 
> Apr 20 04:19:46 host2.example.com cib: [16956]: info: cib_stats: Processed
> 80 operations (2750.00us average, 0% utilization) in the last 10min
> Apr 20 04:33:41 host2.example.com crmd: [16960]: info: crm_timer_popped:
> PEngine Recheck Timer (I_PE_CALC) just popped!
> Apr 20 04:33:41 host2.example.com crmd: [16960]: info: do_state_transition:
> State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Apr 20 04:33:41 host2.example.com crmd: [16960]: info: do_state_transition:
> Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
> Apr 20 04:33:41 host2.example.com crmd: [16960]: info: do_state_transition:
> All 2 cluster nodes are eligible to run resources.
> Apr 20 04:33:41 host2.example.com crmd: [16960]: info: do_pe_invoke: Query
> 3397: Requesting the current CIB: S_POLICY_ENGINE
> Apr 20 04:33:41 host2.example.com crmd: [16960]: info:
> do_pe_invoke_callback: Invoking the PE: query=3397,
> ref=pe_calc-dc-1429497221-2267, seq=385868, quorate=1
> Apr 20 04:33:42 host2.example.com pengine: [16959]: notice: unpack_config:
> On loss of CCM Quorum: Ignore
> Apr 20 04:33:42 host2.example.com pengine: [16959]: info: unpack_config:
> Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Apr 20 04:33:42 host2.example.com pengine: [16959]: info:
> determine_online_status: Node host2.example.com is online
> Apr 20 04:33:42 host2.example.com pengine: [16959]: info:
> determine_online_status: Node vif5_7 is online
> Apr 20 04:33:42 host2.example.com pengine: [16959]: notice: group_print:
>  Resource Group: vifGroup
> Apr 20 04:33:42 host2.example.com pengine: [16959]: notice: native_print:
>    clusterIP       (ocf::heartbeat:IPaddr2):       Started vif5_7
> Apr 20 04:33:42 host2.example.com pengine: [16959]: notice: native_print:
>    routing-jboss   (lsb:routing-jboss):    Started vif5_7
> Apr 20 04:33:42 host2.example.com pengine: [16959]: notice: LogActions:
> Leave resource clusterIP       (Started vif5_7)
> Apr 20 04:33:42 host2.example.com pengine: [16959]: notice: LogActions:
> Leave resource routing-jboss   (Started vif5_7)
> 
> so the ressources have been returned normaly to vif5_7 node.
> 
> But 11hours after, same thing again : the 2 nodes dissappeared from each
> point of view while they are really up and running:
> Apr 20 15:33:42 host2.example.com pengine: [16959]: info:
> process_pe_message: Transition 782: PEngine Input stored in:
> /var/lib/pengine/pe-input-88684.bz2
> Apr 20 15:38:40 corosync [TOTEM ] A processor failed, forming new
> configuration.
> Apr 20 15:38:42 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
> membership event on ring 385872: memb=1, new=0, lost=1
> Apr 20 15:38:42 corosync [pcmk  ] info: pcmk_peer_update: memb:
> host2.example.com 122161674
> Apr 20 15:38:42 corosync [pcmk  ] info: pcmk_peer_update: lost: vif5_7
> 38275594
> Apr 20 15:38:42 corosync [pcmk  ] notice: pcmk_peer_update: Stable
> membership event on ring 385872: memb=1, new=0, lost=0
> Apr 20 15:38:42 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> host2.example.com 122161674
> Apr 20 15:38:42 corosync [pcmk  ] info: ais_mark_unseen_peer_dead: Node
> vif5_7 was not seen in the previous transition
> 
> and so on :
> Apr 20 04:18:39 corosync [TOTEM ] A processor failed, forming new
> configuration.
> Apr 20 04:18:40 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> Apr 20 04:18:40 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> Apr 20 15:38:40 corosync [TOTEM ] A processor failed, forming new
> configuration.
> Apr 20 15:38:42 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> Apr 20 15:38:42 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> Apr 20 15:57:35 corosync [TOTEM ] A processor failed, forming new
> configuration.
> Apr 20 15:57:36 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> Apr 20 15:57:38 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> Apr 20 16:20:06 corosync [TOTEM ] A processor failed, forming new
> configuration.
> Apr 20 16:20:07 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> Apr 20 16:20:11 corosync [TOTEM ] A processor joined or left the membership
> and a new membership was formed.
> 
> 
> So my question is : is there a timer that I can raise to try to give more
> time to each node to see each other BEFORE TOTEM fire the "A processor
> failed, forming new configuration".
> Once again the 2 linux servers 'vif5_7 and host2.example.com) are 2 VM on
> the same VMWare ESX server. May be the network is 'not working' the way
> corosync wants ?
> 
> Sorry for my english. I tried to put all the logs to help you to understand
> the problem...
> Best regards
> 
> 
> 
> 
> 
>>
>>> IMHO, it's just the result while the root cause is that the 2 nodes
>> doesn't see each other sometimes.
>>>
>>> But I've listen to your proposition and I've change the return code of
>> the agent when we request it to stop : now it returns 0 even if it can't
>> remove an iptables nat rule.
>>>
>>> Do you think that this message can help ? The 2 VM are on vmware which
>> sometime gives strange time : " current "epoch" is greater than required"
>>>
>>> Best regards,
>>>
>>>
>>> 2015-04-13 5:00 GMT+02:00 Andrew Beekhof <andrew at beekhof.net>:
>>>
>>>> On 10 Apr 2015, at 11:37 pm, Philippe Carbonnier <
>> Philippe.Carbonnier at vif.fr> wrote:
>>>>
>>>> Hello,
>>>>
>>>> The context :
>>>>   Red Hat Enterprise Linux Server release 5.7
>>>>   corosynclib-1.2.7-1.1.el5.x86_64
>>>>   corosync-1.2.7-1.1.el5.x86_64
>>>>   pacemaker-1.0.10-1.4.el5.x86_64
>>>>   pacemaker-libs-1.0.10-1.4.el5.x86_64
>>>>   2 nodes, both on same ESX server
>>>>
>>>> I've lost of processor joined of left the membership message but can't
>> understand why, because the 2 hosts are up and running, and when the
>> corosync try to start the cluster's ressource he can't because the are
>> already up on the first node.
>>>> We can see "Another DC detected" so the communication between the 2 VM
>> is OK.
>>>>
>>>> I've tried to raise totem parameter, without success.
>>>
>>>> Apr 10 13:34:55 host2.example.com pengine: [26529]: WARN:
>> unpack_rsc_op: Processing failed op routing-jboss_stop_0 on
>> host2.example.com: invalid parameter (2)
>>>
>>> ^^^ Failed stops lead to fencing.
>>>
>>> The agent and/or your config need fixing.
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> L'informatique 100% Agro      www.vif.fr
>>>
>>> Suivez l'actualité VIF sur:
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>