[ClusterLabs] A processor failed, forming new configuration very often and without reason

Wed Apr 29 04:52:15 EDT 2015

Hello,
just for the guys who doesn't want to read all the logs, I put my question
on top (and at the end) of the post :
Is there a timer that I can raise to try to give more time to each nodes to
see each other BEFORE TOTEM fire the "A processor failed, forming new
configuration", because the 2 nodes are really up and running.

The 2 linux servers (vif5_7 and host2.example.com) are 2 VM on the same
VMWare ESX server. May be the network is 'not working' the way corosync
wants ?

2015-04-26 22:08 GMT+02:00 Andrew Beekhof <andrew at beekhof.net>:

>
> > On 13 Apr 2015, at 7:08 pm, Philippe Carbonnier <
> Philippe.Carbonnier at vif.fr> wrote:
> >
> > Hello Mr Beekhof,
> >
> > thanks for your answer. The error when trying to stop the service is
> just the result of the unsuccesfull start of the service: the start try to
> create a new IP alias which fail because the other node still ran it,
>
> Is this IP also a managed resource? If so, it should have been removed
> when the service was asked to stop (and not reported ’success’ if it could
> not do so).
>
> Yes this IP is managed (clusterIP ocf:heartbeat:IPaddr2).

> > so the stop can't be successfull because the IP alias is not up on this
> node.
>

> I think we’d better see your full config and agent, something sounds very
> wrong.
>

I don't know what you mean with full config, does crm configuration is
enough ?

crm configuration:
node host2.exemple.com \
        attributes standby="off"
node vif5_7 \
        attributes standby="off"
primitive clusterIP ocf:heartbeat:IPaddr2 \
        params ip="10.10.72.3" cidr_netmask="32" iflabel="jbossfailover" \
        op monitor interval="30s"
primitive routing-jboss lsb:routing-jboss \
        op monitor interval="30s"
group vifGroup clusterIP routing-jboss
location prefer-clusterIP clusterIP 50: vif5_7
property $id="cib-bootstrap-options" \
        dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
        resource-stickiness="20"

 lsb:routing-jboss is a simple script running iptables :
When it start :
iptables -t nat -A OUTPUT -d $CLUSTERIP  -j DNAT --to-destination
$JBOSSFAILOVER
or when it stop :
iptables -t nat -D OUTPUT -d $CLUSTERIP -j DNAT --to-destination
$JBOSSFAILOVER

When the situation is clean, ie both ressources are running on vif5_7 node,
all is fine during 2 hours, and sometime the 2 nodes doesn't see each other
:
The problem I try to understand is 'just' "A processor failed, forming new
configuration"

On the vif5_7 side:
--------------------------
Apr 20 03:21:29 vif5_7 lrmd: [12556]: info: rsc:clusterIP:5: monitor
Apr 20 04:18:40 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 385864: memb=1, new=0, lost=1
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: memb: vif5_7
38275594
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: lost:
hosts2.example.com 122161674
Apr 20 04:18:40 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 385864: memb=1, new=0, lost=0
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB: vif5_7
38275594
Apr 20 04:18:40 corosync [pcmk  ] info: ais_mark_unseen_peer_dead: Node
hosts2.example.com was not seen in the previous transition
Apr 20 04:18:40 corosync [pcmk  ] info: update_member: Node 122161674/
hosts2.example.com is now: lost
Apr 20 04:18:40 corosync [pcmk  ] info: send_member_notification: Sending
membership update 385864 to 2 children
Apr 20 04:18:40 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Apr 20 04:18:40 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 20 04:18:40 vif5_7 crmd: [12559]: WARN: check_dead_member: Our DC node (
hosts2.example.com) left the cluster
...
Apr 20 04:18:40 vif5_7 crmd: [12559]: info: update_dc: Set DC to vif5_7
(3.0.1)

And on the hosts2.example.com, at the same time I have:
-----------------------------------------------
Apr 20 04:09:46 hosts2.example.com cib: [16956]: info: cib_stats: Processed
1 operations (10000.00us average, 0% utilization) in the last 10min
Apr 20 04:18:39 corosync [TOTEM ] A processor failed, forming new
configuration.
Apr 20 04:18:40 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 385864: memb=1, new=0, lost=1
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: memb:
hosts2.example.com 122161674
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: lost: vif5_7
38275594
Apr 20 04:18:40 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 385864: memb=1, new=0, lost=0
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
hosts2.example.com 122161674
Apr 20 04:18:40 corosync [pcmk  ] info: ais_mark_unseen_peer_dead: Node
vif5_7 was not seen in the previous transition
Apr 20 04:18:40 corosync [pcmk  ] info: update_member: Node 38275594/vif5_7
is now: lost
Apr 20 04:18:40 corosync [pcmk  ] info: send_member_notification: Sending
membership update 385864 to 2 children

So, vif5_7 say : host2.example.com has disappeared and host2.example.com
say : vif5_7 has disappeared. But why ?

And in the SAME second host2.example.com reappears from the vif5_7 point of
view !
---------------------------------------------------------------------------------------------------------------------------
Apr 20 04:18:40 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 385868: memb=1, new=0, lost=0
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: memb: vif5_7
38275594
Apr 20 04:18:40 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 385868: memb=2, new=1, lost=0
Apr 20 04:18:40 corosync [pcmk  ] info: update_member: Node 122161674/
host2.example.com is now: member
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: NEW:
host2.example.com 122161674
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB: vif5_7
38275594
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
host2.example.com 122161674
Apr 20 04:18:40 corosync [pcmk  ] info: send_member_notification: Sending
membership update 385868 to 2 children
Apr 20 04:18:40 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Apr 20 04:18:40 vif5_7 cib: [12555]: notice: ais_dispatch: Membership
385868: quorum acquired
Apr 20 04:18:40 vif5_7 cib: [12555]: info: crm_update_peer: Node
host2.example.com: id=122161674 state=member (new) addr=r(0) ip(10.10.72.7)
 votes=1 born=385364 seen=385868 proc=00000000000000000000000000013312
Apr 20 04:18:40 vif5_7 cib: [12555]: info: cib_process_diff: Diff 0.3023.5
-> 0.3023.6 not applied to 0.3023.1: current "num_updates" is less than
required
Apr 20 04:18:40 vif5_7 cib: [12555]: WARN: cib_server_process_diff: Not
requesting full refresh in slave mode.
Apr 20 04:18:40 vif5_7 cib: [12555]: info: cib_process_diff: Diff 0.3023.6
-> 0.3023.7 not applied to 0.3023.1: current "num_updates" is less than
required
Apr 20 04:18:40 vif5_7 cib: [12555]: WARN: cib_server_process_diff: Not
requesting full refresh in slave mode.
Apr 20 04:18:40 vif5_7 cib: [12555]: info: cib_process_diff: Diff 0.3023.7
-> 0.3023.8 not applied to 0.3023.1: current "num_updates" is less than
required
Apr 20 04:18:40 vif5_7 cib: [12555]: WARN: cib_server_process_diff: Not
requesting full refresh in slave mode.
Apr 20 04:18:40 vif5_7 cib: [12555]: info: cib_process_diff: Diff 0.3023.8
-> 0.3024.1 not applied to 0.3023.1: current "num_updates" is less than
required
Apr 20 04:18:40 vif5_7 cib: [12555]: WARN: cib_server_process_diff: Not
requesting full refresh in slave mode.
Apr 20 04:18:40 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Apr 20 04:18:41 vif5_7 crmd: [12559]: info: te_connect_stonith: Connected
Apr 20 04:18:41 vif5_7 crmd: [12559]: notice: ais_dispatch: Membership
385868: quorum acquired
Apr 20 04:18:41 vif5_7 crmd: [12559]: info: ais_status_callback: status:
host2.example.com is now member (was lost)
Apr 20 04:18:41 vif5_7 crmd: [12559]: info: crm_update_peer: Node
host2.example.com: id=122161674 state=member (new) addr=r(0) ip(10.10.72.7)
 votes=1 born=385364 seen=385868 proc=00000000000000000000000000013312

On host2.example.com, same thing : vif5_7 reappears
Apr 20 04:18:40 corosync [pcmk  ] info: update_member: Node 38275594/vif5_7
is now: member
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: NEW:  vif5_7
38275594
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB: vif5_7
38275594
Apr 20 04:18:40 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
host2.example.com  122161674
...

corosync see that the ressources are online on the 2 nodes :
Apr 20 04:18:41 host2.example.com pengine: [16959]: info:
determine_online_status: Node host2.example.com is online
Apr 20 04:18:41 host2.example.com pengine: [16959]: info:
determine_online_status: Node vif5_7 is online
Apr 20 04:18:41 host2.example.com pengine: [16959]: ERROR:
native_add_running: Resource lsb::routing-jboss:routing-jboss appears to be
active on 2 nodes.
Apr 20 04:18:41 host2.example.com pengine: [16959]: WARN: See
http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
Apr 20 04:18:41 host2.example.com pengine: [16959]: ERROR:
native_add_running: Resource ocf::IPaddr2:clusterIP appears to be active on
2 nodes.
Apr 20 04:18:41 host2.example.com pengine: [16959]: WARN: See
http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: group_print:
 Resource Group: vifGroup
Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: native_print:
   clusterIP       (ocf::heartbeat:IPaddr2) Started
Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: native_print:
   0 : host2.example.com
Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: native_print:
   1 : vif5_7
Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: native_print:
   routing-jboss   (lsb:routing-jboss) Started
Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: native_print:
   0 : host2.example.com
Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: native_print:
   1 : vif5_7
Apr 20 04:18:41 host2.example.com pengine: [16959]: WARN:
native_create_actions: Attempting recovery of resource clusterIP
Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: RecurringOp:
 Start recurring monitor (30s) for clusterIP on vif5_7
Apr 20 04:18:41 host2.example.com pengine: [16959]: WARN:
native_create_actions: Attempting recovery of resource routing-jboss
Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: RecurringOp:
 Start recurring monitor (30s) for routing-jboss on vif5_7
Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: LogActions:
Move resource clusterIP        (Started host2.example.com -> vif5_7)
Apr 20 04:18:41 host2.example.com pengine: [16959]: notice: LogActions:
Move resource routing-jboss    (Started host2.example.com -> vif5_7)
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: do_state_transition:
State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: unpack_graph:
Unpacked transition 737: 13 actions in 13 synapses
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: do_te_invoke:
Processing graph 737 (ref=pe_calc-dc-1429496321-2258) derived from
/var/lib/pengine/pe-error-545.bz2
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_pseudo_action:
Pseudo action 17 fired and confirmed
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
Initiating action 12: stop routing-jboss_stop_0 on host2.example.com (local)
Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: cancel_op: operation
monitor[357] on lsb::routing-jboss::routing-jboss for client 16960, its
parameters: CRM_meta_interval=[30000] CRM_meta_timeout=[20000]
crm_feature_set=[3.0.1] CRM_meta_name=[monitor]  cancelled
Apr 20 04:18:41 host2.example.com attrd: [16958]: info:
attrd_local_callback: Sending full refresh (origin=crmd)
Apr 20 04:18:41 host2.example.com attrd: [16958]: info:
attrd_trigger_update: Sending flush op to all hosts for:
fail-count-routing-jboss (<null>)
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: do_lrm_rsc_op:
Performing key=12:737:0:90d438c0-2db4-4d39-bdf9-f0476d6f06ff
op=routing-jboss_stop_0 )
Apr 20 04:18:41 host2.example.com lrmd: [16957]: info:
rsc:routing-jboss:358: stop
Apr 20 04:18:41 host2.example.com attrd: [16958]: info:
attrd_trigger_update: Sending flush op to all hosts for: probe_complete
(true)
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
Initiating action 13: stop routing-jboss_stop_0 on vif5_7
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: process_lrm_event:
LRM operation routing-jboss_monitor_30000 (call=357, status=1,
cib-update=0, confirmed=true) Cancelled
Apr 20 04:18:41 host2.example.com lrmd: [17123]: WARN: For LSB init script,
no additional parameters are needed.
Apr 20 04:18:41 host2.example.com pengine: [16959]: ERROR:
process_pe_message: Transition 737: ERRORs found during PE processing.
PEngine Input stored in: /var/lib/pengine/pe-error-545.bz2
Apr 20 04:18:41 host2.example.com attrd: [16958]: info:
attrd_trigger_update: Sending flush op to all hosts for: terminate (<null>)
Apr 20 04:18:41 host2.example.com attrd: [16958]: info:
attrd_trigger_update: Sending flush op to all hosts for:
last-failure-routing-jboss (<null>)
Apr 20 04:18:41 host2.example.com attrd: [16958]: info:
attrd_trigger_update: Sending flush op to all hosts for: shutdown (<null>)
Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
(routing-jboss:stop:stdout) Disabling traffic redirection from 10.10.72.3
to 10.10.72.7
Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
(routing-jboss:stop:stdout) [
Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
(routing-jboss:stop:stdout)   OK
Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
(routing-jboss:stop:stdout) ]
Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
(routing-jboss:stop:stdout) ^M
Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
(routing-jboss:stop:stdout)

Apr 20 04:18:41 host2.example.com crmd: [16960]: info: process_lrm_event:
LRM operation routing-jboss_stop_0 (call=358, rc=0, cib-update=3395,
confirmed=true) ok
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
Action routing-jboss_stop_0 (12) confirmed on host2.example.com (rc=0)
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
Action routing-jboss_stop_0 (13) confirmed on vif5_7 (rc=0)
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
Initiating action 9: stop clusterIP_stop_0 on host2.example.com (local)
Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: cancel_op: operation
monitor[355] on ocf::IPaddr2::clusterIP for client 16960, its parameters:
CRM_meta_interval=[30000] ip=[10.10.72.3] cidr_netmask=[32]
CRM_meta_timeout=[20000] crm_feature_set=[3.0.1] CRM_meta_name=[monitor]
iflabel=[jbossfailover]  cancelled
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: do_lrm_rsc_op:
Performing key=9:737:0:90d438c0-2db4-4d39-bdf9-f0476d6f06ff
op=clusterIP_stop_0 )
Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: rsc:clusterIP:359:
stop
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
Initiating action 10: stop clusterIP_stop_0 on vif5_7
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: process_lrm_event:
LRM operation clusterIP_monitor_30000 (call=355, status=1, cib-update=0,
confirmed=true) Cancelled
Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
(clusterIP:stop:stderr) logger: unknown facility name: none.
Apr 20 04:18:41 host2.example.com lrmd: [16957]: info: RA output:
(clusterIP:stop:stderr) logger: unknown facility name: none.
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: process_lrm_event:
LRM operation clusterIP_stop_0 (call=359, rc=0, cib-update=3396,
confirmed=true) ok
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
Action clusterIP_stop_0 (9) confirmed on host2.example.com (rc=0)
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
Action clusterIP_stop_0 (10) confirmed on vif5_7 (rc=0)
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_pseudo_action:
Pseudo action 18 fired and confirmed
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_pseudo_action:
Pseudo action 5 fired and confirmed
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_pseudo_action:
Pseudo action 15 fired and confirmed
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
Initiating action 11: start clusterIP_start_0 on vif5_7
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
Action clusterIP_start_0 (11) confirmed on vif5_7 (rc=0)
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
Initiating action 4: monitor clusterIP_monitor_30000 on vif5_7
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
Initiating action 14: start routing-jboss_start_0 on vif5_7
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
Action routing-jboss_start_0 (14) confirmed on vif5_7 (rc=0)
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_pseudo_action:
Pseudo action 16 fired and confirmed
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_rsc_command:
Initiating action 3: monitor routing-jboss_monitor_30000 on vif5_7
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
Action routing-jboss_monitor_30000 (3) confirmed on vif5_7 (rc=0)
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: match_graph_event:
Action clusterIP_monitor_30000 (4) confirmed on vif5_7 (rc=0)
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: run_graph:
====================================================
Apr 20 04:18:41 host2.example.com crmd: [16960]: notice: run_graph:
Transition 737 (Complete=13, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-error-545.bz2): Complete
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: te_graph_trigger:
Transition 737 is now complete
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: notify_crmd:
Transition 737 status: done - <null>
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: do_state_transition:
State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 20 04:18:41 host2.example.com crmd: [16960]: info: do_state_transition:
Starting PEngine Recheck Timer
Apr 20 04:18:44 host2.example.com lrmd: [16957]: info: RA output:
(clusterIP:start:stderr) ARPING 10.10.72.3 from 10.10.72.3 eth0
Sent 5 probes (5 broadcast(s))
Received 0 response(s)

Apr 20 04:19:46 host2.example.com cib: [16956]: info: cib_stats: Processed
80 operations (2750.00us average, 0% utilization) in the last 10min
Apr 20 04:33:41 host2.example.com crmd: [16960]: info: crm_timer_popped:
PEngine Recheck Timer (I_PE_CALC) just popped!
Apr 20 04:33:41 host2.example.com crmd: [16960]: info: do_state_transition:
State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Apr 20 04:33:41 host2.example.com crmd: [16960]: info: do_state_transition:
Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
Apr 20 04:33:41 host2.example.com crmd: [16960]: info: do_state_transition:
All 2 cluster nodes are eligible to run resources.
Apr 20 04:33:41 host2.example.com crmd: [16960]: info: do_pe_invoke: Query
3397: Requesting the current CIB: S_POLICY_ENGINE
Apr 20 04:33:41 host2.example.com crmd: [16960]: info:
do_pe_invoke_callback: Invoking the PE: query=3397,
ref=pe_calc-dc-1429497221-2267, seq=385868, quorate=1
Apr 20 04:33:42 host2.example.com pengine: [16959]: notice: unpack_config:
On loss of CCM Quorum: Ignore
Apr 20 04:33:42 host2.example.com pengine: [16959]: info: unpack_config:
Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Apr 20 04:33:42 host2.example.com pengine: [16959]: info:
determine_online_status: Node host2.example.com is online
Apr 20 04:33:42 host2.example.com pengine: [16959]: info:
determine_online_status: Node vif5_7 is online
Apr 20 04:33:42 host2.example.com pengine: [16959]: notice: group_print:
 Resource Group: vifGroup
Apr 20 04:33:42 host2.example.com pengine: [16959]: notice: native_print:
   clusterIP       (ocf::heartbeat:IPaddr2):       Started vif5_7
Apr 20 04:33:42 host2.example.com pengine: [16959]: notice: native_print:
   routing-jboss   (lsb:routing-jboss):    Started vif5_7
Apr 20 04:33:42 host2.example.com pengine: [16959]: notice: LogActions:
Leave resource clusterIP       (Started vif5_7)
Apr 20 04:33:42 host2.example.com pengine: [16959]: notice: LogActions:
Leave resource routing-jboss   (Started vif5_7)

so the ressources have been returned normaly to vif5_7 node.

But 11hours after, same thing again : the 2 nodes dissappeared from each
point of view while they are really up and running:
Apr 20 15:33:42 host2.example.com pengine: [16959]: info:
process_pe_message: Transition 782: PEngine Input stored in:
/var/lib/pengine/pe-input-88684.bz2
Apr 20 15:38:40 corosync [TOTEM ] A processor failed, forming new
configuration.
Apr 20 15:38:42 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 385872: memb=1, new=0, lost=1
Apr 20 15:38:42 corosync [pcmk  ] info: pcmk_peer_update: memb:
host2.example.com 122161674
Apr 20 15:38:42 corosync [pcmk  ] info: pcmk_peer_update: lost: vif5_7
38275594
Apr 20 15:38:42 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 385872: memb=1, new=0, lost=0
Apr 20 15:38:42 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
host2.example.com 122161674
Apr 20 15:38:42 corosync [pcmk  ] info: ais_mark_unseen_peer_dead: Node
vif5_7 was not seen in the previous transition

and so on :
Apr 20 04:18:39 corosync [TOTEM ] A processor failed, forming new
configuration.
Apr 20 04:18:40 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Apr 20 04:18:40 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Apr 20 15:38:40 corosync [TOTEM ] A processor failed, forming new
configuration.
Apr 20 15:38:42 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Apr 20 15:38:42 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Apr 20 15:57:35 corosync [TOTEM ] A processor failed, forming new
configuration.
Apr 20 15:57:36 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Apr 20 15:57:38 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Apr 20 16:20:06 corosync [TOTEM ] A processor failed, forming new
configuration.
Apr 20 16:20:07 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.
Apr 20 16:20:11 corosync [TOTEM ] A processor joined or left the membership
and a new membership was formed.

So my question is : is there a timer that I can raise to try to give more
time to each node to see each other BEFORE TOTEM fire the "A processor
failed, forming new configuration".
Once again the 2 linux servers 'vif5_7 and host2.example.com) are 2 VM on
the same VMWare ESX server. May be the network is 'not working' the way
corosync wants ?

Sorry for my english. I tried to put all the logs to help you to understand
the problem...
Best regards

>
> > IMHO, it's just the result while the root cause is that the 2 nodes
> doesn't see each other sometimes.
> >
> > But I've listen to your proposition and I've change the return code of
> the agent when we request it to stop : now it returns 0 even if it can't
> remove an iptables nat rule.
> >
> > Do you think that this message can help ? The 2 VM are on vmware which
> sometime gives strange time : " current "epoch" is greater than required"
> >
> > Best regards,
> >
> >
> > 2015-04-13 5:00 GMT+02:00 Andrew Beekhof <andrew at beekhof.net>:
> >
> > > On 10 Apr 2015, at 11:37 pm, Philippe Carbonnier <
> Philippe.Carbonnier at vif.fr> wrote:
> > >
> > > Hello,
> > >
> > > The context :
> > >   Red Hat Enterprise Linux Server release 5.7
> > >   corosynclib-1.2.7-1.1.el5.x86_64
> > >   corosync-1.2.7-1.1.el5.x86_64
> > >   pacemaker-1.0.10-1.4.el5.x86_64
> > >   pacemaker-libs-1.0.10-1.4.el5.x86_64
> > >   2 nodes, both on same ESX server
> > >
> > > I've lost of processor joined of left the membership message but can't
> understand why, because the 2 hosts are up and running, and when the
> corosync try to start the cluster's ressource he can't because the are
> already up on the first node.
> > > We can see "Another DC detected" so the communication between the 2 VM
> is OK.
> > >
> > > I've tried to raise totem parameter, without success.
> >
> > > Apr 10 13:34:55 host2.example.com pengine: [26529]: WARN:
> unpack_rsc_op: Processing failed op routing-jboss_stop_0 on
> host2.example.com: invalid parameter (2)
> >
> > ^^^ Failed stops lead to fencing.
> >
> > The agent and/or your config need fixing.
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> >
> >
> > L'informatique 100% Agro      www.vif.fr
> >
> > Suivez l'actualité VIF sur:
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
[image: logoVif] <http://www.vif.fr/>L'informatique 100% Agrowww.vif.fr 
[image: VifYouTube] <http://www.youtube.com/user/Agrovif>[image: VifTwitter] 
<https://twitter.com/VIF_agro>*Suivez l'actualité VIF sur:* 
<http://www.agrovif.com/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20150429/b07e839c/attachment-0003.html>