[Pacemaker] Need help!!! resources fail-over not taking place properly...

Thu Feb 18 08:20:03 EST 2010

Hi,

On Thu, Feb 18, 2010 at 05:09:09PM +0530, Jayakrishnan wrote:
> sir,
> 
> I have set up a two node cluster in Ubuntu 9.1. I have added a cluster-ip
> using ocf:heartbeat:IPaddr2, clonned lsb script "postgresql-8.4" and also
> added a manually created script for slony database replication.
> 
> Now every thing works fine but I am not able to use the ocf resource
> scripts. I mean fail over is not taking place or else even resource is not
> even taking. My ha.cf file and cib configuration is attached with this mail
> 
> My ha.cf file
> 
> autojoin none
> keepalive 2
> deadtime 15
> warntime 5
> initdead 64
> udpport 694
> bcast eth0
> auto_failback off
> node node1
> node node2
> crm respawn
> use_logd yes
> 
> 
> My cib.xml configuration file in cli format:
> 
> node $id="3952b93e-786c-47d4-8c2f-a882e3d3d105" node2 \
>     attributes standby="off"
> node $id="ac87f697-5b44-4720-a8af-12a6f2295930" node1 \
>     attributes standby="off"
> primitive pgsql lsb:postgresql-8.4 \
>     meta target-role="Started" resource-stickness="inherited" \
>     op monitor interval="15s" timeout="25s" on-fail="standby"
> primitive slony-fail lsb:slony_failover \
>     meta target-role="Started"
> primitive vir-ip ocf:heartbeat:IPaddr2 \
>     params ip="192.168.10.10" nic="eth0" cidr_netmask="24"
> broadcast="192.168.10.255" \
>     op monitor interval="15s" timeout="25s" on-fail="standby" \
>     meta target-role="Started"
> clone pgclone pgsql \
>     meta notify="true" globally-unique="false" interleave="true"
> target-role="Started"
> colocation ip-with-slony inf: slony-fail vir-ip
> order slony-b4-ip inf: vir-ip slony-fail
> property $id="cib-bootstrap-options" \
>     dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \
>     cluster-infrastructure="Heartbeat" \
>     no-quorum-policy="ignore" \
>     stonith-enabled="false" \
>     last-lrm-refresh="1266488780"
> rsc_defaults $id="rsc-options" \
>     resource-stickiness="INFINITY"
> 
> 
> 
> I am assigning the cluster-ip (192.168.10.10) in eth0 with ip 192.168.10.129
> in one machine and 192.168.10.130 in another machine.
> 
> When I pull out the eth0 interface cable fail-over is not taking place.

That's split brain. More than a resource failure. Without
stonith, you'll have both nodes running all resources.

> This is the log message i am getting while I pull out the cable:
> 
> "Feb 18 16:55:58 node2 NetworkManager: <info>  (eth0): carrier now OFF
> (device state 1)"
> 
> and after a miniute or two
> 
> log snippet:
> -------------------------------------------------------------------
> Feb 18 16:57:37 node2 cib: [21940]: info: cib_stats: Processed 3 operations
> (13333.00us average, 0% utilization) in the last 10min
> Feb 18 17:02:53 node2 crmd: [21944]: info: crm_timer_popped: PEngine Recheck
> Timer (I_PE_CALC) just popped!
> Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Feb 18 17:02:53 node2 crmd: [21944]: WARN: do_state_transition: Progressed
> to state S_POLICY_ENGINE after C_TIMER_POPPED
> Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: All 2
> cluster nodes are eligible to run resources.
> Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke: Query 111:
> Requesting the current CIB: S_POLICY_ENGINE
> Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke_callback: Invoking
> the PE: ref=pe_calc-dc-1266492773-121, seq=2, quorate=1
> Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_config: Node scores:
> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status: Node
> node2 is online
> Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:
> slony-fail_monitor_0 on node2 returned 0 (ok) instead of the expected value:
> 7 (not running)
> Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation
> slony-fail_monitor_0 found resource slony-fail active on node2
> Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:
> pgsql:0_monitor_0 on node2 returned 0 (ok) instead of the expected value: 7
> (not running)
> Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation
> pgsql:0_monitor_0 found resource pgsql:0 active on node2
> Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status: Node
> node1 is online
> Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:
> vir-ip#011(ocf::heartbeat:IPaddr2):#011Started node2
> Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:
> slony-fail#011(lsb:slony_failover):#011Started node2
> Feb 18 17:02:53 node2 pengine: [21982]: notice: clone_print: Clone Set:
> pgclone
> Feb 18 17:02:53 node2 pengine: [21982]: notice: print_list: #011Started: [
> node2 node1 ]
> Feb 18 17:02:53 node2 pengine: [21982]: notice: RecurringOp:  Start
> recurring monitor (15s) for pgsql:1 on node1
> Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave resource
> vir-ip#011(Started node2)
> Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave resource
> slony-fail#011(Started node2)
> Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave resource
> pgsql:0#011(Started node2)
> Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave resource
> pgsql:1#011(Started node1)
> Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Feb 18 17:02:53 node2 crmd: [21944]: info: unpack_graph: Unpacked transition
> 26: 1 actions in 1 synapses
> Feb 18 17:02:53 node2 crmd: [21944]: info: do_te_invoke: Processing graph 26
> (ref=pe_calc-dc-1266492773-121) derived from
> /var/lib/pengine/pe-input-125.bz2
> Feb 18 17:02:53 node2 crmd: [21944]: info: te_rsc_command: Initiating action
> 15: monitor pgsql:1_monitor_15000 on node1
> Feb 18 17:02:53 node2 pengine: [21982]: ERROR: write_last_sequence: Cannout
> open series file /var/lib/pengine/pe-input.last for writing

This is probably a permission problem. /var/lib/pengine should be
owned by haclient:hacluster.

> Feb 18 17:02:53 node2 pengine: [21982]: info: process_pe_message: Transition
> 26: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2
> Feb 18 17:02:55 node2 crmd: [21944]: info: match_graph_event: Action
> pgsql:1_monitor_15000 (15) confirmed on node1 (rc=0)
> Feb 18 17:02:55 node2 crmd: [21944]: info: run_graph:
> ====================================================
> Feb 18 17:02:55 node2 crmd: [21944]: notice: run_graph: Transition 26
> (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-125.bz2): Complete
> Feb 18 17:02:55 node2 crmd: [21944]: info: te_graph_trigger: Transition 26
> is now complete
> Feb 18 17:02:55 node2 crmd: [21944]: info: notify_crmd: Transition 26
> status: done - <null>
> Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: Starting
> PEngine Recheck Timer
> ------------------------------------------------------------------------------

Don't see anything in the logs about the IP address resource.

> Also I am not able to use the pgsql ocf script and hence I am using the init

Why is that? Something wrong with pgsql? If so, then it should be
fixed. It's always much better to use the OCF instead of LSB RA.

Thanks,

Dejan

> script and cloned it as  I need to run it on both nodes for slony data base
> replication.
> 
> I am using the heartbeat and pacemaker debs from the updated ubuntu karmic
> repo. (Heartbeat 2.99)
> 
> Please check my configuration and tell me where I am missing....[?][?][?]
> -- 
> Regards,
> 
> Jayakrishnan. L
> 
> Visit: www.jayakrishnan.bravehost.com

> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker