<br>Sir,<br><br>You got the point...... I thought that there is some mistakes in my configurations after 2 weeks trying...  If I enable stonith and manage to shut down the failed machine, most of my problems will be solved.. Now I feel much confident...<br>


<br>But sir I need to clear the resource failures for my slony_failover script.. Because when the slony failover takes place it will give a warning message stating <br><br><span style="background-color: rgb(255, 255, 51);">Feb 16 14:50:01 node1 lrmd: [2477]: info: RA output: (slony-fail:start:stderr) <stdin>:4: NOTICE:  failedNode: set 1 has no other direct receivers - move now</span><br>


<br> to stderr or stdout and this warning messages are treated as resource failures by heartbeat-pacemaker. So If I want to add another script for second database failover, I am afraid the first script may block the execution of the second.. Now I have only one database replication for testing and the slony-failover script is running last while failovers..<br>


<br><br>And I still dont believe that I am chatting with the person who made the "crm-cli".<br><br><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Dejan Muhamedagic</b> <span dir="ltr"><<a href="mailto:dejanmm@fastmail.fm">dejanmm@fastmail.fm</a>></span><br>


Date: Thu, Feb 18, 2010 at 9:02 PM<br>Subject: Re: [Pacemaker] Need help!!! resources fail-over not taking place  properly...<br>To: <a href="mailto:pacemaker@oss.clusterlabs.org">pacemaker@oss.clusterlabs.org</a><br><br>

<br>

Hi,<br>

<div class="im"><br>

On Thu, Feb 18, 2010 at 08:22:26PM +0530, Jayakrishnan wrote:<br>

> Hello Dejan,<br>

><br>

> First of all thank you very much for your reply. I found that one of my node<br>

> is having the permission problem. There the permission of /var/lib/pengine<br>

> file was set to "999:999" I am not sure how!!!!!! However i changed it...<br>

><br>

> sir, when I pull out the interface cable i am getting only this log message:<br>

><br>

> Feb 18 16:55:58 node2 NetworkManager: <info> (eth0): carrier now OFF (device<br>

> state 1)<br>

><br>

> And the resource ip is not moving any where at all. It is still there in the<br>

> same machine... I acn view that the IP is still assigned to the eth0<br>

> interface via "# ip addr show", even though the interface status is 'down.'.<br>

> Is this the split-brain?? If so how can I clear it??<br>

<br>

</div>With fencing (stonith). Please read some documentation available<br>

here: <a href="http://clusterlabs.org/wiki/Documentation" target="_blank">http://clusterlabs.org/wiki/Documentation</a><br>

<div class="im"><br>

> Because of the on-fail=standy in pgsql part in my cib I am able to do a<br>

> failover to another node when I manuallyu stop the postgres service in tha<br>

> active machine. however even after restarting the postgres service via<br>

> "/etc/init.d/postgresql-8.4 start " I have to run<br>

> crm resource cleanup <pgclone><br>

<br>

</div>Yes, that's necessary.<br>

<div class="im"><br>

> to make the crm_mon or cluster identify that the service on. Till then It is<br>

> showing as a failed action<br>

><br>

> crm_mon snippet<br>

> --------------------------------------------------------------------<br>

> Last updated: Thu Feb 18 20:17:28 2010<br>

> Stack: Heartbeat<br>

> Current DC: node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105) - partition with<br>

> quorum<br>

><br>

> Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56<br>

> 2 Nodes configured, unknown expected votes<br>

> 3 Resources configured.<br>

> ============<br>

><br>

> Node node2 (3952b93e-786c-47d4-8c2f-a882e3d3d105): standby (on-fail)<br>

> Online: [ node1 ]<br>

><br>

> vir-ip  (ocf::heartbeat:IPaddr2):       Started node1<br>

> slony-fail      (lsb:slony_failover):   Started node1<br>

> Clone Set: pgclone<br>

>         Started: [ node1 ]<br>

>         Stopped: [ pgsql:0 ]<br>

><br>

> Failed actions:<br>

>     pgsql:0_monitor_15000 (node=node2, call=33, rc=7, status=complete): not<br>

> running<br>

> --------------------------------------------------------------------------------<br>

><br>

> Is there any way to run crm resource cleanup <resource> periodically??<br>

<br>

</div>Why would you want to do that? Do you expect your resources to<br>

fail regularly?<br>

<div class="im"><br>

> I dont know if there is any mistake in pgsql ocf script sir.. I have given<br>

> all parameters correctly but its is giving an error " syntax error" all the<br>

> time when I use it..<br>

<br>

</div>Best to report such a case, it's either a configuration problem<br>

(did you read its metadata) or perhaps a bug in the RA.<br>

<br>

Thanks,<br>

<font color="#888888"><br>

Dejan<br>

</font><div><div></div><div class="h5"><br>

> I put the same meta attributes as for the current lsb<br>

> as shown below...<br>

><br>

> Please help me out... should I reinstall the nodes again??<br>

><br>

><br>

> On Thu, Feb 18, 2010 at 6:50 PM, Dejan Muhamedagic <<a href="mailto:dejanmm@fastmail.fm">dejanmm@fastmail.fm</a>>wrote:<br>

><br>

> > Hi,<br>

> ><br>

> > On Thu, Feb 18, 2010 at 05:09:09PM +0530, Jayakrishnan wrote:<br>

> > > sir,<br>

> > ><br>

> > > I have set up a two node cluster in Ubuntu 9.1. I have added a cluster-ip<br>

> > > using ocf:heartbeat:IPaddr2, clonned lsb script "postgresql-8.4" and also<br>

> > > added a manually created script for slony database replication.<br>

> > ><br>

> > > Now every thing works fine but I am not able to use the ocf resource<br>

> > > scripts. I mean fail over is not taking place or else even resource is<br>

> > not<br>

> > > even taking. My <a href="http://ha.cf" target="_blank">ha.cf</a> file and cib configuration is attached with this<br>

> > mail<br>

> > ><br>

> > > My <a href="http://ha.cf" target="_blank">ha.cf</a> file<br>

> > ><br>

> > > autojoin none<br>

> > > keepalive 2<br>

> > > deadtime 15<br>

> > > warntime 5<br>

> > > initdead 64<br>

> > > udpport 694<br>

> > > bcast eth0<br>

> > > auto_failback off<br>

> > > node node1<br>

> > > node node2<br>

> > > crm respawn<br>

> > > use_logd yes<br>

> > ><br>

> > ><br>

> > > My cib.xml configuration file in cli format:<br>

> > ><br>

> > > node $id="3952b93e-786c-47d4-8c2f-a882e3d3d105" node2 \<br>

> > >     attributes standby="off"<br>

> > > node $id="ac87f697-5b44-4720-a8af-12a6f2295930" node1 \<br>

> > >     attributes standby="off"<br>

> > > primitive pgsql lsb:postgresql-8.4 \<br>

> > >     meta target-role="Started" resource-stickness="inherited" \<br>

> > >     op monitor interval="15s" timeout="25s" on-fail="standby"<br>

> > > primitive slony-fail lsb:slony_failover \<br>

> > >     meta target-role="Started"<br>

> > > primitive vir-ip ocf:heartbeat:IPaddr2 \<br>

> > >     params ip="192.168.10.10" nic="eth0" cidr_netmask="24"<br>

> > > broadcast="192.168.10.255" \<br>

> > >     op monitor interval="15s" timeout="25s" on-fail="standby" \<br>

> > >     meta target-role="Started"<br>

> > > clone pgclone pgsql \<br>

> > >     meta notify="true" globally-unique="false" interleave="true"<br>

> > > target-role="Started"<br>

> > > colocation ip-with-slony inf: slony-fail vir-ip<br>

> > > order slony-b4-ip inf: vir-ip slony-fail<br>

> > > property $id="cib-bootstrap-options" \<br>

> > >     dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56" \<br>

> > >     cluster-infrastructure="Heartbeat" \<br>

> > >     no-quorum-policy="ignore" \<br>

> > >     stonith-enabled="false" \<br>

> > >     last-lrm-refresh="1266488780"<br>

> > > rsc_defaults $id="rsc-options" \<br>

> > >     resource-stickiness="INFINITY"<br>

> > ><br>

> > ><br>

> > ><br>

> > > I am assigning the cluster-ip (192.168.10.10) in eth0 with ip<br>

> > 192.168.10.129<br>

> > > in one machine and 192.168.10.130 in another machine.<br>

> > ><br>

> > > When I pull out the eth0 interface cable fail-over is not taking place.<br>

> ><br>

> > That's split brain. More than a resource failure. Without<br>

> > stonith, you'll have both nodes running all resources.<br>

> ><br>

> > > This is the log message i am getting while I pull out the cable:<br>

> > ><br>

> > > "Feb 18 16:55:58 node2 NetworkManager: <info>  (eth0): carrier now OFF<br>

> > > (device state 1)"<br>

> > ><br>

> > > and after a miniute or two<br>

> > ><br>

> > > log snippet:<br>

> > > -------------------------------------------------------------------<br>

> > > Feb 18 16:57:37 node2 cib: [21940]: info: cib_stats: Processed 3<br>

> > operations<br>

> > > (13333.00us average, 0% utilization) in the last 10min<br>

> > > Feb 18 17:02:53 node2 crmd: [21944]: info: crm_timer_popped: PEngine<br>

> > Recheck<br>

> > > Timer (I_PE_CALC) just popped!<br>

> > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State<br>

> > > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC<br>

> > cause=C_TIMER_POPPED<br>

> > > origin=crm_timer_popped ]<br>

> > > Feb 18 17:02:53 node2 crmd: [21944]: WARN: do_state_transition:<br>

> > Progressed<br>

> > > to state S_POLICY_ENGINE after C_TIMER_POPPED<br>

> > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: All 2<br>

> > > cluster nodes are eligible to run resources.<br>

> > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke: Query 111:<br>

> > > Requesting the current CIB: S_POLICY_ENGINE<br>

> > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_pe_invoke_callback:<br>

> > Invoking<br>

> > > the PE: ref=pe_calc-dc-1266492773-121, seq=2, quorate=1<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_config: On loss of<br>

> > > CCM Quorum: Ignore<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_config: Node scores:<br>

> > > 'red' = -INFINITY, 'yellow' = 0, 'green' = 0<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status:<br>

> > Node<br>

> > > node2 is online<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:<br>

> > > slony-fail_monitor_0 on node2 returned 0 (ok) instead of the expected<br>

> > value:<br>

> > > 7 (not running)<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation<br>

> > > slony-fail_monitor_0 found resource slony-fail active on node2<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: info: unpack_rsc_op:<br>

> > > pgsql:0_monitor_0 on node2 returned 0 (ok) instead of the expected value:<br>

> > 7<br>

> > > (not running)<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: notice: unpack_rsc_op: Operation<br>

> > > pgsql:0_monitor_0 found resource pgsql:0 active on node2<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: info: determine_online_status:<br>

> > Node<br>

> > > node1 is online<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:<br>

> > > vir-ip#011(ocf::heartbeat:IPaddr2):#011Started node2<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: notice: native_print:<br>

> > > slony-fail#011(lsb:slony_failover):#011Started node2<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: notice: clone_print: Clone Set:<br>

> > > pgclone<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: notice: print_list: #011Started:<br>

> > [<br>

> > > node2 node1 ]<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: notice: RecurringOp:  Start<br>

> > > recurring monitor (15s) for pgsql:1 on node1<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave<br>

> > resource<br>

> > > vir-ip#011(Started node2)<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave<br>

> > resource<br>

> > > slony-fail#011(Started node2)<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave<br>

> > resource<br>

> > > pgsql:0#011(Started node2)<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: notice: LogActions: Leave<br>

> > resource<br>

> > > pgsql:1#011(Started node1)<br>

> > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_state_transition: State<br>

> > > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS<br>

> > > cause=C_IPC_MESSAGE origin=handle_response ]<br>

> > > Feb 18 17:02:53 node2 crmd: [21944]: info: unpack_graph: Unpacked<br>

> > transition<br>

> > > 26: 1 actions in 1 synapses<br>

> > > Feb 18 17:02:53 node2 crmd: [21944]: info: do_te_invoke: Processing graph<br>

> > 26<br>

> > > (ref=pe_calc-dc-1266492773-121) derived from<br>

> > > /var/lib/pengine/pe-input-125.bz2<br>

> > > Feb 18 17:02:53 node2 crmd: [21944]: info: te_rsc_command: Initiating<br>

> > action<br>

> > > 15: monitor pgsql:1_monitor_15000 on node1<br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: ERROR: write_last_sequence:<br>

> > Cannout<br>

> > > open series file /var/lib/pengine/pe-input.last for writing<br>

> ><br>

> > This is probably a permission problem. /var/lib/pengine should be<br>

> > owned by haclient:hacluster.<br>

> ><br>

> > > Feb 18 17:02:53 node2 pengine: [21982]: info: process_pe_message:<br>

> > Transition<br>

> > > 26: PEngine Input stored in: /var/lib/pengine/pe-input-125.bz2<br>

> > > Feb 18 17:02:55 node2 crmd: [21944]: info: match_graph_event: Action<br>

> > > pgsql:1_monitor_15000 (15) confirmed on node1 (rc=0)<br>

> > > Feb 18 17:02:55 node2 crmd: [21944]: info: run_graph:<br>

> > > ====================================================<br>

> > > Feb 18 17:02:55 node2 crmd: [21944]: notice: run_graph: Transition 26<br>

> > > (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0,<br>

> > > Source=/var/lib/pengine/pe-input-125.bz2): Complete<br>

> > > Feb 18 17:02:55 node2 crmd: [21944]: info: te_graph_trigger: Transition<br>

> > 26<br>

> > > is now complete<br>

> > > Feb 18 17:02:55 node2 crmd: [21944]: info: notify_crmd: Transition 26<br>

> > > status: done - <null><br>

> > > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: State<br>

> > > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS<br>

> > > cause=C_FSA_INTERNAL origin=notify_crmd ]<br>

> > > Feb 18 17:02:55 node2 crmd: [21944]: info: do_state_transition: Starting<br>

> > > PEngine Recheck Timer<br>

> > ><br>

> > ------------------------------------------------------------------------------<br>

> ><br>

> > Don't see anything in the logs about the IP address resource.<br>

> ><br>

> > > Also I am not able to use the pgsql ocf script and hence I am using the<br>

> > init<br>

> ><br>

> > Why is that? Something wrong with pgsql? If so, then it should be<br>

> > fixed. It's always much better to use the OCF instead of LSB RA.<br>

> ><br>

> > Thanks,<br>

> ><br>

> > Dejan<br>

> ><br>

> > > script and cloned it as  I need to run it on both nodes for slony data<br>

> > base<br>

> > > replication.<br>

> > ><br>

> > > I am using the heartbeat and pacemaker debs from the updated ubuntu<br>

> > karmic<br>

> > > repo. (Heartbeat 2.99)<br>

> > ><br>

> > > Please check my configuration and tell me where I am missing....[?][?][?]<br>

> > > --<br>

> > > Regards,<br>

> > ><br>

> > > Jayakrishnan. L<br>

> > ><br>

> > > Visit: <a href="http://www.jayakrishnan.bravehost.com" target="_blank">www.jayakrishnan.bravehost.com</a><br>

> ><br>

> ><br>

> ><br>

> ><br>

> > > _______________________________________________<br>

> > > Pacemaker mailing list<br>

> > > <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

> > > <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

> ><br>

> ><br>

> > _______________________________________________<br>

> > Pacemaker mailing list<br>

> > <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

> > <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

> ><br>

><br>

><br>

><br>

> --<br>

> Regards,<br>

><br>

> Jayakrishnan. L<br>

><br>

> Visit: <a href="http://www.jayakrishnan.bravehost.com" target="_blank">www.jayakrishnan.bravehost.com</a><br>

<br>

> _______________________________________________<br>

> Pacemaker mailing list<br>

> <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

<br>

_______________________________________________<br>

Pacemaker mailing list<br>

<a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

</div></div></div><br><br clear="all"><br>-- <br>Regards,<br><br>Jayakrishnan. L<br><br>Visit: <a href="http://www.jayakrishnan.bravehost.com">www.jayakrishnan.bravehost.com</a><br><br>