[Pacemaker] stonith pacemaker problem

Vadym Chepkov vchepkov at gmail.com
Mon Oct 11 14:58:45 EDT 2010


On Oct 11, 2010, at 2:14 AM, Andrew Beekhof wrote:

> On Sun, Oct 10, 2010 at 11:20 PM, Shravan Mishra
> <shravan.mishra at gmail.com> wrote:
>> Andrew,
>> 
>> We were able to solve our problem. Obviously if no one else is having
>> it then it has to be our environment. It's just that time pressure and
>> mgmt pressure was causing us to go really bonkers.
>> 
>> We had been struggling with this for past 4 days.
>> So here is the story:
>> 
>> We had following versions of HA libs existing on our appliance:
>> 
>> heartbeat=3.0.0
>> openais=1.0.0
>> pacemaker=1.0.9
>> 
>> When I started installing glue=1.0.3 on top of it I started getting
>> bunch of conflicts so I basically
>> uninstalled the heartbeat and openais and proceeded to install the
>> following in the given order:
>> 
>> 1.  glue=1.0.3
>> 2.  corosync=1.1.1
>> 3. pacemaker=1.0.9
>> 4. agents=1.0.3
>> 
>> 
>> 
>> And that's when we started seeing this problem.
>> So after 2 days of going nowhere with this we said let's leave the
>> packages as such try to install using --replace-files option.
>> 
>> We are using a build tool called conary which has this option and not
>> standard make/make install.
>> 
>> So we let the above heartbeat and openais remain as such and installed
>> glue,corosync and pacemaker on top of it with the --replace-files
>> options , this time with no conflicts and bingo it all works fine.
>> 
>> So that sort of confused me as to why do we still need heartbeat given
>> the above 4 packages.
> 
> strictly speaking you don't.
> but at least on fedora, the policy is that $x-libs always requires $x
> so just building against heartbeat-libs means that yum will suck in
> the main heartbeat package :-(

I don't think its the case for properly designed rpms:

[root at fedora ~]# cat /etc/fedora-release 
Fedora release 13 (Goddard)

[root at fedora ~]# rpm -qa|grep postgres
postgresql-libs-8.4.4-1.fc13.i686

heartbeat dependency is for some reason built in into spec file

%package libs
Summary:          Heartbeat libraries
Group:            System Environment/Daemons
Requires:         heartbeat = %{version}-%{release}

And I don't think it should.

Vadym


> 
> glad you found a path forward though
> 
>>  understand that /usr/lib/ocf/resource.d/heartbeat has ocf scripts
>> provided by heartbeat but that can be part of the "Reusable cluster
>> agents" subsystem.
>> 
>> Frankly I thought the way I had installed the system by erasing and
>> installing the fresh packages it should have worked.
>> 
>> But all said and done I learned a lot of cluster code by gdbing it.
>> I'll be having a peaceful thanksgiving.
>> 
>> Thanks and happy thanks giving.
>> Shravan
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Sun, Oct 10, 2010 at 2:46 PM, Andrew Beekhof <andrew at beekhof.net> wrote:
>>> Not enough information.
>>> We'd need more than just the lrmd's logs, they only show what happened not why.
>>> 
>>> On Thu, Oct 7, 2010 at 11:02 PM, Shravan Mishra
>>> <shravan.mishra at gmail.com> wrote:
>>>> Hi,
>>>> 
>>>> Description of my environment:
>>>>   corosync=1.2.8
>>>>   pacemaker=1.1.3
>>>>   Linux= 2.6.29.6-0.6.smp.gcc4.1.x86_64 #1 SMP
>>>> 
>>>> 
>>>> We are having a problem with our pacemaker which is continuously
>>>> canceling the monitoring operation of our stonith devices.
>>>> 
>>>> We ran:
>>>> 
>>>> stonith -d -t external/safe/ipmi hostname=ha2.itactics.com
>>>> ipaddr=192.168.2.7 userid=hellouser passwd=hello interface=lanplus -S
>>>> 
>>>> it's output is attached as stonith.output.
>>>> 
>>>> We have been trying to debug this issue for  a few days now with no success.
>>>> We are hoping that someone can help us as we are under immense
>>>> pressure to move to RCS unless we can solve this issue in a day or two
>>>> ,which I personally don't want to because we like the product.
>>>> 
>>>> Any help will be greatly appreciated.
>>>> 
>>>> 
>>>> Here is an excerpt from the /var/log/messages:
>>>> =========================
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11155: start
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11156: monitor
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: cancel_op: operation
>>>> monitor[11156] on
>>>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584,
>>>> its parameters: CRM_meta_interval=[20000] target_role=[started]
>>>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000]
>>>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor]
>>>> hostname=[ha2.itactics.com] passwd=[Ft01ST0pMF@]
>>>> userid=[safe_ipmi_admin]  cancelled
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11157: stop
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11158: start
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11159: monitor
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: cancel_op: operation
>>>> monitor[11159] on
>>>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584,
>>>> its parameters: CRM_meta_interval=[20000] target_role=[started]
>>>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000]
>>>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor]
>>>> hostname=[ha2.itactics.com] passwd=[Ft01ST0pMF@]
>>>> userid=[safe_ipmi_admin]  cancelled
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11160: stop
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11161: start
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11162: monitor
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: cancel_op: operation
>>>> monitor[11162] on
>>>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584,
>>>> its parameters: CRM_meta_interval=[20000] target_role=[started]
>>>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000]
>>>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor]
>>>> hostname=[ha2.itactics.com] passwd=[Ft01ST0pMF@]
>>>> userid=[safe_ipmi_admin]  cancelled
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11163: stop
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11164: start
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11165: monitor
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: cancel_op: operation
>>>> monitor[11165] on
>>>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584,
>>>> its parameters: CRM_meta_interval=[20000] target_role=[started]
>>>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000]
>>>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor]
>>>> hostname=[ha2.itactics.com] passwd=[Ft01ST0pMF@]
>>>> userid=[safe_ipmi_admin]  cancelled
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11166: stop
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11167: start
>>>> Oct  7 16:58:29 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11168: monitor
>>>> Oct  7 16:58:30 ha1 lrmd: [3581]: info: cancel_op: operation
>>>> monitor[11168] on
>>>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584,
>>>> its parameters: CRM_meta_interval=[20000] target_role=[started]
>>>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000]
>>>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor]
>>>> hostname=[ha2.itactics.com] passwd=[Ft01ST0pMF@]
>>>> userid=[safe_ipmi_admin]  cancelled
>>>> Oct  7 16:58:30 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11169: stop
>>>> Oct  7 16:58:30 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11170: start
>>>> Oct  7 16:58:30 ha1 lrmd: [3581]: info: stonithRA plugin: got
>>>> metadata: <?xml version="1.0"?> <!DOCTYPE resource-agent SYSTEM
>>>> "ra-api-1.dtd"> <resource-agent name="external/safe/ipmi">
>>>> <version>1.0</version>   <longdesc lang="en"> ipmitool based power
>>>> management. Apparently, the power off method of ipmitool is
>>>> intercepted by ACPI which then makes a regular shutdown. If case of a
>>>> split brain on a two-node it may happen that no node survives. For
>>>> two-node clusters use only the reset method.    </longdesc>
>>>> <shortdesc lang="en">IPMI STONITH external device </shortdesc>
>>>> <parameters> <parameter name="hostname" unique="1"> <content
>>>> type="string" /> <shortdesc lang="en"> Hostname </shortdesc> <longdesc
>>>> lang="en"> The name of the host to be managed by this STONITH device.
>>>> </longdesc> </parameter>  <parameter name="ipaddr" unique="1">
>>>> <content type="string" /> <shortdesc lang="en"> IP Address
>>>> </shortdesc> <longdesc lang="en"> The IP address of the STONITH
>>>> device. </longdesc> </parameter>  <parameter name="userid" unique="1">
>>>> <content type="string" /> <shortdesc lang="en"> Login </shortdesc>
>>>> <longdesc lang="en"> The username used for logging in to the STONITH
>>>> device. </longdesc> </parameter>  <parameter name="passwd" unique="1">
>>>> <content type="string" /> <shortdesc lang="en"> Password </shortdesc>
>>>> <longdesc lang="en"> The password used for logging in to the STONITH
>>>> device. </longdesc> </parameter>  <parameter name="interface"
>>>> unique="1"> <content type="string" default="lan"/> <shortdesc
>>>> lang="en"> IPMI interface </shortdesc> <longdesc lang="en"> IPMI
>>>> interface to use, such as "lan" or "lanplus". </longdesc> </parameter>
>>>>  </parameters>    <actions>     <action name="start"   timeout="15" />
>>>>    <action name="stop"    timeout="15" />     <action name="status"
>>>> timeout="15" />     <action name="monitor" timeout="15" interval="15"
>>>> start-delay="15" />     <action name="meta-data"  timeout="15" />
>>>> </actions>   <special tag="heartbeat">     <version>2.0</version>
>>>> </special> </resource-agent>
>>>> Oct  7 16:58:30 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11171: monitor
>>>> Oct  7 16:58:30 ha1 lrmd: [3581]: info: cancel_op: operation
>>>> monitor[11171] on
>>>> stonith::external/safe/ipmi::ha2.itactics.com-stonith for client 3584,
>>>> its parameters: CRM_meta_interval=[20000] target_role=[started]
>>>> ipaddr=[192.168.2.7] interface=[lanplus] CRM_meta_timeout=[180000]
>>>> crm_feature_set=[3.0.2] CRM_meta_name=[monitor]
>>>> hostname=[ha2.itactics.com] passwd=[Ft01ST0pMF@]
>>>> userid=[safe_ipmi_admin]  cancelled
>>>> Oct  7 16:58:30 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11172: stop
>>>> Oct  7 16:58:30 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11173: start
>>>> Oct  7 16:58:30 ha1 lrmd: [3581]: info:
>>>> rsc:ha2.itactics.com-stonith:11174: monitor
>>>> 
>>>> ==========================
>>>> 
>>>> Thanks
>>>> 
>>>> Shravan
>>>> 
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker





More information about the Pacemaker mailing list