[Pacemaker] Problem in Stonith configuration

Fri Oct 28 17:49:50 EDT 2011

Hello,

On 10/28/2011 01:21 PM, neha chatrath wrote:
> Hello,
> 
> 1. How about using Integrated ILO device for fencing? I am using HP
> Proliant DL360 G7 server which supports ILO3.
>    - Can RILOE Stonith be used for this?

works fine e.g. with external/ipmi stonith module

> 
> 2. Can meatware Stonith plugin be used for production software?

yes

> 
> 3. One more issue which I am facing is that when I try
>           -"crm ra list stonith" command, there is no output. although
> different RA's under Heartbeat class are visible.

never saw this behavior, when all packages were installed ...

>           - Also, Stonith class is visible in the output of "crm ra 
> classes" command
>           - all the default Stonith RA's  like meatware, suicide,
> ibmrsa, ipmi etc are present in /usr/lib/stonith/plugins directory.
>           - Due to this I am not able to configure stonith in my system.

if stonith agents are available when using the "stonith" cmdline tool, I
expect this to work. You can also use this tool to get meta-data infos
if tab completion in "crm configure" mode is not enough.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

>  
> Thanks and regards
> Neha Chatrath
> 
> On Tue, Oct 18, 2011 at 2:51 PM, neha chatrath <nehachatrath at gmail.com
> <mailto:nehachatrath at gmail.com>> wrote:
> 
>     Hello,
> 
>     > 1. If a resource fails, node should reboot (through fencing mechanism)
>     > and resources should re-start on the node.
> 
>     Why would you want that? This would increase the service downtime
>     considerable. Why is a local restart not possible ... and even if there
>     is a good reason for a reboot, why not starting the resource on the
>     other node?
>     -In our system, there are some primitive, clone resources along with
>     3 different master-slave resources.
>     -All the masters and slaves of these resources are co-located i.e.
>     all the 3 masters are co-located on a node and 3 slaves on the other
>     node.
>     -These 3 master-slaves resources are tightly coupled. There is a
>     requirement that failure of even any one of these resources,
>     restarts all the resources in the group
>     -All these resources can be shifted to the other node but
>     subsequently these should also be restarted as a lot of data/control
>     plane synching is being done between the two nodes.
>     e.g. If one of the resources running on node1 as a Master fails,
>     then all these 3 resources are shifted to the other node i.e. node2 
>     (with corresponding slave resources being promoted as master). On 
>     node1, these resources should get re-started as slaves.
> 
>     We understand that node restart will increase the downtime but since
>     we could not find much on the option for group restart of
>     master-slave resources, we are trying for node restart option.
> 
> 
>     Thanks and regards
>     Neha Chatrath
> 
>     ---------- Forwarded message ----------
>     From: *Andreas Kurz* <andreas at hastexo.com <mailto:andreas at hastexo.com>>
>     Date: Tue, Oct 18, 2011 at 1:55 PM
>     Subject: Re: [Pacemaker] Problem in Stonith configuration
>     To: pacemaker at oss.clusterlabs.org <mailto:pacemaker at oss.clusterlabs.org>
> 
> 
>     Hello,
> 
> 
>     On 10/18/2011 09:00 AM, neha chatrath wrote:
>     > Hello,
>     >
>     > Minor updates in the first requirement.
>     > 1. If a resource fails, node should reboot (through fencing mechanism)
>     > and resources should re-start on the node.
> 
>     Why would you want that? This would increase the service downtime
>     considerable. Why is a local restart not possible ... and even if there
>     is a good reason for a reboot, why not starting the resource on the
>     other node?
> 
> 
>     > 2. If the physical link between the nodes in a cluster fails then that
>     > node should be isolated (kind of a power down) and the resources
>     should
>     > continue to run on the other nodes
> 
>     That is how stonith works, yes.
> 
>     crm ra list stonith ... gives you a list of all available stonith
>     plugins.
> 
>     crm ra info stonit:xxxx ... details for a specific plugin.
> 
>     Using external/ipmi is often a good choice because a lot of servers
>     already have an BMC with IPMI on board or they are shipped with a
>     management card supporting IMPI.
> 
>     Regards,
>     Andreas
> 
> 
>     On Tue, Oct 18, 2011 at 12:30 PM, neha chatrath
>     <nehachatrath at gmail.com <mailto:nehachatrath at gmail.com>> wrote:
> 
>         Hello,
> 
>         Minor updates in the first requirement.
>         1. If a resource fails, node should reboot (through fencing
>         mechanism) and resources should re-start on the node.
> 
>         2. If the physical link between the nodes in a cluster fails
>         then that node should be isolated (kind of a power down) and the
>         resources should continue to run on the other nodes
> 
>         Apologies for the inconvenience.
> 
> 
>         Thanks and regards
>         Neha Chatrath
> 
>         On Tue, Oct 18, 2011 at 12:08 PM, neha chatrath
>         <nehachatrath at gmail.com <mailto:nehachatrath at gmail.com>> wrote:
> 
>             Hello Andreas,
> 
>             Thanks for the reply.
> 
>             So can you please suggest what Stonith plugin should I use
>             for the production release of my software. I have the
>             following system requirements:
>             1. If a node in the cluster fails, it should be reboot and
>             resources should re-start on the node.
>             2. If the physical link between the nodes in a cluster fails
>             then that node should be isolated (kind of a power down) and
>             the resources should continue to run on the other nodes.
> 
>             I have different types of resources e.g. primitive,
>             master-slave and cone running on my system.
> 
>             Thanks and regards
>             Neha Chatrath
> 
> 
>             Date: Mon, 17 Oct 2011 15:08:16 +0200
>             From: Andreas Kurz <andreas at hastexo.com
>             <mailto:andreas at hastexo.com>>
>             To: pacemaker at oss.clusterlabs.org
>             <mailto:pacemaker at oss.clusterlabs.org>
>             Subject: Re: [Pacemaker] Problem in Stonith configuration
>             Message-ID: <4E9C28C0.8070904 at hastexo.com
>             <mailto:4E9C28C0.8070904 at hastexo.com>>
>             Content-Type: text/plain; charset="iso-8859-1"
> 
>             Hello,
> 
> 
>             On 10/17/2011 12:34 PM, neha chatrath wrote:
>             > Hello,
>             > I am configuring a 2 node cluster with following
>             configuration:
>             >
>             > *[root at MCG1 init.d]# crm configure show
>             >
>             > node $id="16738ea4-adae-483f-9d79-
>             b0ecce8050f4" mcg2 \
>             > attributes standby="off"
>             >
>             > node $id="3d507250-780f-414a-b674-8c8d84e345cd" mcg1 \
>             > attributes standby="off"
>             >
>             > primitive ClusterIP ocf:heartbeat:IPaddr \
>             > params ip="192.168.1.204" cidr_netmask="255.255.255.0"
>             nic="eth0:1" \
>             >
>             > op monitor interval="40s" timeout="20s" \
>             > meta target-role="Started"
>             >
>             > primitive app1_fencing stonith:suicide \
>             > op monitor interval="90" \
>             > meta target-role="Started"
>             >
>             > primitive myapp1 ocf:heartbeat:Redundancy \
>             > op monitor interval="60s" role="Master" timeout="30s"
>             on-fail="standby" \
>             > op monitor interval="40s" role="Slave" timeout="40s"
>             on-fail="restart"
>             >
>             > primitive myapp2 ocf:mcg:Redundancy_myapp2 \
>             > op monitor interval="60" role="Master" timeout="30"
>             on-fail="standby" \
>             > op monitor interval="40" role="Slave" timeout="40"
>             on-fail="restart"
>             >
>             > primitive myapp3 ocf:mcg:red_app3 \
>             > op monitor interval="60" role="Master" timeout="30"
>             on-fail="fence" \
>             > op monitor interval="40" role="Slave" timeout="40"
>             on-fail="restart"
>             >
>             > ms ms_myapp1 myapp1 \
>             > meta master-max="1" master-node-max="1" clone-max="2"
>             clone-node-max="1"
>             > notify="true"
>             >
>             > ms ms_myapp2 myapp2 \
>             > meta master-max="1" master-node-max="1" clone-max="2"
>             clone-node-max="1"
>             > notify="true"
>             >
>             > ms ms_myapp3 myapp3 \
>             > meta master-max="1" master-max-node="1" clone-max="2"
>             clone-node-max="1"
>             > notify="true"
>             >
>             > colocation myapp1_col inf: ClusterIP ms_myapp1:Master
>             >
>             > colocation myapp2_col inf: ClusterIP ms_myapp2:Master
>             >
>             > colocation myapp3_col inf: ClusterIP ms_myapp3:Master
>             >
>             > order myapp1_order inf: ms_myapp1:promote ClusterIP:start
>             >
>             > order myapp2_order inf: ms_myapp2:promote ms_myapp1:start
>             >
>             > order myapp3_order inf: ms_myapp3:promote ms_myapp2:start
>             >
>             > property $id="cib-bootstrap-options" \
>             > dc-version="1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1" \
>             > cluster-infrastructure="Heartbeat" \
>             > stonith-enabled="true" \
>             > no-quorum-policy="ignore"
>             >
>             > rsc_defaults $id="rsc-options" \
>             > resource-stickiness="100" \
>             > migration-threshold="3"
>             > *
> 
>             > I start Heartbeat demon only one of the nodes e.g. mcg1.
>             But none of the
>             > resources (myapp, myapp1 etc) gets started even on this node.
>             > Following is the output of "*crm_mon -f *" command:
>             >
>             > *Last updated: Mon Oct 17 10:19:22 2011
> 
>             > Stack: Heartbeat
>             > Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)-
>             partition with
>             > quorum
>             > Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1
>             > 2 Nodes configured, unknown expected votes
>             > 5 Resources configured.
>             > ============
>             > Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4): UNCLEAN
>             (offline)
> 
>             The cluster is waiting for a successful fencing event before
>             starting
>             all resources .. the only way to be sure the second node
>             runs no resources.
> 
>             Since you are using suicide pluging this will never happen
>             if Heartbeat
>             is not started on that node. If this is only a _test_setup_
>             go with ssh
>             or even null stonith plugin ... never use them on production
>             systems!
> 
>             Regards,
>             Andreas
> 
> 
>             On Mon, Oct 17, 2011 at 4:04 PM, neha chatrath
>             <nehachatrath at gmail.com <mailto:nehachatrath at gmail.com>> wrote:
> 
>                 Hello,
>                 I am configuring a 2 node cluster with following
>                 configuration:
> 
>                 *[root at MCG1 init.d]# crm configure show
> 
>                 node $id="16738ea4-adae-483f-9d79-b0ecce8050f4" mcg2 \
>                 attributes standby="off"
> 
>                 node $id="3d507250-780f-414a-b674-8c8d84e345cd" mcg1 \
>                 attributes standby="off"
> 
>                 primitive ClusterIP ocf:heartbeat:IPaddr \
>                 params ip="192.168.1.204" cidr_netmask="255.255.255.0"
>                 nic="eth0:1" \
> 
>                 op monitor interval="40s" timeout="20s" \
>                 meta target-role="Started"
> 
>                 primitive app1_fencing stonith:suicide \
>                 op monitor interval="90" \
>                 meta target-role="Started"
> 
>                 primitive myapp1 ocf:heartbeat:Redundancy \
>                 op monitor interval="60s" role="Master" timeout="30s"
>                 on-fail="standby" \
>                 op monitor interval="40s" role="Slave" timeout="40s"
>                 on-fail="restart"
> 
>                 primitive myapp2 ocf:mcg:Redundancy_myapp2 \
>                 op monitor interval="60" role="Master" timeout="30"
>                 on-fail="standby" \
>                 op monitor interval="40" role="Slave" timeout="40"
>                 on-fail="restart"
> 
>                 primitive myapp3 ocf:mcg:red_app3 \
>                 op monitor interval="60" role="Master" timeout="30"
>                 on-fail="fence" \
>                 op monitor interval="40" role="Slave" timeout="40"
>                 on-fail="restart"
> 
>                 ms ms_myapp1 myapp1 \
>                 meta master-max="1" master-node-max="1" clone-max="2"
>                 clone-node-max="1" notify="true"
> 
>                 ms ms_myapp2 myapp2 \
>                 meta master-max="1" master-node-max="1" clone-max="2"
>                 clone-node-max="1" notify="true"
> 
>                 ms ms_myapp3 myapp3 \
>                 meta master-max="1" master-max-node="1" clone-max="2"
>                 clone-node-max="1" notify="true"
> 
>                 colocation myapp1_col inf: ClusterIP ms_myapp1:Master
> 
>                 colocation myapp2_col inf: ClusterIP ms_myapp2:Master
> 
>                 colocation myapp3_col inf: ClusterIP ms_myapp3:Master
> 
>                 order myapp1_order inf: ms_myapp1:promote ClusterIP:start
> 
>                 order myapp2_order inf: ms_myapp2:promote ms_myapp1:start
> 
>                 order myapp3_order inf: ms_myapp3:promote ms_myapp2:start
> 
>                 property $id="cib-bootstrap-options" \
>                 dc-version="1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1"
>                 \
>                 cluster-infrastructure="Heartbeat" \
>                 stonith-enabled="true" \
>                 no-quorum-policy="ignore"
> 
>                 rsc_defaults $id="rsc-options" \
>                 resource-stickiness="100" \
>                 migration-threshold="3"
>                 *
>                 I start Heartbeat demon only one of the nodes e.g. mcg1.
>                 But none of the resources (myapp, myapp1 etc) gets
>                 started even on this node.
>                 Following is the output of "*crm_mon -f *" command:
> 
>                 *Last updated: Mon Oct 17 10:19:22 2011
>                 Stack: Heartbeat
>                 Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)-
>                 partition with quorum
>                 Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1
>                 2 Nodes configured, unknown expected votes
>                 5 Resources configured.
>                 ============
>                 Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4):
>                 UNCLEAN (offline)
>                 Online: [ mcg1 ]
>                 app1_fencing    (stonith:suicide):Started mcg1
> 
>                 Migration summary:
>                 * Node mcg1:
>                 *
>                 When I set "stonith_enabled" as false, then all my
>                 resources comes up.
> 
>                 Can somebody help me with STONITH configuration? 
> 
>                 Cheers
>                 Neha Chatrath
>                                           KEEP SMILING!!!!
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 286 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111028/24d8180b/attachment-0003.sig>