[Pacemaker] Problem in Stonith configuration

Fri Oct 28 07:21:26 EDT 2011

Hello,

1. How about using Integrated ILO device for fencing? I am using HP Proliant
DL360 G7 server which supports ILO3.
   - Can RILOE Stonith be used for this?

2. Can meatware Stonith plugin be used for production software?

3. One more issue which I am facing is that when I try
          -"crm ra list stonith" command, there is no output. although
different RA's under Heartbeat class are visible.
          - Also, Stonith class is visible in the output of "crm ra
classes" command
          - all the default Stonith RA's  like meatware, suicide, ibmrsa,
ipmi etc are present in /usr/lib/stonith/plugins directory.
          - Due to this I am not able to configure stonith in my system.

Thanks and regards
Neha Chatrath

On Tue, Oct 18, 2011 at 2:51 PM, neha chatrath <nehachatrath at gmail.com>wrote:

> Hello,
>
> > 1. If a resource fails, node should reboot (through fencing mechanism)
> > and resources should re-start on the node.
>
> Why would you want that? This would increase the service downtime
> considerable. Why is a local restart not possible ... and even if there
> is a good reason for a reboot, why not starting the resource on the
> other node?
> -In our system, there are some primitive, clone resources along with 3
> different master-slave resources.
> -All the masters and slaves of these resources are co-located i.e. all the
> 3 masters are co-located on a node and 3 slaves on the other node.
> -These 3 master-slaves resources are tightly coupled. There is a
> requirement that failure of even any one of these resources, restarts all
> the resources in the group
> -All these resources can be shifted to the other node but subsequently
> these should also be restarted as a lot of data/control plane synching is
> being done between the two nodes.
> e.g. If one of the resources running on node1 as a Master fails, then all
> these 3 resources are shifted to the other node i.e. node2  (with
> corresponding slave resources being promoted as master). On  node1, these
> resources should get re-started as slaves.
>
> We understand that node restart will increase the downtime but since we
> could not find much on the option for group restart of master-slave
> resources, we are trying for node restart option.
>
>
> Thanks and regards
> Neha Chatrath
>
> ---------- Forwarded message ----------
> From: Andreas Kurz <andreas at hastexo.com>
> Date: Tue, Oct 18, 2011 at 1:55 PM
> Subject: Re: [Pacemaker] Problem in Stonith configuration
> To: pacemaker at oss.clusterlabs.org
>
>
> Hello,
>
>
> On 10/18/2011 09:00 AM, neha chatrath wrote:
> > Hello,
> >
> > Minor updates in the first requirement.
> > 1. If a resource fails, node should reboot (through fencing mechanism)
> > and resources should re-start on the node.
>
> Why would you want that? This would increase the service downtime
> considerable. Why is a local restart not possible ... and even if there
> is a good reason for a reboot, why not starting the resource on the
> other node?
>
>
> > 2. If the physical link between the nodes in a cluster fails then that
> > node should be isolated (kind of a power down) and the resources should
> > continue to run on the other nodes
>
> That is how stonith works, yes.
>
> crm ra list stonith ... gives you a list of all available stonith plugins.
>
> crm ra info stonit:xxxx ... details for a specific plugin.
>
> Using external/ipmi is often a good choice because a lot of servers
> already have an BMC with IPMI on board or they are shipped with a
> management card supporting IMPI.
>
> Regards,
> Andreas
>
>
> On Tue, Oct 18, 2011 at 12:30 PM, neha chatrath <nehachatrath at gmail.com>wrote:
>
>> Hello,
>>
>> Minor updates in the first requirement.
>> 1. If a resource fails, node should reboot (through fencing mechanism) and
>> resources should re-start on the node.
>>
>> 2. If the physical link between the nodes in a cluster fails then that
>> node should be isolated (kind of a power down) and the resources should
>> continue to run on the other nodes
>>
>> Apologies for the inconvenience.
>>
>>
>> Thanks and regards
>> Neha Chatrath
>>
>> On Tue, Oct 18, 2011 at 12:08 PM, neha chatrath <nehachatrath at gmail.com>wrote:
>>
>>> Hello Andreas,
>>>
>>> Thanks for the reply.
>>>
>>> So can you please suggest what Stonith plugin should I use for the
>>> production release of my software. I have the following system requirements:
>>> 1. If a node in the cluster fails, it should be reboot and resources
>>> should re-start on the node.
>>> 2. If the physical link between the nodes in a cluster fails then that
>>> node should be isolated (kind of a power down) and the resources should
>>> continue to run on the other nodes.
>>>
>>> I have different types of resources e.g. primitive, master-slave and cone
>>> running on my system.
>>>
>>> Thanks and regards
>>> Neha Chatrath
>>>
>>>
>>> Date: Mon, 17 Oct 2011 15:08:16 +0200
>>> From: Andreas Kurz <andreas at hastexo.com>
>>> To: pacemaker at oss.clusterlabs.org
>>> Subject: Re: [Pacemaker] Problem in Stonith configuration
>>> Message-ID: <4E9C28C0.8070904 at hastexo.com>
>>> Content-Type: text/plain; charset="iso-8859-1"
>>>
>>> Hello,
>>>
>>>
>>> On 10/17/2011 12:34 PM, neha chatrath wrote:
>>> > Hello,
>>> > I am configuring a 2 node cluster with following configuration:
>>> >
>>> > *[root at MCG1 init.d]# crm configure show
>>> >
>>> > node $id="16738ea4-adae-483f-9d79-
>>> b0ecce8050f4" mcg2 \
>>> > attributes standby="off"
>>> >
>>> > node $id="3d507250-780f-414a-b674-8c8d84e345cd" mcg1 \
>>> > attributes standby="off"
>>> >
>>> > primitive ClusterIP ocf:heartbeat:IPaddr \
>>> > params ip="192.168.1.204" cidr_netmask="255.255.255.0" nic="eth0:1" \
>>> >
>>> > op monitor interval="40s" timeout="20s" \
>>> > meta target-role="Started"
>>> >
>>> > primitive app1_fencing stonith:suicide \
>>> > op monitor interval="90" \
>>> > meta target-role="Started"
>>> >
>>> > primitive myapp1 ocf:heartbeat:Redundancy \
>>> > op monitor interval="60s" role="Master" timeout="30s" on-fail="standby"
>>> \
>>> > op monitor interval="40s" role="Slave" timeout="40s" on-fail="restart"
>>> >
>>> > primitive myapp2 ocf:mcg:Redundancy_myapp2 \
>>> > op monitor interval="60" role="Master" timeout="30" on-fail="standby" \
>>> > op monitor interval="40" role="Slave" timeout="40" on-fail="restart"
>>> >
>>> > primitive myapp3 ocf:mcg:red_app3 \
>>> > op monitor interval="60" role="Master" timeout="30" on-fail="fence" \
>>> > op monitor interval="40" role="Slave" timeout="40" on-fail="restart"
>>> >
>>> > ms ms_myapp1 myapp1 \
>>> > meta master-max="1" master-node-max="1" clone-max="2"
>>> clone-node-max="1"
>>> > notify="true"
>>> >
>>> > ms ms_myapp2 myapp2 \
>>> > meta master-max="1" master-node-max="1" clone-max="2"
>>> clone-node-max="1"
>>> > notify="true"
>>> >
>>> > ms ms_myapp3 myapp3 \
>>> > meta master-max="1" master-max-node="1" clone-max="2"
>>> clone-node-max="1"
>>> > notify="true"
>>> >
>>> > colocation myapp1_col inf: ClusterIP ms_myapp1:Master
>>> >
>>> > colocation myapp2_col inf: ClusterIP ms_myapp2:Master
>>> >
>>> > colocation myapp3_col inf: ClusterIP ms_myapp3:Master
>>> >
>>> > order myapp1_order inf: ms_myapp1:promote ClusterIP:start
>>> >
>>> > order myapp2_order inf: ms_myapp2:promote ms_myapp1:start
>>> >
>>> > order myapp3_order inf: ms_myapp3:promote ms_myapp2:start
>>> >
>>> > property $id="cib-bootstrap-options" \
>>> > dc-version="1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1" \
>>> > cluster-infrastructure="Heartbeat" \
>>> > stonith-enabled="true" \
>>> > no-quorum-policy="ignore"
>>> >
>>> > rsc_defaults $id="rsc-options" \
>>> > resource-stickiness="100" \
>>> > migration-threshold="3"
>>> > *
>>>
>>> > I start Heartbeat demon only one of the nodes e.g. mcg1. But none of
>>> the
>>> > resources (myapp, myapp1 etc) gets started even on this node.
>>> > Following is the output of "*crm_mon -f *" command:
>>> >
>>> > *Last updated: Mon Oct 17 10:19:22 2011
>>>
>>> > Stack: Heartbeat
>>> > Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)- partition with
>>> > quorum
>>> > Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1
>>> > 2 Nodes configured, unknown expected votes
>>> > 5 Resources configured.
>>> > ============
>>> > Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4): UNCLEAN (offline)
>>>
>>> The cluster is waiting for a successful fencing event before starting
>>> all resources .. the only way to be sure the second node runs no
>>> resources.
>>>
>>> Since you are using suicide pluging this will never happen if Heartbeat
>>> is not started on that node. If this is only a _test_setup_ go with ssh
>>> or even null stonith plugin ... never use them on production systems!
>>>
>>> Regards,
>>> Andreas
>>>
>>>
>>> On Mon, Oct 17, 2011 at 4:04 PM, neha chatrath <nehachatrath at gmail.com>wrote:
>>>
>>>> Hello,
>>>> I am configuring a 2 node cluster with following configuration:
>>>>
>>>> *[root at MCG1 init.d]# crm configure show
>>>>
>>>> node $id="16738ea4-adae-483f-9d79-b0ecce8050f4" mcg2 \
>>>> attributes standby="off"
>>>>
>>>> node $id="3d507250-780f-414a-b674-8c8d84e345cd" mcg1 \
>>>> attributes standby="off"
>>>>
>>>> primitive ClusterIP ocf:heartbeat:IPaddr \
>>>> params ip="192.168.1.204" cidr_netmask="255.255.255.0" nic="eth0:1" \
>>>>
>>>> op monitor interval="40s" timeout="20s" \
>>>> meta target-role="Started"
>>>>
>>>> primitive app1_fencing stonith:suicide \
>>>> op monitor interval="90" \
>>>> meta target-role="Started"
>>>>
>>>> primitive myapp1 ocf:heartbeat:Redundancy \
>>>> op monitor interval="60s" role="Master" timeout="30s" on-fail="standby"
>>>> \
>>>> op monitor interval="40s" role="Slave" timeout="40s" on-fail="restart"
>>>>
>>>> primitive myapp2 ocf:mcg:Redundancy_myapp2 \
>>>> op monitor interval="60" role="Master" timeout="30" on-fail="standby" \
>>>> op monitor interval="40" role="Slave" timeout="40" on-fail="restart"
>>>>
>>>> primitive myapp3 ocf:mcg:red_app3 \
>>>> op monitor interval="60" role="Master" timeout="30" on-fail="fence" \
>>>> op monitor interval="40" role="Slave" timeout="40" on-fail="restart"
>>>>
>>>> ms ms_myapp1 myapp1 \
>>>> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
>>>> notify="true"
>>>>
>>>> ms ms_myapp2 myapp2 \
>>>> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
>>>> notify="true"
>>>>
>>>> ms ms_myapp3 myapp3 \
>>>> meta master-max="1" master-max-node="1" clone-max="2" clone-node-max="1"
>>>> notify="true"
>>>>
>>>> colocation myapp1_col inf: ClusterIP ms_myapp1:Master
>>>>
>>>> colocation myapp2_col inf: ClusterIP ms_myapp2:Master
>>>>
>>>> colocation myapp3_col inf: ClusterIP ms_myapp3:Master
>>>>
>>>> order myapp1_order inf: ms_myapp1:promote ClusterIP:start
>>>>
>>>> order myapp2_order inf: ms_myapp2:promote ms_myapp1:start
>>>>
>>>> order myapp3_order inf: ms_myapp3:promote ms_myapp2:start
>>>>
>>>> property $id="cib-bootstrap-options" \
>>>> dc-version="1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1" \
>>>> cluster-infrastructure="Heartbeat" \
>>>> stonith-enabled="true" \
>>>> no-quorum-policy="ignore"
>>>>
>>>> rsc_defaults $id="rsc-options" \
>>>> resource-stickiness="100" \
>>>> migration-threshold="3"
>>>> *
>>>> I start Heartbeat demon only one of the nodes e.g. mcg1. But none of the
>>>> resources (myapp, myapp1 etc) gets started even on this node.
>>>> Following is the output of "*crm_mon -f *" command:
>>>>
>>>> *Last updated: Mon Oct 17 10:19:22 2011
>>>> Stack: Heartbeat
>>>> Current DC: mcg1 (3d507250-780f-414a-b674-8c8d84e345cd)- partition with
>>>> quorum
>>>> Version: 1.0.11-db98485d06ed3fe0fe236509f023e1bd4a5566f1
>>>> 2 Nodes configured, unknown expected votes
>>>> 5 Resources configured.
>>>> ============
>>>> Node mcg2 (16738ea4-adae-483f-9d79-b0ecce8050f4): UNCLEAN (offline)
>>>> Online: [ mcg1 ]
>>>> app1_fencing    (stonith:suicide):Started mcg1
>>>>
>>>> Migration summary:
>>>> * Node mcg1:
>>>> *
>>>> When I set "stonith_enabled" as false, then all my resources comes up.
>>>>
>>>> Can somebody help me with STONITH configuration?
>>>>
>>>> Cheers
>>>> Neha Chatrath
>>>>                           KEEP SMILING!!!!
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111028/9654dc54/attachment-0003.html>