[Pacemaker] Best way to recover from failed STONITH?
Andreas Kurz
andreas at hastexo.com
Fri Dec 21 19:22:57 EST 2012
On 12/21/2012 07:47 PM, Andrew Martin wrote:
> Andreas,
>
> Thanks for the help. Please see my replies inline below.
>
> ----- Original Message -----
>> From: "Andreas Kurz" <andreas at hastexo.com>
>> To: pacemaker at oss.clusterlabs.org
>> Sent: Friday, December 21, 2012 10:11:08 AM
>> Subject: Re: [Pacemaker] Best way to recover from failed STONITH?
>>
>> On 12/21/2012 04:18 PM, Andrew Martin wrote:
>>> Hello,
>>>
>>> Yesterday a power failure took out one of the nodes and its STONITH
>>> device (they share an upstream power source) in a 3-node
>>> active/passive cluster (Corosync 2.1.0, Pacemaker 1.1.8). After
>>> logging into the cluster, I saw that the STONITH operation had
>>> given up in failure and that none of the resources were running on
>>> the other nodes:
>>> Dec 20 17:59:14 [18909] quorumnode crmd: notice:
>>> too_many_st_failures: Too many failures to fence node0 (11),
>>> giving up
>>>
>>> I brought the failed node back online and it rejoined the cluster,
>>> but no more STONITH attempts were made and the resources remained
>>> stopped. Eventually I set stonith-enabled="false" ran killall on
>>> all pacemaker-related processes on the other (remaining) nodes,
>>> then restarted pacemaker, and the resources successfully migrated
>>> to one of the other nodes. This seems like a rather invasive
>>> technique. My questions about this type of situation are:
>>> - is there a better way to tell the cluster "I have manually
>>> confirmed this node is dead/safe"? I see there is the meatclient
>>> command, but can that only be used with the meatware STONITH
>>> plugin?
>>
>> crm node cleanup quorumnode
>
> I'm using the latest version of crmsh (1.2.1) but it doesn't seem to support this command:
ah ... sorry, true ... its the "clearstate" command ... but it does a
"cleanup" ;-)
> root at node0:~# crm --version
> 1.2.1 (Build unknown)
> root at node0:~# crm node
> crm(live)node# help
>
> Node management and status commands.
>
> Available commands:
>
> status show nodes' status as XML
> show show node
> standby put node into standby
> online set node online
> fence fence node
> clearstate Clear node state
> delete delete node
> attribute manage attributes
> utilization manage utilization attributes
> status-attr manage status attributes
> help show help (help topics for list of topics)
> end go back one level
> quit exit the program
> Also, do I run cleanup on just the node that failed, or all of them?
You need to specify a node with this command and you only need/should do
this for the failed node.
>
>
>>
>>> - in general, is there a way to force the cluster to start
>>> resources, if you just need to get them back online and as a
>>> human have confirmed that things are okay? Something like crm
>>> resource start rsc --force?
>>
>> ... see above ;-)
>
> On a related note, is there a way to way to get better information
> about why the cluster is in its current state? For example, in this
> situation it would be nice to be able to run a command and have the
> cluster print "resources stopped until node XXX can be fenced" to
> be able to quickly assess the problem with the cluster.
yeah .... not all cluster command outputs and logs are user-friendly ;-)
... sorry I'm not aware of a direct way to get better information, maybe
someone else?
>
>>
>>> - how can I completely clear out saved data for the cluster and
>>> start over from scratch (last-resort option)? Stopping pacemaker
>>> and removing everything from /var/lib/pacemaker/cib and
>>> /var/lib/pacemaker/pengine cleans the CIB, but the nodes end up
>>> sitting in the "pending" state for a very long time (30 minutes
>>> or more). Am I missing another directory that needs to be
>>> cleared?
>>
>> you started with an completely empty cib and the two (or three?)
>> nodes
>> needed 30min to form a cluster?
> Yes, in fact I cleared out both /var/lib/pacemaker/cib and /var/lib/pacemaker/pengine
> several times and most of the times after starting pacemaker again
> one node would become "online" pretty quickly (less than 5 minutes), but the other two
> would remain "pending" for quite some time. I left it going overnight
> and this morning all of the nodes
that sounds not correct ... any logs during this time?
>>
>>>
>>> I am going to look into making the power source for the STONITH
>>> device independent of the power source for the node itself,
>>> however even with that setup there's still a chance that something
>>> could take out both power sources at the same time, in which case
>>> manual intervention and confirmation that the node is dead would
>>> be required.
>>
>> Pacemaker 1.1.8 supports (again) stonith topologies ... so more than
>> one
>> fencing device and they can be "logically" combined.
>
> Where can I find documentation on STONITH topologies and configuring
> more than one fencing device for a single node? I don't see it mentioned
> in the Cluster Labs documentation (Clusters from Scratch or Pacemaker Explained).
hmm ... good question ... beside the source code that includes an
example as comment ....
Best regards,
Andreas
>
> Thanks,
>
> Andrew
>
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
--
Need help with Pacemaker?
http://www.hastexo.com/now
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20121222/586f1cd5/attachment-0003.sig>
More information about the Pacemaker
mailing list