[Pacemaker] Best way to recover from failed STONITH?

Sat Dec 22 00:22:57 UTC 2012

On 12/21/2012 07:47 PM, Andrew Martin wrote:
> Andreas,
> 
> Thanks for the help. Please see my replies inline below.
> 
> ----- Original Message -----
>> From: "Andreas Kurz" <andreas at hastexo.com>
>> To: pacemaker at oss.clusterlabs.org
>> Sent: Friday, December 21, 2012 10:11:08 AM
>> Subject: Re: [Pacemaker] Best way to recover from failed STONITH?
>>
>> On 12/21/2012 04:18 PM, Andrew Martin wrote:
>>> Hello,
>>>
>>> Yesterday a power failure took out one of the nodes and its STONITH
>>> device (they share an upstream power source) in a 3-node
>>> active/passive cluster (Corosync 2.1.0, Pacemaker 1.1.8). After
>>> logging into the cluster, I saw that the STONITH operation had
>>> given up in failure and that none of the resources were running on
>>> the other nodes:
>>> Dec 20 17:59:14 [18909] quorumnode       crmd:   notice:
>>> too_many_st_failures:       Too many failures to fence node0 (11),
>>> giving up
>>>
>>> I brought the failed node back online and it rejoined the cluster,
>>> but no more STONITH attempts were made and the resources remained
>>> stopped. Eventually I set stonith-enabled="false" ran killall on
>>> all pacemaker-related processes on the other (remaining) nodes,
>>> then restarted pacemaker, and the resources successfully migrated
>>> to one of the other nodes. This seems like a rather invasive
>>> technique. My questions about this type of situation are:
>>>  - is there a better way to tell the cluster "I have manually
>>>  confirmed this node is dead/safe"? I see there is the meatclient
>>>  command, but can that only be used with the meatware STONITH
>>>  plugin?
>>
>> crm node cleanup quorumnode
> 
> I'm using the latest version of crmsh (1.2.1) but it doesn't seem to support this command:

ah ... sorry, true ... its the "clearstate" command ... but it does a
"cleanup" ;-)

> root at node0:~# crm --version
> 1.2.1 (Build unknown)
> root at node0:~# crm node
> crm(live)node# help
> 
> Node management and status commands.
> 
> Available commands:
> 
> 	status           show nodes' status as XML
> 	show             show node
> 	standby          put node into standby
> 	online           set node online
> 	fence            fence node
> 	clearstate       Clear node state
> 	delete           delete node
> 	attribute        manage attributes
> 	utilization      manage utilization attributes
> 	status-attr      manage status attributes
> 	help             show help (help topics for list of topics)
> 	end              go back one level
> 	quit             exit the program
> Also, do I run cleanup on just the node that failed, or all of them?

You need to specify a node with this command and you only need/should do
this for the failed node.

> 
> 
>>
>>>  - in general, is there a way to force the cluster to start
>>>  resources, if you just need to get them back online and as a
>>>  human have confirmed that things are okay? Something like crm
>>>  resource start rsc --force?
>>
>> ... see above ;-)
> 
> On a related note, is there a way to way to get better information
> about why the cluster is in its current state? For example, in this 
> situation it would be nice to be able to run a command and have the
> cluster print "resources stopped until node XXX can be fenced" to
> be able to quickly assess the problem with the cluster.

yeah .... not all cluster command outputs and logs are user-friendly ;-)
... sorry I'm not aware of a direct way to get better information, maybe
someone else?

> 
>>
>>>  - how can I completely clear out saved data for the cluster and
>>>  start over from scratch (last-resort option)? Stopping pacemaker
>>>  and removing everything from /var/lib/pacemaker/cib and
>>>  /var/lib/pacemaker/pengine cleans the CIB, but the nodes end up
>>>  sitting in the "pending" state for a very long time (30 minutes
>>>  or more). Am I missing another directory that needs to be
>>>  cleared?
>>
>> you started with an completely empty cib and the two (or three?)
>> nodes
>> needed 30min to form a cluster?
> Yes, in fact I cleared out both /var/lib/pacemaker/cib and /var/lib/pacemaker/pengine
> several times and most of the times after starting pacemaker again
> one node would become "online" pretty quickly (less than 5 minutes), but the other two 
> would remain "pending" for quite some time. I left it going overnight
> and this morning all of the nodes

that sounds not correct ... any logs during this time?

>>
>>>
>>> I am going to look into making the power source for the STONITH
>>> device independent of the power source for the node itself,
>>> however even with that setup there's still a chance that something
>>> could take out both power sources at the same time, in which case
>>> manual intervention and confirmation that the node is dead would
>>> be required.
>>
>> Pacemaker 1.1.8 supports (again) stonith topologies ... so more than
>> one
>> fencing device and they can be "logically" combined.
> 
> Where can I find documentation on STONITH topologies and configuring
> more than one fencing device for a single node? I don't see it mentioned
> in the Cluster Labs documentation (Clusters from Scratch or Pacemaker Explained).

hmm ... good question ... beside the source code that includes an
example as comment ....

Best regards,
Andreas

> 
> Thanks,
> 
> Andrew
> 
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20121222/586f1cd5/attachment-0004.sig>