[Pacemaker] detect/cleanup failed resource
Bernd Schubert
bs_lists at aakef.fastmail.fm
Thu Oct 21 06:42:40 EDT 2010
Hi all,
is there a better way to detect a failed resource than to run "crm_mon -1 -r"?
Example, I just 'created' a failed resource:
crm_mon -1 -r
Failed actions:
ost_janlus_27_start_0 (node=vm3, call=108, rc=2, status=complete): invalid parameter
This cannot easily parsed using 'grep', as "Failed actions:" is a complete
section. Well, using a python or perl script, it still wouldn't be
too difficult. But, how can I figure out the resource name there?
I cannot run "crm resource cleanup ost_janlus_27_start_0", as this is
obviously not the resource name. I also cannot simply cut off "start_0",
as there are also other actions that might fail.
In fact, crm_mon output is here already annoying to be run as human being,
as for a clean-up a simple copy-and-paste using mouse clicks does not work,
as I always have to cut off the action...
Cleaning up dozens to hundreds resources manually is not an option, so
we have a script that goes over all resources and does that. However, in
larger clusters that can easily take up to 90 minutes.
For a small size cluster,
[root at vm3 ~]# time crm resource cleanup ost_janlus_27
Cleaning up ost_janlus_27 on vm6
Cleaning up ost_janlus_27 on vm7
Cleaning up ost_janlus_27 on vm8
Cleaning up ost_janlus_27 on vm1
Cleaning up ost_janlus_27 on vm2
Cleaning up ost_janlus_27 on vm3
real 0m7.129s
user 0m0.471s
sys 0m0.106s
[root at vm3 ~]# time crm resource cleanup ost_janlus_27 vm6
Cleaning up ost_janlus_27 on vm6
real 0m1.348s
user 0m0.203s
sys 0m0.071s
[root at vm3 ~]# time cluster_resources cleanup
resource: mds-janlus-grp
Cleaning up vg_janlus on vm6
Cleaning up mgs on vm6
Cleaning up mdt_janlus on vm6
Cleaning up vg_janlus on vm7
Cleaning up mgs on vm7
[...]
real 3m35.463s
user 0m13.704s
sys 0m3.440s
(cluster_resources is a small front end for crm to run
it for all of our resources)
So about 1.35s per resource. No problem to do that for a few resources on
all nodes on a 3 node system. But already annoyingly over 3 minutes
for 28 resources and 6 nodes on our default small size systems.
And definitely not an option anymore on a 18 node cluster with 230 resources
(calculated time: 230 resources * 18 nodes * 1.35 s = 5589 s = 1.5 *hours*).
And cleaning up 230 resources manually if something went wrong on the cluster
is also no fun and also is not really fast.
So I'm looking for *any* sane way to clean up resources or at least
for a good parse-able way to get failed resources and the corresponding
node.
Thanks,
Bernd
--
Bernd Schubert
DataDirect Networks
More information about the Pacemaker
mailing list