[Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

Thu Sep 29 15:17:18 UTC 2011

Hello all.

corosync-1.4.1
pacemaker-1.1.5
pacemaker runs with "ver: 1"

I run on some problems this week. I not sure if I need to make 3 
separate letters, sorry if so.

1)
I set node to standby and then to online. And after this I get this:

2643 root RT 0 11424 2052 1744 R 100.9 0.0 657502:53 
/usr/lib/heartbeat/stonithd
2644 hacluste RT 0 12432 3440 2240 R 100.9 0.0 657502:43 
/usr/lib/heartbeat/cib
2648 hacluste RT 0 11828 2860 2456 R 100.9 0.0 657502:45 
/usr/lib/heartbeat/crmd
2646 hacluste RT 0 11764 2240 1904 R 99.9 0.0 657502:49 
/usr/lib/heartbeat/attrd

I was in hurry and it`s a production server, so I kill this proc and 
stop pacemakerd & corosync. Then start them again. And all was ok.
I suppose what pacemakerd and corosync was running while this problems 
occurs. I assume this cos then I run stop on they init scripts it is 
takes some time till they stop.

Any hints?

2)
This one is scary.
I twice run on situation then pacemaker thinks what resource is started 
but it is not. We use slightly modifed version of "anything" agent for 
our scripts but they are aware of OCF return codes and other staff.

I run monitoring by our agent from console:
# env -i ; OCF_ROOT=/usr/lib/ocf 
OCF_RESKEY_binfile=/usr/local/mpop/bin/my/dialogues_notify.pl 
/usr/lib/ocf/resource.d/mail.ru/generic monitor
# generic[14992]: DEBUG: default monitor : 7

So our agent said what it is not running, but pacemaker still think it 
does. I runs for 2 days and after I forced to cleanup it. And it find 
what it`snot running in seconds.

This is really scary situation. I can`t reproduce it but I already have 
it twice... may be more but I not see it, who knows.

I attach out agent script and that is how we run this script:

primitive dialogues_notify.pl ocf:mail.ru:generic \
         op monitor interval="30" timeout="300" on-fail="restart" \
         op start interval="0" timeout="300" \
         op stop interval="0" timeout="300" \
         params binfile="/usr/local/mpop/bin/my/dialogues_notify.pl" \
         meta failure-timeout="120"

3)
This one it confusing and dangerous.

I use failure-timeout on most resources to wipe out temp warn messages 
from crm_verify -LV - I use it for monitoring a cluster. All works good 
but I found this:

1) Resource can`t start on node and migrate to next one.
2) It can`t start here too and on all other.
3) It is give up and stops. There is many erros about all this in 
crm_verify -LV - and it is good.
4) failure-timeout comes and... wipe out all errors.
5) We have stopped resource and all errors are wiped. And we don`t know 
if it is stopped by a hands of admin or because of errors.

I think what failure-timeout should not happend on stopped resource.
Any chance to avoid this?

-- 
Best regards,
Proskurin Kirill
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: generic.txt
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110929/93d4869d/attachment-0003.txt>