[Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

Sun Oct 2 21:32:07 EDT 2011

On Fri, Sep 30, 2011 at 1:17 AM, Proskurin Kirill
<k.proskurin at corp.mail.ru> wrote:
> Hello all.
>
> corosync-1.4.1
> pacemaker-1.1.5
> pacemaker runs with "ver: 1"
>
> I run on some problems this week. I not sure if I need to make 3 separate
> letters, sorry if so.

I believe that should be fixed in 1.1.6
There was some problem with IPC going nuts if corosync crashed.

>
> 1)
> I set node to standby and then to online. And after this I get this:
>
> 2643 root RT 0 11424 2052 1744 R 100.9 0.0 657502:53
> /usr/lib/heartbeat/stonithd
> 2644 hacluste RT 0 12432 3440 2240 R 100.9 0.0 657502:43
> /usr/lib/heartbeat/cib
> 2648 hacluste RT 0 11828 2860 2456 R 100.9 0.0 657502:45
> /usr/lib/heartbeat/crmd
> 2646 hacluste RT 0 11764 2240 1904 R 99.9 0.0 657502:49
> /usr/lib/heartbeat/attrd
>
> I was in hurry and it`s a production server, so I kill this proc and stop
> pacemakerd & corosync. Then start them again. And all was ok.
> I suppose what pacemakerd and corosync was running while this problems
> occurs. I assume this cos then I run stop on they init scripts it is takes
> some time till they stop.
>
> Any hints?
>
> 2)
> This one is scary.
> I twice run on situation then pacemaker thinks what resource is started but
> it is not.

RA is misbehaving.  Pacemaker will only consider a resource running if
the RA tells us it is (running or in a failed state).

> We use slightly modifed version of "anything" agent for our
> scripts but they are aware of OCF return codes and other staff.
>
> I run monitoring by our agent from console:
> # env -i ; OCF_ROOT=/usr/lib/ocf
> OCF_RESKEY_binfile=/usr/local/mpop/bin/my/dialogues_notify.pl
> /usr/lib/ocf/resource.d/mail.ru/generic monitor
> # generic[14992]: DEBUG: default monitor : 7
>
> So our agent said what it is not running, but pacemaker still think it does.
> I runs for 2 days and after I forced to cleanup it. And it find what it`snot
> running in seconds.

Did you configure a recurring monitor operation?

>
> This is really scary situation. I can`t reproduce it but I already have it
> twice... may be more but I not see it, who knows.
>
> I attach out agent script and that is how we run this script:
>
> primitive dialogues_notify.pl ocf:mail.ru:generic \
>        op monitor interval="30" timeout="300" on-fail="restart" \
>        op start interval="0" timeout="300" \
>        op stop interval="0" timeout="300" \
>        params binfile="/usr/local/mpop/bin/my/dialogues_notify.pl" \
>        meta failure-timeout="120"
>
> 3)
> This one it confusing and dangerous.
>
> I use failure-timeout on most resources to wipe out temp warn messages from
> crm_verify -LV - I use it for monitoring a cluster. All works good but I
> found this:
>
> 1) Resource can`t start on node and migrate to next one.
> 2) It can`t start here too and on all other.
> 3) It is give up and stops. There is many erros about all this in crm_verify
> -LV - and it is good.
> 4) failure-timeout comes and... wipe out all errors.
> 5) We have stopped resource and all errors are wiped. And we don`t know if
> it is stopped by a hands of admin or because of errors.

If it was stopped by hand, you'd see target-role=Stopped.
failure-timeout would not affect these services.

> I think what failure-timeout should not happend on stopped resource.
> Any chance to avoid this?

Not sure why you think this is dangerous, the cluster is doing exactly
what you told it to.
If you want resources to stay stopped either set failure-timeout=0
(disabled) or set the target-role to Stopped.