[Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

Proskurin Kirill k.proskurin at corp.mail.ru
Mon Oct 3 02:50:49 EDT 2011


On 10/03/2011 05:32 AM, Andrew Beekhof wrote:
>> corosync-1.4.1
>> pacemaker-1.1.5
>> pacemaker runs with "ver: 1"

>> 2)
>> This one is scary.
>> I twice run on situation then pacemaker thinks what resource is started but
>> it is not.
>
> RA is misbehaving.  Pacemaker will only consider a resource running if
> the RA tells us it is (running or in a failed state).

But you can see below, what agent return "7".

>> We use slightly modifed version of "anything" agent for our
>> scripts but they are aware of OCF return codes and other staff.
>>
>> I run monitoring by our agent from console:
>> # env -i ; OCF_ROOT=/usr/lib/ocf
>> OCF_RESKEY_binfile=/usr/local/mpop/bin/my/dialogues_notify.pl
>> /usr/lib/ocf/resource.d/mail.ru/generic monitor
>> # generic[14992]: DEBUG: default monitor : 7
>>
>> So our agent said what it is not running, but pacemaker still think it does.
>> I runs for 2 days and after I forced to cleanup it. And it find what it`snot
>> running in seconds.
>
> Did you configure a recurring monitor operation?

Of course. I add my primitive configuration in original letter there is:
op monitor interval="30" timeout="300" on-fail="restart" \

I have this third time and this time I found in logs:
Oct 01 02:00:12 mysender34.mail.ru pengine: [26301]: notice:
unpack_rsc_op: Ignoring expired failure tranprocessor_stop_0 (rc=-2,
magic=2:-2;121:690:0:4c16dc39-1fd3-41f2-b582-0236f6b6eccc) on
mysender34.mail.ru

There is different resource name cos logs from third situation but 
problem is same.


>> 3)
>> This one it confusing and dangerous.
>>
>> I use failure-timeout on most resources to wipe out temp warn messages from
>> crm_verify -LV - I use it for monitoring a cluster. All works good but I
>> found this:
>>
>> 1) Resource can`t start on node and migrate to next one.
>> 2) It can`t start here too and on all other.
>> 3) It is give up and stops. There is many erros about all this in crm_verify
>> -LV - and it is good.
>> 4) failure-timeout comes and... wipe out all errors.
>> 5) We have stopped resource and all errors are wiped. And we don`t know if
>> it is stopped by a hands of admin or because of errors.

>> I think what failure-timeout should not happend on stopped resource.
>> Any chance to avoid this?

> Not sure why you think this is dangerous, the cluster is doing exactly
> what you told it to.
> If you want resources to stay stopped either set failure-timeout=0
> (disabled) or set the target-role to Stopped.

No, I want to use failure-timeout but not wipe out errors then resource 
are already stopped by pacemaker because of errors and not by admin hands.

-- 
Best regards,
Proskurin Kirill




More information about the Pacemaker mailing list