[Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

Proskurin Kirill k.proskurin at corp.mail.ru
Wed Oct 5 15:47:00 UTC 2011


On 10/05/2011 04:19 AM, Andrew Beekhof wrote:
> On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill
> <k.proskurin at corp.mail.ru>  wrote:
>> On 10/03/2011 05:32 AM, Andrew Beekhof wrote:
>>>>
>>>> corosync-1.4.1
>>>> pacemaker-1.1.5
>>>> pacemaker runs with "ver: 1"
>>
>>>> 2)
>>>> This one is scary.
>>>> I twice run on situation then pacemaker thinks what resource is started
>>>> but
>>>> it is not.
>>>
>>> RA is misbehaving.  Pacemaker will only consider a resource running if
>>> the RA tells us it is (running or in a failed state).
>>
>> But you can see below, what agent return "7".
>
> Its still broken. Not one stop action succeeds.
>
> Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
> tranprocessor:stop process (PID 4082) timed out (try 1).  Killing with
> signal SIGTERM (15).
> Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
> tranprocessor:stop process (PID 21859) timed out (try 1).  Killing
> with signal SIGTERM (15).
> Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
> tranprocessor:stop process (PID 24576) timed out (try 1).  Killing
> with signal SIGTERM (15).
>
> /That/ is why pacemaker thinks its still running.

I made an experiment.

I create script what don`t die at SIGTERM

#!/usr/bin/perl
$SIG{TERM} = "IGNORE"; sleep 1 while 1

And run it on pacemaker.
I run 3 tests:
1) primitive test-kill-15.pl ocf:mail.ru:generic \
         op monitor interval="20" timeout="5" on-fail="restart" \
         params binfile="/tmp/test-kill-15.pl" external_pidfile="1"

2) Same but on-fail=block

3) Same but with metaware stonith.

Each time I do:
crm resource stop test-kill-15.pl

And in case 1 and 2 - I get "unmanaged" on this resource.
In case 3 I get stonith situation.

 From IRC:
(12:20:44 PM) beekhof: Oloremo: what the hell is the cluster supposed to 
do if stop fails and you dont want fencing?  it cant start it anywhere 
because its still active in the original location
(12:30:09 PM) Oloremo: I get the point, really.  But may be it should 
make it unmanaged?

And it does.

So can I assume what my problem with monitoring still not that clear? I 
don`t get "unmanaged" - it is just thinks that resource are started but 
it`s not.


-- 
Best regards,
Proskurin Kirill




More information about the Pacemaker mailing list