[Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

Proskurin Kirill k.proskurin at corp.mail.ru
Mon Oct 17 15:21:56 EDT 2011


Hello Beekhof.

First of all - I don`t want to waste your time but this problem is realy 
important for me and I can`t solve it by my self and it`s looks like a 
bug or something. I think what I fail at describing of this problem so I 
will try again and try to make a sum of all prev conversation.

I have a situation then pacemaker thinks what resource are running but 
it`s not. Agent from console said it`s not running.
I have no fencing and this resource are fail to stop by timeout.
And you said what it`s a reason of this situation. But I made an 
experiment and found what if pcmk can`t stop resource it make it "unmanaged"

My resource was not "unmanaged" - it`s just say what they are running 
and I have no indication of problem.

We already fix this non stoppable scripts but I want to be sure what I 
will not run on this problem any more.

Below some quotes from prev conversation if needed.

12.10.2011 6:11, Andrew Beekhof пишет:
>>>>>> On 10/03/2011 05:32 AM, Andrew Beekhof wrote:
>>>>>>>>
>>>>>>>> corosync-1.4.1
>>>>>>>> pacemaker-1.1.5
>>>>>>>> pacemaker runs with "ver: 1"
>>>>>>
>>>>>>>> 2)
>>>>>>>> This one is scary.
>>>>>>>> I twice run on situation then pacemaker thinks what resource is
>>>>>>>> started
>>>>>>>> but
>>>>>>>> it is not.
>>>>>>>
>>>>>>> RA is misbehaving.  Pacemaker will only consider a resource running if
>>>>>>> the RA tells us it is (running or in a failed state).
>>>>>>
>>>>>> But you can see below, what agent return "7".
>>>>>
>>>>> Its still broken. Not one stop action succeeds.
>>>>>
>>>>> Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
>>>>> tranprocessor:stop process (PID 4082) timed out (try 1).  Killing with
>>>>> signal SIGTERM (15).
>>>>> Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
>>>>> tranprocessor:stop process (PID 21859) timed out (try 1).  Killing
>>>>> with signal SIGTERM (15).
>>>>> Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
>>>>> tranprocessor:stop process (PID 24576) timed out (try 1).  Killing
>>>>> with signal SIGTERM (15).
>>>>>
>>>>> /That/ is why pacemaker thinks its still running.
>>>>
>>>> I made an experiment.
>>>>
>>>> I create script what don`t die at SIGTERM
>>>>
>>>> #!/usr/bin/perl
>>>> $SIG{TERM} = "IGNORE"; sleep 1 while 1
>>>>
>>>> And run it on pacemaker.
>>>> I run 3 tests:
>>>> 1) primitive test-kill-15.pl ocf:mail.ru:generic \
>>>>         op monitor interval="20" timeout="5" on-fail="restart" \
>>>>         params binfile="/tmp/test-kill-15.pl" external_pidfile="1"
>>>>
>>>> 2) Same but on-fail=block
>>>>
>>>> 3) Same but with metaware stonith.
>>>>
>>>> Each time I do:
>>>> crm resource stop test-kill-15.pl
>>>>
>>>> And in case 1 and 2 - I get "unmanaged" on this resource.
>
> Because you've not configured any fencing devices.


-- 
Best regards,
Proskurin Kirill




More information about the Pacemaker mailing list