[Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

Proskurin Kirill k.proskurin at corp.mail.ru
Fri Oct 7 09:40:45 UTC 2011


On 10/07/2011 02:13 AM, Andrew Beekhof wrote:
> On Thu, Oct 6, 2011 at 2:47 AM, Proskurin Kirill
> <k.proskurin at corp.mail.ru>  wrote:
>> On 10/05/2011 04:19 AM, Andrew Beekhof wrote:
>>>
>>> On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill
>>> <k.proskurin at corp.mail.ru>    wrote:
>>>>
>>>> On 10/03/2011 05:32 AM, Andrew Beekhof wrote:
>>>>>>
>>>>>> corosync-1.4.1
>>>>>> pacemaker-1.1.5
>>>>>> pacemaker runs with "ver: 1"
>>>>
>>>>>> 2)
>>>>>> This one is scary.
>>>>>> I twice run on situation then pacemaker thinks what resource is started
>>>>>> but
>>>>>> it is not.
>>>>>
>>>>> RA is misbehaving.  Pacemaker will only consider a resource running if
>>>>> the RA tells us it is (running or in a failed state).
>>>>
>>>> But you can see below, what agent return "7".
>>>
>>> Its still broken. Not one stop action succeeds.
>>>
>>> Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
>>> tranprocessor:stop process (PID 4082) timed out (try 1).  Killing with
>>> signal SIGTERM (15).
>>> Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
>>> tranprocessor:stop process (PID 21859) timed out (try 1).  Killing
>>> with signal SIGTERM (15).
>>> Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
>>> tranprocessor:stop process (PID 24576) timed out (try 1).  Killing
>>> with signal SIGTERM (15).
>>>
>>> /That/ is why pacemaker thinks its still running.
>>
>> I made an experiment.
>>
>> I create script what don`t die at SIGTERM
>>
>> #!/usr/bin/perl
>> $SIG{TERM} = "IGNORE"; sleep 1 while 1
>>
>> And run it on pacemaker.
>> I run 3 tests:
>> 1) primitive test-kill-15.pl ocf:mail.ru:generic \
>>         op monitor interval="20" timeout="5" on-fail="restart" \
>>         params binfile="/tmp/test-kill-15.pl" external_pidfile="1"
>>
>> 2) Same but on-fail=block
>>
>> 3) Same but with metaware stonith.
>>
>> Each time I do:
>> crm resource stop test-kill-15.pl
>>
>> And in case 1 and 2 - I get "unmanaged" on this resource.
>> In case 3 I get stonith situation.
>
> I can't comment based on only a partial config.

Sorry for that. I attached full crm config & logs of that day.
Resource called test-kill-15.pl

-- 
Best regards,
Proskurin Kirill
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.log.bz2
Type: application/x-bzip
Size: 183236 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111007/ed64ba85/attachment-0004.bin>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cib.txt
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111007/ed64ba85/attachment-0004.txt>


More information about the Pacemaker mailing list