[Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

Tue Oct 11 22:11:30 EDT 2011

On Fri, Oct 7, 2011 at 8:40 PM, Proskurin Kirill
<k.proskurin at corp.mail.ru> wrote:
> On 10/07/2011 02:13 AM, Andrew Beekhof wrote:
>>
>> On Thu, Oct 6, 2011 at 2:47 AM, Proskurin Kirill
>> <k.proskurin at corp.mail.ru>  wrote:
>>>
>>> On 10/05/2011 04:19 AM, Andrew Beekhof wrote:
>>>>
>>>> On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill
>>>> <k.proskurin at corp.mail.ru>    wrote:
>>>>>
>>>>> On 10/03/2011 05:32 AM, Andrew Beekhof wrote:
>>>>>>>
>>>>>>> corosync-1.4.1
>>>>>>> pacemaker-1.1.5
>>>>>>> pacemaker runs with "ver: 1"
>>>>>
>>>>>>> 2)
>>>>>>> This one is scary.
>>>>>>> I twice run on situation then pacemaker thinks what resource is
>>>>>>> started
>>>>>>> but
>>>>>>> it is not.
>>>>>>
>>>>>> RA is misbehaving.  Pacemaker will only consider a resource running if
>>>>>> the RA tells us it is (running or in a failed state).
>>>>>
>>>>> But you can see below, what agent return "7".
>>>>
>>>> Its still broken. Not one stop action succeeds.
>>>>
>>>> Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
>>>> tranprocessor:stop process (PID 4082) timed out (try 1).  Killing with
>>>> signal SIGTERM (15).
>>>> Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
>>>> tranprocessor:stop process (PID 21859) timed out (try 1).  Killing
>>>> with signal SIGTERM (15).
>>>> Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
>>>> tranprocessor:stop process (PID 24576) timed out (try 1).  Killing
>>>> with signal SIGTERM (15).
>>>>
>>>> /That/ is why pacemaker thinks its still running.
>>>
>>> I made an experiment.
>>>
>>> I create script what don`t die at SIGTERM
>>>
>>> #!/usr/bin/perl
>>> $SIG{TERM} = "IGNORE"; sleep 1 while 1
>>>
>>> And run it on pacemaker.
>>> I run 3 tests:
>>> 1) primitive test-kill-15.pl ocf:mail.ru:generic \
>>>        op monitor interval="20" timeout="5" on-fail="restart" \
>>>        params binfile="/tmp/test-kill-15.pl" external_pidfile="1"
>>>
>>> 2) Same but on-fail=block
>>>
>>> 3) Same but with metaware stonith.
>>>
>>> Each time I do:
>>> crm resource stop test-kill-15.pl
>>>
>>> And in case 1 and 2 - I get "unmanaged" on this resource.

Because you've not configured any fencing devices.

>>> In case 3 I get stonith situation.

Because now there is something the cluster can do to try and automate
recovery when the stop operation fails.

>>
>> I can't comment based on only a partial config.
>
> Sorry for that. I attached full crm config & logs of that day.
> Resource called test-kill-15.pl
>
> --
> Best regards,
> Proskurin Kirill
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>