[Pacemaker] 1) attrd, crmd, cib, stonithd going to 100% CPU after standby 2) monitoring bug 3) meta failure-timeout issue

Andrew Beekhof andrew at beekhof.net
Thu Oct 6 18:13:14 EDT 2011


On Thu, Oct 6, 2011 at 2:47 AM, Proskurin Kirill
<k.proskurin at corp.mail.ru> wrote:
> On 10/05/2011 04:19 AM, Andrew Beekhof wrote:
>>
>> On Mon, Oct 3, 2011 at 5:50 PM, Proskurin Kirill
>> <k.proskurin at corp.mail.ru>  wrote:
>>>
>>> On 10/03/2011 05:32 AM, Andrew Beekhof wrote:
>>>>>
>>>>> corosync-1.4.1
>>>>> pacemaker-1.1.5
>>>>> pacemaker runs with "ver: 1"
>>>
>>>>> 2)
>>>>> This one is scary.
>>>>> I twice run on situation then pacemaker thinks what resource is started
>>>>> but
>>>>> it is not.
>>>>
>>>> RA is misbehaving.  Pacemaker will only consider a resource running if
>>>> the RA tells us it is (running or in a failed state).
>>>
>>> But you can see below, what agent return "7".
>>
>> Its still broken. Not one stop action succeeds.
>>
>> Sep 30 13:58:41 mysender34.mail.ru lrmd: [26299]: WARN:
>> tranprocessor:stop process (PID 4082) timed out (try 1).  Killing with
>> signal SIGTERM (15).
>> Sep 30 14:09:34 mysender34.mail.ru lrmd: [26299]: WARN:
>> tranprocessor:stop process (PID 21859) timed out (try 1).  Killing
>> with signal SIGTERM (15).
>> Sep 30 20:04:17 mysender34.mail.ru lrmd: [26299]: WARN:
>> tranprocessor:stop process (PID 24576) timed out (try 1).  Killing
>> with signal SIGTERM (15).
>>
>> /That/ is why pacemaker thinks its still running.
>
> I made an experiment.
>
> I create script what don`t die at SIGTERM
>
> #!/usr/bin/perl
> $SIG{TERM} = "IGNORE"; sleep 1 while 1
>
> And run it on pacemaker.
> I run 3 tests:
> 1) primitive test-kill-15.pl ocf:mail.ru:generic \
>        op monitor interval="20" timeout="5" on-fail="restart" \
>        params binfile="/tmp/test-kill-15.pl" external_pidfile="1"
>
> 2) Same but on-fail=block
>
> 3) Same but with metaware stonith.
>
> Each time I do:
> crm resource stop test-kill-15.pl
>
> And in case 1 and 2 - I get "unmanaged" on this resource.
> In case 3 I get stonith situation.

I can't comment based on only a partial config.

>
> From IRC:
> (12:20:44 PM) beekhof: Oloremo: what the hell is the cluster supposed to do
> if stop fails and you dont want fencing?  it cant start it anywhere because
> its still active in the original location
> (12:30:09 PM) Oloremo: I get the point, really.  But may be it should make
> it unmanaged?
>
> And it does.
>
> So can I assume what my problem with monitoring still not that clear? I
> don`t get "unmanaged" - it is just thinks that resource are started but it`s
> not.
>
>
> --
> Best regards,
> Proskurin Kirill
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>




More information about the Pacemaker mailing list