[Pacemaker] failure handling on a cloned resource

Wed May 15 22:54:49 EDT 2013

On 16/05/2013, at 12:45 AM, Johan Huysmans <johan.huysmans at inuits.be> wrote:

> Hi Andrew,
> 
> Thx!
> 
> I tested your github pacemaker repository by building an rpm from it and installing it on my testsetup.
> 
> Before I could build the rpm I had to change 2 things in the GNUmakefile:
> * --without=doc should be --without doc

That would be a dependancy issue, I something needed was not installed.

> * --target i686 was missing
> If I didn't make these modification the rpmbuild command failed (on CentOS6)

What was the command you ran?

> 
> I performed the test which failed before and everything seems OK.
> Once the failing resource was restored the depending resources were automatically started.
> 
> Thanks for this fast fix!
> 
> 
> I which release can I expect this fix? and when is it planned?

1.1.10 planned for as soon as all the bugs are fixed :)
we're at rc2 now, rc3 should be today/tomorrow

> I will currently use the head build I created. This is ok for my testsetup
> but I don't want to run this version in production
> 
> Greetings,
> Johan Huysmans
> 
> On 2013-05-10 06:55, Andrew Beekhof wrote:
>> Fixed!
>> 
>>   https://github.com/beekhof/pacemaker/commit/d87de1b
>> 
>> On 10/05/2013, at 11:59 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
>> 
>>> On 07/05/2013, at 5:15 PM, Johan Huysmans <johan.huysmans at inuits.be> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I only keep a couple of pe-input file, and that pe-inpurt-1 version was already overwritten.
>>>> I redid my tests as describe in my previous mails.
>>>> 
>>>> At the end of the test it was again written to pe-input1, which is included as attachment.
>>> Perfect.
>>> Basically the PE doesn't know how to correctly recognise that d_tomcat_monitor_15000 needs to be processed after d_tomcat_last_failure_0:
>>> 
>>>            <lrm_rsc_op id="d_tomcat_monitor_15000" operation_key="d_tomcat_monitor_15000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" transition-key="18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" transition-magic="0:0;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" call-id="44" rc-code="0" op-status="0" interval="15000" last-rc-change="1367910303" exec-time="0" queue-time="0" op-digest="0c738dfc69f09a62b7ebf32344fddcf6"/>
>>>            <lrm_rsc_op id="d_tomcat_last_failure_0" operation_key="d_tomcat_monitor_15000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" transition-key="18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" transition-magic="0:1;18:360:0:ade789ed-b68e-4f0d-9092-684d0aaa0e89" call-id="44" rc-code="1" op-status="0" interval="15000" last-rc-change="1367909258" exec-time="0" queue-time="0" op-digest="0c738dfc69f09a62b7ebf32344fddcf6"/>
>>> 
>>> which would allow it to recognise that the resource is healthy one again.
>>> 
>>> I'll see what I can do...
>>> 
>>>> gr.
>>>> Johan
>>>> 
>>>> On 2013-05-07 04:08, Andrew Beekhof wrote:
>>>>> I have a much clearer idea of the problem you're seeing now, thankyou.
>>>>> 
>>>>> Could you attach /var/lib/pacemaker/pengine/pe-input-1.bz2 from CSE-1 ?
>>>>> 
>>>>> On 03/05/2013, at 10:40 PM, Johan Huysmans <johan.huysmans at inuits.be> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Below you can see my setup and my test, this shows that my cloned resource with on-fail=block does not recover automatically.
>>>>>> 
>>>>>> My Setup:
>>>>>> 
>>>>>> # rpm -aq | grep -i pacemaker
>>>>>> pacemaker-libs-1.1.9-1512.el6.i686
>>>>>> pacemaker-cluster-libs-1.1.9-1512.el6.i686
>>>>>> pacemaker-cli-1.1.9-1512.el6.i686
>>>>>> pacemaker-1.1.9-1512.el6.i686
>>>>>> 
>>>>>> # crm configure show
>>>>>> node CSE-1
>>>>>> node CSE-2
>>>>>> primitive d_tomcat ocf:ntc:tomcat \
>>>>>>   op monitor interval="15s" timeout="510s" on-fail="block" \
>>>>>>   op start interval="0" timeout="510s" \
>>>>>>   params instance_name="NMS" monitor_use_ssl="no" monitor_urls="/cse/health" monitor_timeout="120" \
>>>>>>   meta migration-threshold="1"
>>>>>> primitive ip_11 ocf:heartbeat:IPaddr2 \
>>>>>>   op monitor interval="10s" \
>>>>>>   params broadcast="172.16.11.31" ip="172.16.11.31" nic="bond0.111" iflabel="ha" \
>>>>>>   meta migration-threshold="1" failure-timeout="10"
>>>>>> primitive ip_19 ocf:heartbeat:IPaddr2 \
>>>>>>   op monitor interval="10s" \
>>>>>>   params broadcast="172.18.19.31" ip="172.18.19.31" nic="bond0.119" iflabel="ha" \
>>>>>>   meta migration-threshold="1" failure-timeout="10"
>>>>>> group svc-cse ip_19 ip_11
>>>>>> clone cl_tomcat d_tomcat
>>>>>> colocation colo_tomcat inf: svc-cse cl_tomcat
>>>>>> order order_tomcat inf: cl_tomcat svc-cse
>>>>>> property $id="cib-bootstrap-options" \
>>>>>>   dc-version="1.1.9-1512.el6-2a917dd" \
>>>>>>   cluster-infrastructure="cman" \
>>>>>>   pe-warn-series-max="9" \
>>>>>>   no-quorum-policy="ignore" \
>>>>>>   stonith-enabled="false" \
>>>>>>   pe-input-series-max="9" \
>>>>>>   pe-error-series-max="9" \
>>>>>>   last-lrm-refresh="1367582088"
>>>>>> 
>>>>>> Currently only 1 node is available, CSE-1.
>>>>>> 
>>>>>> 
>>>>>> This is how I am currently testing my setup:
>>>>>> 
>>>>>> => Starting point: Everything up and running
>>>>>> 
>>>>>> # crm resource status
>>>>>> Resource Group: svc-cse
>>>>>>    ip_19    (ocf::heartbeat:IPaddr2):    Started
>>>>>>    ip_11    (ocf::heartbeat:IPaddr2):    Started
>>>>>> Clone Set: cl_tomcat [d_tomcat]
>>>>>>    Started: [ CSE-1 ]
>>>>>>    Stopped: [ d_tomcat:1 ]
>>>>>> 
>>>>>> => Causing failure: Change system so tomcat is running but has a failure (in attachment step_2.log)
>>>>>> 
>>>>>> # crm resource status
>>>>>> Resource Group: svc-cse
>>>>>>    ip_19    (ocf::heartbeat:IPaddr2):    Stopped
>>>>>>    ip_11    (ocf::heartbeat:IPaddr2):    Stopped
>>>>>> Clone Set: cl_tomcat [d_tomcat]
>>>>>>    d_tomcat:0    (ocf::ntc:tomcat):    Started (unmanaged) FAILED
>>>>>>    Stopped: [ d_tomcat:1 ]
>>>>>> 
>>>>>> => Fixing failure: Revert system so tomcat is running without failure (in attachment step_3.log)
>>>>>> 
>>>>>> # crm resource status
>>>>>> Resource Group: svc-cse
>>>>>>    ip_19    (ocf::heartbeat:IPaddr2):    Stopped
>>>>>>    ip_11    (ocf::heartbeat:IPaddr2):    Stopped
>>>>>> Clone Set: cl_tomcat [d_tomcat]
>>>>>>    d_tomcat:0    (ocf::ntc:tomcat):    Started (unmanaged) FAILED
>>>>>>    Stopped: [ d_tomcat:1 ]
>>>>>> 
>>>>>> As you can see in the logs the OCF script doesn't return any failure. This is noticed by pacemaker,
>>>>>> however it doesn't reflect in crm_mon and it doesn't start the depending resources.
>>>>>> 
>>>>>> Gr.
>>>>>> Johan
>>>>>> 
>>>>>> On 2013-05-03 03:04, Andrew Beekhof wrote:
>>>>>>> On 02/05/2013, at 5:45 PM, Johan Huysmans <johan.huysmans at inuits.be> wrote:
>>>>>>> 
>>>>>>>> On 2013-05-01 05:48, Andrew Beekhof wrote:
>>>>>>>>> On 17/04/2013, at 9:54 PM, Johan Huysmans <johan.huysmans at inuits.be> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi All,
>>>>>>>>>> 
>>>>>>>>>> I'm trying to setup a specific configuration in our cluster, however I'm struggling with my configuration.
>>>>>>>>>> 
>>>>>>>>>> This is what I'm trying to achieve:
>>>>>>>>>> On both nodes of the cluster a daemon must be running (tomcat).
>>>>>>>>>> Some failover addresses are configured and must be running on the node with a correctly running tomcat.
>>>>>>>>>> 
>>>>>>>>>> I have this achieved with a cloned tomcat resource and an collocation between the cloned tomcat and the failover addresses.
>>>>>>>>>> When I cause a failure in the tomcat on the node running the failover addresses, the failover addresses will failover to the other node as expected.
>>>>>>>>>> crm_mon shows that this tomcat has a failure.
>>>>>>>>>> When I configure the tomcat resource with failure-timeout=0, the failure alarm in crm_mon isn't cleared whenever the tomcat failure is fixed.
>>>>>>>>> All sounds right so far.
>>>>>>>> If my broken tomcat is automatically fixed, I expect this to be noticed by pacemaker and that that node will be able to run my failover addresses,
>>>>>>>> however I don't see this happening.
>>>>>>> This is very hard to discuss without seeing logs.
>>>>>>> 
>>>>>>> So you created a tomcat error, waited for pacemaker to notice, fixed the error and observed the pacemaker did not re-notice?
>>>>>>> How long did you wait? More than the 15s repeat interval I assume?  Did at least the resource agent notice?
>>>>>>> 
>>>>>>>>>> When I configure the tomcat resource with failure-timeout=30, the failure alarm in crm_mon is cleared after 30seconds however the tomcat is still having a failure.
>>>>>>>>> Can you define "still having a failure"?
>>>>>>>>> You mean it still shows up in crm_mon?
>>>>>>>>> Have you read this link?
>>>>>>>>>   http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/s-rules-recheck.html
>>>>>>>> "Still having a failure" means that the tomcat is still broken and my OCF script reports it as a failure.
>>>>>>>>>> What I expect is that pacemaker reports the failure as the failure exists and as long as it exists and that pacemaker reports that everything is ok once everything is back ok.
>>>>>>>>>> 
>>>>>>>>>> Do I do something wrong with my configuration?
>>>>>>>>>> Or how can I achieve my wanted setup?
>>>>>>>>>> 
>>>>>>>>>> Here is my configuration:
>>>>>>>>>> 
>>>>>>>>>> node CSE-1
>>>>>>>>>> node CSE-2
>>>>>>>>>> primitive d_tomcat ocf:custom:tomcat \
>>>>>>>>>>   op monitor interval="15s" timeout="510s" on-fail="block" \
>>>>>>>>>>   op start interval="0" timeout="510s" \
>>>>>>>>>>   params instance_name="NMS" monitor_use_ssl="no" monitor_urls="/cse/health" monitor_timeout="120" \
>>>>>>>>>>   meta migration-threshold="1" failure-timeout="0"
>>>>>>>>>> primitive ip_1 ocf:heartbeat:IPaddr2 \
>>>>>>>>>>   op monitor interval="10s" \
>>>>>>>>>>   params nic="bond0" broadcast="10.1.1.1" iflabel="ha" ip="10.1.1.1"
>>>>>>>>>> primitive ip_2 ocf:heartbeat:IPaddr2 \
>>>>>>>>>>   op monitor interval="10s" \
>>>>>>>>>>   params nic="bond0" broadcast="10.1.1.2" iflabel="ha" ip="10.1.1.2"
>>>>>>>>>> group svc-cse ip_1 ip_2
>>>>>>>>>> clone cl_tomcat d_tomcat
>>>>>>>>>> colocation colo_tomcat inf: svc-cse cl_tomcat
>>>>>>>>>> order order_tomcat inf: cl_tomcat svc-cse
>>>>>>>>>> property $id="cib-bootstrap-options" \
>>>>>>>>>>   dc-version="1.1.8-7.el6-394e906" \
>>>>>>>>>>   cluster-infrastructure="cman" \
>>>>>>>>>>   no-quorum-policy="ignore" \
>>>>>>>>>>   stonith-enabled="false"
>>>>>>>>>> 
>>>>>>>>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> Greetings,
>>>>>>>>>> Johan Huysmans
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>> 
>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>> _______________________________________________
>>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>> 
>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>> _______________________________________________
>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>> 
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>> 
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> <step_2.log><step_3.log>_______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>> 
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> <pe-input-1.bz2>_______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>