[Pacemaker] Service restoration in clone resource group

Tue Oct 15 19:39:13 EDT 2013

On Oct 15, 2013, at 6:21 PM, Andrew Beekhof <andrew at beekhof.net> wrote:

> 
> On 10/10/2013, at 12:52 PM, Sean Lutner <sean at rentul.net> wrote:
> 
>> 
>> On Oct 8, 2013, at 9:45 AM, Sean Lutner <sean at rentul.net> wrote:
>> 
>>> 
>>> On Oct 8, 2013, at 9:33 AM, Lars Marowsky-Bree <lmb at suse.com> wrote:
>>> 
>>>> On 2013-10-08T09:29:14, Sean Lutner <sean at rentul.net> wrote:
>>>> 
>>>>> The clone was created using the interleave=true option, yes. 
> 
> You might want to trawl the raw xml to make sure pcs did the right thing.
>   cibadmin -Ql | grep interleave
> 
> would tell you.

Thanks, that's very helpful. I'll have a look.

> 
>>>> 
>>>> Ok, so pcs hides that (interesting to know).
>>>> 
>>>>> Does this have an affect on what I'm trying to accomplish?
>>>> 
>>>> Yes, if you hadn't set that, it might have been an explanation. My best
>>>> guess right now would be to upgrade first; the PE has gotten quite a few
>>>> fixes since 1.1.8 again.
>>> 
>>> Are you indicating that the behavior I expect to see, which is the resource being marked as Started on the now passive node, is what pacemaker should be doing and this could be a bug?
>>> 
>>> If it would help, I can provide a full cib configuration and logs while I execute the tests I've been running. I won't be able to do that until tonight (EST time) but can if it may help.
>>> 
>>> Thanks
>>> Sean
>> 
>> Sorry for following up on my own post but I have a follow-up question about the failcount for a resource. Does a crm_resource --cleanup erase the failcount on the resource it's run against?
> 
> Older versions didn't but I don't exactly recall when we started doing that.

In practice that's what I'm observing so it seems that with 1.1.8 it does.

> 
>> I'm looking at making changes to the failure-timeout and cluster-recheck-interval which when combined with my values of resource-stickiness=100 and migration-threshold=1 should allow for the services on the now failed node to be restarted and be marked as Started in the cluster without causing an unnecessary failover.
>> 
>> Does this make sense?
> 
> yes

I currently have my failure-timeout and cluster-recheck-interval both set to 10m but I'm not seeing the failcount clear.  If I trigger a failover by stopping the resource/service the failover works as expected. But if I then manually restart the services on a previously failed node pacemaker never marks the resources as Started again.

I think I may be hitting this bug you fixed back in May. The commit for the fix is https://github.com/beekhof/pacemaker/commit/d87de1b and the thread discussing the issue is http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg15979.html.

I think that fits and is what I'm seeing because the default on-fail behavior for a stop operation is block.

I will be pulling a newer version of pacemaker from git and building an RPM to test with.

> 
>> 
>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> Lars
>>>> 
>>>> -- 
>>>> Architect Storage/HA
>>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
>>>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131015/245d6041/attachment-0003.sig>