[Pacemaker] Recovery after simple master-master failover

Thu Feb 23 18:17:47 EST 2012

On Fri, Feb 24, 2012 at 3:08 AM, David Gubler <dg at doodle.com> wrote:
> Hi Jake,
>
> Thanks for your answer. I had another go today.
>
>
> On 22.02.2012 00:09, Jake Smith wrote:
>>
>> Still probably not the nicest/cleanest solution but you could do a cronjob
>> that runs 'crm resource reprobe node_name'.  That will check for resources
>> the cluster didn't start and prevent the cleanup actions.
>
>
> Unfortunately that doesn't work, if the last error was a monitor timeout.

It should. Please file a bug with a hb_report tarball on bugs.clusterlabs.org.

> Oddly enough I have to do "crm resource cleanup apacheClone" - not "apache"

This doesn't work because there is no actual resource called "apache".
Granted we could be smarter and work it out.  Patch anyone?

> - to fix the state of the apache resource, even though the monitor is part
> of the apache resource, not the clone. If I try both variants with reprobe,
> nothing happens.
>
> By the way, if I stop apache (/etc/init.d/apache2 stop), wait until
> Pacemaker notices, and start it again, then Pacemaker also notices that
> apache is back and moves the IPs accordingly!
>
> Why does it matter to pacemaker whether the service is shut down normally
> vs. a monitor timeout?
>
>
>> what about an 'on-fail' in the op monitor section - probably with an
>> =ignore?
>> More on that one here:
>>
>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-resource-operations.html
>
>
> That doesn't help - Pacemaker sometimes (it's not deterministic and often
> only happens on one of the two nodes) still stops and starts apache.
>
> Even after reading the documentation several times, I still barely get what
> on-fail=something is supposed to do. When I set e.g. "on-fail=ignore" on the
> apache primitive, it has no apparent effects (dito for restart) - Pacemaker
> acts exactly as if that option were not set. Which kind of makes sense:
>
> "The default for the stop operation is fence when STONITH is enabled and
> block otherwise. All other operations default to stop."
>
> Thus, "ignore" equals "stop", and "stop" equals "block" (since I don't have
> STONITH). So what good is "ignore", if it's just another way of saying
> "block"?

No, ignore means "pretend it never happened", so in the case of a
monitor failure it means "pretend that everything is still happily
running".

>
> So I *suppose* what I'm seeing is that my failed apache resource gets into
> the blocked state, and since "blocked" means "don't do anything with that
> resource", no surprise it doesn't recover automatically. But I still have
> now clue as to how I should do this instead...

I've missed the backstory, but the only way it should be able to get
into a blocked state is if the stop action fails/times out and stonith
is inactive or if you've specifically set on-fail=block for an op.
To which the solution is "make sure stop succeeds" or "dont do that"

>
> Thanks,
>
>
> David
>
> --
> David Gubler
> Senior Software & Operations Engineer
> MeetMe: http://doodle.com/david
> E-Mail: dg at doodle.com
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org