[ClusterLabs] [Pacemaker] Beginner | Resources stuck unloading

Sat Dec 19 00:20:35 UTC 2015

>But generally, stonith-enabled=false can lead to error recovery problems
and make trouble harder to diagnose. If you can take the time to get
stonith working, it should at least stop your first problem from causing
further problems.

Yeah, I feel like a lot of my post-failover cluster state is because I
haven't implemented fencing yet. They're VMs running on a Proxmox instance
and I was hoping to get a proof of concept of fail over working before I
implemented STONITH. I'm mostly figuring out how to recover cluster state
at the moment.

>If you're using corosync 2, you can set "two_node: 1" in corosync.conf,
and delete the no-quorum-policy=ignore setting in Pacemaker. It won't
make a huge difference, but corosync 2 can handle it better now.

I'll look into doing this. I am running Corosync 2 and Pacemaker 1.1.10 as
per what is provided via Ubuntu 14.04's repositories.

>If you are doing a planned failover, a better way would be to put the
node into standby mode first, then stop pacemaker.

Yeah, figured this out later. I had a higher success rate with failing over
resources.

Right now it's just so difficult to get the cluster back to two online
nodes with one node running resources. I've tried a ground zero approach
where I kill every process and every service Pacemaker is supposed to
handle and then start up everything again. I've tried clearing node state
and this and that but I get a lot of NODE: OFFLINE and crmd refusing to
stop itself. There are a lot of tutorials around getting stuff running but
not a lot of guides on when your cluster is fubar.

Thanks so much for your advice.

On Sun, Dec 13, 2015 at 10:18 PM, Tyler Hampton <dr.frankinfurter at gmail.com>
wrote:

> Hi!
>
> I'm currently trying to semi-follow Sebastien Han's blog post on
> implementing HA with Ceph rbd volumes and I am hitting some walls. The
> difference between what I'm trying to do and the blog post is that I'm
> trying to implement an active/passive instead of an active/active.
>
> I am able to get the two nodes to recognize each other and for a single
> node to assume resources. However, the setup is fairly finnicky (I'm
> assuming due to my ignorance) and I can't get it to work most of the time.
>
> When I do get a pair and try to fail over (service pacemaker stop) the
> node that I'm stopping pacemaker on fails to unload its controlled
> resources and goes into a loop. A 'proper' failover has only happened twice.
>
> pacemaker stop output (with log output):
> https://gist.github.com/howdoicomputer/d88e224f6fead4623efc
>
> resource configuration:
> https://gist.github.com/howdoicomputer/a6f846eb54c3024a5be9
>
> Any help is greatly appreciated.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20151218/d9b75d1a/attachment.htm>