[ClusterLabs] [Pacemaker] Beginner | Resources stuck unloading

Wed Dec 16 16:02:47 UTC 2015

On 12/14/2015 12:18 AM, Tyler Hampton wrote:
> Hi!
> 
> I'm currently trying to semi-follow Sebastien Han's blog post on
> implementing HA with Ceph rbd volumes and I am hitting some walls. The
> difference between what I'm trying to do and the blog post is that I'm
> trying to implement an active/passive instead of an active/active.
> 
> I am able to get the two nodes to recognize each other and for a single
> node to assume resources. However, the setup is fairly finnicky (I'm
> assuming due to my ignorance) and I can't get it to work most of the time.
> 
> When I do get a pair and try to fail over (service pacemaker stop) the node
> that I'm stopping pacemaker on fails to unload its controlled resources and
> goes into a loop. A 'proper' failover has only happened twice.
> 
> pacemaker stop output (with log output):
> https://gist.github.com/howdoicomputer/d88e224f6fead4623efc
> 
> resource configuration:
> https://gist.github.com/howdoicomputer/a6f846eb54c3024a5be9
> 
> Any help is greatly appreciated.

Hopefully someone with more ceph or upstart experience can give you more
specifics.

But generally, stonith-enabled=false can lead to error recovery problems
and make trouble harder to diagnose. If you can take the time to get
stonith working, it should at least stop your first problem from causing
further problems.

If you're using corosync 2, you can set "two_node: 1" in corosync.conf,
and delete the no-quorum-policy=ignore setting in Pacemaker. It won't
make a huge difference, but corosync 2 can handle it better now.

If you are doing a planned failover, a better way would be to put the
node into standby mode first, then stop pacemaker. That ensures all
resources are successfully failed over first, and when the node comes
back, it lets you decide when it's ready to host resources again (by
taking it out of standby mode), which gives you time for
administration/troubleshooting/whatever reason you took it down.