<div dir="ltr">>But generally, stonith-enabled=false can lead to error recovery problems<div>and make trouble harder to diagnose. If you can take the time to get</div><div>stonith working, it should at least stop your first problem from causing</div><div>further problems.</div><div><br></div><div>Yeah, I feel like a lot of my post-failover cluster state is because I haven't implemented fencing yet. They're VMs running on a Proxmox instance and I was hoping to get a proof of concept of fail over working before I implemented STONITH. I'm mostly figuring out how to recover cluster state at the moment.</div><div><br></div><div>>If you're using corosync 2, you can set "two_node: 1" in corosync.conf,</div><div>and delete the no-quorum-policy=ignore setting in Pacemaker. It won't</div><div>make a huge difference, but corosync 2 can handle it better now.</div><div><br></div><div>I'll look into doing this. I am running Corosync 2 and Pacemaker 1.1.10 as per what is provided via Ubuntu 14.04's repositories.</div><div><br></div><div>>If you are doing a planned failover, a better way would be to put the</div><div>node into standby mode first, then stop pacemaker.</div><div><br></div><div>Yeah, figured this out later. I had a higher success rate with failing over resources.</div><div><br></div><div>Right now it's just so difficult to get the cluster back to two online nodes with one node running resources. I've tried a ground zero approach where I kill every process and every service Pacemaker is supposed to handle and then start up everything again. I've tried clearing node state and this and that but I get a lot of NODE: OFFLINE and crmd refusing to stop itself. There are a lot of tutorials around getting stuff running but not a lot of guides on when your cluster is fubar.</div><div><br></div><div>Thanks so much for your advice.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Dec 13, 2015 at 10:18 PM, Tyler Hampton <span dir="ltr"><<a href="mailto:dr.frankinfurter@gmail.com" target="_blank">dr.frankinfurter@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi!<div><br></div><div>I'm currently trying to semi-follow Sebastien Han's blog post on implementing HA with Ceph rbd volumes and I am hitting some walls. The difference between what I'm trying to do and the blog post is that I'm trying to implement an active/passive instead of an active/active.</div><div><br></div><div>I am able to get the two nodes to recognize each other and for a single node to assume resources. However, the setup is fairly finnicky (I'm assuming due to my ignorance) and I can't get it to work most of the time.</div><div><br></div><div>When I do get a pair and try to fail over (service pacemaker stop) the node that I'm stopping pacemaker on fails to unload its controlled resources and goes into a loop. A 'proper' failover has only happened twice.</div><div><br></div><div>pacemaker stop output (with log output): <a href="https://gist.github.com/howdoicomputer/d88e224f6fead4623efc" target="_blank">https://gist.github.com/howdoicomputer/d88e224f6fead4623efc</a></div><div><br></div><div>resource configuration: <a href="https://gist.github.com/howdoicomputer/a6f846eb54c3024a5be9" target="_blank">https://gist.github.com/howdoicomputer/a6f846eb54c3024a5be9</a></div><div><br></div><div>Any help is greatly appreciated.</div></div>

</blockquote></div><br></div>