<div dir="ltr">Thanks, Andrew. The goal was to use either Pacemaker and Corosync 1.x from the Debain packages, or use both compiled from source. So, with the compiled version, I was hoping to avoid CMAN. However, it seems the packaged version of Pacemaker doesn't support CMAN anyway, so it's moot.<div>
<br></div><div>I rebuilt my VMs from scratch, re-installed Pacemaker and Corosync from the Debian packages, but I'm still having an odd problem. Here is the config portion of my CIB:</div><div><br></div><div><div> <crm_config></div>
<div> <cluster_property_set id="cib-bootstrap-options"></div><div> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff"/></div>
<div> <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="openais"/></div><div> <nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="2"/></div>
<div> <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/></div><div> <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/></div>
<div> </cluster_property_set></div><div> </crm_config></div></div><div><br></div><div>I set no-quorum-policy=ignore based on the documentation example for a 2-node cluster. But when Pacemaker starts up on the first node, the DRBD resource is in slave mode and none of the other resources are started (they depend on DRBD being master), and I see these lines in the log:</div>
<div><br></div><div><div>Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: unpack_config: On loss of CCM Quorum: Ignore</div><div>Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start nfs_fs (test-vm-1 - blocked)</div>
<div>Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start nfs_ip (test-vm-1 - blocked)</div><div>Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start nfs (test-vm-1 - blocked)</div>
<div>Oct 03 15:29:18 test-vm-1 pengine: [3742]: notice: LogActions: Start drbd_r0:0 (test-vm-1)</div></div><div><br></div><div>I'm assuming the NFS resources show "blocked" because the resource they depend on is not in the correct state.</div>
<div><br></div><div>Even when the second node (test-vm-2) comes online, the state of these resources does not change. I can shutdown and re-start Pacemaker over and over again on test-vm-2, but nothihg changes. However... and this is where it gets weird... if I shut down Pacemaker on test-vm-1, then all of the resources immediately fail over to test-vm-2 and start correctly. And I see these lines in the log:</div>
<div><br></div><div><div>Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: unpack_config: On loss of CCM Quorum: Ignore</div><div>Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: stage6: Scheduling Node test-vm-1 for shutdown</div>
<div>Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Start nfs_fs (test-vm-2)</div><div>Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Start nfs_ip (test-vm-2)</div><div>Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Start nfs (test-vm-2)</div>
<div>Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Stop drbd_r0:0 (test-vm-1)</div><div>Oct 03 15:44:26 test-vm-1 pengine: [5305]: notice: LogActions: Promote drbd_r0:1 (Slave -> Master test-vm-2)</div>
</div><div><br></div><div>After that, I can generally move the resources back and forth, and even fail them over by hard-failing a node, without any problems. The real problem is that this isn't consistent, though. Every once in a while, I'll hard-fail a node and the other one will go into this "stuck" state where Pacemaker knows it lost a node, but DRBD will stay in slave mode and the other resources will never start. It seems to happen quite randomly. Then, even if I restart Pacemaker on both nodes, or reboot them altogether, I run into the startup issue mentioned previously.</div>
<div><br></div><div>Any ideas?</div><div><br></div><div> Thanks,</div><div> Dave</div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Oct 2, 2013 at 1:01 AM, Andrew Beekhof <span dir="ltr"><<a href="mailto:andrew@beekhof.net" target="_blank">andrew@beekhof.net</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><br>
On 02/10/2013, at 5:24 AM, David Parker <<a href="mailto:dparker@utica.edu">dparker@utica.edu</a>> wrote:<br>
<br>
> Thanks, I did a little Googling and found the git repository for pcs.<br>
<br>
</div>pcs won't help you rebuild pacemaker with cman support (or corosync 2.x support) turned on though.<br>
<div class="im"><br>
<br>
> Is there any way to make a two-node cluster work with the stock Debian packages, though? It seems odd that this would be impossible.<br>
<br>
</div>it really depends how the debian maintainers built pacemaker.<br>
by the sounds of it, it only supports the pacemaker plugin mode for corosync 1.x<br>
<div class="HOEnZb"><div class="h5"><br>
><br>
><br>
> On Tue, Oct 1, 2013 at 3:16 PM, Larry Brigman <<a href="mailto:larry.brigman@gmail.com">larry.brigman@gmail.com</a>> wrote:<br>
> pcs is another package you will need to install.<br>
><br>
> On Oct 1, 2013 9:04 AM, "David Parker" <<a href="mailto:dparker@utica.edu">dparker@utica.edu</a>> wrote:<br>
> Hello,<br>
><br>
> Sorry for the delay in my reply. I've been doing a lot of experimentation, but so far I've had no luck.<br>
><br>
> Thanks for the suggestion, but it seems I'm not able to use CMAN. I'm running Debian Wheezy with Corosync and Pacemaker installed via apt-get. When I installed CMAN and set up a cluster.conf file, Pacemaker refused to start and said that CMAN was not supported. When CMAN is not installed, Pacemaker starts up fine, but I see these lines in the log:<br>
><br>
> Sep 30 23:36:29 test-vm-1 crmd: [6941]: ERROR: init_quorum_connection: The Corosync quorum API is not supported in this build<br>
> Sep 30 23:36:29 test-vm-1 pacemakerd: [6932]: ERROR: pcmk_child_exit: Child process crmd exited (pid=6941, rc=100)<br>
> Sep 30 23:36:29 test-vm-1 pacemakerd: [6932]: WARN: pcmk_child_exit: Pacemaker child process crmd no longer wishes to be respawned. Shutting ourselves down.<br>
><br>
> So, then I checked to see which plugins are supported:<br>
><br>
> # pacemakerd -F<br>
> Pacemaker 1.1.7 (Build: ee0730e13d124c3d58f00016c3376a1de5323cff)<br>
> Supporting: generated-manpages agent-manpages ncurses heartbeat corosync-plugin snmp libesmtp<br>
><br>
> Am I correct in believing that this Pacemaker package has been compiled without support for any quorum API? If so, does anyone know if there is a Debian package which has the correct support?<br>
><br>
> I also tried compiling LibQB, Corosync and Pacemaker from source via git, following the instructions documented here:<br>
><br>
> <a href="http://clusterlabs.org/wiki/SourceInstall" target="_blank">http://clusterlabs.org/wiki/SourceInstall</a><br>
><br>
> I was hopeful that this would work, because as I understand it, Corosync 2.x no longer uses CMAN. Everything compiled and started fine, but the compiled version of Pacemaker did not include either the 'crm' or 'pcs' commands. Do I need to install something else in order to get one of these?<br>
><br>
> Any and all help is greatly appreciated!<br>
><br>
> Thanks,<br>
> Dave<br>
><br>
><br>
> On Wed, Sep 25, 2013 at 6:08 AM, David Lang <<a href="mailto:david@lang.hm">david@lang.hm</a>> wrote:<br>
> the cluster is trying to reach a quarum (the majority of the nodes talking to each other) and that is never going to happen with only one node. so you have to disable this.<br>
><br>
> try putting<br>
> <cman two_node="1" expected_votes="1" transport="udpu"/><br>
> in your cluster.conf<br>
><br>
> David Lang<br>
><br>
> On Tue, 24 Sep 2013, David Parker wrote:<br>
><br>
> Date: Tue, 24 Sep 2013 11:48:59 -0400<br>
> From: David Parker <<a href="mailto:dparker@utica.edu">dparker@utica.edu</a>><br>
> Reply-To: The Pacemaker cluster resource manager<br>
> <<a href="mailto:pacemaker@oss.clusterlabs.org">pacemaker@oss.clusterlabs.org</a>><br>
> To: The Pacemaker cluster resource manager <<a href="mailto:pacemaker@oss.clusterlabs.org">pacemaker@oss.clusterlabs.org</a>><br>
> Subject: Re: [Pacemaker] Corosync won't recover when a node fails<br>
><br>
><br>
> I forgot to mention, OS is Debian Wheezy 64-bit, Corosync and Pacemaker<br>
> installed from packages via apt-get, and there are no local firewall rules<br>
> in place:<br>
><br>
> # iptables -L<br>
> Chain INPUT (policy ACCEPT)<br>
> target prot opt source destination<br>
><br>
> Chain FORWARD (policy ACCEPT)<br>
> target prot opt source destination<br>
><br>
> Chain OUTPUT (policy ACCEPT)<br>
> target prot opt source destination<br>
><br>
><br>
> On Tue, Sep 24, 2013 at 11:41 AM, David Parker <<a href="mailto:dparker@utica.edu">dparker@utica.edu</a>> wrote:<br>
><br>
> Hello,<br>
><br>
> I have a 2-node cluster using Corosync and Pacemaker, where the nodes are<br>
> actually to VirtualBox VMs on the same physical machine. I have some<br>
> resources set up in Pacemaker, and everything works fine if I move them in<br>
> a controlled way with the "crm_resource -r <resource> --move --node <node>"<br>
> command.<br>
><br>
> However, when I hard-fail one of the nodes via the "poweroff" command in<br>
> Virtual Box, which "pulls the plug" on the VM, the resources do not move,<br>
> and I see the following output in the log on the remaining node:<br>
><br>
> Sep 24 11:20:30 corosync [TOTEM ] The token was lost in the OPERATIONAL<br>
> state.<br>
> Sep 24 11:20:30 corosync [TOTEM ] A processor failed, forming new<br>
> configuration.<br>
> Sep 24 11:20:30 corosync [TOTEM ] entering GATHER state from 2.<br>
> Sep 24 11:20:31 test-vm-2 lrmd: [2503]: debug: rsc:drbd_r0:0 monitor[31]<br>
> (pid 8495)<br>
> drbd[8495]: 2013/09/24_11:20:31 WARNING: This resource agent is<br>
> deprecated and may be removed in a future release. See the man page for<br>
> details. To suppress this warning, set the "ignore_deprecation" resource<br>
> parameter to true.<br>
> drbd[8495]: 2013/09/24_11:20:31 WARNING: This resource agent is<br>
> deprecated and may be removed in a future release. See the man page for<br>
> details. To suppress this warning, set the "ignore_deprecation" resource<br>
> parameter to true.<br>
> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c<br>
> /etc/drbd.conf role r0<br>
> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Exit code 0<br>
> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Command output:<br>
> Secondary/Primary<br>
> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Calling drbdadm -c<br>
> /etc/drbd.conf cstate r0<br>
> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Exit code 0<br>
> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0: Command output: Connected<br>
> drbd[8495]: 2013/09/24_11:20:31 DEBUG: r0 status: Secondary/Primary<br>
> Secondary Primary Connected<br>
> Sep 24 11:20:31 test-vm-2 lrmd: [2503]: info: operation monitor[31] on<br>
> drbd_r0:0 for client 2506: pid 8495 exited with return code 0<br>
> Sep 24 11:20:32 corosync [TOTEM ] entering GATHER state from 0.<br>
> Sep 24 11:20:34 corosync [TOTEM ] The consensus timeout expired.<br>
> Sep 24 11:20:34 corosync [TOTEM ] entering GATHER state from 3.<br>
> Sep 24 11:20:36 corosync [TOTEM ] The consensus timeout expired.<br>
> Sep 24 11:20:36 corosync [TOTEM ] entering GATHER state from 3.<br>
> Sep 24 11:20:38 corosync [TOTEM ] The consensus timeout expired.<br>
> Sep 24 11:20:38 corosync [TOTEM ] entering GATHER state from 3.<br>
> Sep 24 11:20:40 corosync [TOTEM ] The consensus timeout expired.<br>
> Sep 24 11:20:40 corosync [TOTEM ] entering GATHER state from 3.<br>
> Sep 24 11:20:40 corosync [TOTEM ] Totem is unable to form a cluster<br>
> because of an operating system or network fault. The most common cause of<br>
> this message is that the local firewall is configured improperly.<br>
> Sep 24 11:20:43 corosync [TOTEM ] The consensus timeout expired.<br>
> Sep 24 11:20:43 corosync [TOTEM ] entering GATHER state from 3.<br>
> Sep 24 11:20:43 corosync [TOTEM ] Totem is unable to form a cluster<br>
> because of an operating system or network fault. The most common cause of<br>
> this message is that the local firewall is configured improperly.<br>
> Sep 24 11:20:45 corosync [TOTEM ] The consensus timeout expired.<br>
> Sep 24 11:20:45 corosync [TOTEM ] entering GATHER state from 3.<br>
> Sep 24 11:20:45 corosync [TOTEM ] Totem is unable to form a cluster<br>
> because of an operating system or network fault. The most common cause of<br>
> this message is that the local firewall is configured improperly.<br>
> Sep 24 11:20:47 corosync [TOTEM ] The consensus timeout expired.<br>
><br>
> Those last 3 messages just repeat over and over, the cluster never<br>
> recovers, and the resources never move. "crm_mon" reports that the<br>
> resources are still running on the dead node, and shows no indication that<br>
> anything has gone wrong.<br>
><br>
> Does anyone know what the issue could be? My expectation was that the<br>
> remaining node would become the sole member of the cluster, take over the<br>
> resources, and everything would keep running.<br>
><br>
> For reference, my corosync.conf file is below:<br>
><br>
> compatibility: whitetank<br>
><br>
> totem {<br>
> version: 2<br>
> secauth: off<br>
> interface {<br>
> member {<br>
> memberaddr: 192.168.25.201<br>
> }<br>
> member {<br>
> memberaddr: 192.168.25.202<br>
> }<br>
> ringnumber: 0<br>
> bindnetaddr: 192.168.25.0<br>
> mcastport: 5405<br>
> }<br>
> transport: udpu<br>
> }<br>
><br>
> logging {<br>
> fileline: off<br>
> to_logfile: yes<br>
> to_syslog: yes<br>
> debug: on<br>
> logfile: /var/log/cluster/corosync.log<br>
> timestamp: on<br>
> logger_subsys {<br>
> subsys: AMF<br>
> debug: on<br>
> }<br>
> }<br>
><br>
><br>
> Thanks!<br>
> Dave<br>
><br>
> --<br>
> Dave Parker<br>
> Systems Administrator<br>
> Utica College<br>
> Integrated Information Technology Services<br>
> (315) 792-3229<br>
> Registered Linux User #408177<br>
><br>
><br>
><br>
><br>
><br>
> _______________________________________________<br>
><br>
> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
><br>
> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
><br>
><br>
><br>
> Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
><br>
> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
><br>
> Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
><br>
><br>
> _______________________________________________<br>
> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
><br>
> Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
> Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
><br>
><br>
><br>
><br>
> --<br>
> Dave Parker<br>
> Systems Administrator<br>
> Utica College<br>
> Integrated Information Technology Services<br>
> <a href="tel:%28315%29%20792-3229" value="+13157923229">(315) 792-3229</a><br>
> Registered Linux User #408177<br>
><br>
> _______________________________________________<br>
> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
><br>
> Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
> Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
><br>
><br>
> _______________________________________________<br>
> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
><br>
> Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
> Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
><br>
><br>
><br>
><br>
> --<br>
> Dave Parker<br>
> Systems Administrator<br>
> Utica College<br>
> Integrated Information Technology Services<br>
> <a href="tel:%28315%29%20792-3229" value="+13157923229">(315) 792-3229</a><br>
> Registered Linux User #408177<br>
> _______________________________________________<br>
> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
><br>
> Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
> Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
<br>
</div></div><br>_______________________________________________<br>
Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div>Dave Parker</div>Systems Administrator<br>Utica College<br>Integrated Information Technology Services<br>(315) 792-3229<br>Registered Linux User #408177
</div>