<div dir="ltr">You didn't notice that after setting attributes on "db02", the remote node "db02" went offline as "unclean", even though vm-db02 was still running? That strikes me as wrong! Once it gets into this state, I can order vm-db02 to stop, but it never will. Indeed, pacemaker doesn't do much at this point -- I can put everything into standby mode, and services don't shut down. That is why the forcible reboot. Also, why I don't know (yet) what would happen to a service on db02 when this happens -- it takes too long to restart the cluster to carry out too many tests in one day!<div>
<br></div><div>I'll review asymmetrical clusters -- I think my mistake was thinking an infinite score location constraint to put DummyOnVM on db02 would prevent it from running anywhere else, but of course of db02 isn't running, my one rule isn't equivalent to having -inf scores elsewhere. Still odd that shutting down vm-db02 would trigger a migration of an unrelated VM. (The fact that would also stop vm-swbuild is the known problem that constraints don't work well with migration.)</div>
<div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Jul 2, 2013 at 6:20 PM, David Vossel <span dir="ltr"><<a href="mailto:dvossel@redhat.com" target="_blank">dvossel@redhat.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">----- Original Message -----<br>
> From: "Lindsay Todd" <<a href="mailto:rltodd.ml1@gmail.com">rltodd.ml1@gmail.com</a>><br>
</div><div class="im">> To: "The Pacemaker cluster resource manager" <<a href="mailto:pacemaker@oss.clusterlabs.org">pacemaker@oss.clusterlabs.org</a>><br>
> Sent: Tuesday, July 2, 2013 4:05:22 PM<br>
> Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes<br>
><br>
</div><div class="im">> Sorry for the delayed response, but I was out last week. I've applied this<br>
> patch to 1.1.10-rc5 and have been testing:<br>
><br>
<br>
</div>Thanks for testing :)<br>
<div><div class="h5"><br>
><br>
><br>
> # crm_attribute --type status --node "db02" --name "service_postgresql"<br>
> --update "true"<br>
> # crm_attribute --type status --node "db02" --name "service_postgresql"<br>
> scope=status name=service_postgresql value=true<br>
> # crm resource stop vm-db02<br>
> # crm resource start vm-db02<br>
> ### Wait a bit<br>
> # crm_attribute --type status --node "db02" --name "service_postgresql"<br>
> scope=status name=service_postgresql value=(null)<br>
> Error performing operation: No such device or address<br>
> # crm_attribute --type status --node "db02" --name "service_postgresql"<br>
> --update "true"<br>
> # crm_attribute --type status --node "db02" --name "service_postgresql"<br>
> scope=status name=service_postgresql value=true<br>
><br>
> Good so far. But now look at this (every node was clean, and all services<br>
> were running, before we started):<br>
><br>
><br>
><br>
> # crm status<br>
> Last updated: Tue Jul 2 16:15:14 2013<br>
> Last change: Tue Jul 2 16:15:12 2013 via crmd on cvmh02<br>
> Stack: cman<br>
> Current DC: cvmh02 - partition with quorum<br>
> Version: 1.1.10rc5-1.el6.ccni-2718638<br>
> 9 Nodes configured, unknown expected votes<br>
> 59 Resources configured.<br>
><br>
><br>
> Node db02: UNCLEAN (offline)<br>
> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01<br>
> ldap02:vm-ldap02 ]<br>
> OFFLINE: [ swbuildsl6:vm-swbuildsl6 ]<br>
><br>
> Full list of resources:<br>
><br>
> fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04<br>
> fence-cvmh02 (stonith:fence_ipmilan): Started cvmh04<br>
> fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04<br>
> fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01<br>
> Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> Clone Set: c-p-libvirtd [p-libvirtd]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> Clone Set: c-watch-ib0 [p-watch-ib0]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> Clone Set: c-fs-gpfs [p-fs-gpfs]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh03<br>
> vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Stopped<br>
> vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh02<br>
> vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03<br>
> vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04<br>
> DummyOnVM (ocf::pacemaker:Dummy): Started cvmh01<br>
><br>
> Not so good, and I'm not sure how to clean this up. I can't seem to stop<br>
<br>
</div></div>clean what up? I don't understand what I'm expected to notice out of place here?! The remote-node us up, everything looks happy.<br>
<div class="im"><br>
> vm-db02 any more, even after I've entered:<br>
><br>
><br>
><br>
> # crm_node -R db02 --force<br>
<br>
</div>That won't stop the remote-node. 'crm resource stop vm-db02' should though.<br>
<div class="im"><br>
> # crm resource start vm-db02<br>
<br>
</div>ha, I'm so confused. why are you trying to start it? I thought you were trying to stop the resource?<br>
<div><div class="h5"><br>
><br>
><br>
><br>
> ### Wait a bit<br>
><br>
><br>
><br>
> # crm status<br>
> Last updated: Tue Jul 2 16:32:38 2013<br>
> Last change: Tue Jul 2 16:27:28 2013 via cibadmin on cvmh01<br>
> Stack: cman<br>
> Current DC: cvmh02 - partition with quorum<br>
> Version: 1.1.10rc5-1.el6.ccni-2718638<br>
> 8 Nodes configured, unknown expected votes<br>
> 54 Resources configured.<br>
><br>
><br>
> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ldap01:vm-ldap01 ldap02:vm-ldap02<br>
> swbuildsl6:vm-swbuildsl6 ]<br>
> OFFLINE: [ db02:vm-db02 ]<br>
><br>
> fence-cvmh01 (stonith:fence_ipmilan): Started cvmh03<br>
> fence-cvmh02 (stonith:fence_ipmilan): Started cvmh03<br>
> fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04<br>
> fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01<br>
> Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> Clone Set: c-p-libvirtd [p-libvirtd]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> Clone Set: c-watch-ib0 [p-watch-ib0]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> Clone Set: c-fs-gpfs [p-fs-gpfs]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh02<br>
> vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Started cvmh01<br>
> vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03<br>
> vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04<br>
> DummyOnVM (ocf::pacemaker:Dummy): Started cvmh01<br>
><br>
> My only recourse has been to reboot the cluster.<br>
><br>
> So let's do that and try<br>
> setting a location constraint on DummyOnVM, to force it on db02...<br>
><br>
><br>
><br>
><br>
><br>
><br>
><br>
> Last updated: Tue Jul 2 16:43:46 2013<br>
> Last change: Tue Jul 2 16:27:28 2013 via cibadmin on cvmh01<br>
> Stack: cman<br>
> Current DC: cvmh02 - partition with quorum<br>
> Version: 1.1.10rc5-1.el6.ccni-2718638<br>
> 8 Nodes configured, unknown expected votes<br>
> 54 Resources configured.<br>
><br>
><br>
> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01<br>
> ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]<br>
><br>
> fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04<br>
> fence-cvmh02 (stonith:fence_ipmilan): Started cvmh03<br>
> fence-cvmh03 (stonith:fence_ipmilan): Started cvmh04<br>
> fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01<br>
> Clone Set: c-fs-libvirt-VM-xcm [fs-libvirt-VM-xcm]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> Clone Set: c-p-libvirtd [p-libvirtd]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> Clone Set: c-fs-bind-libvirt-VM-cvmh [fs-bind-libvirt-VM-cvmh]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> Clone Set: c-watch-ib0 [p-watch-ib0]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> Clone Set: c-fs-gpfs [p-fs-gpfs]<br>
> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]<br>
> Stopped: [ db02 ldap01 ldap02 swbuildsl6 ]<br>
> vm-compute-test (ocf::ccni:xcatVirtualDomain): Started cvmh01<br>
> vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): Started cvmh01<br>
> vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh02<br>
> vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh03<br>
> vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh04<br>
> DummyOnVM (ocf::pacemaker:Dummy): Started cvmh03<br>
><br>
> # pcs constraint location DummyOnVM prefers db02<br>
> # crm status<br>
> ...<br>
> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01<br>
> ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]<br>
> ...<br>
> DummyOnVM (ocf::pacemaker:Dummy): Started db02<br>
><br>
><br>
> That's what we want to see. It would be interesting to stop db02. I expect<br>
> DummyOnVM to stop.<br>
<br>
</div></div>OH, okay, so you wanted DummyOnVM to start on db02.<br>
<div class="im"><br>
><br>
><br>
><br>
> # crm resource stop vm-db02<br>
> # crm status<br>
> ...<br>
> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ldap01:vm-ldap01 ldap02:vm-ldap02 ]<br>
> OFFLINE: [ db02:vm-db02 swbuildsl6:vm-swbuildsl6 ]<br>
> ...<br>
> DummyOnVM (ocf::pacemaker:Dummy): Started cvmh02<br>
><br>
> Failed actions:<br>
> vm-compute-test_migrate_from_0 (node=cvmh02, call=147, rc=1, status=Timed<br>
> Out, last-rc-change=Tue Jul 2 16:48:17 2013<br>
> , queued=20003ms, exec=0ms<br>
> ): unknown error<br>
><br>
> Well, that is odd. (It is the case that vm-swbuildsl6 has an order dependency<br>
> on vm-compute-test, as I was trying to understand how migrations worked with<br>
> order dependencies (not very well).<br>
<br>
</div>I don't think this failure has anything to do with the order dependencies. If pacemaker attempted to live migrate the vm and it fails, that's a resource problem. Do you have your virtual machine images on shared storage?<br>
<div class="im"><br>
> Once vm-compute-test recovers,<br>
> vm-swbuildsl6 does come back up.) This isn't really very good -- if I am<br>
> running services in VM or other containers, I need them to run only in that<br>
> container!<br>
<br>
</div>Read about the differences between asymmetrical and symmetrical clusters. I think this will help this make sense. By default resources can run anywhere, you just gave more weight to db02 for the Dummy resource, meaning it prefers that node when it is around.<br>
<br>
<a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_deciding_which_nodes_a_resource_can_run_on" target="_blank">http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_deciding_which_nodes_a_resource_can_run_on</a><br>
<div class="im"><br>
<br>
><br>
> If I start vm-db02 back up, I see that DummyOnVM is stopped and moved to<br>
> db02.<br>
<br>
</div>Yep, this is what I'd expect for a symmetrical cluster.<br>
<br>
Thanks again for the feedback, hope the asymmetrical/symmetrical cluster stuff helps :)<br>
<span class="HOEnZb"><font color="#888888"><br>
-- Vossel<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
><br>
><br>
> On Thu, Jun 20, 2013 at 4:16 PM, David Vossel < <a href="mailto:dvossel@redhat.com">dvossel@redhat.com</a> > wrote:<br>
><br>
><br>
><br>
> ----- Original Message -----<br>
> > From: "David Vossel" < <a href="mailto:dvossel@redhat.com">dvossel@redhat.com</a> ><br>
> > To: "The Pacemaker cluster resource manager" <<br>
> > <a href="mailto:pacemaker@oss.clusterlabs.org">pacemaker@oss.clusterlabs.org</a> ><br>
> > Sent: Thursday, June 20, 2013 1:35:44 PM<br>
> > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes<br>
> ><br>
> > ----- Original Message -----<br>
> > > From: "David Vossel" < <a href="mailto:dvossel@redhat.com">dvossel@redhat.com</a> ><br>
> > > To: "The Pacemaker cluster resource manager"<br>
> > > < <a href="mailto:pacemaker@oss.clusterlabs.org">pacemaker@oss.clusterlabs.org</a> ><br>
> > > Sent: Wednesday, June 19, 2013 4:47:58 PM<br>
> > > Subject: Re: [Pacemaker] Pacemaker remote nodes, naming, and attributes<br>
> > ><br>
> > > ----- Original Message -----<br>
> > > > From: "Lindsay Todd" < <a href="mailto:rltodd.ml1@gmail.com">rltodd.ml1@gmail.com</a> ><br>
> > > > To: "The Pacemaker cluster resource manager"<br>
> > > > < <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a> ><br>
> > > > Sent: Wednesday, June 19, 2013 4:11:58 PM<br>
> > > > Subject: [Pacemaker] Pacemaker remote nodes, naming, and attributes<br>
> > > ><br>
> > > > I built a set of rpms for pacemaker 1.1.0-rc4 and updated my test<br>
> > > > cluster<br>
> > > > (hopefully won't be a "test" cluster forever), as well as my VMs<br>
> > > > running<br>
> > > > pacemaker-remote. The OS everywhere is Scientific Linux 6.4. I am<br>
> > > > wanting<br>
> > > > to<br>
> > > > set some attributes on remote nodes, which I can use to control where<br>
> > > > services run.<br>
> > > ><br>
> > > > The first deviation I note from the documentation is the naming of the<br>
> > > > remote<br>
> > > > nodes. I see:<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > Last updated: Wed Jun 19 16:50:39 2013<br>
> > > > Last change: Wed Jun 19 16:19:53 2013 via cibadmin on cvmh04<br>
> > > > Stack: cman<br>
> > > > Current DC: cvmh02 - partition with quorum<br>
> > > > Version: 1.1.10rc4-1.el6.ccni-d19719c<br>
> > > > 8 Nodes configured, unknown expected votes<br>
> > > > 49 Resources configured.<br>
> > > ><br>
> > > ><br>
> > > > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 db02:vm-db02 ldap01:vm-ldap01<br>
> > > > ldap02:vm-ldap02 swbuildsl6:vm-swbuildsl6 ]<br>
> > > ><br>
> > > > Full list of resources:<br>
> > > ><br>
> > > > and so forth. The "remote-node" names are simply the hostname, so the<br>
> > > > vm-db02<br>
> > > > VirtualDomain resource has a remote-node name of db02. The "Pacemaker<br>
> > > > Remote" manual suggests this should be displayed as "db02", not<br>
> > > > "db02:vm-db02", although I can see how the latter format would be<br>
> > > > useful.<br>
> > ><br>
> > > Yep, this got changed since the documentation was published. We wanted<br>
> > > people to be able to recognize which remote-node went with which resource<br>
> > > easily.<br>
> > ><br>
> > > ><br>
> > > > So now let's set an attribute on this remote node. What name do I use?<br>
> > > > How<br>
> > > > about:<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > # crm_attribute --node "db02:vm-db02" \<br>
> > > > --name "service_postgresql" \<br>
> > > > --update "true"<br>
> > > > Could not map name=db02:vm-db02 to a UUID<br>
> > > > Please choose from one of the matches above and suppy the 'id' with<br>
> > > > --attr-id<br>
> > > ><br>
> > > > Perhaps not the most informative output, but obviously it fails. Let's<br>
> > > > try<br>
> > > > the unqualified name:<br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > ><br>
> > > > # crm_attribute --node "db02" \<br>
> > > > --name "service_postgresql" \<br>
> > > > --update "true"<br>
> > > > Remote-nodes do not maintain permanent attributes,<br>
> > > > 'service_postgresql=true'<br>
> > > > will be removed after db02 reboots.<br>
> > > > Error setting service_postgresql=true (section=status,<br>
> > > > set=status-db02):<br>
> > > > No<br>
> > > > such device or address<br>
> > > > Error performing operation: No such device or address<br>
> ><br>
> > I just tested this and ran into the same errors you did. Turns out this<br>
> > happens when the remote-node's status section is empty. If you start a<br>
> > resource on the node and then set the attribute it will work... obviously<br>
> > this is a bug. I'm working on a fix.<br>
><br>
> This should help with the attributes bit.<br>
><br>
> <a href="https://github.com/ClusterLabs/pacemaker/commit/26d34a9171bddae67c56ebd8c2513ea8fa770204" target="_blank">https://github.com/ClusterLabs/pacemaker/commit/26d34a9171bddae67c56ebd8c2513ea8fa770204</a><br>
><br>
> -- Vossel<br>
><br>
> _______________________________________________<br>
> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
><br>
> Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
> Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
><br>
><br>
> _______________________________________________<br>
> Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
> <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
><br>
> Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
> Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
><br>
<br>
_______________________________________________<br>
Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>
</div></div></blockquote></div><br></div>