[Pacemaker] Asymmetric cluster, clones, and location constraints

Thu Oct 24 18:40:25 EDT 2013

----- Original Message -----
> From: "Lindsay Todd" <rltodd.ml1 at gmail.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Wednesday, October 23, 2013 2:38:17 PM
> Subject: Re: [Pacemaker] Asymmetric cluster, clones, and location constraints
> 
> David,
> 
> The Infiniband network takes a nondeterministic amount of time to actually
> finish initializing, so we use ethmonitor to watch it; the OS is supposed to
> bring it up at boot time, but it moves on through the boot sequence without
> actually waiting for it. So in self defense we watch it with pacemaker. I
> guess I could restructure this to use a resource that brings up IB (with a
> really long time out) and use ordering to wait for that complete, but it
> seems that ethmonitor would be more adaptive to short-term IB network
> issues. Since ethmonitor works by setting an attribute (the RA running means
> it is watching the network, not that the network is up), I've used location
> constraints instead of ordering constraints.
> 
> So I have completely restarted my cluster. Right now the physical nodes see
> each other, and the fencing agents are running. The first thing that should
> start are the ethmonitor resource agents on the VM hosts (the c-watch-ib0
> clones of the p-watch-ib0 primitive). They are not starting (like they used
> to).

I see.  Your cib generates an invalid transition.  I'll try and look into it in more detail soon to understand the cause.

One completely unrelated thought I had while looking at your config involves your fencing agents. You shouldn't have to use location constraints at on the fencing agents. I believe stonith is smart enough now to execute the agent on a node that isn't the target regardless of where the policy engine puts it.

-- Vossel

> The cib snapshot can be seen in http://pastebin.com/TccTHQPS (some
> slight editing to hide passwords in fencing agents).
> 
> /Lindsay
> 
> 
> On Wed, Oct 23, 2013 at 11:20 AM, David Vossel < dvossel at redhat.com > wrote:
> 
> 
> 
> ----- Original Message -----
> > From: "Lindsay Todd" < rltodd.ml1 at gmail.com >
> > To: "The Pacemaker cluster resource manager" <
> > Pacemaker at oss.clusterlabs.org >
> > Sent: Tuesday, October 22, 2013 4:19:11 PM
> > Subject: [Pacemaker] Asymmetric cluster, clones, and location constraints
> > 
> > I am getting rather unexpected behavior when I combine clones, location
> > constraints, and remote nodes in an asymmetric cluster. My cluster is
> > configured to be asymmetric, distinguishing between vmhosts and various
> > sorts of remote nodes. Currently I am running upstream version b6d42ed. I
> > am
> > simplifying my description to avoid confusion, hoping in so doing I don't
> > miss any salient points...
> > 
> > My physical cluster nodes, also the VM hosts, have the attribute
> > "nodetype=vmhost". They also have Infiniband interfaces, which take some
> > time to come up. I don't want my shared file system (which needs IB), or
> > libvirtd (which needs the file system), to come up before IB... So I have
> > this in my configuration:
> > 
> > 
> > 
> > 
> > primitive p-watch-ib0 ocf:heartbeat:ethmonitor \
> > params \
> > interface="ib0" \
> > op monitor timeout="100s" interval="10s"
> > clone c-watch-ib0 p-watch-ib0 \
> > meta interleave="true"
> > #
> > location loc-watch-ib-only-vmhosts c-watch-ib0 \
> > rule 0: nodetype eq "vmhost"
> > 
> > Something broke between upstream versions 0a2570a and c68919f -- the
> > c-watch-ib0 clone never starts. I've found that if I run "crm_resource
> > --force-start -r p-watch-ib0" when IB is running, the ethmonitor-ib0
> > attribute is not set like it used to be. Oh well, I can set it manually. So
> > let's.
> 
> A re-write of the attrd component was introduced around that time period.
> This should have been resolved at this point in the b6d42ed build.
> 
> > We use GPFS for a shared file system, so I have an agent to start it and
> > wait
> > for a file system to mount. It should only run on VM hosts, and only when
> > IB
> > is running. So I have this:
> 
> So the IB resource is setting some attribute that enables the fs to run? Why
> can't a ordering constraint be used here between IB and FS?
> 
> > 
> > 
> > 
> > 
> > primitive p-fs-gpfs ocf:ccni:gpfs \
> > params \
> > fspath="/gpfs/lb/utility" \
> > op monitor timeout="20s" interval="30s" \
> > op start timeout="180s" \
> > op stop timeout="120s"
> > clone c-fs-gpfs p-fs-gpfs \
> > meta interleave="true"
> > location loc-fs-gpfs-needs-ib0 c-fs-gpfs \
> > rule -inf: not_defined "ethmonitor-ib0" or "ethmonitor-ib0" eq 0
> > location loc-fs-gpfs-on-vmhosts c-fs-gpfs \
> > rule 0: nodetype eq "vmhost"
> > 
> > That all used to start nicely. Now even if I set the ethmonitor-ib0
> > attribute, it doesn't. However, I can use "crm_resource --force-start -r
> > p-fs-gpfs" on each of my VM hosts, then issue "crm resource cleanup
> > c-fs-gpfs", and all is well. I can use "crm status" to see something like:
> > 
> > 
> > 
> > Last updated: Tue Oct 22 16:35:43 2013
> > Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01
> > Stack: cman
> > Current DC: cvmh04 - partition with quorum
> > Version: 1.1.10-19.el6.ccni-b6d42ed
> > 8 Nodes configured
> > 92 Resources configured
> > 
> > 
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > 
> > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh01
> > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh01
> > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> > Clone Set: c-fs-gpfs [p-fs-gpfs]
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > which is what I would expect (other than I expect pacemaker to have started
> > these for me, like it used to).
> > 
> > Now I also have clone resources to NFS-mount another file system, and
> > actually do a bind mount out of the GPFS file system, which behave like the
> > GPFS resource -- they used to just work, now I need to use "crm_resource
> > --force-start" and clean up. That finally lets me start libvirtd, using
> > this
> > configuration:
> > 
> > 
> > 
> > 
> > primitive p-libvirtd lsb:libvirtd \
> > op monitor interval="30s"
> > clone c-p-libvirtd p-libvirtd \
> > meta interleave="true"
> > order o-libvirtd-after-storage inf: \
> > ( c-fs-libvirt-VM-xcm c-fs-bind-libvirt-VM-cvmh ) \
> > c-p-libvirtd
> > location loc-libvirtd-on-vmhosts c-p-libvirtd \
> > rule 0: nodetype eq "vmhost"
> > 
> > Of course that used to just work, but now, like the other clones, I need to
> > force-start libvirtd on the VM hosts, and clean up. Once I do that, all my
> > VM resources, which are not clones, just start up like they are supposed
> > to!
> > Several of these are configured as remote nodes, and they have services
> > configured to run in them. But now other strange things happen:
> > 
> > 
> > 
> > 
> > Last updated: Tue Oct 22 16:46:29 2013
> > Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01
> > Stack: cman
> > Current DC: cvmh04 - partition with quorum
> > Version: 1.1.10-19.el6.ccni-b6d42ed
> > 8 Nodes configured
> > 92 Resources configured
> > 
> > 
> > ContainerNode slurmdb02:vm-slurmdb02: UNCLEAN (offline)
> > Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Containers: [ db02:vm-db02 ldap01:vm-ldap01 ldap02:vm-ldap02 ]
> > 
> > fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> > fence-cvmh02 (stonith:fence_ipmilan): Started cvmh01
> > fence-cvmh03 (stonith:fence_ipmilan): Started cvmh01
> > fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> > Clone Set: c-p-libvirtd [p-libvirtd]
> > p-libvirtd (lsb:libvirtd): FAILED slurmdb02
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 ]
> > Clone Set: c-watch-ib0 [p-watch-ib0]
> > p-watch-ib0 (ocf::heartbeat:ethmonitor): FAILED slurmdb02
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 ]
> > Clone Set: c-fs-gpfs [p-fs-gpfs]
> > p-fs-gpfs (ocf::ccni:gpfs): FAILED slurmdb02
> > Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> > Stopped: [ db02 ldap01 ldap02 ]
> > vm-compute-test (ocf::ccni:xcatVirtualDomain): FAILED [ cvmh04 slurmdb0
> > 2 ]
> > vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): FAILED slurmdb02
> > vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh01
> > vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh02
> > vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> > p-postgres (ocf::heartbeat:pgsql): FAILED [ db02 slurmdb02 ]
> > p-mysql (ocf::heartbeat:mysql): FAILED [ db02 slurmdb02 ]
> > Clone Set: c-fs-share-config-data [fs-share-config-data]
> > fs-share-config-data (ocf::heartbeat:Filesystem): FAILED slurmdb02
> > Stopped: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ldap01 ldap02 ]
> > p-mysql-slurm (ocf::heartbeat:mysql): FAILED slurmdb02
> > p-slurmdbd (ocf::ccni:SlurmDBD): FAILED slurmdb02
> > Clone Set: c-ldapagent [s-ldapagent]
> > s-ldapagent (ocf::ccni:WrapInitScript): FAILED slurmdb02
> > Stopped: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ldap01 ldap02 ]
> > Clone Set: c-ldap [s-ldap]
> > s-ldap (ocf::ccni:WrapInitScript): FAILED slurmdb02
> > Started: [ ldap01 ldap02 ]
> > Stopped: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ]
> > 
> > Now this is unexpected for a couple of reasons. I do have constraints like:
> > 
> > 
> > 
> > 
> > location loc-vm-swbuildsl6 vm-swbuildsl6 \
> > rule $id="loc-vm-swbuildsl6-rule" 0: nodetype eq vmhost
> > order o-vm-swbuildsl6 inf: c-p-libvirtd vm-swbuildsl6
> > 
> > And it is not the case that slurmdb02 has the vmhost attribute set; using
> > "crm_mon -o -1 -N -A" we see:
> > 
> > 
> > 
> > 
> > Node Attributes:
> > * Node cvmh01:
> > + ethmonitor-ib0 : 1
> > + nodetype : vmhost
> > * Node cvmh02:
> > + ethmonitor-ib0 : 1
> > + nodetype : vmhost
> > * Node cvmh03:
> > + ethmonitor-ib0 : 1
> > + nodetype : vmhost
> > * Node cvmh04:
> > + ethmonitor-ib0 : 1
> > + nodetype : vmhost
> > * Node db02:
> > * Node ldap01:
> > * Node ldap02:
> > * Node slurmdb02:
> > 
> > The results are unexpected to me also because I (perhaps naively) wouldn't
> > expect it to show me the new nodes on the "stopped" lines -- I kind of
> > expected a location rule to limit where clones would even be attempted. For
> > example, with the rule limiting c-p-libvirtd to the vmhosts, I don't really
> > expect to be told that the clones are stopped on the remote VM nodes db02,
> > ldap01, and ldap02 (let alone be started on slurmdb02!).
> > 
> > Until I wrote this note, even the cloned ldap resource c-ldap needed to be
> > started using force-start. Not sure why this time it started on its own...
> > Perhaps this stack trace in the core dump pacemaker left on one of the VM
> > hosts has a clue?
> > 
> > 
> > 
> > 
> > 
> > #0 0x00007f121e9ac8e5 in raise () from /lib64/libc.so.6
> > #1 0x00007f121e9ae0c5 in abort () from /lib64/libc.so.6
> > #2 0x00007f121e9ea7f7 in __libc_message () from /lib64/libc.so.6
> > #3 0x00007f121e9f0126 in malloc_printerr () from /lib64/libc.so.6
> > #4 0x00007f121e9f05ad in malloc_consolidate () from /lib64/libc.so.6
> > #5 0x00007f121e9f33c5 in _int_malloc () from /lib64/libc.so.6
> > #6 0x00007f121e9f45e6 in calloc () from /lib64/libc.so.6
> > #7 0x00007f121e9e91ed in open_memstream () from /lib64/libc.so.6
> > #8 0x00007f121ea5ebdb in __vsyslog_chk () from /lib64/libc.so.6
> > #9 0x00007f121ea5f1b3 in __syslog_chk () from /lib64/libc.so.6
> > #10 0x00007f121e72b9fb in ?? () from /usr/lib64/libqb.so.0
> > #11 0x00007f121e72a6a2 in qb_log_real_va_ () from /usr/lib64/libqb.so.0
> > #12 0x00007f121e72a91d in qb_log_real_ () from /usr/lib64/libqb.so.0
> > #13 0x000000000042e994 in te_rsc_command (graph=0x20c7b40,
> > action=0x23b0c90)
> > at te_actions.c:412
> 
> This is crashing at a log message. Apparently we are trying to plug a "NULL"
> pointer into one of the format strings "%s" entries. Looking at that log
> message, none of those values should be NULL, something is wrong here.
> 
> 
> > #14 0x0000003a64404019 in initiate_action (graph=0x20c7b40) at graph.c:172
> > #15 fire_synapse (graph=0x20c7b40) at graph.c:211
> > #16 run_graph (graph=0x20c7b40) at graph.c:366
> > #17 0x000000000042f8cd in te_graph_trigger (user_data=<value optimized
> > out>)
> > at te_utils.c:331
> > #18 0x0000003a6202b283 in crm_trigger_dispatch (source=<value optimized
> > out>,
> > callback=<value optimized out>, userdata=<value optimized out>)
> > at mainloop.c:105
> > #19 0x00000038b3c38f0e in g_main_context_dispatch ()
> > from /lib64/libglib-2.0.so.0
> > #20 0x00000038b3c3c938 in ?? () from /lib64/libglib-2.0.so.0
> > #21 0x00000038b3c3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
> > #22 0x00000000004058ee in crmd_init () at main.c:154
> > #23 0x0000000000405c2c in main (argc=1, argv=0x7fffdc207528) at main.c:121
> > 
> > Not sure how to take this further. It has been difficult to characterize
> > what
> > exactly is or isn't happening, and hopefully I've not left out some
> > critical
> > detail. Thanks.
> 
> There is a whole lot going on here, which is making it a bit difficult to
> know where to start. You are using attributes and rules to enable resources.
> The attrd has recently been re-written which could have caused some of the
> problems you are seeing (especially if you ever attempted to write an
> attribute to remote-node using a build from sometime in September)
> 
> To make this easier to understand I'd recommend this... Get to the point
> where you'd expect a resource to start and it isn't. Capture the cib
> "cibadmin -q > cibsnapshot.cib". pastebin the cib and tell us which resource
> you'd expect to be starting. Then we can try and determine accurately what
> is preventing it from starting. That will at least give us something solid
> to work from.
> 
> -- Vossel
> 
> > /Lindsay
> > 
> > 
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>