[Pacemaker] Asymmetric cluster, clones, and location constraints

Thu Oct 31 12:00:01 EDT 2013

----- Original Message -----
> From: "Andrew Beekhof" <andrew at beekhof.net>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Wednesday, October 30, 2013 1:08:12 AM
> Subject: Re: [Pacemaker] Asymmetric cluster, clones, and location constraints
> 
> 
> On 25 Oct 2013, at 9:40 am, David Vossel <dvossel at redhat.com> wrote:
> 
> > 
> > 
> > 
> > 
> > ----- Original Message -----
> >> From: "Lindsay Todd" <rltodd.ml1 at gmail.com>
> >> To: "The Pacemaker cluster resource manager"
> >> <pacemaker at oss.clusterlabs.org>
> >> Sent: Wednesday, October 23, 2013 2:38:17 PM
> >> Subject: Re: [Pacemaker] Asymmetric cluster, clones, and location
> >> constraints
> >> 
> >> David,
> >> 
> >> The Infiniband network takes a nondeterministic amount of time to actually
> >> finish initializing, so we use ethmonitor to watch it; the OS is supposed
> >> to
> >> bring it up at boot time, but it moves on through the boot sequence
> >> without
> >> actually waiting for it. So in self defense we watch it with pacemaker. I
> >> guess I could restructure this to use a resource that brings up IB (with a
> >> really long time out) and use ordering to wait for that complete, but it
> >> seems that ethmonitor would be more adaptive to short-term IB network
> >> issues. Since ethmonitor works by setting an attribute (the RA running
> >> means
> >> it is watching the network, not that the network is up), I've used
> >> location
> >> constraints instead of ordering constraints.
> >> 
> >> So I have completely restarted my cluster. Right now the physical nodes
> >> see
> >> each other, and the fencing agents are running. The first thing that
> >> should
> >> start are the ethmonitor resource agents on the VM hosts (the c-watch-ib0
> >> clones of the p-watch-ib0 primitive). They are not starting (like they
> >> used
> >> to).
> > 
> > I see.  Your cib generates an invalid transition.  I'll try and look into
> > it in more detail soon to understand the cause.
> 
> According to git bisect, the winner is:

I always knew I was a winner

> 
> 15a86e501a57b50fdb3b8ce0ed432b183c343c74 is the first bad commit
> commit 15a86e501a57b50fdb3b8ce0ed432b183c343c74
> Author: David Vossel <dvossel at redhat.com>
> Date:   Mon Sep 23 18:55:21 2013 -0500
> 
>     High: pengine: Probe container nodes
> 
> 
> I'll take a look in the morning unless David beats me to it :-)

This is a tough one.  I enabled probing container nodes, but didn't anticipate the scenario where there's an ordering constraint involving a container node's "container resource" (the VM).

I have an idea of how to fix this, but the end result might make probing containers is useless.  I'll give this some thought.

Until then, there is a real easy workaround for this. Set the 'enable-container-probes' global config option to "false"

-- Vossel

> > 
> > One completely unrelated thought I had while looking at your config
> > involves your fencing agents. You shouldn't have to use location
> > constraints at on the fencing agents. I believe stonith is smart enough
> > now to execute the agent on a node that isn't the target regardless of
> > where the policy engine puts it.
> > 
> > -- Vossel
> > 
> >> The cib snapshot can be seen in http://pastebin.com/TccTHQPS (some
> >> slight editing to hide passwords in fencing agents).
> >> 
> >> /Lindsay
> >> 
> >> 
> >> On Wed, Oct 23, 2013 at 11:20 AM, David Vossel < dvossel at redhat.com >
> >> wrote:
> >> 
> >> 
> >> 
> >> ----- Original Message -----
> >>> From: "Lindsay Todd" < rltodd.ml1 at gmail.com >
> >>> To: "The Pacemaker cluster resource manager" <
> >>> Pacemaker at oss.clusterlabs.org >
> >>> Sent: Tuesday, October 22, 2013 4:19:11 PM
> >>> Subject: [Pacemaker] Asymmetric cluster, clones, and location constraints
> >>> 
> >>> I am getting rather unexpected behavior when I combine clones, location
> >>> constraints, and remote nodes in an asymmetric cluster. My cluster is
> >>> configured to be asymmetric, distinguishing between vmhosts and various
> >>> sorts of remote nodes. Currently I am running upstream version b6d42ed. I
> >>> am
> >>> simplifying my description to avoid confusion, hoping in so doing I don't
> >>> miss any salient points...
> >>> 
> >>> My physical cluster nodes, also the VM hosts, have the attribute
> >>> "nodetype=vmhost". They also have Infiniband interfaces, which take some
> >>> time to come up. I don't want my shared file system (which needs IB), or
> >>> libvirtd (which needs the file system), to come up before IB... So I have
> >>> this in my configuration:
> >>> 
> >>> 
> >>> 
> >>> 
> >>> primitive p-watch-ib0 ocf:heartbeat:ethmonitor \
> >>> params \
> >>> interface="ib0" \
> >>> op monitor timeout="100s" interval="10s"
> >>> clone c-watch-ib0 p-watch-ib0 \
> >>> meta interleave="true"
> >>> #
> >>> location loc-watch-ib-only-vmhosts c-watch-ib0 \
> >>> rule 0: nodetype eq "vmhost"
> >>> 
> >>> Something broke between upstream versions 0a2570a and c68919f -- the
> >>> c-watch-ib0 clone never starts. I've found that if I run "crm_resource
> >>> --force-start -r p-watch-ib0" when IB is running, the ethmonitor-ib0
> >>> attribute is not set like it used to be. Oh well, I can set it manually.
> >>> So
> >>> let's.
> >> 
> >> A re-write of the attrd component was introduced around that time period.
> >> This should have been resolved at this point in the b6d42ed build.
> >> 
> >>> We use GPFS for a shared file system, so I have an agent to start it and
> >>> wait
> >>> for a file system to mount. It should only run on VM hosts, and only when
> >>> IB
> >>> is running. So I have this:
> >> 
> >> So the IB resource is setting some attribute that enables the fs to run?
> >> Why
> >> can't a ordering constraint be used here between IB and FS?
> >> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> primitive p-fs-gpfs ocf:ccni:gpfs \
> >>> params \
> >>> fspath="/gpfs/lb/utility" \
> >>> op monitor timeout="20s" interval="30s" \
> >>> op start timeout="180s" \
> >>> op stop timeout="120s"
> >>> clone c-fs-gpfs p-fs-gpfs \
> >>> meta interleave="true"
> >>> location loc-fs-gpfs-needs-ib0 c-fs-gpfs \
> >>> rule -inf: not_defined "ethmonitor-ib0" or "ethmonitor-ib0" eq 0
> >>> location loc-fs-gpfs-on-vmhosts c-fs-gpfs \
> >>> rule 0: nodetype eq "vmhost"
> >>> 
> >>> That all used to start nicely. Now even if I set the ethmonitor-ib0
> >>> attribute, it doesn't. However, I can use "crm_resource --force-start -r
> >>> p-fs-gpfs" on each of my VM hosts, then issue "crm resource cleanup
> >>> c-fs-gpfs", and all is well. I can use "crm status" to see something
> >>> like:
> >>> 
> >>> 
> >>> 
> >>> Last updated: Tue Oct 22 16:35:43 2013
> >>> Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01
> >>> Stack: cman
> >>> Current DC: cvmh04 - partition with quorum
> >>> Version: 1.1.10-19.el6.ccni-b6d42ed
> >>> 8 Nodes configured
> >>> 92 Resources configured
> >>> 
> >>> 
> >>> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> >>> 
> >>> fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> >>> fence-cvmh02 (stonith:fence_ipmilan): Started cvmh01
> >>> fence-cvmh03 (stonith:fence_ipmilan): Started cvmh01
> >>> fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> >>> Clone Set: c-fs-gpfs [p-fs-gpfs]
> >>> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> >>> which is what I would expect (other than I expect pacemaker to have
> >>> started
> >>> these for me, like it used to).
> >>> 
> >>> Now I also have clone resources to NFS-mount another file system, and
> >>> actually do a bind mount out of the GPFS file system, which behave like
> >>> the
> >>> GPFS resource -- they used to just work, now I need to use "crm_resource
> >>> --force-start" and clean up. That finally lets me start libvirtd, using
> >>> this
> >>> configuration:
> >>> 
> >>> 
> >>> 
> >>> 
> >>> primitive p-libvirtd lsb:libvirtd \
> >>> op monitor interval="30s"
> >>> clone c-p-libvirtd p-libvirtd \
> >>> meta interleave="true"
> >>> order o-libvirtd-after-storage inf: \
> >>> ( c-fs-libvirt-VM-xcm c-fs-bind-libvirt-VM-cvmh ) \
> >>> c-p-libvirtd
> >>> location loc-libvirtd-on-vmhosts c-p-libvirtd \
> >>> rule 0: nodetype eq "vmhost"
> >>> 
> >>> Of course that used to just work, but now, like the other clones, I need
> >>> to
> >>> force-start libvirtd on the VM hosts, and clean up. Once I do that, all
> >>> my
> >>> VM resources, which are not clones, just start up like they are supposed
> >>> to!
> >>> Several of these are configured as remote nodes, and they have services
> >>> configured to run in them. But now other strange things happen:
> >>> 
> >>> 
> >>> 
> >>> 
> >>> Last updated: Tue Oct 22 16:46:29 2013
> >>> Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01
> >>> Stack: cman
> >>> Current DC: cvmh04 - partition with quorum
> >>> Version: 1.1.10-19.el6.ccni-b6d42ed
> >>> 8 Nodes configured
> >>> 92 Resources configured
> >>> 
> >>> 
> >>> ContainerNode slurmdb02:vm-slurmdb02: UNCLEAN (offline)
> >>> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> >>> Containers: [ db02:vm-db02 ldap01:vm-ldap01 ldap02:vm-ldap02 ]
> >>> 
> >>> fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
> >>> fence-cvmh02 (stonith:fence_ipmilan): Started cvmh01
> >>> fence-cvmh03 (stonith:fence_ipmilan): Started cvmh01
> >>> fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
> >>> Clone Set: c-p-libvirtd [p-libvirtd]
> >>> p-libvirtd (lsb:libvirtd): FAILED slurmdb02
> >>> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> >>> Stopped: [ db02 ldap01 ldap02 ]
> >>> Clone Set: c-watch-ib0 [p-watch-ib0]
> >>> p-watch-ib0 (ocf::heartbeat:ethmonitor): FAILED slurmdb02
> >>> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> >>> Stopped: [ db02 ldap01 ldap02 ]
> >>> Clone Set: c-fs-gpfs [p-fs-gpfs]
> >>> p-fs-gpfs (ocf::ccni:gpfs): FAILED slurmdb02
> >>> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
> >>> Stopped: [ db02 ldap01 ldap02 ]
> >>> vm-compute-test (ocf::ccni:xcatVirtualDomain): FAILED [ cvmh04 slurmdb0
> >>> 2 ]
> >>> vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): FAILED slurmdb02
> >>> vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh01
> >>> vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh02
> >>> vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh03
> >>> p-postgres (ocf::heartbeat:pgsql): FAILED [ db02 slurmdb02 ]
> >>> p-mysql (ocf::heartbeat:mysql): FAILED [ db02 slurmdb02 ]
> >>> Clone Set: c-fs-share-config-data [fs-share-config-data]
> >>> fs-share-config-data (ocf::heartbeat:Filesystem): FAILED slurmdb02
> >>> Stopped: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ldap01 ldap02 ]
> >>> p-mysql-slurm (ocf::heartbeat:mysql): FAILED slurmdb02
> >>> p-slurmdbd (ocf::ccni:SlurmDBD): FAILED slurmdb02
> >>> Clone Set: c-ldapagent [s-ldapagent]
> >>> s-ldapagent (ocf::ccni:WrapInitScript): FAILED slurmdb02
> >>> Stopped: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ldap01 ldap02 ]
> >>> Clone Set: c-ldap [s-ldap]
> >>> s-ldap (ocf::ccni:WrapInitScript): FAILED slurmdb02
> >>> Started: [ ldap01 ldap02 ]
> >>> Stopped: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ]
> >>> 
> >>> Now this is unexpected for a couple of reasons. I do have constraints
> >>> like:
> >>> 
> >>> 
> >>> 
> >>> 
> >>> location loc-vm-swbuildsl6 vm-swbuildsl6 \
> >>> rule $id="loc-vm-swbuildsl6-rule" 0: nodetype eq vmhost
> >>> order o-vm-swbuildsl6 inf: c-p-libvirtd vm-swbuildsl6
> >>> 
> >>> And it is not the case that slurmdb02 has the vmhost attribute set; using
> >>> "crm_mon -o -1 -N -A" we see:
> >>> 
> >>> 
> >>> 
> >>> 
> >>> Node Attributes:
> >>> * Node cvmh01:
> >>> + ethmonitor-ib0 : 1
> >>> + nodetype : vmhost
> >>> * Node cvmh02:
> >>> + ethmonitor-ib0 : 1
> >>> + nodetype : vmhost
> >>> * Node cvmh03:
> >>> + ethmonitor-ib0 : 1
> >>> + nodetype : vmhost
> >>> * Node cvmh04:
> >>> + ethmonitor-ib0 : 1
> >>> + nodetype : vmhost
> >>> * Node db02:
> >>> * Node ldap01:
> >>> * Node ldap02:
> >>> * Node slurmdb02:
> >>> 
> >>> The results are unexpected to me also because I (perhaps naively)
> >>> wouldn't
> >>> expect it to show me the new nodes on the "stopped" lines -- I kind of
> >>> expected a location rule to limit where clones would even be attempted.
> >>> For
> >>> example, with the rule limiting c-p-libvirtd to the vmhosts, I don't
> >>> really
> >>> expect to be told that the clones are stopped on the remote VM nodes
> >>> db02,
> >>> ldap01, and ldap02 (let alone be started on slurmdb02!).
> >>> 
> >>> Until I wrote this note, even the cloned ldap resource c-ldap needed to
> >>> be
> >>> started using force-start. Not sure why this time it started on its
> >>> own...
> >>> Perhaps this stack trace in the core dump pacemaker left on one of the VM
> >>> hosts has a clue?
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> #0 0x00007f121e9ac8e5 in raise () from /lib64/libc.so.6
> >>> #1 0x00007f121e9ae0c5 in abort () from /lib64/libc.so.6
> >>> #2 0x00007f121e9ea7f7 in __libc_message () from /lib64/libc.so.6
> >>> #3 0x00007f121e9f0126 in malloc_printerr () from /lib64/libc.so.6
> >>> #4 0x00007f121e9f05ad in malloc_consolidate () from /lib64/libc.so.6
> >>> #5 0x00007f121e9f33c5 in _int_malloc () from /lib64/libc.so.6
> >>> #6 0x00007f121e9f45e6 in calloc () from /lib64/libc.so.6
> >>> #7 0x00007f121e9e91ed in open_memstream () from /lib64/libc.so.6
> >>> #8 0x00007f121ea5ebdb in __vsyslog_chk () from /lib64/libc.so.6
> >>> #9 0x00007f121ea5f1b3 in __syslog_chk () from /lib64/libc.so.6
> >>> #10 0x00007f121e72b9fb in ?? () from /usr/lib64/libqb.so.0
> >>> #11 0x00007f121e72a6a2 in qb_log_real_va_ () from /usr/lib64/libqb.so.0
> >>> #12 0x00007f121e72a91d in qb_log_real_ () from /usr/lib64/libqb.so.0
> >>> #13 0x000000000042e994 in te_rsc_command (graph=0x20c7b40,
> >>> action=0x23b0c90)
> >>> at te_actions.c:412
> >> 
> >> This is crashing at a log message. Apparently we are trying to plug a
> >> "NULL"
> >> pointer into one of the format strings "%s" entries. Looking at that log
> >> message, none of those values should be NULL, something is wrong here.
> >> 
> >> 
> >>> #14 0x0000003a64404019 in initiate_action (graph=0x20c7b40) at
> >>> graph.c:172
> >>> #15 fire_synapse (graph=0x20c7b40) at graph.c:211
> >>> #16 run_graph (graph=0x20c7b40) at graph.c:366
> >>> #17 0x000000000042f8cd in te_graph_trigger (user_data=<value optimized
> >>> out>)
> >>> at te_utils.c:331
> >>> #18 0x0000003a6202b283 in crm_trigger_dispatch (source=<value optimized
> >>> out>,
> >>> callback=<value optimized out>, userdata=<value optimized out>)
> >>> at mainloop.c:105
> >>> #19 0x00000038b3c38f0e in g_main_context_dispatch ()
> >>> from /lib64/libglib-2.0.so.0
> >>> #20 0x00000038b3c3c938 in ?? () from /lib64/libglib-2.0.so.0
> >>> #21 0x00000038b3c3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
> >>> #22 0x00000000004058ee in crmd_init () at main.c:154
> >>> #23 0x0000000000405c2c in main (argc=1, argv=0x7fffdc207528) at
> >>> main.c:121
> >>> 
> >>> Not sure how to take this further. It has been difficult to characterize
> >>> what
> >>> exactly is or isn't happening, and hopefully I've not left out some
> >>> critical
> >>> detail. Thanks.
> >> 
> >> There is a whole lot going on here, which is making it a bit difficult to
> >> know where to start. You are using attributes and rules to enable
> >> resources.
> >> The attrd has recently been re-written which could have caused some of the
> >> problems you are seeing (especially if you ever attempted to write an
> >> attribute to remote-node using a build from sometime in September)
> >> 
> >> To make this easier to understand I'd recommend this... Get to the point
> >> where you'd expect a resource to start and it isn't. Capture the cib
> >> "cibadmin -q > cibsnapshot.cib". pastebin the cib and tell us which
> >> resource
> >> you'd expect to be starting. Then we can try and determine accurately what
> >> is preventing it from starting. That will at least give us something solid
> >> to work from.
> >> 
> >> -- Vossel
> >> 
> >>> /Lindsay
> >>> 
> >>> 
> >>> 
> >>> 
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>> 
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>> 
> >> 
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> 
> >> Project Home: http://www.clusterlabs.org
> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >> 
> >> 
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> 
> >> Project Home: http://www.clusterlabs.org
> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >> 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>