[Pacemaker] Asymmetric cluster, clones, and location constraints

Wed Oct 30 02:08:12 EDT 2013

On 25 Oct 2013, at 9:40 am, David Vossel <dvossel at redhat.com> wrote:

> 
> 
> 
> 
> ----- Original Message -----
>> From: "Lindsay Todd" <rltodd.ml1 at gmail.com>
>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Wednesday, October 23, 2013 2:38:17 PM
>> Subject: Re: [Pacemaker] Asymmetric cluster, clones, and location constraints
>> 
>> David,
>> 
>> The Infiniband network takes a nondeterministic amount of time to actually
>> finish initializing, so we use ethmonitor to watch it; the OS is supposed to
>> bring it up at boot time, but it moves on through the boot sequence without
>> actually waiting for it. So in self defense we watch it with pacemaker. I
>> guess I could restructure this to use a resource that brings up IB (with a
>> really long time out) and use ordering to wait for that complete, but it
>> seems that ethmonitor would be more adaptive to short-term IB network
>> issues. Since ethmonitor works by setting an attribute (the RA running means
>> it is watching the network, not that the network is up), I've used location
>> constraints instead of ordering constraints.
>> 
>> So I have completely restarted my cluster. Right now the physical nodes see
>> each other, and the fencing agents are running. The first thing that should
>> start are the ethmonitor resource agents on the VM hosts (the c-watch-ib0
>> clones of the p-watch-ib0 primitive). They are not starting (like they used
>> to).
> 
> I see.  Your cib generates an invalid transition.  I'll try and look into it in more detail soon to understand the cause.

According to git bisect, the winner is:

15a86e501a57b50fdb3b8ce0ed432b183c343c74 is the first bad commit
commit 15a86e501a57b50fdb3b8ce0ed432b183c343c74
Author: David Vossel <dvossel at redhat.com>
Date:   Mon Sep 23 18:55:21 2013 -0500

    High: pengine: Probe container nodes

I'll take a look in the morning unless David beats me to it :-)

> 
> One completely unrelated thought I had while looking at your config involves your fencing agents. You shouldn't have to use location constraints at on the fencing agents. I believe stonith is smart enough now to execute the agent on a node that isn't the target regardless of where the policy engine puts it.
> 
> -- Vossel
> 
>> The cib snapshot can be seen in http://pastebin.com/TccTHQPS (some
>> slight editing to hide passwords in fencing agents).
>> 
>> /Lindsay
>> 
>> 
>> On Wed, Oct 23, 2013 at 11:20 AM, David Vossel < dvossel at redhat.com > wrote:
>> 
>> 
>> 
>> ----- Original Message -----
>>> From: "Lindsay Todd" < rltodd.ml1 at gmail.com >
>>> To: "The Pacemaker cluster resource manager" <
>>> Pacemaker at oss.clusterlabs.org >
>>> Sent: Tuesday, October 22, 2013 4:19:11 PM
>>> Subject: [Pacemaker] Asymmetric cluster, clones, and location constraints
>>> 
>>> I am getting rather unexpected behavior when I combine clones, location
>>> constraints, and remote nodes in an asymmetric cluster. My cluster is
>>> configured to be asymmetric, distinguishing between vmhosts and various
>>> sorts of remote nodes. Currently I am running upstream version b6d42ed. I
>>> am
>>> simplifying my description to avoid confusion, hoping in so doing I don't
>>> miss any salient points...
>>> 
>>> My physical cluster nodes, also the VM hosts, have the attribute
>>> "nodetype=vmhost". They also have Infiniband interfaces, which take some
>>> time to come up. I don't want my shared file system (which needs IB), or
>>> libvirtd (which needs the file system), to come up before IB... So I have
>>> this in my configuration:
>>> 
>>> 
>>> 
>>> 
>>> primitive p-watch-ib0 ocf:heartbeat:ethmonitor \
>>> params \
>>> interface="ib0" \
>>> op monitor timeout="100s" interval="10s"
>>> clone c-watch-ib0 p-watch-ib0 \
>>> meta interleave="true"
>>> #
>>> location loc-watch-ib-only-vmhosts c-watch-ib0 \
>>> rule 0: nodetype eq "vmhost"
>>> 
>>> Something broke between upstream versions 0a2570a and c68919f -- the
>>> c-watch-ib0 clone never starts. I've found that if I run "crm_resource
>>> --force-start -r p-watch-ib0" when IB is running, the ethmonitor-ib0
>>> attribute is not set like it used to be. Oh well, I can set it manually. So
>>> let's.
>> 
>> A re-write of the attrd component was introduced around that time period.
>> This should have been resolved at this point in the b6d42ed build.
>> 
>>> We use GPFS for a shared file system, so I have an agent to start it and
>>> wait
>>> for a file system to mount. It should only run on VM hosts, and only when
>>> IB
>>> is running. So I have this:
>> 
>> So the IB resource is setting some attribute that enables the fs to run? Why
>> can't a ordering constraint be used here between IB and FS?
>> 
>>> 
>>> 
>>> 
>>> 
>>> primitive p-fs-gpfs ocf:ccni:gpfs \
>>> params \
>>> fspath="/gpfs/lb/utility" \
>>> op monitor timeout="20s" interval="30s" \
>>> op start timeout="180s" \
>>> op stop timeout="120s"
>>> clone c-fs-gpfs p-fs-gpfs \
>>> meta interleave="true"
>>> location loc-fs-gpfs-needs-ib0 c-fs-gpfs \
>>> rule -inf: not_defined "ethmonitor-ib0" or "ethmonitor-ib0" eq 0
>>> location loc-fs-gpfs-on-vmhosts c-fs-gpfs \
>>> rule 0: nodetype eq "vmhost"
>>> 
>>> That all used to start nicely. Now even if I set the ethmonitor-ib0
>>> attribute, it doesn't. However, I can use "crm_resource --force-start -r
>>> p-fs-gpfs" on each of my VM hosts, then issue "crm resource cleanup
>>> c-fs-gpfs", and all is well. I can use "crm status" to see something like:
>>> 
>>> 
>>> 
>>> Last updated: Tue Oct 22 16:35:43 2013
>>> Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01
>>> Stack: cman
>>> Current DC: cvmh04 - partition with quorum
>>> Version: 1.1.10-19.el6.ccni-b6d42ed
>>> 8 Nodes configured
>>> 92 Resources configured
>>> 
>>> 
>>> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
>>> 
>>> fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
>>> fence-cvmh02 (stonith:fence_ipmilan): Started cvmh01
>>> fence-cvmh03 (stonith:fence_ipmilan): Started cvmh01
>>> fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
>>> Clone Set: c-fs-gpfs [p-fs-gpfs]
>>> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
>>> which is what I would expect (other than I expect pacemaker to have started
>>> these for me, like it used to).
>>> 
>>> Now I also have clone resources to NFS-mount another file system, and
>>> actually do a bind mount out of the GPFS file system, which behave like the
>>> GPFS resource -- they used to just work, now I need to use "crm_resource
>>> --force-start" and clean up. That finally lets me start libvirtd, using
>>> this
>>> configuration:
>>> 
>>> 
>>> 
>>> 
>>> primitive p-libvirtd lsb:libvirtd \
>>> op monitor interval="30s"
>>> clone c-p-libvirtd p-libvirtd \
>>> meta interleave="true"
>>> order o-libvirtd-after-storage inf: \
>>> ( c-fs-libvirt-VM-xcm c-fs-bind-libvirt-VM-cvmh ) \
>>> c-p-libvirtd
>>> location loc-libvirtd-on-vmhosts c-p-libvirtd \
>>> rule 0: nodetype eq "vmhost"
>>> 
>>> Of course that used to just work, but now, like the other clones, I need to
>>> force-start libvirtd on the VM hosts, and clean up. Once I do that, all my
>>> VM resources, which are not clones, just start up like they are supposed
>>> to!
>>> Several of these are configured as remote nodes, and they have services
>>> configured to run in them. But now other strange things happen:
>>> 
>>> 
>>> 
>>> 
>>> Last updated: Tue Oct 22 16:46:29 2013
>>> Last change: Tue Oct 22 15:50:52 2013 via crmd on cvmh01
>>> Stack: cman
>>> Current DC: cvmh04 - partition with quorum
>>> Version: 1.1.10-19.el6.ccni-b6d42ed
>>> 8 Nodes configured
>>> 92 Resources configured
>>> 
>>> 
>>> ContainerNode slurmdb02:vm-slurmdb02: UNCLEAN (offline)
>>> Online: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
>>> Containers: [ db02:vm-db02 ldap01:vm-ldap01 ldap02:vm-ldap02 ]
>>> 
>>> fence-cvmh01 (stonith:fence_ipmilan): Started cvmh04
>>> fence-cvmh02 (stonith:fence_ipmilan): Started cvmh01
>>> fence-cvmh03 (stonith:fence_ipmilan): Started cvmh01
>>> fence-cvmh04 (stonith:fence_ipmilan): Started cvmh01
>>> Clone Set: c-p-libvirtd [p-libvirtd]
>>> p-libvirtd (lsb:libvirtd): FAILED slurmdb02
>>> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
>>> Stopped: [ db02 ldap01 ldap02 ]
>>> Clone Set: c-watch-ib0 [p-watch-ib0]
>>> p-watch-ib0 (ocf::heartbeat:ethmonitor): FAILED slurmdb02
>>> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
>>> Stopped: [ db02 ldap01 ldap02 ]
>>> Clone Set: c-fs-gpfs [p-fs-gpfs]
>>> p-fs-gpfs (ocf::ccni:gpfs): FAILED slurmdb02
>>> Started: [ cvmh01 cvmh02 cvmh03 cvmh04 ]
>>> Stopped: [ db02 ldap01 ldap02 ]
>>> vm-compute-test (ocf::ccni:xcatVirtualDomain): FAILED [ cvmh04 slurmdb0
>>> 2 ]
>>> vm-swbuildsl6 (ocf::ccni:xcatVirtualDomain): FAILED slurmdb02
>>> vm-db02 (ocf::ccni:xcatVirtualDomain): Started cvmh01
>>> vm-ldap01 (ocf::ccni:xcatVirtualDomain): Started cvmh02
>>> vm-ldap02 (ocf::ccni:xcatVirtualDomain): Started cvmh03
>>> p-postgres (ocf::heartbeat:pgsql): FAILED [ db02 slurmdb02 ]
>>> p-mysql (ocf::heartbeat:mysql): FAILED [ db02 slurmdb02 ]
>>> Clone Set: c-fs-share-config-data [fs-share-config-data]
>>> fs-share-config-data (ocf::heartbeat:Filesystem): FAILED slurmdb02
>>> Stopped: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ldap01 ldap02 ]
>>> p-mysql-slurm (ocf::heartbeat:mysql): FAILED slurmdb02
>>> p-slurmdbd (ocf::ccni:SlurmDBD): FAILED slurmdb02
>>> Clone Set: c-ldapagent [s-ldapagent]
>>> s-ldapagent (ocf::ccni:WrapInitScript): FAILED slurmdb02
>>> Stopped: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ldap01 ldap02 ]
>>> Clone Set: c-ldap [s-ldap]
>>> s-ldap (ocf::ccni:WrapInitScript): FAILED slurmdb02
>>> Started: [ ldap01 ldap02 ]
>>> Stopped: [ cvmh01 cvmh02 cvmh03 cvmh04 db02 ]
>>> 
>>> Now this is unexpected for a couple of reasons. I do have constraints like:
>>> 
>>> 
>>> 
>>> 
>>> location loc-vm-swbuildsl6 vm-swbuildsl6 \
>>> rule $id="loc-vm-swbuildsl6-rule" 0: nodetype eq vmhost
>>> order o-vm-swbuildsl6 inf: c-p-libvirtd vm-swbuildsl6
>>> 
>>> And it is not the case that slurmdb02 has the vmhost attribute set; using
>>> "crm_mon -o -1 -N -A" we see:
>>> 
>>> 
>>> 
>>> 
>>> Node Attributes:
>>> * Node cvmh01:
>>> + ethmonitor-ib0 : 1
>>> + nodetype : vmhost
>>> * Node cvmh02:
>>> + ethmonitor-ib0 : 1
>>> + nodetype : vmhost
>>> * Node cvmh03:
>>> + ethmonitor-ib0 : 1
>>> + nodetype : vmhost
>>> * Node cvmh04:
>>> + ethmonitor-ib0 : 1
>>> + nodetype : vmhost
>>> * Node db02:
>>> * Node ldap01:
>>> * Node ldap02:
>>> * Node slurmdb02:
>>> 
>>> The results are unexpected to me also because I (perhaps naively) wouldn't
>>> expect it to show me the new nodes on the "stopped" lines -- I kind of
>>> expected a location rule to limit where clones would even be attempted. For
>>> example, with the rule limiting c-p-libvirtd to the vmhosts, I don't really
>>> expect to be told that the clones are stopped on the remote VM nodes db02,
>>> ldap01, and ldap02 (let alone be started on slurmdb02!).
>>> 
>>> Until I wrote this note, even the cloned ldap resource c-ldap needed to be
>>> started using force-start. Not sure why this time it started on its own...
>>> Perhaps this stack trace in the core dump pacemaker left on one of the VM
>>> hosts has a clue?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> #0 0x00007f121e9ac8e5 in raise () from /lib64/libc.so.6
>>> #1 0x00007f121e9ae0c5 in abort () from /lib64/libc.so.6
>>> #2 0x00007f121e9ea7f7 in __libc_message () from /lib64/libc.so.6
>>> #3 0x00007f121e9f0126 in malloc_printerr () from /lib64/libc.so.6
>>> #4 0x00007f121e9f05ad in malloc_consolidate () from /lib64/libc.so.6
>>> #5 0x00007f121e9f33c5 in _int_malloc () from /lib64/libc.so.6
>>> #6 0x00007f121e9f45e6 in calloc () from /lib64/libc.so.6
>>> #7 0x00007f121e9e91ed in open_memstream () from /lib64/libc.so.6
>>> #8 0x00007f121ea5ebdb in __vsyslog_chk () from /lib64/libc.so.6
>>> #9 0x00007f121ea5f1b3 in __syslog_chk () from /lib64/libc.so.6
>>> #10 0x00007f121e72b9fb in ?? () from /usr/lib64/libqb.so.0
>>> #11 0x00007f121e72a6a2 in qb_log_real_va_ () from /usr/lib64/libqb.so.0
>>> #12 0x00007f121e72a91d in qb_log_real_ () from /usr/lib64/libqb.so.0
>>> #13 0x000000000042e994 in te_rsc_command (graph=0x20c7b40,
>>> action=0x23b0c90)
>>> at te_actions.c:412
>> 
>> This is crashing at a log message. Apparently we are trying to plug a "NULL"
>> pointer into one of the format strings "%s" entries. Looking at that log
>> message, none of those values should be NULL, something is wrong here.
>> 
>> 
>>> #14 0x0000003a64404019 in initiate_action (graph=0x20c7b40) at graph.c:172
>>> #15 fire_synapse (graph=0x20c7b40) at graph.c:211
>>> #16 run_graph (graph=0x20c7b40) at graph.c:366
>>> #17 0x000000000042f8cd in te_graph_trigger (user_data=<value optimized
>>> out>)
>>> at te_utils.c:331
>>> #18 0x0000003a6202b283 in crm_trigger_dispatch (source=<value optimized
>>> out>,
>>> callback=<value optimized out>, userdata=<value optimized out>)
>>> at mainloop.c:105
>>> #19 0x00000038b3c38f0e in g_main_context_dispatch ()
>>> from /lib64/libglib-2.0.so.0
>>> #20 0x00000038b3c3c938 in ?? () from /lib64/libglib-2.0.so.0
>>> #21 0x00000038b3c3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
>>> #22 0x00000000004058ee in crmd_init () at main.c:154
>>> #23 0x0000000000405c2c in main (argc=1, argv=0x7fffdc207528) at main.c:121
>>> 
>>> Not sure how to take this further. It has been difficult to characterize
>>> what
>>> exactly is or isn't happening, and hopefully I've not left out some
>>> critical
>>> detail. Thanks.
>> 
>> There is a whole lot going on here, which is making it a bit difficult to
>> know where to start. You are using attributes and rules to enable resources.
>> The attrd has recently been re-written which could have caused some of the
>> problems you are seeing (especially if you ever attempted to write an
>> attribute to remote-node using a build from sometime in September)
>> 
>> To make this easier to understand I'd recommend this... Get to the point
>> where you'd expect a resource to start and it isn't. Capture the cib
>> "cibadmin -q > cibsnapshot.cib". pastebin the cib and tell us which resource
>> you'd expect to be starting. Then we can try and determine accurately what
>> is preventing it from starting. That will at least give us something solid
>> to work from.
>> 
>> -- Vossel
>> 
>>> /Lindsay
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org