[Pacemaker] pacemaker-remote tls handshaking

Tue May 21 17:14:01 EDT 2013

----- Original Message -----
> From: "Lindsay Todd" <rltodd.ml1 at gmail.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Monday, May 20, 2013 5:49:28 PM
> Subject: Re: [Pacemaker] pacemaker-remote tls handshaking
> 
> The man page for gnutls_handshake certainly suggests it should handle a
> non-blocking socket. I'd been wondering if somehow time_t was unsigned and
> rearranged some of the timeout calculations accordingly, but that hasn't
> changed behavior. It looks to me like it is pacemaker_remoted that is
> choosing to drop the connection.

I tested the latest pacemaker code and my test environment still works for me (integrating virtual machines as remote-nodes)

I'd suggest this.  Try running the pacemaker_remote regression test and see what happens.  This will start up an instance of pacemaker_remote locally and issue client commands to it to test both the TLS connection and the ability to start/stop/monitor services.

/usr/share/pacemaker/tests/lrmd/regression.py  -R

> But I have noticed that sometimes the remote resource fails before the VM has
> completely started up. It might be helpful to have some sort of helper
> script that VirtualDomain resources can use in their monitor functions to
> verify that pacemaker_remoted is running, before the remote resource's
> connection timeout clock even starts running.
> (But making the timeout longer
> doesn't help with the handshaking problem that happens later on.)

By default, the connection should retry for 60 seconds after the vm resource starts.  Like you've noticed, this can be extended to account for vms that take longer to boot.

> 
> There have been a few segfaults of crmd during my testing of this, so perhaps
> there is a memory smash somewhere. (A couple times the failure was at
> remote_lrmd_ra.c:186,

Please provide gdb backtrace.  We need to get this resolved asap before the release of v.1.1.10 is complete. I believe there is a new rc in the works already.

> and a resource cleanup on the remote resource
> sometimes triggers this.)
> 
> > I doubt this will make a difference, but here's the key I use during
> > testing,
> > lrmd:ce9db0bc3cec583d3b3bf38b0ac9ff91
> 
> Wait, I thought you were creating a binary authkey file and not using a
> username with PSK authentication?

Yep, that's right. It treats the above as a binary authkey even though it looks like a username+key. I just sent that out so we could rule out that your authkey was a problem.

> 
> /Lindsay
> 
> 
> On Thu, May 16, 2013 at 6:47 PM, David Vossel < dvossel at redhat.com > wrote:
> 
> 
> 
> ----- Original Message -----
> > From: "Lindsay Todd" < rltodd.ml1 at gmail.com >
> > To: "The Pacemaker cluster resource manager" <
> > Pacemaker at oss.clusterlabs.org >
> > Sent: Thursday, May 16, 2013 3:44:09 PM
> > Subject: [Pacemaker] pacemaker-remote tls handshaking
> > 
> > I've built pacemaker 1.1.10rc2 and am trying to get the pacemaker-remote
> > features working on my Scientific Linux 6.4 system. It almost works...
> > 
> > The /etc/pacemaker/authkey file is on all the cluster nodes, as well as my
> > test VM (readable to all users, and checksums are the same everywhere). I
> > can connect via telnet to port 3121 of the VM.
> > 
> > I even see the ghost node
> > appear for my VM when I use either 'crm status' or 'pcs status'. (Aside:
> > crmsh doesn't know about the new meta attributes for remote...)
> > 
> > But the communication isn't quite working. In my log I see:
> > 
> > May 16 15:58:34 cvmh04 crmd[4893]: warning: lrmd_tcp_connect_cb: Client tls
> > han
> > dshake failed for server swbuildsl6:3121. Disconnecting
> > May 16 15:58:34 swbuildsl6 pacemaker_remoted[2308]: error:
> > lrmd_remote_client
> > _msg: Remote lrmd tls handshake failed
> > May 16 15:58:35 cvmh04 crmd[4893]: warning: lrmd_tcp_connect_cb: Client tls
> > han
> > dshake failed for server swbuildsl6:3121. Disconnecting
> > May 16 15:58:35 swbuildsl6 pacemaker_remoted[2308]: error:
> > lrmd_remote_client
> > _msg: Remote lrmd tls handshake failed
> > 
> > and it isn't long before pacemaker stops trying.
> > 
> > Is there some additional configuration I need?
> 
> Ah, you dared to try my new feature, and this is what you get! :D
> 
> It looks like you have it covered. If you can telnet into the vm from the
> host (it should kick you off pretty quickly), then then all the firewall
> rules are correct. I'm not sure what is going on. The only thing I can think
> of is perhaps your gnutls version doesn't like that I'm using a non-blocking
> socket during the tls handshake.
> 
> I doubt this will make a difference, but here's the key I use during testing,
> lrmd:ce9db0bc3cec583d3b3bf38b0ac9ff91
> 
> Has anyone else had success or ran into something similar yet? I'll help
> investigate this next week. I'll be out of the office until Tuesday.
> 
> -- Vossel
> 
> > /Lindsay
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>