[Pacemaker] pacemaker-remote tls handshaking

Fri May 24 00:21:33 EDT 2013

----- Original Message -----
> From: "Lindsay Todd" <rltodd.ml1 at gmail.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Thursday, May 23, 2013 4:35:02 PM
> Subject: Re: [Pacemaker] pacemaker-remote tls handshaking
> 
> Working on this problem further...
> 
> On Tue, May 21, 2013 at 5:14 PM, David Vossel <dvossel at redhat.com> wrote:
> > I'd suggest this.  Try running the pacemaker_remote regression test and see
> > what happens.  This will start up
> > an instance of pacemaker_remote locally and issue client commands to it to
> > test both the TLS connection and
> > the ability to start/stop/monitor services.
> >
> > /usr/share/pacemaker/tests/lrmd/regression.py  -R
> 
> But sadly SL 6.4 doesn't have the systemctl commands this is trying to

oops

> use.  (Also I am building RPMs and installing those, the lrmd
> regression tests aren't included in pacemaker-cts.

another oops

> No problem, I ran
> directly from the build directory.)  It doesn't seem to make much
> progress.  The stdout is:
> 
>     sh: systemctl: command not found
>     sh: /lib/systemd/system/lrmd_dummy_daemon.service: No such file or
>     directory
>     sh: systemctl: command not found
>     Starting ...
> 
> And the lrmd-regression.log has:
>     Set r/w permissions for uid=496, gid=494 on /tmp/lrmd-regression.log
>     May 23 15:14:39 [3610] swbuildsl6 pacemaker_remoted:     info:
> qb_ipcs_us_publish:      server name: lrmd
>     May 23 15:14:39 [3610] swbuildsl6 pacemaker_remoted:   notice:
> lrmd_init_remote_tls_server:     Starting a tls listener on port 3121.
>     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> qb_ipcs_us_publish:      server name: cib_ro
>     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> qb_ipcs_us_publish:      server name: cib_rw
>     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> qb_ipcs_us_publish:      server name: cib_shm
>     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> qb_ipcs_us_publish:      server name: attrd
>     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> qb_ipcs_us_publish:      server name: stonith-ng
>     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> qb_ipcs_us_publish:      server name: crmd
>     May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
> main:    Starting
> 
> 
> > By default, the connection should retry for 60 seconds after the vm
> > resource starts.  Like you've noticed, this
> > can be extended to account for vms that take longer to boot.
> 
> But maybe this should start after the monitor method for the VM first
> indicates success?  Or does it already?

The policy engine has no way of expressing this right now. It would be difficult to make this happen.  Likely your idea of additional start scripts to verify when the VM's network is actually available would be a better choice.

> 
> >> There have been a few segfaults of crmd during my testing of this, so
> >> perhaps
> >> there is a memory smash somewhere. (A couple times the failure was at
> >> remote_lrmd_ra.c:186,
> >
> > Please provide gdb backtrace.  We need to get this resolved asap before the
> > release of v.1.1.10 is complete.
> > I believe there is a new rc in the works already.
> 
> So I've attached results from a few core dumps.  All were triggered
> using "crm resource cleanup swbuildsl6" where swbuildsl6 is the host
> name of the VM  (that I can still telnet to port 3121).

thanks :)

> >> > I doubt this will make a difference, but here's the key I use during
> >> > testing,
> >> > lrmd:ce9db0bc3cec583d3b3bf38b0ac9ff91
> 
> It makes no difference.  I had wondered if the shorter key would matter.
> 
> Also, I've attached some patches I made to 1.1.10rc3 to try to resolve
> this problem.  So far no success.  Some of these add logging; the
> others are fix what look to me to be fishy code with cases that aren't
> completely handled.  With the additional logging, I see these results
> being logged:
> 
>     May 23 17:06:51 swbuildsl6 pacemaker_remoted[2326]:   notice:
> lrmd_remote_listen: LRMD client connection established. 0x995250 id:
> df04d8ee-7fcb-4025-8c8f-8a1555a4d097
>     May 23 17:06:53 cvmh02 crmd[18982]:  warning: lrmd_tcp_connect_cb:
> Client tls handshake failed for server swbuildsl6:3121. Disconnecting
>     May 23 17:06:52 swbuildsl6 pacemaker_remoted[2326]:    error:
> lrmd_remote_client_msg: Remote lrmd tls handshake failed: -9
>     May 23 17:06:52 swbuildsl6 pacemaker_remoted[2326]:   notice:
> lrmd_remote_client_destroy: LRMD client disconnecting remote client -
> name: <unknown> id: df04d8ee-7fcb-4025-8c8f-8a1555a4d097
> 
> Puzzling -- nothing being logged from
> crm_initiate_client_tls_handshake -- is there something I need to add
> to somehow activate the crm_err and crm_info calls?

Well, you've definitely gotten my attention.  I tried this on my rhel 6 box and sure enough, I'm seeing the exact same thing you're seeing.  No worries. I'll track this down.  I'm sure it has to do with the gnutls version being used.

In the mean time, if you want to test this feature, it does work in Fedora 18.  Thanks for all your work on testing this.  You're feedback came just in time. We are about to release 1.1.10 soon :)

-- Vossel

> /rlt
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>