[Pacemaker] pacemaker-remote tls handshaking

Lindsay Todd rltodd.ml1 at gmail.com
Thu May 23 17:35:02 EDT 2013


Working on this problem further...

On Tue, May 21, 2013 at 5:14 PM, David Vossel <dvossel at redhat.com> wrote:
> I'd suggest this.  Try running the pacemaker_remote regression test and see what happens.  This will start up
> an instance of pacemaker_remote locally and issue client commands to it to test both the TLS connection and
> the ability to start/stop/monitor services.
>
> /usr/share/pacemaker/tests/lrmd/regression.py  -R

But sadly SL 6.4 doesn't have the systemctl commands this is trying to
use.  (Also I am building RPMs and installing those, the lrmd
regression tests aren't included in pacemaker-cts.  No problem, I ran
directly from the build directory.)  It doesn't seem to make much
progress.  The stdout is:

    sh: systemctl: command not found
    sh: /lib/systemd/system/lrmd_dummy_daemon.service: No such file or directory
    sh: systemctl: command not found
    Starting ...

And the lrmd-regression.log has:
    Set r/w permissions for uid=496, gid=494 on /tmp/lrmd-regression.log
    May 23 15:14:39 [3610] swbuildsl6 pacemaker_remoted:     info:
qb_ipcs_us_publish:      server name: lrmd
    May 23 15:14:39 [3610] swbuildsl6 pacemaker_remoted:   notice:
lrmd_init_remote_tls_server:     Starting a tls listener on port 3121.
    May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
qb_ipcs_us_publish:      server name: cib_ro
    May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
qb_ipcs_us_publish:      server name: cib_rw
    May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
qb_ipcs_us_publish:      server name: cib_shm
    May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
qb_ipcs_us_publish:      server name: attrd
    May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
qb_ipcs_us_publish:      server name: stonith-ng
    May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
qb_ipcs_us_publish:      server name: crmd
    May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted:     info:
main:    Starting


> By default, the connection should retry for 60 seconds after the vm resource starts.  Like you've noticed, this
> can be extended to account for vms that take longer to boot.

But maybe this should start after the monitor method for the VM first
indicates success?  Or does it already?

>> There have been a few segfaults of crmd during my testing of this, so perhaps
>> there is a memory smash somewhere. (A couple times the failure was at
>> remote_lrmd_ra.c:186,
>
> Please provide gdb backtrace.  We need to get this resolved asap before the release of v.1.1.10 is complete.
> I believe there is a new rc in the works already.

So I've attached results from a few core dumps.  All were triggered
using "crm resource cleanup swbuildsl6" where swbuildsl6 is the host
name of the VM  (that I can still telnet to port 3121).

>> > I doubt this will make a difference, but here's the key I use during
>> > testing,
>> > lrmd:ce9db0bc3cec583d3b3bf38b0ac9ff91

It makes no difference.  I had wondered if the shorter key would matter.

Also, I've attached some patches I made to 1.1.10rc3 to try to resolve
this problem.  So far no success.  Some of these add logging; the
others are fix what look to me to be fishy code with cases that aren't
completely handled.  With the additional logging, I see these results
being logged:

    May 23 17:06:51 swbuildsl6 pacemaker_remoted[2326]:   notice:
lrmd_remote_listen: LRMD client connection established. 0x995250 id:
df04d8ee-7fcb-4025-8c8f-8a1555a4d097
    May 23 17:06:53 cvmh02 crmd[18982]:  warning: lrmd_tcp_connect_cb:
Client tls handshake failed for server swbuildsl6:3121. Disconnecting
    May 23 17:06:52 swbuildsl6 pacemaker_remoted[2326]:    error:
lrmd_remote_client_msg: Remote lrmd tls handshake failed: -9
    May 23 17:06:52 swbuildsl6 pacemaker_remoted[2326]:   notice:
lrmd_remote_client_destroy: LRMD client disconnecting remote client -
name: <unknown> id: df04d8ee-7fcb-4025-8c8f-8a1555a4d097

Puzzling -- nothing being logged from
crm_initiate_client_tls_handshake -- is there something I need to add
to somehow activate the crm_err and crm_info calls?

/rlt
-------------- next part --------------
Using 1.1.10rc3:

Bus error

(gdb) where
#0  0x000000000042541f in retry_start_cmd_cb (data=0x82f090)
    at remote_lrmd_ra.c:186
#1  0x0000003bba03961b in ?? () from /lib64/libglib-2.0.so.0
#2  0x0000003bba038f0e in g_main_context_dispatch ()
   from /lib64/libglib-2.0.so.0
#3  0x0000003bba03c938 in ?? () from /lib64/libglib-2.0.so.0
#4  0x0000003bba03cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
#5  0x000000000040530e in crmd_init () at main.c:154
#6  0x000000000040560c in main (argc=1, argv=0x7fff7b552368) at main.c:120
(gdb) list
181         lrm_state_t *lrm_state = data;
182         remote_ra_data_t *ra_data = lrm_state->remote_ra_data;
183         remote_ra_cmd_t *cmd = NULL;
184         int rc = -1;
185   
186         if (!ra_data || !ra_data->cur_cmd) {
187             return FALSE;
188         }
189         cmd = ra_data->cur_cmd;
190         if (safe_str_neq(cmd->action, "start")) {


Using 1.1.10rc2

Segmentation fault:

(gdb) where
#0  0x00007f03c6603eed in lrmd_tls_connection_destroy (
    userdata=<value optimized out>) at lrmd_client.c:506
#1  0x00007f03c66046c0 in lrmd_tcp_connect_cb (userdata=0x9541b0, sock=-1)
    at lrmd_client.c:1079
#2  0x00007f03c6a3764a in check_connect_finished (userdata=0x97d350)
    at remote.c:736
#3  0x0000003b67a3961b in ?? () from /lib64/libglib-2.0.so.0
#4  0x0000003b67a38f0e in g_main_context_dispatch ()
   from /lib64/libglib-2.0.so.0
#5  0x0000003b67a3c938 in ?? () from /lib64/libglib-2.0.so.0
#6  0x0000003b67a3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
#7  0x000000000040530e in crmd_init () at main.c:154
#8  0x000000000040560c in main (argc=1, argv=0x7fff26ac1ab8) at main.c:120
(gdb) list
501         lrmd_t *lrmd = userdata;
502         lrmd_private_t *native = lrmd->private;
503
504         crm_info("TLS connection destroyed");
505
506         if (native->remote->tls_session) {
507             gnutls_bye(*native->remote->tls_session, GNUTLS_SHUT_RDWR);
508             gnutls_deinit(*native->remote->tls_session);
509             gnutls_free(native->remote->tls_session);
510         }

Segmentation fault:

(gdb) where
#0  0x00007f0e0751ceed in lrmd_tls_connection_destroy (
    userdata=<value optimized out>) at lrmd_client.c:506
#1  0x00007f0e0751d6c0 in lrmd_tcp_connect_cb (userdata=0x21b6ae0, sock=-110)
    at lrmd_client.c:1079
#2  0x00007f0e0795064a in check_connect_finished (userdata=0x21639d0)
    at remote.c:736
#3  0x0000003b67a3961b in ?? () from /lib64/libglib-2.0.so.0
#4  0x0000003b67a38f0e in g_main_context_dispatch ()
   from /lib64/libglib-2.0.so.0
#5  0x0000003b67a3c938 in ?? () from /lib64/libglib-2.0.so.0
#6  0x0000003b67a3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
#7  0x000000000040530e in crmd_init () at main.c:154
#8  0x000000000040560c in main (argc=1, argv=0x7fff567875b8) at main.c:120


Segmentation fault:

(gdb) where
#0  0x00007fac43d87eed in lrmd_tls_connection_destroy (
    userdata=<value optimized out>) at lrmd_client.c:506
#1  0x00007fac43d886c0 in lrmd_tcp_connect_cb (userdata=0x1bd5f60, sock=-110)
    at lrmd_client.c:1079
#2  0x00007fac441bb64a in check_connect_finished (userdata=0x1bc4360)
    at remote.c:736
#3  0x0000003b67a3961b in ?? () from /lib64/libglib-2.0.so.0
#4  0x0000003b67a38f0e in g_main_context_dispatch ()
   from /lib64/libglib-2.0.so.0
#5  0x0000003b67a3c938 in ?? () from /lib64/libglib-2.0.so.0
#6  0x0000003b67a3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
#7  0x000000000040530e in crmd_init () at main.c:154
#8  0x000000000040560c in main (argc=1, argv=0x7fff0a0ee808) at main.c:120

Segmentation fault:

(gdb) where
#0  0x00007f0242ebceed in lrmd_tls_connection_destroy (
    userdata=<value optimized out>) at lrmd_client.c:506
#1  0x00007f0242ebd6c0 in lrmd_tcp_connect_cb (userdata=0x1eec210, sock=-1)
    at lrmd_client.c:1079
#2  0x00007f02432f064a in check_connect_finished (userdata=0x1f17150)
    at remote.c:736
#3  0x0000003b67a3961b in ?? () from /lib64/libglib-2.0.so.0
#4  0x0000003b67a38f0e in g_main_context_dispatch ()
   from /lib64/libglib-2.0.so.0
#5  0x0000003b67a3c938 in ?? () from /lib64/libglib-2.0.so.0
#6  0x0000003b67a3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0
#7  0x000000000040530e in crmd_init () at main.c:154
#8  0x000000000040560c in main (argc=1, argv=0x7fff162351e8) at main.c:120
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pacemaker-ccni.patch
Type: application/octet-stream
Size: 4177 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130523/b054e526/attachment-0003.obj>


More information about the Pacemaker mailing list