[ClusterLabs] corosync-qdevice doesn't daemonize (or stay running)

Fri Jul 13 13:09:22 EDT 2018

I've finally solved this.  Solution inline.

On Fri, Jul 13, 2018 at 9:55 AM Jan Friesse <jfriesse at redhat.com> wrote:
>
> Jason,
>
> > On Thu, Jun 21, 2018 at 10:47 AM Jason Gauthier <jagauthier at gmail.com> wrote:
> >>
> >> On Thu, Jun 21, 2018 at 9:49 AM Jan Pokorný <jpokorny at redhat.com> wrote:
> >>>
> >>> On 21/06/18 07:05 -0400, Jason Gauthier wrote:
> >>>> On Thu, Jun 21, 2018 at 5:11 AM Christine Caulfield <ccaulfie at redhat.com> wrote:
> >>>>> On 19/06/18 18:47, Jason Gauthier wrote:
> >>>>>> Attached!
> >>>>>
> >>>>> That's very odd. I can see communication with the server and corosync in
> >>>>> there (do it's doing something) but no logging at all. When I start
> >>>>> qdevice on my systems it logs loads of messages even if it doesn't
> >>>>> manage to contact the server. Do you have any logging entries in
> >>>>> corosync.conf that might be stopping it?
> >>>>
> >>>> I haven't checked the corosync logs for any entries before, but I just
> >>>> did.  There isn't anything logged.
> >>>
> >>> What about syslog entries (may boil down to /var/log/messages,
> >>> journald log, or whatever sink is configured)?
> >>
> >> I took a look, since both you and Chrissie mentioned that.
> >>
> >> There aren't any new entries added to any of the /var/log files.
> >>
> >> # corosync-qdevice -f -d
> >> # date
> >> Thu Jun 21 10:36:06 EDT 2018
> >>
> >> # ls -lt|head
> >> total 152072
> >> -rw-r----- 1 root        adm          68018 Jun 21 10:34 auth.log
> >> -rw-rw-r-- 1 root        utmp      18704352 Jun 21 10:34 lastlog
> >> -rw-rw-r-- 1 root        utmp        107136 Jun 21 10:34 wtmp
> >> -rw-r----- 1 root        adm         248444 Jun 21 10:34 daemon.log
> >> -rw-r----- 1 root        adm         160899 Jun 21 10:34 syslog
> >> -rw-r----- 1 root        adm        1119856 Jun 21 09:46 kern.log
> >>
> >> I did look through daemon, messages, and syslog just to be sure.
> >>
> >>>>> Where did the binary come from? did you build it yourself or is it from
> >>>>> a package? I wonder if it's got corrupted or is a bad version. Possibly
> >>>>> linked against a 'dodgy' libqb - there have been some things going on
> >>>>> there that could cause logging to go missing in some circumstances.
> >>>>>
> >>>>> Honza (the qdevice expert) is away at the moment, so I'm guessing a bit
> >>>>> here anyway!
>
> Corosync-qdevice is using same config as corosync, so to get messages on
> stderr, please configure
>
> logging.to_stderr: on

Yes!  I added a logging subsection with QDEVICE and enabled stderr.
Then, and only then did corosync-qdevice -f -d behave the way I expected it to.

>
> >>>>
> >>>> Hmm. Interesting.  I installed the debian package.  When it didn't
> >>>> work, I grabbed the source from github.  They both act the same way,
> >>>> but if there is an underlying library issue then that will continue to
> >>>> be a problem.
> >>>>
> >>>> It doesn't say much:
> >>>> /usr/lib/x86_64-linux-gnu/libqb.so.0.18.1
> >>>
> >>> You are likely using libqb v1.0.1.
> >>
> >> Correct. I didn't even think to look at the output of dpkg -l for the
> >> package version.
> >> Debian 9 also packages binutils-2.28
> >>
> >>> Ability to figure out the proper package version is one of the most
> >>> basic skills to provide useful diagnostics about the issues with
> >>> distro-provided packages.
> >>>
> >>> With Debian, the proper incantation seems to be
> >>>
> >>>    dpkg -s libqb-dev | grep -i version
> >>>
> >>> or
> >>>
> >>>    apt list libqb-dev
> >>>
> >>> (or substitute libqb0 for libqb-dev).
> >>>
> >>> As Chrissie mentioned, there is some fishiness possible if you happen
> >>> to use ld linker from binutils 2.29+ for the building with this old
> >>> libqb in the mix, so if the issues persist and logging seems to be
> >>> missing, try recompiling with the downgraded binutils package below
> >>> said breakage point.
> >>
> >> Since the system already has a lower numbered binutils (2.28) I wonder
> >> if I should attempt to build a newer version of the libqb library.
> >>
> >> As Chrissie mentioned, I will open a bug with Debian in the Interim.
> >> But I don 't believe I will see resolution to that any time soon. :)
> >
> > I was finally able to look at this problem again, and found that qnetd
> > is giving me some messaging, but I don't know what to do with it.
> >
> > Jun 29 16:34:35 debug   New client connected
> > Jun 29 16:34:35 debug     cluster name = zeta
> > Jun 29 16:34:35 debug     tls started = 1
> > Jun 29 16:34:35 debug     tls peer certificate verified = 1
> > Jun 29 16:34:35 debug     node_id = 1084772368
> > Jun 29 16:34:35 debug     pointer = 0x563afd609d70
> > Jun 29 16:34:35 debug     addr_str = ::ffff:192.168.80.16:38010
> > Jun 29 16:34:35 debug     ring id = (40a85010.89ec)
> > Jun 29 16:34:35 debug     cluster dump:
> > Jun 29 16:34:35 debug       client = ::ffff:192.168.80.16:38010,
> > node_id = 1084772368
> > Jun 29 16:34:35 debug   Client ::ffff:192.168.80.16:38010 (cluster
> > zeta, node_id 1084772368) sent initial node list.
> > Jun 29 16:34:35 debug     msg seq num 4
> > Jun 29 16:34:35 debug     node list:
> > Jun 29 16:34:35 error   ffsplit: Received empty config node list for
> > client ::ffff:192.168.80.16:38010
>
> Yes, this is interesting. Could you please share your config?

Yes, see below.

> > Jun 29 16:34:35 error   Algorithm returned error code. Sending error reply.
> > Jun 29 16:34:35 debug   Client ::ffff:192.168.80.16:38010 (cluster
> > zeta, node_id 1084772368) sent membership node list.
> > Jun 29 16:34:35 debug     msg seq num 5
> > Jun 29 16:34:35 debug     ring id = (40a85010.89ec)
> > Jun 29 16:34:35 debug     node list:
> > Jun 29 16:34:35 debug       node_id = 1084772368, data_center_id = 0,
> > node_state = not set
> > Jun 29 16:34:35 debug       node_id = 1084772369, data_center_id = 0,
> > node_state = not set
> > Jun 29 16:34:35 debug   Algorithm result vote is Ask later
> > Jun 29 16:34:35 debug   Client ::ffff:192.168.80.16:38010 (cluster
> > zeta, node_id 1084772368) sent quorum node list.
> > Jun 29 16:34:35 debug     msg seq num 6
> > Jun 29 16:34:35 debug     quorate = 1
> > Jun 29 16:34:35 debug     node list:
> > Jun 29 16:34:35 debug       node_id = 1084772368, data_center_id = 0,
> > node_state = member
> > Jun 29 16:34:35 debug       node_id = 1084772369, data_center_id = 0,
> > node_state = member
> >
> > It looks like "config node list" is empty, but the other lists are
> > not.  I'm not sure where it's getting that node list from.  For fun, I
> > added
> > nodelist {
> >      node {
> >         alpha: 192.168.80.16
> >       }
> >      node {
> >         beta: 192.168.80.17
> >      }
> >    }
> > }
>
> This is how nodelist doesn't look like. It should look like:
> nodelist {
>          node {
>                  ring0_addr: 192.168.80.16
>                  nodeid: 1
>          }
>          node {
>                  ring0_addr: 192.168.80.17
>                  nodeid: 2
>          }
> }
>

You are correct.  I figured this out as well, after some
experimentation and finding examples online. However, this alone did
not resolve the issue.  When I started it like this, corosync-qdevice
would not send a config nodelist, and it would exit with an error code
of 18.

It wasn't until after I moved the nodelist above the quorum section
did corosync-qdevice actually start successfully.

> But it's really weird corosync-qdevice started without proper nodelist
> (it shouldn't).

It wasn't.  It was exiting with an error code of 18.  But that was
because it never saw the nodelist.
The incorrect nodelist above, was never interpreted.  When I figured
out the nodelist syntax was wrong, and I still received the error 18,
I moved the nodelist above the quorum section and it started.

> Honza
>
> > to corosync.conf, and restarted both nodes. But that didn't help.
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>