[ClusterLabs] [ClusterLab] : Corosync not initializing successfully

Tue May 3 13:34:19 UTC 2016

Thanks for your response Dejan.

I do not know yet whether this has anything to do with endianness.
FWIW, there could be something quirky with the system so keeping all
options open. :)

I added some debug prints to understand what's happening under the hood.

*Success case: (on x86 machine): *
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.206.1.7:137220) was formed. Members joined:
181272839
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
my_high_delivered=1
[TOTEM ] Delivering 1 to 2
[TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn. header->id=0
[SYNC  ] Nikhil: Entering sync_barrier_handler
[SYNC  ] Committing synchronization for corosync configuration map access
.
[TOTEM ] Delivering 2 to 4
[TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
[TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
[CPG   ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
[CPG   ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
[SYNC  ] Committing synchronization for corosync cluster closed process
group service v1.01
*[MAIN  ] Completed service synchronization, ready to provide service.*

*Failure case: (on ppc)*:
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
181344357
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC  ] Nikhil: Inside sync_deliver_fn header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
Above message repeats continuously.

So it appears that in failure case I do not receive messages with sequence
number 2-4.
If somebody can throw some ideas that'll help a lot.

-Thanks
Nikhil

On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic <dejanmm at fastmail.fm>
wrote:

> Hi,
>
> On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:
> > >As your hardware is probably capable of running ppcle and if you have an
> > >environment
> > >at hand without too much effort it might pay off to try that.
> > >There are of course distributions out there support corosync on
> > >big-endian architectures
> > >but I don't know if there is an automatized regression for corosync on
> > >big-endian that
> > >would catch big-endian-issues right away with something as current as
> > >your 2.3.5.
> >
> > No we are not testing big-endian.
> >
> > So totally agree with Klaus. Give a try to ppcle. Also make sure all
> > nodes are little-endian. Corosync should work in mixed BE/LE
> > environment but because it's not tested, it may not work (and it's a
> > bug, so if ppcle works I will try to fix BE).
>
> I tested a cluster consisting of big endian/little endian nodes
> (s390 and x86-64), but that was a while ago. IIRC, all relevant
> bugs in corosync got fixed at that time. Don't know what is the
> situation with the latest version.
>
> Thanks,
>
> Dejan
>
> > Regards,
> >   Honza
> >
> > >
> > >Regards,
> > >Klaus
> > >
> > >On 05/02/2016 06:44 AM, Nikhil Utane wrote:
> > >>Re-sending as I don't see my post on the thread.
> > >>
> > >>On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
> > >><nikhil.subscribed at gmail.com <mailto:nikhil.subscribed at gmail.com>>
> wrote:
> > >>
> > >>     Hi,
> > >>
> > >>     Looking for some guidance here as we are completely blocked
> > >>     otherwise :(.
> > >>
> > >>     -Regards
> > >>     Nikhil
> > >>
> > >>     On Fri, Apr 29, 2016 at 6:11 PM, Sriram <sriram.ec at gmail.com
> > >>     <mailto:sriram.ec at gmail.com>> wrote:
> > >>
> > >>         Corrected the subject.
> > >>
> > >>         We went ahead and captured corosync debug logs for our ppc
> board.
> > >>         After log analysis and comparison with the sucessful logs(
> > >>         from x86 machine) ,
> > >>         we didnt find *"[ MAIN  ] Completed service synchronization,
> > >>         ready to provide service.*" in ppc logs.
> > >>         So, looks like corosync is not in a position to accept
> > >>         connection from Pacemaker.
> > >>         Even I tried with the new corosync.conf with no success.
> > >>
> > >>         Any hints on this issue would be really helpful.
> > >>
> > >>         Attaching ppc_notworking.log, x86_working.log, corosync.conf.
> > >>
> > >>         Regards,
> > >>         Sriram
> > >>
> > >>
> > >>
> > >>         On Fri, Apr 29, 2016 at 2:44 PM, Sriram <sriram.ec at gmail.com
> > >>         <mailto:sriram.ec at gmail.com>> wrote:
> > >>
> > >>             Hi,
> > >>
> > >>             I went ahead and made some changes in file system(Like I
> > >>             brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
> > >>             /etc/sysconfig ), After that I was able to run  "pcs
> > >>             cluster start".
> > >>             But it failed with the following error
> > >>              # pcs cluster start
> > >>             Starting Cluster...
> > >>             Starting Pacemaker Cluster Manager[FAILED]
> > >>             Error: unable to start pacemaker
> > >>
> > >>             And in the /var/log/pacemaker.log, I saw these errors
> > >>             pacemakerd:     info: mcp_read_config:  cmap connection
> > >>             setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
> > >>             Apr 29 08:53:47 [15863] node_cu pacemakerd:     info:
> > >>             mcp_read_config:  cmap connection setup failed:
> > >>             CS_ERR_TRY_AGAIN.  Retrying in 5s
> > >>             Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
> > >>             mcp_read_config:  Could not connect to Cluster
> > >>             Configuration Database API, error 6
> > >>             Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
> > >>             main:     Could not obtain corosync config data, exiting
> > >>             Apr 29 08:53:52 [15863] node_cu pacemakerd:     info:
> > >>             crm_xml_cleanup:  Cleaning up memory from libxml2
> > >>
> > >>
> > >>             And in the /var/log/Debuglog, I saw these errors coming
> > >>             from corosync
> > >>             20160429 085347.487050 <tel:085347.487050> airv_cu
> > >>             daemon.warn corosync[12857]:   [QB    ] Denied connection,
> > >>             is not ready (12857-15863-14)
> > >>             20160429 085347.487067 <tel:085347.487067> airv_cu
> > >>             daemon.info <http://daemon.info> corosync[12857]:   [QB
> > >>             ] Denied connection, is not ready (12857-15863-14)
> > >>
> > >>
> > >>             I browsed the code of libqb to find that it is failing in
> > >>
> > >>
> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
> > >>
> > >>             Line 600 :
> > >>             handle_new_connection function
> > >>
> > >>             Line 637:
> > >>             if (auth_result == 0 &&
> > >>             c->service->serv_fns.connection_accept) {
> > >>                     res = c->service->serv_fns.connection_accept(c,
> > >>                                              c->euid, c->egid);
> > >>                 }
> > >>                 if (res != 0) {
> > >>                     goto send_response;
> > >>                 }
> > >>
> > >>             Any hints on this issue would be really helpful for me to
> > >>             go ahead.
> > >>             Please let me know if any logs are required,
> > >>
> > >>             Regards,
> > >>             Sriram
> > >>
> > >>             On Thu, Apr 28, 2016 at 2:42 PM, Sriram
> > >>             <sriram.ec at gmail.com <mailto:sriram.ec at gmail.com>> wrote:
> > >>
> > >>                 Thanks Ken and Emmanuel.
> > >>                 Its a big endian machine. I will try with running "pcs
> > >>                 cluster setup" and "pcs cluster start"
> > >>                 Inside cluster.py, "service pacemaker start" and
> > >>                 "service corosync start" are executed to bring up
> > >>                 pacemaker and corosync.
> > >>                 Those service scripts and the infrastructure needed to
> > >>                 bring up the processes in the above said manner
> > >>                 doesn't exist in my board.
> > >>                 As it is a embedded board with the limited memory,
> > >>                 full fledged linux is not installed.
> > >>                 Just curious to know, what could be reason the
> > >>                 pacemaker throws that error.
> > >>
> > >>                 /"cmap connection setup failed: CS_ERR_TRY_AGAIN.
> > >>                 Retrying in 1s"
> > >>
> > >>                 /
> > >>                 Thanks for response.
> > >>
> > >>                 Regards,
> > >>                 Sriram.
> > >>
> > >>                 On Thu, Apr 28, 2016 at 8:55 AM, Ken Gaillot
> > >>                 <kgaillot at redhat.com <mailto:kgaillot at redhat.com>>
> wrote:
> > >>
> > >>                     On 04/27/2016 11:25 AM, emmanuel segura wrote:
> > >>                     > you need to use pcs to do everything, pcs
> > >>                     cluster setup and pcs
> > >>                     > cluster start, try to use the redhat docs for
> > >>                     more information.
> > >>
> > >>                     Agreed -- pcs cluster setup will create a proper
> > >>                     corosync.conf for you.
> > >>                     Your corosync.conf below uses corosync 1 syntax,
> > >>                     and there were
> > >>                     significant changes in corosync 2. In particular,
> > >>                     you don't need the
> > >>                     file created in step 4, because pacemaker is no
> > >>                     longer launched via a
> > >>                     corosync plugin.
> > >>
> > >>                     > 2016-04-27 17:28 GMT+02:00 Sriram
> > >>                     <sriram.ec at gmail.com <mailto:sriram.ec at gmail.com
> >>:
> > >>                     >> Dear All,
> > >>                     >>
> > >>                     >> I m trying to use pacemaker and corosync for
> > >>                     the clustering requirement that
> > >>                     >> came up recently.
> > >>                     >> We have cross compiled corosync, pacemaker and
> > >>                     pcs(python) for ppc
> > >>                     >> environment (Target board where pacemaker and
> > >>                     corosync are supposed to run)
> > >>                     >> I m having trouble bringing up pacemaker in
> > >>                     that environment, though I could
> > >>                     >> successfully bring up corosync.
> > >>                     >> Any help is welcome.
> > >>                     >>
> > >>                     >> I m using these versions of pacemaker and
> corosync
> > >>                     >> [root at node_cu pacemaker]# corosync -v
> > >>                     >> Corosync Cluster Engine, version '2.3.5'
> > >>                     >> Copyright (c) 2006-2009 Red Hat, Inc.
> > >>                     >> [root at node_cu pacemaker]# pacemakerd -$
> > >>                     >> Pacemaker 1.1.14
> > >>                     >> Written by Andrew Beekhof
> > >>                     >>
> > >>                     >> For running corosync, I did the following.
> > >>                     >> 1. Created the following directories,
> > >>                     >>     /var/lib/pacemaker
> > >>                     >>     /var/lib/corosync
> > >>                     >>     /var/lib/pacemaker
> > >>                     >>     /var/lib/pacemaker/cores
> > >>                     >>     /var/lib/pacemaker/pengine
> > >>                     >>     /var/lib/pacemaker/blackbox
> > >>                     >>     /var/lib/pacemaker/cib
> > >>                     >>
> > >>                     >>
> > >>                     >> 2. Created a file called corosync.conf under
> > >>                     /etc/corosync folder with the
> > >>                     >> following contents
> > >>                     >>
> > >>                     >> totem {
> > >>                     >>
> > >>                     >>         version: 2
> > >>                     >>         token:          5000
> > >>                     >>         token_retransmits_before_loss_const: 20
> > >>                     >>         join:           1000
> > >>                     >>         consensus:      7500
> > >>                     >>         vsftype:        none
> > >>                     >>         max_messages:   20
> > >>                     >>         secauth:        off
> > >>                     >>         cluster_name:   mycluster
> > >>                     >>         transport:      udpu
> > >>                     >>         threads:        0
> > >>                     >>         clear_node_high_bit: yes
> > >>                     >>
> > >>                     >>         interface {
> > >>                     >>                 ringnumber: 0
> > >>                     >>                 # The following three values
> > >>                     need to be set based on your
> > >>                     >> environment
> > >>                     >>                 bindnetaddr: 10.x.x.x
> > >>                     >>                 mcastaddr: 226.94.1.1
> > >>                     >>                 mcastport: 5405
> > >>                     >>         }
> > >>                     >>  }
> > >>                     >>
> > >>                     >>  logging {
> > >>                     >>         fileline: off
> > >>                     >>         to_syslog: yes
> > >>                     >>         to_stderr: no
> > >>                     >>         to_syslog: yes
> > >>                     >>         logfile: /var/log/corosync.log
> > >>                     >>         syslog_facility: daemon
> > >>                     >>         debug: on
> > >>                     >>         timestamp: on
> > >>                     >>  }
> > >>                     >>
> > >>                     >>  amf {
> > >>                     >>         mode: disabled
> > >>                     >>  }
> > >>                     >>
> > >>                     >>  quorum {
> > >>                     >>         provider: corosync_votequorum
> > >>                     >>  }
> > >>                     >>
> > >>                     >> nodelist {
> > >>                     >>   node {
> > >>                     >>         ring0_addr: node_cu
> > >>                     >>         nodeid: 1
> > >>                     >>        }
> > >>                     >> }
> > >>                     >>
> > >>                     >> 3.  Created authkey under /etc/corosync
> > >>                     >>
> > >>                     >> 4.  Created a file called pcmk under
> > >>                     /etc/corosync/service.d and contents as
> > >>                     >> below,
> > >>                     >>       cat pcmk
> > >>                     >>       service {
> > >>                     >>          # Load the Pacemaker Cluster Resource
> > >>                     Manager
> > >>                     >>          name: pacemaker
> > >>                     >>          ver:  1
> > >>                     >>       }
> > >>                     >>
> > >>                     >> 5. Added the node name "node_cu" in /etc/hosts
> > >>                     with 10.X.X.X ip
> > >>                     >>
> > >>                     >> 6. ./corosync -f -p & --> this step started
> > >>                     corosync
> > >>                     >>
> > >>                     >> [root at node_cu pacemaker]# netstat -alpn | grep
> > >>                     -i coros
> > >>                     >> udp        0      0 10.X.X.X:61841     0.0.0.0:
> *
> > >>                     >> 9133/corosync
> > >>                     >> udp        0      0 10.X.X.X:5405      0.0.0.0:
> *
> > >>                     >> 9133/corosync
> > >>                     >> unix  2      [ ACC ]     STREAM     LISTENING
> > >>                        148888 9133/corosync
> > >>                     >> @quorum
> > >>                     >> unix  2      [ ACC ]     STREAM     LISTENING
> > >>                        148884 9133/corosync
> > >>                     >> @cmap
> > >>                     >> unix  2      [ ACC ]     STREAM     LISTENING
> > >>                        148887 9133/corosync
> > >>                     >> @votequorum
> > >>                     >> unix  2      [ ACC ]     STREAM     LISTENING
> > >>                        148885 9133/corosync
> > >>                     >> @cfg
> > >>                     >> unix  2      [ ACC ]     STREAM     LISTENING
> > >>                        148886 9133/corosync
> > >>                     >> @cpg
> > >>                     >> unix  2      [ ]         DGRAM
> > >>                       148840 9133/corosync
> > >>                     >>
> > >>                     >> 7. ./pacemakerd -f & gives the following error
> > >>                     and exits.
> > >>                     >> [root at node_cu pacemaker]# pacemakerd -f
> > >>                     >> cmap connection setup failed:
> > >>                     CS_ERR_TRY_AGAIN.  Retrying in 1s
> > >>                     >> cmap connection setup failed:
> > >>                     CS_ERR_TRY_AGAIN.  Retrying in 2s
> > >>                     >> cmap connection setup failed:
> > >>                     CS_ERR_TRY_AGAIN.  Retrying in 3s
> > >>                     >> cmap connection setup failed:
> > >>                     CS_ERR_TRY_AGAIN.  Retrying in 4s
> > >>                     >> cmap connection setup failed:
> > >>                     CS_ERR_TRY_AGAIN.  Retrying in 5s
> > >>                     >> Could not connect to Cluster Configuration
> > >>                     Database API, error 6
> > >>                     >>
> > >>                     >> Can you please point me, what is missing in
> > >>                     these steps ?
> > >>                     >>
> > >>                     >> Before trying these steps, I tried running "pcs
> > >>                     cluster start", but that
> > >>                     >> command fails with "service" script not found.
> > >>                     As the root filesystem
> > >>                     >> doesn't contain either /etc/init.d/ or
> > >>                     /sbin/service
> > >>                     >>
> > >>                     >> So, the plan is to bring up corosync and
> > >>                     pacemaker manually, later do the
> > >>                     >> cluster configuration using "pcs" commands.
> > >>                     >>
> > >>                     >> Regards,
> > >>                     >> Sriram
> > >>                     >>
> > >>                     >> _______________________________________________
> > >>                     >> Users mailing list: Users at clusterlabs.org
> > >>                     <mailto:Users at clusterlabs.org>
> > >>                     >> http://clusterlabs.org/mailman/listinfo/users
> > >>                     >>
> > >>                     >> Project Home: http://www.clusterlabs.org
> > >>                     >> Getting started:
> > >>
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >>                     >> Bugs: http://bugs.clusterlabs.org
> > >>                     >>
> > >>                     >
> > >>                     >
> > >>                     >
> > >>
> > >>
> > >>                     _______________________________________________
> > >>                     Users mailing list: Users at clusterlabs.org
> > >>                     <mailto:Users at clusterlabs.org>
> > >>                     http://clusterlabs.org/mailman/listinfo/users
> > >>
> > >>                     Project Home: http://www.clusterlabs.org
> > >>                     Getting started:
> > >>
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >>                     Bugs: http://bugs.clusterlabs.org
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>         _______________________________________________
> > >>         Users mailing list: Users at clusterlabs.org
> > >>         <mailto:Users at clusterlabs.org>
> > >>         http://clusterlabs.org/mailman/listinfo/users
> > >>
> > >>         Project Home: http://www.clusterlabs.org
> > >>         Getting started:
> > >>         http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >>         Bugs: http://bugs.clusterlabs.org
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>_______________________________________________
> > >>Users mailing list: Users at clusterlabs.org
> > >>http://clusterlabs.org/mailman/listinfo/users
> > >>
> > >>Project Home: http://www.clusterlabs.org
> > >>Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >>Bugs: http://bugs.clusterlabs.org
> > >
> > >
> > >_______________________________________________
> > >Users mailing list: Users at clusterlabs.org
> > >http://clusterlabs.org/mailman/listinfo/users
> > >
> > >Project Home: http://www.clusterlabs.org
> > >Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >Bugs: http://bugs.clusterlabs.org
> > >
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160503/d6b9e5db/attachment.htm>