[ClusterLabs] [ClusterLab] : Corosync not initializing successfully
Nikhil Utane
nikhil.subscribed at gmail.com
Tue May 3 13:34:19 UTC 2016
Thanks for your response Dejan.
I do not know yet whether this has anything to do with endianness.
FWIW, there could be something quirky with the system so keeping all
options open. :)
I added some debug prints to understand what's happening under the hood.
*Success case: (on x86 machine): *
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.206.1.7:137220) was formed. Members joined:
181272839
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC ] Nikhil: Inside sync_deliver_fn. header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
my_high_delivered=1
[TOTEM ] Delivering 1 to 2
[TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
[SYNC ] Nikhil: Inside sync_deliver_fn. header->id=0
[SYNC ] Nikhil: Entering sync_barrier_handler
[SYNC ] Committing synchronization for corosync configuration map access
.
[TOTEM ] Delivering 2 to 4
[TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
[TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
[CPG ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
[CPG ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
[SYNC ] Committing synchronization for corosync cluster closed process
group service v1.01
*[MAIN ] Completed service synchronization, ready to provide service.*
*Failure case: (on ppc)*:
[TOTEM ] entering OPERATIONAL state.
[TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
181344357
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
my_high_delivered=0
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=0
[TOTEM ] Delivering 0 to 1
[TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
[SYNC ] Nikhil: Inside sync_deliver_fn header->id=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
[TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
my_high_delivered=1
Above message repeats continuously.
So it appears that in failure case I do not receive messages with sequence
number 2-4.
If somebody can throw some ideas that'll help a lot.
-Thanks
Nikhil
On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic <dejanmm at fastmail.fm>
wrote:
> Hi,
>
> On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:
> > >As your hardware is probably capable of running ppcle and if you have an
> > >environment
> > >at hand without too much effort it might pay off to try that.
> > >There are of course distributions out there support corosync on
> > >big-endian architectures
> > >but I don't know if there is an automatized regression for corosync on
> > >big-endian that
> > >would catch big-endian-issues right away with something as current as
> > >your 2.3.5.
> >
> > No we are not testing big-endian.
> >
> > So totally agree with Klaus. Give a try to ppcle. Also make sure all
> > nodes are little-endian. Corosync should work in mixed BE/LE
> > environment but because it's not tested, it may not work (and it's a
> > bug, so if ppcle works I will try to fix BE).
>
> I tested a cluster consisting of big endian/little endian nodes
> (s390 and x86-64), but that was a while ago. IIRC, all relevant
> bugs in corosync got fixed at that time. Don't know what is the
> situation with the latest version.
>
> Thanks,
>
> Dejan
>
> > Regards,
> > Honza
> >
> > >
> > >Regards,
> > >Klaus
> > >
> > >On 05/02/2016 06:44 AM, Nikhil Utane wrote:
> > >>Re-sending as I don't see my post on the thread.
> > >>
> > >>On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
> > >><nikhil.subscribed at gmail.com <mailto:nikhil.subscribed at gmail.com>>
> wrote:
> > >>
> > >> Hi,
> > >>
> > >> Looking for some guidance here as we are completely blocked
> > >> otherwise :(.
> > >>
> > >> -Regards
> > >> Nikhil
> > >>
> > >> On Fri, Apr 29, 2016 at 6:11 PM, Sriram <sriram.ec at gmail.com
> > >> <mailto:sriram.ec at gmail.com>> wrote:
> > >>
> > >> Corrected the subject.
> > >>
> > >> We went ahead and captured corosync debug logs for our ppc
> board.
> > >> After log analysis and comparison with the sucessful logs(
> > >> from x86 machine) ,
> > >> we didnt find *"[ MAIN ] Completed service synchronization,
> > >> ready to provide service.*" in ppc logs.
> > >> So, looks like corosync is not in a position to accept
> > >> connection from Pacemaker.
> > >> Even I tried with the new corosync.conf with no success.
> > >>
> > >> Any hints on this issue would be really helpful.
> > >>
> > >> Attaching ppc_notworking.log, x86_working.log, corosync.conf.
> > >>
> > >> Regards,
> > >> Sriram
> > >>
> > >>
> > >>
> > >> On Fri, Apr 29, 2016 at 2:44 PM, Sriram <sriram.ec at gmail.com
> > >> <mailto:sriram.ec at gmail.com>> wrote:
> > >>
> > >> Hi,
> > >>
> > >> I went ahead and made some changes in file system(Like I
> > >> brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
> > >> /etc/sysconfig ), After that I was able to run "pcs
> > >> cluster start".
> > >> But it failed with the following error
> > >> # pcs cluster start
> > >> Starting Cluster...
> > >> Starting Pacemaker Cluster Manager[FAILED]
> > >> Error: unable to start pacemaker
> > >>
> > >> And in the /var/log/pacemaker.log, I saw these errors
> > >> pacemakerd: info: mcp_read_config: cmap connection
> > >> setup failed: CS_ERR_TRY_AGAIN. Retrying in 4s
> > >> Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
> > >> mcp_read_config: cmap connection setup failed:
> > >> CS_ERR_TRY_AGAIN. Retrying in 5s
> > >> Apr 29 08:53:52 [15863] node_cu pacemakerd: warning:
> > >> mcp_read_config: Could not connect to Cluster
> > >> Configuration Database API, error 6
> > >> Apr 29 08:53:52 [15863] node_cu pacemakerd: notice:
> > >> main: Could not obtain corosync config data, exiting
> > >> Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
> > >> crm_xml_cleanup: Cleaning up memory from libxml2
> > >>
> > >>
> > >> And in the /var/log/Debuglog, I saw these errors coming
> > >> from corosync
> > >> 20160429 085347.487050 <tel:085347.487050> airv_cu
> > >> daemon.warn corosync[12857]: [QB ] Denied connection,
> > >> is not ready (12857-15863-14)
> > >> 20160429 085347.487067 <tel:085347.487067> airv_cu
> > >> daemon.info <http://daemon.info> corosync[12857]: [QB
> > >> ] Denied connection, is not ready (12857-15863-14)
> > >>
> > >>
> > >> I browsed the code of libqb to find that it is failing in
> > >>
> > >>
> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
> > >>
> > >> Line 600 :
> > >> handle_new_connection function
> > >>
> > >> Line 637:
> > >> if (auth_result == 0 &&
> > >> c->service->serv_fns.connection_accept) {
> > >> res = c->service->serv_fns.connection_accept(c,
> > >> c->euid, c->egid);
> > >> }
> > >> if (res != 0) {
> > >> goto send_response;
> > >> }
> > >>
> > >> Any hints on this issue would be really helpful for me to
> > >> go ahead.
> > >> Please let me know if any logs are required,
> > >>
> > >> Regards,
> > >> Sriram
> > >>
> > >> On Thu, Apr 28, 2016 at 2:42 PM, Sriram
> > >> <sriram.ec at gmail.com <mailto:sriram.ec at gmail.com>> wrote:
> > >>
> > >> Thanks Ken and Emmanuel.
> > >> Its a big endian machine. I will try with running "pcs
> > >> cluster setup" and "pcs cluster start"
> > >> Inside cluster.py, "service pacemaker start" and
> > >> "service corosync start" are executed to bring up
> > >> pacemaker and corosync.
> > >> Those service scripts and the infrastructure needed to
> > >> bring up the processes in the above said manner
> > >> doesn't exist in my board.
> > >> As it is a embedded board with the limited memory,
> > >> full fledged linux is not installed.
> > >> Just curious to know, what could be reason the
> > >> pacemaker throws that error.
> > >>
> > >> /"cmap connection setup failed: CS_ERR_TRY_AGAIN.
> > >> Retrying in 1s"
> > >>
> > >> /
> > >> Thanks for response.
> > >>
> > >> Regards,
> > >> Sriram.
> > >>
> > >> On Thu, Apr 28, 2016 at 8:55 AM, Ken Gaillot
> > >> <kgaillot at redhat.com <mailto:kgaillot at redhat.com>>
> wrote:
> > >>
> > >> On 04/27/2016 11:25 AM, emmanuel segura wrote:
> > >> > you need to use pcs to do everything, pcs
> > >> cluster setup and pcs
> > >> > cluster start, try to use the redhat docs for
> > >> more information.
> > >>
> > >> Agreed -- pcs cluster setup will create a proper
> > >> corosync.conf for you.
> > >> Your corosync.conf below uses corosync 1 syntax,
> > >> and there were
> > >> significant changes in corosync 2. In particular,
> > >> you don't need the
> > >> file created in step 4, because pacemaker is no
> > >> longer launched via a
> > >> corosync plugin.
> > >>
> > >> > 2016-04-27 17:28 GMT+02:00 Sriram
> > >> <sriram.ec at gmail.com <mailto:sriram.ec at gmail.com
> >>:
> > >> >> Dear All,
> > >> >>
> > >> >> I m trying to use pacemaker and corosync for
> > >> the clustering requirement that
> > >> >> came up recently.
> > >> >> We have cross compiled corosync, pacemaker and
> > >> pcs(python) for ppc
> > >> >> environment (Target board where pacemaker and
> > >> corosync are supposed to run)
> > >> >> I m having trouble bringing up pacemaker in
> > >> that environment, though I could
> > >> >> successfully bring up corosync.
> > >> >> Any help is welcome.
> > >> >>
> > >> >> I m using these versions of pacemaker and
> corosync
> > >> >> [root at node_cu pacemaker]# corosync -v
> > >> >> Corosync Cluster Engine, version '2.3.5'
> > >> >> Copyright (c) 2006-2009 Red Hat, Inc.
> > >> >> [root at node_cu pacemaker]# pacemakerd -$
> > >> >> Pacemaker 1.1.14
> > >> >> Written by Andrew Beekhof
> > >> >>
> > >> >> For running corosync, I did the following.
> > >> >> 1. Created the following directories,
> > >> >> /var/lib/pacemaker
> > >> >> /var/lib/corosync
> > >> >> /var/lib/pacemaker
> > >> >> /var/lib/pacemaker/cores
> > >> >> /var/lib/pacemaker/pengine
> > >> >> /var/lib/pacemaker/blackbox
> > >> >> /var/lib/pacemaker/cib
> > >> >>
> > >> >>
> > >> >> 2. Created a file called corosync.conf under
> > >> /etc/corosync folder with the
> > >> >> following contents
> > >> >>
> > >> >> totem {
> > >> >>
> > >> >> version: 2
> > >> >> token: 5000
> > >> >> token_retransmits_before_loss_const: 20
> > >> >> join: 1000
> > >> >> consensus: 7500
> > >> >> vsftype: none
> > >> >> max_messages: 20
> > >> >> secauth: off
> > >> >> cluster_name: mycluster
> > >> >> transport: udpu
> > >> >> threads: 0
> > >> >> clear_node_high_bit: yes
> > >> >>
> > >> >> interface {
> > >> >> ringnumber: 0
> > >> >> # The following three values
> > >> need to be set based on your
> > >> >> environment
> > >> >> bindnetaddr: 10.x.x.x
> > >> >> mcastaddr: 226.94.1.1
> > >> >> mcastport: 5405
> > >> >> }
> > >> >> }
> > >> >>
> > >> >> logging {
> > >> >> fileline: off
> > >> >> to_syslog: yes
> > >> >> to_stderr: no
> > >> >> to_syslog: yes
> > >> >> logfile: /var/log/corosync.log
> > >> >> syslog_facility: daemon
> > >> >> debug: on
> > >> >> timestamp: on
> > >> >> }
> > >> >>
> > >> >> amf {
> > >> >> mode: disabled
> > >> >> }
> > >> >>
> > >> >> quorum {
> > >> >> provider: corosync_votequorum
> > >> >> }
> > >> >>
> > >> >> nodelist {
> > >> >> node {
> > >> >> ring0_addr: node_cu
> > >> >> nodeid: 1
> > >> >> }
> > >> >> }
> > >> >>
> > >> >> 3. Created authkey under /etc/corosync
> > >> >>
> > >> >> 4. Created a file called pcmk under
> > >> /etc/corosync/service.d and contents as
> > >> >> below,
> > >> >> cat pcmk
> > >> >> service {
> > >> >> # Load the Pacemaker Cluster Resource
> > >> Manager
> > >> >> name: pacemaker
> > >> >> ver: 1
> > >> >> }
> > >> >>
> > >> >> 5. Added the node name "node_cu" in /etc/hosts
> > >> with 10.X.X.X ip
> > >> >>
> > >> >> 6. ./corosync -f -p & --> this step started
> > >> corosync
> > >> >>
> > >> >> [root at node_cu pacemaker]# netstat -alpn | grep
> > >> -i coros
> > >> >> udp 0 0 10.X.X.X:61841 0.0.0.0:
> *
> > >> >> 9133/corosync
> > >> >> udp 0 0 10.X.X.X:5405 0.0.0.0:
> *
> > >> >> 9133/corosync
> > >> >> unix 2 [ ACC ] STREAM LISTENING
> > >> 148888 9133/corosync
> > >> >> @quorum
> > >> >> unix 2 [ ACC ] STREAM LISTENING
> > >> 148884 9133/corosync
> > >> >> @cmap
> > >> >> unix 2 [ ACC ] STREAM LISTENING
> > >> 148887 9133/corosync
> > >> >> @votequorum
> > >> >> unix 2 [ ACC ] STREAM LISTENING
> > >> 148885 9133/corosync
> > >> >> @cfg
> > >> >> unix 2 [ ACC ] STREAM LISTENING
> > >> 148886 9133/corosync
> > >> >> @cpg
> > >> >> unix 2 [ ] DGRAM
> > >> 148840 9133/corosync
> > >> >>
> > >> >> 7. ./pacemakerd -f & gives the following error
> > >> and exits.
> > >> >> [root at node_cu pacemaker]# pacemakerd -f
> > >> >> cmap connection setup failed:
> > >> CS_ERR_TRY_AGAIN. Retrying in 1s
> > >> >> cmap connection setup failed:
> > >> CS_ERR_TRY_AGAIN. Retrying in 2s
> > >> >> cmap connection setup failed:
> > >> CS_ERR_TRY_AGAIN. Retrying in 3s
> > >> >> cmap connection setup failed:
> > >> CS_ERR_TRY_AGAIN. Retrying in 4s
> > >> >> cmap connection setup failed:
> > >> CS_ERR_TRY_AGAIN. Retrying in 5s
> > >> >> Could not connect to Cluster Configuration
> > >> Database API, error 6
> > >> >>
> > >> >> Can you please point me, what is missing in
> > >> these steps ?
> > >> >>
> > >> >> Before trying these steps, I tried running "pcs
> > >> cluster start", but that
> > >> >> command fails with "service" script not found.
> > >> As the root filesystem
> > >> >> doesn't contain either /etc/init.d/ or
> > >> /sbin/service
> > >> >>
> > >> >> So, the plan is to bring up corosync and
> > >> pacemaker manually, later do the
> > >> >> cluster configuration using "pcs" commands.
> > >> >>
> > >> >> Regards,
> > >> >> Sriram
> > >> >>
> > >> >> _______________________________________________
> > >> >> Users mailing list: Users at clusterlabs.org
> > >> <mailto:Users at clusterlabs.org>
> > >> >> http://clusterlabs.org/mailman/listinfo/users
> > >> >>
> > >> >> Project Home: http://www.clusterlabs.org
> > >> >> Getting started:
> > >>
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >> >> Bugs: http://bugs.clusterlabs.org
> > >> >>
> > >> >
> > >> >
> > >> >
> > >>
> > >>
> > >> _______________________________________________
> > >> Users mailing list: Users at clusterlabs.org
> > >> <mailto:Users at clusterlabs.org>
> > >> http://clusterlabs.org/mailman/listinfo/users
> > >>
> > >> Project Home: http://www.clusterlabs.org
> > >> Getting started:
> > >>
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >> Bugs: http://bugs.clusterlabs.org
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> Users mailing list: Users at clusterlabs.org
> > >> <mailto:Users at clusterlabs.org>
> > >> http://clusterlabs.org/mailman/listinfo/users
> > >>
> > >> Project Home: http://www.clusterlabs.org
> > >> Getting started:
> > >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >> Bugs: http://bugs.clusterlabs.org
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>_______________________________________________
> > >>Users mailing list: Users at clusterlabs.org
> > >>http://clusterlabs.org/mailman/listinfo/users
> > >>
> > >>Project Home: http://www.clusterlabs.org
> > >>Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >>Bugs: http://bugs.clusterlabs.org
> > >
> > >
> > >_______________________________________________
> > >Users mailing list: Users at clusterlabs.org
> > >http://clusterlabs.org/mailman/listinfo/users
> > >
> > >Project Home: http://www.clusterlabs.org
> > >Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >Bugs: http://bugs.clusterlabs.org
> > >
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160503/d6b9e5db/attachment.htm>
More information about the Users
mailing list