[ClusterLabs] [ClusterLab] : Corosync not initializing successfully
Nikhil Utane
nikhil.subscribed at gmail.com
Thu May 5 11:15:20 CEST 2016
Found the root-cause.
In file schedwrk.c, the function handle2void() uses a union which was not
initialized.
Because of that the handle value was computed incorrectly (lower half was
garbage).
56 static hdb_handle_t
57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
58 static const void *
59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }
After initializing (as highlighted), the corosync initialization seems to
be going through fine. Will check other things.
-Regards
Nikhil
On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane <nikhil.subscribed at gmail.com>
wrote:
> Thanks for your response Dejan.
>
> I do not know yet whether this has anything to do with endianness.
> FWIW, there could be something quirky with the system so keeping all
> options open. :)
>
> I added some debug prints to understand what's happening under the hood.
>
> *Success case: (on x86 machine): *
> [TOTEM ] entering OPERATIONAL state.
> [TOTEM ] A new membership (10.206.1.7:137220) was formed. Members joined:
> 181272839
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
> my_high_delivered=0
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=0
> [TOTEM ] Delivering 0 to 1
> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
> [SYNC ] Nikhil: Inside sync_deliver_fn. header->id=1
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
> my_high_delivered=1
> [TOTEM ] Delivering 1 to 2
> [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
> [SYNC ] Nikhil: Inside sync_deliver_fn. header->id=0
> [SYNC ] Nikhil: Entering sync_barrier_handler
> [SYNC ] Committing synchronization for corosync configuration map access
> .
> [TOTEM ] Delivering 2 to 4
> [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
> [TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
> [CPG ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
> [CPG ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
> left:0)
> [SYNC ] Committing synchronization for corosync cluster closed process
> group service v1.01
> *[MAIN ] Completed service synchronization, ready to provide service.*
>
>
> *Failure case: (on ppc)*:
> [TOTEM ] entering OPERATIONAL state.
> [TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
> 181344357
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
> my_high_delivered=0
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=0
> [TOTEM ] Delivering 0 to 1
> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
> [SYNC ] Nikhil: Inside sync_deliver_fn header->id=1
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=1
> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
> my_high_delivered=1
> Above message repeats continuously.
>
> So it appears that in failure case I do not receive messages with sequence
> number 2-4.
> If somebody can throw some ideas that'll help a lot.
>
> -Thanks
> Nikhil
>
> On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic <dejanmm at fastmail.fm>
> wrote:
>
>> Hi,
>>
>> On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:
>> > >As your hardware is probably capable of running ppcle and if you have
>> an
>> > >environment
>> > >at hand without too much effort it might pay off to try that.
>> > >There are of course distributions out there support corosync on
>> > >big-endian architectures
>> > >but I don't know if there is an automatized regression for corosync on
>> > >big-endian that
>> > >would catch big-endian-issues right away with something as current as
>> > >your 2.3.5.
>> >
>> > No we are not testing big-endian.
>> >
>> > So totally agree with Klaus. Give a try to ppcle. Also make sure all
>> > nodes are little-endian. Corosync should work in mixed BE/LE
>> > environment but because it's not tested, it may not work (and it's a
>> > bug, so if ppcle works I will try to fix BE).
>>
>> I tested a cluster consisting of big endian/little endian nodes
>> (s390 and x86-64), but that was a while ago. IIRC, all relevant
>> bugs in corosync got fixed at that time. Don't know what is the
>> situation with the latest version.
>>
>> Thanks,
>>
>> Dejan
>>
>> > Regards,
>> > Honza
>> >
>> > >
>> > >Regards,
>> > >Klaus
>> > >
>> > >On 05/02/2016 06:44 AM, Nikhil Utane wrote:
>> > >>Re-sending as I don't see my post on the thread.
>> > >>
>> > >>On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
>> > >><nikhil.subscribed at gmail.com <mailto:nikhil.subscribed at gmail.com>>
>> wrote:
>> > >>
>> > >> Hi,
>> > >>
>> > >> Looking for some guidance here as we are completely blocked
>> > >> otherwise :(.
>> > >>
>> > >> -Regards
>> > >> Nikhil
>> > >>
>> > >> On Fri, Apr 29, 2016 at 6:11 PM, Sriram <sriram.ec at gmail.com
>> > >> <mailto:sriram.ec at gmail.com>> wrote:
>> > >>
>> > >> Corrected the subject.
>> > >>
>> > >> We went ahead and captured corosync debug logs for our ppc
>> board.
>> > >> After log analysis and comparison with the sucessful logs(
>> > >> from x86 machine) ,
>> > >> we didnt find *"[ MAIN ] Completed service synchronization,
>> > >> ready to provide service.*" in ppc logs.
>> > >> So, looks like corosync is not in a position to accept
>> > >> connection from Pacemaker.
>> > >> Even I tried with the new corosync.conf with no success.
>> > >>
>> > >> Any hints on this issue would be really helpful.
>> > >>
>> > >> Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>> > >>
>> > >> Regards,
>> > >> Sriram
>> > >>
>> > >>
>> > >>
>> > >> On Fri, Apr 29, 2016 at 2:44 PM, Sriram <sriram.ec at gmail.com
>> > >> <mailto:sriram.ec at gmail.com>> wrote:
>> > >>
>> > >> Hi,
>> > >>
>> > >> I went ahead and made some changes in file system(Like I
>> > >> brought in /etc/init.d/corosync and
>> /etc/init.d/pacemaker,
>> > >> /etc/sysconfig ), After that I was able to run "pcs
>> > >> cluster start".
>> > >> But it failed with the following error
>> > >> # pcs cluster start
>> > >> Starting Cluster...
>> > >> Starting Pacemaker Cluster Manager[FAILED]
>> > >> Error: unable to start pacemaker
>> > >>
>> > >> And in the /var/log/pacemaker.log, I saw these errors
>> > >> pacemakerd: info: mcp_read_config: cmap connection
>> > >> setup failed: CS_ERR_TRY_AGAIN. Retrying in 4s
>> > >> Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
>> > >> mcp_read_config: cmap connection setup failed:
>> > >> CS_ERR_TRY_AGAIN. Retrying in 5s
>> > >> Apr 29 08:53:52 [15863] node_cu pacemakerd: warning:
>> > >> mcp_read_config: Could not connect to Cluster
>> > >> Configuration Database API, error 6
>> > >> Apr 29 08:53:52 [15863] node_cu pacemakerd: notice:
>> > >> main: Could not obtain corosync config data, exiting
>> > >> Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
>> > >> crm_xml_cleanup: Cleaning up memory from libxml2
>> > >>
>> > >>
>> > >> And in the /var/log/Debuglog, I saw these errors coming
>> > >> from corosync
>> > >> 20160429 085347.487050 <tel:085347.487050> airv_cu
>> > >> daemon.warn corosync[12857]: [QB ] Denied
>> connection,
>> > >> is not ready (12857-15863-14)
>> > >> 20160429 085347.487067 <tel:085347.487067> airv_cu
>> > >> daemon.info <http://daemon.info> corosync[12857]: [QB
>> > >> ] Denied connection, is not ready (12857-15863-14)
>> > >>
>> > >>
>> > >> I browsed the code of libqb to find that it is failing in
>> > >>
>> > >>
>> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
>> > >>
>> > >> Line 600 :
>> > >> handle_new_connection function
>> > >>
>> > >> Line 637:
>> > >> if (auth_result == 0 &&
>> > >> c->service->serv_fns.connection_accept) {
>> > >> res = c->service->serv_fns.connection_accept(c,
>> > >> c->euid, c->egid);
>> > >> }
>> > >> if (res != 0) {
>> > >> goto send_response;
>> > >> }
>> > >>
>> > >> Any hints on this issue would be really helpful for me to
>> > >> go ahead.
>> > >> Please let me know if any logs are required,
>> > >>
>> > >> Regards,
>> > >> Sriram
>> > >>
>> > >> On Thu, Apr 28, 2016 at 2:42 PM, Sriram
>> > >> <sriram.ec at gmail.com <mailto:sriram.ec at gmail.com>>
>> wrote:
>> > >>
>> > >> Thanks Ken and Emmanuel.
>> > >> Its a big endian machine. I will try with running
>> "pcs
>> > >> cluster setup" and "pcs cluster start"
>> > >> Inside cluster.py, "service pacemaker start" and
>> > >> "service corosync start" are executed to bring up
>> > >> pacemaker and corosync.
>> > >> Those service scripts and the infrastructure needed
>> to
>> > >> bring up the processes in the above said manner
>> > >> doesn't exist in my board.
>> > >> As it is a embedded board with the limited memory,
>> > >> full fledged linux is not installed.
>> > >> Just curious to know, what could be reason the
>> > >> pacemaker throws that error.
>> > >>
>> > >> /"cmap connection setup failed: CS_ERR_TRY_AGAIN.
>> > >> Retrying in 1s"
>> > >>
>> > >> /
>> > >> Thanks for response.
>> > >>
>> > >> Regards,
>> > >> Sriram.
>> > >>
>> > >> On Thu, Apr 28, 2016 at 8:55 AM, Ken Gaillot
>> > >> <kgaillot at redhat.com <mailto:kgaillot at redhat.com>>
>> wrote:
>> > >>
>> > >> On 04/27/2016 11:25 AM, emmanuel segura wrote:
>> > >> > you need to use pcs to do everything, pcs
>> > >> cluster setup and pcs
>> > >> > cluster start, try to use the redhat docs for
>> > >> more information.
>> > >>
>> > >> Agreed -- pcs cluster setup will create a proper
>> > >> corosync.conf for you.
>> > >> Your corosync.conf below uses corosync 1 syntax,
>> > >> and there were
>> > >> significant changes in corosync 2. In particular,
>> > >> you don't need the
>> > >> file created in step 4, because pacemaker is no
>> > >> longer launched via a
>> > >> corosync plugin.
>> > >>
>> > >> > 2016-04-27 17:28 GMT+02:00 Sriram
>> > >> <sriram.ec at gmail.com <mailto:sriram.ec at gmail.com
>> >>:
>> > >> >> Dear All,
>> > >> >>
>> > >> >> I m trying to use pacemaker and corosync for
>> > >> the clustering requirement that
>> > >> >> came up recently.
>> > >> >> We have cross compiled corosync, pacemaker and
>> > >> pcs(python) for ppc
>> > >> >> environment (Target board where pacemaker and
>> > >> corosync are supposed to run)
>> > >> >> I m having trouble bringing up pacemaker in
>> > >> that environment, though I could
>> > >> >> successfully bring up corosync.
>> > >> >> Any help is welcome.
>> > >> >>
>> > >> >> I m using these versions of pacemaker and
>> corosync
>> > >> >> [root at node_cu pacemaker]# corosync -v
>> > >> >> Corosync Cluster Engine, version '2.3.5'
>> > >> >> Copyright (c) 2006-2009 Red Hat, Inc.
>> > >> >> [root at node_cu pacemaker]# pacemakerd -$
>> > >> >> Pacemaker 1.1.14
>> > >> >> Written by Andrew Beekhof
>> > >> >>
>> > >> >> For running corosync, I did the following.
>> > >> >> 1. Created the following directories,
>> > >> >> /var/lib/pacemaker
>> > >> >> /var/lib/corosync
>> > >> >> /var/lib/pacemaker
>> > >> >> /var/lib/pacemaker/cores
>> > >> >> /var/lib/pacemaker/pengine
>> > >> >> /var/lib/pacemaker/blackbox
>> > >> >> /var/lib/pacemaker/cib
>> > >> >>
>> > >> >>
>> > >> >> 2. Created a file called corosync.conf under
>> > >> /etc/corosync folder with the
>> > >> >> following contents
>> > >> >>
>> > >> >> totem {
>> > >> >>
>> > >> >> version: 2
>> > >> >> token: 5000
>> > >> >> token_retransmits_before_loss_const:
>> 20
>> > >> >> join: 1000
>> > >> >> consensus: 7500
>> > >> >> vsftype: none
>> > >> >> max_messages: 20
>> > >> >> secauth: off
>> > >> >> cluster_name: mycluster
>> > >> >> transport: udpu
>> > >> >> threads: 0
>> > >> >> clear_node_high_bit: yes
>> > >> >>
>> > >> >> interface {
>> > >> >> ringnumber: 0
>> > >> >> # The following three values
>> > >> need to be set based on your
>> > >> >> environment
>> > >> >> bindnetaddr: 10.x.x.x
>> > >> >> mcastaddr: 226.94.1.1
>> > >> >> mcastport: 5405
>> > >> >> }
>> > >> >> }
>> > >> >>
>> > >> >> logging {
>> > >> >> fileline: off
>> > >> >> to_syslog: yes
>> > >> >> to_stderr: no
>> > >> >> to_syslog: yes
>> > >> >> logfile: /var/log/corosync.log
>> > >> >> syslog_facility: daemon
>> > >> >> debug: on
>> > >> >> timestamp: on
>> > >> >> }
>> > >> >>
>> > >> >> amf {
>> > >> >> mode: disabled
>> > >> >> }
>> > >> >>
>> > >> >> quorum {
>> > >> >> provider: corosync_votequorum
>> > >> >> }
>> > >> >>
>> > >> >> nodelist {
>> > >> >> node {
>> > >> >> ring0_addr: node_cu
>> > >> >> nodeid: 1
>> > >> >> }
>> > >> >> }
>> > >> >>
>> > >> >> 3. Created authkey under /etc/corosync
>> > >> >>
>> > >> >> 4. Created a file called pcmk under
>> > >> /etc/corosync/service.d and contents as
>> > >> >> below,
>> > >> >> cat pcmk
>> > >> >> service {
>> > >> >> # Load the Pacemaker Cluster Resource
>> > >> Manager
>> > >> >> name: pacemaker
>> > >> >> ver: 1
>> > >> >> }
>> > >> >>
>> > >> >> 5. Added the node name "node_cu" in /etc/hosts
>> > >> with 10.X.X.X ip
>> > >> >>
>> > >> >> 6. ./corosync -f -p & --> this step started
>> > >> corosync
>> > >> >>
>> > >> >> [root at node_cu pacemaker]# netstat -alpn |
>> grep
>> > >> -i coros
>> > >> >> udp 0 0 10.X.X.X:61841
>> 0.0.0.0:*
>> > >> >> 9133/corosync
>> > >> >> udp 0 0 10.X.X.X:5405
>> 0.0.0.0:*
>> > >> >> 9133/corosync
>> > >> >> unix 2 [ ACC ] STREAM LISTENING
>> > >> 148888 9133/corosync
>> > >> >> @quorum
>> > >> >> unix 2 [ ACC ] STREAM LISTENING
>> > >> 148884 9133/corosync
>> > >> >> @cmap
>> > >> >> unix 2 [ ACC ] STREAM LISTENING
>> > >> 148887 9133/corosync
>> > >> >> @votequorum
>> > >> >> unix 2 [ ACC ] STREAM LISTENING
>> > >> 148885 9133/corosync
>> > >> >> @cfg
>> > >> >> unix 2 [ ACC ] STREAM LISTENING
>> > >> 148886 9133/corosync
>> > >> >> @cpg
>> > >> >> unix 2 [ ] DGRAM
>> > >> 148840 9133/corosync
>> > >> >>
>> > >> >> 7. ./pacemakerd -f & gives the following error
>> > >> and exits.
>> > >> >> [root at node_cu pacemaker]# pacemakerd -f
>> > >> >> cmap connection setup failed:
>> > >> CS_ERR_TRY_AGAIN. Retrying in 1s
>> > >> >> cmap connection setup failed:
>> > >> CS_ERR_TRY_AGAIN. Retrying in 2s
>> > >> >> cmap connection setup failed:
>> > >> CS_ERR_TRY_AGAIN. Retrying in 3s
>> > >> >> cmap connection setup failed:
>> > >> CS_ERR_TRY_AGAIN. Retrying in 4s
>> > >> >> cmap connection setup failed:
>> > >> CS_ERR_TRY_AGAIN. Retrying in 5s
>> > >> >> Could not connect to Cluster Configuration
>> > >> Database API, error 6
>> > >> >>
>> > >> >> Can you please point me, what is missing in
>> > >> these steps ?
>> > >> >>
>> > >> >> Before trying these steps, I tried running
>> "pcs
>> > >> cluster start", but that
>> > >> >> command fails with "service" script not found.
>> > >> As the root filesystem
>> > >> >> doesn't contain either /etc/init.d/ or
>> > >> /sbin/service
>> > >> >>
>> > >> >> So, the plan is to bring up corosync and
>> > >> pacemaker manually, later do the
>> > >> >> cluster configuration using "pcs" commands.
>> > >> >>
>> > >> >> Regards,
>> > >> >> Sriram
>> > >> >>
>> > >> >>
>> _______________________________________________
>> > >> >> Users mailing list: Users at clusterlabs.org
>> > >> <mailto:Users at clusterlabs.org>
>> > >> >> http://clusterlabs.org/mailman/listinfo/users
>> > >> >>
>> > >> >> Project Home: http://www.clusterlabs.org
>> > >> >> Getting started:
>> > >>
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > >> >> Bugs: http://bugs.clusterlabs.org
>> > >> >>
>> > >> >
>> > >> >
>> > >> >
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> Users mailing list: Users at clusterlabs.org
>> > >> <mailto:Users at clusterlabs.org>
>> > >> http://clusterlabs.org/mailman/listinfo/users
>> > >>
>> > >> Project Home: http://www.clusterlabs.org
>> > >> Getting started:
>> > >>
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > >> Bugs: http://bugs.clusterlabs.org
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> Users mailing list: Users at clusterlabs.org
>> > >> <mailto:Users at clusterlabs.org>
>> > >> http://clusterlabs.org/mailman/listinfo/users
>> > >>
>> > >> Project Home: http://www.clusterlabs.org
>> > >> Getting started:
>> > >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > >> Bugs: http://bugs.clusterlabs.org
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>_______________________________________________
>> > >>Users mailing list: Users at clusterlabs.org
>> > >>http://clusterlabs.org/mailman/listinfo/users
>> > >>
>> > >>Project Home: http://www.clusterlabs.org
>> > >>Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > >>Bugs: http://bugs.clusterlabs.org
>> > >
>> > >
>> > >_______________________________________________
>> > >Users mailing list: Users at clusterlabs.org
>> > >http://clusterlabs.org/mailman/listinfo/users
>> > >
>> > >Project Home: http://www.clusterlabs.org
>> > >Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > >Bugs: http://bugs.clusterlabs.org
>> > >
>> >
>> >
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org
>> > http://clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://clusterlabs.org/pipermail/users/attachments/20160505/43683ea8/attachment-0001.html>
More information about the Users
mailing list