[ClusterLabs] [ClusterLab] : Corosync not initializing successfully
Jan Friesse
jfriesse at redhat.com
Thu May 5 09:28:49 UTC 2016
Nikhil
> Found the root-cause.
> In file schedwrk.c, the function handle2void() uses a union which was not
> initialized.
> Because of that the handle value was computed incorrectly (lower half was
> garbage).
>
> 56 static hdb_handle_t
> 57 void2handle (const void *v) { union u u={}; u.v = v; return u.h; }
> 58 static const void *
> 59 handle2void (hdb_handle_t h) { union u u={}; u.h = h; return u.v; }
>
> After initializing (as highlighted), the corosync initialization seems to
> be going through fine. Will check other things.
Your patch is incorrect and actually doesn't work. As I said (when
pointing you to schedwrk.c), I will send you proper patch, but fix that
issue correctly is not easy.
Regards,
Honza
>
> -Regards
> Nikhil
>
> On Tue, May 3, 2016 at 7:04 PM, Nikhil Utane <nikhil.subscribed at gmail.com>
> wrote:
>
>> Thanks for your response Dejan.
>>
>> I do not know yet whether this has anything to do with endianness.
>> FWIW, there could be something quirky with the system so keeping all
>> options open. :)
>>
>> I added some debug prints to understand what's happening under the hood.
>>
>> *Success case: (on x86 machine): *
>> [TOTEM ] entering OPERATIONAL state.
>> [TOTEM ] A new membership (10.206.1.7:137220) was formed. Members joined:
>> 181272839
>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
>> my_high_delivered=0
>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>> my_high_delivered=0
>> [TOTEM ] Delivering 0 to 1
>> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
>> [SYNC ] Nikhil: Inside sync_deliver_fn. header->id=1
>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=2,
>> my_high_delivered=1
>> [TOTEM ] Delivering 1 to 2
>> [TOTEM ] Delivering MCAST message with seq 2 to pending delivery queue
>> [SYNC ] Nikhil: Inside sync_deliver_fn. header->id=0
>> [SYNC ] Nikhil: Entering sync_barrier_handler
>> [SYNC ] Committing synchronization for corosync configuration map access
>> .
>> [TOTEM ] Delivering 2 to 4
>> [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
>> [TOTEM ] Delivering MCAST message with seq 4 to pending delivery queue
>> [CPG ] comparing: sender r(0) ip(10.206.1.7) ; members(old:0 left:0)
>> [CPG ] chosen downlist: sender r(0) ip(10.206.1.7) ; members(old:0
>> left:0)
>> [SYNC ] Committing synchronization for corosync cluster closed process
>> group service v1.01
>> *[MAIN ] Completed service synchronization, ready to provide service.*
>>
>>
>> *Failure case: (on ppc)*:
>> [TOTEM ] entering OPERATIONAL state.
>> [TOTEM ] A new membership (10.207.24.101:16) was formed. Members joined:
>> 181344357
>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=0,
>> my_high_delivered=0
>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>> my_high_delivered=0
>> [TOTEM ] Delivering 0 to 1
>> [TOTEM ] Delivering MCAST message with seq 1 to pending delivery queue
>> [SYNC ] Nikhil: Inside sync_deliver_fn header->id=1
>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>> my_high_delivered=1
>> [TOTEM ] Nikhil: Inside messages_deliver_to_app. end_point=1,
>> my_high_delivered=1
>> Above message repeats continuously.
>>
>> So it appears that in failure case I do not receive messages with sequence
>> number 2-4.
>> If somebody can throw some ideas that'll help a lot.
>>
>> -Thanks
>> Nikhil
>>
>> On Tue, May 3, 2016 at 5:26 PM, Dejan Muhamedagic <dejanmm at fastmail.fm>
>> wrote:
>>
>>> Hi,
>>>
>>> On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:
>>>>> As your hardware is probably capable of running ppcle and if you have
>>> an
>>>>> environment
>>>>> at hand without too much effort it might pay off to try that.
>>>>> There are of course distributions out there support corosync on
>>>>> big-endian architectures
>>>>> but I don't know if there is an automatized regression for corosync on
>>>>> big-endian that
>>>>> would catch big-endian-issues right away with something as current as
>>>>> your 2.3.5.
>>>>
>>>> No we are not testing big-endian.
>>>>
>>>> So totally agree with Klaus. Give a try to ppcle. Also make sure all
>>>> nodes are little-endian. Corosync should work in mixed BE/LE
>>>> environment but because it's not tested, it may not work (and it's a
>>>> bug, so if ppcle works I will try to fix BE).
>>>
>>> I tested a cluster consisting of big endian/little endian nodes
>>> (s390 and x86-64), but that was a while ago. IIRC, all relevant
>>> bugs in corosync got fixed at that time. Don't know what is the
>>> situation with the latest version.
>>>
>>> Thanks,
>>>
>>> Dejan
>>>
>>>> Regards,
>>>> Honza
>>>>
>>>>>
>>>>> Regards,
>>>>> Klaus
>>>>>
>>>>> On 05/02/2016 06:44 AM, Nikhil Utane wrote:
>>>>>> Re-sending as I don't see my post on the thread.
>>>>>>
>>>>>> On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
>>>>>> <nikhil.subscribed at gmail.com <mailto:nikhil.subscribed at gmail.com>>
>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Looking for some guidance here as we are completely blocked
>>>>>> otherwise :(.
>>>>>>
>>>>>> -Regards
>>>>>> Nikhil
>>>>>>
>>>>>> On Fri, Apr 29, 2016 at 6:11 PM, Sriram <sriram.ec at gmail.com
>>>>>> <mailto:sriram.ec at gmail.com>> wrote:
>>>>>>
>>>>>> Corrected the subject.
>>>>>>
>>>>>> We went ahead and captured corosync debug logs for our ppc
>>> board.
>>>>>> After log analysis and comparison with the sucessful logs(
>>>>>> from x86 machine) ,
>>>>>> we didnt find *"[ MAIN ] Completed service synchronization,
>>>>>> ready to provide service.*" in ppc logs.
>>>>>> So, looks like corosync is not in a position to accept
>>>>>> connection from Pacemaker.
>>>>>> Even I tried with the new corosync.conf with no success.
>>>>>>
>>>>>> Any hints on this issue would be really helpful.
>>>>>>
>>>>>> Attaching ppc_notworking.log, x86_working.log, corosync.conf.
>>>>>>
>>>>>> Regards,
>>>>>> Sriram
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 29, 2016 at 2:44 PM, Sriram <sriram.ec at gmail.com
>>>>>> <mailto:sriram.ec at gmail.com>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I went ahead and made some changes in file system(Like I
>>>>>> brought in /etc/init.d/corosync and
>>> /etc/init.d/pacemaker,
>>>>>> /etc/sysconfig ), After that I was able to run "pcs
>>>>>> cluster start".
>>>>>> But it failed with the following error
>>>>>> # pcs cluster start
>>>>>> Starting Cluster...
>>>>>> Starting Pacemaker Cluster Manager[FAILED]
>>>>>> Error: unable to start pacemaker
>>>>>>
>>>>>> And in the /var/log/pacemaker.log, I saw these errors
>>>>>> pacemakerd: info: mcp_read_config: cmap connection
>>>>>> setup failed: CS_ERR_TRY_AGAIN. Retrying in 4s
>>>>>> Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
>>>>>> mcp_read_config: cmap connection setup failed:
>>>>>> CS_ERR_TRY_AGAIN. Retrying in 5s
>>>>>> Apr 29 08:53:52 [15863] node_cu pacemakerd: warning:
>>>>>> mcp_read_config: Could not connect to Cluster
>>>>>> Configuration Database API, error 6
>>>>>> Apr 29 08:53:52 [15863] node_cu pacemakerd: notice:
>>>>>> main: Could not obtain corosync config data, exiting
>>>>>> Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
>>>>>> crm_xml_cleanup: Cleaning up memory from libxml2
>>>>>>
>>>>>>
>>>>>> And in the /var/log/Debuglog, I saw these errors coming
>>>>>> from corosync
>>>>>> 20160429 085347.487050 <tel:085347.487050> airv_cu
>>>>>> daemon.warn corosync[12857]: [QB ] Denied
>>> connection,
>>>>>> is not ready (12857-15863-14)
>>>>>> 20160429 085347.487067 <tel:085347.487067> airv_cu
>>>>>> daemon.info <http://daemon.info> corosync[12857]: [QB
>>>>>> ] Denied connection, is not ready (12857-15863-14)
>>>>>>
>>>>>>
>>>>>> I browsed the code of libqb to find that it is failing in
>>>>>>
>>>>>>
>>> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
>>>>>>
>>>>>> Line 600 :
>>>>>> handle_new_connection function
>>>>>>
>>>>>> Line 637:
>>>>>> if (auth_result == 0 &&
>>>>>> c->service->serv_fns.connection_accept) {
>>>>>> res = c->service->serv_fns.connection_accept(c,
>>>>>> c->euid, c->egid);
>>>>>> }
>>>>>> if (res != 0) {
>>>>>> goto send_response;
>>>>>> }
>>>>>>
>>>>>> Any hints on this issue would be really helpful for me to
>>>>>> go ahead.
>>>>>> Please let me know if any logs are required,
>>>>>>
>>>>>> Regards,
>>>>>> Sriram
>>>>>>
>>>>>> On Thu, Apr 28, 2016 at 2:42 PM, Sriram
>>>>>> <sriram.ec at gmail.com <mailto:sriram.ec at gmail.com>>
>>> wrote:
>>>>>>
>>>>>> Thanks Ken and Emmanuel.
>>>>>> Its a big endian machine. I will try with running
>>> "pcs
>>>>>> cluster setup" and "pcs cluster start"
>>>>>> Inside cluster.py, "service pacemaker start" and
>>>>>> "service corosync start" are executed to bring up
>>>>>> pacemaker and corosync.
>>>>>> Those service scripts and the infrastructure needed
>>> to
>>>>>> bring up the processes in the above said manner
>>>>>> doesn't exist in my board.
>>>>>> As it is a embedded board with the limited memory,
>>>>>> full fledged linux is not installed.
>>>>>> Just curious to know, what could be reason the
>>>>>> pacemaker throws that error.
>>>>>>
>>>>>> /"cmap connection setup failed: CS_ERR_TRY_AGAIN.
>>>>>> Retrying in 1s"
>>>>>>
>>>>>> /
>>>>>> Thanks for response.
>>>>>>
>>>>>> Regards,
>>>>>> Sriram.
>>>>>>
>>>>>> On Thu, Apr 28, 2016 at 8:55 AM, Ken Gaillot
>>>>>> <kgaillot at redhat.com <mailto:kgaillot at redhat.com>>
>>> wrote:
>>>>>>
>>>>>> On 04/27/2016 11:25 AM, emmanuel segura wrote:
>>>>>> > you need to use pcs to do everything, pcs
>>>>>> cluster setup and pcs
>>>>>> > cluster start, try to use the redhat docs for
>>>>>> more information.
>>>>>>
>>>>>> Agreed -- pcs cluster setup will create a proper
>>>>>> corosync.conf for you.
>>>>>> Your corosync.conf below uses corosync 1 syntax,
>>>>>> and there were
>>>>>> significant changes in corosync 2. In particular,
>>>>>> you don't need the
>>>>>> file created in step 4, because pacemaker is no
>>>>>> longer launched via a
>>>>>> corosync plugin.
>>>>>>
>>>>>> > 2016-04-27 17:28 GMT+02:00 Sriram
>>>>>> <sriram.ec at gmail.com <mailto:sriram.ec at gmail.com
>>>>> :
>>>>>> >> Dear All,
>>>>>> >>
>>>>>> >> I m trying to use pacemaker and corosync for
>>>>>> the clustering requirement that
>>>>>> >> came up recently.
>>>>>> >> We have cross compiled corosync, pacemaker and
>>>>>> pcs(python) for ppc
>>>>>> >> environment (Target board where pacemaker and
>>>>>> corosync are supposed to run)
>>>>>> >> I m having trouble bringing up pacemaker in
>>>>>> that environment, though I could
>>>>>> >> successfully bring up corosync.
>>>>>> >> Any help is welcome.
>>>>>> >>
>>>>>> >> I m using these versions of pacemaker and
>>> corosync
>>>>>> >> [root at node_cu pacemaker]# corosync -v
>>>>>> >> Corosync Cluster Engine, version '2.3.5'
>>>>>> >> Copyright (c) 2006-2009 Red Hat, Inc.
>>>>>> >> [root at node_cu pacemaker]# pacemakerd -$
>>>>>> >> Pacemaker 1.1.14
>>>>>> >> Written by Andrew Beekhof
>>>>>> >>
>>>>>> >> For running corosync, I did the following.
>>>>>> >> 1. Created the following directories,
>>>>>> >> /var/lib/pacemaker
>>>>>> >> /var/lib/corosync
>>>>>> >> /var/lib/pacemaker
>>>>>> >> /var/lib/pacemaker/cores
>>>>>> >> /var/lib/pacemaker/pengine
>>>>>> >> /var/lib/pacemaker/blackbox
>>>>>> >> /var/lib/pacemaker/cib
>>>>>> >>
>>>>>> >>
>>>>>> >> 2. Created a file called corosync.conf under
>>>>>> /etc/corosync folder with the
>>>>>> >> following contents
>>>>>> >>
>>>>>> >> totem {
>>>>>> >>
>>>>>> >> version: 2
>>>>>> >> token: 5000
>>>>>> >> token_retransmits_before_loss_const:
>>> 20
>>>>>> >> join: 1000
>>>>>> >> consensus: 7500
>>>>>> >> vsftype: none
>>>>>> >> max_messages: 20
>>>>>> >> secauth: off
>>>>>> >> cluster_name: mycluster
>>>>>> >> transport: udpu
>>>>>> >> threads: 0
>>>>>> >> clear_node_high_bit: yes
>>>>>> >>
>>>>>> >> interface {
>>>>>> >> ringnumber: 0
>>>>>> >> # The following three values
>>>>>> need to be set based on your
>>>>>> >> environment
>>>>>> >> bindnetaddr: 10.x.x.x
>>>>>> >> mcastaddr: 226.94.1.1
>>>>>> >> mcastport: 5405
>>>>>> >> }
>>>>>> >> }
>>>>>> >>
>>>>>> >> logging {
>>>>>> >> fileline: off
>>>>>> >> to_syslog: yes
>>>>>> >> to_stderr: no
>>>>>> >> to_syslog: yes
>>>>>> >> logfile: /var/log/corosync.log
>>>>>> >> syslog_facility: daemon
>>>>>> >> debug: on
>>>>>> >> timestamp: on
>>>>>> >> }
>>>>>> >>
>>>>>> >> amf {
>>>>>> >> mode: disabled
>>>>>> >> }
>>>>>> >>
>>>>>> >> quorum {
>>>>>> >> provider: corosync_votequorum
>>>>>> >> }
>>>>>> >>
>>>>>> >> nodelist {
>>>>>> >> node {
>>>>>> >> ring0_addr: node_cu
>>>>>> >> nodeid: 1
>>>>>> >> }
>>>>>> >> }
>>>>>> >>
>>>>>> >> 3. Created authkey under /etc/corosync
>>>>>> >>
>>>>>> >> 4. Created a file called pcmk under
>>>>>> /etc/corosync/service.d and contents as
>>>>>> >> below,
>>>>>> >> cat pcmk
>>>>>> >> service {
>>>>>> >> # Load the Pacemaker Cluster Resource
>>>>>> Manager
>>>>>> >> name: pacemaker
>>>>>> >> ver: 1
>>>>>> >> }
>>>>>> >>
>>>>>> >> 5. Added the node name "node_cu" in /etc/hosts
>>>>>> with 10.X.X.X ip
>>>>>> >>
>>>>>> >> 6. ./corosync -f -p & --> this step started
>>>>>> corosync
>>>>>> >>
>>>>>> >> [root at node_cu pacemaker]# netstat -alpn |
>>> grep
>>>>>> -i coros
>>>>>> >> udp 0 0 10.X.X.X:61841
>>> 0.0.0.0:*
>>>>>> >> 9133/corosync
>>>>>> >> udp 0 0 10.X.X.X:5405
>>> 0.0.0.0:*
>>>>>> >> 9133/corosync
>>>>>> >> unix 2 [ ACC ] STREAM LISTENING
>>>>>> 148888 9133/corosync
>>>>>> >> @quorum
>>>>>> >> unix 2 [ ACC ] STREAM LISTENING
>>>>>> 148884 9133/corosync
>>>>>> >> @cmap
>>>>>> >> unix 2 [ ACC ] STREAM LISTENING
>>>>>> 148887 9133/corosync
>>>>>> >> @votequorum
>>>>>> >> unix 2 [ ACC ] STREAM LISTENING
>>>>>> 148885 9133/corosync
>>>>>> >> @cfg
>>>>>> >> unix 2 [ ACC ] STREAM LISTENING
>>>>>> 148886 9133/corosync
>>>>>> >> @cpg
>>>>>> >> unix 2 [ ] DGRAM
>>>>>> 148840 9133/corosync
>>>>>> >>
>>>>>> >> 7. ./pacemakerd -f & gives the following error
>>>>>> and exits.
>>>>>> >> [root at node_cu pacemaker]# pacemakerd -f
>>>>>> >> cmap connection setup failed:
>>>>>> CS_ERR_TRY_AGAIN. Retrying in 1s
>>>>>> >> cmap connection setup failed:
>>>>>> CS_ERR_TRY_AGAIN. Retrying in 2s
>>>>>> >> cmap connection setup failed:
>>>>>> CS_ERR_TRY_AGAIN. Retrying in 3s
>>>>>> >> cmap connection setup failed:
>>>>>> CS_ERR_TRY_AGAIN. Retrying in 4s
>>>>>> >> cmap connection setup failed:
>>>>>> CS_ERR_TRY_AGAIN. Retrying in 5s
>>>>>> >> Could not connect to Cluster Configuration
>>>>>> Database API, error 6
>>>>>> >>
>>>>>> >> Can you please point me, what is missing in
>>>>>> these steps ?
>>>>>> >>
>>>>>> >> Before trying these steps, I tried running
>>> "pcs
>>>>>> cluster start", but that
>>>>>> >> command fails with "service" script not found.
>>>>>> As the root filesystem
>>>>>> >> doesn't contain either /etc/init.d/ or
>>>>>> /sbin/service
>>>>>> >>
>>>>>> >> So, the plan is to bring up corosync and
>>>>>> pacemaker manually, later do the
>>>>>> >> cluster configuration using "pcs" commands.
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >> Sriram
>>>>>> >>
>>>>>> >>
>>> _______________________________________________
>>>>>> >> Users mailing list: Users at clusterlabs.org
>>>>>> <mailto:Users at clusterlabs.org>
>>>>>> >> http://clusterlabs.org/mailman/listinfo/users
>>>>>> >>
>>>>>> >> Project Home: http://www.clusterlabs.org
>>>>>> >> Getting started:
>>>>>>
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> >> Bugs: http://bugs.clusterlabs.org
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>> <mailto:Users at clusterlabs.org>
>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started:
>>>>>>
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>> <mailto:Users at clusterlabs.org>
>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started:
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list