[Pacemaker] Multiple thread after rebooting server: the node doesn't go online

Giovanni Di Milia gdimilia at cfa.harvard.edu
Tue Nov 17 16:31:57 EST 2009


Another problem has appeared:
after the reboot of one server I often have a cluster partition and  
both servers elect themselves DC.

Even if the partition doesn't appear just after the reboot of one  
server (i.e. serverA), if I try to restart corosync on the other  
server (i.e. serverB), the partition appear.
Then if I also restart corosync on the first server (serverA)  
everything work fine again.
But if I restart corosync on the second server (serverB) nothing  
change and the partition appears again.

It's seems to me that there is still something wrong with the first  
run of corosync just after the server reboot.

I didn't configure any fencing method, because I think that my  
configuration is really simple and I don't need it.

Thanks again for your patience,
Giovanni


On Nov 17, 2009, at 12:07 PM, Giovanni Di Milia wrote:

> Disabling syslog the problem disappears.
>
> Thank you very much,
> Giovanni
>
>
>
> On Nov 16, 2009, at 4:51 PM, hj lee wrote:
>
>> Hi,
>>
>> Please disable syslog in openais.conf, and try it again. It seems  
>> this issue is related to fork() call and syslog().
>>
>> hj
>>
>> On Fri, Nov 13, 2009 at 1:08 PM, Giovanni Di Milia <gdimilia at cfa.harvard.edu 
>> > wrote:
>> Thank you very much for your response.
>>
>> The only thing I really don't understand is: why this problem  
>> doesn't appear in all my simulations?
>> I configured at least 7 couple of virtual servers with vmware 2 and  
>> CentOS 5.3 and 5.4 (32 and 64 bits) and I never had this kind of  
>> problems!
>>
>> The only difference in the configuration is that I used private IPs  
>> for the simulations and public IPs for the real servers, but I  
>> don't think it is important.
>>
>> Thanks for your patience,
>> Giovanni
>>
>>
>>
>> On Nov 13, 2009, at 1:36 PM, hj lee wrote:
>>
>>> Hi,
>>>
>>> I have the same problem in CentOS 5.3 with pacemaker-1.0.5 and  
>>> openais-0.80.5. This is openais bug! Two problems.
>>> 1. Starting openais service gets seg fault sometime. It more  
>>> likely happens if openais service get started before syslog.
>>> 2. The seg fault handler of openais calls syslog(). The syslog is  
>>> one of UNSAFE function that must not be called from signal handler  
>>> because it is non-reentrent function.
>>>
>>> To fix this issue: get the openais source, find sigsegv_handler  
>>> function exec/main.c and just comment out log_flush(), shown  
>>> below. Then recompile and isntall it(make and make install). The  
>>> log_flush should be removed from all signal handlers in openais  
>>> code base. I am still not sure where seg fault occurs, but  
>>> commenting out log_flush prevents seg fault.
>>>
>>>
>>> -------------------------------------------------------------------------
>>> static void sigsegv_handler (int num)
>>> {
>>>         signal (SIGSEGV, SIG_DFL);
>>> //      log_flush ();
>>>         raise (SIGSEGV);
>>> }
>>>
>>> Thanks
>>> hj
>>>
>>> On Thu, Nov 12, 2009 at 4:21 PM, Giovanni Di Milia <gdimilia at cfa.harvard.edu 
>>> > wrote:
>>> I set up a cluster of two servers CentOS 5.4 x86_64 with pacemaker  
>>> 1.06 and corosync 1.1.2
>>>
>>> I only installed the x86_64 packages (yum install pacemaker try to  
>>> install also the 32 bits one).
>>>
>>> I configured a shared cluster IP (it's a public ip) and a cluster  
>>> website.
>>>
>>> Everything work fine if i try to stop corosync on one of the two  
>>> servers (the services pass from one machine to the other without  
>>> problems), but if I reboot one server, when it returns alive it  
>>> cannot go online in the cluster.
>>> I also noticed that there are several thread of corosync and if I  
>>> kill all of them and then I start again corosync, everything work  
>>> fine again.
>>>
>>> I don't know what is happening and I'm not able to reproduce the  
>>> same situation on some virtual servers!
>>>
>>> Thanks,
>>> Giovanni
>>>
>>>
>>>
>>> the configuration of corosync is the following:
>>>
>>> ##############################################
>>> # Please read the corosync.conf.5 manual page
>>> compatibility: whitetank
>>>
>>> aisexec {
>>>        # Run as root - this is necessary to be able to manage  
>>> resources with Pacemaker
>>>        user:   root
>>>        group:  root
>>> }
>>>
>>> service {
>>>        # Load the Pacemaker Cluster Resource Manager
>>>        ver:       0
>>>        name:      pacemaker
>>>        use_mgmtd: yes
>>>        use_logd:  yes
>>> }
>>>
>>> totem {
>>>        version: 2
>>>
>>>        # How long before declaring a token lost (ms)
>>>        token:          5000
>>>
>>>        # How many token retransmits before forming a new  
>>> configuration
>>>        token_retransmits_before_loss_const: 10
>>>
>>>        # How long to wait for join messages in the membership  
>>> protocol (ms)
>>>        join:           1000
>>>
>>>        # How long to wait for consensus to be achieved before  
>>> starting a new round of membership configuration (ms)
>>>        consensus:      2500
>>>
>>>        # Turn off the virtual synchrony filter
>>>        vsftype:        none
>>>
>>>        # Number of messages that may be sent by one processor on  
>>> receipt of the token
>>>        max_messages:   20
>>>
>>>        # Stagger sending the node join messages by 1..send_join ms
>>>        send_join: 45
>>>
>>>        # Limit generated nodeids to 31-bits (positive signed  
>>> integers)
>>>        clear_node_high_bit: yes
>>>
>>>        # Disable encryption
>>>        secauth:        off
>>>
>>>        # How many threads to use for encryption/decryption
>>>        threads:        0
>>>
>>>        # Optionally assign a fixed node id (integer)
>>>        # nodeid:         1234
>>>
>>>        interface {
>>>                ringnumber: 0
>>>
>>>                # The following values need to be set based on your  
>>> environment
>>> bindnetaddr: XXX.XXX.XXX.0 #here I put the right ip for my  
>>> configuration
>>> mcastaddr: 226.94.1.1
>>> mcastport: 4000
>>>        }
>>> }
>>>
>>> logging {
>>>        fileline: off
>>>        to_stderr: yes
>>>        to_logfile: yes
>>>        to_syslog: yes
>>>        logfile: /tmp/corosync.log
>>>        debug: off
>>>        timestamp: on
>>>        logger_subsys {
>>>                subsys: AMF
>>>                debug: off
>>>        }
>>> }
>>>
>>> amf {
>>>        mode: disabled
>>> }
>>>
>>> ##################################################
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list
>>> Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>
>>>
>>> -- 
>>> Dream with longterm vision!
>>> kerdosa
>>> _______________________________________________
>>> Pacemaker mailing list
>>> Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>
>>
>>
>> -- 
>> Dream with longterm vision!
>> kerdosa
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20091117/55924946/attachment-0001.html>


More information about the Pacemaker mailing list