[Pacemaker] Multiple thread after rebooting server: the node doesn't go online

Giovanni Di Milia gdimilia at cfa.harvard.edu
Fri Nov 13 15:08:43 EST 2009


Thank you very much for your response.

The only thing I really don't understand is: why this problem doesn't  
appear in all my simulations?
I configured at least 7 couple of virtual servers with vmware 2 and  
CentOS 5.3 and 5.4 (32 and 64 bits) and I never had this kind of  
problems!

The only difference in the configuration is that I used private IPs  
for the simulations and public IPs for the real servers, but I don't  
think it is important.

Thanks for your patience,
Giovanni



On Nov 13, 2009, at 1:36 PM, hj lee wrote:

> Hi,
>
> I have the same problem in CentOS 5.3 with pacemaker-1.0.5 and  
> openais-0.80.5. This is openais bug! Two problems.
> 1. Starting openais service gets seg fault sometime. It more likely  
> happens if openais service get started before syslog.
> 2. The seg fault handler of openais calls syslog(). The syslog is  
> one of UNSAFE function that must not be called from signal handler  
> because it is non-reentrent function.
>
> To fix this issue: get the openais source, find sigsegv_handler  
> function exec/main.c and just comment out log_flush(), shown below.  
> Then recompile and isntall it(make and make install). The log_flush  
> should be removed from all signal handlers in openais code base. I  
> am still not sure where seg fault occurs, but commenting out  
> log_flush prevents seg fault.
>
>
> -------------------------------------------------------------------------
> static void sigsegv_handler (int num)
> {
>         signal (SIGSEGV, SIG_DFL);
> //      log_flush ();
>         raise (SIGSEGV);
> }
>
> Thanks
> hj
>
> On Thu, Nov 12, 2009 at 4:21 PM, Giovanni Di Milia <gdimilia at cfa.harvard.edu 
> > wrote:
> I set up a cluster of two servers CentOS 5.4 x86_64 with pacemaker  
> 1.06 and corosync 1.1.2
>
> I only installed the x86_64 packages (yum install pacemaker try to  
> install also the 32 bits one).
>
> I configured a shared cluster IP (it's a public ip) and a cluster  
> website.
>
> Everything work fine if i try to stop corosync on one of the two  
> servers (the services pass from one machine to the other without  
> problems), but if I reboot one server, when it returns alive it  
> cannot go online in the cluster.
> I also noticed that there are several thread of corosync and if I  
> kill all of them and then I start again corosync, everything work  
> fine again.
>
> I don't know what is happening and I'm not able to reproduce the  
> same situation on some virtual servers!
>
> Thanks,
> Giovanni
>
>
>
> the configuration of corosync is the following:
>
> ##############################################
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
>
> aisexec {
>        # Run as root - this is necessary to be able to manage  
> resources with Pacemaker
>        user:   root
>        group:  root
> }
>
> service {
>        # Load the Pacemaker Cluster Resource Manager
>        ver:       0
>        name:      pacemaker
>        use_mgmtd: yes
>        use_logd:  yes
> }
>
> totem {
>        version: 2
>
>        # How long before declaring a token lost (ms)
>        token:          5000
>
>        # How many token retransmits before forming a new configuration
>        token_retransmits_before_loss_const: 10
>
>        # How long to wait for join messages in the membership  
> protocol (ms)
>        join:           1000
>
>        # How long to wait for consensus to be achieved before  
> starting a new round of membership configuration (ms)
>        consensus:      2500
>
>        # Turn off the virtual synchrony filter
>        vsftype:        none
>
>        # Number of messages that may be sent by one processor on  
> receipt of the token
>        max_messages:   20
>
>        # Stagger sending the node join messages by 1..send_join ms
>        send_join: 45
>
>        # Limit generated nodeids to 31-bits (positive signed integers)
>        clear_node_high_bit: yes
>
>        # Disable encryption
>        secauth:        off
>
>        # How many threads to use for encryption/decryption
>        threads:        0
>
>        # Optionally assign a fixed node id (integer)
>        # nodeid:         1234
>
>        interface {
>                ringnumber: 0
>
>                # The following values need to be set based on your  
> environment
> bindnetaddr: XXX.XXX.XXX.0 #here I put the right ip for my  
> configuration
> mcastaddr: 226.94.1.1
> mcastport: 4000
>        }
> }
>
> logging {
>        fileline: off
>        to_stderr: yes
>        to_logfile: yes
>        to_syslog: yes
>        logfile: /tmp/corosync.log
>        debug: off
>        timestamp: on
>        logger_subsys {
>                subsys: AMF
>                debug: off
>        }
> }
>
> amf {
>        mode: disabled
> }
>
> ##################################################
>
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>
>
> -- 
> Dream with longterm vision!
> kerdosa
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20091113/9651c48e/attachment-0001.html>


More information about the Pacemaker mailing list