[Pacemaker] Multiple thread after rebooting server: the node doesn't go online

hj lee kerdosa at gmail.com
Fri Nov 13 13:36:58 EST 2009


Hi,

I have the same problem in CentOS 5.3 with pacemaker-1.0.5 and
openais-0.80.5. This is openais bug! Two problems.
1. Starting openais service gets seg fault sometime. It more likely happens
if openais service get started before syslog.
2. The seg fault handler of openais calls syslog(). The syslog is one of
UNSAFE function that must not be called from signal handler because it is
non-reentrent function.

To fix this issue: get the openais source, find sigsegv_handler function
exec/main.c and just comment out log_flush(), shown below. Then recompile
and isntall it(make and make install). The log_flush should be removed from
all signal handlers in openais code base. I am still not sure where seg
fault occurs, but commenting out log_flush prevents seg fault.


-------------------------------------------------------------------------
static void sigsegv_handler (int num)
{
        signal (SIGSEGV, SIG_DFL);
//      log_flush ();
        raise (SIGSEGV);
}

Thanks
hj

On Thu, Nov 12, 2009 at 4:21 PM, Giovanni Di Milia <gdimilia at cfa.harvard.edu
> wrote:

> I set up a cluster of two servers CentOS 5.4 x86_64 with pacemaker 1.06 and
> corosync 1.1.2
>
> I only installed the x86_64 packages (yum install pacemaker try to install
> also the 32 bits one).
>
> I configured a shared cluster IP (it's a public ip) and a cluster website.
>
> Everything work fine if i try to stop corosync on one of the two servers
> (the services pass from one machine to the other without problems), but if I
> reboot one server, when it returns alive it cannot go online in the cluster.
> I also noticed that there are several thread of corosync and if I kill all
> of them and then I start again corosync, everything work fine again.
>
> I don't know what is happening and I'm not able to reproduce the same
> situation on some virtual servers!
>
> Thanks,
> Giovanni
>
>
>
> the configuration of corosync is the following:
>
> ##############################################
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
>
> aisexec {
>        # Run as root - this is necessary to be able to manage resources
> with Pacemaker
>        user:   root
>        group:  root
> }
>
> service {
>        # Load the Pacemaker Cluster Resource Manager
>        ver:       0
>        name:      pacemaker
>        use_mgmtd: yes
>        use_logd:  yes
> }
>
> totem {
>        version: 2
>
>        # How long before declaring a token lost (ms)
>        token:          5000
>
>        # How many token retransmits before forming a new configuration
>        token_retransmits_before_loss_const: 10
>
>        # How long to wait for join messages in the membership protocol (ms)
>        join:           1000
>
>        # How long to wait for consensus to be achieved before starting a
> new round of membership configuration (ms)
>        consensus:      2500
>
>        # Turn off the virtual synchrony filter
>        vsftype:        none
>
>        # Number of messages that may be sent by one processor on receipt of
> the token
>        max_messages:   20
>
>        # Stagger sending the node join messages by 1..send_join ms
>        send_join: 45
>
>        # Limit generated nodeids to 31-bits (positive signed integers)
>        clear_node_high_bit: yes
>
>        # Disable encryption
>        secauth:        off
>
>        # How many threads to use for encryption/decryption
>        threads:        0
>
>        # Optionally assign a fixed node id (integer)
>        # nodeid:         1234
>
>        interface {
>                ringnumber: 0
>
>                # The following values need to be set based on your
> environment
> bindnetaddr: XXX.XXX.XXX.0 #here I put the right ip for my configuration
> mcastaddr: 226.94.1.1
> mcastport: 4000
>        }
> }
>
> logging {
>        fileline: off
>        to_stderr: yes
>        to_logfile: yes
>        to_syslog: yes
>        logfile: /tmp/corosync.log
>        debug: off
>        timestamp: on
>        logger_subsys {
>                subsys: AMF
>                debug: off
>        }
> }
>
> amf {
>        mode: disabled
> }
>
> ##################################################
>
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>



-- 
Dream with longterm vision!
kerdosa
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20091113/2f3e7833/attachment-0002.html>


More information about the Pacemaker mailing list