[Pacemaker] Node remains offline (was Node remains online)

Bart Coninckx bart.coninckx at telenet.be
Thu Mar 10 20:10:47 UTC 2011


Hi all,

I have a three node cluster and while introducing the third node, it
remains offline no matter what I do. Another symptom is that stopping
openais takes forever on that node, while it is waiting for crmd to unload.

The logfile shows this node (xen3) to be online however:

Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
0x6987c0 for attrd/10120
Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
0x69cb20 for cib/10118
Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_ipc: Sending membership
update 4100 to cib
Mar 10 20:55:26 corosync [CLM   ] CLM CONFIGURATION CHANGE
Mar 10 20:55:26 corosync [CLM   ] New Configuration:
Mar 10 20:55:26 corosync [CLM   ]       r(0) ip(10.0.1.13) r(1)
ip(10.0.2.13)
Mar 10 20:55:26 corosync [CLM   ] Members Left:
Mar 10 20:55:26 corosync [CLM   ] Members Joined:
Mar 10 20:55:26 corosync [pcmk  ] notice: pcmk_peer_update: Transitional
membership event on ring 4104: memb=1, new=0, lost=0
Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_peer_update: memb: xen3
218169354
Mar 10 20:55:26 corosync [CLM   ] CLM CONFIGURATION CHANGE
Mar 10 20:55:26 corosync [CLM   ] New Configuration:
Mar 10 20:55:26 corosync [CLM   ]       r(0) ip(10.0.1.11) r(1)
ip(10.0.2.11)
Mar 10 20:55:26 corosync [CLM   ]       r(0) ip(10.0.1.12) r(1)
ip(10.0.2.12)
Mar 10 20:55:26 corosync [CLM   ]       r(0) ip(10.0.1.13) r(1)
ip(10.0.2.13)
Mar 10 20:55:26 corosync [CLM   ] Members Left:
Mar 10 20:55:26 corosync [CLM   ] Members Joined:
Mar 10 20:55:26 corosync [CLM   ]       r(0) ip(10.0.1.11) r(1)
ip(10.0.2.11)
Mar 10 20:55:26 corosync [CLM   ]       r(0) ip(10.0.1.12) r(1)
ip(10.0.2.12)
Mar 10 20:55:26 corosync [pcmk  ] notice: pcmk_peer_update: Stable
membership event on ring 4104: memb=3, new=2, lost=0
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Creating entry
for node 184614922 born on 4104
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Node
184614922/unknown is now: member
Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_peer_update: NEW:
.pending. 184614922
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Creating entry
for node 201392138 born on 4104
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Node
201392138/unknown is now: member
Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_peer_update: NEW:
.pending. 201392138
Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
.pending. 184614922
Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
.pending. 201392138
Mar 10 20:55:26 corosync [pcmk  ] info: pcmk_peer_update: MEMB: xen3
218169354
Mar 10 20:55:26 corosync [pcmk  ] info: send_member_notification:
Sending membership update 4104 to 1 children
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: 0x7f4268000c80
Node 218169354 ((null)) born on: 4104
Mar 10 20:55:26 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: 0x7f4268001120
Node 201392138 (xen2) born on: 3800
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: 0x7f4268001120
Node 201392138 now known as xen2 (was: (null))
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Node xen2 now has
process list: 00000000000000000000000000151312 (1381138)
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Node xen2 now has
1 quorum votes (was 0)
Mar 10 20:55:26 corosync [pcmk  ] info: send_member_notification:
Sending membership update 4104 to 1 children
Mar 10 20:55:26 corosync [pcmk  ] WARN: route_ais_message: Sending
message to local.crmd failed: ipc delivery failed (rc=-2)
Mar 10 20:55:26 xen3 cib: [10118]: notice: ais_dispatch_message:
Membership 4104: quorum acquired
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: 0x7f4268000aa0
Node 184614922 (xen1) born on: 3792
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: 0x7f4268000aa0
Node 184614922 now known as xen1 (was: (null))
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Node xen1 now has
process list: 00000000000000000000000000151312 (1381138)
Mar 10 20:55:26 corosync [pcmk  ] info: update_member: Node xen1 now has
1 quorum votes (was 0)
Mar 10 20:55:26 corosync [pcmk  ] info: update_expected_votes: Expected
quorum votes 2 -> 3
Mar 10 20:55:26 corosync [pcmk  ] info: send_member_notification:
Sending membership update 4104 to 1 children
Mar 10 20:55:26 corosync [pcmk  ] WARN: route_ais_message: Sending
message to local.crmd failed: ipc delivery failed (rc=-2)
Mar 10 20:55:26 corosync [TOTEM ] Marking ringid 1 interface 10.0.2.13
FAULTY - adminisrtative intervention required.
Mar 10 20:55:26 corosync [pcmk  ] WARN: route_ais_message: Sending
message to local.crmd failed: ipc delivery failed (rc=-2)
Mar 10 20:55:26 xen3 cib: [10118]: WARN: cib_diff_notify: Local-only
Change (client:crmd, call: 1742): -1.-1.-1 (Application of an update
diff failed, requesting a full refresh)
Mar 10 20:55:27 corosync [pcmk  ] info: pcmk_ipc: Recorded connection
0x7f4268002040 for crmd/10122
Mar 10 20:55:27 corosync [pcmk  ] info: pcmk_ipc: Sending membership
update 4104 to crmd
Mar 10 20:55:27 xen3 crmd: [10122]: notice: ais_dispatch_message:
Membership 4104: quorum acquired
Mar 10 20:55:27 xen3 crmd: [10122]: notice: crmd_peer_update: Status
update: Client xen3/crmd now has status [online] (DC=<null>)
Mar 10 20:55:27 corosync [MAIN  ] Completed service synchronization,
ready to provide service.
Mar 10 20:55:27 xen3 cib: [10118]: WARN: cib_server_process_diff: Not
applying diff 0.1672.12 -> 0.1672.13 (sync in progress)
Mar 10 20:55:27 xen3 mgmtd: [10123]: debug: main: run the loop...
Mar 10 20:55:27 xen3 mgmtd: [10123]: info: Started.
Mar 10 20:55:27 xen3 lrmd: [10119]: info: setting max-children to 4


ps afx shows all relevant processes in a normal state though:

10111 ?        Ssl    0:00 /usr/sbin/corosync
10117 ?        S      0:00  \_ /usr/lib64/heartbeat/stonithd
10118 ?        S      0:00  \_ /usr/lib64/heartbeat/cib
10119 ?        S      0:00  \_ /usr/lib64/heartbeat/lrmd
10120 ?        S      0:00  \_ /usr/lib64/heartbeat/attrd
10121 ?        S      0:00  \_ /usr/lib64/heartbeat/pengine
10122 ?        S      0:00  \_ /usr/lib64/heartbeat/crmd
10123 ?        S      0:00  \_ /usr/lib64/heartbeat/mgmtd


I tried to remove the node with crm_node -R= to no avail.

The used versions are :

corosync-1.2.6-0.2.2
openais-1.1.3-0.2.3
pacemaker-1.1.2-0.7.1

corosync.conf looks like this:

aisexec {
        group:  root
        user:   root
}
service {
        use_mgmtd:      yes
        ver:    0
        name:   pacemaker
}
totem {
        rrp_mode:       passive
        token_retransmits_before_loss_const:    10
        join:   1000
        max_messages:   20
        vsftype:        none
        token:  5000
        consensus:      7500
        secauth:        off
        version:        2

        interface {
                bindnetaddr:    10.0.1.0
                mcastaddr:      226.94.1.1
                mcastport:      5405
                ringnumber:     0

        }
        interface {
                bindnetaddr:    10.0.2.0
                mcastaddr:      226.84.2.1
                mcastport:      5406
                ringnumber:     1
        }
        clear_node_high_bit:    yes
}
logging {
        to_logfile:     yes
        logfile:        /var/log/ha-log
        timestamp:      on
        syslog_facility:        daemon
        to_syslog:      no
        debug:  on
        to_stderr:      yes
        fileline:       off

}
amf {
        mode:   disable
}


Does anyone have any suggestions on how to proceed?

Thank you!!

B.





More information about the Pacemaker mailing list