[Pacemaker] unable to join cluster

Hisashi Osanai osanai.hisashi at jp.fujitsu.com
Thu Mar 22 00:07:59 EDT 2012


Hello,

I have three nodes cluster using pacemaker/corosync. When I reboot one node,

the node unable to join cluster. I can see that kind of split brain 10-20% 
(recall ration) if I shutdown a node. 

What do you think of this problem? 

My questions are:
- Is this known problem?
- Any work around to avoid the this?
- How can I solve this problem?

[testserver001]
============
Last updated: Sat Mar 10 14:18:49 2012
Stack: openais
Current DC: NONE
3 Nodes configured, 3 expected votes
4 Resources configured.
============

OFFLINE: [ testserver001 testserver002 testserver003 ]


Migration summary:

[testserver002]
============
Last updated: Sat Mar 10 14:15:17 2012
Stack: openais
Current DC: testserver002 - partition with quorum
Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
3 Nodes configured, 3 expected votes
4 Resources configured.
============

Online: [ testserver002 testserver003 ]
OFFLINE: [ testserver001 ]

 Resource Group: testgroup
     testrsc     (lsb:testmgr):   Started testserver002
stonith-testserver002        (stonith:external/ipmi):        Started
testserver003
stonith-testserver003        (stonith:external/ipmi):        Started
testserver002
stonith-testserver001        (stonith:external/ipmi):        Started
testserver003

Migration summary:
* Node testserver003:
* Node testserver002:

[testserver003]
============
Last updated: Sat Mar 10 14:19:07 2012
Stack: openais
Current DC: testserver002 - partition with quorum
Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
3 Nodes configured, 3 expected votes
4 Resources configured.
============

Online: [ testserver002 testserver003 ]
OFFLINE: [ testserver001 ]

 Resource Group: testgroup
     testrsc     (lsb:testmgr):   Started testserver002
stonith-testserver002        (stonith:external/ipmi):        Started
testserver003
stonith-testserver003        (stonith:external/ipmi):        Started
testserver002
stonith-testserver001        (stonith:external/ipmi):        Started
testserver003

Migration summary:
* Node testserver003:
* Node testserver002:

- Checked information
  + https://bugzilla.redhat.com/show_bug.cgi?id=525589
    It looks the packages which I used already support this.
  + http://comments.gmane.org/gmane.linux.highavailability.user/36101
    I checked entries in /etc/hosts but I didn't find out the wrong entry.
    ===
    127.0.0.1 testserver001 localhost
    ::1             localhost6.localdomain6 localhost6
    ===

- Look into this from tcpdump
  OK case: after MESSAGE_TYPE_ORF_TOKEN received, pacemaker sends
MESSAGE_TYPE_MCAST.
           I took the information from VMware env.
  
    + MESSAGE_TYPE_ORF_TOKEN
      No.     Time                       Source                Destination
Protocol Length Info
          119 2012-03-19 22:00:15.250310 172.27.4.1            172.27.4.2
UDP      112    Source port: 23489  Destination port: 23490

      Frame 119: 112 bytes on wire (896 bits), 112 bytes captured (896 bits)
      Ethernet II, Src: Vmware_6b:b9:9a (00:0c:29:6b:b9:9a), Dst:
Vmware_8e:74:92 (00:0c:29:8e:74:92)
      Internet Protocol Version 4, Src: 172.27.4.1 (172.27.4.1), Dst:
172.27.4.2 (172.27.4.2)
      User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
(23490)
      Data (70 bytes)

      0000  00 00 22 ff ac 1b 04 01 00 00 00 00 0c 00 00 00
..".............
      0010  00 00 00 00 00 00 00 00 ac 1b 04 01 02 00 ac 1b
................
      (snip)

    + MESSAGE_TYPE_MCAST
      No.     Time                       Source                Destination
Protocol Length Info
         5141 2012-03-19 22:01:19.198346 172.27.4.2            226.94.16.16
UDP      1486   Source port: 23489  Destination port: 23490

      Frame 5141: 1486 bytes on wire (11888 bits), 1486 bytes captured
(11888 bits)
      Ethernet II, Src: Vmware_8e:74:92 (00:0c:29:8e:74:92), Dst:
IPv4mcast_5e:10:10 (01:00:5e:5e:10:10)
      Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst:
226.94.16.16 (226.94.16.16)
      User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
(23490)
      Data (1444 bytes)

      0000  01 02 22 ff ac 1b 04 02 ac 1b 04 02 02 00 ac 1b
..".............
      0010  04 02 08 00 02 00 ac 1b 04 02 08 00 04 00 ac 1b
................
      (snip)

  NG case: MESSAGE_TYPE_ORF_TOKEN sent and received repeatedly and I can see
the 
           message in pacemaker.log.

    + MESSAGE_TYPE_ORF_TOKEN
      No.     Time                       Source                Destination
Protocol Length Info
         39605 2012-03-10 14:18:13.826778 172.27.4.2            172.27.4.3
UDP      112    Source port: 23489  Destination port: 23490

      Frame 39605: 112 bytes on wire (896 bits), 112 bytes captured (896
bits)
      Ethernet II, Src: FujitsuT_98:79:4b (00:19:99:98:79:4b), Dst:
FujitsuT_97:8d:15 (00:19:99:97:8d:15)
      Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst:
172.27.4.3 (172.27.4.3)
      User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
(23490)
      Data (70 bytes)

      0000  00 00 22 ff ac 1b 04 01 00 00 00 00 01 00 00 00
..".............
      0010  ff ff ff ff ac 1b 04 01 ac 1b 04 01 02 00 ac 1b
................
      (snip)

    + pacemaker.log
      Mar 10 14:20:09 testserver001 crmd: [7551]: info: crm_timer_popped:
Election Trigger (I_DC_TIMEOUT) just popped!
      Mar 10 14:20:09 testserver001 crmd: [7551]: WARN: do_log: FSA: Input
I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
      Mar 10 14:20:09 testserver001 crmd: [7551]: info: do_state_transition:
State transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT
cause=C_TIMER_POPPED origin=crm_timer_popped ]
      Mar 10 14:22:09 testserver001 crmd: [7551]: ERROR: crm_timer_popped:
Election Timeout (I_ELECTION_DC) just popped!
      Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_state_transition:
State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_TIMER_POPPED origin=crm_timer_popped ]
      Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_te_control:
Registering TE UUID: b2bb3cc4-cead-475c-bb73-3adbb60142ae
      Mar 10 14:22:09 testserver001 crmd: [7551]: WARN:
cib_client_add_notify_callback: Callback already present
      Mar 10 14:22:09 testserver001 crmd: [7551]: info: set_graph_functions:
Setting custom graph functions
      Mar 10 14:22:09 testserver001 crmd: [7551]: info: unpack_graph:
Unpacked transition -1: 0 actions in 0 synapses
      Mar 10 14:22:09 testserver001 crmd: [7551]: info: do_dc_takeover:
Taking over DC status for this partition
      Mar 10 14:22:09 testserver001 cib: [7547]: info:
cib_process_readwrite: We are now in R/W mode
      Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
Operation complete: op cib_master for section 'all' (origin=local/crmd/6,
version=0.143.0): ok (rc=0)
      Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
Operation complete: op cib_modify for section cib (origin=local/crmd/7,
version=0.143.0): ok (rc=0)
      Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
Operation complete: op cib_modify for section crm_config
(origin=local/crmd/9, version=0.143.0): ok (rc=0)
      Mar 10 14:22:09 testserver001 crmd: [7551]: info:
do_dc_join_offer_all: join-1: Waiting on 1 outstanding join acks
      Mar 10 14:22:09 testserver001 crmd: [7551]: info: ais_dispatch:
Membership 516: quorum still lost
      Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
Operation complete: op cib_modify for section crm_config
(origin=local/crmd/11, version=0.143.0): ok (rc=0)
      Mar 10 14:22:09 testserver001 crmd: [7551]: info: crm_ais_dispatch:
Setting expected votes to 3
      Mar 10 14:22:09 testserver001 crmd: [7551]: info:
config_query_callback: Checking for expired actions every 900000ms
      Mar 10 14:22:09 testserver001 crmd: [7551]: info:
config_query_callback: Sending expected-votes=3 to corosync
      Mar 10 14:22:09 testserver001 crmd: [7551]: info: ais_dispatch:
Membership 516: quorum still lost
      Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
Operation complete: op cib_modify for section crm_config
(origin=local/crmd/14, version=0.143.0): ok (rc=0)
      Mar 10 14:22:09 testserver001 crmd: [7551]: info: crm_ais_dispatch:
Setting expected votes to 3
      Mar 10 14:22:09 testserver001 crmd: [7551]: info: te_connect_stonith:
Attempting connection to fencing daemon...
      Mar 10 14:22:09 testserver001 cib: [7547]: info: cib_process_request:
Operation complete: op cib_modify for section crm_config
(origin=local/crmd/16, version=0.143.0): ok (rc=0)
      Mar 10 14:22:10 testserver001 crmd: [7551]: info: te_connect_stonith:
Connected

    + enum message_type {
              MESSAGE_TYPE_ORF_TOKEN = 0,         /* Ordering, Reliability,
Flow (ORF) control Token */
              MESSAGE_TYPE_MCAST = 1,             /* ring ordered multicast
message */
              MESSAGE_TYPE_MEMB_MERGE_DETECT = 2, /* merge rings if there
are available rings */
              MESSAGE_TYPE_MEMB_JOIN = 3,         /* membership join message
*/
              MESSAGE_TYPE_MEMB_COMMIT_TOKEN = 4, /* membership commit token
*/
              MESSAGE_TYPE_TOKEN_HOLD_CANCEL = 5, /* cancel the holding of
the token */
      };

- packages on CentOS 5.6
  + pacemaker-1.0.10-1.4.el5
  + corosync-1.2.5-1.3.el5

Thank you in advance,
Hisashi Osanai

Hisashi Osanai (osanai.hisashi at jp.fujitsu.com)








More information about the Pacemaker mailing list