[Pacemaker] lenny + clvm + pacemaker/openais...

Dejan Muhamedagic dejanmm at fastmail.fm
Thu May 28 13:43:49 EDT 2009


Hi,

On Thu, May 28, 2009 at 05:19:34PM +0000, Alain St-Denis wrote:
> 
> Andrew Beekhof wrote:
> > You might want to check out Martin's packages.
> > If I understood correctly, he's built the version of clvm used by SUSE
> > (which we know works) against 0.80.5
> >
> > Look for his email with the subject "lvm2-clvm RPMs in opensuse.org
> > package repo?"
> 
> Thanks!
> 
> I installed Martin's packages. Here's what I have:
> 
> pacemaker-openais                       1.0.3+svn20090522-2~bpo50+1
> clvm-openais                            2.02.44-4~bpo50+1
> libopenais-legacy-2                     0.80.5+svn20090522-2~bpo50+1
> openais-legacy                          0.80.5+svn20090522-2~bpo50+1
> heartbeat-common                        2.99.2+sles11r9-3~bpo50+1
> libheartbeat2                           2.99.2+sles11r9-3~bpo50+1
> 
> Now, soon after I start clvmd, aisexec dies with a segv (in 
> openais_conn_private_data_get). On my 3 nodes test cluster, I start openais 
> on all nodes, then I start clvmd on one of the nodes. Not long after, aisexec 
> dies on the other nodes. Here are the last messages logged by aisexec:
> 
> May 28 16:19:04.924914 [TOTEM] entering GATHER state from 11.
> May 28 16:19:05.079052 [TOTEM] Saving state aru 20 high seq received 20
> May 28 16:19:05.079094 [TOTEM] Storing new sequence id for ring 298
> May 28 16:19:05.079155 [TOTEM] entering COMMIT state.
> May 28 16:19:05.079500 [TOTEM] entering RECOVERY state.
> May 28 16:19:05.079558 [TOTEM] position [0] member 142.135.16.107:
> May 28 16:19:05.079571 [TOTEM] previous ring seq 660 rep 142.135.16.107
> May 28 16:19:05.079578 [TOTEM] aru a high delivered a received flag 1
> May 28 16:19:05.079587 [TOTEM] position [1] member 142.135.16.109:
> May 28 16:19:05.079594 [TOTEM] previous ring seq 660 rep 142.135.16.109
> May 28 16:19:05.079612 [TOTEM] aru 20 high delivered 20 received flag 1
> May 28 16:19:05.079627 [TOTEM] Did not need to originate any messages in 
> recovery.
> May 28 16:19:05.080669 [CLM  ] CLM CONFIGURATION CHANGE
> May 28 16:19:05.080711 [CLM  ] New Configuration:
> May 28 16:19:05.080724 [CLM  ]  r(0) ip(142.135.16.109)
> May 28 16:19:05.080733 [CLM  ] Members Left:
> May 28 16:19:05.080774 [CLM  ] Members Joined:
> May 28 16:19:05.080790 [crm  ] notice: pcmk_peer_update: Transitional 
> membership event on ring 664: memb=1, new=0, lost=0
> May 28 16:19:05.080805 [crm  ] info: pcmk_peer_update: memb: lab09 1829799822
> May 28 16:19:05.080843 [CLM  ] CLM CONFIGURATION CHANGE
> May 28 16:19:05.080855 [CLM  ] New Configuration:
> May 28 16:19:05.080865 [CLM  ]  r(0) ip(142.135.16.107)
> May 28 16:19:05.080901 [CLM  ]  r(0) ip(142.135.16.109)
> May 28 16:19:05.080914 [CLM  ] Members Left:
> May 28 16:19:05.080923 [CLM  ] Members Joined:
> May 28 16:19:05.080938 [CLM  ]  r(0) ip(142.135.16.107)
> May 28 16:19:05.080972 [crm  ] notice: pcmk_peer_update: Stable membership 
> event on ring 664: memb=2, new=1, lost=0
> May 28 16:19:05.080985 [MAIN ] info: update_member: Node 1796245390/lab07 is 
> now: member
> May 28 16:19:05.081001 [crm  ] info: pcmk_peer_update: NEW:  lab07 1796245390
> May 28 16:19:05.081036 [crm  ] info: pcmk_peer_update: MEMB: lab07 1796245390
> May 28 16:19:05.081044 [crm  ] info: pcmk_peer_update: MEMB: lab09 1829799822
> May 28 16:19:05.081063 [crm  ] info: send_member_notification: Sending 
> membership update 664 to 2 children
> May 28 16:19:05.081118 [SYNC ] This node is within the primary component and 
> will provide service.
> May 28 16:19:05.081144 [TOTEM] entering OPERATIONAL state.
> May 28 16:19:05.082382 [MAIN ] info: update_member: 0x7f1188002510 Node 
> 1796245390 (lab07) born on: 664
> May 28 16:19:05.082416 [crm  ] info: send_member_notification: Sending 
> membership update 664 to 2 children
> May 28 16:19:05.082757 [CLM  ] got nodejoin message 142.135.16.107
> May 28 16:19:05.082832 [CLM  ] got nodejoin message 142.135.16.109
> May 28 16:19:05.087292 [CPG  ] got joinlist message from node 1829799822
> 
> Then it crashes. Martin (or anybody), have you seen this? I attached my 
> openais.conf file. Maybe I'm doing something stupid in there?

You should file a bugzilla for openais. Please use hb_report, it
is going to get all the relevant stuff including the stack traces
(I hope that there was a core dumped).

Thanks,

Dejan

> Alain
> 
> -- 
> Alain St-Denis
> Supercomputing, Systems and Storage / Superinformatique, syst?mes et stockage,
> High Performance Computing Support / Soutien aux calculs en haute performance
> Chief Information Officer Branch / Direction G?n?rale du dirigeant principal 
> de l'information
> Environment Canada / Environnement Canada
> Tel: +1 514 421 4697

> # Please read the openais.conf.5 manual page
> 
> aisexec {
> 	# Run as root - this is necessary to be able to manage resources with Pacemaker
> 	user:	root
> 	group:	root
> }
> 
> service {
> 	# Load the Pacemaker Cluster Resource Manager
> 	name: pacemaker
> 	ver:  0
> }
> 
> totem {
> 	version: 2
> 
> 	# How long before declaring a token lost (ms)
> 	token:          10000
> 
> 	# How many token retransmits before forming a new configuration
> 	token_retransmits_before_loss_const: 20
> 
> 	# How long to wait for join messages in the membership protocol (ms)
> 	join:           60
> 
> 	# How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)
> 	consensus:      4800
> 
> 	# Turn off the virtual synchrony filter
> 	vsftype:        none
> 
> 	# Number of messages that may be sent by one processor on receipt of the token
> 	max_messages:   20
> 
> 	# Limit generated nodeids to 31-bits (positive signed integers)
> 	clear_node_high_bit: yes
> 
> 	# Disable encryption
> 	secauth: off
> 
> 	# How many threads to use for encryption/decryption
> 	threads: 0
> 
> 	# Optionally assign a fixed node id (integer)
> 	# nodeid:         1234
> 
> 	interface {
> 		ringnumber: 0
> 
> 		# The following values need to be set based on your environment
> 		bindnetaddr: 142.135.16.0
> 		mcastaddr: 226.94.1.1
> 		mcastport: 5405
> 	}
> }
> 
> logging {
> 	debug: on
> 	fileline: off
> 	to_syslog: yes
> 	to_stderr: yes
> 	syslog_facility: daemon
> 	timestamp: on
> }
> 
> amf {
> 	mode: disabled
> }

> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker





More information about the Pacemaker mailing list