[Pacemaker] Configuration recommandations for (very?) large cluster

Thu Aug 14 03:35:34 EDT 2014

On 14 Aug 2014, at 3:28 pm, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:

> 14.08.2014 05:24, Andrew Beekhof wrote:
>> 
>> On 14 Aug 2014, at 12:05 am, Lars Ellenberg <Lars.Ellenberg at linbit.com> wrote:
>> 
>>> On Wed, Aug 13, 2014 at 10:33:55AM +1000, Andrew Beekhof wrote:
>>>> On 13 Aug 2014, at 2:02 am, Cédric Dufour - Idiap Research Institute <cedric.dufour at idiap.ch> wrote:
>>>>> On 12/08/14 07:52, Andrew Beekhof wrote:
>>>>>> On 11 Aug 2014, at 10:10 pm, Cédric Dufour - Idiap Research Institute <cedric.dufour at idiap.ch> wrote:
>>> 
>>> ...
>>> 
>>>>> While I still had the ~450 resources, I also "accidentally" brought
>>>>> all 22 nodes back to life together (well, actually started the DC
>>>>> alone and then started the remaining 21 nodes together). As could be
>>>>> expected, the DC got quite busy (dispatching/executing the ~450*22
>>>>> monitoring operations on all nodes). It took 40 minutes for the
>>>>> cluster to stabilize. But it did stabilize, with no timeout and not
>>>>> monitor operations failure! A few "high CIB load detected / throttle
>>>>> down mode" messages popped up but all went well.
>>> 
>>> Cool.
>>> 
>>>> Thats about 0.12s per operation, not too bad.
>>>> More importantly, I'm glad to hear that real-world clusters are seeing
>>>> the same kind of improvements as those in the lab.
>>>> 
>>>> It would be interesting to know how the 40 minutes compares to bringing one node online at a time.
>>>> 
>>>>> 
>>>>> Q: Is there a way to favorize more powerful nodes for the DC (iow. push the DC "election" process in a preferred direction) ?
>>>> 
>>>> Only by starting it first and ensuring it doesn't die (we prfioritize the node with the largest crmd process uptime).
>>> 
>>> Uhm, there was a patch once for pacemaker-1.0.
>>> The latest version I found right now is below.
>>> Written by Klaus Wenninger, iirc.
>> 
>> The new CIB code actually reduces the case for a patch like this - since all updates are performed on all hosts.
>> So the workload from the CIB should be pretty much identical on all nodes.
>> 
>> The load from the crmd is mostly from talking to the lrmd, which is dependant on resource placement rather than being (or not being) the DC.
> 
> I've seen the different picture with 1024 unique clone instances. crmd's
> CPU load on DC is much higher for that case during probe/start/stop.

Thats with the new CIB code?

> 
>> 
>> About the only reason would be to make the pengine go faster - and I'm not completely convinced that is a sufficient justification.
> 
> In one cluster I have heterogeneous HW - some nodes are Xeons, some are
> Atoms. I'd prefer to have a way to perform pe calculations on the former
> ones.
> 
>> 
>>> 
>>> The idea was to communicate via environment
>>> a "HA_dc_prio" value, with meanings:
>>> 
>>> unset => use default of 1
>>> <= -1: Node does not become DC (does not vote)
>> 
>> The "< 0" behaviour sounds dangerous.
>> That would preclude a DC from being elected in a partition where all nodes had this value.
>> 
>>> ==  0: Node may only become DC if no node with >= 1 is available.
>>>       It also will trigger an election whenever a node joins.
>>> (==  1: default)
>>>> =  1: "classic pacemaker behavior",
>>> 	 but changed so positive prio will be checked first,
>>>        and higher positive prio will win
>>> 
>>> It may still apply (with white space changes) to current pacemaker 1.0.
>>> 
>>> It will need some more adjustments for pacemaker 1.1,
>>> but a quick browse through the code suggests it won't be too much work.
>>> 
>>> 	Lars
>>> 
>>> 
>>> --- crmd/election.c.orig	2011-11-28 16:24:54.345431668 +0100
>>> +++ crmd/election.c	2011-11-28 16:39:18.008420543 +0100
>>> @@ -33,6 +33,7 @@
>>> GHashTable *voted = NULL;
>>> uint highest_born_on = -1;
>>> static int current_election_id = 1;
>>> +static int our_dc_prio = INT_MIN; /* INT_MIN/<0/==0/>0 not_set/not_voting/retrigger_election/default_behaviour_plus_prio */
>>> 
>>> static int
>>> crm_uptime(struct timeval *output)
>>> @@ -107,6 +108,20 @@
>>>            break;
>>>    }
>>> 
>>> +    if (our_dc_prio == INT_MIN) {
>>> +        char * dc_prio_str = getenv("HA_dc_prio");
>>> +
>>> +        if (dc_prio_str == NULL) {
>>> +            our_dc_prio = 1;
>>> +        } else {
>>> +            our_dc_prio = atoi(dc_prio_str);
>>> +        }
>>> +    }
>>> +
>>> +    if (our_dc_prio < 0) {
>>> +        not_voting = TRUE;
>>> +    }
>>> +
>>>    if (not_voting == FALSE) {
>>>        if (is_set(fsa_input_register, R_STARTING)) {
>>>            not_voting = TRUE;
>>> @@ -123,6 +138,7 @@
>>>    current_election_id++;
>>>    crm_xml_add(vote, F_CRM_ELECTION_OWNER, fsa_our_uuid);
>>>    crm_xml_add_int(vote, F_CRM_ELECTION_ID, current_election_id);
>>> +    crm_xml_add_int(vote, F_CRM_DC_PRIO, our_dc_prio);
>>> 
>>>    crm_uptime(&age);
>>>    crm_xml_add_int(vote, F_CRM_ELECTION_AGE_S, age.tv_sec);
>>> @@ -241,8 +258,9 @@
>>> {
>>>    struct timeval your_age;
>>>    int age;
>>>    int election_id = -1;
>>> +    int your_dc_prio = 1;
>>>    int log_level = LOG_INFO;
>>>    gboolean use_born_on = FALSE;
>>>    gboolean done = FALSE;
>>>    gboolean we_loose = FALSE;
>>> @@ -273,6 +291,18 @@
>>>    your_version = crm_element_value(vote->msg, F_CRM_VERSION);
>>>    election_owner = crm_element_value(vote->msg, F_CRM_ELECTION_OWNER);
>>>    crm_element_value_int(vote->msg, F_CRM_ELECTION_ID, &election_id);
>>> +    crm_element_value_int(vote->msg, F_CRM_DC_PRIO, &your_dc_prio);
>>> +
>>> +    if (our_dc_prio == INT_MIN) {
>>> +        char * dc_prio_str = getenv("HA_dc_prio");
>>> +
>>> +        if (dc_prio_str == NULL) {
>>> +            our_dc_prio = 1;
>>> +        } else {
>>> +            our_dc_prio = atoi(dc_prio_str);
>>> +        }
>>> +    }
>>> +
>>>    crm_element_value_int(vote->msg, F_CRM_ELECTION_AGE_S, (int *)&(your_age.tv_sec));
>>>    crm_element_value_int(vote->msg, F_CRM_ELECTION_AGE_US, (int *)&(your_age.tv_usec));
>>> 
>>> @@ -334,6 +364,13 @@
>>>        reason = "Recorded";
>>>        done = TRUE;
>>> 
>>> +    } else if(our_dc_prio < your_dc_prio) {
>>> +        reason = "DC Prio";
>>> +        we_loose = TRUE;
>>> +
>>> +    } else if(our_dc_prio > your_dc_prio) {
>>> +        reason = "DC Prio";
>>> +
>>>    } else if (compare_version(your_version, CRM_FEATURE_SET) < 0) {
>>>        reason = "Version";
>>>        we_loose = TRUE;
>>> @@ -400,6 +437,7 @@
>>> 
>>>        crm_xml_add(novote, F_CRM_ELECTION_OWNER, election_owner);
>>>        crm_xml_add_int(novote, F_CRM_ELECTION_ID, election_id);
>>> +	crm_xml_add_int(novote, F_CRM_DC_PRIO, 0); /* rather don't advertise a negative value */
>>> 
>>>        send_cluster_message(vote_from, crm_msg_crmd, novote, TRUE);
>>>        free_xml(novote);
>>> --- include/crm/msg_xml.h.orig	2011-11-28 16:41:47.309414327 +0100
>>> +++ include/crm/msg_xml.h	2011-11-28 16:42:23.921417584 +0100
>>> @@ -33,6 +33,7 @@
>>> #  define F_CRM_USER			"crm_user"
>>> #  define F_CRM_JOIN_ID			"join_id"
>>> #  define F_CRM_ELECTION_ID		"election-id"
>>> +#  define F_CRM_DC_PRIO                 "dc-prio"
>>> #  define F_CRM_ELECTION_AGE_S		"election-age-sec"
>>> #  define F_CRM_ELECTION_AGE_US		"election-age-nano-sec"
>>> #  define F_CRM_ELECTION_OWNER		"election-owner"
>>> --- lib/ais/plugin.c.orig	2011-11-28 16:42:57.002411543 +0100
>>> +++ lib/ais/plugin.c	2011-11-28 16:44:22.160413844 +0100
>>> @@ -409,6 +409,9 @@
>>>    get_config_opt(pcmk_api, local_handle, "use_logd", &value, "no");
>>>    pcmk_env.use_logd = value;
>>> 
>>> +    get_config_opt(pcmk_api, local_handle, "dc_prio", &value, "1");
>>> +    pcmk_env.dc_prio = value;
>>> +
>>>    get_config_opt(pcmk_api, local_handle, "use_mgmtd", &value, "no");
>>>    if (ais_get_boolean(value) == FALSE) {
>>>        int lpc = 0;
>>> @@ -599,6 +602,7 @@
>>>    pcmk_env.logfile = NULL;
>>>    pcmk_env.use_logd = "false";
>>>    pcmk_env.syslog = "daemon";
>>> +    pcmk_env.dc_prio = "1";
>>> 
>>>    if (cs_uid != root_uid) {
>>>        ais_err("Corosync must be configured to start as 'root',"
>>> --- lib/ais/utils.c.orig	2011-11-28 16:45:01.940415754 +0100
>>> +++ lib/ais/utils.c	2011-11-28 16:45:33.018412117 +0100
>>> @@ -237,6 +237,7 @@
>>> 	setenv("HA_logfacility",	pcmk_env.syslog,   1);
>>> 	setenv("HA_LOGFACILITY",	pcmk_env.syslog,   1);
>>> 	setenv("HA_use_logd",		pcmk_env.use_logd, 1);
>>> +        setenv("HA_dc_prio", pcmk_env.dc_prio, 1);
>>> 	setenv("HA_quorum_type",	pcmk_env.quorum,   1);
>>> /* *INDENT-ON* */
>>> 
>>> --- lib/ais/utils.h.orig	2011-11-28 16:45:45.143412597 +0100
>>> +++ lib/ais/utils.h	2011-11-28 16:46:37.026410208 +0100
>>> @@ -238,6 +238,7 @@
>>>    const char *syslog;
>>>    const char *logfile;
>>>    const char *use_logd;
>>> +    const char *dc_prio;
>>>    const char *quorum;
>>> };
>>> 
>>> --- crmd/messages.c.orig	2012-05-25 16:23:22.913106180 +0200
>>> +++ crmd/messages.c	2012-05-25 16:28:30.330263392 +0200
>>> @@ -36,6 +36,8 @@
>>> #include <crmd_messages.h>
>>> #include <crmd_lrm.h>
>>> 
>>> +static int our_dc_prio = INT_MIN;
>>> +
>>> GListPtr fsa_message_queue = NULL;
>>> extern void crm_shutdown(int nsig);
>>> 
>>> @@ -693,7 +695,19 @@
>>>    /*========== DC-Only Actions ==========*/
>>>    if (AM_I_DC) {
>>>        if (strcmp(op, CRM_OP_JOIN_ANNOUNCE) == 0) {
>>> -            return I_NODE_JOIN;
>>> +	    if (our_dc_prio == INT_MIN) {
>>> +        	char * dc_prio_str = getenv("HA_dc_prio");
>>> +
>>> +        	if (dc_prio_str == NULL) {
>>> +		    our_dc_prio = 1;
>>> +        	} else {
>>> +            	    our_dc_prio = atoi(dc_prio_str);
>>> +        	}
>>> +    	    }		
>>> +	    if (our_dc_prio == 0)
>>> +		return I_ELECTION;	
>>> +            else 
>>> +		return I_NODE_JOIN;
>>> 
>>>        } else if (strcmp(op, CRM_OP_JOIN_REQUEST) == 0) {
>>>            return I_JOIN_REQUEST;
>>> 
>>> 
>>> -- 
>>> : Lars Ellenberg
>>> : LINBIT | Your Way to High Availability
>>> : DRBD/HA support and consulting http://www.linbit.com
>>> 
>>> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140814/ab710a07/attachment-0003.sig>