[Pacemaker] The larger cluster is tested.

Andrew Beekhof andrew at beekhof.net
Mon Nov 18 19:01:00 EST 2013


On 16 Nov 2013, at 12:22 am, yusuke iida <yusk.iida at gmail.com> wrote:

> Hi, Andrew
> 
> Thanks for the suggestion variety.
> 
> I fixed and tested the value of batch-limit by 1, 2, 3, and 4 from the
> beginning, in order to confirm what batch-limit is suitable.
> 
> It was something like the following in my environment.
> Timeout did not occur batch-limit=1 and 2.
> batch-limit = 3 was 1 timeout.
> batch-limit = 4 was 5 timeout.
> 
> I think the limit is still high in; From the above results, "limit =
> QB_MAX (1, peers / 4)".

Remember these results are specific to your (virtual) hardware and configured timeouts.
I would argue that 5 timeouts out of 2853 actions is actually quite impressive for a default value in this sort of situation.[1]

Some tuning in a cluster of this kind is to be expected.

[1] It took crm_simulate 4 minutes to even pretend to perform all those operations.

> 
> So I have created a fix to fixed to 2 batch-limit when it became a
> state of extreme.
> https://github.com/yuusuke/pacemaker/commit/efe2d6ebc55be39b8be43de38e7662f039b61dec
> 
> Results of the test several times, it seems to work without problems.
> 
> When batch-limit is fixed and tested, below has a report.
> batch-limit=1
> https://drive.google.com/file/d/0BwMFJItoO-fVNk8wTGlYNjNnSHc/edit?usp=sharing
> batch-limit=2
> https://drive.google.com/file/d/0BwMFJItoO-fVTnc4bXY2YXF2M2M/edit?usp=sharing
> batch-limit=3
> https://drive.google.com/file/d/0BwMFJItoO-fVYl9Gbks2VlJMR0k/edit?usp=sharing
> batch-limit=4
> https://drive.google.com/file/d/0BwMFJItoO-fVZnJIazd5MFQ1aGs/edit?usp=sharing
> 
> The report at the time of making it operate by my test code is the following.
> https://drive.google.com/file/d/0BwMFJItoO-fVbzB0NjFLeVY3Zmc/edit?usp=sharing
> 
> Regards,
> Yusuke
> 
> 2013/11/13 Andrew Beekhof <andrew at beekhof.net>:
>> Did you look at the load numbers in the logs?
>> The CPUs are being slammed for over 20 minutes.
>> 
>> The automatic tuning can only help so much, you're simply asking the cluster to do more work than it is capable of.
>> Giving more priority to cib operations the come via IPC is one option, but as I explained earlier, it comes at the cost of correctness.
>> 
>> Given the huge mismatch between the nodes' capacity and the tasks you're asking them to achieve, your best path forward is probably setting a load-threshold < 40% or a batch-limit <= 8.
>> Or we could try a patch like the one below if we think that the defaults are not aggressive enough.
>> 
>> diff --git a/crmd/throttle.c b/crmd/throttle.c
>> index d77195a..7636d4a 100644
>> --- a/crmd/throttle.c
>> +++ b/crmd/throttle.c
>> @@ -611,14 +611,14 @@ throttle_get_total_job_limit(int l)
>>         switch(r->mode) {
>> 
>>             case throttle_extreme:
>> -                if(limit == 0 || limit > peers/2) {
>> -                    limit = peers/2;
>> +                if(limit == 0 || limit > peers/4) {
>> +                    limit = QB_MAX(1, peers/4);
>>                 }
>>                 break;
>> 
>>             case throttle_high:
>> -                if(limit == 0 || limit > peers) {
>> -                    limit = peers;
>> +                if(limit == 0 || limit > peers/2) {
>> +                    limit = QB_MAX(1, peers/2);
>>                 }
>>                 break;
>>             default:
>> 
>> This may also be worthwhile:
>> 
>> diff --git a/crmd/throttle.c b/crmd/throttle.c
>> index d77195a..586513a 100644
>> --- a/crmd/throttle.c
>> +++ b/crmd/throttle.c
>> @@ -387,22 +387,36 @@ static bool throttle_io_load(float *load, unsigned int *blocked)
>> }
>> 
>> static enum throttle_state_e
>> -throttle_handle_load(float load, const char *desc)
>> +throttle_handle_load(float load, const char *desc, int cores)
>> {
>> -    if(load > THROTTLE_FACTOR_HIGH * throttle_load_target) {
>> +    float adjusted_load = load;
>> +
>> +    if(cores <= 0) {
>> +        /* No adjusting of the supplied load value */
>> +
>> +    } else if(cores == 1) {
>> +        /* On a single core machine, a load of 1.0 is already too high */
>> +        adjusted_load = load * THROTTLE_FACTOR_MEDIUM;
>> +
>> +    } else {
>> +        /* Normalize the load to be per-core */
>> +        adjusted_load = load / cores;
>> +    }
>> +
>> +    if(adjusted_load > THROTTLE_FACTOR_HIGH * throttle_load_target) {
>>         crm_notice("High %s detected: %f", desc, load);
>>         return throttle_high;
>> 
>> -    } else if(load > THROTTLE_FACTOR_MEDIUM * throttle_load_target) {
>> +    } else if(adjusted_load > THROTTLE_FACTOR_MEDIUM * throttle_load_target) {
>>         crm_info("Moderate %s detected: %f", desc, load);
>>         return throttle_med;
>> 
>> -    } else if(load > THROTTLE_FACTOR_LOW * throttle_load_target) {
>> +    } else if(adjusted_load > THROTTLE_FACTOR_LOW * throttle_load_target) {
>>         crm_debug("Noticable %s detected: %f", desc, load);
>>         return throttle_low;
>>     }
>> 
>> -    crm_trace("Negligable %s detected: %f", desc, load);
>> +    crm_trace("Negligable %s detected: %f", desc, adjusted_load);
>>     return throttle_none;
>> }
>> 
>> @@ -464,22 +478,12 @@ throttle_mode(void)
>>     }
>> 
>>     if(throttle_load_avg(&load)) {
>> -        float simple = load / cores;
>> -        mode |= throttle_handle_load(simple, "CPU load");
>> +        mode |= throttle_handle_load(load, "CPU load", cores);
>>     }
>> 
>>     if(throttle_io_load(&load, &blocked)) {
>> -        float blocked_ratio = 0.0;
>> -
>> -        mode |= throttle_handle_load(load, "IO load");
>> -
>> -        if(cores) {
>> -            blocked_ratio = blocked / cores;
>> -        } else {
>> -            blocked_ratio = blocked;
>> -        }
>> -
>> -        mode |= throttle_handle_load(blocked_ratio, "blocked IO ratio");
>> +        mode |= throttle_handle_load(load, "IO load", 0);
>> +        mode |= throttle_handle_load(blocked, "blocked IO ratio", cores);
>>     }
>> 
>>     if(mode & throttle_extreme) {
>> 
>> 
>> 
>> 
>> On 12 Nov 2013, at 3:25 pm, yusuke iida <yusk.iida at gmail.com> wrote:
>> 
>>> Hi, Andrew
>>> 
>>> I'm sorry.
>>> This report was a thing when two cores were assigned to the virtual machine.
>>> https://drive.google.com/file/d/0BwMFJItoO-fVdlIwTVdFOGRkQ0U/edit?usp=sharing
>>> 
>>> I'm sorry to be misleading.
>>> 
>>> This is the report acquired with one core.
>>> https://drive.google.com/file/d/0BwMFJItoO-fVSlo0dE0xMzNORGc/edit?usp=sharing
>>> 
>>> It does not define the LRMD_MAX_CHILDREN on any node.
>>> load-threshold is still default.
>>> cib_max_cpu is set to 0.4 by the following processing.
>>> 
>>>       if(cores == 1) {
>>>           cib_max_cpu = 0.4;
>>>       }
>>> 
>>> since -- if it exceeds 60%, it will be in the state of Extreme.
>>> Nov 08 11:08:31 [2390] vm01       crmd: (  throttle.c:441   )  notice:
>>> throttle_mode:        Extreme CIB load detected: 0.670000
>>> 
>>> From the state of a bit, DC is detecting that vm01 is in the state of Extreme.
>>> Nov 08 11:08:32 [2387] vm13       crmd: (  throttle.c:701   )   debug:
>>> throttle_update:     Host vm01 supports a maximum of 2 jobs and
>>> throttle mode 1000.  New job limit is 1
>>> 
>>> From the following log, a dynamic change of batch-limit also seems to
>>> process satisfactorily.
>>> # grep "throttle_get_total_job_limit" pacemaker.log
>>> (snip)
>>> Nov 08 11:08:31 [2387] vm13       crmd: (  throttle.c:629   )   trace:
>>> throttle_get_total_job_limit:    No change to batch-limit=0
>>> Nov 08 11:08:32 [2387] vm13       crmd: (  throttle.c:632   )   trace:
>>> throttle_get_total_job_limit:    Using batch-limit=8
>>> (snip)
>>> Nov 08 11:10:32 [2387] vm13       crmd: (  throttle.c:632   )   trace:
>>> throttle_get_total_job_limit:    Using batch-limit=16
>>> 
>>> The above shows that it is not solved even if it restricts the whole
>>> number of jobs by batch-limit.
>>> Are there any other methods of reducing a synchronous message?
>>> 
>>> Internal IPC message is not so much.
>>> Do not be able to handle even a little it on the way to handle the
>>> synchronization message?
>>> 
>>> Regards,
>>> Yusuke
>>> 
>>> 2013/11/12 Andrew Beekhof <andrew at beekhof.net>:
>>>> 
>>>> On 11 Nov 2013, at 11:48 pm, yusuke iida <yusk.iida at gmail.com> wrote:
>>>> 
>>>>> Execution of the graph was also checked.
>>>>> Since the number of pending(s) is restricted to 16 from the middle, it
>>>>> is judged that batch-limit is effective.
>>>>> Observing here, even if a job is restricted by batch-limit, two or
>>>>> more jobs are always fired(ed) in 1 second.
>>>>> These performed jobs return a result and the synchronous message of
>>>>> CIB generates them.
>>>>> The node which continued receiving a synchronous message processes
>>>>> there preferentially, and postpones an internal IPC message.
>>>>> I think that it caused timeout.
>>>> 
>>>> What load-threshold were you running this with?
>>>> 
>>>> I see this in the logs:
>>>> "Host vm10 supports a maximum of 4 jobs and throttle mode 0100.  New job limit is 1"
>>>> 
>>>> Have you set LRMD_MAX_CHILDREN=4 on these nodes?
>>>> I wouldn't recommend that for a single core VM.  I'd let the default of 2*cores be used.
>>>> 
>>>> 
>>>> Also, I'm not seeing "Extreme CIB load detected".  Are these still single core machines?
>>>> If so it would suggest that something about:
>>>> 
>>>>       if(cores == 1) {
>>>>           cib_max_cpu = 0.4;
>>>>       }
>>>>       if(throttle_load_target > 0.0 && throttle_load_target < cib_max_cpu) {
>>>>           cib_max_cpu = throttle_load_target;
>>>>       }
>>>> 
>>>>       if(load > 1.5 * cib_max_cpu) {
>>>>           /* Can only happen on machines with a low number of cores */
>>>>           crm_notice("Extreme %s detected: %f", desc, load);
>>>>           mode |= throttle_extreme;
>>>> 
>>>> is wrong.
>>>> 
>>>> What was load-threshold configured as?
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> ----------------------------------------
>>> METRO SYSTEMS CO., LTD
>>> 
>>> Yusuke Iida
>>> Mail: yusk.iida at gmail.com
>>> ----------------------------------------
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>> 
> 
> 
> 
> -- 
> ----------------------------------------
> METRO SYSTEMS CO., LTD
> 
> Yusuke Iida
> Mail: yusk.iida at gmail.com
> ----------------------------------------
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131119/52227220/attachment-0003.sig>


More information about the Pacemaker mailing list