[ClusterLabs] 'crm node standby' command failing with "Error performing operation: Communication error on send . Return code is 70"

Mon Sep 24 13:57:06 UTC 2018

On Fri, 2018-09-21 at 13:34 +0530, Prasad Nagaraj wrote:
> Hi -
> 
> Yesterday, I noticed that when I am trying to execute 'crm node
> standby' command on one of my cluster nodes, it was failing with 
> 
> "Error performing operation: Communication error on send . Return
> code is 70"
> 
> My corosync logs had these entries during that time:
> 
> Sep 20 22:14:54 [4454] vm5c336912f1       crmd:   notice:
> throttle_handle_load: High CPU load detected: 1.850000
> Sep 20 22:14:57 [4449] vm5c336912f1        cib:     info:
> cib_process_ping:     Reporting our current digest to vmb546073338:
> 8fe67fcfcd20515c246c225a124a8902 for 0.481.2 (0x2742230 0)
> Sep 20 22:15:09 [4449] vm5c336912f1        cib:     info:
> cib_process_request:  Forwarding cib_modify operation for section
> nodes to master (origin=local/crm_attribute/4)
> Sep 20 22:15:24 [4454] vm5c336912f1       crmd:   notice:
> throttle_handle_load: High CPU load detected: 1.640000
> Sep 20 22:15:54 [4454] vm5c336912f1       crmd:     info:
> throttle_handle_load: Moderate CPU load detected: 0.990000
> Sep 20 22:15:54 [4454] vm5c336912f1       crmd:     info:
> throttle_send_command:        New throttle mode: 0010 (was 0100)
> Sep 20 22:16:24 [4454] vm5c336912f1       crmd:     info:
> throttle_send_command:        New throttle mode: 0001 (was 0010)
> Sep 20 22:16:54 [4454] vm5c336912f1       crmd:     info:
> throttle_send_command:        New throttle mode: 0000 (was 0001)
> Sep 20 22:17:09 [4449] vm5c336912f1        cib:     info:
> cib_process_request:  Forwarding cib_modify operation for section
> nodes to master (origin=local/crm_attribute/4)
> Sep 20 22:19:10 [4449] vm5c336912f1        cib:     info:
> cib_process_request:  Forwarding cib_modify operation for section
> nodes to master (origin=local/crm_attribute/4)
> Sep 20 22:23:08 [4449] vm5c336912f1        cib:     info:
> cib_perform_op:       Diff: --- 0.481.2 2
> Sep 20 22:23:08 [4449] vm5c336912f1        cib:     info:
> cib_perform_op:       Diff: +++ 0.482.0
> 9bacc862b8713430c81ea91694942a41
> Sep 20 22:23:08 [4449] vm5c336912f1        cib:     info:
> cib_perform_op:       +  /cib:  @epoch=482, @num_updates=0 
> 
> 
> Is the above behavior due to pacemaker thinking that cluster is
> highly loaded and trying to throttle the execution of commands ? What
> is the best way to resolve or work-around such problems. We do have
> high io load on our cluster - which hosts mysql database.

Throttling is a natural way to handle occasional high load and is not a
problem in itself. I wouldn't expect a load of 1.85 to make a big
difference, so I wouldn't worry about that unless other load-related
problems emerge.

The error message you reported sounds more like a networking issue than
a load issue. Are you seeing any network issues around that time?
Especially corosync retransmits or token timeouts could be significant.

> 
> Also from the thread,
> https://lists.clusterlabs.org/pipermail/users/2017-May/005702.html
> 
> it was asked :
> >There is not much detail about “load-threshold”.
> > Please can someone share steps or any commands to modify “load-
> threshold”.
> Could someone advise whether this is the way to control the
> throttling of cluster operations and how to set this parameter ?
> 
> Thanks in advance,
> Prasad
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>