[ClusterLabs] Corosync: 100% cpu (corosync 2.3.5, libqb 0.17.1, pacemaker 1.1.13)

Pallai Roland pallair at magex.hu
Wed Aug 5 19:57:18 EDT 2015


hi,

I've built a recent cluster stack from sources on Debian Jessie and I can't
get rid of cpu spikes. Corosync blocks the entire system for seconds on
every simple transition, even itself:

 drbdtest1 corosync[4734]:   [MAIN  ] Corosync main process was not
scheduled for 2590.4512 ms (threshold is 2400.0000 ms). Consider token
timeout increase.

and even drbd:
 drbdtest1 kernel: drbd p1: PingAck did not arrive in time.

My previous build (corosync 1.4.6, libqb 0.17.0, pacemaker 1.1.12) works
fine on this nodes with the same corosync/pacemaker setup.

What should I try? It's a test environment, the issue is 100% reproducible
in seconds. Network traffic is minimal all the time and there is no I/O
load.


*Pacemaker config:*

node 167969573: drbdtest1
node 167969574: drbdtest2
primitive drbd_p1 ocf:linbit:drbd \
        params drbd_resource=p1 \
        op monitor interval=30
primitive drbd_p2 ocf:linbit:drbd \
        params drbd_resource=p2 \
        op monitor interval=30
primitive dummy_test ocf:pacemaker:Dummy \
        meta allow-migrate=true \
        params state="/var/run/activenode"
primitive fence_libvirt stonith:external/libvirt \
        params hostlist="drbdtest1,drbdtest2"
hypervisor_uri="qemu+ssh://libvirt-fencing@mgx4/system" \
        op monitor interval=30
primitive fs_boot Filesystem \
        params device="/dev/null" directory="/boot" fstype="*" \
        meta is-managed=false \
        op monitor interval=20 timeout=40 on-fail=block OCF_CHECK_LEVEL=20
primitive fs_f1 Filesystem \
        params device="/dev/drbd/by-res/p1" directory="/mnt/p1" fstype=ext4
options="commit=60,barrier=0,data=writeback" \
        op monitor interval=20 timeout=40 \
        op start timeout=300 interval=0 \
        op stop timeout=180 interval=0
primitive ip_10.3.3.138 IPaddr2 \
        params ip=10.3.3.138 cidr_netmask=32 \
        op monitor interval=10s timeout=20s
primitive sysinfo ocf:pacemaker:SysInfo \
        op start timeout=20s interval=0 \
        op stop timeout=20s interval=0 \
        op monitor interval=60s
group dummy-group dummy_test
ms ms_drbd_p1 drbd_p1 \
        meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true
ms ms_drbd_p2 drbd_p2 \
        meta master-max=2 master-node-max=1 clone-max=2 notify=true
clone fencing_by_libvirt fence_libvirt \
        meta globally-unique=false
clone fs_boot_clone fs_boot
clone sysinfos sysinfo \
        meta globally-unique=false
location fs1_on_high_load fs_f1 \
        rule -inf: cpu_load gte 4
colocation dummy_coloc inf: dummy-group ms_drbd_p2:Master
colocation f1a-coloc inf: fs_f1 ms_drbd_p1:Master
colocation f1b-coloc inf: fs_f1 fs_boot_clone:Started
order dummy_order inf: ms_drbd_p2:promote dummy-group:start
order orderA inf: ms_drbd_p1:promote fs_f1:start
property cib-bootstrap-options: \
        dc-version=1.1.13-6052cd1 \
        cluster-infrastructure=corosync \
        expected-quorum-votes=2 \
        no-quorum-policy=ignore \
        symmetric-cluster=true \
        placement-strategy=default \
        last-lrm-refresh=1438735742 \
        have-watchdog=false
property cib-bootstrap-options-stonith: \
        stonith-enabled=true \
        stonith-action=reboot
rsc_defaults rsc-options: \
        resource-stickiness=100


*corosync.conf:*

totem {
        version: 2
        token: 3000
        token_retransmits_before_loss_const: 10
        clear_node_high_bit: yes
        crypto_cipher: none
        crypto_hash: none
        interface {
                ringnumber: 0
                bindnetaddr: 10.3.3.37
                mcastaddr: 225.0.0.37
                mcastport: 5403
                ttl: 1
        }
}

logging {
        fileline: off
        to_stderr: no
        to_logfile: yes
        logfile: /var/log/corosync/corosync.log
        to_syslog: yes
        syslog_facility: daemon
        debug: off
        timestamp: on
        logger_subsys {
                subsys: QUORUM
                debug: off
        }
}

quorum {
        provider: corosync_votequorum
        expected_votes: 2
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20150806/7318735a/attachment-0002.html>


More information about the Users mailing list