[ClusterLabs] Corosync: 100% cpu (corosync 2.3.5, libqb 0.17.1, pacemaker 1.1.13)

Thu Aug 6 02:53:40 EDT 2015

Pallai Roland napsal(a):
> hi,
>
> I've built a recent cluster stack from sources on Debian Jessie and I can't
> get rid of cpu spikes. Corosync blocks the entire system for seconds on
> every simple transition, even itself:

How many cores you have? Corosync since 2.0 uses only two threads (and 
one is only for logging) so it's virtually impossible for corosync to 
block ENTIRE system as long as you have more then one core.

>
>   drbdtest1 corosync[4734]:   [MAIN  ] Corosync main process was not
> scheduled for 2590.4512 ms (threshold is 2400.0000 ms). Consider token
> timeout increase.
>
> and even drbd:
>   drbdtest1 kernel: drbd p1: PingAck did not arrive in time.

Kernel module blocked by unrelated userspace app?

>
> My previous build (corosync 1.4.6, libqb 0.17.0, pacemaker 1.1.12) works
> fine on this nodes with the same corosync/pacemaker setup.
>
> What should I try? It's a test environment, the issue is 100% reproducible
> in seconds. Network traffic is minimal all the time and there is no I/O
> load.

set corosync.conf debug to on (or trace) and see what it happens. Also 
make sure /dev/shm has enough space.

Honza

>
>
> *Pacemaker config:*
>
> node 167969573: drbdtest1
> node 167969574: drbdtest2
> primitive drbd_p1 ocf:linbit:drbd \
>          params drbd_resource=p1 \
>          op monitor interval=30
> primitive drbd_p2 ocf:linbit:drbd \
>          params drbd_resource=p2 \
>          op monitor interval=30
> primitive dummy_test ocf:pacemaker:Dummy \
>          meta allow-migrate=true \
>          params state="/var/run/activenode"
> primitive fence_libvirt stonith:external/libvirt \
>          params hostlist="drbdtest1,drbdtest2"
> hypervisor_uri="qemu+ssh://libvirt-fencing@mgx4/system" \
>          op monitor interval=30
> primitive fs_boot Filesystem \
>          params device="/dev/null" directory="/boot" fstype="*" \
>          meta is-managed=false \
>          op monitor interval=20 timeout=40 on-fail=block OCF_CHECK_LEVEL=20
> primitive fs_f1 Filesystem \
>          params device="/dev/drbd/by-res/p1" directory="/mnt/p1" fstype=ext4
> options="commit=60,barrier=0,data=writeback" \
>          op monitor interval=20 timeout=40 \
>          op start timeout=300 interval=0 \
>          op stop timeout=180 interval=0
> primitive ip_10.3.3.138 IPaddr2 \
>          params ip=10.3.3.138 cidr_netmask=32 \
>          op monitor interval=10s timeout=20s
> primitive sysinfo ocf:pacemaker:SysInfo \
>          op start timeout=20s interval=0 \
>          op stop timeout=20s interval=0 \
>          op monitor interval=60s
> group dummy-group dummy_test
> ms ms_drbd_p1 drbd_p1 \
>          meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
> notify=true
> ms ms_drbd_p2 drbd_p2 \
>          meta master-max=2 master-node-max=1 clone-max=2 notify=true
> clone fencing_by_libvirt fence_libvirt \
>          meta globally-unique=false
> clone fs_boot_clone fs_boot
> clone sysinfos sysinfo \
>          meta globally-unique=false
> location fs1_on_high_load fs_f1 \
>          rule -inf: cpu_load gte 4
> colocation dummy_coloc inf: dummy-group ms_drbd_p2:Master
> colocation f1a-coloc inf: fs_f1 ms_drbd_p1:Master
> colocation f1b-coloc inf: fs_f1 fs_boot_clone:Started
> order dummy_order inf: ms_drbd_p2:promote dummy-group:start
> order orderA inf: ms_drbd_p1:promote fs_f1:start
> property cib-bootstrap-options: \
>          dc-version=1.1.13-6052cd1 \
>          cluster-infrastructure=corosync \
>          expected-quorum-votes=2 \
>          no-quorum-policy=ignore \
>          symmetric-cluster=true \
>          placement-strategy=default \
>          last-lrm-refresh=1438735742 \
>          have-watchdog=false
> property cib-bootstrap-options-stonith: \
>          stonith-enabled=true \
>          stonith-action=reboot
> rsc_defaults rsc-options: \
>          resource-stickiness=100
>
>
> *corosync.conf:*
>
> totem {
>          version: 2
>          token: 3000
>          token_retransmits_before_loss_const: 10
>          clear_node_high_bit: yes
>          crypto_cipher: none
>          crypto_hash: none
>          interface {
>                  ringnumber: 0
>                  bindnetaddr: 10.3.3.37
>                  mcastaddr: 225.0.0.37
>                  mcastport: 5403
>                  ttl: 1
>          }
> }
>
> logging {
>          fileline: off
>          to_stderr: no
>          to_logfile: yes
>          logfile: /var/log/corosync/corosync.log
>          to_syslog: yes
>          syslog_facility: daemon
>          debug: off
>          timestamp: on
>          logger_subsys {
>                  subsys: QUORUM
>                  debug: off
>          }
> }
>
> quorum {
>          provider: corosync_votequorum
>          expected_votes: 2
> }
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>