[Pacemaker] Corosync service taking 100% cpu and is unable to stop gracefully

Thu Apr 19 07:11:32 EDT 2012

Major issues:
1) Corosync reaching over 100% cpu usage.
2) Corosync unable to stop gracefully.
3) Virtual IP of a resources being assigned as the primary IP on a interface, 
after a cable disconnect/reconnect on that interface. The static IP on the 
interface shown as global secondary IP.

Use case:
1) Two nodes in a cluster.
2) Two communication paths exists between the two nodes, with “rrp_mode” set to 
active in corosync.conf
  a. One path is a back-to-back connection between the nodes.
  b. Second is  via the LAN network  switch.
3) The network cable was unplugged on one of the nodes for a while (on both the 
interfaces). It was reconnected after a short while.

Observations:
1) Corosync service was taking 100% cpu on the node whose link was down:
  a. In the above scenario Corosync service could not be stopped gracefully. A 
SIGKILL had to be issued to stop the service.
  b. On this node, of the two interfaces configured in corosync.conf, one was 
being used for the Virtual IP’s preferred eth.   
    i. It was observed that when the link was up after a disconnection, the 
primary global IP on that interface was the Virtual IP configured for a 
resource.
    ii. The static IP assigned to the interface was listed as “scope global 
secondary” in the output of `ip addr show`.
    iii. Also the Virtual IP of the resources configured in pacemaker were 
active on both the nodes.
    iv. `service network restart` also did not work.
  c. Coroysnc service was stopped (Killed since it could not be stopped), the 
network service was re-started and then corosync was re-started. All good after 
this.