[Pacemaker] Process cib loops infinitely with 100% cpu usage and can't be killed

Gabriel Gomiz ggomiz at cooperativaobrera.coop
Thu May 29 08:00:09 EDT 2014

On 05/26/2014 08:56 PM, Andrew Beekhof wrote:
> On 27 May 2014, at 5:48 am, Gabriel Gomiz <ggomiz at cooperativaobrera.coop> wrote:
>> Hello Andrew and cluster folks!
>> In the last month we are experiencing some weird problem with cib process in one of our nodes
>> ('gandalf'), it's a 4-node cluster. Brief description:
>> After some undetermined reason (we still can't figure out why) it begins looping infinitely and
>> consuming 100% CPU.
> Apart from the CPU usage, is there something in particular that makes you think its looping?
Maybe because stracing the process also hangs and the process is not receiving the kill signal
(maybe stuck in a system call inside kernel??).
> There have been some big steps forward in cib for the next upstream release (its basically 2 orders of magnitude faster/more efficient).
> Current versions will regularly max out a core, albeit for hopefully short periods of time depending on the cluster size:
> 	https://twitter.com/beekhof/status/412913549837475840
> Its also a vicious circle - a busy cib leads to failed resource actions, which leads to recovery operations, which leads to more work for the cib.
> Looking at the size of your cluster, 87 resources on 4 nodes... I can imagine that benefitting greatly from the coming version.
> I notice you're using a rhel package, are you a RH customer or is this on a clone?
> Also, did anything specific happen prior to the CIB going nuts?
Only thing that I can think of is a lot of calls to crm_mon via a shell script that we use to check
which resource groups each node is servicing (attached if you're curious).
We use this script to apply puppet manifests conditionally to our nodes and do some monitoring. Also
we have cron jobs checking via the script if the resource group is active before running.
Maybe the sum of that calls can make cib process very busy...?

Anyway, I've built 1.1.12 rc1 RPMS and this morning I've upgraded the cluster. Will let you know if
there is something weird after this upgrade.

