[ClusterLabs] Regular pengine warnings after a transient failure

Tue Mar 8 09:08:44 EST 2016

Ken Gaillot <kgaillot at redhat.com> writes:

> On 03/07/2016 02:03 PM, Ferenc Wágner wrote:
>
>> The transition-keys match, does this mean that the above is a late
>> result from the monitor operation which was considered timed-out
>> previously?  How did it reach vhbl07, if the DC at that time was vhbl03?
>> 
>>> The pe-input files from the transitions around here should help.
>> 
>> They are available.  What shall I look for?
>
> It's not the most user-friendly of tools, but crm_simulate can show how
> the cluster would react to each transition: crm_simulate -Sx $FILE.bz2

$ /usr/sbin/crm_simulate -Sx pe-input-430.bz2 -D recover_many.dot
[...]
$ dot recover_many.dot -Tpng >recover_many.png
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.573572 to fit

The result is a 32767x254 bitmap of green ellipses connected by arrows.
Most arrows are impossible to follow, but the picture seems to agree
with the textual output from crm_simulate:

* 30 FAILED resources on vhbl05 are to be recovered
* 32 Stopped resources are to be started (these are actually running,
  but considered Stopped as a consequence of the crmd restart on vhbl03)

On the other hand, simulation based on pe-input-431.bz2 reports
* only 2 FAILED resources to recover on vhbl05
* 36 resources to start (the 4 new are the ones whose recoveries started
  during the previous -- aborted -- transition)

I failed to extract anything out of these simulations than what was
already known from the logs.  But I'm happy to see that the cluster
probes the disappeared resources on vhbl03 (where they disappeared with
the crmd restart) even though it plans to start some of them on other
nodes.
-- 
Regards,
Feri