[ClusterLabs] Antw: Re: Regular pengine warnings after a transient failure

Wed Mar 9 08:31:55 CET 2016

>>> Ferenc Wágner <wferi at niif.hu> schrieb am 08.03.2016 um 15:08 in Nachricht
<87wppdoydv.fsf at lant.ki.iif.hu>:
> Ken Gaillot <kgaillot at redhat.com> writes:
> 
>> On 03/07/2016 02:03 PM, Ferenc Wágner wrote:
>>
>>> The transition-keys match, does this mean that the above is a late
>>> result from the monitor operation which was considered timed-out
>>> previously?  How did it reach vhbl07, if the DC at that time was vhbl03?
>>> 
>>>> The pe-input files from the transitions around here should help.
>>> 
>>> They are available.  What shall I look for?
>>
>> It's not the most user-friendly of tools, but crm_simulate can show how
>> the cluster would react to each transition: crm_simulate -Sx $FILE.bz2
> 
> $ /usr/sbin/crm_simulate -Sx pe-input-430.bz2 -D recover_many.dot
> [...]
> $ dot recover_many.dot -Tpng >recover_many.png
> dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.573572 to 
> fit
> 
> The result is a 32767x254 bitmap of green ellipses connected by arrows.

That completely agrees with my experience on this: FOr real-life
configurations those graphs are gigantic. When outputting SVG instead of PNG,
you can at least zoom and pan in the graph (even Firefox can do it). Or (if you
feel crazy enough) you can load the graph in Inkscape and delete the details
you are not interested in.

The best solution of course would be to omit irrelevant details from the graph
before creating the dot file...

> Most arrows are impossible to follow, but the picture seems to agree
> with the textual output from crm_simulate:
> 
> * 30 FAILED resources on vhbl05 are to be recovered
> * 32 Stopped resources are to be started (these are actually running,
>   but considered Stopped as a consequence of the crmd restart on vhbl03)
> 
> On the other hand, simulation based on pe-input-431.bz2 reports
> * only 2 FAILED resources to recover on vhbl05
> * 36 resources to start (the 4 new are the ones whose recoveries started
>   during the previous -- aborted -- transition)
> 
> I failed to extract anything out of these simulations than what was
> already known from the logs.  But I'm happy to see that the cluster
> probes the disappeared resources on vhbl03 (where they disappeared with
> the crmd restart) even though it plans to start some of them on other
> nodes.
> -- 
> Regards,
> Feri
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org