[ClusterLabs] Antw: Re: Why shouldn't one store resource configuration in the CIB?

Wed Apr 19 08:32:26 CEST 2017

>>> Ferenc Wágner <wferi at niif.hu> schrieb am 18.04.2017 um 18:46 in Nachricht
<87tw5l64v0.fsf at lant.ki.iif.hu>:
> Ken Gaillot <kgaillot at redhat.com> writes:
> 
>> On 04/13/2017 11:11 AM, Ferenc Wágner wrote:
>> 
>>> I encountered several (old) statements on various forums along the lines
>>> of: "the CIB is not a transactional database and shouldn't be used as
>>> one" or "resource parameters should only uniquely identify a resource,
>>> not configure it" and "the CIB was not designed to be a configuration
>>> database but people still use it that way".  Sorry if I misquote these,
>>> I go by my memories now, I failed to dig up the links by a quick try.
>>> 
>>> Well, I've been feeling guilty in the above offenses for years, but it
>>> worked out pretty well that way which helped to suppress these warnings
>>> in the back of my head.  Still, I'm curious: what's the reason for these
>>> warnings, what are the dangers of "abusing" the CIB this way?
>>> /var/lib/pacemaker/cib/cib.xml is 336 kB with 6 nodes and 155 resources
>>> configured.  Old Pacemaker versions required tuning PCMK_ipc_buffer to
>>> handle this, but even the default is big enough nowadays (128 kB after
>>> compression, I guess).
>>> 
>>> Am I walking on thin ice?  What should I look out for?
>>
>> That's a good question. Certainly, there is some configuration
>> information in most resource definitions, so it's more a matter of degree.
>>
>> The main concerns I can think of are:
>>
>> 1. Size: Increasing the CIB size increases the I/O, CPU and networking
>> overhead of the cluster (and if it crosses the compression threshold,
>> significantly). It also marginally increases the time it takes the
>> policy engine to calculate a new state, which slows recovery.
> 
> Thanks for the input, Ken!  Is this what you mean?
> 
> cib: info: crm_compress_string: Compressed 1028972 bytes into 69095 (ratio 
> 14:1) in 138ms
> 
> At the same time /var/lib/pacemaker/cib/cib.xml is 336K, and

I wonder why the CIB is transferred as a whole all the time: Considering that
the configuration changes rarely, it would not have to be sent all the time.
Even if a change occurs, only the affected element (i.e. a single resource)
should be transferred. Similarly to the status.

> 
> # cibadmin -Q --scope resources | wc -c
> 330951
> # cibadmin -Q --scope status | wc -c
> 732820

On a smaller scale I have 55759 bytes resources vs. 111181 bytes status

As mentioned in another thread, one of the reasons for a large size are the
IDs used to describe an element. For example in resource "prm_foobar" an
attribute named "iflabel" has the ID "prm_foobar-instance_attributes-iflabel".
Considering that the XML element is (at least) inside a <primitive>,
<instance_attributes>, <nvpair> I wonder whether it's really necessary to map
the whole path into the ID name.

Similar for the status: A significant portion is consumed by transition-keys
and transition-magic which seem "over-unique". For example consider these:
"158:49:0:69e31903-245d-4265-b732-7
60ddd369df2", "0:0;158:49:0:69e31903-245d-4265-b732-760ddd369df2". So they add
extra information to a UUID (Universally Unique ID 128 bit) which is overkill.
Is it just to add extra semantic? A UUID alone would be more than enough. Even
a GUID (Global Unique ID, 64 bit) would be enough IMHO. (Note that Microsoft
thinks GUIDs and UUIDs are the same).

> 
> Even though I consume about 2 kB per resource, the status section
> weights 2.2 times the resources section.  Which means shrinking the
> resource size wouldn't change the full size significantly.

Another big saving would be replacing XML elements by a tokenized
representation (In the times when RAM was rare, even BASIC interpreters did
that). As no-one edits the CIB directly, that wouldn't affect any user (if
cibadmin would do the conversions for example).

> 
> At the same time, we should probably monitor the trends of the cluster
> messaging health as we expand it (with nodes and resources).  What would
> be some useful indicators to graph?

runaround time, I guess: The longer the messages, the loger processing (,
compressing/decompressing) and transfer times.

> 
>> 2. Consistency: Clusters can become partitioned. If changes are made on
>> one or more partitions during the separation, the changes won't be
>> reflected on all nodes until the partition heals, at which time the
>> cluster will reconcile them, potentially losing one side's changes.

If only one side of a partitioned cluster is allowed to make (valid) changes,
that isn't really a problem.
Maybe not everything is working as smoothly as it should.

> 
> Ah, that's a very good point, which I neglected totally: even inquorate
> partitions can have configuration changes.  Thanks for bringing this up!
> I wonder if there's any practical workaround for that.
> 
>> I suppose this isn't qualitatively different from using a separate
>> configuration file, but those tend to be more static, and failure to
>> modify all copies would be more obvious when doing them individually
>> rather than issuing a single cluster command.
> 
> From a different angle: if a node is off, you can't modify its
> configuration file.  So you need an independent mechanism to do what the
> CIB synchronization does anyway, or a shared file system with its added
> complexity.  On the other hand, one needn't guess how Pacemaker
> reconciles the conflicting resource configuration changes.  Indeed, how
> does it?
> -- 
> Thanks,
> Feri
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org