[ClusterLabs] Why shouldn't one store resource configuration in the CIB?

Tue Apr 18 19:21:21 EDT 2017

On 04/18/2017 11:46 AM, Ferenc Wágner wrote:
> Ken Gaillot <kgaillot at redhat.com> writes:
> 
>> On 04/13/2017 11:11 AM, Ferenc Wágner wrote:
>>
>>> I encountered several (old) statements on various forums along the lines
>>> of: "the CIB is not a transactional database and shouldn't be used as
>>> one" or "resource parameters should only uniquely identify a resource,
>>> not configure it" and "the CIB was not designed to be a configuration
>>> database but people still use it that way".  Sorry if I misquote these,
>>> I go by my memories now, I failed to dig up the links by a quick try.
>>>
>>> Well, I've been feeling guilty in the above offenses for years, but it
>>> worked out pretty well that way which helped to suppress these warnings
>>> in the back of my head.  Still, I'm curious: what's the reason for these
>>> warnings, what are the dangers of "abusing" the CIB this way?
>>> /var/lib/pacemaker/cib/cib.xml is 336 kB with 6 nodes and 155 resources
>>> configured.  Old Pacemaker versions required tuning PCMK_ipc_buffer to
>>> handle this, but even the default is big enough nowadays (128 kB after
>>> compression, I guess).
>>>
>>> Am I walking on thin ice?  What should I look out for?
>>
>> That's a good question. Certainly, there is some configuration
>> information in most resource definitions, so it's more a matter of degree.
>>
>> The main concerns I can think of are:
>>
>> 1. Size: Increasing the CIB size increases the I/O, CPU and networking
>> overhead of the cluster (and if it crosses the compression threshold,
>> significantly). It also marginally increases the time it takes the
>> policy engine to calculate a new state, which slows recovery.
> 
> Thanks for the input, Ken!  Is this what you mean?
> 
> cib: info: crm_compress_string: Compressed 1028972 bytes into 69095 (ratio 14:1) in 138ms

yep

> At the same time /var/lib/pacemaker/cib/cib.xml is 336K, and
> 
> # cibadmin -Q --scope resources | wc -c
> 330951
> # cibadmin -Q --scope status | wc -c
> 732820
> 
> Even though I consume about 2 kB per resource, the status section
> weights 2.2 times the resources section.  Which means shrinking the
> resource size wouldn't change the full size significantly.

good point

> At the same time, we should probably monitor the trends of the cluster
> messaging health as we expand it (with nodes and resources).  What would
> be some useful indicators to graph?

I think the main concern would be CPU spikes when a new state needs to
be calculated (which is at least every cluster-recheck-interval).

Network traffic on the cluster communication link would be interesting,
especially at start-up when everything is happening at once, or after a
global clean-up of all resources.

I/O on whatever holds /var/lib/pacemaker will probably be small, but
wouldn't hurt to check.

>> 2. Consistency: Clusters can become partitioned. If changes are made on
>> one or more partitions during the separation, the changes won't be
>> reflected on all nodes until the partition heals, at which time the
>> cluster will reconcile them, potentially losing one side's changes.
> 
> Ah, that's a very good point, which I neglected totally: even inquorate
> partitions can have configuration changes.  Thanks for bringing this up!
> I wonder if there's any practical workaround for that.
> 
>> I suppose this isn't qualitatively different from using a separate
>> configuration file, but those tend to be more static, and failure to
>> modify all copies would be more obvious when doing them individually
>> rather than issuing a single cluster command.
> 
> From a different angle: if a node is off, you can't modify its
> configuration file.  So you need an independent mechanism to do what the
> CIB synchronization does anyway, or a shared file system with its added
> complexity.  On the other hand, one needn't guess how Pacemaker
> reconciles the conflicting resource configuration changes.  Indeed, how
> does it?

Good question, I haven't delved deeply into that code. It's not merging
diffs or anything like that -- some changes are blessed, and anything
incompatible is discarded.