Deduplicate is called from various places in the engine and can cause
a lot of garbage to get created. It first creates a map and then
adds each value to the map in order (1st alloc). It then creates a
new slice (2nd alloc) and appends everything from the map to the slice.
Finally, it sorted the new slice (3rd alloc).
This switches the algorithm to use stable sorting and resuing the existing
slice to avoid allocations.
The decoders were held onto each iterator to avoid creating them all
the time. Some of them have use quite a bit of memory so they can
be expensive to create when querying across many series.
Intead, more them to a re-usable pool where we create the minimum that
could active be in use. This reduces garbage as well as makes the iterators
less expensive to create.
This allows encoders to be re-used and maintained in a pool to
avoid allocating new ones on every compactions and write of an encoded
block. The pool used is not a sync.Pool to ensure that the encoders
will not be garbage collected.
Due to a bug in compactions, it's possible some blocks may have duplicate
points stored. If those blocks are decoded and re-compacted, an assertion
panic could trigger.
We now dedup those blocks if necessary to remove the duplicate points
and avoid the panic.
If a large series contains a point that is overwritten, the compactor
would load the whole series into RAM during a full compaction. If
the series was large, it could cause very large RAM spikes and OOMs.
The change reworks the compactor to merge blocks more incrementally
similar to the fix done in #6556.
In some query scenarios, if there are a lot of points on disk spread
across many blocks in TSM files and a point is overwritten near the
begginning of the shard's timerange, the full series could be loaded
into RAM triggering OOMs and huge allocations.
The issue was that the KeyCursor code that handles overwriting points
had a simple implementation that just deduped the whole series in this
case. This falls over when the series is quite large.
Instead, the KeyCursor has been changed to only decode blocks with
updated points. It then keeps track of what section of the blocks
have been read so they are not re-read when the later points are
decoded.
Since the points in a block are always sorted, the code was also changed
to remove the Deduplicate calls since they end up
reallocating the slice. Instead, we do a sorted merge and re-use
the slice as much as we can.
This commit makes a number of performance improvements to
reduce allocations during query execution. Several objects
and buffers are now reused across the components to avoid
allocations.
Previously a simple `count(value)` query across 1M points
would require 26,000+ allocations. After the changes in
this commit that number has been reduced to 88.
MinTime is not in the index for each block so storing it in the block
header is redundant. The encodings also store it in their header so
we are actually storing it 3 times.
Removing this is an incompatible change with the current tsm1 file format.
There is a lot of allocations performed when decoding blocks. These
types can be re-used to reduce allocations in many cases. This change
allows a type specific slice to be passed in to decode funcs to be re-used
if it is large enough.
The existing decode is is left for backwards compatibility but is not
very efficient right now. It may be removed.
If DecodeSameTypeBlock is called on on an empty Values slice, it would
panic with an index out of bounds error. This func can actually be removed
because DecodeBlock can determine what type of values are encoded already.
This will still panic if the block cannot be decoded due to other reasons.
Fixes#4365
If similar float values were encoded, the number of leading bits would
overflow the 5 available bits to store them (e.g. store 33 in 5 bits). When
decoding, the values after the overflowed value would spike to very large and
small values.
To prevent the overflow, we clamp the value to 31 which is the maximum
number of leading zero bits we can encoded.
Fixes#4357