Limit number of concurrent optimized compactions so that level compactions do not get starved. Starved level compactions result in a sudden increase in disk usage.
Add [data] max-concurrent-optimized-compactions for configuring maximum number of concurrent optimized compactions. Default value is 1.
Co-authored-by: davidby-influx <dbyrne@influxdata.com>
Co-authored-by: devanbenz <devandbenz@gmail.com>
Closes: #26315
PlanOptimize is being checked far too frequently. This PR is the simplest change that can be made in order to ensure that PlanOptimize is not being ran too much. To alleviate the frequency I've added a lastWrite parameter to PlanOptimize and added an additional test that mocks the edge cause out in the wild that led to this PR.
Previously in test cases for PlanOptimize I was not checked to see if certain cases would be picked up by Plan I've adjusted a few of the existing test cases after modifying Plan and PlanOptimize to have the same lastWrite time.
* feat: Add CompactPointsPerBlock config opt
This PR adds an additional parameter for influxd
CompactPointsPerBlock. It adjusts the DefaultAggressiveMaxPointsPerBlock
to 10,000. We had discovered that with the points per block set to
100,000 compacted TSM files were increasing. After modifying the
points per block to 10,000 we noticed that the file sizes decreased.
The value has been set as a parameter that can be adjusted by administrators
this allows there to be some tuning if compression problems are encountered.
The error type check for errBlockRead was incorrect,
and bad TSM files were not being moved aside when
that error was encountered. Use errors.Join,
errors.Is, and errors.As to correctly unwrap multiple
errors.
Closes https://github.com/influxdata/influxdb/issues/25838
* feat: Modify optimized compaction to cover edge cases
This PR changes the algorithm for compaction to account for the following
cases that were not previously accounted for:
- Many generations with a groupsize over 2 GB
- Single generation with many files and a groupsize under 2 GB
- Where groupsize is the total size of the TSM files in said shard directory.
- shards that may have over a 2 GB group size but
many fragmented files (under 2 GB and under aggressive
point per block count)
closes https://github.com/influxdata/influxdb/issues/25666
There are a number of code paths in Compactor.write which
on error can lead to leaked file handles to temporary files.
This, in turn, prevents the removal of the temporary files until
InfluxDB is rebooted, releasing the file handles.
closes https://github.com/influxdata/influxdb/issues/25724
* fix: prevent retention service from hanging
Fix issue that can cause the retention service to hang waiting on a
`Shard.Close` call. When this occurs, no other shards will be deleted
by the retention service. This is usually noticed as an increase in
disk usage because old shards are not cleaned up.
The fix adds to new methods to `Store`, `SetShardNewReadersBlocked`
and `InUse`. `InUse` can be used to poll if a shard has active readers,
which the retention service uses to skip over in-use shards to prevent
the service from hanging. `SetShardNewReadersBlocked` determines if
new read access may be granted to a shard. This is required to prevent
race conditions around the use of `InUse` and the deletion of shards.
If the retention service skips over a shard because it is in-use, the
shard will be checked again the next time the retention service is run.
It can be deleted on subsequent checks if it is no longer in-use. If
the shards is stuck in-use, the retention service will not be able to
delete the shards, which can be observed in the logs for manual
intervention. Other shards can still be deleted by the retention service
even if a shard is stuck with readers.
closes: #25054
When the compaction planner runs, if it cannot acquire
a lock on the files it plans to compact, it returns a
nil list of compaction groups. This, in turn, sets the
engine statistics for compactions queues to zero,
which is incorrect. Instead, use the length of pending
files which would have been returned.
closes https://github.com/influxdata/influxdb/issues/22138
Compaction logging will generate intermediate information on
volume of data written and output files created, as well as
improve some of the anti-entropy messages related to compaction.
This will also apply to `influx_tools compact`
Closes https://github.com/influxdata/influxdb/issues/21704
* fix(storage): skip TSM files with block read errors
When we find a bad TSM file during compaction, propagate the error up and move
the bad file aside. The engine will disregard the file so the next compaction
will not hit the same error.
StringArrayEncodeAll will panic if the total length of strings
contained in the src slice is > 0xffffffff. This change adds a unit
test to replicate the issue and an associated fix to return an error.
This also raises an issue that compactions will be unable to make
progress under the following condition:
* multiple string blocks are to be merged to a single block and
* the total length of all strings exceeds the maximum block size that
snappy will encode (0xffffffff)
The observable effect of this is errors in the logs indicating a
compaction failure.
Fixes#13687
This commit limits the number of files that can be compacted in
a single group when forcing a full compaction or when a shard
becomes cold. This is to prevent too many files being compacted
at the same time.
The InUse call on TSMFiles is inherently racy in the presence of
Ref calls outside of the file store mutex. In addition, we return
some TSMFiles to callers without them being Ref'd which might allow
them to be closed from underneath. While I believe it is the case
that it would be impossible, as the only thing that gets a handle
externally is compaction, and compaction enforces that only one
handle exists at a time, and thus is only deleted once after the
compaction is done with it, it's not very obvious or enforced.
Instead, always return a TSMFile with a Ref call under the read
lock, and require that no one else calls Ref. That way, it cannot
transition to referenced if the InUse call returns false under the
write lock.
The CreateSnapshot method was racy in a number of ways in the presence
of multiple calls or compactions: it did not take references to the
TSMFiles, and the temporary directory it creates could have been
shared with concurrent CreateSnapshot calls. In addition, the
files slice could have been concurrently mutated during a compaction
as well.
Instead, under the write lock, make a local copy of the state for
the compaction, including Ref calls (write locks are implicitly
read locks). Then, there is no need for a lock at all afterward.
Add some comments to explain these issues at the call sites of InUse,
and document that the Files method that returns the slice unprotected
is only for tests.
This changes the approach to adjusting the amount of concurrency
used for snapshotting to be based on the snapshot latency vs
cardinality. The cardinality approach could use too much concurrency
and increase the number of level 1 TSM files too quickly which incurs
more disk IO.
The latency model seems to adjust better to different workloads.
If there were many individual deletes to a series that ended up
deleting every value in the block and the tombstone timestamps
were not contigous, it was possible for the TSMKeyIterator to
return false for Next incorrectly. This causes the compaction to
drop any remaining data in the file.
Normally, if all the data is deleted via tombstones, we remove the
whole key from the TSM index. In this case, we're not able to determine
that the key is fully deleted until the block is decode and tombstones
are applied.
This changes the TSMKeyIterator to detect this condition and continue
to the next key instead of aborting.
This adds the capability to the engine to force a full compaction
to be scheduled. When called, it snapshots any data in the cache,
aborts running compactions and prevents level plans from returning
level plans.
Some files seem to get orphan behind higher levels. This causes
the compactions to get blocked as the lowere level files will not
get picked up by their lower level planners. This allows the full
plan to identify them and pull them into their plans.
This check doesn't make sense for high cardinality data as the files
typically get big and sparse very quickly. This causes a lot of extra
disk space to be used which is taken up by large indexes and sparse
data.
With higher cardinality or larger series keys, the files can roll
over early which causes them to take longer to be compacted by higher
levels. This causes larger disk usage and higher numbers of tsm files
at times.
Compactions would create their own TSMReaders for simplicity. With
very high cardinality compactions, creating the reader and indirectIndex
can start to use a significant amount of memory.
This changes the compactions to use a reader that is already allocated
and managed by the FileStore.
This switches all the interfaces that take string series key to
take a []byte. This eliminates many small allocations where we
convert between to two repeatedly. Eventually, this change should
propogate futher up the stack.
When snapshots and compactions are disabled, the check to see if
the compaction should be aborted occurs in between writing to the
next TSM file. If a large compaction is running, it might take
a while for the file to be finished writing causing long delays.
This now interrupts compactions while iterating over the blocks to
write which allows them to abort immediately.
This changes full compactions within a shard to run sequentially
instead of running all the compaction groups in parallel. Normally,
there is only 1 full compaction group to run. At times, there could
be several which causes instability if they are all running concurrently
as they tie up a cpu for long periods of time.
Level compactions are also capped to a max of 4 concurrently running for each level
in a shard. This prevents sudden spikes in CPU and disk usage due to a large backlog
of tsm files at a given level.
The compactor prevents the same file from being compacted by different
compaction runs, but it can result in warning errors in the logs that
are confusing.
This adds compaction plan tracking to the planner so that files are
only part of one plan at a given time.
If blocks containing overlapping ranges of time where partially
recombined, it was possible for the some points to get dropped
during compactions. This occurred because the window of time of
the points we need to merge did not account for the partial blocks
created from a prior merge.
Fixes#8084