TSI did not check that the max select series limit during planning
the same way that inmem did. This means that the limit could be
set but the planning of a high cardinality query would still OOM
the server. This fixes that limit as well as makes the query interruptible
during planning.
There was a race on the WaitGroup where we could end up calling Add
while another goroutine was still waiting. The functions were confusing
so they have been simplified a bit since the compactions goroutines
have been reworked a lot already.
The scheduling logic ended up favoring more backlogged shards
too much and would starved active, less backed up shards. This
occurred because the scheduling kicks in once a second. When it
runs, it schedules as many compactions as it can. A backed up shard
would end up having more compactions to run during the loop an would
generally get to schedule them more frequently.
This now allows each shard to try and schedule one compaction at a time
which provides a more balanced approach. At some point, we'll probably
want to more directly balanc the each shards backlog vs letting it happen
somewhat randomly.
Some files seem to get orphan behind higher levels. This causes
the compactions to get blocked as the lowere level files will not
get picked up by their lower level planners. This allows the full
plan to identify them and pull them into their plans.
This check doesn't make sense for high cardinality data as the files
typically get big and sparse very quickly. This causes a lot of extra
disk space to be used which is taken up by large indexes and sparse
data.
One shard might be able to run a compaction, but could fail to
limits being hit. This loop would continue indefinitely as the
same task would continue to be rescheduled.
With higher cardinality or larger series keys, the files can roll
over early which causes them to take longer to be compacted by higher
levels. This causes larger disk usage and higher numbers of tsm files
at times.
This changes the compaction scheduling to better utilize the available
cores that are free. Previously, a level was planned in its own goroutine
and would kick off a number of compactions groups. The problem with this
model was that if there were 4 groups, and 3 completed quickly, the planning
would be blocked for that level until the last group finished. If the compactions
at the prior level are running more quickly, a large backlog could accumlate.
This now moves the planning to a single goroutine that plans each level in
succession and starts as many groups as it can. When one group finishes,
the planning will start the next group for the level.
The fysncs due to large writes when writing to TSM files and the
WAL can eventually cause large pauses. Since we already buffer
writes, using synchronous IO reduces fsync latency by ensuring
the individiual writes hit disk. This spreads out the latecncy
across multiple writes better.
This commit adds a basic TSI versioning scheme, by adding a Version field
to an index's MANIFEST file.
Existing TSI indexes will not have this field present in their MANIFEST
files, and thus will be deemed incomatible with the current version.
Users with existing TSI indexes will be able to remove them, and convert the
resulting inmem indexes to the current version of a TSI index using the
influx_inspect tooling.