This commit ensures that the series file should work appropriately on
32-bit architecturs. It does this by reducing the maximum size of a
series file to 512MB on 32-bit systems, which should be fully
addressable.
It further updates tests so that the series file size can be reduced
further when running many tests in parallel on 32-bit architectures.
This limits the disk IO for writing TSM files during compactions
and snapshots. This helps reduce the spiky IO patterns on SSDs and
when compactions run very quickly.
Since possibly v0.9 DELETE SERIES has had the unwanted side effect of
removing series from the index when the last traces of series data are
removed from TSM. This occurred because the inmem index was rebuilt on
startup, and if there was no TSM data for a series then there could be
not series to add to the index.
This commit returns to the original (documented) DROP/DETETE SERIES
behaviour. As such, when issuing DROP SERIES all instances of matching
series will be removed from both the TSM engine and the index. When
issuing DELETE SERIES only TSM data will be removed.
It is up to the operator to remove series from the index.
NB, this commit does not address how to remove series data from the
series file when a shard rolls over.
This changes the approach to adjusting the amount of concurrency
used for snapshotting to be based on the snapshot latency vs
cardinality. The cardinality approach could use too much concurrency
and increase the number of level 1 TSM files too quickly which incurs
more disk IO.
The latency model seems to adjust better to different workloads.
The disk based temp index for writing a TSM file was used for
compactions other than snapshot compactions. That meant it was
used even for smaller compactiont that would not use much memory.
An unintended side-effect of this is higher disk IO when copying
the index to the final file.
This switches when to use the index based on the estimated size of
the new index that will be written. This isn't exact, but seems to
work kick in at higher cardinality and larger compactions when it
is necessary to avoid OOMs.
The default max-concurrent-compactions settings allows up to 50%
of cores to be used for compactions. When the number of cores is
high (>8), this can lead to high disk utilization. Capping at
4 and combined with high snapshot sizes seems to keep the compaction
backlog reasonable and not tax the disks as much. Systems with lots
of IOPS, RAM and CPU cores may want to increase these.
With the recent changes to compactions and snapshotting, the current
default can create lots of small level 1 TSM files. This increases
the default in order to create larger level 1 files and less disk
utilization.
O_SYNC was added with writing TSM files to fix an issue where the
final fsync at the end cause the process to stall. This ends up
increase disk util to much so this change switches to use multiple
fsyncs while writing the TSM file instead of O_SYNC or one large
one at the end.
If a delete for a time that does not exist was run, we would not
remove the series key from the slice of series to remove from the
index.
This could be triggered by running somethin like "delete from cpu where
time = 0" and if there was no data at time 0, the series would still
be removed from the index.
If there were many individual deletes to a series that ended up
deleting every value in the block and the tombstone timestamps
were not contigous, it was possible for the TSMKeyIterator to
return false for Next incorrectly. This causes the compaction to
drop any remaining data in the file.
Normally, if all the data is deleted via tombstones, we remove the
whole key from the TSM index. In this case, we're not able to determine
that the key is fully deleted until the block is decode and tombstones
are applied.
This changes the TSMKeyIterator to detect this condition and continue
to the next key instead of aborting.
The loop to check if a series still exists in a TSM file was wrong
in that it 1) exited early after one iteration and 2) had an off
by one error that causes the wrong series to be marked as existing.
This fixes both of these cases which can cause the index to become
inconsistent with the data store on disk.
This fixes a regression in the Cache introduced in ca40c1ad3c where
not all the values in the cache entry would be removed. Previously,
calling Exclude did not require the values to be sorted. The change
in ca40c1ad3c relies on the values being sorted so it was possible for
it to find the wrong indexes in when calling FindRange and leave some
data that should be deleted.
Fixes#9161
This commit firstly ensures that a shard's size on disk is accurately
reported when using the tsi1 index, by including the on-disk size of the
tsi1 index in the calculation.
Secondly, this commit add support for shard streaming/copying when using
the tsi1 index. Prior to this, a tsi1 index would not be correctly
restored when streaming shards.
The Cache.ApplyEntryFn iterates keys according to the partitions
and hashed values. This can cause the deleteKeys slice to contain
unsorted keys when deleting series. The code uses a binary search
on this slice later on and this can fail to detect that the series
should still exists. The series is then removed from the index
even though it has data still.
Fixes#9116
If the close happens when next is being called, it can result in a race
condition where the current iterator gets set to nil after the initial
check.
This also fixes the finalizer so it runs the close method in a goroutine
instead of running it by itself. This is because all finalizers run on
the same goroutine so a close that takes a long time can cause a backup
for all finalizers. This also removes the redundant call to
`runtime.SetFinalizer` from the finalizer itself because a finalizer,
when called, has already cleared itself.
* replaces coordinating goroutines for single k-way heap merge iterator
* removes contention sending keys across buffered channels
startup time from 46s -> 28s for iterating 1MM keys across 14 shards
If the first block that needs to be read was partially deleted such
that the trailing end has no values, it was possible for the query
cursor end early.
This was caused by the KeyCursor.ReadFloatBlock returning no values instead
of checking the remaing blocks.
This adds the capability to the engine to force a full compaction
to be scheduled. When called, it snapshots any data in the cache,
aborts running compactions and prevents level plans from returning
level plans.