Under concurrent writes and deletes of the same series, a nil panic
could occur in bytes.Compare. Instead of setting the seriesKeys to
nil, set them to an 0 length slice which prevents the panic.
This separates out the dropping of a measurement from the series
to avoid frequent checks to see if a measurement still has series.
The series are dropped individually and we keep track of which
measurements are involved and then delete each measurment afterwards.
If the fields.idx was corrupted in someway, it would cause the shard
to fail to load. Deleting the file will allow it to be rebuilt.
This change handles this automatically so it's rebuilt if necessary
without user intervention.
This commit adds the ability to correctly mark a series as deleted in
the global series file. Whenever a shard engine determines that a series
should be deleted, it checks with each shard's bitset for series that
are to be deleted and are no longer contained in any shard-local
bitsets.
These series are then removed from the series file.
This test could hang due to an existing race that is still not fixed.
The snapshot and level compaction goroutines woule end up waiting on
the wrong channel to be closed so whey would never exit.
This commit adds a bitset into each shard's in-memory index, to be used to
track undeleted series ids. Currently tsi1 support is not implemented.
When new series are added to the shard, the series id is added
to the bitset. When series are deleted from the shard, the series
ids are removed from the bitset.
Becasue each shard shares the same inmem index reference, the bitset
is stored in the `ShardIndex`, which is local to each shard, and then
different references are passed into the shared `Index` object, depending
on which shard is writing the series.
* Live Restore + Enterprise data format compatability
* Extended ImportData to import all DB's if no db name given
* Added a new enterprise data test, and backup command now prints the backup file paths at conclusion
* Added whole-system backup test
* Update to use protobuf in all enterprise data cases
* Update to test to do cross-testing with enterprise version
* incremental enterprise backup format support
The cache defaulted to entry capacity size of 32. This default
is fine for lower cardinalities, but causes big spikes in InUse
heap with higher cardinalities that can OOM the process. Since
the hints had to be removed previously due to increased memory usage,
they are now completely removed. For lower cardinalities, we do
grow the slice, but this has a small performance penalty compared
to the large memory usage/OOMs with larger cardinalities.
* only call ParseTags when necessary
* remove dependency on inmem.Series in tsdb test package
* Measurement and Series are no longer exported. Their use is restricted
to the inmem package
* improve Measurement and Series types by exporting immutable
fields and removing unnecessary APIs and locks
Reduced startup time from 28s to 17s. Overall improvement including
#9162 reduces startup from 46s to 17s for 1MM series across 14 shards.
This commit ensures that the series file should work appropriately on
32-bit architecturs. It does this by reducing the maximum size of a
series file to 512MB on 32-bit systems, which should be fully
addressable.
It further updates tests so that the series file size can be reduced
further when running many tests in parallel on 32-bit architectures.
This limits the disk IO for writing TSM files during compactions
and snapshots. This helps reduce the spiky IO patterns on SSDs and
when compactions run very quickly.
Since possibly v0.9 DELETE SERIES has had the unwanted side effect of
removing series from the index when the last traces of series data are
removed from TSM. This occurred because the inmem index was rebuilt on
startup, and if there was no TSM data for a series then there could be
not series to add to the index.
This commit returns to the original (documented) DROP/DETETE SERIES
behaviour. As such, when issuing DROP SERIES all instances of matching
series will be removed from both the TSM engine and the index. When
issuing DELETE SERIES only TSM data will be removed.
It is up to the operator to remove series from the index.
NB, this commit does not address how to remove series data from the
series file when a shard rolls over.
This changes the approach to adjusting the amount of concurrency
used for snapshotting to be based on the snapshot latency vs
cardinality. The cardinality approach could use too much concurrency
and increase the number of level 1 TSM files too quickly which incurs
more disk IO.
The latency model seems to adjust better to different workloads.
The disk based temp index for writing a TSM file was used for
compactions other than snapshot compactions. That meant it was
used even for smaller compactiont that would not use much memory.
An unintended side-effect of this is higher disk IO when copying
the index to the final file.
This switches when to use the index based on the estimated size of
the new index that will be written. This isn't exact, but seems to
work kick in at higher cardinality and larger compactions when it
is necessary to avoid OOMs.
O_SYNC was added with writing TSM files to fix an issue where the
final fsync at the end cause the process to stall. This ends up
increase disk util to much so this change switches to use multiple
fsyncs while writing the TSM file instead of O_SYNC or one large
one at the end.
If a delete for a time that does not exist was run, we would not
remove the series key from the slice of series to remove from the
index.
This could be triggered by running somethin like "delete from cpu where
time = 0" and if there was no data at time 0, the series would still
be removed from the index.
If there were many individual deletes to a series that ended up
deleting every value in the block and the tombstone timestamps
were not contigous, it was possible for the TSMKeyIterator to
return false for Next incorrectly. This causes the compaction to
drop any remaining data in the file.
Normally, if all the data is deleted via tombstones, we remove the
whole key from the TSM index. In this case, we're not able to determine
that the key is fully deleted until the block is decode and tombstones
are applied.
This changes the TSMKeyIterator to detect this condition and continue
to the next key instead of aborting.
The loop to check if a series still exists in a TSM file was wrong
in that it 1) exited early after one iteration and 2) had an off
by one error that causes the wrong series to be marked as existing.
This fixes both of these cases which can cause the index to become
inconsistent with the data store on disk.
This fixes a regression in the Cache introduced in ca40c1ad3c where
not all the values in the cache entry would be removed. Previously,
calling Exclude did not require the values to be sorted. The change
in ca40c1ad3c relies on the values being sorted so it was possible for
it to find the wrong indexes in when calling FindRange and leave some
data that should be deleted.
Fixes#9161
This commit firstly ensures that a shard's size on disk is accurately
reported when using the tsi1 index, by including the on-disk size of the
tsi1 index in the calculation.
Secondly, this commit add support for shard streaming/copying when using
the tsi1 index. Prior to this, a tsi1 index would not be correctly
restored when streaming shards.
The Cache.ApplyEntryFn iterates keys according to the partitions
and hashed values. This can cause the deleteKeys slice to contain
unsorted keys when deleting series. The code uses a binary search
on this slice later on and this can fail to detect that the series
should still exists. The series is then removed from the index
even though it has data still.
Fixes#9116
If the close happens when next is being called, it can result in a race
condition where the current iterator gets set to nil after the initial
check.
This also fixes the finalizer so it runs the close method in a goroutine
instead of running it by itself. This is because all finalizers run on
the same goroutine so a close that takes a long time can cause a backup
for all finalizers. This also removes the redundant call to
`runtime.SetFinalizer` from the finalizer itself because a finalizer,
when called, has already cleared itself.
* replaces coordinating goroutines for single k-way heap merge iterator
* removes contention sending keys across buffered channels
startup time from 46s -> 28s for iterating 1MM keys across 14 shards
If the first block that needs to be read was partially deleted such
that the trailing end has no values, it was possible for the query
cursor end early.
This was caused by the KeyCursor.ReadFloatBlock returning no values instead
of checking the remaing blocks.
This adds the capability to the engine to force a full compaction
to be scheduled. When called, it snapshots any data in the cache,
aborts running compactions and prevents level plans from returning
level plans.
* did not handle cached values correctly
* sort shards by time in either ascending or descending
order depending on the RPC request ordering to ensure they
are traversed in the correct order.
The DropSeries code path ended up creating a MeasurementSeriesIterator
for each dropped series, this was too expensive just to see if a
series exists.
This adds a HasSeries func and fixes and issue where TSI files were
compacted while an iterator was still in use causing a panic.
This removes the containsSeries func which ends up creating a map
sized to the slice of keys passed in. This doesn't scale well to
high cardinalities and creates a lot of garbage.
The query language min and max times are slighly different than the
values used in the engine. This allows faster codes to be used when
the whole time range is deleted.
This is a version of DeleteRange that take a func predicate to determine
whether a series key should be deleted or not. This avoids the large
slice allocations with higher cardinalities.
This adds a new v4 tombstone format that extends the v3 format
by allowing multiple batches of tombstones to be written without
having to re-read all the existing tombstones. This uses gzip
multi stream to append multiple v3 files together to create a v4
format.
The previous sha was taken from a revision on a devel branch that I
thought would continue staying in the tree after it was merged. That
revision was rebased away and the API was changed for the logger.
This updates the usage of the logger and adds a simple package for
constructing the base logger.
The 1.0 version of zap changed the format of the default console logger
so this change moves over to this new logger instead of attempting to
retain backwards compatibility with the old format.
Update support in the `toml` package for parsing human-readble byte sizes.
Supported size suffixes are "k" or "K" for kibibytes, "m" or "M" for
mebibytes, and "g" or "G" for gibibytes. If a size suffix isn't specified
then bytes are assumed.
In the config, `cache-max-memory-size` and `cache-snapshot-memory-size` are
now typed as `toml.Size` and support the new syntax.
* Fprint* functions
* No nakedness
* clarify panic messages
* spacing between case statements
* remove break in favor of return
* remove goto in favor of for { continue }
* batch cursors return slices of timestamps and values to reduce call
overhead. Significantly improved iteration.
* added CreateCursor API to Shard, Engine
* moved build*Cursor to code gen
* array has already been sized correctly
* eliminates bounds checking for each element access
* reduces decoding of 30,000,000 points via storage API from
584ms to 540ms on average
* Introduces EXPLAIN ANALYZE command, which
produces a detailed tree of operations used to
execute the query.
introduce context.Context to APIs
metrics package
* create groups of named measurements
* safe for concurrent access
tracing package
EXPLAIN ANALYZE implementation for OSS
Serialize EXPLAIN ANALYZE traces from remote nodes
use context.Background for tests
group with other stdlib packages
additional documentation and remove unused API
use influxdb/pkg/testing/assert
remove testify reference
If multiple tombstones exists for a series that ended up causing the
full data to be deleted, the blocks were not removed from the offsets
in the index. This causes the TSMReader to report that a key exist
but does not have any data.
During a compaction, every key should have at least one value. Since
this invariant was broken, the compaction aborted early and ends up
dropping all series keys that are lexigraphically greater than where
the breakage occured. This would cause data to be dropped during the
compaction.
This fixes a potential bug where the BlockIterator would skip blocks
if the underlying TSMReader had deletes on it concurrently. This
could possibly occur due to changes in 91eb9de3 that now use the
existing TSMReaders from the FileStore instead of creating new ones
during compaction.
There was a race on the WaitGroup where we could end up calling Add
while another goroutine was still waiting. The functions were confusing
so they have been simplified a bit since the compactions goroutines
have been reworked a lot already.
The scheduling logic ended up favoring more backlogged shards
too much and would starved active, less backed up shards. This
occurred because the scheduling kicks in once a second. When it
runs, it schedules as many compactions as it can. A backed up shard
would end up having more compactions to run during the loop an would
generally get to schedule them more frequently.
This now allows each shard to try and schedule one compaction at a time
which provides a more balanced approach. At some point, we'll probably
want to more directly balanc the each shards backlog vs letting it happen
somewhat randomly.
Some files seem to get orphan behind higher levels. This causes
the compactions to get blocked as the lowere level files will not
get picked up by their lower level planners. This allows the full
plan to identify them and pull them into their plans.
This check doesn't make sense for high cardinality data as the files
typically get big and sparse very quickly. This causes a lot of extra
disk space to be used which is taken up by large indexes and sparse
data.
One shard might be able to run a compaction, but could fail to
limits being hit. This loop would continue indefinitely as the
same task would continue to be rescheduled.
With higher cardinality or larger series keys, the files can roll
over early which causes them to take longer to be compacted by higher
levels. This causes larger disk usage and higher numbers of tsm files
at times.
This changes the compaction scheduling to better utilize the available
cores that are free. Previously, a level was planned in its own goroutine
and would kick off a number of compactions groups. The problem with this
model was that if there were 4 groups, and 3 completed quickly, the planning
would be blocked for that level until the last group finished. If the compactions
at the prior level are running more quickly, a large backlog could accumlate.
This now moves the planning to a single goroutine that plans each level in
succession and starts as many groups as it can. When one group finishes,
the planning will start the next group for the level.
The fysncs due to large writes when writing to TSM files and the
WAL can eventually cause large pauses. Since we already buffer
writes, using synchronous IO reduces fsync latency by ensuring
the individiual writes hit disk. This spreads out the latecncy
across multiple writes better.
With higher cardinalities, the encoder pools where become a bottleneck.
This changes the snapshot compactions ot checkout one encoder of each
type and re-use it while writing the snapshots as opposed to repeatedly
checking it out and in.
This perioically re-allocates the cache store to avoid memory
fragmentation and gradual slow down of the store after repeated
deletes and inserts into the map.
This instructs the kernel that it can release memory used by mmap'd
TSM files when they are not actively being used. It the mappings are
use, the kernel will fault the pages back in. On linux, this causes
RES memory to drop immediately when run.
Compactions would create their own TSMReaders for simplicity. With
very high cardinality compactions, creating the reader and indirectIndex
can start to use a significant amount of memory.
This changes the compactions to use a reader that is already allocated
and managed by the FileStore.
These are already sorted during compaction, so switch to sorting lazily
to avoid the CPU and allocations. This would only occur when using if
using the writer directly.
The directIndex used by the TSMWriter maintained a map of series keys
to index entries. When the index is written to the TSM file, the keys
are sorted and then written out in order.
The reason for this is because directIndex used to be the only index
and it was optimized more for reading. The reading has been replaced
by the indirectIndex so the map of keys ends up wasting space.
During compactions, the series keys (and index entries) are already sorted
so this change uses the sorting to avoid the map and sort when writing the
index. This reduces allocations and CPU usage quite a bit for larger cardinality
TSM files.
This leaves the slower compactions that create full blocks to only
the full compaction. This helps reduce CPU usage and memory while shards
are hot, but increases disk usage (reduced compression) slightly.
Deleting high cardinality series could take a very long time, cause
write timeouts as well as dead lock the process. This fixes these
issue to by changing the approach for cleaning up the indexes and
reducing lock contention.
The prior approach delete each series and updated every index (inmem)
during the delete. This was very slow and cause the index to be locked
while it items in a slice were removed one by one. This has been changed
to mark series as deleted and then rebuild the index asynchronously which
speeds up the process.
There was also a dead lock that could occur when deleing the field set.
Deleting the field set held a write lock and the function it invoked under
the lock could try to take a read lock on the field set. This would then
deadlock. This approach was also very slow and caused time out for writes.
It now uses faster approach that checks for the existing of the measurment
in the cache and filestore which does not take write locks.
It prints the statistics of each iterator that will access the storage
engine. For each access of the storage engine, it will print the number
of shards that will potentially be accessed, the number of files that
may be accessed, the number of series that will be created, the number
of blocks, and the size of those blocks.
This is used quite a bit to determine which fields are needed in a
condition. When the condition gets large, the memory usage begins to
slow it down considerably and it doesn't take care of duplicates.
There are several places in the code where comma-ok map retrieval was
being used poorly. Some were benign, like checking existence before
issuing an unconditional delete with no cleanup. Others were potentially
far more serious: assuming that if 'ok' was true, then the resulting
pointer retrieved from the map would be non-nil. `nil` is a perfectly
valid value to store in a map of pointers, and the comma-ok syntax is
meant for when membership is distinct from having a non-zero value.
There was only one or two cases that I saw that being used correctly for
maps of pointers.
If a large compaction was running and was aborted. It could would leave
some tmp files around for files that it had fully written. The current
active file was cleaned up, but already completed ones would not. This
would occur when a TSM file needed to rollover due to size.
* <type>FinalizerIterator sets a runtime finalizer and calls Close
when garbage collected. This will ensure any associated cursors
are closed and the associated TSM files released
* `query.Iterators#Merge` call could return an error and the inputs
would not be closed, causing a cursor leak
The OnReplace func ends up trying to acquire locks on MeasurementFields. When
its called via snapshotting, this can deadlock because the snapshotting goroutine
also holds an RLock on the engine. If a delete measurement calls is run at the
right time, it will lock the MeasurementFields and try to acquire a lock on the engine
to disable compactions. This creates a deadlock.
To fix this, the OnReplace callback is moved to a function param to allow only Replace
calls as part of a compaction to invoke it as opposed to both snapshotting and compactions.
Fixes#8713
This change provides a clear separation between the query engine
mechanics and the query language so that the language can be parsed and
dealt with separate from the query engine itself.
Previously pseudo iterators could be created for meta data such
as series, measurement, and tag data. These iterators were created
at a higher level and lacked a lot of the power of the query engine.
This commit moves system iterators down to the series level and
supports the following:
- _name
- _seriesKey
- _tagKey
- _tagValue
- _fieldKey
These can be used as normal fields such as:
SELECT _seriesKey FROM cpu
This will return all the series keys for `cpu`.
If there were multiple shards, drop measurement could update the index
and remove the measurement before the other shards ran their deletes.
This causes the later shards to not see any series to delete.
The fix is to all deleteSeries to handle the index delete which already
accounts for removing the measurement when it is fully removed from the
index.
The partiallyRead func didn't account for the initial values and would
return true for blocks that had not been read at all. This causes a
slower path during compactions that forces a block to be decoded when
it could just be merged as is without decoded. This causes compactions
to consume more CPU and run slower at times.
This switches all the interfaces that take string series key to
take a []byte. This eliminates many small allocations where we
convert between to two repeatedly. Eventually, this change should
propogate futher up the stack.
The refs map was to increment the file references one time each.
It doesn't hurt to increment them multiple times though.
We also do not need to copy the files slice as we are accessing it
under a read lock so it can't be changed.