The path info only contained the file name which caused tombstone
files to not be removed if there were queries running against
a file that was compacted.
This is now consistent with the TSMReader.Path which returns the
full path info.
If they were left around, re-enabling them again could cause
future compactions to continuously fail. A restart of the
server would clean them up correctly though.
If there were multiple TSM files and a delete/drop was run,
we would write the delete series to the tombstone file N
times for each file. This occurred because FileStore.WalkKeys walks
every key in every TSM file which can return duplicate keys.
This issue caused TSM files to be much larger than they should be
and also cause large memory usage during the delete.
This keeps some memory bounds when reloading a TSM files tombstones
so that the heap does not grow exceedintly fast and stay there
after the deletes are applied.
Tombstone were read fully into memory at startup which could consume
a lot of RAM and OOM the process if there were a lot of deleted
series and many TSM files.
This now walks the tombstone file and iteratively applies the tombstone
which uses significantly less RAM. This may be slightly slower in the
generate cause, but should scale better.
The `SHOW MEASUREMENTS` and `SHOW TAG VALUES` cannot go through the
query engine to get the speed they need. They also only need access to
the database index and do not need access to specific shards. This
removes the query rewriting that was done to turn these two queries into
a select statement and reimplements them inside of the coordinator as an
interface on the TSDBStore.
Normally, compactions do not conflict on the files they are compacting.
If the full cold threshold is set very low, it can cause conflicts where
two compactions compact the same files. The full compaction was the
only place this could happen as it's planning is greedy.
To make this safer for concurrent execution, the compaction tracks which
files are current being compacted and prevents any new compactions from
starting if the file set overlaps.
Fixes#6595
If a query is interrupted via kill query, the tsm files managed
by the file store purger would never get removeed because
KeyCursor.Close was never called.
KeyCursor.Close should always be called now.
If a query was running against a file being compacted, we close the file
and the query would end wherever it had read up to. This could result
in queries that randomly lost data, but running them again showed the
full results.
We now use a reference counting approach and move the in-use files out
of the way in the filestore and allow the queries to complete against
the old tsm files. The new files are installed and new queries will
use them.
Fixes#5501
benchmark old ns/op new ns/op delta
BenchmarkBooleanDecoder_2048-4 9954 7846 -21.18%
benchmark old allocs new allocs delta
BenchmarkBooleanDecoder_2048-4 0 0 +0.00%
benchmark old bytes new bytes delta
BenchmarkBooleanDecoder_2048-4 0 0 +0.00%
There was a race where the same series would get added to the in-memory
index for a measurement more than once. This would result in the same
series being returned more than once during queries causing duplicate
results. The issue was that we check for the series under the read
lock, but did not check again under the write lock where there was
a small window where the series could be added by another goroutine.
We now check for the series under the write lock.
Fixes#6946
A slower disk can can cause excessive allocations to occur when
writing to the WAL because the slower encoding and compression occurs
before taking the write lock. The encoding/compression grabs a large
byte slice from a pool and ultimately waits until it can acquire the
write lock.
This adds a throttle to limit how many inflight WAL writes can be queued
up to prevent OOMing the processess with slower disks and heavy writes.
If a delete is issued while a compaction is running, the a newly
deleted series could re-appear after the compaction completed. This
could occur the compaction had already written the blocks for series
that were just deleted. When the compaction completes, the newly
written tombstone files would be deleted, essentially undeleting the
series.
Reduce the lock contention on tsdb.Store by taking a short lived
read-lock instead of a long write lock. Also close shards in parallel
and drop the whole RP dir in bulk instead of each shard dir.
Reduces the lock contention on the tsdb.Store by taking a short
read lock instead of a long write lock. Also processes shards
in parallel instead of serially.
Due to a bug in compactions, it's possible some blocks may have duplicate
points stored. If those blocks are decoded and re-compacted, an assertion
panic could trigger.
We now dedup those blocks if necessary to remove the duplicate points
and avoid the panic.
For larger datasets, it's possible for shards to get into a state where
many large, dense TSM files exist. While the shard is still hot for
writes, full compactions will skip these files since they are already
fairly optimized and full compactions are expensive. If the write volume
is large enough, the shard can accumulate lots of these files. When
a file is in this state, it's index can contain every series which
causes startup times to increase since each file must parse the full
set of series keys for every file. If the number of series is high,
the index can be quite large causing large amount of disk IO at startup.
To fix this, a optmize compaction is run when a full compaction planning
step decides there is nothing to do. The optimize compaction combines
and spreads the data and series keys across all files resulting in each
file containing the full series data for that shard and a subset of the
total set of keys in the shard.
This allows a shard to only store a series key once in the shard reducing
storage size as well allows a shard to only load each key once at startup.
Large files created early in the leveled compactions could cause
a shard to get into a bad state. This reworks the level planner
to handle those cases as well as splits large compactions up into
multiple groups to leverage more CPUs when possible.
Truncate the time interval output of the monitor service to be on even
time intervals rather than on every minute based on the start time. This
normalizes the output from the monitor service.
If there were blocks in later TSM files that were for overwritten
points or writes into the past, they could be returned more than
once or out of order causing the cursor values to be unsorted.
One effect of this is that graphs in graphana would render with
the line going all over the place in spots.
This might also cause duplicate data to be returned.
Fixes#6738
The tsdb package had a substantial amount of dead code related to the
old query engine still in there. It is no longer used, so it was removed
since it was left unmaintained. There is likely still more code that is
the same, but wasn't found as part of this code cleanup.
influxql has dead code show up because of the code generation so it is
not included in this pruning.
Updated `influx_inspect` to use the `FieldDimensions` method instead
(more reliable anyway). The `influx_tsm` program used its own vendored
copy of `FieldCodec` so it is not affected by this change. `FieldCodec`
was only used for the `b1` and `bz1` engines which were removed in 0.12,
but the code that created the field codec was never removed. This
limited the maximum number of fields to 255 even though that restriction
was removed with the `tsm1` engine.
Fixes#6869.
A copy/paste error had nil cursors destined for a condition cursor get
set to the auxiliary cursor instead. When the number of conditions
exceeded the number of auxiliary fields, this would result in a stack
trace in some situations. When the number of conditions was less than or
equal to the number of auxiliary fields, it means that an auxiliary
cursor may have been overwritten with a nil cursor accidentally and a
leak might have happened since it was never closed.
Fixes#6859.
Anecdotally, the relationship between memory consumption and series
cardinality was thought to be exponential. I suspect that this is false.
The intent of the added benchmarks is to verify my suspicion. Eventually
the these benchmarks will run nightly to serve as a basis to evualuate
the memory performance in a controlled environment.
https://github.com/influxdata/docs.influxdata.com/issues/392
Restore would try to open the shard if there was an error. If there
was an error, the files written are very likely to be partially written
and they can cause the server to panic.
To prevent a shard from trying to open broken files, we now write to
a temp file and rename it to the actual name only after fully writing
and fsyncing the file.
The TSDBStore interface needs to also allow for remote TSDBStore but the
DatabaseIndex is only for a local TSDB instance. Moved the optimized
SHOW TAG VALUES path to do a typecast to the LocalTSDBStore struct
instead of always attempting to use the optimized version.
If the TSDBStore is not local and does not have the DatabaseIndex, it
will default to using the distributed query instead.
This commit optimizes `SHOW TAG VALUES` so that it avoids the
`SELECT` query engine execution and iterator creation. There
are also optimizations to reduce individual memory allocations
and to reduce in-memory heap size by only operating on one
measurement at a time.
Execution time has been reduce to approximately 900ms for
500,000 rows. This is about 2µs per row. Of this time,
approximately 1µs is spent retrieving and sorting the row
and 1µs is spent encoding into JSON and writing to the
response body.
If cache.Deduplicate is called while writes are in-flight on the cache, a data race
could occur.
WARNING: DATA RACE
Write by goroutine 15:
runtime.mapassign1()
/usr/local/go/src/runtime/hashmap.go:429 +0x0
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Cache).entry()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache.go:482 +0x27e
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Cache).WriteMulti()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache.go:207 +0x3b2
github.com/influxdata/influxdb/tsdb/engine/tsm1.TestCache_Deduplicate_Concurrent.func1()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache_test.go:421 +0x73
Previous read by goroutine 16:
runtime.mapiterinit()
/usr/local/go/src/runtime/hashmap.go:607 +0x0
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Cache).Deduplicate()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache.go:272 +0x7c
github.com/influxdata/influxdb/tsdb/engine/tsm1.TestCache_Deduplicate_Concurrent.func2()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache_test.go:429 +0x69
Goroutine 15 (running) created at:
github.com/influxdata/influxdb/tsdb/engine/tsm1.TestCache_Deduplicate_Concurrent()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache_test.go:423 +0x3f2
testing.tRunner()
/usr/local/go/src/testing/testing.go:473 +0xdc
Goroutine 16 (finished) created at:
github.com/influxdata/influxdb/tsdb/engine/tsm1.TestCache_Deduplicate_Concurrent()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache_test.go:431 +0x43b
testing.tRunner()
/usr/local/go/src/testing/testing.go:473 +0xdc
For restoring a shard, we need to be able to have the shard open,
but disabled. It was racy to open it and then disable it separately
since writes/queries could occur in between that time.
This switch the backup shard call to use the shard Snapshot that
internally creates a snapshot by hardlinking all of the TSM and
tombstone files instead. This reduces the time that the FileStore
is locked and will allow for larger shards to be backup more easily.
The level planner would keep including the same TSM files to be
recompacted even if they were already quite compacted and split
across several TSM files.
Fixes#6683
This fixes a pathalogical query condition cause by and problematic
structuring of TSM files based on how points were written. The
condition can occur when there are multiple TSM files and a large
number of points are written into the past. The earlier existing
TSM files must also have points in the past and close to the present
causing their time range to eclipse the later files.
When this condition occurs, some queries can spend an excessive amount
of time merge all the overlapping blocks.
The fix was to constrain the window of overlapping blocks based on
the first one we ran into. There was also a simple case in the Merge
where we could skip the binary search path and just append the two
inputs.
os.Open is documented as:
> Open opens the named file for reading. If successful, methods on
> the returned file can be used for reading;
That suggests the file's methods should only be called if opening
was successful. The original code would defer f.Close() right after
os.Open, before ensuring that err is nil, so f.Close() would run
even if os.Open did not return successfully.
Apply https://github.com/golang/go/wiki/CodeReviewComments#indent-error-flow
suggestion to keep the normal path at minimal indentation, and indent
the error handling code instead. This improves code readability.
If you use a statement like this:
SELECT value FROM one..cpu, two..cpu
It will access both the `one` and `two` databases as if you had selected
the `cpu` measurement twice for both of them. Updated the `tsdb.Shard`
create iterator function to filter out any sources that do not apply to
that shard so this duplication doesn't happen.
Fixes#6701.
The limit optimization was put into the wrong place and caused only part
of the shard to be read when a limit was used. The optimization is
possible, but requires a bit of refactoring to the code here so the call
iterator is created per series before handed to the limit iterator.
Fixes#6661.
Due to an bug in TSM tombstone files, it was possible to create
empty tombstone files. At startup, the TSM file would error out
and not load the TSM file.
Instead, treat it as an empty v1 file so the TSM file can load
correctly.
Fixes#6641
If there were duplicate points in multiple blocks, we would correctly
dedup the points and mark the regions of the blocks we've read.
Unfortunately, we were not excluding the already points as the cursor
moved to points in the later blocks which could cause points to be
return twice incorrectly.
Fixes#6611
The optimization to speed up shard loading had the side effect of
skipping adding series to the index that already exist. The skipping
was in the wrong location and also skipped the shards measurementFields
index which is required in order to query that series in the shard.
Switched the max keys test to write int64 of the same value so RLE
would kick in and the file size will be smaller (84MB vs 3.8MB).
Removed the chunking test which was skipped because the code will
not downsize a block into smaller chunks now.
Skip MaxKeys tests in various environments because it needs to
write too much data to run reliably.
If a large series contains a point that is overwritten, the compactor
would load the whole series into RAM during a full compaction. If
the series was large, it could cause very large RAM spikes and OOMs.
The change reworks the compactor to merge blocks more incrementally
similar to the fix done in #6556.
Fixes#6557
The list of field keys in the index may have differed from the field
keys in the actual shard. Fixing `SHOW FIELD KEYS` so it relies only on
the shard rather than the index.
Fixes#6659.
Casting syntax is done with the PostgreSQL syntax `field1::float` to
specify which type should be used when selecting a field. You can also
do `field1::field` or `tag1::tag` to specify that a field or tag should
be selected.
This makes it possible to select a tag when a field key and a tag key
conflict with each other in a measurement. It also means it's possible
to choose a field with a specific type if multiple shards disagree. If
no types are given, the same ordering for how a type is chosen is used
to determine which type to return.
The FieldDimensions method has been updated to return the data type for
the fields that get returned. The SeriesKeys function has also been
removed since it is no longer needed. SeriesKeys was originally used for
the fill iterator, but then expanded to be used by auxiliary iterators
for determining the channel iterator types. The fill iterator doesn't
need it anymore and the auxiliary types are better served by
FieldDimensions implementing that functionality, so SeriesKeys is no
longer needed.
Fixes#6519.
This locks showeed up in a deadlock systems running queries and
delete series across a large dataset. Queries should not need to
lock the tsdb.Store for writes
Drop database was closing and deleting each shard dir individually and
serially. It would then delete the empty database dirs.
This changes drop database to close all shards in parallel and run
one os.RemoveAll to remove everything under the db dir which is more
efficient.
This also reworked the locking to avoid locking the tsdb.Store for
long periods of time. That can cause queries and writes for other
databases to block as well.
On data sets with many series and potentially large series keys,
the cost of parsing the key and re-indexing can be high.
Loading the TSM keys into the index was being done repeatedly for
series that were already index by an earlier TSM file. This was
wasted worked and slows down shard loading.
Parsing the key was also innefficient and allocated a new string
slice. This was simplified to remove that allocation.
This commit changes the `tsm1.Engine` to create individual series
iterators in batches so that it can be parallelized. Iterators
are combined at the end so they can be redistributed to the
parallelized merge iterator.