If a delete is issued while a compaction is running, the a newly
deleted series could re-appear after the compaction completed. This
could occur the compaction had already written the blocks for series
that were just deleted. When the compaction completes, the newly
written tombstone files would be deleted, essentially undeleting the
series.
Reduce the lock contention on tsdb.Store by taking a short lived
read-lock instead of a long write lock. Also close shards in parallel
and drop the whole RP dir in bulk instead of each shard dir.
Reduces the lock contention on the tsdb.Store by taking a short
read lock instead of a long write lock. Also processes shards
in parallel instead of serially.
Due to a bug in compactions, it's possible some blocks may have duplicate
points stored. If those blocks are decoded and re-compacted, an assertion
panic could trigger.
We now dedup those blocks if necessary to remove the duplicate points
and avoid the panic.
For larger datasets, it's possible for shards to get into a state where
many large, dense TSM files exist. While the shard is still hot for
writes, full compactions will skip these files since they are already
fairly optimized and full compactions are expensive. If the write volume
is large enough, the shard can accumulate lots of these files. When
a file is in this state, it's index can contain every series which
causes startup times to increase since each file must parse the full
set of series keys for every file. If the number of series is high,
the index can be quite large causing large amount of disk IO at startup.
To fix this, a optmize compaction is run when a full compaction planning
step decides there is nothing to do. The optimize compaction combines
and spreads the data and series keys across all files resulting in each
file containing the full series data for that shard and a subset of the
total set of keys in the shard.
This allows a shard to only store a series key once in the shard reducing
storage size as well allows a shard to only load each key once at startup.
Large files created early in the leveled compactions could cause
a shard to get into a bad state. This reworks the level planner
to handle those cases as well as splits large compactions up into
multiple groups to leverage more CPUs when possible.
Truncate the time interval output of the monitor service to be on even
time intervals rather than on every minute based on the start time. This
normalizes the output from the monitor service.
If there were blocks in later TSM files that were for overwritten
points or writes into the past, they could be returned more than
once or out of order causing the cursor values to be unsorted.
One effect of this is that graphs in graphana would render with
the line going all over the place in spots.
This might also cause duplicate data to be returned.
Fixes#6738
The tsdb package had a substantial amount of dead code related to the
old query engine still in there. It is no longer used, so it was removed
since it was left unmaintained. There is likely still more code that is
the same, but wasn't found as part of this code cleanup.
influxql has dead code show up because of the code generation so it is
not included in this pruning.
Updated `influx_inspect` to use the `FieldDimensions` method instead
(more reliable anyway). The `influx_tsm` program used its own vendored
copy of `FieldCodec` so it is not affected by this change. `FieldCodec`
was only used for the `b1` and `bz1` engines which were removed in 0.12,
but the code that created the field codec was never removed. This
limited the maximum number of fields to 255 even though that restriction
was removed with the `tsm1` engine.
Fixes#6869.
A copy/paste error had nil cursors destined for a condition cursor get
set to the auxiliary cursor instead. When the number of conditions
exceeded the number of auxiliary fields, this would result in a stack
trace in some situations. When the number of conditions was less than or
equal to the number of auxiliary fields, it means that an auxiliary
cursor may have been overwritten with a nil cursor accidentally and a
leak might have happened since it was never closed.
Fixes#6859.
Anecdotally, the relationship between memory consumption and series
cardinality was thought to be exponential. I suspect that this is false.
The intent of the added benchmarks is to verify my suspicion. Eventually
the these benchmarks will run nightly to serve as a basis to evualuate
the memory performance in a controlled environment.
https://github.com/influxdata/docs.influxdata.com/issues/392
Restore would try to open the shard if there was an error. If there
was an error, the files written are very likely to be partially written
and they can cause the server to panic.
To prevent a shard from trying to open broken files, we now write to
a temp file and rename it to the actual name only after fully writing
and fsyncing the file.
The TSDBStore interface needs to also allow for remote TSDBStore but the
DatabaseIndex is only for a local TSDB instance. Moved the optimized
SHOW TAG VALUES path to do a typecast to the LocalTSDBStore struct
instead of always attempting to use the optimized version.
If the TSDBStore is not local and does not have the DatabaseIndex, it
will default to using the distributed query instead.
This commit optimizes `SHOW TAG VALUES` so that it avoids the
`SELECT` query engine execution and iterator creation. There
are also optimizations to reduce individual memory allocations
and to reduce in-memory heap size by only operating on one
measurement at a time.
Execution time has been reduce to approximately 900ms for
500,000 rows. This is about 2µs per row. Of this time,
approximately 1µs is spent retrieving and sorting the row
and 1µs is spent encoding into JSON and writing to the
response body.
If cache.Deduplicate is called while writes are in-flight on the cache, a data race
could occur.
WARNING: DATA RACE
Write by goroutine 15:
runtime.mapassign1()
/usr/local/go/src/runtime/hashmap.go:429 +0x0
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Cache).entry()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache.go:482 +0x27e
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Cache).WriteMulti()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache.go:207 +0x3b2
github.com/influxdata/influxdb/tsdb/engine/tsm1.TestCache_Deduplicate_Concurrent.func1()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache_test.go:421 +0x73
Previous read by goroutine 16:
runtime.mapiterinit()
/usr/local/go/src/runtime/hashmap.go:607 +0x0
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Cache).Deduplicate()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache.go:272 +0x7c
github.com/influxdata/influxdb/tsdb/engine/tsm1.TestCache_Deduplicate_Concurrent.func2()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache_test.go:429 +0x69
Goroutine 15 (running) created at:
github.com/influxdata/influxdb/tsdb/engine/tsm1.TestCache_Deduplicate_Concurrent()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache_test.go:423 +0x3f2
testing.tRunner()
/usr/local/go/src/testing/testing.go:473 +0xdc
Goroutine 16 (finished) created at:
github.com/influxdata/influxdb/tsdb/engine/tsm1.TestCache_Deduplicate_Concurrent()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/cache_test.go:431 +0x43b
testing.tRunner()
/usr/local/go/src/testing/testing.go:473 +0xdc
For restoring a shard, we need to be able to have the shard open,
but disabled. It was racy to open it and then disable it separately
since writes/queries could occur in between that time.
This switch the backup shard call to use the shard Snapshot that
internally creates a snapshot by hardlinking all of the TSM and
tombstone files instead. This reduces the time that the FileStore
is locked and will allow for larger shards to be backup more easily.
The level planner would keep including the same TSM files to be
recompacted even if they were already quite compacted and split
across several TSM files.
Fixes#6683
This fixes a pathalogical query condition cause by and problematic
structuring of TSM files based on how points were written. The
condition can occur when there are multiple TSM files and a large
number of points are written into the past. The earlier existing
TSM files must also have points in the past and close to the present
causing their time range to eclipse the later files.
When this condition occurs, some queries can spend an excessive amount
of time merge all the overlapping blocks.
The fix was to constrain the window of overlapping blocks based on
the first one we ran into. There was also a simple case in the Merge
where we could skip the binary search path and just append the two
inputs.
os.Open is documented as:
> Open opens the named file for reading. If successful, methods on
> the returned file can be used for reading;
That suggests the file's methods should only be called if opening
was successful. The original code would defer f.Close() right after
os.Open, before ensuring that err is nil, so f.Close() would run
even if os.Open did not return successfully.
Apply https://github.com/golang/go/wiki/CodeReviewComments#indent-error-flow
suggestion to keep the normal path at minimal indentation, and indent
the error handling code instead. This improves code readability.
If you use a statement like this:
SELECT value FROM one..cpu, two..cpu
It will access both the `one` and `two` databases as if you had selected
the `cpu` measurement twice for both of them. Updated the `tsdb.Shard`
create iterator function to filter out any sources that do not apply to
that shard so this duplication doesn't happen.
Fixes#6701.
The limit optimization was put into the wrong place and caused only part
of the shard to be read when a limit was used. The optimization is
possible, but requires a bit of refactoring to the code here so the call
iterator is created per series before handed to the limit iterator.
Fixes#6661.
Due to an bug in TSM tombstone files, it was possible to create
empty tombstone files. At startup, the TSM file would error out
and not load the TSM file.
Instead, treat it as an empty v1 file so the TSM file can load
correctly.
Fixes#6641
If there were duplicate points in multiple blocks, we would correctly
dedup the points and mark the regions of the blocks we've read.
Unfortunately, we were not excluding the already points as the cursor
moved to points in the later blocks which could cause points to be
return twice incorrectly.
Fixes#6611
The optimization to speed up shard loading had the side effect of
skipping adding series to the index that already exist. The skipping
was in the wrong location and also skipped the shards measurementFields
index which is required in order to query that series in the shard.
Switched the max keys test to write int64 of the same value so RLE
would kick in and the file size will be smaller (84MB vs 3.8MB).
Removed the chunking test which was skipped because the code will
not downsize a block into smaller chunks now.
Skip MaxKeys tests in various environments because it needs to
write too much data to run reliably.
If a large series contains a point that is overwritten, the compactor
would load the whole series into RAM during a full compaction. If
the series was large, it could cause very large RAM spikes and OOMs.
The change reworks the compactor to merge blocks more incrementally
similar to the fix done in #6556.
Fixes#6557
The list of field keys in the index may have differed from the field
keys in the actual shard. Fixing `SHOW FIELD KEYS` so it relies only on
the shard rather than the index.
Fixes#6659.
Casting syntax is done with the PostgreSQL syntax `field1::float` to
specify which type should be used when selecting a field. You can also
do `field1::field` or `tag1::tag` to specify that a field or tag should
be selected.
This makes it possible to select a tag when a field key and a tag key
conflict with each other in a measurement. It also means it's possible
to choose a field with a specific type if multiple shards disagree. If
no types are given, the same ordering for how a type is chosen is used
to determine which type to return.
The FieldDimensions method has been updated to return the data type for
the fields that get returned. The SeriesKeys function has also been
removed since it is no longer needed. SeriesKeys was originally used for
the fill iterator, but then expanded to be used by auxiliary iterators
for determining the channel iterator types. The fill iterator doesn't
need it anymore and the auxiliary types are better served by
FieldDimensions implementing that functionality, so SeriesKeys is no
longer needed.
Fixes#6519.
This locks showeed up in a deadlock systems running queries and
delete series across a large dataset. Queries should not need to
lock the tsdb.Store for writes
Drop database was closing and deleting each shard dir individually and
serially. It would then delete the empty database dirs.
This changes drop database to close all shards in parallel and run
one os.RemoveAll to remove everything under the db dir which is more
efficient.
This also reworked the locking to avoid locking the tsdb.Store for
long periods of time. That can cause queries and writes for other
databases to block as well.
On data sets with many series and potentially large series keys,
the cost of parsing the key and re-indexing can be high.
Loading the TSM keys into the index was being done repeatedly for
series that were already index by an earlier TSM file. This was
wasted worked and slows down shard loading.
Parsing the key was also innefficient and allocated a new string
slice. This was simplified to remove that allocation.
This commit changes the `tsm1.Engine` to create individual series
iterators in batches so that it can be parallelized. Iterators
are combined at the end so they can be redistributed to the
parallelized merge iterator.
Before #6038 was merged, we needed to filter "name" so that it didn't
accidentally hit the code path that used "name" to check the name of a
measurement. This was changed to "_name" to avoid a conflict with a
legitimate tag that used "name" as the key.
SHOW TAG VALUES was never modified to remove the code that filtered out
"name". This removes that line of code so a condition with "name"
doesn't get removed erroneously.
Example:
SHOW TAG VALUES WITH KEY = host WHERE "name" = 'jsternberg'
Fixes#6581.
If a large series contains a point that is overwritten, the compactor
would load the whole series into RAM during a full compaction. If
the series was large, it could cause very large RAM spikes and OOMs.
The change reworks the compactor to merge blocks more incrementally
similar to the fix done in #6556.
This commit moves the `CallIterator` to wrap the individual series
instead of wrapping a shard. This allows individual points to be
aggregated before being merged.
This will cause a small increase in memory usuage per series but
it shows a 20% decrease in query time when there are a moderate
number of points per series.
In some query scenarios, if there are a lot of points on disk spread
across many blocks in TSM files and a point is overwritten near the
begginning of the shard's timerange, the full series could be loaded
into RAM triggering OOMs and huge allocations.
The issue was that the KeyCursor code that handles overwriting points
had a simple implementation that just deduped the whole series in this
case. This falls over when the series is quite large.
Instead, the KeyCursor has been changed to only decode blocks with
updated points. It then keeps track of what section of the blocks
have been read so they are not re-read when the later points are
decoded.
Since the points in a block are always sorted, the code was also changed
to remove the Deduplicate calls since they end up
reallocating the slice. Instead, we do a sorted merge and re-use
the slice as much as we can.
The cursors were returning the wrong value in the case when points
existed in both the cache and tsm files with the same timestamp. The
cache value should have been returned, but the tsm value was returned
incorrectly.
Fixes#6439
This commit changes the `SeriesIterator` to process one measurement
at a time and uses a `floatFastDedupeIterator` to avoid point
encoding during deduplication.
If a shard is empty for a specific field and the field type is something
other than a float, a nil iterator would get returned from one of the
empty shards and cause the combined iterators to be cast to the float
type and all other iterator types to be discarded (or for integers, to
be cast).
This is rare since most aggregates don't accept strings or booleans, but
for queries like:
SELECT distinct(string) FROM mydata
It would result in nothing getting returned if one of the shards didn't
have a value for `string`.
This change modifies the query engine to return nil for the shards
instead of a fake iterator and then to only use the fake iterator if the
final aggregate iterator is nil (meaning that no iterators could be
constructed for the field from any shard).
Fixes#6495.
If multiple tombstone entries happen to exist for the same key in a
tombstone file, it was possible to panic. The first application
would remove all index entries and the second time around the code
still assumed entries would exist and would index into the nil slice.
Also fixes a case where the range of time would fully delete all index
entries, but it did not align with math.MinInt64 and math.MaxInt64. This
would cause the index locations to still exist in the offset slice. This
is inefficient because the BlockIterator would still scan and decode the block
only to discover that all the values are deleted. We now just remove it from
the offsets slice in this case since the range of values are deleted.
When a large tombstone file existed on disk, this code was slow since
it would apply each tombstone to the index one at a time causing the
index to be scanned for each key.
Instead, we group all the tombstones together by timestamp and apply
in bulk so that the index in scan once for each set of tombstones.
If we change to immuntable tombstone files, it might be better to just
write a file where all the keys have the same tombstone so we can re-apply
them efficiently.
This was the wrong fix. The real issue was the tombstones were
being read incorrectly and also applied incorrectly at times. This
code is slower and not necessary so reverting it.
Each iteration of the loop was incrementing the position by 4 incorrectly.
The position should start at four since the header is 4 bytes. This
caused tombstones at the end of the file to not be read because the counter
was out of sync with the actual file position which cause the loop to exit early.
Probably better to refactor this to check for io.EOF instead of using the counter.
The code for parsing a key our of the WAL or TSM files in the engine
was naive and didn't account for measurements with escape chars. This
uses the correct parsing code to parse and load them correctly.
Fixes#6496
This remove the dropMeta param from the tsdb.Store.DeleteSeries and
lets the shard determine when to remove the meta data from the index
based on what series still have data in the shard.
This uncovered a nasty bug in compactions where a fully deleted series would
prematurely end the compactions and not carry forward the rest of the data
in the TSM file. This is now fixed as well.
When a shard is closed and removed due to retention policy enforcement,
the series contained in the shard would still exists in the index causing
a memory leak. Restarting the server would cause them not to be loaded.
Fixes#6457
There are two TSMIndex implementations, the directIndex and the
indirectIndex. Originally, we only had the directIndex and later
added the indirectIndex and NewTSMReaderWithOptions in order to
allow both indexes to be used in tests and code. This has created
a problem since we really only use the directIndex for writing and
always use the indirectIndex for reading.
This changes removes the NewTSMReaderWithOptions func so that it is
no longer possible to create a TSMReader with a directIndex. This
will allow a lot of the block reading code used by the directIndex
to be removed and simplify maintainence. It also gives better test
coverage of the code that is actually used by the TSM engine now.
This adds support for a time range to tombstone files to allow a subset
of points to be deleted instead of the whole series. It changes the
tombstone file format to a binary format and maintains backwards compatibility
with the old text format tombstone files.
Binary math inside of a where condition was previously disallowed. Now,
these types of queries are just passed verbatim down to the underlying
query engine which can handle it.
We may want to revisit this when it comes to tags at some point as it
prevents the more efficient filtering of tags that a simple expression
allows, but it allows a query like this to be done:
SELECT * FROM cpu WHERE value + 2 < 5
So while it can be better, this is a good initial implementation to
provide this functionality. There are very rare situations where a tag
may be used appropriately in one of these circumstances.
Fixes#3558.
This commit changes the `FloatDecoder.val` from a `float64` type
to a `uint64` to avoid an additional type conversion during read.
Now the type gets converted to a `float64` only on call to `Values()`.
This has various benefits:
- Users embedding InfluxDB within other Go programs can specify a different logger / prefix easily.
- More consistent with code used elsewhere in InfluxDB (e.g. services, other `run.Server.*` fields, etc).
- This is also more efficient, because it means `executeQuery` no longer allocates a single `*log.Logger` each time it is called.
The cache max memory size is an approximate size and can prevent a
shard from loading at startup. This change disable the max size
at startup to prevent this problem and sets the limt back after
reloading.
Fixes#6109
The series keys within a tag set were previously not sorted which would
cause the output to be non-deterministic. This sorts the output series
by their keys so it has a consistent output especially when using
limits.
Fixes#3166.
This also switches the remaining iterators to be lazy so they can return
errors properly. They needed to be converted to lazy initialization
anyway, which has the side effect of making it much easier for us to
propagate the underlying error during initialization.
Updated the Emitter to return errors when it cannot read properly from
the iterators.
When a GROUP BY or multiple sources are used, the top level limit
iterator requires reading the entire iterator stream so it can find all
of the tag groups it needs to return. For large data series, this ends
up with the limit iterator discarding a lot of output.
This change adds a new lower level limit iterator on each series itself
so that there are fewer data points that have to be thrown away by the
top level iterator.
Fixes#5553.
Now it is possible to compare tags and fields and it is also now
possible to compare tags and tags. Previously, it was only possible to
compare fields with fields and tags with a string or a regex.
Fixes#3371.
This commit makes a number of performance improvements to
reduce allocations during query execution. Several objects
and buffers are now reused across the components to avoid
allocations.
Previously a simple `count(value)` query across 1M points
would require 26,000+ allocations. After the changes in
this commit that number has been reduced to 88.
A missing tag on a point was sometimes treated as `""` and sometimes
treated as a separate `null` entity. This change modifies the equality
operations to always treat a missing tag as an empty string.
Empty tags are *not* indexed and do not have the same performance as a
tag that exists.
Fixes#3773.
Send nil values from the tsm1 cursor at the end of the cursor. After the
cursor reached tsm1, the `nextAt()` call would always return the default
value rather than a nil value.
Descending also didn't work correctly because the seeking functionality
for tsm1 iterators would always act like they were ascending instead of
descending when choosing which value to select. This resulted in very
strange output from the emitter since it couldn't figure out if it was
ascending or descending.
Fixes#6206.
The QueryExecutor had a lot of dead code made obsolete by the query
engine refactor that has now been removed. The TSDBStore interface has
also been cleaned up so we can have multiple implementations of this
(such as a local and remote version).
A StatementExecutor interface has been created for adding custom
functionality to the QueryExecutor that may not be available in the open
source version. The QueryExecutor delegate all statement execution to
the StatementExecutor and the QueryExecutor will only keep track of
housekeeping. Implementing additional queries is as simple as wrapping
the cluster.StatementExecutor struct or replacing it with something
completely different.
The PointsWriter in the QueryExecutor has been changed to a simple
interface that implements the one method needed by the query executor.
This is to allow different PointsWriter implementations to be used by
the QueryExecutor. It has also been moved into the StatementExecutor
instead.
The TSDBStore interface has now been modified to contain the code for
creating an IteratorCreator. This is so the underlying TSDBStore can
implement different ways of accessing the underlying shards rather than
always having to access each shard individually (such as batch
requests).
Remove the show servers handling. This isn't a valid command in the open
source version of InfluxDB anymore.
The QueryManager interface is now built into QueryExecutor and is no
longer necessary. The StatementExecutor and QueryExecutor split allows
task management to much more easily be built into QueryExecutor rather
than as a separate struct.
Both Shard and Engine had the same reference to the measurementField map,
but they each protected it with their own locks. This causes a race when
write and queries are occurring because writes can add new fields to the
map while queries are reading from it.
The fix moves the ownership to the Engine and provides protected accessors
to that Shard now users. For the most parts, the access on shard were old
dead code.
Fixing the measurementFields map race created a new race on the internal
fields map. This is now unexported and protected via MeasurementFields
exported funcs.
Fixes#6188
The stats setup ends up creating a lot of lock contention which signifcantly
impacts write throughput when a large number of measurements are used.
Fixes#6131