The storage filters are modified to use the predicates directly so we do
not have to pass `semantic.FunctionExpression` around. Instead, since
simple expressions are all that are supported anyway, we transform
suitable function expressions into predicates as part of the push down
rule and this simplifies the influxdb reader code.
This also moves the storage predicate conversion code into the standard
library package as it is the only location that uses this code now that
the predicate conversion is done as part of the push down rule.
This refactor was prompted by another refactor of the
`semantic.FunctionExpression` that would cause it to always contain a
`semantic.Block`. Since the push down filter needs the expressions and
to combine them, this refactor allows us not do construct a combined
filter inside of blocks which allows us to have better type safety.
Filter cursors buffer points in between calls to Next() if the number
of read points exceeds 1000. Previously, this buffer was being cleared
out before being iterated over which caused queries to return a resultset
which had a number of rows divisable by 1000.
This change moves the clearing of the buffer until after the points have
been read. This change affects any queries which read more than 1000 points
from a single series & have a filter that can be successfully applied to at
least one of those points.
These APIs require a measurement, permitting an additional optimization
to reduce the search space against the TSM index. Specifically, the
search key prefix is extended from `org+bucket` to
`org+bucket,\x00=<measurement>`
* MeasurementNames
* MeasurementTagKeys
* MeasurementTagValues
* Adds an api to the models package for efficiently parsing the
measurement tag (\x00) from a normalized series key
* refactor(storage): move type ByTagKey to the only package that uses it
* refactor(tsdb): use types in tsdb/cursors
* refactor(tsdb): remove unused type SeriesIDElems
* refactor(tsdb): inline only use of tsdb.ReadAllSeriesIDIterator
* refactor(tsdb): move series file to its own package
* refactor(storage): remove platform->influxdb aliases
This removes the storage dependency on libflux by moving the interfaces
it implements to the `query` package so it can reference the definitions
rather than the package with the implementation and the registration
with the runtime. This breaks the dependency where a storage package
depends on a flux runtime package.
* refactor(storage): add readSource field accessors
* refactor(storage): remove unused limitSeriesCursor
* refactor(storage): export IndexSeriesCursor
This allows IDPE to use the same implementation, rather than duplicate
code. Also copied unit tests from IDPE.
* chore: go fmt
* refactor(storage): move unused code to repo that needs it
Turns out that a bunch of code is only needed in IDPE. This change
removes that code, and another PR adds it to IDPE.
* refactor(storage): export KeyMerger
* refactor(storage): export NilSortHi and NilSortLo
* refactor(storage): move StringIterator & friends to IDPE
* refactor(storage): unexport a few test helper funcs
* fix(storage): simplify storage/seriesCursor
storage/seriesCursor releases series file and TSI references sooner.
Remove unhelpful request object, inherited from 1.x
* chore(storage): replace SeriesCursor interface with sole implementation
* feat(backup): `influx backup` creates data backup
* feat(backup): initial restore work
* feat(restore): initial restore impl
Adds a restore tool which does offline restore of data and metadata.
* fix(restore): pr cleanup
* fix(restore): fix data dir creation
* fix(restore): pr cleanup
* chore: amend CHANGELOG
* fix: restore to empty dir fails differently
* feat(backup): backup and restore credentials
Saves the credentials file to backups and restores it from backups.
Additionally adds some logging for errors when fetching backup files.
* fix(restore): add missed commit
* fix(restore): pr cleanup
* fix(restore): fix default credentials restore path
* fix(backup): actually copy the credentials file for the backup
* fix: dirs get 0777, files get 0666
* fix: small review feedback
Co-authored-by: tmgordeeva <tanya@influxdata.com>
This adds an lru cache for the columns that are produced as tags. When
producing the columns that are part of the group key, it will generate
the column and then keep it in an lru cache to reuse for future tables.
The start and stop column are effectively cached for every table because
they are special and will be the same for all of the tables.
For the tags, it retains the most recently used since they may be used
by a future table. That way most of the columns will get shared with
each other.
When the size differs, a slice is used so the underlying data is still
shared, but the size is different.
This removes the duplicate filter that is used by the reader. The
storage engine shouldn't be sending us duplicate tables anyway and this
code hurts performance in high cardinality queries because of the memory
it uses to keep track of all of the keys that have been seen.
By default this feature is disabled; the full compaction behaviour does
not change. When this feature is enabled compactions can be limited
across multiple storage engines running in multiple processes.
The mechanism by which this happens is not part of the abstraction added
here.
* test(storage): ensure multiple engines can run concurrently
* feat(storage): expose control over retention run
Fixes#15134.
This commit adds the ability to inject a functional option into a
storage.Engine for controlling when the retention enforcer can run.
Previously, retention enforcers ran on an interval; if you ran multiple
storage engines (as we do in some environments) then it was not possible
to coordinate when engines ran retention. Often they would synchronise
because they started at the same time.
This change will let you specify a blocking function to control when the
retention enforcer can run.
A simple function for serialising retention enforcement across multiple
storage engines could look like:
```go
var mu sync.Mutex
func f() (done func()) {
mu.Lock()
return func() { mu.Unlock() }
}
```
The ResponseWriter would truncate the last series if the byte size of
the points frames exceeded the writeSize constant, causing a Flush to
occur and the cumulative ResponseWriter.sz to reset to zero. Because
ResponseWriter.sz was not incremented for each frame, it remained at
zero, which resulted in the final Flush short circuiting.
This commit implements the Size method for the cursors.Array types
to be used to estimate the size of frame. This is in place of calling
the Protocol Buffer `Size` function, which can be very expensive.
If the reader produces more than one table with the same group key, we
discard the later ones because the stream should never give us more than
one table with the same group key.
This is an error and it indicates the server sent us a bad set of data.
This change makes it so that the client is tolerant of that data and
will discard it if it exists.
Adds the ability to set the current generation to use when compacting
the cache only. Previously, we used the current generation for all
files but this causes issues and we should only use the current
generation for level 1 compaction.
When a buffered column reader was used, the length was not reset to
whatever the requested length was for the buffer so it was possible for
the length to be longer than the actual columns.
The storage table reader will now work correctly when there are multiple
outputs. The table interface now implements the new table and column
reader interfaces and works properly with `execute.CopyTable`. The
source uses `execute.CopyTable` to buffer the table in memory when there
are multiple output transformations.
I don't see anywhere obvious that an engine would be closed twice, but
if it was, the RLock would have been held permanently, such that a Lock
could not be taken later.
Running go test ./storage/... did not trigger a double-close.
The controller implementation is primarily used by influxdb so it
shouldn't be part of the flux repository. This copies the code from flux
to influxdb so it can be removed from the next flux release.
The copy was unnecessary since it was just going to be copied
immediately afterwards into an Arrow buffer. In the future, we will want
to have storage directly send the arrow buffer, but right now we are
putting it in an array and copying it anyway.
Even when we send an arrow buffer, the underlying sequence of bytes is
probably going to be different and we will rely on the allocator to
reuse bytes so let's remove the extra copy.
This manifested as incorrect sort ordering when serialized via RPC,
resulting in an `invalid partition key order` error.
This fix introduces a delimiter to ensure sort keys cannot collide.
These tables were previously used to perform meta queries.
Meta queries are now answered using a specific API, and as
a result, these tables can go away.
Translate the measurement and field tag key names to their non-storage
names and add the `_start` and `_stop` tag keys to the output since
they aren't real tags, but ones that are added by range.
The RPC call should translate `_measurement` and `_field` to their
proper shortened byte strings when requesting the tag values.
This also fixes the planner rewrites to return the root node even when
no rewrite happened as this is required by the planner.
The TagValues API will perform a linear scan if there is no predicate;
otherwise, it will use the index to find a list of candidate series
keys.
TagKeys expects the predicate to be transformed such that
`_measurement` and `_field` are remapped to `\x00` and `\xff`
respectively.
There is one TODO marked to analyze the predicate for a
`\x00 = '<measurement>'` pattern. If found, the predicate can be
eliminated and fall back to a linear prefix scan by combining the org,
bucket and measurement. This is tracked by issue #13497.
If a pattern is seen that matches the `v1.tagValues(...)` call, then it
will be replaced with a direct RPC call to read the tag values for the
selected tag key which should be better optimized than reading from the
storage engine tsm1 files.
If a pattern is seen that matches reading the tag keys, it will be
replaced with a direct RPC call to read the tag keys which should be
better optimized than reading from the storage engine tsm1 files.
* Extend storage service protobuf with TagKeys and TagValues
Co-authored-by: Michael Desa <mjdesa@gmail.com>
Co-authored-by: Jacob Marble <jacobmarble@influxdata.com>
* Extend the reads.Store interface with new TagKeys and TagValues APIs
* Extend readservice.store to implement refactored reads.Store interface
* Implement a StringIterator gRPC writer / serializer
* Implement a StringIterator gRPC reader / deserializer
* Implement a StringIterator merger
Storage converts references to _value in filter predicates to $.
It then considers any predicate that does not reference $ to be
a tag predicate. Tag predicates are used to construct series index
cursors.
This commit fixes a bug where field comparisons were being included
in tag predicates due to the HasFieldValueKey function searching
for comparison expressions referencing _value instead of $. Because
all references to _value had already been replaced. An expression
of the form '$ < 3000' would be considered a tag expression and
therefore would be mistakenly included as a tag predicate.
Fixes#13159.
A cursor can be in process (local) or remote. Remote cursors have
already applied the `hasPoints` check, to reduce network traffic.
Testing whether the cursor has points again here:
https://github.com/influxdata/influxdb/blob/master/storage/reads/reader.go#L221
will always return `false` for a remote cursor that has applied the
NoPoints optimization.
This temporary fix allows the `hasPoints` function to differentiate
a streamCursor and always return true in that case.
The storage engine will now drop any points that contain invalid tag
data. Special tag keys for the measurement and field key will be
excepted from this validation.
We will want to validate that all tag key/value data is valid unicode.
This commit changes the validation helper to only validate provided
tags, since measurements are currently very likely to contain invalid
utf-8 characters.
There are two exceptions to the tag validation: the validation of the
special tag keys for measurements and field keys.
When the WAL was moved up, the validation that happened at the cache
was skipped. This moves the field type validation for a batch of
points up ahead of the WAL again.
It is possible a StreamReader (gRPC) may return an empty response. This
change adds retry and bail-out support. When a bail-out occurs,
reads.ErrStreamNoData is returned.
StorageReadClient adapts a gRPC Storage_ReadClient to provide
cursors.CursorStats by reading the trailer after receiving the final
message from the stream.
This commit adds the pkg/lifecycle.Resource to help manage opening,
closing, and leasing out references to some resource. A resource
cannot be closed until all acquired references have been released.
If the debug_ref tag is enabled, all resource acquisitions keep
track of the stack trace that created them and have a finalizer
associated with them to print on stderr if they are leaked. It also
registers a handler on SIGUSR2 to dump all of the currently live
resources.
Having resources tracked in a uniform way with a data type allows us
to do more sophisticated tracking with the debug_ref tag, as well.
For example, we could panic the process if a resource cannot be
closed within a certain time frame, or attempt to figure out the
DAG of resource ownership dynamically.
This commit also fixes many issues around resources, correctness
during error scenarios, reporting of errors, idempotency of
close, tracking of memory for some data structures, resource leaks
in tests, and out of order dependency closes in tests.
It turns out that LastModified and DiskSize are unused, and so it
was easy to change to not care about the WAL.
This hooks up metrics and starts the WAL again.
At the cost of some nil checks, we don't have to have an interface, defend against
subtle bugs with nils in non-nil interfaces, an empty implementation, etc.
Also, the tsm1 engine is losing the WAL anyway.
Because the WAL relies on the tsm1.Value type, we move that into its own
tsm1/value package and set up some aliases forwarding them into tsm1. This
also required adding some methods and changing consumers to avoid the
unexported fields. I imagine this step will be useful one day when we make
the write path more efficient with respect to consuming points.
This commit additionally fixes some issues with generation. The iterator.tmpldata
and generation for array_cursor_* were removed accidentally when removing
iterators, making those generated files stale. Restore that and regenerate.
No change in functionality.
This commit improves the performance of a mass delete on the TSI index
by deleting at the measurement level instead of deleting each series
individually.
I did this with a dumb editor macro, so some comments changed too.
Also rename root package from platform to influxdb.
In interest of minimizing risk, anyone importing the root package has
now aliased it to "platform" so that no changes beyond imports were
necessary in those files.
Lastly, replace the old platform module to local path /dev/null so that
nobody can accidentally reintroduce a platform dependency while
migrating platform code to influxdb.
The storage bucket service wraps another bucket service, and invokes
actions on a storage engine based upon the actions taken upon buckets.
Currently, the storage bucket service will delete bucket data from the
storage engine when the bucket is deleted via the bucket service.
A standard Makefile is used now in all subdirs that run go generate.
Make will only generate the file if its source files changed.
The checkgenerate target runs clean to ensure all targets a generated
fresh.
The table interface was modified to expose the arrow buffers. The
storage table has now been converted to use this interface with the same
fixes so that it exposes arrow buffers.
The influxql package has also been updated to use the `DoArrow` method
from the `flux.Table` interface.