This speeds up snapshot persistence by taking all of the persist jobs
and running them simultaneously on a JoinSet. With this we can speed
things up a bit by not waiting for each file to persist before the next
one can be persisted. Instead we now can run all the persisting at the
same time using the tokio runtime.
Closes#24658
This refactors plugins and triggers so that plugins no longer need to be "created". Since plugins exist in either the configured local directory or on the Github repo, a user now only needs to create a trigger and reference the plugin filename.
Closes#25876
This change allows *both* the write and query commands to accept input
via stdin, string, or by a file. With this change larger queries are more
feasible to write as they can now be written in a file and smaller
writes via a string are now possible. This also makes the program work
more like people would expect it to, especially on unix based systems.
This commit also contains three tests to make sure the functionality works
as expected.
Closes#25772Closes#25892
* feat: first stab at locally updating parquet cache
closes: https://github.com/influxdata/influxdb/issues/25887
* refactor: use enums to separate out the modes
This commit introduced the `Immediate` and `Eventual` modes for
fulfilling the cache request. In immediate mode since the data is
readily available to be cached, we can avoid extra requests to object
store.
part of: https://github.com/influxdata/influxdb/issues/25887
This commit does a few key things:
- Removes the 72 hour query and write restrictions in Core
- Limits the queries to a default number of parquet files. We chose 432
as this is about 72 hours using default settings for the gen1
timeblock
- The file limit can be increased, but the help text and error message
when exceeded note that query performance will likely be degraded as
a result.
- We warn users to use smaller time ranges if possible if they hit this
query error
With this we eliminate the hard restriction we have in place, but
instead create a soft one that users can choose to take the performance
hit with. If they can't take that hit then it's recomended that they
upgrade to Enterprise which has the compactor built in to make
performant historical queries.
* refactor: reduce catalog locks when getting chunks
The main refactor was to change the ChunkContainer trait to use the
DatabaseSchema and TableDefinition types directly in the signature, vs.
the names, which then required an additional catalog lock and lookups for
both entities. This was already handled upstream in the QueryTable, so
there was no need to do the lookups again.
This required the addition of a test helper in influxdb3_write::test_helpers
that provides convenience methods for getting record batches from the
WriteBuffer. We have been implementing such a method manually in several
places, so this is nice to have it unified. This provides a blanket impl
so that anything implementing WriteBuffer gets the method.
Some other house cleaning was included.
* refactor: clean up test helpers in influxdb3_write
* refactor: pass original df filters forward with ChunkFilter
* chore: clippy
This updates plugins so that they will reload the code if the local file is modified. Github pugins continue to be loaded only once when they are initially created or loaded on startup.
This will make iterating on plugin development locally much easier.
Closes#25863
* feat: Add request plugin capability
Adds the request plugin type. Triggers can be bound to an API endpoint at /api/v3/engine/<path>. Requests will get yielded to the plugin with the query parameters, request parameters, and request body.
I didn't implement the test endpoint for this plugin type as it seems much more natural for users to save the file and make a new request. Once #25863 is done it'll make it very easy.
Closes#25862
* chore: fix spelling in error message
Although the `format` in the request is used, the value coming
through the header is parsed earlier. So, when that lookup in
the header fails an error is returned (`InvalidMimeType`).
In this commit, there are extra checks to allow the default `Accept`
header values that come from the browser by defaulting it to `json`
closes: https://github.com/influxdata/influxdb/issues/25874
Related to https://github.com/influxdata/influxdb_pro/issues/436
This PR updates the filter handling in the `WriteBuffer` so that sets of `Expr`s provided in a query will better prune both chunks from the in-memory buffer, as well as the set of parquet file chunks that are forwarded to DataFusion, for query execution.
### New `BufferFilter` type
This introduces the [`BufferFilter`](bab428f0eb/influxdb3_write/src/lib.rs (L496)) type. This converts a set of `Expr`s from a logical query plan into a filter that can be used to:
* prune chunks based on a provided lower/upper `time` boundary from both the buffer and parquet
* prune chunks from the buffer based on any literal guarantees predicated on tag columns in the query, e.g., `WHERE tag = 'a'` or `WHERE tag IN ['a', 'b']`
This type is exposed such that it will be easy to use from replicated buffers and from the compactor when producing `Arc<dyn QueryChunk>`s in Enterprise.
### Tests
* Tests in the [`table_buffer`](bab428f0eb/influxdb3_write/src/write_buffer/table_buffer.rs) module were updated to use the `WriteValidator`. This allows construction of rows based on line protocol directly, and in cleaning up the tests a bit, allowed me to extend some of the test cases in [this test](bab428f0eb/influxdb3_write/src/write_buffer/table_buffer.rs (L979)).
* I added [a test](bab428f0eb/influxdb3_write/src/write_buffer/table_buffer.rs (L1243)) that checks the buffer chunk index filtering for expressions against multiple tag columns.
* Added [a test](bab428f0eb/influxdb3_write/src/write_buffer/table_buffer.rs (L1153)) that checks time pruning
* Added [a test](bab428f0eb/influxdb3_write/src/write_buffer/persisted_files.rs (L279)) that checks time pruning in `PersistedFiles`
* I renamed several tests to start with `test_`.
* chore: add out of order tests
- assertions for what remains in the queryable buffer when out of order
timestamps are encountered. This could be true for back filling, and
in that case back filled data takes over the queryable buffer and
moving all the recent data into parquet files (as part of snapshotting)
- assertions to check last cache still retains the most recent values
when out of order data is encountered
* chore: update comment
Co-authored-by: Trevor Hilton <thilton@influxdata.com>
---------
Co-authored-by: Trevor Hilton <thilton@influxdata.com>
* feat(processing_engine): Add cron plugins and triggers to the processing engine.
* feat(processing_engine): switch from 'cron plugin' to 'schedule plugin', use TimeProvider.
* feat(processing_engine): add test for test scheduled plugin.
* feat: improve plugin logging interface
Updates the plugin log functions so they can take any number of Python objects which will be converted into a single log line string.
Closes#25847
* refactor: udpate on PR feedback
* feat: return better plugin execution errors
This sets up the framework for fleshing out more useful plugin execution errors that get returned to the user during testing. We'll also want to capture these for logging in system tables.
Also fixes a test that was broken in previous commit on time limits. Didn't show up because of the feature flag.
* fix: compile errors without system-py feature
* refactor: update tests for wal file removal
- update the last wal file seen first so that removal doesn't
wait for one more cycle
- added the worked out example test
- minor tidy ups (introduce inner so that block scopes are delegated)
* refactor: address PR feedback
This updates the v1 /query API hanlder to handle InfluxDB v1's unique
query response structure when GROUP BY clauses are provided.
The distinction is in the addition of a "tags" field to the emitted series
data that contains a map of the GROUP BY tags along with their distinct
values associated with the data in the "values" field.
This required splitting the QueryExecutor into two query paths for InfluxQL
and SQL, as this allowed for handling InfluxQL query parsing in advance
of query planning.
A set of snapshot tests were added to check that it all works.
Creates AmazonS3 object using all environmental variables for additional cases where command line parameters are not appropriate.
This makes step 4 here possible - https://docs.aws.amazon.com/sdk-for-rust/latest/dg/credproviders.html
Alternative approach may have been to add a similar command line option for AWS_CONTAINER_CREDENTIALS_RELATIVE_URI but this makes no sense to provide on a CLI given it is only to be set automatically on Fargate and EKS containers.
As from_env() looks up all relevant environmental variables, it no longer makes sense to look for them as part of the CLI option parsing, so those relevant env options are removed.
Fixes#25828
This commit sets InfluxDB 3 Core to have a 72 hour limit for queries and
writes. What this means is that writes that contain historical data
older than 72 hours will be rejected and queries will filter out data
older than 72 hours. Core is intended to be a recent timeseries database
and performance over data older than 72 hours will degrade without a
garbage collector, a core feature of InfluxDB 3 Enterprise. InfluxDB 3
Enterprise does not have this write or query limit in place.
Note that this does *not* mean older data is deleted. Older data is
still accessible in object storage as Parquet files that can still be
used in other services and analyzed with dataframe libraries like pandas
and polars.
This commit does a few things:
- Uses timestamps in the year 2065 for tests as these should not break
for longer than many of us will be working in our lifetimes. This is
only needed for the integration tests as other tests use the
MockProvider for time.
- Filters the buffer and persisted files to only show data newer than
3 days ago
- Fixes the integration tests to work with the fact that writes older
than 3 days are rejected
This changes the CLI arg `host-id` to `writer-id` to more accurately
indicate meaning.
This changes also goes through the codebase and changes struct fields,
methods, and variables to use the term `writer_id` or `writer_identifier_prefix`
instead of `host_id` etc., to make the meaning clear in the code.
This also changes the catalog serialization to use the field `writer_id`
instead of `host_id`, which is breaking change.
This updates the create plugin API and CLI so that it doesn't take the pugin code, but instead takes a file name of a file that must be in the plugin-dir of the server. It returns an error if the plugin-dir is not configured or if the file isn't there.
Also updates the WAL and catalog so that it doesn't store the plugin code directly. The code is read from disk one time when the plugin runs.
Closes#25797
* feat: introduce num wal files to keep
This commit allows a configurable number of wal files to be left behind
in OS. This is necessary as enterprise replicas rely on these files.
closes: https://github.com/influxdata/influxdb/issues/25788
* refactor: address PR feedback
* refactor: address PR comment