This makes quite a few major changes to our CLI and how users interact
with it:
1. All commands are now in the form <verb> <noun> this was to make the
commands consistent. We had last-cache as a noun, but serve as a
verb in the top level. Given that we could only create or delete
All noun based commands have been move under a create and delete
command
2. --host short form is now -H not -h which is reassigned to -h/--help
for shorter help text and is in line with what users would expect
for a CLI
3. Only the needed items from clap_blocks have been moved into
`influxdb3_clap_blocks` and any IOx specific references were changed
to InfluxDB 3 specific ones
4. References to InfluxDB 3.0 OSS have been changed to InfluxDB 3 Core
in our CLI tools
5. --dbname has been changed to --database to be consistent with --table
in many commands. The short -d flag still remains. In the create/
delete command for the database however the name of the database is
a positional arg
e.g. `influxbd3 create database foo` rather than
`influxdb3 database create --dbname foo`
6. --table has been removed from the delete/create command for tables
and is now a positional arg much like database
7. clap_blocks was removed as dependency to avoid having IOx specific
env vars
8. --cache-name is now an optional positional arg for last_cache and meta_cache
9. last-cache/meta-cache commands are now last_cache and meta_cache respectively
Unfortunately we have quite a few options to run the software and I
couldn't cut down on them, but at least with this commands and options
will be more discoverable and we have full control over our CLI options
now.
Closes#25646
Prior to this change we would deny writes that used n and u for the
precision argument when doing writes. We only accepted ns and us for
those apis. However, to be backwards compatible we would need to enable
accepting writes with n and u. This is mostly just upgrading our deps
as this was a change that landed in IOx first. We test that it works for
our code by adding test cases for their precision in this repo.
Fixes bug in queryable buffer where if a block of data was missing one of the columns defined in a table sort key, the creation of the logical plan to sort and dedupe the data would fail, causing a panic.
Fixes#25670
* feat: add influxdb3_clap_blocks with runtime config
Added a new workspace crate `influxdb3_clap_blocks` which will be a
starting point for adding InfluxDB 3 OSS/Pro specific CLI configuration
that no longer references IOx, and allows for us to trim out unneeded
configurations for the monolithic InfluxDB 3.
Other than changing references from IOX to INFLUXDB3, this makes one
important change: it enables IO on the DataFusion runtime. This, for now,
is an experimental change to see if we can relieve some concurrency
issues that we have been experiencing.
* chore: add observability deps for windows
This changes the code to reference InfluxDB 3 OSS rather than Edge which
had been it's original name when we first started the project. With this
we now have the code reflect what we are actually calling it. On top of
this the long help text has been changed to give advice about how to
actually run the code now with the bare minimum set of flags needed now
as `influxdb serve` is no longer a viable command on it's own.
Closes#25649
* feat: add startup time to logging output
This change adds a startup time counter to the output when starting up
a server. The main purpose of this is to verify whether the impact of
changes actually speeds up the loading of the server.
* feat: Significantly decrease startup times for WAL
This commit does a few important things to speedup startup times:
1. We avoid changing an Arc<str> to a String with the series key as the
From<String> impl will call with_column which will then turn it into
an Arc<str> again. Instead we can just call `with_column` directly
and pass in the iterator without also collecting into a Vec<String>
2. We switch to using bitcode as the serialization format for the WAL.
This significantly reduces startup time as this format is faster to
use instead of JSON, which was eating up massive amounts of time.
Part of this change involves not using the tag feature of serde as
it's currently not supported by bincode
3. We also parallelize reading and deserializing the WAL files before
we then apply them in order. This reduces time waiting on IO and we
eagerly evaluate each spawned task in order as much as possible.
This gives us about a 189% speedup over what we were doing before.
Closes#25534
In this commit the vec backing the buffer is swapped for an array.
Criterion benchmarks were added to compare the perf to make sure it has
not made it worse. The vec implementation has been removed after the
benchmarks done locally
* feat: core metadata cache structs with basic tests
Implement the base MetaCache type that holds the hierarchical structure
of the metadata cache providing methods to create and push rows from the
WAL into the cache.
Added a prune method as well as a method for gathering record batches
from a meta cache. A test was added to check the latter for various
predicates and that the former works, though, pruning shows that we need
to modify how record batches are produced such that expired entries are
not emitted.
* refactor: filter expired entries and do some clean up in the meta cache
* chore: update core deps
- arrow/parquet deps are patched (as in core)
- three specific code changes to cope with changes in core crates
- TransitionPartitionId, use `from_parts` instead of `new`
- arrow buffers can take &[u8] directly without `to_vec()`/`vec!`
(used only in tests)
- `schema` and `influxdb_line_protocol` crates need `v3` feature enabled
* chore: update deny.toml
* chore: formatting and deny toml changes
Unicode-3.0 license is added to allowed licenses list, without it
end up with 19 errors (`zerovec`, `zerovec-derive` etc.)
* chore: address PR feedback
- move enabling v3 feature to root Cargo.toml
- added the upstream PR for datafusion-common that introduced RUSTSEC-2024-0384
`cargo deny` was showing that no crate matched the advisory criteria for this [RUSTSEC advisory](https://rustsec.org/advisories/RUSTSEC-2024-0376.html), so this PR removes the ignore entry.
In addition, the `hashbrown` crate was causing a new audit failure, and updating it required that the `Zlib` license be added to our list of allowed licenses.
No issue for this, but it is blocking another PR at the moment (https://github.com/influxdata/influxdb/pull/25515).
- Introduced traits, `ParquetMetrics` and `SystemInfoProvider` to enable
writing easier tests
- Uses mockito for code that depends on reqwest::Client and also uses
mockall to generally mock any traits like `SystemInfoProvider`
- Minor updates to docs
* feat: Add TableId and ColumnId
* feat: swap over to DbId and TableId everywhere
This commit swaps us over to using the DbId and TableId types everywhere
for our internal systems. Anywhere that's external facing, such as names
for last cache tables or line protocol parsing, use names. In these cases
we have the `Catalog` which keeps a map of TableIds and DbIds in a
bidirectional mapping for easy lookup i.e. id <-> names. While in essence
the change itself isn't that complicated given the nature of how much we
depended on names for things, the changes end up being quite invasive and
extensive. Luckily it shouldn't be too hard to review. Note this does
not add the column ids which will be done in a follow up PR.
Closes#25375Closes#25403Closes#25404Closes#25405Closes#25412Closes#25413
This adds a new crate `influxdb3_test_helpers` which provides two object
store helper types that can be used to track request counts made through
the store, as well as synchronize requests made through the store, resp.
Part of #25347
This sets up a new implementation of an in-memory parquet file cache in the `influxdb3_write` crate in the `parquet_cache.rs` module.
This module introduces the following types:
* `MemCachedObjectStore` - a wrapper around an `Arc<dyn ObjectStore>` that can serve GET-style requests to the store from an in-memory cache
* `ParquetCacheOracle` - an interface (trait) that can accept requests to create new cache entries in the cache used by the `MemCachedObjectStore`
* `MemCacheOracle` - implementation of the `ParquetCacheOracle` trait
## `MemCachedObjectStore`
This takes inspiration from the [`MemCacheObjectStore` type](1eaa4ed5ea/object_store_mem_cache/src/store.rs (L205-L213)) in core, but has some different semantics around its implementation of the `ObjectStore` trait, and uses a different cache implementation.
The reason for wrapping the object store is that this ensures that any GET-style request being made for a given object is served by the cache, e.g., metadata requests made by DataFusion.
The internal cache comes from the [`clru` crate](https://crates.io/crates/clru), which provides a least-recently used (LRU) cache implementation that allows for weighted entries. The cache is initialized with a capacity and entries are given a weight on insert to the cache that represents how much of the allotted capacity they will take up. If there isn't enough room for a new entry on insert, then the LRU item will be removed.
### Limitations of `clru`
The `clru` crate conveniently gives us an LRU eviction policy but its API may put some limitations on the system:
* gets to the cache require an `&mut` reference, which means that the cache needs to be behind a `Mutex`. If this slows down requests through the object store, then we may need to explore alternatives.
* we may want more sophisticated eviction policies than a straight LRU, i.e., to favour certain tables over others, or files that represent recent data over those that represent old data.
## `ParquetCacheOracle` / `MemCacheOracle`
The cache oracle is responsible for handling cache requests, i.e., to fetch an item and store it in the cache. In this PR, the oracle runs a background task to handle these requests. I defined this as a trait/struct pair since the implementation may look different in Pro vs. OSS.
* feat: Remove lock for FileId tests
Since we now are using cargo-nextest in CI we can remove
the locks used in the FileId tests to make sure that we
have no race conditions
* feat: Add u32 ID for Databases
This commit adds a new DbId for databases. It also updates paths to use
that id as part of the name. When starting up the WriteBuffer we apply
the DbId from the persisted snapshot much like we do for ParquetFileId's
This introduces the influxdb3_id crate to avoid circular deps with ids.
The ParquetFileId should also be moved into this crate, but it's
outside the scope of this change.
Closes#25301
- uses Arc<str> to represent create once and read everywhere type
of string
- updated snapshots for insta asserts, uses redaction to hardcode
randomly generated UUID strings
- added methods to catalog to expose instace and host ids
Closes: https://github.com/influxdata/influxdb/issues/25315
This extends the system tables available with a new `parquet_files` table
which will list the parquet files associated with a given table in a
database.
Queries to system.parquet_files must provide a table_name predicate to
specify the table name of interest.
The files are accessed through the QueryableBuffer.
In addition, a test was added to check success and failure modes of the
new system table query.
Finally, the Persister trait had its associated error type removed. This
was somewhat of a consequence of how I initially implemented this change,
but I felt cleaned the code up a bit, so I kept it in the commit.
* refactor: Move Catalog into influxdb3_catalog crate
This moves the catalog and its serialization logic into its own crate. This is a precursor to recording more catalog modifications into the WAL.
Fixes#25204
* fix: cargo update
* fix: add version = 2 to deny.toml
* fix: update deny.toml
* fix: add CCO to deny.toml
* feat: refactor WAL and WriteBuffer
There is a ton going on here, but here are the high level things. This implements a new WAL, which is backed entirely by object store. It then updates the WriteBuffer to be able to work with how the new WAL works, which also required an update to how the Catalog is modified and persisted.
The concept of Segments has been removed. Previously there was a separate WAL per segment of time. Instead, there is now a single WAL that all writes and updates flow into. Data within the write buffer is organized by Chunk(s) within tables, which is based on the timestamp of the row data. These are known as the Level0 files, which will be persisted as Parquet into object store. The default chunk duration for level 0 files is 10 minutes.
The WAL is written as single files that get created at the configured WAL flush interval (1s by default). After a certain number of files have been created, the server will attempt to snapshot the WAL (default is to snapshot the first 600 files of the WAL after we have 900 total, i.e. snapshot 10 minutes of WAL data).
The design goal with this is to persist 10 minute chunks of data that are no longer receiving writes, while clearing out old WAL files. This works if data getting written in around "now" with no more than 5 minutes of delay. If we continue to have delayed writes, a snapshot of all data will be forced in order to clear out the WAL and free up memory in the buffer.
Overall, this structure of a single wal, with flushes and snapshots and chunks in the queryable buffer led to a simpler setup for the write buffer overall. I was able to clear out quite a bit of code related to the old segment organization.
Fixes#25142 and fixes#25173
* refactor: address PR feedback
* refactor: wal to replay and background flush on new
* chore: remove stray println
This commit updates us to rustc 1.80. There are three significant changes
here:
1. LazyLock and LazyCell have been stabilized meaning we can replace our
usage of Lazy from the once_cell crate with the std lib versions
2. Lints were added to handle unknown cfg directives. `tokio_unstable`
is affected by this and while we do have the flags in our
.cargo/config.toml Cargo still output a lint for it so we supress
that warning now in our Cargo.toml for the workspace
3. clippy now throws a new warning about priority levels for lints. It's
quite frankly a thing that doesn't make sense to me and should be
something cargo fixes, but here we are.
Besides that it was a painless upgrade and now we're on the latest and
greatest.
Part of #25067
Changes in this PR:
Addition of a PROFILING.md file, which briefly outlines how to build the influxdb3 binary in preparation for profiling and explains usage of macOS's Instruments tool
Addition of a quick-bench profile, which extends the already existing quick-release profile with debuginfo turned on
Closes#25096
- Adds a new HTTP API that allows the creation of a last cache, see the issue for details
- An E2E test was added to check success/failure behaviour of the API
- Adds the mime crate, for parsing request MIME types, but this is only used in the code I added - we may adopt it in other APIs / parts of the HTTP server in future PRs
* feat: base for last cache implementation
Each last cache holds a ring buffer for each column in an index map, which
preserves the insertion order for faster record batch production.
The ring buffer uses a custom type to handle the different supported
data types that we can have in the system.
* feat: implement last cache provider
LastCacheProvider is the API used to create last caches and write
table batches to them. It uses a two-layer RwLock/HashMap: the first for
the database, and the second layer for the table within the database.
This allows for table-level locks when writing in buffered data, and only
gets a database-level lock when creating a cache (and in future, when
removing them as well).
* test: APIs on write buffer and test for last cache
Added basic APIs on the write buffer to access the last cache and then a
test to the last_cache module to see that it works with a simple example
* docs: add some doc comments to last_cache
* chore: clippy
* chore: one small comment on IndexMap
* chore: clean up some stale comments
* refactor: part of PR feedback
Addressed three parts of PR feedback:
1. Remove double-lock on cache map
2. Re-order the get when writing to the cache to be outside the loop
3. Move the time check into the cache itself
* refactor: nest cache by key columns
This refactors the last cache to use a nested caching structure, where
the key columns for a given cache are used to create a hierarchy of
nested maps, terminating in the actual store for the values in the cache.
Access to the cache is done via a set of predicates which can optionally
specify the key column values at any level in the cache hierarchy to only
gather record batches from children of that node in the cache.
Some todos:
- Need to handle the TTL
- Need to move the TableProvider impl up to the LastCache type
* refactor: TableProvider impl to LastCache
This re-writes the datafusion TableProvider implementation on the correct
type, i.e., the LastCache, and adds conversion from the filter Expr's to
the Predicate type for the cache.
* feat: support TTL in last cache
Last caches will have expired entries walked when writes come in.
* refactor: add panic when unexpected predicate used
* refactor: small naming convention change
* refactor: include keys in query results and no null keys
Changed key columns so that they do not accept null values, i.e., rows
that are pushed that are missing key column values will be ignored.
When producing record batches for a cache, if not all key columns are
used in the predicate, then this change makes it so that the non-predicate
key columns are produced as columns in the outputted record batches.
A test with a few cases showing this was added.
* fix: last cache key column query output
Ensure key columns in the last cache that are not included in the
predicate are emitted in the RecordBatches as a column.
Cleaned up and added comments to the new test.
* chore: clippy and some un-needed code
* fix: clean up some logic errors in last_cache
* test: add tests for non default cache size and TTL
Added two tests, as per commit title. Also moved the eviction process
to a separate function so that it was not being done on every write to
the cache, which could be expensive, and this ensures that entries are
evicted regardless of whether writes are coming in or not.
* test: add invalid predicate test cases to last_cache
* test: last_cache with field key columns
* test: last_cache uses series key for default keys
* test: last_cache uses tag set as default keys
* docs: add doc comments to last_cache
* fix: logic error in last cache creation
CacheAlreadyExists errors were only being based on the database and
table names, and not including the cache names, which was not
correct.
* docs: add some comments to last cache create fn
* feat: support null values in last cache
This also adds explicit support for series key columns to distinguish
them from normal tags in terms of nullability
A test was added to check nulls work
* fix: reset last cache last time when ttl evicts all data
Introduce the experimental series key feature to monolith, along with the new `/api/v3/write` API which accepts the new line protocol to write to tables containing a series key.
Series key
* The series key is supported in the `schema::Schema` type by the addition of a metadata entry that stores the series key members in their correct order. Writes that are received to `v3` tables must have the same series key for every single write.
Series key columns are `NOT NULL`
* Nullability of columns is enforced in the core `schema` crate based on a column's membership in the series key. So, when building a `schema::Schema` using `schema::SchemaBuilder`, the arrow `Field`s that are injected into the schema will have `nullable` set to false for columns that are part of the series key, as well as the `time` column.
* The `NOT NULL` _constraint_, if you can call it that, is enforced in the buffer (see [here](https://github.com/influxdata/influxdb/pull/25066/files#diff-d70ef3dece149f3742ff6e164af17f6601c5a7818e31b0e3b27c3f83dcd7f199R102-R119)) by ensuring there are no gaps in data buffered for series key columns.
Series key columns are still tags
* Columns in the series key are annotated as tags in the arrow schema, which for now means that they are stored as Dictionaries. This was done to avoid having to support a new column type for series key columns.
New write API
* This PR introduces the new write API, `/api/v3/write`, which accepts the new `v3` line protocol. Currently, the only part of the new line protocol proposed in https://github.com/influxdata/influxdb/issues/24979 that is supported is the series key. New data types are not yet supported for fields.
Split write paths
* To support the existing write path alongside the new write path, a new module was set up to perform validation in the `influxdb3_write` crate (`write_buffer/validator.rs`). This re-uses the existing write validation logic, and replicates it with needed changes for the new API. I refactored the validation code to use a state machine over a series of nested function calls to help distinguish the fallible validation/update steps from the infallible conversion steps.
* The code in that module could potentially be refactored to reduce code duplication.
Remove reliance on data_types::ColumnType
Introduce TableSnapshot for serializing table information in the catalog.
Remove the columns BTree from the TableDefinition an use the schema
directly. BTrees are still used to ensure column ordering when tables are
created, or columns added to existing tables.
The custom Deserialize impl on TableDefinition used to block duplicate
column definitions in the serialized data. This preserves that bevaviour
using serde_with and extends it to the other types in the catalog, namely
InnerCatalog and DatabaseSchema.
The serialization test for the catalog was extended to include multiple
tables in a database and multiple columns spanning the range of available
types in each table.
Snapshot testing was introduced using the insta crate to check the
serialized JSON form of the catalog, and help catch breaking changes
when introducing features to the catalog.
Added a test that verifies the no-duplicate key rules when deserializing
the map components in the Catalog
Introduction of the `TokioDatafusionConfig` clap block for configuring the DataFusion runtime - this exposes many new `--datafusion-*` options on start, including `--datafusion-num-threads`
To accommodate renaming of `QueryNamespaceProvider` to `QueryDatabase` in `influxdb3_core`, I renamed the `QueryDatabase` type to `Database`.
Fixed tests that broke as a result of sync.
For releases we need to have Docker images and binary images available for the
user to actually run influxdb3. These CI changes will build the binaries on a
release tag and the Docker image as well, test, sign, and publish them and make
them available for download.
Co-Authored-By: Brandon Pfeifer <bpfeifer@influxdata.com>
* feat: report system stats in load generator
Added the mechanism to report system stats during load generation. The
following stats are saved in a CSV file:
- cpu_usage
- disk_written_bytes
- disk_read_bytes
- memory
- virtual_memory
This only works when running the load generator against a local instance
of influxdb3, i.e., one that is running on your machine.
Generating system stats is done by passing the --system-stats flag to the
load generator.
* feat: /ping API to serve version
The /ping API was added, which is served at GET and
POST methods. The API responds with a JSON body
containing the version and revision of the build.
A new crate was added, influxdb3_process, which
takes the process_info.rs module from the influxdb3
crate, and puts it in a separate crate so that other
crates (influxdb3_server) can depend on it. This was
needed in order to have access to the version and
revision values, which are generated at build time,
in the HTTP API code of influxdb3_server.
A E2E test was added to check that /ping works.
E2E TestServer can now have logs emitted using the
TEST_LOG environment variable.
* feat: initial load generator implementation
This adds a load generator as a new crate. Initially it only generates write load, but the scaffolding is there to add a query load generator to complement the write load tool.
This could have been added as a subcommand to the influxdb3 program, but I thought it best to have it separate for now.
It's fairly light on tests and error handling given its an internal tooling CLI. I've added only something very basic to test the line protocol generation and run the actual write command by hand.
I included pretty detailed instructions and some runnable examples.
* refactor: address PR feedback
feat: support the v1 query API
This PR adds support for the `/api/v1/query` API, which is meant to
serve the original InfluxDB v1 query API, to serve single statement
`SELECT` and `SHOW` queries. The response, which is returned as JSON,
can be chunked via the `chunked` and optional `chunk_size` parameters.
An optional `epoch` parameter can be supplied to have `time` column
timestamps converted to a UNIX epoch with the given precision.
## Buffering
The response is buffered by default, but if the `chunked` parameter
is not supplied, or is passed as `false`, then the entire query
result will be buffered into memory before being returned in the
response. This is how the original API behaves, so we are replicating
that here.
When `chunked` is passed as `true`, then the response will be a
stream of chunks, where each chunk is a self-contained response,
with the same structure as that of the non-chunked response. Chunks
are split up by the provided `chunk_size`, or by series, i.e.,
measurement, which ever comes first. The default chunk size is 10,000
rows.
Buffering is implemented with the `QueryResponseStream` and
`ChunkBuffer` types, the former implements the `Stream` trait,
which allows it to be streamed in the HTTP response directly with
`hyper`'s `Body::wrap_stream`. The `QueryResponseStream` is a wrapper
around the inner arrow `RecordBatchStream`, which buffers the
streamed `RecordBatch`es according to the requested chunking parameters.
## Testing
Two new E2E tests were added to test basic query functionality and
chunking behaviour, respectively. In addition, some manual testing
was done to verify that the InfluxDB Grafana plugin works with this
API.
This changes the 'influxdb3 create token' command so that it will just
automatically generate a completely random base64 encoded token prepended with
'apiv3_' that is then fed into a Sha512 algorithm instead of Sha256. The
user can no longer pass in a token to be turned into the proper output.
This also changes the server code to handle the change to Sha512 as well.
Closes#24704
feat: support SHOW RETENTION POLICIES
Added support through the influxdb3 Query Executor to perform
SHOW RETENTION POLICIES queries, both on a specific database as well
as accross all databases.
Test cases were added to check this functionality.
feat: add query_influxql api
This PR adds support for the /api/v3/query_influxql API. This re-uses code from the existing query_sql API, but some refactoring was done to allow for code re-use between the two.
The main change to the original code from the existing query_sql API was that the format is determined up front, in the event that the user provides some incorrect Accept header, so that the 400 BAD REQUEST is returned before performing the query.
Support of several InfluxQL queries that previously required a bridge to be executed in 3.0 was added:
SHOW MEASUREMENTS
SHOW TAG KEYS
SHOW TAG VALUES
SHOW FIELD KEYS
SHOW DATABASES
Handling of qualified measurement names in SELECT queries (see below)
This is accomplished with the newly added iox_query_influxql_rewrite crate, which provides the means to re-write an InfluxQL statement to strip out a database name and retention policy, if provided. Doing so allows the query_influxql API to have the database parameter optional, as it may be provided in the query string.
Handling qualified measurement names in SELECT
The implementation in this PR will inspect all measurements provided in a FROM clause and extract the database (DB) name and retention policy (RP) name (if not the default). If multiple DB/RP's are provided, an error is thrown.
Testing
E2E tests were added for performing basic queries against a running server on both the query_sql and query_influxql APIs. In addition, the test for query_influxql includes some of the InfluxQL-specific queries, e.g., SHOW MEASUREMENTS.
Other Changes
The influxdb3_client now has the api_v3_query_influxql method (and a basic test was added for this)