2024-01-08 16:50:59 +00:00
|
|
|
[package]
|
|
|
|
name = "influxdb3_write"
|
|
|
|
version.workspace = true
|
|
|
|
authors.workspace = true
|
|
|
|
edition.workspace = true
|
|
|
|
license.workspace = true
|
|
|
|
|
|
|
|
[dependencies]
|
2024-02-29 21:21:41 +00:00
|
|
|
# Core Crates
|
|
|
|
data_types.workspace = true
|
|
|
|
datafusion_util.workspace = true
|
|
|
|
influxdb-line-protocol.workspace = true
|
|
|
|
iox_catalog.workspace = true
|
2024-03-28 17:33:17 +00:00
|
|
|
iox_http.workspace = true
|
2024-02-29 21:21:41 +00:00
|
|
|
iox_query.workspace = true
|
2024-03-11 17:54:09 +00:00
|
|
|
iox_time.workspace = true
|
2024-03-12 13:47:32 +00:00
|
|
|
parquet_file.workspace = true
|
2024-02-29 21:21:41 +00:00
|
|
|
observability_deps.workspace = true
|
|
|
|
schema.workspace = true
|
2024-01-08 16:50:59 +00:00
|
|
|
|
refactor: implement new wal and refactor write buffer (#25196)
* feat: refactor WAL and WriteBuffer
There is a ton going on here, but here are the high level things. This implements a new WAL, which is backed entirely by object store. It then updates the WriteBuffer to be able to work with how the new WAL works, which also required an update to how the Catalog is modified and persisted.
The concept of Segments has been removed. Previously there was a separate WAL per segment of time. Instead, there is now a single WAL that all writes and updates flow into. Data within the write buffer is organized by Chunk(s) within tables, which is based on the timestamp of the row data. These are known as the Level0 files, which will be persisted as Parquet into object store. The default chunk duration for level 0 files is 10 minutes.
The WAL is written as single files that get created at the configured WAL flush interval (1s by default). After a certain number of files have been created, the server will attempt to snapshot the WAL (default is to snapshot the first 600 files of the WAL after we have 900 total, i.e. snapshot 10 minutes of WAL data).
The design goal with this is to persist 10 minute chunks of data that are no longer receiving writes, while clearing out old WAL files. This works if data getting written in around "now" with no more than 5 minutes of delay. If we continue to have delayed writes, a snapshot of all data will be forced in order to clear out the WAL and free up memory in the buffer.
Overall, this structure of a single wal, with flushes and snapshots and chunks in the queryable buffer led to a simpler setup for the write buffer overall. I was able to clear out quite a bit of code related to the old segment organization.
Fixes #25142 and fixes #25173
* refactor: address PR feedback
* refactor: wal to replay and background flush on new
* chore: remove stray println
2024-08-01 19:04:15 +00:00
|
|
|
# Local deps
|
2024-11-27 13:41:46 +00:00
|
|
|
influxdb3_cache = { path = "../influxdb3_cache" }
|
2024-08-02 20:04:12 +00:00
|
|
|
influxdb3_catalog = { path = "../influxdb3_catalog" }
|
2024-09-18 15:44:04 +00:00
|
|
|
influxdb3_id = { path = "../influxdb3_id" }
|
2024-10-02 18:45:12 +00:00
|
|
|
influxdb3_test_helpers = { path = "../influxdb3_test_helpers" }
|
refactor: implement new wal and refactor write buffer (#25196)
* feat: refactor WAL and WriteBuffer
There is a ton going on here, but here are the high level things. This implements a new WAL, which is backed entirely by object store. It then updates the WriteBuffer to be able to work with how the new WAL works, which also required an update to how the Catalog is modified and persisted.
The concept of Segments has been removed. Previously there was a separate WAL per segment of time. Instead, there is now a single WAL that all writes and updates flow into. Data within the write buffer is organized by Chunk(s) within tables, which is based on the timestamp of the row data. These are known as the Level0 files, which will be persisted as Parquet into object store. The default chunk duration for level 0 files is 10 minutes.
The WAL is written as single files that get created at the configured WAL flush interval (1s by default). After a certain number of files have been created, the server will attempt to snapshot the WAL (default is to snapshot the first 600 files of the WAL after we have 900 total, i.e. snapshot 10 minutes of WAL data).
The design goal with this is to persist 10 minute chunks of data that are no longer receiving writes, while clearing out old WAL files. This works if data getting written in around "now" with no more than 5 minutes of delay. If we continue to have delayed writes, a snapshot of all data will be forced in order to clear out the WAL and free up memory in the buffer.
Overall, this structure of a single wal, with flushes and snapshots and chunks in the queryable buffer led to a simpler setup for the write buffer overall. I was able to clear out quite a bit of code related to the old segment organization.
Fixes #25142 and fixes #25173
* refactor: address PR feedback
* refactor: wal to replay and background flush on new
* chore: remove stray println
2024-08-01 19:04:15 +00:00
|
|
|
influxdb3_wal = { path = "../influxdb3_wal" }
|
2024-10-08 14:45:13 +00:00
|
|
|
influxdb3_telemetry = { path = "../influxdb3_telemetry" }
|
refactor: implement new wal and refactor write buffer (#25196)
* feat: refactor WAL and WriteBuffer
There is a ton going on here, but here are the high level things. This implements a new WAL, which is backed entirely by object store. It then updates the WriteBuffer to be able to work with how the new WAL works, which also required an update to how the Catalog is modified and persisted.
The concept of Segments has been removed. Previously there was a separate WAL per segment of time. Instead, there is now a single WAL that all writes and updates flow into. Data within the write buffer is organized by Chunk(s) within tables, which is based on the timestamp of the row data. These are known as the Level0 files, which will be persisted as Parquet into object store. The default chunk duration for level 0 files is 10 minutes.
The WAL is written as single files that get created at the configured WAL flush interval (1s by default). After a certain number of files have been created, the server will attempt to snapshot the WAL (default is to snapshot the first 600 files of the WAL after we have 900 total, i.e. snapshot 10 minutes of WAL data).
The design goal with this is to persist 10 minute chunks of data that are no longer receiving writes, while clearing out old WAL files. This works if data getting written in around "now" with no more than 5 minutes of delay. If we continue to have delayed writes, a snapshot of all data will be forced in order to clear out the WAL and free up memory in the buffer.
Overall, this structure of a single wal, with flushes and snapshots and chunks in the queryable buffer led to a simpler setup for the write buffer overall. I was able to clear out quite a bit of code related to the old segment organization.
Fixes #25142 and fixes #25173
* refactor: address PR feedback
* refactor: wal to replay and background flush on new
* chore: remove stray println
2024-08-01 19:04:15 +00:00
|
|
|
|
2024-02-29 21:21:41 +00:00
|
|
|
# crates.io dependencies
|
2024-09-27 15:59:17 +00:00
|
|
|
anyhow.workspace = true
|
2024-02-29 21:21:41 +00:00
|
|
|
arrow.workspace = true
|
|
|
|
async-trait.workspace = true
|
|
|
|
byteorder.workspace = true
|
2024-03-26 19:22:19 +00:00
|
|
|
bytes.workspace = true
|
2024-10-03 18:47:46 +00:00
|
|
|
bimap.workspace = true
|
2024-02-29 21:21:41 +00:00
|
|
|
chrono.workspace = true
|
|
|
|
crc32fast.workspace = true
|
|
|
|
crossbeam-channel.workspace = true
|
2024-09-27 15:59:17 +00:00
|
|
|
dashmap.workspace = true
|
2024-02-29 21:21:41 +00:00
|
|
|
datafusion.workspace = true
|
feat: memory-cached object store for parquet files (#25377)
Part of #25347
This sets up a new implementation of an in-memory parquet file cache in the `influxdb3_write` crate in the `parquet_cache.rs` module.
This module introduces the following types:
* `MemCachedObjectStore` - a wrapper around an `Arc<dyn ObjectStore>` that can serve GET-style requests to the store from an in-memory cache
* `ParquetCacheOracle` - an interface (trait) that can accept requests to create new cache entries in the cache used by the `MemCachedObjectStore`
* `MemCacheOracle` - implementation of the `ParquetCacheOracle` trait
## `MemCachedObjectStore`
This takes inspiration from the [`MemCacheObjectStore` type](https://github.com/influxdata/influxdb3_core/blob/1eaa4ed5ea147bc24db98d9686e457c124dfd5b7/object_store_mem_cache/src/store.rs#L205-L213) in core, but has some different semantics around its implementation of the `ObjectStore` trait, and uses a different cache implementation.
The reason for wrapping the object store is that this ensures that any GET-style request being made for a given object is served by the cache, e.g., metadata requests made by DataFusion.
The internal cache comes from the [`clru` crate](https://crates.io/crates/clru), which provides a least-recently used (LRU) cache implementation that allows for weighted entries. The cache is initialized with a capacity and entries are given a weight on insert to the cache that represents how much of the allotted capacity they will take up. If there isn't enough room for a new entry on insert, then the LRU item will be removed.
### Limitations of `clru`
The `clru` crate conveniently gives us an LRU eviction policy but its API may put some limitations on the system:
* gets to the cache require an `&mut` reference, which means that the cache needs to be behind a `Mutex`. If this slows down requests through the object store, then we may need to explore alternatives.
* we may want more sophisticated eviction policies than a straight LRU, i.e., to favour certain tables over others, or files that represent recent data over those that represent old data.
## `ParquetCacheOracle` / `MemCacheOracle`
The cache oracle is responsible for handling cache requests, i.e., to fetch an item and store it in the cache. In this PR, the oracle runs a background task to handle these requests. I defined this as a trait/struct pair since the implementation may look different in Pro vs. OSS.
2024-09-24 14:58:15 +00:00
|
|
|
futures.workspace = true
|
2024-03-26 19:22:19 +00:00
|
|
|
futures-util.workspace = true
|
2024-06-26 16:43:35 +00:00
|
|
|
hashbrown.workspace = true
|
2024-03-26 19:22:19 +00:00
|
|
|
hex.workspace = true
|
feat: last cache implementation (#25109)
* feat: base for last cache implementation
Each last cache holds a ring buffer for each column in an index map, which
preserves the insertion order for faster record batch production.
The ring buffer uses a custom type to handle the different supported
data types that we can have in the system.
* feat: implement last cache provider
LastCacheProvider is the API used to create last caches and write
table batches to them. It uses a two-layer RwLock/HashMap: the first for
the database, and the second layer for the table within the database.
This allows for table-level locks when writing in buffered data, and only
gets a database-level lock when creating a cache (and in future, when
removing them as well).
* test: APIs on write buffer and test for last cache
Added basic APIs on the write buffer to access the last cache and then a
test to the last_cache module to see that it works with a simple example
* docs: add some doc comments to last_cache
* chore: clippy
* chore: one small comment on IndexMap
* chore: clean up some stale comments
* refactor: part of PR feedback
Addressed three parts of PR feedback:
1. Remove double-lock on cache map
2. Re-order the get when writing to the cache to be outside the loop
3. Move the time check into the cache itself
* refactor: nest cache by key columns
This refactors the last cache to use a nested caching structure, where
the key columns for a given cache are used to create a hierarchy of
nested maps, terminating in the actual store for the values in the cache.
Access to the cache is done via a set of predicates which can optionally
specify the key column values at any level in the cache hierarchy to only
gather record batches from children of that node in the cache.
Some todos:
- Need to handle the TTL
- Need to move the TableProvider impl up to the LastCache type
* refactor: TableProvider impl to LastCache
This re-writes the datafusion TableProvider implementation on the correct
type, i.e., the LastCache, and adds conversion from the filter Expr's to
the Predicate type for the cache.
* feat: support TTL in last cache
Last caches will have expired entries walked when writes come in.
* refactor: add panic when unexpected predicate used
* refactor: small naming convention change
* refactor: include keys in query results and no null keys
Changed key columns so that they do not accept null values, i.e., rows
that are pushed that are missing key column values will be ignored.
When producing record batches for a cache, if not all key columns are
used in the predicate, then this change makes it so that the non-predicate
key columns are produced as columns in the outputted record batches.
A test with a few cases showing this was added.
* fix: last cache key column query output
Ensure key columns in the last cache that are not included in the
predicate are emitted in the RecordBatches as a column.
Cleaned up and added comments to the new test.
* chore: clippy and some un-needed code
* fix: clean up some logic errors in last_cache
* test: add tests for non default cache size and TTL
Added two tests, as per commit title. Also moved the eviction process
to a separate function so that it was not being done on every write to
the cache, which could be expensive, and this ensures that entries are
evicted regardless of whether writes are coming in or not.
* test: add invalid predicate test cases to last_cache
* test: last_cache with field key columns
* test: last_cache uses series key for default keys
* test: last_cache uses tag set as default keys
* docs: add doc comments to last_cache
* fix: logic error in last cache creation
CacheAlreadyExists errors were only being based on the database and
table names, and not including the cache names, which was not
correct.
* docs: add some comments to last cache create fn
* feat: support null values in last cache
This also adds explicit support for series key columns to distinguish
them from normal tags in terms of nullability
A test was added to check nulls work
* fix: reset last cache last time when ttl evicts all data
2024-07-09 19:22:04 +00:00
|
|
|
indexmap.workspace = true
|
2024-02-29 21:21:41 +00:00
|
|
|
object_store.workspace = true
|
|
|
|
parking_lot.workspace = true
|
|
|
|
parquet.workspace = true
|
|
|
|
serde.workspace = true
|
|
|
|
serde_json.workspace = true
|
2024-06-04 15:38:43 +00:00
|
|
|
serde_with.workspace = true
|
2024-03-26 19:22:19 +00:00
|
|
|
sha2.workspace = true
|
2024-02-29 21:21:41 +00:00
|
|
|
snap.workspace = true
|
2024-04-29 18:34:32 +00:00
|
|
|
thiserror.workspace = true
|
|
|
|
tokio.workspace = true
|
2024-03-12 13:47:32 +00:00
|
|
|
url.workspace = true
|
2024-04-29 18:34:32 +00:00
|
|
|
uuid.workspace = true
|
feat: add basic wal implementation for Edge (#24570)
* feat: add basic wal implementation for Edge
This WAL implementation uses some of the code from the wal crate, but departs pretty significantly from it in many ways. For now it uses simple JSON encoding for the serialized ops, but we may want to switch that to Protobuf at some point in the future. This version of the wal doesn't have its own buffering. That will be implemented higher up in the BufferImpl, which will use the wal and SegmentWriter to make data in the buffer durable.
The write flow will be that writes will come into the buffer and validate/update against an in memory Catalog. Once validated, writes will get buffered up in memory and then flushed into the WAL periodically (likely every 10-20ms). After being flushed to the wal, the entire batch of writes will be put into the in memory queryable buffer. After that responses will be sent back to the clients. This should reduce the write lock pressure on the in-memory buffer considerably.
In this PR:
- Update the Wal, WalSegmentWriter, and WalSegmentReader traits to line up with new design/understanding
- Implement wal (mainly just a way to identify segment files in a directory)
- Implement WalSegmentWriter (write header, op batch with crc, and track sequence number in segment, re-open existing file)
- Implement WalSegmentReader
* refactor: make Wal return impl reader/writer
* refactor: clean up wal segment open
* fix: WriteBuffer and Wal usage
Turn wal and write buffer references into a concrete type, rather than dyn.
* fix: have wal loading ignore invalid files
2024-01-12 16:52:28 +00:00
|
|
|
|
|
|
|
[dev-dependencies]
|
2024-02-29 21:21:41 +00:00
|
|
|
# Core Crates
|
|
|
|
arrow_util.workspace = true
|
2024-06-04 15:38:43 +00:00
|
|
|
insta.workspace = true
|
2024-04-03 19:13:36 +00:00
|
|
|
metric.workspace = true
|
feat: add the `api/v3/query_influxql` API (#24696)
feat: add query_influxql api
This PR adds support for the /api/v3/query_influxql API. This re-uses code from the existing query_sql API, but some refactoring was done to allow for code re-use between the two.
The main change to the original code from the existing query_sql API was that the format is determined up front, in the event that the user provides some incorrect Accept header, so that the 400 BAD REQUEST is returned before performing the query.
Support of several InfluxQL queries that previously required a bridge to be executed in 3.0 was added:
SHOW MEASUREMENTS
SHOW TAG KEYS
SHOW TAG VALUES
SHOW FIELD KEYS
SHOW DATABASES
Handling of qualified measurement names in SELECT queries (see below)
This is accomplished with the newly added iox_query_influxql_rewrite crate, which provides the means to re-write an InfluxQL statement to strip out a database name and retention policy, if provided. Doing so allows the query_influxql API to have the database parameter optional, as it may be provided in the query string.
Handling qualified measurement names in SELECT
The implementation in this PR will inspect all measurements provided in a FROM clause and extract the database (DB) name and retention policy (RP) name (if not the default). If multiple DB/RP's are provided, an error is thrown.
Testing
E2E tests were added for performing basic queries against a running server on both the query_sql and query_influxql APIs. In addition, the test for query_influxql includes some of the InfluxQL-specific queries, e.g., SHOW MEASUREMENTS.
Other Changes
The influxdb3_client now has the api_v3_query_influxql method (and a basic test was added for this)
2024-03-01 17:27:38 +00:00
|
|
|
pretty_assertions.workspace = true
|
2024-02-29 21:21:41 +00:00
|
|
|
test_helpers.workspace = true
|
2024-09-18 14:23:17 +00:00
|
|
|
test-log.workspace = true
|