Closes#25461
_Note: the first three commits on this PR are from https://github.com/influxdata/influxdb/pull/25492_
This PR makes the switch from using names for columns to the use of `ColumnId`s. Where column names are used, they are represented as `Arc<str>`. This impacts most components of the system, and the result is a fairly sizeable change set. The area where the most refactoring was needed was in the last-n-value cache.
One of the themes of this PR is to rely less on the arrow `Schema` for handling the column-level information, and tracking that info in our own `ColumnDefinition` type, which captures the `ColumnId`.
I will summarize the various changes in the PR below, and also leave some comments in-line in the PR.
## Switch to `u32` for `ColumnId`
The `ColumnId` now follows the `DbId` and `TableId`, and uses a globally unique `u32` to identify all columns in the database. This was a change from using a `u16` that was only unique within the column's table. This makes it easier to follow the patterns used for creating the other identifier types when dealing with columns, and should reduce the burden of having to manage the state of a table-scoped identifier.
## Changes in the WAL/Catalog
* `WriteBatch` now contains no names for tables or columns and purely uses IDs
* This PR relies on `IndexMap` for `_Id`-keyed maps so that the order of elements in the map is consistent. This has important implications, namely, that when iterating over an ID map, the elements therein will always be produced in the same order which allows us to make assertions on column order in a lot of our tests, and allows for the re-introduction of `insta` snapshots for serialization tests. This map type provides O(1) lookups, but also provides _fast_ iteration, which should help when serializing these maps in write batches to the WAL.
* Removed the need to serialize the bi-directional maps for `DatabaseSchema`/`TableDefinition` via use of `SerdeVecMap` (see comments in-line)
* The `tables` map in `DatabaseSchema` no stores an `Arc<TableDefinition>` so that the table definition can be shared around more easily. This meant that changes to tables in the catalog need to do a clone, but we were already having to do a clone for changes to the DB schema.
* Removal of the `TableSchema` type and consolidation of its parts/functions directly onto `TableDefinition`
* Added the `ColumnDefinition` type, which represents all we need to know about a column, and is used in place of the Arrow `Schema` for column-level meta-info. We were previously relying heavily on the `Schema` for iterating over columns, accessing data types, etc., but this gives us an API that we have more control over for our needs. The `Schema` is still held at the `TableDefinition` level, as it is needed for the query path, and is maintained to be consistent with what is contained in the `ColumnDefinition`s for a table.
## Changes in the Last-N-Value Cache
* There is a bigger distinction between caches that have an explicit set of value columns, and those that accept new fields. The former should be more performant.
* The Arrow `Schema` is managed differently now: it used to be updated more than it needed to be, and now is only updated when a row with new fields is pushed to a cache that accepts new fields.
## Changes in the write-path
* When ingesting, during validation, field names are qualified to their associated column ID
This PR introduces a new type `SerdeVecHashMap` that can be used in places where we need a HashMap with the following properties:
1. When serialized, it is serialized as a list of key-value pairs, instead of a map
2. When deserialized, it assumes the serialization format from (1.) and deserializes from a list of key-value pairs to a map
3. Does not allow for duplicate keys on deserialization
This is useful in places where we need to create map types that map from an identifier (integer) to some value, and need to serialize that data. For example: in the WAL when serializing write batches, and in the catalog when serializing the database/table schema.
This PR refactors the code in `influxdb3_wal` and `influxdb3_catalog` to use the new type for maps that use `DbId` and `TableId` as the key. Follow on work can give the same treatment to `ColumnId` based maps once that is fully worked out.
## Explanation
If we have a `HashMap<u32, String>`, `serde_json` will serialize it in the following way:
```json
{"0": "foo", "1": "bar"}
```
i.e., the integer keys are serialized as strings, since JSON doesn't support any other type of key in maps.
`SerdeVecHashMap<u32, String>` will be serialized by `serde_json` in the following way:
```json,
[[0, "foo"], [1, "bar"]]
```
and will deserialize from that vector-based structure back to the map. This allows serialization/deserialization to run directly off of the `HashMap`'s `Iterator`/`FromIterator` implementations.
## The Controversial Part
One thing I also did in this PR was switch the catalog from using a `BTreeMap` for tables to using the new `HashMap` type. This breaks the deterministic ordering of the database schema's `tables` map and therefore wrecks the snapshot tests we were using. I had to comment those parts of their respective tests out, because there isn't an easy way to make the underlying hashmap have a deterministic ordering just in tests that I am aware of.
If we think that using `BTreeMap` in the catalog is okay over a `HashMap`, then I think it would be okay to roll a similar `SerdeVecBTreeMap` type specifically for the catalog. Coincidentally, this may actually be a good use case for [`indexmap`](https://docs.rs/indexmap/latest/indexmap/), since it holds supposedly similar lookup performance characteristics to hashmap, while preserving order and _having faster iteration_ which could be a win for WAL serialization speed. It also accepts different hashing algorithms so could be swapped in with FNV like `HashMap` can.
## Follow-up work
Use the `SerdeVecHashMap` for column data in the WAL following https://github.com/influxdata/influxdb/issues/25461
* refactor: roll back addition of DatabaseSchemaProvider trait
* refactor: make parquet metrics optional in telemetry for pro
* refactor: make ParquetFileId Hash
* refactor: test harness logging
Allow the endpoint for telemetry to be passed in via the cli args, e.g
```
--telemetry-endpoint "https://somehost/test/"
```
and the actual endpoint always appends `v3` to it. So, above URL becomes
"https://somehost/test/v3"
Separate out methods of the Catalog API that are used on the query side into a new trait `DatabaseSchemaProvider`. The new trait includes methods from the Catalog that get the underlying `DatabaseSchema` or interact with names/IDs.
This will allow for a separate implementation of the Catalog for pro that only needs to hold a replicated/combined view in-memory of one or more catalogs without the need to do persistence that a write buffer's catalog needs to do.
While in there I also switched the `QueryExecutorImpl::new` method to take an args struct to avoid the clippy lint.
* feat: Add non-unique u16 Id to ColumnDefinition
This commit adds the column_id field to ColumnDefinition so that the
output for a Catalog will contain the id of that column. This is non
unique, whereas TableIds and DbIds will be unique. The column_id
corresponds to it's index in the schema.
Closes#25386
When running the tests repeatedly the tests failed intermittently
as the background runner wakes up to prune the cache and the tests
are loading and removing the data. The check whether `n_to_prune`
is greater than 0 before going ahead with pruning fixes the issue
Closes: https://github.com/influxdata/influxdb/issues/25446
* feat(circleci): add inclusivity checks
* chore(circleci): adjust package-validation for inclusive language
* chore: update tests for inclusive language
- Introduced traits, `ParquetMetrics` and `SystemInfoProvider` to enable
writing easier tests
- Uses mockito for code that depends on reqwest::Client and also uses
mockall to generally mock any traits like `SystemInfoProvider`
- Minor updates to docs
* feat: Add TableId and ColumnId
* feat: swap over to DbId and TableId everywhere
This commit swaps us over to using the DbId and TableId types everywhere
for our internal systems. Anywhere that's external facing, such as names
for last cache tables or line protocol parsing, use names. In these cases
we have the `Catalog` which keeps a map of TableIds and DbIds in a
bidirectional mapping for easy lookup i.e. id <-> names. While in essence
the change itself isn't that complicated given the nature of how much we
depended on names for things, the changes end up being quite invasive and
extensive. Luckily it shouldn't be too hard to review. Note this does
not add the column ids which will be done in a follow up PR.
Closes#25375Closes#25403Closes#25404Closes#25405Closes#25412Closes#25413
This adds a watch channel to check for when prunes have happened on
the parquet cache oracle, so we can notify something, like a test, that
needs to know when a prune has happened.
This should make the cache eviction test in the parquet_cache module
not flake out anymore.
- added mechanism within PersistedFile to expose parquet file related
metrics. The details are updated when new snapshot is generated and
also when all snapshots are loaded when the process starts up
- at the point of creating the telemetry payload these parquet metrics
are looked up before sending it to the server.
Closes: https://github.com/influxdata/influxdb/issues/25418
This adds a new crate `influxdb3_test_helpers` which provides two object
store helper types that can be used to track request counts made through
the store, as well as synchronize requests made through the store, resp.
We will need to wait on the RUSTSEC advisory to be resolved upstream,
i.e., by having tonic and hyper upgraded in core, before we can lift this
advisory ignore and use the latest versions of those crates.
- instrumented code to get read and write measurement
- introduced EventsBucket for collection of reads/writes
- sampler now samples every minute for all metrics (including
reads/writes)
- other tidy ups
closes: https://github.com/influxdata/influxdb/issues/25372
* test: check parquet cache in the write buffer
Checked that the parquet cache will serve queries when chunks are
requested from the write buffer. The added test also checks for get_range
requests made to the object store, which are typically made by DataFusion
to infer schema for parquet files.
* refactor: make parquet cache optional on write buffer
* test: add test to verify parquet cache function
This makes the parquet cache optional at the write buffer level, and adds
a test that verifies that the cache catches and prevents requests to the
object store in the event of a cache hit.
Closes#25382Closes#25383
This refactors the parquet cache to use less locking by switching from using the `clru` crate to a hand-rolled cache implementation. The new cache still acts as an LRU, but it uses atomics to track hit-time per entry, and handles pruning in a separate process that is decoupled from insertion/gets to the cache.
The `Cache` type uses a [`DashMap`](https://docs.rs/dashmap/latest/dashmap/struct.DashMap.html) internally to store cache entries. This should help reduce lock contention, and also has the added benefit of not requiring mutability to insert into _or_ get from the map.
The cache maps an `object_store::Path` to a `CacheEntry`. On a hit, an entry will have its `hit_time` (an `AtomicI64`) incremented. During a prune operation, entries that have the oldest hit times will be removed from the cache. See the `Cache::prune` method for details.
The cache is setup with a memory _capacity_ and a _prune percent_. The cache tracks memory used when entries are added, based on their _size_, and when a prune is invoked in the background, if the cache has exceeded its capacity, it will prune `prune_percent * cache.len()` entries from the cache.
Two tests were added:
* `cache_evicts_lru_when_full` to check LRU behaviour of the cache
* `cache_hit_while_fetching` to check that a cache entry hit while a request is in flight to fetch that entry will not result in extra calls to the underlying object store
- moved num samples for cpu/mem to be usize from u64
- generic float function to round to 2 decimal places added
- removed unnecessary cast of f32 to f64
Part of #25347
This sets up a new implementation of an in-memory parquet file cache in the `influxdb3_write` crate in the `parquet_cache.rs` module.
This module introduces the following types:
* `MemCachedObjectStore` - a wrapper around an `Arc<dyn ObjectStore>` that can serve GET-style requests to the store from an in-memory cache
* `ParquetCacheOracle` - an interface (trait) that can accept requests to create new cache entries in the cache used by the `MemCachedObjectStore`
* `MemCacheOracle` - implementation of the `ParquetCacheOracle` trait
## `MemCachedObjectStore`
This takes inspiration from the [`MemCacheObjectStore` type](1eaa4ed5ea/object_store_mem_cache/src/store.rs (L205-L213)) in core, but has some different semantics around its implementation of the `ObjectStore` trait, and uses a different cache implementation.
The reason for wrapping the object store is that this ensures that any GET-style request being made for a given object is served by the cache, e.g., metadata requests made by DataFusion.
The internal cache comes from the [`clru` crate](https://crates.io/crates/clru), which provides a least-recently used (LRU) cache implementation that allows for weighted entries. The cache is initialized with a capacity and entries are given a weight on insert to the cache that represents how much of the allotted capacity they will take up. If there isn't enough room for a new entry on insert, then the LRU item will be removed.
### Limitations of `clru`
The `clru` crate conveniently gives us an LRU eviction policy but its API may put some limitations on the system:
* gets to the cache require an `&mut` reference, which means that the cache needs to be behind a `Mutex`. If this slows down requests through the object store, then we may need to explore alternatives.
* we may want more sophisticated eviction policies than a straight LRU, i.e., to favour certain tables over others, or files that represent recent data over those that represent old data.
## `ParquetCacheOracle` / `MemCacheOracle`
The cache oracle is responsible for handling cache requests, i.e., to fetch an item and store it in the cache. In this PR, the oracle runs a background task to handle these requests. I defined this as a trait/struct pair since the implementation may look different in Pro vs. OSS.
* feat: Remove lock for FileId tests
Since we now are using cargo-nextest in CI we can remove
the locks used in the FileId tests to make sure that we
have no race conditions
* feat: Add u32 ID for Databases
This commit adds a new DbId for databases. It also updates paths to use
that id as part of the name. When starting up the WriteBuffer we apply
the DbId from the persisted snapshot much like we do for ParquetFileId's
This introduces the influxdb3_id crate to avoid circular deps with ids.
The ParquetFileId should also be moved into this crate, but it's
outside the scope of this change.
Closes#25301
- uses Arc<str> to represent create once and read everywhere type
of string
- updated snapshots for insta asserts, uses redaction to hardcode
randomly generated UUID strings
- added methods to catalog to expose instace and host ids
Closes: https://github.com/influxdata/influxdb/issues/25315
This changes our CI to use cargo-nextest which is faster and does not
have issues around global statics. Since it runs each test in it's
own process we don't have to worry about tests stepping on each other's
toes in this regard. It also updates the CI to ignore the current
cargo deny failure as we can't do anything until the arrow crates are
upgraded.
This makes some changes to the TestServer E2E framework, which is used
for running integration tests in the influxdb3 crate. These changes are
meant so that we can more easily split the code for pro.