This commit is a major refactor for the code base. It mainly does four
things:
1. Splits code shared between the internal IOx repository and this one
into it's own repo over at https://github.com/influxdata/influxdb3_core
2. Removes any docs or anything else that did not relate to this project
3. Reorganizes the Cargo.toml files to use the top level Cargo.toml to
declare dependencies and versions to keep all crates in sync and sets
all others to use `<dep>.workspace = true` unless it's an optional
dependency
4. Set the top level Cargo.toml to point to the core crates as git
dependencies
With this any changes specific to Edge will be contained here, updating
deps will be a PR over in `influxdata/influxdb3_core`, and we can prove
out the viability for this model to use for IOx.
* fix: persister loading with no segments
Fixes a bug where the persister would throw an error if attempting to load segments when none had been persisted.
Moved persister tests into tests block.
* feat: implement loader for persisted state
This implements a loader for the write buffer. It loads the catalog and the buffer from the WAL.
Move Persister errors into their own type now that the write buffer load could return errors from the persister.
This doesn't yet rotate segments or trigger persistence of newly closed segments, which will be addressed in a future PR.
* fix: cargo update to fix audit
* refactor: add error type to persister trait
* refactor: use generics instead of dyn
---------
Co-authored-by: Trevor Hilton <thilton@influxdata.com>
* feat: Add partial write and name check to write_lp
This commit adds new behavior to the v3 write_lp http endpoint by
implementing both partial writes and checking the db name for validity.
It also sets the partial write behavior as the default now, whereas
before we would reject the entire request if one line was incorrect.
Users who *do* actually want that behavior can now opt in by putting
'accept_partial=false' into the url of the request.
We also check that the db name used in the request contains only
numbers, letters, underscores and hyphens and that it must start with
either a number or letter.
We also introduce a more standardized way to return errors to the user
as JSON that we can expand over time to give actionable error messages
to the user that they can use to fix their requests.
Finally tests have been included to mock out and test the behavior for
all of the above so that changes to the error messages are reflected in
tests, that both partial and not partial writes work as expected, and
that invalid db names are rejected without writing.
* feat: Add precision to write_lp http endpoint
This commit adds the ability to control the precision of the time stamp
passed in to the endpoint. For example if a user chooses 'second' and
the timestamp 20 that will be 20 seconds past the Unix Epoch. If they
choose 'millisecond' instead it will be 20 milliseconds past the Epoch.
Up to this point we assumed that all data passed in was of nanosecond
precision. The data is still stored in the database as nanoseconds.
Instead upon receiving the data we convert it to nanoseconds. If the
precision URL parameter is not specified we default to auto and take a
best effort guess at what the user wanted based on the order of
magnitude of the data passed in.
This change will allow users finer grained control over what precision
they want to use for their data as well as trying our best to make a
good user experience and having things work as expected and not creating
a failure mode whereby a user wanted seconds and instead put in
nanoseconds by default.
* refactor: move write buffer into its own dir
* feat: implement write buffer segment with wal flushing
This creates the WriteBufferFlusher and OpenBufferSegment. If a wal is passed into the buffer, data written into it will be persisted to the wal for the initialized segment id.
* refactor: use crossbeam in flusher and pr cleanup
* chore(deps): Update arrow and datafusion to 49.0.0
This commit copies in our dependency code from influxdb_iox in order for
us to be able to upgrade from a forked version of 46.0.0 to 49.0.0 of
both arrow and datafusion. Most of the important changes were around how
we consumed the crates in influxdb3(_server/_write). Those diffs are
particularly worth looking at as the rest was a straight copy and we
don't touch those crates in our development currently for influxdb3
edge.
* fix: regenerate workspace hack crate
* fix: Protobuf issues with incompatibility labels
* fix: Broken CI yaml
* fix: buf version
* fix: Only check IOx repo
* fix: Remove protobuf lint
* fix: Comment out call to protobuf-lint
* feat: Implement Catalog r/w for persister
This commit implements reading and writing the Catalog to the object
store. This was already stubbed out functionality, but it just needed an
implementation. Saving it to the object store is pretty straight forward
as it just serializes it to JSON and writes it to the object store. For
loading, it finds the most recently added Catalog based on the file name
and returns that from the object store in it's deserialized form and
returned to the caller.
This commit also adds some tests to make sure that the above
functionality works as intended.
* feat: Implement Segment r/w for persister
This commit continues the work on the persister by implementing the
persist_segment and load_segment functions for the persister. Much like
the Catalog implementation, it's serialized to JSON before being
persisted to the object store in persist_segment. This is pretty
straightforward. For the loading though we need to find the most recent
n segment files and so we need to list them and then return the most
recent n. This is a little more complicated to do, but there are
comments in the code to make it easier to grok.
We also implement more tests to make sure that this part of the
persister works as expected.
* feat: Implement Parquet r/w to persister
This commit does a few things:
- First we add methods to the persister trait for reading and writing
parquet files as these were not stubbed out in prior commits
- Secondly we add a method to serialize a SendableRecordBatchStream into
Parquet bytes
- With these in place implementing the trait methods is pretty
straightforward: hand a path in and a stream and get back some
metadata about the file persisted and also get the bytes back if
loading from the store
Of course we also add more tests to make sure this all works as
expected. Do note that this does nothing to make sure that we bound how
much memory is used or if this is the most efficient way to write
parquet files. This is mostly to get things working with the
understanding that future refinement on the approach might be needed.
* fix: Update smallvec for crate advisory
* fix: Implement better filename handling
* feat: Handle loading > 1000 Segment Info files
This commit introduces 4 new types in the paths module for the
influxdb3_write crate. They are:
- ParquetFilePath
- CatalogFilePath
- SegmentInfoFilePath
- SegmentWalFilePath
Each of these corresponds to an object store path and for the WAL file an
on disk path that we can use to address the needed files in a consistent way
and not need to have path construction be duplicated to address these files.
These types also Deref/AsRef to the object_store::path::Path type (or the
std::path::Path type for the Wal) so that they can be used in places that
expect the type such as various object_store/std::fs and so that we can use
the underlying type's methods without needing to implement them for each
type as they are just a thin wrapper around those types.
This commit adds some tests to make sure that the path construction
works as intended and also updates the `wal.rs` file to use the new
`SegmentWalFilePath` instead of just a `PathBuf`.
Closes: #24578
* feat: add basic wal implementation for Edge
This WAL implementation uses some of the code from the wal crate, but departs pretty significantly from it in many ways. For now it uses simple JSON encoding for the serialized ops, but we may want to switch that to Protobuf at some point in the future. This version of the wal doesn't have its own buffering. That will be implemented higher up in the BufferImpl, which will use the wal and SegmentWriter to make data in the buffer durable.
The write flow will be that writes will come into the buffer and validate/update against an in memory Catalog. Once validated, writes will get buffered up in memory and then flushed into the WAL periodically (likely every 10-20ms). After being flushed to the wal, the entire batch of writes will be put into the in memory queryable buffer. After that responses will be sent back to the clients. This should reduce the write lock pressure on the in-memory buffer considerably.
In this PR:
- Update the Wal, WalSegmentWriter, and WalSegmentReader traits to line up with new design/understanding
- Implement wal (mainly just a way to identify segment files in a directory)
- Implement WalSegmentWriter (write header, op batch with crc, and track sequence number in segment, re-open existing file)
- Implement WalSegmentReader
* refactor: make Wal return impl reader/writer
* refactor: clean up wal segment open
* fix: WriteBuffer and Wal usage
Turn wal and write buffer references into a concrete type, rather than dyn.
* fix: have wal loading ignore invalid files
* fix: build, upgrade rustc, and deps
This commit upgrades Rust to 1.75.0, the latest release. We also
upgraded our dependencies to stay up to date and to clear out any
uneeded deps from the lockfile. In order to make sure everything works
this also fixes the build by upgrading the workspace-hack crate using
cargo hikari and removing the `workspace.lint` that was in
influxdb3_write that didn't need to be there, probably from a merge
issue.
With this we can build influxdb3 as our default on main, but this alone
is not enough to fix CI and will be addressed in future commits.
* fix: warnings for influxdb3 build
This commit fixes the warnings emitted by `cargo build` when compiling
influxdb3. Mainly it adds needed lifetimes and removes uneccesary
imports and functions calls.
* fix: all of the clippy lints
This for the most part just applies suggested fixes by clippy with a few
exceptions:
- Generated type crates had additional allows added since we can't
control what code gets made
- Things that couldn't be automatically fixed were done so manually in
particular adding a Send bound for traits that created a Future that
should be Send
We also had to fix a build issue by adding a feature for tokio-compat
due to the upgrade of deps. The workspace crate was updated accordingly.
* fix: failing test due to rust panic message change
Inbetween rustc 1.72 and rustc 1.75 the way that error messages were
displayed when panicing changed. One of our tests depended on the output
of that behavior and this commit updates the error message to the new
form so that tests will pass.
* fix: broken cargo doc link
* fix: cargo formatting run
* fix: add workspace-hack to influxdb3 crates
This was the last change needed to make sure that the workspace-hack
crate CI lint would pass.
* fix: remove tests that can not run anymore
We removed iox code from this code base and as a result some tests
cannot be run anymore and so this commit removes them from the code base
so that we can get a green build.
* WIP: basic influxdb3 command and http server
* WIP: write lp, buffer, query out
* WIP: test write & query on influxdb3_server, fix warnings
* WIP: pull write buffer and catalog into separate crate
* WIP: sketch out types used for write: buffer, wal, persister
* WIP: remove a bunch of old IOx stuff and fmt