During initialisation, the ingester connects to the Kafka brokers - this
involves per-partition leadership discovery & connection establishment.
These connections are then retained for the lifetime of the process.
Prior to this commit, the ingester would establish a connection to all
partition leaders for a given topic. After this commit, the ingester
connects to only the partition leaders it is going to consume from
(for those shards that it is assigned.)
* ci: use same feature set in `build_dev` and `build_release`
* ci: also enable unstable tokio for `build_dev`
* chore: update tokio to 1.21 (to fix console-subscriber 0.1.8
* fix: "must use"
Adds instrumentation to the low-level (post-aggregation) Kafka client,
capturing the uncompressed, approximate message size (calculated as the
sum of all Record::approximate_size() returns, ignoring largely static
framing overhead).
Previously aggregated writes were merged into a single Kafka Record -
this meant that all merged ops would be placed into the same Record, and
therefore receive the same sequence number once published to Kafka.
The new aggregator batches at the Record level, therefore aggregated
writes now get their own distinct sequence number. This commit updates
the batching tests to reflect this new sequence number assignment
behaviour.
The previous aggregator impl would assert that writes had been
partitioned before aggregating them (or rather, that the DML write had a
partition key assigned).
This should be true for all writes passing through the write buffer,
irrespective of which aggregator is used, therefore this assert is moved
"up" into the write buffer itself.
Replaces the DmlAggregator with the simpler RecordAggregator.
Metrics gathered as part of #5323 shows there is practically no benefit
to the additional complexity of the DmlAggregator over the simpler
RecordAggregator impl.
This commit adds a new write buffer aggregator used by rskafka to
increase the size of Kafka messages on the wire. The Kafka write buffer
impl is the only impl to perform aggregation.
This Aggregator impl maps IOx-specific DML operations to rskafka Records
with no additional processing - it can be thought of as an IOx-specific
adaptor over rskafka's RecordAggregator.
By delegating batching of Record instances to rskakfa's simple
RecordAggregator, we minimise code complexity / bug surface area / LoC.
Changes the Kafka write buffer impl to parallelise initialisation of the
PartitionClient instances.
Now that the PartitionClient constructor also performs leader discovery
(using cached metadata, influxdata/rskafka#164) and establishes a broker
connection (influxdata/rskafka#166) executing them in parallel will
cause a proportional decrease in the time taken to bring IOx up.
* fix: do not loose data when Kafka reports that offset is above watermark
This can happen in certain cluster rebalance settings.
This is also linked to https://github.com/influxdata/rskafka/issues/147
but for the upstream issue I currently have no idea how to fix it, so
let's at least harden IOx against it.
Fixes#5128.
* refactor: panic for `SequenceNumberAfterWatermark`
Previously IOx mapped a single database to a single kafka topic - this
is no longer the case, so referring to the kafka topic name as the
"database name" name is confusing.
Adds a decorator over the underlying kafka client to capture the latency
distribution of the low-level kafka writes, independent of the
aggregation/DML batching framework that sits "above" this client.
The latency measurements include the serialisation overhead, protocol
overhead, and actual network I/O.
The Kafka write buffer implementation (and only the Kafka impl) merges
together successive DML writes for the same namespace & partition within
a window of time.
This commit records the number of DML writes that have been merged
together to form a single batched op before it is dispatched to Kafka.
* test: add test for timestamps in kafka write buffer
* refactor: move timestamp batching test to generic tests
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Fixes interaction of `maybe_skip_kafka_integration!` and `should_panic`
by ensuring that `maybe_skip_kafka_integration!` panics to skip
`should_panic` tests.
Without that it is not possible to just run `cargo test -p write_buffer`.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Changes the kafka message wire format to include the partition key for
serialised DML writes on the wire.
After this commit, the kafka messages will contain the partition key for
each op, but this information will go unused in the ingester - this
enables us to roll out the producer side, before making the value's
presence necessary on the consumer side.
A follow-up PR will change the ingester to utilise this embedded
partition key.
This has the unfortunate side effect of making the partition key part of
the public gRPC write API:
https://github.com/influxdata/influxdb_iox/issues/4866
This commit changes the kafka write aggregator to only merge DML ops
destined for the same partition.
Prior to this commit the aggregator was merging DML ops that had
different partition keys, causing data to be persisted in incorrect
partitions:
https://github.com/influxdata/influxdb_iox/issues/4787
The default behavior of the ingester is to panic if the min unpersisted
sequence number in the catalog is unknown to the write buffer due to the
retention policies having evicted that sequence number.
Specifying `--skip-to-oldest-available` changes this behavior to skip to
the oldest sequence number the write buffer does have available and go
from there.
Fixes#4624.
Derive the debug impl so it prints all the fields (specifically the
"number of sequencers configured" is pretty helpful in a test).
Manual impls drift over time and are more effort than the derive!
Fixes symlinking in maybe_auto_create_directories() - previously it
would create a symlink specifying the target path relative to the
working dir, and not relative to the symlink.
If the working dir != the path in the WriteBufferCreationConfig,
subsequent calls would get stuck in an infinite loop attempting to
resolve the bad symlink.
* chore: upgrade rskafka
* refactor: less cloning
* fix: defined behaviour when seeking to an unknown sequence number
The new, defined behavior is: "return an error once and then end the
stream".
Co-authored-by: Edd Robinson <me@edd.io>
Co-authored-by: Edd Robinson <me@edd.io>
For sparse data the PB-encoded data (our Kafka wire format) is way
smaller than the MutableBatch (up to a factor 20). So lets use this one
to estimate the size during batching.
Prod has a larger max msg. size for Kafka (10MB instead of 1MB), but
currently we're unable to wire all the write buffer configs through. As
a quick fix lets hard code the config. This however breaks the write
buffer when running under default Kafka (1MB), so we should reverse this
(tracked under #3723).
When creating a new aggregation span, you MUST NOT just create a new
random span context and put its child span into a span recorder, because
the then only the child will be reported to the trace collector. Instead
create a new root span w/o any parent directly.
This makes jaeger slightly more happy and it won't complain about broken
spans anymore.
* refactor: improve writer buffer consumer interface
The change looks huge but is actually rather simple. To
understand the interface change, let me first explain what we want:
- be able to fetch watermarks for any sequencer
- have streams:
- each streams tracks a sequencer and has an offset state (no read
multiplexing)
- we can seek a stream
- seeking and streaming cannot be done at the same time (that would be
weird and likely leads to many bugs both in write buffer and in the
user code)
- ideally we don't need to create streams of all sequencers but can
choose a subset
Before this change we had one mutable consumer struct where you can get
all streams and watermark functions (this mutable-borrows the consumer)
or you can seek a single stream (this also mutable-borrows the
consumer). This is a bit weird for multiple reasons:
- you cannot seek a single stream without dropping all of them
- the mutable-borrow construct makes it really difficult to pass the
streams into separate threads
- the consumer is boxed (because its mutable) which makes it more
difficult to handle in a large-scale application
What this change does is the following:
- you have an immutable consumer (similar to the producer)
- the consumer offers the following methods:
- get the set of sequencer IDs
- get watermark for any sequencer
- get a stream handler (see next point) for any sequencer
- the stream handler captures the stream state (offset) and provides you
a standard `Stream<_>` interface as well as a seek function.
Mutable-borrows ensure that you cannot use both at the same time.
The stream handler provides you the stream via `handler.stream()`. It
doesn't implement `Stream<_>` itself because the way boxing, dynamic
dispatch work, and pinning interact (i.e. I couldn't get it to work
without the indirection).
As a bonus point (which we don't use however) you can now create
multiple streams for the same sequencer and they all have their own
offset.
* fix: review comments
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
All features are now covered by rskafka. This also removes the need to
specify a server ID for write buffer consumers. This was only used for
rdkafka since there we needed to specify a consumer group, even though
we did not use any transactions.
* chore: upgrade rskafka + enable snappy support
* test: ensure that rskafka and rdkakfa work together
Before removing rdkafka ensure that:
- rskafka can consume existing messages produced by rdkafka so we do not
need to drain existing topics
- rdkafka can consume new messages produced by rskafka so we can roll
back
I ran the whole `write_buffer` test suite (including the newly added
tests) using Apache Kafka as well as Redpanda.
* test: ensure we handle consumer offset in error case correctly
* docs: explain test setup
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: Add db_name/namespace to DmlWrite and DmlDelete
This is required for the new ingester to be able to work with the write buffer. The protobuf that gets serialized over Kafka already includes the database name, it just wasn't getting carried through to the marshaled Dml operation.
* fix: database != namespace, propagation through write buffer
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
- Use a more standard way to setup the tracing subsystem (as described
in tracing-subscriber docs)
- Also capture content from `log` crate
- Play nice w/ Rust's libtest message capture
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: Sequencer wrapper
This type wraps an underlying WriteBufferWriter implementation, tagging
it with a sequencer ID it should use when enqueuing operations to the
buffer.
* feat: mock sharder
Implements a mock Sharder impl that returns pre-configured responses to
shard(), and captures the input to the call.
* feat: sharded write buffer
Implements sharding of ops into an underlying WriteBuffer.
Writes are sharded by some abstract Sharder impl, collated per shard to
maximise the size of each op (and therefore compression efficiency),
converted into a DML operation and then enqueued in parallel to the
underlying WriteBuffer implementation.
Deletes are modelled as being mapped to a single write buffer shard,
which is the case while we support sharding based on the table &
namespace only. Deletes will be extended to support (potentially)
multiple shards when column overrides are implemented.
* refactor: runtime write buffers
Switch from using static dispatch, to using a runtime specified
WriteBufferWriting implementation.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>