Commit Graph

6141 Commits (42b1436220e4a3bf3687940e30501597c5747b66)

Author SHA1 Message Date
Andrew Lamb 0b3df2ab50
fix: reduce verbosity of logs (#3159)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-11-19 10:03:27 +00:00
Carol (Nichols || Goulding) 25d55cd08a
feat: Move server config paths beneath 'nodes' (#3144)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-11-19 09:54:32 +00:00
Raphael Taylor-Davies e32d367e85
feat: flush delete mailbox on persist (#3126) (#3147)
* feat: flush delete mailbox on persist (#3126)

* chore: review feedback

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-11-19 09:45:29 +00:00
Marco Neumann 7c72a993a3 fix: don't retry "forever" sending Kafka messages
When a Kafka broker pod is recreated (for whatever reason) and gets a
new IP while doing so, the following happened:

1. Old broker pod gets terminated, but is still reachable via DNS and
   TCP.
2. rdkafka looses its connection, re-creates it using the old IP. The
   TCP connection can be established (this heavily depends on the K8s
   network setup), but won't be able to send any messages because the
   old broker is already shutting down / dead.
3. New broker gets created w/ new IP (but same DNS name).
4. Somewhat in parallel to step 3: rdkafka gets informed by other
   brokers that the topic lost its leader and then that the topic has
   the new leader (which has the same identity as the old one). Since
   leader changes in Kafka can also happen when brokers are totally
   healthy, it doesn't conclude that its TCP connection might be broken
   and tries to send messages to the new broker via the old TCP
   connection.
5. It takes very long (~130s on my test setup) for the old
   rdkafka->broker TCP connection to break. Since
   `message.send.max.retries` has a default of `2147483647` rdkafka will
   not give up on the application level.
5. rdkafka re-connects, while doing so resolves via DNS the new broker
   IP and is happy.

An alternative fix that was tried: Use the `connect` rdkafka callback to
hook into the place where it would issue the UNIX `connect` call. There
we can manipulate the socket. Setting `TCP_USER_TIMEOUT` to 5000ms also
solves the issue somewhat, but might have different implications (also
it then takes around 5s to kill the connection). Since this is a more
hackish implementation and somewhat an unofficial way to configure
rdkafka, I decided against it.

Test Setup
==========

```rust
\#[tokio::test]
async fn write_forever() {
    maybe_start_logging();
    let conn = maybe_skip_kafka_integration!();
    let adapter = KafkaTestAdapter::new(conn);
    let ctx = adapter.new_context(NonZeroU32::new(1).unwrap()).await;

    let writer = ctx.writing(true).await.unwrap();
    let lp = "upc user=1 100";
    let sequencer_id = set_pop_first(&mut writer.sequencer_ids()).unwrap();

    for i in 1.. {
        println!("{}", i);

        let tables = mutable_batch_lp::lines_to_batches(lp, 0).unwrap();
        let write = DmlWrite::new(tables, DmlMeta::unsequenced(None));
        let operation = DmlOperation::Write(write);
        let res = writer.store_operation(sequencer_id, &operation).await;
        dbg!(res);

        tokio::time::sleep(Duration::from_secs(1)).await;
    }
}
```

Make sure to set the the rdkafka `log` config to `all`. Then use KinD,
setup a 3-node Strimzi cluster and start the test binary within the K8s
cluster. You need to start a debug container that is close enough to
your developer system (e.g. an old Debian DOES NOT work if you run
bleeding edge Arch):

```console
$(host) kubectl run -i --tty --rm debug --image=archlinux --restart=Never -n kafka -- bash
````

Then you copy over the test binary the container using [cargo-with](https://github.com/cbourjau/cargo-with):

```console
$(host) cargo with 'kubectl cp {bin} kafka/debug:/foo' -- test -p write_buffe
````

Within the container shell that you've just created, start the
forever-running test (make sure to set `KAFKA_CONNECT` according to your
Strimzi setup!):

```console
$(container) TEST_INTEGRATION=1 KAFKA_CONNECT=my-cluster-kafka-bootstrap:9092 RUST_BACKTRACE=1 RUST_LOG=debug ./foo write_forever --nocapture
````

The test should run and tell you that it is delivering messages. It also
tells you within the debug logs which broker it sends the messages to.
Now you need to kill the broker (in my example it was `my-cluster-kafka-1`):

```console
$(host) kubectl -n kafka delete pod my-cluster-kafka-1
````

The test should now stop to deliver messages and should error. Without
this patch it might take over 100s for it to recover even after the
deleted pod was re-created. With this patch it quickly is able to
deliver data again after the broker comes back online.

Fixes #3030.
2021-11-19 09:53:57 +01:00
Nga Tran c148251dcb feat: implement step2: compact and persist os chunks 2021-11-18 18:18:55 -05:00
Carol (Nichols || Goulding) a2454b542d
fix: Small cleanups in Cargo.tomls (#3160)
* fix: Add tokio rt-multi-thread feature so cargo test -p client_util compiles

* fix: Alphabetize dependencies

* fix: Add the data_types_conversions feature to get tests passing

* fix: Remove dev dependencies already listed under normal dependencies

* fix: Make sure the workspace is using the new resolver
2021-11-18 22:26:33 +00:00
Jacob Marble 2976244244
chore: update one-shot Dockerfile to not depend on rust:ci (#3133)
* chore: update one-shot Dockerfile to not depend on rust:ci

* chore: update Debian to bullseye

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-11-18 16:01:32 +00:00
kodiakhq[bot] 0d500b135b
Merge pull request #3118 from influxdata/cn/alias-db-commands
fix: Make delete/restore aliases for release/claim; remove tombstone
2021-11-18 15:22:10 +00:00
kodiakhq[bot] c9f02f83e7
Merge branch 'main' into cn/alias-db-commands 2021-11-18 15:13:43 +00:00
Nga Tran ccef3b535a feat: clean up and add comments for next steps 2021-11-18 10:11:51 -05:00
Andrew Lamb 5e7336b475
docs: Tweak comments on Mailbox (#3152) 2021-11-18 14:19:19 +00:00
Andrew Lamb 1fae3559cf
docs: document differences between Mailbox and channel (#3148) 2021-11-18 13:08:03 +00:00
kodiakhq[bot] d42b416bdb
Merge pull request #3138 from influxdata/crepererum/update_ci_builder
ci: update CI image builder to use newer docker
2021-11-18 10:03:31 +00:00
kodiakhq[bot] ba4e7c2dff
Merge branch 'main' into crepererum/update_ci_builder 2021-11-18 09:56:15 +00:00
Marco Neumann fef6cafa24 ci: explain some circle decisions 2021-11-18 10:55:36 +01:00
Raphael Taylor-Davies 714fc85c8d
refactor: extract Mailbox type (#3126) (#3142)
* refactor: extract Mailbox type (#3126)

* fix: doc

* chore: review feedback

Co-authored-by: Andrew Lamb <alamb@influxdata.com>

Co-authored-by: Andrew Lamb <alamb@influxdata.com>
2021-11-18 09:34:06 +00:00
Nga Tran a5c04e5fe4 feat: framework for compact os chunks 2021-11-17 18:12:51 -05:00
Carol (Nichols || Goulding) f69d37e9a8
fix: Remove database delete/restore entirely 2021-11-17 12:03:11 -05:00
Carol (Nichols || Goulding) 7783e4a7ff
fix: Make delete/restore aliases for release/claim; remove tombstone
Fixes #2680
2021-11-17 11:41:08 -05:00
Raphael Taylor-Davies 8155747735
feat: add write buffer delete encoding (#2731) (#3127)
* feat: add write buffer delete encoding (#2731)

* chore: fix doc

* chore: review feedback

* chore: review feedback

* chore: fmt

* chore: review feedback
2021-11-17 16:12:19 +00:00
Andrew Lamb b5a7bf03da
feat: Add kafka write buffer consumer metrics (#3129)
* feat: Add kafka write buffer consumer metrics

* refactor: use unwrap_or_else

* fix: Update bucket boundaries
2021-11-17 14:35:40 +00:00
Andrew Lamb 47acd181c5
chore: Update datafusion + arrow/parquet/arrow-flight 6.2.0 (#3136)
* chore: Update datafusion and arrow

* chore: update arrow/parquet/arrow-flight to 6.2.0

* refactor: Add table_exists to SchemaProvider impl

* fix: clippy

* fix: clippy 2

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-11-17 14:04:49 +00:00
Dom da61966858
build: remove proc-macro2 pin (#3137)
Seems unused, builds without it!

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-11-17 13:55:39 +00:00
Marco Neumann c9168a2c13 ci: update CI image builder to use newer docker
This is a precondition to build ARM64 CI images.
2021-11-17 14:17:27 +01:00
Andrew Lamb d6c6e9a6c7
fix: Default kafka timeout to be shorter than gRPC timeout (60 sec --> 10 sec) (#3131)
* fix: Default kafka timeout to be shorter than gRPC timeout

* docs: fix link style
2021-11-17 12:19:53 +00:00
kodiakhq[bot] a87a320eb3
Merge pull request #3134 from influxdata/crepererum/bullseye
ci: update CI images from docker buster to bullseye
2021-11-17 09:46:11 +00:00
Marco Neumann 640cd88df3 ci: update CI images from docker buster to bullseye
This will break `perf_image` until the new CI image is built due to the
newly required `--all-tags` parameter to `docker push` that isn't
available for the docker version we run on buster.
2021-11-17 10:00:31 +01:00
kodiakhq[bot] 76790cadd8
Merge pull request #3135 from influxdata/crepererum/tokio140
chore: upgrade tokio to 1.14.0 to fix RUSTSEC-2021-0124
2021-11-17 08:58:18 +00:00
Marco Neumann 04d8133227 chore: upgrade tokio to 1.14.0 to fix RUSTSEC-2021-0124 2021-11-17 09:44:52 +01:00
Andrew Lamb 38ca9e1339
fix: capture all panic messages in logs (#3130) 2021-11-16 21:59:05 +00:00
kodiakhq[bot] 35f5725a3a
Merge pull request #3120 from influxdata/crepererum/issue3100
feat: emit Kafka stats as metrics instead of logs
2021-11-16 16:26:19 +00:00
Marco Neumann 79929c8cf4 feat: add more Kafka metrics 2021-11-16 17:18:41 +01:00
Marco Neumann 9ee004946e fix: do not overload rdkafka w/ statistics 2021-11-16 17:18:41 +01:00
Marco Neumann e6fdd79a0f feat: emit Kafka stats as metrics instead of logs
This maps a subset of Kafka stats as metrics. The set can -- of course
-- be changed in the future depending on our needs.

Fixes #3100.
2021-11-16 17:18:41 +01:00
Raphael Taylor-Davies 553e412226
refactor: DMLOperation write path (#2731) (#3121)
* refactor: DMLOperation write path (#2731)

* chore: fmt

* chore: review feedback
2021-11-16 12:42:19 +00:00
kodiakhq[bot] f3fd94148c
Merge pull request #3113 from influxdata/crepererum/issue3063
fix: ensure `ConsistenHasher` is consistent
2021-11-16 08:49:53 +00:00
kodiakhq[bot] 88de603fc2
Merge branch 'main' into crepererum/issue3063 2021-11-16 08:41:55 +00:00
Carol (Nichols || Goulding) bc11244828
feat: Rename database disown/adopt to release/claim (#3111)
* fix: Rename 'disown' to 'release' database

Connects to #3110

* fix: Rename 'adopt' to 'claim' database

Fixes #3110.
2021-11-15 20:28:09 +00:00
kodiakhq[bot] 2a9d840161
Merge pull request #3090 from influxdata/cn+jpg/adopt
feat: Add an Adopt Database API
2021-11-15 19:40:43 +00:00
Carol (Nichols || Goulding) d759d98612
fix: Update new code with API that changed since branching from main 2021-11-15 14:32:50 -05:00
kodiakhq[bot] cc693a780e
Merge branch 'main' into cn+jpg/adopt 2021-11-15 19:22:07 +00:00
Carol (Nichols || Goulding) 3545f6d65a
fix: Pass through error for already-owned database 2021-11-15 14:15:56 -05:00
kodiakhq[bot] 1cfbbf0245
Merge pull request #3115 from influxdata/crepererum/issue3020a
refactor: clarify `ServerType` background worker handling
2021-11-15 17:45:25 +00:00
Marco Neumann 1d68980e4f
fix: typo
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
2021-11-15 18:37:31 +01:00
Marco Neumann c88930a6a5 refactor: clarify `ServerType` background worker handling
Ref #3020.
2021-11-15 18:28:32 +01:00
Marco Neumann 4e71de508e fix: ensure `ConsistenHasher` is consistent
The std `DefaultHasher` is NOT guaranteed to stay the same, so let's
directly use the `SipHasher13` which at the moment (2021-11-15) is used
by the standard lib.

Fixes #3063.
2021-11-15 17:39:17 +01:00
Raphael Taylor-Davies 3cd7d2eda2
refactor: improve usability of proto conversion traits (#3109)
* refactor: improve usability of proto conversion traits

* chore: review feedback
2021-11-15 16:10:29 +00:00
Jake Goulding af28cfa2a6
feat: Add an adopt database API
Fixes #2679.
2021-11-15 09:26:06 -05:00
Raphael Taylor-Davies 58f3e2e559
refactor: move delete predicate proto serialization to generated_types (#3108)
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2021-11-15 12:02:14 +00:00
kodiakhq[bot] 60eaf704a9
Merge pull request #3107 from influxdata/crepererum/improve_router_client_errors
feat: improve `RouterClient` errors
2021-11-15 11:45:07 +00:00