Commit Graph

8861 Commits (cb10a7c6d8d7d0a88d1e90848ac62eea9cdcbf90)

Author SHA1 Message Date
Nga Tran cb10a7c6d8
feat: More accurate memory estimate for compaction (#5471)
* feat: initial implementation of memory estimation for a compaction

* feat: estimate size of files and have the right actions for the needed budget

* feat: run candidates in parallel

* fix: have the right name for the column field of the output struct

* feat: add metrics for estimated budgets

* chore: cleanup

* chore: Apply suggestions from code review

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>

* fix: fix syntax after applying review's suggestions

* refactor: Convert a Vec to VecDeque to go well with pop and push

* chore: remove max_concurrent_size_bytes and input_size_threshold_bytes

* chore: remove input_file_count_threshold

* test: tests for estimate_arrow_bytes_for_file

Co-authored-by: Carol (Nichols || Goulding) <193874+carols10cents@users.noreply.github.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-30 13:44:44 +00:00
Dom 887d73f7e1
Merge pull request #5510 from influxdata/dom/empty-parquet
fix: remove empty parquet panic
2022-08-30 14:20:20 +01:00
Dom Dwyer 2fc0ddbea1 fix: compactor tolerates empty output
Changes the compactor code to tolerate a SplitExec yielding an empty
partition (with no rows).

This raises a WARN as the situation in which this is acceptable is very
rare, and is more likely indicative of an opportunity to improve the
SplitExec usage (i.e. pruning out unnecessary split points).
2022-08-30 14:52:31 +02:00
Dom Dwyer 7698264768 refactor: raise error for no rows in parquet file
Previously when attempting to serialise a stream of one or more
RecordBatch containing no rows (resulting in an empty file), the parquet
serialisation code would panic.

This changes the code path to raise an error instead, to support the
compactor making multiple splits at once, which may overlap a single
chunk:

                  ────────────── Time ────────────▶

                          │                │
                  ┌█████──────────────────────█████┐
                  │█████  │    Chunk 1     │  █████│
                  └█████──────────────────────█████┘
                          │                │

                          │                │

                      Split T1         Split T2

In the example above, the chunk has an unusual distribution of write
timestamps over the time range it covers, with all data having a
timestamp before T1, or after T2. When a running a SplitExec to slice
this chunk at T1 and T2, the middle of the resulting 3 subsets will
contain no rows. Because we store only the min/max timestamps in the
chunk statistics, it is unfortunately impossible to prune one of these
split points from the plan ahead of time.
2022-08-30 14:52:31 +02:00
Raphael Taylor-Davies 711ba77341
chore: update object_store to test IMDSv1 fallback (#5509)
* chore: update object_store to test IMDSv1 fallback

* chore: Run cargo hakari tasks

Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-30 12:31:49 +00:00
Marco Neumann fecbbd9fa1
refactor: improve namespace caching in querier (#5492)
1. Cache converted schema instead of catalog schema. This safes a buch
   of memcopies during conversion.
2. Simplify creation of new chunks, we now only need a `CachedTable`
   instead of a namespace and a table schema.

In an artificial benchmark, this removed around 10ms from the query
(although that was prior to #5467 which moved schema conversion one
level up). Still I think it is the cleaner cache design.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-30 11:42:21 +00:00
Marco Neumann 430536f05f
refactor: use a single timestamp in policy backend (#5508)
* refactor: use a single timestamp in policy backend

Prior to this PR we had at least 1 `TimeProvider::now` calls per GET
request (for caches that only used LRU) and up to 3 calls (caches with
LRU + refresh + TTL). Let's instead use a single timestamp that is
created by the policy backend itself (instead of the policies). This has
the following consequences:

- **efficiency:** `SystemProvider::now` is not free, even though under Linux
  this doesn't result in a syscall, it uses the stdlib time system which
  also checks for monotonicity
- **consistency:** All changes for a single trigger (e.g. a
  GET cache call) now use a single timestamp instead of slightly
  increasing ones. I argue this is the better semantic, simpler to
  understand and better to debug.

For some (slightly artificial) local performance experiment, this shaves
off around 2ms per single-table SQL query. However I expect that there might
be more degenerated cases (e.g. multi-table SQL queries or some
InfluxRPC requests that hit multiple tables).

The majority of this patch is moving the `TimeProvider` from the
policies into the policy backend.

* docs: explain `now` parameter
2022-08-30 11:23:25 +00:00
kodiakhq[bot] bf0a0ab3a5
Merge pull request #5505 from influxdata/dom/revert-object-store-bump
revert: object store bump
2022-08-30 08:56:10 +00:00
Dom 89af2f2b1d
Merge branch 'main' into dom/revert-object-store-bump 2022-08-30 09:47:02 +01:00
Dom 91167428f2
Merge pull request #5504 from influxdata/dom/dotenvy
build: bump dotenvy
2022-08-30 09:46:00 +01:00
Dom Dwyer 66f0b59dbb revert: remove Azure SDK / bump object_store
This reverts commit c2f8efa03a.
2022-08-30 10:41:29 +02:00
Dom Dwyer e752a707f8 revert: remove audit ignore for RUSTSEC-2022-0048
This reverts commit 227149e5b6.
2022-08-30 10:39:55 +02:00
Dom Dwyer dcc0f9d34f build: bump dotenvy
I fixed this while waiting for my build to deploy. I think that says
more about our build than anything else!
2022-08-30 10:34:26 +02:00
Dom 5530d02adb
Merge pull request #5500 from influxdata/dependabot/cargo/futures-0.3.24
chore(deps): Bump futures from 0.3.23 to 0.3.24
2022-08-30 09:20:20 +01:00
Dom 747f5440e1
Merge pull request #5496 from influxdata/dependabot/cargo/futures-channel-0.3.24
chore(deps): Bump futures-channel from 0.3.23 to 0.3.24
2022-08-30 09:20:12 +01:00
Dom b3a7602b47
Merge pull request #5503 from influxdata/dependabot/cargo/futures-core-0.3.24
chore(deps): Bump futures-core from 0.3.23 to 0.3.24
2022-08-30 09:19:07 +01:00
dependabot[bot] 852f6c5749
chore(deps): Bump futures-core from 0.3.23 to 0.3.24
Bumps [futures-core](https://github.com/rust-lang/futures-rs) from 0.3.23 to 0.3.24.
- [Release notes](https://github.com/rust-lang/futures-rs/releases)
- [Changelog](https://github.com/rust-lang/futures-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-lang/futures-rs/compare/0.3.23...0.3.24)

---
updated-dependencies:
- dependency-name: futures-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-08-30 01:25:21 +00:00
dependabot[bot] 0137db9adc
chore(deps): Bump futures from 0.3.23 to 0.3.24
Bumps [futures](https://github.com/rust-lang/futures-rs) from 0.3.23 to 0.3.24.
- [Release notes](https://github.com/rust-lang/futures-rs/releases)
- [Changelog](https://github.com/rust-lang/futures-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-lang/futures-rs/compare/0.3.23...0.3.24)

---
updated-dependencies:
- dependency-name: futures
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-08-30 01:24:21 +00:00
dependabot[bot] 480bcbda18
chore(deps): Bump futures-channel from 0.3.23 to 0.3.24
Bumps [futures-channel](https://github.com/rust-lang/futures-rs) from 0.3.23 to 0.3.24.
- [Release notes](https://github.com/rust-lang/futures-rs/releases)
- [Changelog](https://github.com/rust-lang/futures-rs/blob/master/CHANGELOG.md)
- [Commits](https://github.com/rust-lang/futures-rs/compare/0.3.23...0.3.24)

---
updated-dependencies:
- dependency-name: futures-channel
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-08-30 01:23:16 +00:00
kodiakhq[bot] 00aa4b9c83
Merge pull request #5470 from influxdata/cn/kafka-topic
feat: Renaming kafka topic types
2022-08-29 20:53:04 +00:00
kodiakhq[bot] 419efb91e9
Merge branch 'main' into cn/kafka-topic 2022-08-29 20:46:33 +00:00
Andrew Lamb de47f5605b
chore: Update datafusion (with new sqlparser release) - option 1 (#5433)
* chore: Update datafusion pin

* chore: Update now that user is a reserved word

* chore: Update cargo.lock

* fix: update query for user function

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-29 19:10:00 +00:00
Carol (Nichols || Goulding) dbd27f648f
refactor: Rename more mentions of Kafka to their other name where appropriate 2022-08-29 14:27:02 -04:00
Carol (Nichols || Goulding) 1b49ad25f7
refactor: Rename KafkaTopicId to TopicId 2022-08-29 14:27:02 -04:00
Carol (Nichols || Goulding) 58f0b63cdc
refactor: Rename KafkaTopic to Topic or TopicMetadata or topic name as appropriate 2022-08-29 14:27:02 -04:00
kodiakhq[bot] 122dbe1b4b
Merge pull request #5435 from influxdata/cn+jpg/shard
feat: renaming some of the confusing sequencer things
2022-08-29 18:16:15 +00:00
Carol (Nichols || Goulding) cb52683a1a
fix: Redo uses after rebase 2022-08-29 14:08:33 -04:00
Carol (Nichols || Goulding) 3aa3ae2ba5
docs: Add more comments about why to use ShardIndex or ShardId 2022-08-29 14:07:20 -04:00
Carol (Nichols || Goulding) 74c9529062
fix: Rename KafkaPartition to ShardIndex 2022-08-29 14:07:18 -04:00
Carol (Nichols || Goulding) c9567cad7d
fix: Rename some more sequencer to shard 2022-08-29 14:06:45 -04:00
Carol (Nichols || Goulding) ab20828c2f
fix: Rename some more comments and test values from sequencer to shard 2022-08-29 14:06:45 -04:00
Carol (Nichols || Goulding) 6443858870
fix: Rename compactor option from sequencer to shard 2022-08-29 14:06:45 -04:00
Carol (Nichols || Goulding) 95b7529079
fix: Rename more test values to shard 2022-08-29 14:06:45 -04:00
Carol (Nichols || Goulding) fe9c474620
fix: rustfmt 2022-08-29 14:06:45 -04:00
Carol (Nichols || Goulding) fbae4282df
fix: Rename another sequencer to shard to be hopefully clearer 2022-08-29 14:06:45 -04:00
Carol (Nichols || Goulding) f6c93f7e67
fix: Remove moot comment 2022-08-29 14:06:44 -04:00
Carol (Nichols || Goulding) 952a3ea498
fix: Return querier sharding to use sequencer ID 2022-08-29 14:06:44 -04:00
Carol (Nichols || Goulding) 240946d8f5
fix: Deprecate proto sequencer_id fields; add shard_id fields 2022-08-29 14:06:44 -04:00
Carol (Nichols || Goulding) 698f1a47ff
refactor: Rename test structures from sequencer to shard where appropriate 2022-08-29 14:06:44 -04:00
Jake Goulding 4abf21c724
refactor: Rename Sequencer (and its entourage) to Shard 2022-08-29 14:06:43 -04:00
Sam Arnold 05657ea068
fix: optimizations for metadata fetch and chunk pruning (#5467)
* fix: hoist repeated computation out of chunk creation

We have hundreds of chunks per table, so it is beneficial to only
do common work once.

* chore: remove TableCache as it is no longer used

* fix: prune chunks both before and after metadata fetch

Fetching the metadata for all the chunks in a table is expensive,
especially when we have a narrow time range query that only
needs a few chunks.

* chore: fix clippy

* fix: fix up some last tests

* fix: review comments

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-29 14:59:05 +00:00
Marco Neumann e441b5b307
feat: add deadline config to backoff system (#5489)
This will simplify event emission in #5464.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-29 14:51:41 +00:00
kodiakhq[bot] 4f119d1e40
Merge pull request #5485 from influxdata/dom/kafka-msg-size-dist
feat: Kafka payload size distribution metric
2022-08-29 13:55:44 +00:00
kodiakhq[bot] 339b1e8b92
Merge branch 'main' into dom/kafka-msg-size-dist 2022-08-29 13:49:18 +00:00
Andrew Lamb 9aac78d30b
fix: Correctly lexigraphically sort `_field` and `_measurement` with upper case tag keys (#5436)
Co-authored-by: Dom <dom@itsallbroken.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-29 13:45:03 +00:00
Adrian Thurston 33e31725c9
feat: added rustup toolchain dir to docker buildkit cache (#5474)
Added /usr/local/rustup to the list of directories cached during build. This is
where rustup installs the toolchain, so we save the download and install on
every build.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-29 12:32:20 +00:00
Dom 247841dbf1
Merge branch 'main' into dom/kafka-msg-size-dist 2022-08-29 13:27:38 +01:00
Marco Neumann 8bc7606cb5
refactor: provide process-wide static strings (version, UUID) (#5487)
We currently only use the human-readable version string for the CLI
help, but for #5464 I want to use the GIT hash and a process-time UUID.
This is the prep work for that.

Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
2022-08-29 12:25:19 +00:00
kodiakhq[bot] 0ddc21ef40
Merge pull request #5488 from influxdata/dom/fix-audit
build: bump object_store
2022-08-29 12:18:04 +00:00
Dom Dwyer 175cae2f56 feat: capture Kafka message size distribution
Adds instrumentation to the low-level (post-aggregation) Kafka client,
capturing the uncompressed, approximate message size (calculated as the
sum of all Record::approximate_size() returns, ignoring largely static
framing overhead).
2022-08-29 14:08:51 +02:00