* refactor: `TestPartition::update_sort_key` should return an `Arc`
The whole test framework is built around `Arc`s, so let's fix this
consistency issue.
* fix: actually calculate correct column set in test framework
* feat: check expected parquet file schema
While working on the querier I made some mistakes regarding schemas and
such a check would have greatly improved the debugging experience.
* feat: namespace cache expiration
* fix: improve parquet schema check
* fix: remove clone
The low-level chunk storage shouldn't care about the table name (this is
also true for parquet chunks btw). In fact, the table name is already
only a partial information since it misses the namespace.
If we need a table name, then the high-level chunk/data management is
responsible for that.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: store per-file column set in catalog
Together with the table-wide schema and the partition-wide sort key, this should
be everything we need to read a parquet file directly into memory
without peeking any file-level metadata.
The querier will use this to directly load parquet files into the read
buffer.
**WARNING: This requires a catalog wipe!**
Ref #4124.
* refactor: use proper `ColumnSet` type
* refactor(querier): split ingester partitions into chunks
With the new wire protocol the ingester can now transmit multiple
snapshots per partition with different schemas. This changes the querier
to reflect this and and splits uses the individual snapshots as chunks
for the query engine instead of a single partition.
The schema handling was changed so that instead of a table-wide schema
enforcement, we now use the snapshot-specific projections. This means we
do not need to create all-NULL columns any longer because the batches
within the chunks now always have the correct schema.
* refactor: "disassembler" -> "decoder"
* fix: make ChunkOrder u64 data type to accept min sequence number 0
* fix: make ChunkOrder i64 to match with sequence number type
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
The metrics and logs introduced in #4806 will be emitted once for all
ingesters instead of per request. The accumulated view makes it pretty
hard to judge the actual request-response timings and the number of
requests.
Instead we now measure the data per request.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* refactor: use new ingester<>querier wire protocol
Use and document the new and more flexible ingester<>querier wire
protocol.
Note that the ingester does NOT stream the response data yet, but the
internal data structures would allow that. A follow-up change will
adjust the ingester code to stream the data.
Ref #4849.
* fix: typos
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: clarify naming and public interface
* test: add schema assertion to `ingester_response_to_record_batches`
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: prepare new ingester<>querier protocol on the querier side
This changes the querier internals to work with the new protocol. The
wire protocol stays the same (for now). There's a (somewhat hackish)
adapter in place on the querier side that converts the old to the new
protocol on-the-fly. This is an intermediate step before we actually
change the wire protocol (and in a step after that also take advantage
of the new possibilites on the ingester side).
Ref #4849.
* docs: explain adapter
* feat: extend flight client to accept multiple (changing) schemas
See #4849.
Originally I intended not to use Flight at all for the new
ingester<>querier protocol. However since flight also deals with
dictionary batches and multiple batches and the gRPC protocol that I
would write would look very similar, I will use Flight with a bit more
flexible message types.
The rough idea for the protocol is the following stream:
- for each partition:
1. "none" message with partition metadata
2. for each chunk (can have different schemas under certain
circumstances):
1. "schema" message (resets dictionary state)
2. (optional) dictionary batch messages
3. one or more "record batch" message
The nice thing about it is that the same arrow client works also for the
existing client<>querier protocol since there we just send:
1. "schema" message (no app metadata)
2. (optional) dictionary batch messages
3. zero, one or more "record batch" message (no app metadata)
* refactor: separate high- and low-level flight client
It is very unlikely that a user will use the high-level batch-producing
functionality and the low-level stuff within the same session. So let's
split this into to clients (high-level uses the low-level one
internally) to avoid confusion.
Also add documentation on our protocol handling.
* refactor: enumerate all variants in match statement to better catch errors in the future
* feat: Log time spent requesting ingester partitions
Fixes#4558.
* feat: Record a metric for the duration queriers wait on ingesters
* fix: Use DurationHistogram instead of U64 Histogram
* test: Add a test for the ingester ms metric
* feat: Add back the logging to provide both logging and metrics for ingester duration
* refactor: Use sample_count method on metrics
* feat: Record ingester duration separately for success or failure
* fix: Create a separate test for the ingester metrics
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: TEMP Update DataFusion to pre-release
* chore: update arrow et al to 16.0.0
* chore: Run cargo hakari tasks
* fix: update reader read_dictionary API
* chore: Update to real Datafusion release
* fix: Update parquet API
* fix: update test
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
This commit changes the code base to use a new reference-counted
PartitionKey type wrapper, instead of passing a bare String around.
This allows the compiler to type check & verify usage of the partition
key, instead of passing a bare string around. By reference counting the
underlying string, we reduce memory usage for some use cases.
To roughly gauge how much data we re-load into cached (i.e. data that
was already loaded but was later evicted due to LRU pressure or TTL
eviction) this change introduces a new metric that estimates if a cache
entry that is requested from the loader was already seen before (using a
probabilistic filter).
* feat: Change data type of catalog Postgres partition's sort_key from a string to an array of string
* test: add column with comma
* fix: use new protonuf field to avoid incompactible
* fix: ensure sort_key is an empty array rather than NULL
* refactor: address review comments
* refactor: address more comments
* chore: clearer comments
* chore: Update iox_catalog/migrations/20220607102200_change_sort_key_type_to_array.sql
* chore: Update iox_catalog/migrations/20220607102200_change_sort_key_type_to_array.sql
* fix: Rename migration so it will be applied after
Co-authored-by: Marko Mikulicic <mkm@influxdata.com>
* refactor: make `Cache` a trait
To insert more high-level metrics (e.g. cache misses/hits) it would be
helpful if we could easily instrument the layer right above the cache
driver (that combines the backend and the loader). To do that without
polluting the types too much, let's introduce a trait that describes the
driver interface and that we could later wrap with intrumentation.
This also pulls out the test into a generic setup, similar to how this
is done for the cache storage backends.
This does NOT include any functionality changes.
* fix: typo
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
* refactor: rework querier concurrency limiting
With #4752 we introduced a concurrency limit into the querier. It works
by drawing permits from a central semaphore whenever we create a
`QuerierNamespace`. This however only limits concurrency during query
planning and not query execution, because the objects contained within
the plan (chunks and some metadata) neither reference the permit nor the
`QuerierNamespace`.
Now one approach to fix that would be to wire up the permit all the down
into all the query-related data structures. This however is very fiddly
and potentially will get lost at some point, because as soon as we
transform these data structures -- e.g. into streams -- the permit might
get lost again. This will be potentially query-dependent and very hard
to debug.
So instead we reverse the approach and track the permits at the upper
layer of the stack: the gRPC service entry points. There we also need to
be careful -- e.g. when we return streams to tonic -- but it's way
easier to review that then the deeply nested object hierarchy that is
involved with queries. Also the separation of concerns is a bit clearer,
because why would a "chunk" care about the "query concurrency" as a
whole.
* refactor: improve gRPC permit keeping and prepare tests
This is a rather quick fix for prod. On the mid-term we probably wanna
rethink our deployment strategy, e.g. by using "one query per pod" and
by deploying queryd w/ IOx into the same pod.