* refactor: remove `ParquetFileRepo::flag_for_delete`
* refactor: batch update parquet files in catalog
* refactor: avoid data roundtrips through postgres
* refactor: do not return ID from PG when we do not need it
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: aggregator for DataFusion statistics
Required to implement #7470, esp. to implement the statistics folding
done within `RecordBatchesExec`.
* docs: improve
An integration test asserting that a router returns an error when
attempting to partition a write with an invalid strftime partition
formatter, rather than panicking.
Changes the partitioning logic to be fallible. This prevents an invalid
partition template from causing a panic, previously possible through two
known code paths:
* TagValue formatter referencing a non-tag column
* Time formatter using an invalid strftime format string
If either occurs, the write attempt is now aborted and an error returned
to the user with a HTTP 500 status code.
Additionally unexpected partitioner errors now map to a catch-all error
instead of panicking.
* refactor: add `parquet_file` PG index for querier
Currently the `list_by_table_not_to_delete` catalog query is somewhat
expensive:
```text
iox_catalog_prod=> select table_id, sum((to_delete is NULL)::int) as n from parquet_file group by table_id order by n desc limit 5;
table_id | n
----------+------
1489038 | 7221
1489037 | 7019
1491534 | 5793
1491951 | 5522
1513377 | 5339
(5 rows)
iox_catalog_prod=> EXPLAIN ANALYZE SELECT id, namespace_id, table_id, partition_id, object_store_id,
min_time, max_time, to_delete, file_size_bytes,
row_count, compaction_level, created_at, column_set, max_l0_created_at
FROM parquet_file
WHERE table_id = 1489038 AND to_delete IS NULL;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on parquet_file (cost=46050.91..47179.26 rows=283 width=200) (actual time=464.368..472.514 rows=7221 loops=1)
Recheck Cond: ((table_id = 1489038) AND (to_delete IS NULL))
Heap Blocks: exact=7152
-> BitmapAnd (cost=46050.91..46050.91 rows=283 width=0) (actual time=463.341..463.343 rows=0 loops=1)
-> Bitmap Index Scan on parquet_file_table_idx (cost=0.00..321.65 rows=22545 width=0) (actual time=1.674..1.674 rows=7221 loops=1)
Index Cond: (table_id = 1489038)
-> Bitmap Index Scan on parquet_file_deleted_at_idx (cost=0.00..45728.86 rows=1525373 width=0) (actual time=460.717..460.717 rows=4772117 loops=1)
Index Cond: (to_delete IS NULL)
Planning Time: 0.092 ms
Execution Time: 472.907 ms
(10 rows)
```
I think this may also be because PostgreSQL kinda chooses the wrong
strategy, because it could just look at the existing index and filter
from there:
```text
iox_catalog_prod=> EXPLAIN ANALYZE SELECT id, namespace_id, table_id, partition_id, object_store_id,
min_time, max_time, to_delete, file_size_bytes,
row_count, compaction_level, created_at, column_set, max_l0_created_at
FROM parquet_file
WHERE table_id = 1489038;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using parquet_file_table_idx on parquet_file (cost=0.57..86237.78 rows=22545 width=200) (actual time=0.057..6.994 rows=7221 loops=1)
Index Cond: (table_id = 1489038)
Planning Time: 0.094 ms
Execution Time: 7.297 ms
(4 rows)
```
However PostgreSQL doesn't know the cardinalities well enough. So
let's add a dedicated index to make the querier faster.
* feat: new migration system
* docs: explain dirty migrations
This test constructs a partition key from an arbitrary selection of
pre-defined parts, and uses the resulting template to partition a write
containing an arbitrary selection of pre-defined tag columns.
Once a partition key is derived, the test asserts build_column_values()
reverses it into the original set of tag (column_name, value) tuples
present in the write.
This commit changes the format of partition keys when generated with
non-default partition key templates ONLY. A prior fixture test is
unchanged by this commit, ensuring the default partition keys remain
the same.
When a custom partition key template is provided, it may specify one or
more parts, with the TagValue template causing values extracted from tag
columns to appear in the derived partition key.
This commit changes the generated partition key in the following ways:
* The delimiter of multi-part partition keys; the character used to
delimit partition key parts is changed from "/" to "|" (the pipe
character) as it is less likely to occur in user-provided input,
reducing the encoding overhead.
* The format of the extracted TagValue values (see below).
Building on the work of custom partition key overrides, where an
immutable partition template is resolved and set at table creation time,
the changes in this PR enable the derived partition key to be
unambiguously reversed into the set of tag (column_name, column_value)
tuples it was generated from for use in query pruning logic. This is
implemented by the build_column_values() method in this commit, which
requires both the template, and the derived partition key.
Prior to this commit, a partition key value extracted from a tag column
was in the form "tagname_x" where "x" is the value and "tagname" is the
name of the tag column it was extracted from. After this commit, the
partition key value is in the form "x"; the column name is removed from
the derived string to reduce the catalog storage overhead (a key driver
of COGS). In the case of a NULL tag value, the sentinel value "!" is
inserted instead of the prior "tagname_" marker. In the case of an empty
string tag value (""), the sentinel "^" value is inserted instead of the
"tagname_-" marker, ensuring the distinction between an empty value and
a not-present tag is preserved.
Additionally tag values utilise percent encoding to encode reserved
characters (part delimiter, empty sentinel character, % itself) to
eliminate deserialisation ambiguity.
Examples of how this has changed derived partition keys, for a template
of [Time(YYYY-MM-DD), TagValue(region), TagValue(bananas)]:
Write: time=1970-01-01,region=west,other=ignored
Old: "1970-01-01-region_west-bananas"
New: "1970-01-01|west|!"
Write: time=1970-01-01,other=ignored
Old: "1970-01-01-region-bananas"
New: "1970-01-01|!|!"
This test asserts the partition key of a write derived from the default
partition key template (YYYY-MM-DD).
This test ensures that the default partition keys do not change with
subsequent changes, as these values are what are used today.
Move the import into the submodule itself, rather than re-exporting it
at the crate level.
This will make it possible to link to the specific module/logic.
This commit adds support for the CLI to query the namespace and schema
APIs to retrieve database and table names from the IDs found in WAL
entries being regenerated.
Instead, the type responsible for initialising it handles namespaced
`Write` initialisation and management, as well as the failure paths that
may need handling. This commit introduces a `NamespaceDemultiplexer`
type with a generic implementation allowing fallible `async` lazy init
of any type from a given `NamespaceId`. This paves the way for catalog-aware
initialisation of `LineProtoWriter`s.