This will hold the deterministic ID for partitions.
Until all existing partitions have this value, this is optional/nullable.
The row ID still exists and is used as the main foreign key in the
parquet_file and skipped_compaction tables.
The hash_id has a unique index so that we can look up records based on
it (if it's available).
If the parquet file record has a partition_hash_id value, use that to
generate the object storage path instead of the partition_id.
* refactor: add `parquet_file` PG index for querier
Currently the `list_by_table_not_to_delete` catalog query is somewhat
expensive:
```text
iox_catalog_prod=> select table_id, sum((to_delete is NULL)::int) as n from parquet_file group by table_id order by n desc limit 5;
table_id | n
----------+------
1489038 | 7221
1489037 | 7019
1491534 | 5793
1491951 | 5522
1513377 | 5339
(5 rows)
iox_catalog_prod=> EXPLAIN ANALYZE SELECT id, namespace_id, table_id, partition_id, object_store_id,
min_time, max_time, to_delete, file_size_bytes,
row_count, compaction_level, created_at, column_set, max_l0_created_at
FROM parquet_file
WHERE table_id = 1489038 AND to_delete IS NULL;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on parquet_file (cost=46050.91..47179.26 rows=283 width=200) (actual time=464.368..472.514 rows=7221 loops=1)
Recheck Cond: ((table_id = 1489038) AND (to_delete IS NULL))
Heap Blocks: exact=7152
-> BitmapAnd (cost=46050.91..46050.91 rows=283 width=0) (actual time=463.341..463.343 rows=0 loops=1)
-> Bitmap Index Scan on parquet_file_table_idx (cost=0.00..321.65 rows=22545 width=0) (actual time=1.674..1.674 rows=7221 loops=1)
Index Cond: (table_id = 1489038)
-> Bitmap Index Scan on parquet_file_deleted_at_idx (cost=0.00..45728.86 rows=1525373 width=0) (actual time=460.717..460.717 rows=4772117 loops=1)
Index Cond: (to_delete IS NULL)
Planning Time: 0.092 ms
Execution Time: 472.907 ms
(10 rows)
```
I think this may also be because PostgreSQL kinda chooses the wrong
strategy, because it could just look at the existing index and filter
from there:
```text
iox_catalog_prod=> EXPLAIN ANALYZE SELECT id, namespace_id, table_id, partition_id, object_store_id,
min_time, max_time, to_delete, file_size_bytes,
row_count, compaction_level, created_at, column_set, max_l0_created_at
FROM parquet_file
WHERE table_id = 1489038;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using parquet_file_table_idx on parquet_file (cost=0.57..86237.78 rows=22545 width=200) (actual time=0.057..6.994 rows=7221 loops=1)
Index Cond: (table_id = 1489038)
Planning Time: 0.094 ms
Execution Time: 7.297 ms
(4 rows)
```
However PostgreSQL doesn't know the cardinalities well enough. So
let's add a dedicated index to make the querier faster.
* feat: new migration system
* docs: explain dirty migrations
Remove unused Postgres indices. This lower database load but also gives
us room to install actually useful indices (see #7842).
To detect which indices are used, I've used the following query (on the
actual write/master replicate in eu-central-1):
```sql
SELECT
n.nspname AS namespace_name,
t.relname AS table_name,
pg_size_pretty(pg_relation_size(t.oid)) AS table_size,
t.reltuples::bigint AS num_rows,
psai.indexrelname AS index_name,
pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,
CASE WHEN i.indisunique THEN 'Y' ELSE 'N' END AS "unique",
psai.idx_scan AS number_of_scans,
psai.idx_tup_read AS tuples_read,
psai.idx_tup_fetch AS tuples_fetched
FROM
pg_index i
INNER JOIN pg_class t ON t.oid = i.indrelid
INNER JOIN pg_namespace n ON n.oid = t.relnamespace
INNER JOIN pg_stat_all_indexes psai ON i.indexrelid = psai.indexrelid
WHERE
n.nspname = 'iox_catalog' AND t.relname = 'parquet_file'
ORDER BY 1, 2, 5;
```
At `2023-05-23T16:00:00Z`:
```text
namespace_name | table_name | table_size | num_rows | index_name | index_size | unique | number_of_scans | tuples_read | tuples_fetched
----------------+--------------+------------+-----------+--------------------------------------------------+------------+--------+-----------------+----------------+----------------
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_deleted_at_idx | 5398 MB | N | 1693383413 | 21036174283392 | 21336337964
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_partition_created_idx | 11 GB | N | 34190874 | 4749070532 | 61934212
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_partition_idx | 2032 MB | N | 1612961601 | 9935669905489 | 8611676799872
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_pkey | 7135 MB | Y | 453927041 | 454181262 | 453894565
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_shard_compaction_delete_created_idx | 14 GB | N | 0 | 0 | 0
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_shard_compaction_delete_idx | 8767 MB | N | 2 | 30717 | 4860
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_table_idx | 1602 MB | N | 9136844 | 341839537275 | 27551
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_location_unique | 4989 MB | Y | 332341872 | 3123 | 3123
```
At `2023-05-24T09:50:00Z` (i.e. nearly 18h later):
```text
namespace_name | table_name | table_size | num_rows | index_name | index_size | unique | number_of_scans | tuples_read | tuples_fetched
----------------+--------------+------------+-----------+--------------------------------------------------+------------+--------+-----------------+----------------+----------------
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_deleted_at_idx | 5448 MB | N | 1693485804 | 21409285169862 | 21364369704
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_partition_created_idx | 11 GB | N | 34190874 | 4749070532 | 61934212
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_partition_idx | 2044 MB | N | 1615214409 | 10159380553599 | 8811036969123
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_pkey | 7189 MB | Y | 455128165 | 455382386 | 455095624
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_shard_compaction_delete_created_idx | 14 GB | N | 0 | 0 | 0
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_shard_compaction_delete_idx | 8849 MB | N | 2 | 30717 | 4860
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_table_idx | 1618 MB | N | 9239071 | 348304417343 | 27551
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_location_unique | 5043 MB | Y | 343484617 | 3123 | 3123
```
The cluster currently is under load and all components are running.
Conclusion:
- `parquet_file_deleted_at_idx`: Used, likely by the GC. We could
probably shrink this index by binning `deleted_at` (within the index,
not within the actual database table), but let's do this in a later PR.
- `parquet_file_partition_created_idx`: Unused and huge (`created_at` is
NOT binned). So let's remove it.
- `parquet_file_partition_idx`: Used, likely by the compactor and
querier because we currently don't have a better index (see #7842 as
well). This includes deleted files as well which is somewhat
pointless. May become obsolete after #7842, not touching for now.
- `parquet_file_pkey`: Primary key. We should probably use the object
store UUID as a primary key BTW, which would also make the GC faster.
Not touching for now.
- `parquet_file_shard_compaction_delete_created_idx`: Huge unused index.
Shards don't exist anymore. Delete it.
- `parquet_file_shard_compaction_delete_idx`: Same as
`parquet_file_shard_compaction_delete_created_idx`.
- `parquet_file_table_idx`: Used but is somewhat too large because it
contains deleted files. Might become obsolete after #7842, don't touch
for now.
- `parquet_location_unique`: See note `parquet_file_pkey`, it's
pointless to have two IDs here. Not touching for now but this is a
potential future improvement.
So we remove:
- `parquet_file_partition_created_idx`
- `parquet_file_shard_compaction_delete_created_idx`
- `parquet_file_shard_compaction_delete_idx`
- the table is unused
- there are no foreign keys or triggers based on this table
- the design is generally not scalable (N*M entries) and tombstones
should rather have
a timestamp so we can check if a parquet file includes that
information or not (or some other form of serialization mechanism)
- it's currently empty in prod (an never was filled w/ data in any
cluster)
* feat: introduce a new way of max_sequence_number for ingester, compactor and querier
* chore: cleanup
* feat: new column max_l0_created_at to order files for deduplication
* chore: cleanup
* chore: debug info for chnaging cpu.parquet
* fix: update test parquet file
Co-authored-by: Marco Neumann <marco@crepererum.net>
* perf: optimize not to update partitions with newly created level 2 files
* chore: cleanup
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: function to get parttion candidates from partition table
* chore: cleanup
* fix: make new_file_at the same value as created_at
* chore: cleanup
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: compactor ignores max file count for first file
chore: typo in comment in compactor
* feat: restore special first file in partition compaction logic; add limit
* fix: calculation in compaction max file count
chore: clippy
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Adds a migration to add a column "persisted_sequence_number" that
defines the inclusive upper-bound on sequencer writes materialised and
uploaded to object store for the partition.
* refactor: store per-file column set in catalog
Together with the table-wide schema and the partition-wide sort key, this should
be everything we need to read a parquet file directly into memory
without peeking any file-level metadata.
The querier will use this to directly load parquet files into the read
buffer.
**WARNING: This requires a catalog wipe!**
Ref #4124.
* refactor: use proper `ColumnSet` type
* feat: Change data type of catalog Postgres partition's sort_key from a string to an array of string
* test: add column with comma
* fix: use new protonuf field to avoid incompactible
* fix: ensure sort_key is an empty array rather than NULL
* refactor: address review comments
* refactor: address more comments
* chore: clearer comments
* chore: Update iox_catalog/migrations/20220607102200_change_sort_key_type_to_array.sql
* chore: Update iox_catalog/migrations/20220607102200_change_sort_key_type_to_array.sql
* fix: Rename migration so it will be applied after
Co-authored-by: Marko Mikulicic <mkm@influxdata.com>
This PR is the first step where we add a new column sort_key_arr whose content we'll manually migrate from sort_key.
When we're done with this, we'll merge https://github.com/influxdata/influxdb_iox/pull/4801/ (whose migration script must be adapted slightly to rename the `sort_key_arr` column back to `sort_key`).
All this must be done while we shut down the ingesters and the compactors.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
After checking the postgres workload for the catalog in prod, this
missing index was noted as the cause of unexpectedly expensive plans for
simple queries.
Add indexes so compactor can find candidate partitions and specific partition files quickly.
Limit number of level 0 files returned for determining candidates. This should ensure that if comapction is very backed up, it will be able to work through the backlog without evaluating the entire world.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Set to_delete to the time the file was marked as deleted rather than
true.
Fixes#4059.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>