Remove unused Postgres indices. This lower database load but also gives
us room to install actually useful indices (see #7842).
To detect which indices are used, I've used the following query (on the
actual write/master replicate in eu-central-1):
```sql
SELECT
n.nspname AS namespace_name,
t.relname AS table_name,
pg_size_pretty(pg_relation_size(t.oid)) AS table_size,
t.reltuples::bigint AS num_rows,
psai.indexrelname AS index_name,
pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,
CASE WHEN i.indisunique THEN 'Y' ELSE 'N' END AS "unique",
psai.idx_scan AS number_of_scans,
psai.idx_tup_read AS tuples_read,
psai.idx_tup_fetch AS tuples_fetched
FROM
pg_index i
INNER JOIN pg_class t ON t.oid = i.indrelid
INNER JOIN pg_namespace n ON n.oid = t.relnamespace
INNER JOIN pg_stat_all_indexes psai ON i.indexrelid = psai.indexrelid
WHERE
n.nspname = 'iox_catalog' AND t.relname = 'parquet_file'
ORDER BY 1, 2, 5;
```
At `2023-05-23T16:00:00Z`:
```text
namespace_name | table_name | table_size | num_rows | index_name | index_size | unique | number_of_scans | tuples_read | tuples_fetched
----------------+--------------+------------+-----------+--------------------------------------------------+------------+--------+-----------------+----------------+----------------
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_deleted_at_idx | 5398 MB | N | 1693383413 | 21036174283392 | 21336337964
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_partition_created_idx | 11 GB | N | 34190874 | 4749070532 | 61934212
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_partition_idx | 2032 MB | N | 1612961601 | 9935669905489 | 8611676799872
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_pkey | 7135 MB | Y | 453927041 | 454181262 | 453894565
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_shard_compaction_delete_created_idx | 14 GB | N | 0 | 0 | 0
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_shard_compaction_delete_idx | 8767 MB | N | 2 | 30717 | 4860
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_file_table_idx | 1602 MB | N | 9136844 | 341839537275 | 27551
iox_catalog | parquet_file | 31 GB | 120985000 | parquet_location_unique | 4989 MB | Y | 332341872 | 3123 | 3123
```
At `2023-05-24T09:50:00Z` (i.e. nearly 18h later):
```text
namespace_name | table_name | table_size | num_rows | index_name | index_size | unique | number_of_scans | tuples_read | tuples_fetched
----------------+--------------+------------+-----------+--------------------------------------------------+------------+--------+-----------------+----------------+----------------
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_deleted_at_idx | 5448 MB | N | 1693485804 | 21409285169862 | 21364369704
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_partition_created_idx | 11 GB | N | 34190874 | 4749070532 | 61934212
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_partition_idx | 2044 MB | N | 1615214409 | 10159380553599 | 8811036969123
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_pkey | 7189 MB | Y | 455128165 | 455382386 | 455095624
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_shard_compaction_delete_created_idx | 14 GB | N | 0 | 0 | 0
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_shard_compaction_delete_idx | 8849 MB | N | 2 | 30717 | 4860
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_file_table_idx | 1618 MB | N | 9239071 | 348304417343 | 27551
iox_catalog | parquet_file | 31 GB | 123869328 | parquet_location_unique | 5043 MB | Y | 343484617 | 3123 | 3123
```
The cluster currently is under load and all components are running.
Conclusion:
- `parquet_file_deleted_at_idx`: Used, likely by the GC. We could
probably shrink this index by binning `deleted_at` (within the index,
not within the actual database table), but let's do this in a later PR.
- `parquet_file_partition_created_idx`: Unused and huge (`created_at` is
NOT binned). So let's remove it.
- `parquet_file_partition_idx`: Used, likely by the compactor and
querier because we currently don't have a better index (see #7842 as
well). This includes deleted files as well which is somewhat
pointless. May become obsolete after #7842, not touching for now.
- `parquet_file_pkey`: Primary key. We should probably use the object
store UUID as a primary key BTW, which would also make the GC faster.
Not touching for now.
- `parquet_file_shard_compaction_delete_created_idx`: Huge unused index.
Shards don't exist anymore. Delete it.
- `parquet_file_shard_compaction_delete_idx`: Same as
`parquet_file_shard_compaction_delete_created_idx`.
- `parquet_file_table_idx`: Used but is somewhat too large because it
contains deleted files. Might become obsolete after #7842, don't touch
for now.
- `parquet_location_unique`: See note `parquet_file_pkey`, it's
pointless to have two IDs here. Not touching for now but this is a
potential future improvement.
So we remove:
- `parquet_file_partition_created_idx`
- `parquet_file_shard_compaction_delete_created_idx`
- `parquet_file_shard_compaction_delete_idx`
This commit fixes loads of crates (47!) had unused dependencies, or
mis-configured dependencies (test deps as normal deps).
I added the "unused_crate_dependencies" to all crates to help prevent
this mess from growing again!
https://doc.rust-lang.org/beta/nightly-rustc/rustc_lint_defs/builtin/static.UNUSED_CRATE_DEPENDENCIES.html
This has the minor downside of false-positives when specifying
dev-dependencies for test/bench binaries - these are files in /test or
/benches (not normal tests). This commit includes a workaround,
importing them in lib.rs (gated by a feature flag). I think the
trade-off of better dependency management is worth it!
- the table is unused
- there are no foreign keys or triggers based on this table
- the design is generally not scalable (N*M entries) and tombstones
should rather have
a timestamp so we can check if a parquet file includes that
information or not (or some other form of serialization mechanism)
- it's currently empty in prod (an never was filled w/ data in any
cluster)
* refactor: Change catalog configuration so it is entirely dsn based / support end to end testing without postgres
Restores code from https://github.com/influxdata/influxdb_iox/pull/7708
Revert "revert: PR #7708"
This reverts commit c9cfe05f8d.
* fix: merge
* fix: Update new test
Still insert them into the database and associate them with namespaces,
but don't ever query them back out.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* fix(garbage collector): limit catalog update for files to delete
Impose a 1000 LIMIT on flag_for_delete_by_retention so the garbage
collector's load on the catalog is limited. 1000 is used as the fixed
limit in another catalog DML.
* follow up to requests in #7562
* chore: add test for limit on update
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Tests that use the in-memory catalog are creating different shards that
then creates old-style Parquet file paths, but in production, everything
uses the transition shard now. To make the tests more like production,
only ever create and use the transition shard, and stop checking for
different shard IDs.
* test: set max_l0_created_at to reasonable values for the tests and also verify it using both test layout and catalog function
* fix: typo
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
This commit adds initial support for "soft" namespace deletion, where
the actual records & data remain, but are no longer queryable /
writeable.
Soft deletion is eventually consistent - users can expect to continue
writing to and reading from a bucket after issuing a soft delete call,
until the various components either restart, or have their caches
flushed.
The components treat soft-deleted namespaces differently:
* router: ignore soft deleted namespaces
* ingester: accept soft deleted namespaces
* compactor: accept soft deleted namespaces
* querier: ignore soft deleted namespaces
* various gRPC services: ignore soft deleted namespaces
This ensures that the ingester & compactor do not see rows "vanishing"
from the database, and continue to make forward progress.
Writes for the deleted namespace that are buffered in the ingester will
be persisted as normal, allowing us to support "un-delete" operations
where the system is restored to a the state at which the delete was
issued (rather than loosing the buffered data).
Follow-on work is required to ensure GC drops the orphaned parquet files
after the configured GC time, and optimisations such as not compacting
parquet from soft-deleted namespaces seems like a trivial win.
All our catalog tests run as one test, over one database connection.
Prior to this commit, there was no state reset during test execution, so
earlier tests would pollute the state of later tests, making it an
increasingly complex and intermingled set of tests trying to assert
their entities while ignoring other, previously created entities (or
changing the order of test execution in search of the golden ordering
that makes everything great again.)
This is a bit of a hack, and is not how I'd have structured catalog
testing w/ clean state if I was writing it fresh. It is what it is.
This has been driving me mad for SO LONG it's SO BAD <shakes fist>.
* feat: `PartitionRepo::list_ids`
* refactor: `CatalogPartitionsSource` => `CatalogToCompactPartitionsSource`
* feat: allow the compactor to process all known partitions
Closes#6648.
* docs: improve
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
---------
Co-authored-by: Andrew Lamb <alamb@influxdata.com>
The maximum number of tables is part of the Namespace, which is already
loaded in its entirety. This commit copies the value into the
NamespaceSchema, making it available for the router to utilise.
We use UTC, but that doesn't mean everyone does. Queries that utilise
NOW() will return incorrect results when the server is using a non-UTC
tz, but application-provided UTC timestamps / epochs.
* feat: introduce a new way of max_sequence_number for ingester, compactor and querier
* chore: cleanup
* feat: new column max_l0_created_at to order files for deduplication
* chore: cleanup
* chore: debug info for chnaging cpu.parquet
* fix: update test parquet file
Co-authored-by: Marco Neumann <marco@crepererum.net>
* feat: function to read partition IDs of all partitions with new writes
* chore: run fmt
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* perf: optimize not to update partitions with newly created level 2 files
* chore: cleanup
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: cold
* chore: debug info
* feat: only compact qualified cold partition candidates
* fix: catalog test
* chore: cleanup
* chore: add new config flag for cold partition candidates
* chore: implement display for CompactionType and add tests for max num partitions
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: function to get parttion candidates from partition table
* chore: cleanup
* fix: make new_file_at the same value as created_at
* chore: cleanup
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Updating the sort key is not commutative and MUST be serialised. The
correctness of the current catalog interface relies on the caller
serialising updates globally, something it cannot reasonably assert in a
distributed system.
This change of the catalog interface pushes this responsibility to the
catalog itself where it can be effectively enforced, and allows a caller
to detect parallel updates to the sort key.
To avoid other tests' state bleeding into this one and this one's state
bleeding into other tests, now that it's testing some queries without
scoping by shard.
* feat: compactor ignores max file count for first file
chore: typo in comment in compactor
* feat: restore special first file in partition compaction logic; add limit
* fix: calculation in compaction max file count
chore: clippy
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: create namespace API call in router
Co-authored-by: Nga Tran <nga-tran@live.com>
* chore: treat retention as ns except in CLI
* fix: overflow in nanosecond calc
* fix: retention test after changing it from hours to ns
* chore: comment clarification in cli; better response type for error in ns API
* fix: correct some rebase mistakes
* chore: merge namespace create & create_with_retention; renamed ns create test helper fn & const
* fix: ns autocreation test was wrong after rebase
* fix: mem catalog has default 1hr retention, accidently removed in rebase
* chore: remove mem catalogs default 1hr retention; make it settable in sets & router
Co-authored-by: Luke Bond <luke.n.bond@gmail.com>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* feat: reject writes that are outside the retention period
* feat: add retention validator into handler stack
* chore: Apply suggestions from code review
Co-authored-by: Dom <dom@itsallbroken.com>
* refactor: address review comments
* test: unit tests fot retention validation
* chore: address review comments
* test: more unit tests and integration tests
* refactor: make time inside retention period for emphemeral_mode test
* fix: 2 hours
Co-authored-by: Dom <dom@itsallbroken.com>
* feat: flag partition for delete
* fix: compare the right date and time
* chore: Run cargo hakari tasks
* chore: cleanup
* fix: typos
* chore: rust style tidy ups in catalog
Co-authored-by: CircleCI[bot] <circleci@influxdata.com>
Co-authored-by: Luke Bond <luke.n.bond@gmail.com>
* feat: deletion flagging in GC based on retention policy
* chore: typo in comment
* fix: only soft delete parquet files that aren't yet soft deleted
* fix: guard against flakiness in catalog test
* chore: some better tests for parquet file delete flagging
Co-authored-by: Nga Tran <nga-tran@live.com>
* refactor: make namespace folder for all namesapce's commands
* feat: WIP for add command to set retention period
* feat: more on updating retention period
* feat: grpc for update namespace retention period
* test: end to end test fpr namespace retention
* fix: lint proto
* chore: cleanup
* chore: kick CI run again
* fix: command hierachy
* chore: fix comments
The checks for whether a column already exists with a different type
were relying on ordering of the input matching the ordering of the
columns returned from inserting the columns in Postgres.
Rather than trying to match the new ordering that is required to avoid
Postgres deadlocks, switch from a Vec to a HashMap and look up the
column type from the name.
This also reduces some allocations that weren't really needed.
* fix: Avoid some allocations by collecting instead of inserting into a vec
* refactor: Encode that adding columns is for one table at a time
* test: Add another test of column limits
* test: Add below/above limit tests for create_or_get_many
* fix: Explicitly DO NOT check column limits when inserting many columns
* feat: Cache the max_columns_per_table on the NamespaceSchema
* feat: Add a function to validate column limits in-memory
* fix: Provide more useful information when over column limits
* fix: Swap types to remove intermediate allocation
* docs: Explain the interactions of the cache and the column limits
* test: Actually set up test that showcases column limit race condition
* fix: Allow writing to existing columns even if table is over column limit
Co-authored-by: Dom <dom@itsallbroken.com>