influxdb/iox_catalog/README.md

# IOx Catalog

This crate contains the code for the IOx Catalog. This includes the definitions of namespaces,
their tables, the columns of those tables and their types, what Parquet files are in object storage
and delete tombstones. There's also some configuration information that the overall distributed
system uses for operation.

To run this crate's tests you'll need Postgres installed and running locally. You'll also need to
set the `INFLUXDB_IOX_CATALOG_DSN` environment variable so that sqlx will be able to connect to
your local DB. For example with user and password filled in:

```
INFLUXDB_IOX_CATALOG_DSN=postgres://<postgres user>:<postgres password>@localhost/iox_shared
```

You can omit the host part if your postgres is running on the default unix domain socket (useful on
macos because, by default, the config installed by `brew install postgres` doesn't listen to a TCP
port):

```
INFLUXDB_IOX_CATALOG_DSN=postgres:///iox_shared
```

You'll then need to create the database. You can do this via the sqlx command line.

```
cargo install sqlx-cli
DATABASE_URL=<dsn> sqlx database create
cargo run -q -- catalog setup
```

This will set up the database based on the files in `./migrations` in this crate. SQLx also creates
a table to keep track of which migrations have been run.

NOTE: **do not** use `sqlx database setup`, because that will create the migration table in the
wrong schema (namespace). Our `catalog setup` code will do that part by using the same sqlx
migration module but with the right namespace setup.

## Migrations

If you need to create and run migrations to add, remove, or change the schema, you'll need the
`sqlx-cli` tool. Install with `cargo install sqlx-cli` if you haven't already, then run `sqlx
migrate --help` to see the commands relevant to migrations.

## Tests

To run the Postgres integration tests, ensure the above setup is complete first.

**CAUTION:** existing data in the database is dropped when tests are run, so you should use a
DIFFERENT database name for your test database than your `INFLUXDB_IOX_CATALOG_DSN` database.

* Set `TEST_INFLUXDB_IOX_CATALOG_DSN=<testdsn>` env as above with the `INFLUXDB_IOX_CATALOG_DSN`
  env var. The integration tests *will* pick up this value if set in your `.env` file.
* Set `TEST_INTEGRATION=1`
* Run `cargo test -p iox_catalog`

## Schema namespace

All iox catalog tables are created in a `iox_catalog` schema. Remember to set the schema search
path when accessing the database with `psql`.

There are several ways to set the default search path, depending if you want to do it for your
session, for the database or for the user.

Setting a default search path for the database or user may interfere with tests (e.g. it may make
some test pass when they should fail). The safest option is set the search path on a per session
basis. As always, there are a few ways to do that:

1. you can type `set search_path to public,iox_catalog;` inside psql.
2. you can add (1) to your `~/.psqlrc`
3. or you can just pass it as a CLI argument with:

```
psql 'dbname=iox_shared options=-csearch_path=public,iox_catalog'
```

## Failed / Dirty Migrations
Migrations might be marked as dirty in prod if they do not run all the way through. In this case, you have to manually
(using a read-write shell):

1. Revert the effect of the migration (e.g. drop created tables, drop created indices)
2. Remove the migration from the `_sqlx_migrations`. E.g. if the version of the migration is 1337, this is:

   ```sql
   DELETE FROM _sqlx_migrations
   WHERE version = 1337;
   ```
feat: Add initial iox_catalog skeleton 2022-01-11 17:51:56 +00:00			`# IOx Catalog`
fix: Instead of using DATABASE_URL, use INFLUXDB_IOX_CATALOG_DSN and TEST_INFLUXDB_IOX_CATALOG_DSN 2022-03-04 20:22:07 +00:00
chore: Wrap markdown at 100 columns 2022-03-07 15:54:27 +00:00			`This crate contains the code for the IOx Catalog. This includes the definitions of namespaces,`
			`their tables, the columns of those tables and their types, what Parquet files are in object storage`
docs(various): Improve Readability (#4768) Signed-off-by: Ryan Russell <git@ryanrussell.org> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com> 2022-06-02 18:01:06 +00:00			`and delete tombstones. There's also some configuration information that the overall distributed`
chore: Wrap markdown at 100 columns 2022-03-07 15:54:27 +00:00			`system uses for operation.`
feat: Add initial iox_catalog skeleton 2022-01-11 17:51:56 +00:00
chore: Wrap markdown at 100 columns 2022-03-07 15:54:27 +00:00			`To run this crate's tests you'll need Postgres installed and running locally. You'll also need to`
			set the `INFLUXDB_IOX_CATALOG_DSN` environment variable so that sqlx will be able to connect to
			`your local DB. For example with user and password filled in:`
feat: Add initial iox_catalog skeleton 2022-01-11 17:51:56 +00:00
			```
fix: Instead of using DATABASE_URL, use INFLUXDB_IOX_CATALOG_DSN and TEST_INFLUXDB_IOX_CATALOG_DSN 2022-03-04 20:22:07 +00:00			`INFLUXDB_IOX_CATALOG_DSN=postgres://<postgres user>:<postgres password>@localhost/iox_shared`
feat: Add initial iox_catalog skeleton 2022-01-11 17:51:56 +00:00			```

chore: Wrap markdown at 100 columns 2022-03-07 15:54:27 +00:00			`You can omit the host part if your postgres is running on the default unix domain socket (useful on`
			macos because, by default, the config installed by `brew install postgres` doesn't listen to a TCP
			`port):`
docs: Improve iox_catalog testing docs (#3760) * docs: Improve iox_catalog testing docs * fix: Update iox_catalog/README.md Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Dom <dom@itsallbroken.com> 2022-02-16 10:23:53 +00:00
			```
fix: Instead of using DATABASE_URL, use INFLUXDB_IOX_CATALOG_DSN and TEST_INFLUXDB_IOX_CATALOG_DSN 2022-03-04 20:22:07 +00:00			`INFLUXDB_IOX_CATALOG_DSN=postgres:///iox_shared`
docs: Improve iox_catalog testing docs (#3760) * docs: Improve iox_catalog testing docs * fix: Update iox_catalog/README.md Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Dom <dom@itsallbroken.com> 2022-02-16 10:23:53 +00:00			```

feat: allow IOx catalog to setup itself (no SQLx CLI required) (#3584) * feat: allow IOx catalog to setup itself (no SQLx CLI required) * refactor: use SQLx macro instead of hand-rolled build script 2022-01-31 15:07:38 +00:00			`You'll then need to create the database. You can do this via the sqlx command line.`
feat: Add initial iox_catalog skeleton 2022-01-11 17:51:56 +00:00
			```
			`cargo install sqlx-cli`
fix: Instead of using DATABASE_URL, use INFLUXDB_IOX_CATALOG_DSN and TEST_INFLUXDB_IOX_CATALOG_DSN 2022-03-04 20:22:07 +00:00			`DATABASE_URL=<dsn> sqlx database create`
			`cargo run -q -- catalog setup`
feat: Add initial iox_catalog skeleton 2022-01-11 17:51:56 +00:00			```

chore: Wrap markdown at 100 columns 2022-03-07 15:54:27 +00:00			This will set up the database based on the files in `./migrations` in this crate. SQLx also creates
			`a table to keep track of which migrations have been run.`
feat: iox_catalog sequencers (#3465) * refactor: ensure sequencers are unique Adds a unique constraint to ensure only one sequencer record exists for each Kafka (topic, partition). * test: use DSN from env for integration tests Removes the hard-coded DSN, instead sourcing it from the DATABASE_URL environment variable. * docs: integration testing for iox_catalog Documents the required steps in order to run the Postgres integration tests for the iox_catalog crate. * feat(iox_catalog): create & list sequencers Adds support for interacting with the "sequencer" table. * chore: update lockfile Running cargo in iox_catalog generates a lockfile diff. 2022-01-14 15:41:38 +00:00
chore: Wrap markdown at 100 columns 2022-03-07 15:54:27 +00:00			NOTE: do not use `sqlx database setup`, because that will create the migration table in the
			wrong schema (namespace). Our `catalog setup` code will do that part by using the same sqlx
			`migration module but with the right namespace setup.`
fix: make catalog db migration idempotent again (#3894) Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com> 2022-03-02 13:56:42 +00:00
fix: Instead of using DATABASE_URL, use INFLUXDB_IOX_CATALOG_DSN and TEST_INFLUXDB_IOX_CATALOG_DSN 2022-03-04 20:22:07 +00:00			`## Migrations`

chore: Wrap markdown at 100 columns 2022-03-07 15:54:27 +00:00			`If you need to create and run migrations to add, remove, or change the schema, you'll need the`
			`sqlx-cli` tool. Install with `cargo install sqlx-cli` if you haven't already, then run `sqlx
			migrate --help` to see the commands relevant to migrations.
fix: Instead of using DATABASE_URL, use INFLUXDB_IOX_CATALOG_DSN and TEST_INFLUXDB_IOX_CATALOG_DSN 2022-03-04 20:22:07 +00:00
feat: iox_catalog sequencers (#3465) * refactor: ensure sequencers are unique Adds a unique constraint to ensure only one sequencer record exists for each Kafka (topic, partition). * test: use DSN from env for integration tests Removes the hard-coded DSN, instead sourcing it from the DATABASE_URL environment variable. * docs: integration testing for iox_catalog Documents the required steps in order to run the Postgres integration tests for the iox_catalog crate. * feat(iox_catalog): create & list sequencers Adds support for interacting with the "sequencer" table. * chore: update lockfile Running cargo in iox_catalog generates a lockfile diff. 2022-01-14 15:41:38 +00:00			`## Tests`

			`To run the Postgres integration tests, ensure the above setup is complete first.`

chore: Wrap markdown at 100 columns 2022-03-07 15:54:27 +00:00			`CAUTION: existing data in the database is dropped when tests are run, so you should use a`
			DIFFERENT database name for your test database than your `INFLUXDB_IOX_CATALOG_DSN` database.
feat: iox_catalog sequencers (#3465) * refactor: ensure sequencers are unique Adds a unique constraint to ensure only one sequencer record exists for each Kafka (topic, partition). * test: use DSN from env for integration tests Removes the hard-coded DSN, instead sourcing it from the DATABASE_URL environment variable. * docs: integration testing for iox_catalog Documents the required steps in order to run the Postgres integration tests for the iox_catalog crate. * feat(iox_catalog): create & list sequencers Adds support for interacting with the "sequencer" table. * chore: update lockfile Running cargo in iox_catalog generates a lockfile diff. 2022-01-14 15:41:38 +00:00
chore: Wrap markdown at 100 columns 2022-03-07 15:54:27 +00:00			* Set `TEST_INFLUXDB_IOX_CATALOG_DSN=<testdsn>` env as above with the `INFLUXDB_IOX_CATALOG_DSN`
			env var. The integration tests will pick up this value if set in your `.env` file.
fix: Instead of using DATABASE_URL, use INFLUXDB_IOX_CATALOG_DSN and TEST_INFLUXDB_IOX_CATALOG_DSN 2022-03-04 20:22:07 +00:00			* Set `TEST_INTEGRATION=1`
			* Run `cargo test -p iox_catalog`
docs: Improve iox_catalog testing docs (#3760) * docs: Improve iox_catalog testing docs * fix: Update iox_catalog/README.md Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Dom <dom@itsallbroken.com> 2022-02-16 10:23:53 +00:00
			`## Schema namespace`

chore: Wrap markdown at 100 columns 2022-03-07 15:54:27 +00:00			All iox catalog tables are created in a `iox_catalog` schema. Remember to set the schema search
			path when accessing the database with `psql`.
docs: Improve iox_catalog testing docs (#3760) * docs: Improve iox_catalog testing docs * fix: Update iox_catalog/README.md Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Dom <dom@itsallbroken.com> 2022-02-16 10:23:53 +00:00
chore: Wrap markdown at 100 columns 2022-03-07 15:54:27 +00:00			`There are several ways to set the default search path, depending if you want to do it for your`
			`session, for the database or for the user.`
docs: Improve iox_catalog testing docs (#3760) * docs: Improve iox_catalog testing docs * fix: Update iox_catalog/README.md Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Dom <dom@itsallbroken.com> 2022-02-16 10:23:53 +00:00
chore: Wrap markdown at 100 columns 2022-03-07 15:54:27 +00:00			`Setting a default search path for the database or user may interfere with tests (e.g. it may make`
			`some test pass when they should fail). The safest option is set the search path on a per session`
			`basis. As always, there are a few ways to do that:`
docs: Improve iox_catalog testing docs (#3760) * docs: Improve iox_catalog testing docs * fix: Update iox_catalog/README.md Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Dom <dom@itsallbroken.com> 2022-02-16 10:23:53 +00:00
			1. you can type `set search_path to public,iox_catalog;` inside psql.
			2. you can add (1) to your `~/.psqlrc`
			`3. or you can just pass it as a CLI argument with:`

			```
			`psql 'dbname=iox_shared options=-csearch_path=public,iox_catalog'`
			```
refactor: add `parquet_file` PG index for querier (#7842) * refactor: add `parquet_file` PG index for querier Currently the `list_by_table_not_to_delete` catalog query is somewhat expensive: ```text iox_catalog_prod=> select table_id, sum((to_delete is NULL)::int) as n from parquet_file group by table_id order by n desc limit 5; table_id \| n ----------+------ 1489038 \| 7221 1489037 \| 7019 1491534 \| 5793 1491951 \| 5522 1513377 \| 5339 (5 rows) iox_catalog_prod=> EXPLAIN ANALYZE SELECT id, namespace_id, table_id, partition_id, object_store_id, min_time, max_time, to_delete, file_size_bytes, row_count, compaction_level, created_at, column_set, max_l0_created_at FROM parquet_file WHERE table_id = 1489038 AND to_delete IS NULL; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on parquet_file (cost=46050.91..47179.26 rows=283 width=200) (actual time=464.368..472.514 rows=7221 loops=1) Recheck Cond: ((table_id = 1489038) AND (to_delete IS NULL)) Heap Blocks: exact=7152 -> BitmapAnd (cost=46050.91..46050.91 rows=283 width=0) (actual time=463.341..463.343 rows=0 loops=1) -> Bitmap Index Scan on parquet_file_table_idx (cost=0.00..321.65 rows=22545 width=0) (actual time=1.674..1.674 rows=7221 loops=1) Index Cond: (table_id = 1489038) -> Bitmap Index Scan on parquet_file_deleted_at_idx (cost=0.00..45728.86 rows=1525373 width=0) (actual time=460.717..460.717 rows=4772117 loops=1) Index Cond: (to_delete IS NULL) Planning Time: 0.092 ms Execution Time: 472.907 ms (10 rows) ``` I think this may also be because PostgreSQL kinda chooses the wrong strategy, because it could just look at the existing index and filter from there: ```text iox_catalog_prod=> EXPLAIN ANALYZE SELECT id, namespace_id, table_id, partition_id, object_store_id, min_time, max_time, to_delete, file_size_bytes, row_count, compaction_level, created_at, column_set, max_l0_created_at FROM parquet_file WHERE table_id = 1489038; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using parquet_file_table_idx on parquet_file (cost=0.57..86237.78 rows=22545 width=200) (actual time=0.057..6.994 rows=7221 loops=1) Index Cond: (table_id = 1489038) Planning Time: 0.094 ms Execution Time: 7.297 ms (4 rows) ``` However PostgreSQL doesn't know the cardinalities well enough. So let's add a dedicated index to make the querier faster. * feat: new migration system * docs: explain dirty migrations 2023-05-31 10:56:32 +00:00
			`## Failed / Dirty Migrations`
			`Migrations might be marked as dirty in prod if they do not run all the way through. In this case, you have to manually`
			`(using a read-write shell):`

			`1. Revert the effect of the migration (e.g. drop created tables, drop created indices)`
			2. Remove the migration from the `_sqlx_migrations`. E.g. if the version of the migration is 1337, this is:

			```sql
			`DELETE FROM _sqlx_migrations`
			`WHERE version = 1337;`
			```