chore: lint markdown docs and use relative code references in docs (#3420)

* chore: lint markdown docs

docs: relative source file references in markdown docs

* chore: review feedback
pull/24376/head
Raphael Taylor-Davies 2022-01-04 09:50:13 +00:00 committed by GitHub
parent f9174c483b
commit 5b71306423
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
6 changed files with 192 additions and 143 deletions

View File

@ -335,6 +335,16 @@ jobs:
# other stuff that may have been added to main since last merge)
MERGE_BASE=$(git merge-base origin/main $CIRCLE_BRANCH) sh -c 'buf breaking --against ".git#ref=$MERGE_BASE"'
# Lint docs
docs-lint:
docker:
- image: python:3-slim-bullseye
steps:
- checkout
- run:
name: Lint docs
command: ./scripts/lint_docs.py ./docs
# Compile a cargo "release" profile binary for branches that end in `/perf`
#
# Uses the latest ci_image (influxdb/rust below) to build a release binary and
@ -458,6 +468,7 @@ workflows:
- lint
- cargo_audit
- protobuf-lint
- docs-lint
- test
- test_heappy
- test_perf
@ -473,6 +484,7 @@ workflows:
- lint
- cargo_audit
- protobuf-lint
- docs-lint
- test
- test_heappy
- test_perf

View File

@ -144,7 +144,7 @@ Schemas during write are only enforced on “best effort” basis by the Mutable
### 2.4 UTF-8 Passthrough
The solution presented here will pass UTF-8 strings (for table and column names) as is. No [unicode normalization](http://www.unicode.org/reports/tr15/) or case-handling will be implemented.
The solution presented here will pass UTF-8 strings (for table and column names) as is. No [unicode normalization](https://www.unicode.org/reports/tr15/) or case-handling will be implemented.
### 2.5 Simple Transactions

View File

@ -76,7 +76,7 @@ Figure 1: Catalog and Data Organization
## Rebuild a Database from its Catalog
Since the `Catalog` is the core of the database, losing the catalog means losing the database; or rebuilding a database means rebuilding its catalog. Thus to rebuild a catalog, its information needs to be saved in a durable storage.
Basically, if an IOx server goes down unexpectedly, we will lose the in-memory catalog shown in Figure 1 and all of its in-memory data chunks `O-MUBs`, `F-MUBs`, and `RUBs`. Only data of `OS` chunks are not lost. As pointed out in [IOx Data Organization and LifeCycle](data_organization_lifecycle.md) that chunk data is persisted contiguously with their loading time, `OS` chunks only include data ingested before the one in `MUBs` and `RUBs`. Thus, in principal, rebuilding the catalog includes two major steps: first rebuild the catalog that links to the already persisted `OS` chunks, then rerun ingesting data of `O-MUBs`, `F-MUBs` and `RUBs`. To perform those 2 steps, IOx saves `catalog transactions` that include minimal information of the latest stage of all valid `OS` chunks (e.g, chunks are not deleted), and `checkpoints` of when in the past to re-ingest data. At the beginning of the `Catalog rebuild`, those transactions will be read from the Object Store, and their information is used to rebuild the Catalog and rerun the necessary ingestion. Due to IOx `DataLifeCyclePolicy` that is responsible for when to trigger compacting chunk data, the rebuilt catalog may look different from the one before but as long as all the data in previous `O-MUBs`, `F-MUBs`, and `RUBs` are reloaded/recovered and tracked in the catalog, it does not matter which chunk types the data belong to. Refer to [Catalog Persistence](catalog_persistence.md) for the original design of Catalog Persistence and [CheckPoint](https://github.com/influxdata/influxdb_iox/blob/b39e01f7ba4f5d19f92862c5e87b90a40879a6c9/persistence_windows/src/checkpoint.rs) for the detailed implementation of IOx Checkpoints.
Basically, if an IOx server goes down unexpectedly, we will lose the in-memory catalog shown in Figure 1 and all of its in-memory data chunks `O-MUBs`, `F-MUBs`, and `RUBs`. Only data of `OS` chunks are not lost. As pointed out in [IOx Data Organization and LifeCycle](data_organization_lifecycle.md) that chunk data is persisted contiguously with their loading time, `OS` chunks only include data ingested before the one in `MUBs` and `RUBs`. Thus, in principal, rebuilding the catalog includes two major steps: first rebuild the catalog that links to the already persisted `OS` chunks, then rerun ingesting data of `O-MUBs`, `F-MUBs` and `RUBs`. To perform those 2 steps, IOx saves `catalog transactions` that include minimal information of the latest stage of all valid `OS` chunks (e.g, chunks are not deleted), and `checkpoints` of when in the past to re-ingest data. At the beginning of the `Catalog rebuild`, those transactions will be read from the Object Store, and their information is used to rebuild the Catalog and rerun the necessary ingestion. Due to IOx `DataLifeCyclePolicy` that is responsible for when to trigger compacting chunk data, the rebuilt catalog may look different from the one before but as long as all the data in previous `O-MUBs`, `F-MUBs`, and `RUBs` are reloaded/recovered and tracked in the catalog, it does not matter which chunk types the data belong to. Refer to [Catalog Persistence](catalog_persistence.md) for the original design of Catalog Persistence and [CheckPoint](../persistence_windows/src/checkpoint.rs) for the detailed implementation of IOx Checkpoints.
## Answer queries through the Catalog
Catalog is essential to not only lead IOx to the right chunk data for answering user queries but also critical to help read the minimal possible data. Besides the catalog structure shown in Figure 1, chunk statistics (e.g. chunk column's min, max, null count, row count) are calculated while building up the catalog and included in corresponding catalog objects. This information enables IOx to answer some queries right after reading its catalog without scanning physical chunk data as in some examples above [^query]. In addition, IOx provides system tables to let users query their catalog data such as number of tables, number of partitions per database or table, number of chunks per database or tables or partition, number of each chunk type and so on.

View File

@ -48,7 +48,7 @@ Chunk is considered the smallest block of data in IOx and the central discussion
[^dup]: The detail of `duplication` and `deduplication` during compaction and query are parts of a large topic that deserve another document.
### Chunk Types
A `Chunk` in IOx is an abstract object defined in the code as a [DbChunk](https://github.com/influxdata/influxdb_iox/blob/12c40b0f0f93e94e483015f9104639a1f766d594/server/src/db/chunk.rs#L78). To optimize the Data LifeCycle and Query Performance, IOx implements these types of physical chunks for a DbChunk: O-MUB, F-MUB, RUB, OS.
A `Chunk` in IOx is an abstract object defined in the code as a [DbChunk](../db/src/chunk.rs). To optimize the Data LifeCycle and Query Performance, IOx implements these types of physical chunks for a DbChunk: O-MUB, F-MUB, RUB, OS.
1. O-MUB: **O**pen **MU**table **B**uffer chunk is optimized for writes and the only chunk type that accepts ingesting data. O-MUB is an in-memory chunk but its data is not sorted and not heavily compressed.[^type]
1. F-MUB: **F**rozen **MU**table **B**uffer chunk has the same format as O-MUB (in memory, not sorted, not encoded) but it no longer accepts writes. It is used as a transition chunk while its data is being moved from optimized-for-writes to optimized-for-reads.
@ -61,7 +61,7 @@ Depending on which stage of the lifecycle a chunk is in, it will be represented
### Stages of a Chunk
Before digging into Data Lifecycle, let us look into the stages of a chunk implemented as [ChunkStage](https://github.com/influxdata/influxdb_iox/blob/76befe94ad14cd121d6fc5c58aa112997d9e211a/server/src/db/catalog/chunk.rs#L130). A chunk goes through three stages demonstrated in Figure 2: `Open`, `Frozen`, and `Persisted`.
Before digging into Data Lifecycle, let us look into the stages of a chunk implemented as [ChunkStage](../db/src/catalog/chunk.rs). A chunk goes through three stages demonstrated in Figure 2: `Open`, `Frozen`, and `Persisted`.
* When data is ingested into IOx, it will be written into an open chunk which is an `O-MUB`.
* When triggered by some manual or automatic action of the lifecycle (described in next section), the open chunk will be frozen, first to `F-MUB` then transitioned to `RUB`.
* When the `RUB` is persisted to an `OS` chunk, its stage will be moved to persisted. Unlike the `Open` and `Frozen` stages that are represented by only one type of chunk at a moment in time, the `Persisted` stage can be represented by two chunk types at a time: `RUB` and `OS` that store the same data for the purpose of query performance. When a query needs to read data of a persisted chunk stage, it will first look for `RUB`, but, if not available, will look for `OS`. `RUB` will be unloaded from the persisted stage if IOx memory runs low, and reloaded if data of that chunk is queried a lot and IOx memory is underused.

View File

@ -35,7 +35,7 @@ $ ./influxdb_iox run database --log-filter debug --log-format logfmt
IOx makes use of Rust's [tracing](https://docs.rs/tracing) ecosystem to output application logs to stdout. It
additionally makes use of [tracing-log](https://docs.rs/tracing-log) to ensure that crates writing events to
the [log](docs.rs/log/) facade are also rendered to stdout.
the [log](https://docs.rs/log/) facade are also rendered to stdout.
### Macros

37
scripts/lint_docs.py Executable file
View File

@ -0,0 +1,37 @@
#!/usr/bin/env python3
import argparse
import re
import sys
from pathlib import Path
parser = argparse.ArgumentParser(description="Lint docs")
parser.add_argument("docs", help="docs directory")
args = parser.parse_args()
regex = re.compile('\[.+\]\((?P<link>[^)]+)\)')
success = True
for md_path in sorted(Path(args.docs).rglob("*.md")):
print(f"Checking \"{md_path}\"")
with open(md_path) as file:
links = {match.group('link'): idx + 1 for (idx, line) in enumerate(file) for match in regex.finditer(line)}
for link, line_number in links.items():
if link.startswith('https://') or link.startswith('#'):
continue
if link.startswith('http://'):
print(f"FAIL: Non-SSL URL {link} in {md_path}:{line_number}", file=sys.stderr)
success = False
continue
if not Path(md_path.parent, link).exists():
print(f"FAIL: Link {link} not found for {md_path}:{line_number}", file=sys.stderr)
success = False
if not success:
exit(1)
print("OK")