chore: lint markdown docs and use relative code references in docs (#3420)

* chore: lint markdown docs docs: relative source file references in markdown docs * chore: review feedback
2022-01-04 09:50:13 +00:00 · 2022-01-04 09:50:13 +00:00 · 5b71306423
parent f9174c483b
commit 5b71306423
6 changed files with 192 additions and 143 deletions
--- a/.circleci/config.yml
+++ b/.circleci/config.yml
@ -335,6 +335,16 @@ jobs:
            # other stuff that may have been added to main since last merge)
            MERGE_BASE=$(git merge-base origin/main $CIRCLE_BRANCH) sh -c 'buf breaking --against ".git#ref=$MERGE_BASE"'

+  # Lint docs
+  docs-lint:
+    docker:
+      - image: python:3-slim-bullseye
+    steps:
+      - checkout
+      - run:
+          name: Lint docs
+          command: ./scripts/lint_docs.py ./docs
+
  # Compile a cargo "release" profile binary for branches that end in `/perf`
  #
  # Uses the latest ci_image (influxdb/rust below) to build a release binary and
@ -458,6 +468,7 @@ workflows:
      - lint
      - cargo_audit
      - protobuf-lint
+      - docs-lint
      - test
      - test_heappy
      - test_perf
@ -473,6 +484,7 @@ workflows:
            - lint
            - cargo_audit
            - protobuf-lint
+            - docs-lint
            - test
            - test_heappy
            - test_perf
--- a/docs/catalog_persistence.md
+++ b/docs/catalog_persistence.md
@ -144,7 +144,7 @@ Schemas during write are only enforced on “best effort” basis by the Mutable


 ### 2.4 UTF-8 Passthrough
-The solution presented here will pass UTF-8 strings (for table and column names) as is. No [unicode normalization](http://www.unicode.org/reports/tr15/) or case-handling will be implemented.
+The solution presented here will pass UTF-8 strings (for table and column names) as is. No [unicode normalization](https://www.unicode.org/reports/tr15/) or case-handling will be implemented.


 ### 2.5 Simple Transactions
--- a/docs/catalogs.md
+++ b/docs/catalogs.md
@ -76,7 +76,7 @@ Figure 1: Catalog and Data Organization
 ## Rebuild a Database from its Catalog
 Since the `Catalog` is the core of the database, losing the catalog means losing the database; or rebuilding a database means rebuilding its catalog. Thus to rebuild a catalog,  its information needs to be saved in a durable storage.

-Basically, if an IOx server goes down unexpectedly, we will lose the in-memory catalog shown in Figure 1 and all of its in-memory data chunks `O-MUBs`, `F-MUBs`, and `RUBs`. Only data of `OS` chunks are not lost. As pointed out in [IOx Data Organization and LifeCycle](data_organization_lifecycle.md) that chunk data is persisted contiguously with their loading time, `OS` chunks only include data ingested before the one in `MUBs` and `RUBs`. Thus, in principal, rebuilding the catalog includes two major steps: first rebuild the catalog that links to the already persisted `OS` chunks, then rerun ingesting data of `O-MUBs`, `F-MUBs` and `RUBs`. To perform those 2 steps, IOx saves `catalog transactions` that include minimal information of the latest stage of all valid `OS` chunks (e.g, chunks are not deleted), and `checkpoints` of when in the past to re-ingest data. At the beginning of the `Catalog rebuild`, those transactions will be read from the Object Store, and their information is used to rebuild the Catalog and rerun the necessary ingestion. Due to IOx `DataLifeCyclePolicy` that is responsible for when to trigger compacting chunk data, the rebuilt catalog may look different from the one before but as long as all the data in previous `O-MUBs`, `F-MUBs`, and `RUBs` are reloaded/recovered and tracked in the catalog, it does not matter which chunk types the data belong to.  Refer to [Catalog Persistence](catalog_persistence.md) for the original design of Catalog Persistence and [CheckPoint](https://github.com/influxdata/influxdb_iox/blob/b39e01f7ba4f5d19f92862c5e87b90a40879a6c9/persistence_windows/src/checkpoint.rs) for the detailed implementation of IOx Checkpoints.
+Basically, if an IOx server goes down unexpectedly, we will lose the in-memory catalog shown in Figure 1 and all of its in-memory data chunks `O-MUBs`, `F-MUBs`, and `RUBs`. Only data of `OS` chunks are not lost. As pointed out in [IOx Data Organization and LifeCycle](data_organization_lifecycle.md) that chunk data is persisted contiguously with their loading time, `OS` chunks only include data ingested before the one in `MUBs` and `RUBs`. Thus, in principal, rebuilding the catalog includes two major steps: first rebuild the catalog that links to the already persisted `OS` chunks, then rerun ingesting data of `O-MUBs`, `F-MUBs` and `RUBs`. To perform those 2 steps, IOx saves `catalog transactions` that include minimal information of the latest stage of all valid `OS` chunks (e.g, chunks are not deleted), and `checkpoints` of when in the past to re-ingest data. At the beginning of the `Catalog rebuild`, those transactions will be read from the Object Store, and their information is used to rebuild the Catalog and rerun the necessary ingestion. Due to IOx `DataLifeCyclePolicy` that is responsible for when to trigger compacting chunk data, the rebuilt catalog may look different from the one before but as long as all the data in previous `O-MUBs`, `F-MUBs`, and `RUBs` are reloaded/recovered and tracked in the catalog, it does not matter which chunk types the data belong to.  Refer to [Catalog Persistence](catalog_persistence.md) for the original design of Catalog Persistence and [CheckPoint](../persistence_windows/src/checkpoint.rs) for the detailed implementation of IOx Checkpoints.

 ## Answer queries through the Catalog
 Catalog is essential to not only lead IOx to the right chunk data for answering user queries but also critical to help read the minimal possible data. Besides the catalog structure shown in Figure 1, chunk statistics (e.g. chunk column's min, max, null count, row count) are calculated while building up the catalog and included in corresponding catalog objects. This information enables IOx to answer some queries right after reading its catalog without scanning physical chunk data as in some examples above [^query]. In addition, IOx provides system tables to let users query their catalog data such as number of tables, number of partitions per database or table, number of chunks per database or tables or partition, number of each chunk type and so on.
--- a/docs/data_organization_lifecycle.md
+++ b/docs/data_organization_lifecycle.md
@ -48,7 +48,7 @@ Chunk is considered the smallest block of data in IOx and the central discussion
 [^dup]: The detail of `duplication` and `deduplication` during compaction and query are parts of a large topic that deserve another document.

 ### Chunk Types
-A `Chunk` in IOx is an abstract object defined in the code as a [DbChunk](https://github.com/influxdata/influxdb_iox/blob/12c40b0f0f93e94e483015f9104639a1f766d594/server/src/db/chunk.rs#L78). To optimize the Data LifeCycle and Query Performance, IOx implements these types of physical chunks for a DbChunk: O-MUB, F-MUB, RUB, OS.
+A `Chunk` in IOx is an abstract object defined in the code as a [DbChunk](../db/src/chunk.rs). To optimize the Data LifeCycle and Query Performance, IOx implements these types of physical chunks for a DbChunk: O-MUB, F-MUB, RUB, OS.

 1. O-MUB: **O**pen **MU**table **B**uffer chunk is optimized for writes and the only chunk type that accepts ingesting data. O-MUB is an in-memory chunk but its data is not sorted and not heavily compressed.[^type]
 1. F-MUB: **F**rozen **MU**table **B**uffer chunk has the same format as O-MUB (in memory, not sorted, not encoded) but it no longer accepts writes. It is used as a transition chunk while its data is being moved from optimized-for-writes to optimized-for-reads.
@ -61,7 +61,7 @@ Depending on which stage of the lifecycle a chunk is in, it will be represented

 ### Stages of a Chunk

-Before digging into Data Lifecycle, let us look into the stages of a chunk implemented as [ChunkStage](https://github.com/influxdata/influxdb_iox/blob/76befe94ad14cd121d6fc5c58aa112997d9e211a/server/src/db/catalog/chunk.rs#L130). A chunk goes through three stages demonstrated in Figure 2: `Open`, `Frozen`, and `Persisted`. 
+Before digging into Data Lifecycle, let us look into the stages of a chunk implemented as [ChunkStage](../db/src/catalog/chunk.rs). A chunk goes through three stages demonstrated in Figure 2: `Open`, `Frozen`, and `Persisted`.
 * When data is ingested into IOx, it will be written into an open chunk which is an `O-MUB`.
 * When triggered by some manual or automatic action of the lifecycle (described in next section), the open chunk will be frozen, first to `F-MUB` then transitioned to `RUB`.
 * When the `RUB` is persisted to an `OS` chunk, its stage will be moved to persisted. Unlike the `Open` and `Frozen` stages that are represented by only one type of chunk at a moment in time, the `Persisted` stage can be represented by two chunk types at a time: `RUB` and `OS` that store the same data for the purpose of query performance. When a query needs to read data of a persisted chunk stage, it will first look for `RUB`, but, if not available, will look for `OS`. `RUB` will be unloaded from the persisted stage if IOx memory runs low, and reloaded if data of that chunk is queried a lot and IOx memory is underused.
--- a/docs/logging.md
+++ b/docs/logging.md
@ -35,7 +35,7 @@ $ ./influxdb_iox run database --log-filter debug --log-format logfmt

 IOx makes use of Rust's [tracing](https://docs.rs/tracing) ecosystem to output application logs to stdout. It
 additionally makes use of [tracing-log](https://docs.rs/tracing-log) to ensure that crates writing events to
-the [log](docs.rs/log/) facade are also rendered to stdout.
+the [log](https://docs.rs/log/) facade are also rendered to stdout.

 ### Macros

--- a/scripts/lint_docs.py
+++ b/scripts/lint_docs.py
@ -0,0 +1,37 @@
+#!/usr/bin/env python3
+
+import argparse
+import re
+import sys
+from pathlib import Path
+
+parser = argparse.ArgumentParser(description="Lint docs")
+parser.add_argument("docs", help="docs directory")
+args = parser.parse_args()
+
+regex = re.compile('\[.+\]\((?P<link>[^)]+)\)')
+
+success = True
+
+for md_path in sorted(Path(args.docs).rglob("*.md")):
+    print(f"Checking \"{md_path}\"")
+
+    with open(md_path) as file:
+        links = {match.group('link'): idx + 1 for (idx, line) in enumerate(file) for match in regex.finditer(line)}
+        for link, line_number in links.items():
+            if link.startswith('https://') or link.startswith('#'):
+                continue
+
+            if link.startswith('http://'):
+                print(f"FAIL: Non-SSL URL {link} in {md_path}:{line_number}", file=sys.stderr)
+                success = False
+                continue
+
+            if not Path(md_path.parent, link).exists():
+                print(f"FAIL: Link {link} not found for {md_path}:{line_number}", file=sys.stderr)
+                success = False
+
+if not success:
+    exit(1)
+
+print("OK")