# Spell Checking Configuration Guide This document explains the spell-checking rules and tools used in the InfluxData documentation repository. ## Overview The docs-v2 repository uses **two complementary spell-checking tools**: 1. **Vale** - Integrated documentation spell checker (runs in pre-commit hooks) 2. **Codespell** - Lightweight code comment spell checker (recommended for CI/CD) ## Tool Comparison | Feature | Vale | Codespell | | --------------------------- | --------------------------- | ----------------------------- | | **Purpose** | Document spell checking | Code comment spell checking | | **Integration** | Pre-commit hooks (Docker) | CI/CD pipeline | | **False Positives** | Low (comprehensive filters) | Low (clear dictionary only) | | **Customization** | YAML rules | INI config + dictionary lists | | **Performance** | Moderate | Fast | | **True Positive Detection** | Document-level | Code-level | ## Vale Configuration ### File: `.ci/vale/styles/InfluxDataDocs/Spelling.yml` #### Why Code Blocks Are Included Unlike other documentation style checkers, this configuration **intentionally includes code blocks** (`~code` is NOT excluded). This is critical because: 1. **Comments in examples** - Users copy code blocks with comments: ```bash # Download and verify the GPG key curl https://repos.influxdata.com/influxdata-archive.key ``` Typos in such comments become part of user documentation/scripts. 2. **Documentation strings** - Code examples may include documentation: ```python def create_database(name): """This funtion creates a new database.""" # ← typo caught pass ``` 3. **Inline comments** - Shell script comments are checked: ```sh #!/bin/bash # Retrive configuration from server influxctl config get ``` ### Filter Patterns Explained #### 1. camelCase and snake\_case Identifiers ```regex (?:_*[a-z]+(?:[A-Z][a-z0-9]*)+(?:[A-Z][a-zA-Z0-9]*)*|[a-z_][a-z0-9]*_[a-z0-9_]*) ``` **Why**: Prevents false positives on variable/method names while NOT matching normal prose **Breakdown**: - **camelCase**: `_*[a-z]+(?:[A-Z][a-z0-9]*)+(?:[A-Z][a-zA-Z0-9]*)*` - Requires at least one uppercase letter (distinguishes `myVariable` from `provide`) - Allows leading underscores for private variables (`_privateVar`, `__dunder__`) - **snake\_case**: `[a-z_][a-z0-9]*_[a-z0-9_]*` - Requires at least one underscore - Distinguishes `my_variable` from normal words **Examples Ignored**: `myVariable`, `targetField`, `getCwd`, `_privateVar`, `my_variable`, `terminationGracePeriodSeconds` **Examples NOT Ignored** (caught by spell-checker): `provide`, `database`, `variable` (normal prose) #### 2. UPPER\_CASE Constants ```regex [A-Z_][A-Z0-9_]+ ``` **Why**: Prevents false positives on environment variables and constants **Examples Ignored**: `API_KEY`, `AWS_REGION`, `INFLUXDB_TOKEN` **Note**: Matches AWS, API (even single uppercase acronyms) - acceptable in docs #### 3. Version Numbers ```regex \d+\.\d+(?:\.\d+)* ``` **Why**: Version numbers aren't words **Examples Ignored**: `1.0`, `2.3.1`, `0.101.0`, `1.2.3.4`, `v1.2.3` **Note**: Handles any number of version parts (2-part, 3-part, 4-part, etc.) #### 4. Hexadecimal Values ```regex 0[xX][0-9a-fA-F]+ ``` **Why**: Hex values appear in code and aren't dictionary words **Examples Ignored**: `0xFF`, `0xDEADBEEF`, `0x1A` #### 5. URLs and Paths ```regex /[a-zA-Z0-9/_\-\.\{\}]+ # Paths: /api/v2/write https?://[^\s\)\]>"]+ # Full URLs: https://docs.example.com ``` **Why**: URLs contain hyphens, slashes, and special chars **Examples Ignored**: `/api/v2/write`, `/kapacitor/v1/`, `https://docs.influxdata.com` #### 6. Shortcode Attributes ```regex (?:endpoint|method|url|href|src|path)="[^"]+" ``` **Why**: Hugo shortcode attribute values often contain hyphens and special chars **Examples Ignored**: `endpoint="https://..."`, `method="POST"` **Future Enhancement**: Add more attributes as needed (name, value, data, etc.) #### 7. Code Punctuation ```regex [@#$%^&*()_+=\[\]{};:,.<>?/\\|-]+ ``` **Why**: Symbols and special characters aren't words **Examples Ignored**: `()`, `{}`, `[]`, `->`, `=>`, `|`, etc. ### Ignored Words The configuration references two word lists: - **`InfluxDataDocs/Terms/ignore.txt`** - Product and technical terms (non-English) - **`InfluxDataDocs/Terms/query-functions.txt`** - InfluxQL/Flux function names To add a word that should be ignored, edit the appropriate file. ## Codespell Configuration ### File: `.codespellrc` #### Dictionary Choice: "clear" **Why "clear" (not "rare" or "code")**: - `clear` - Unambiguous spelling errors only - Examples: "recieve" → "receive", "occured" → "occurred" - False positive rate: \~1% - `rare` - Includes uncommon but valid English words - Would flag legitimate technical terms - False positive rate: \~15-20% - `code` - Includes code-specific words - Too aggressive for documentation - False positive rate: \~25-30% #### Skip Directories ```ini skip = public,node_modules,dist,.git,.vale,api-docs ``` - `public` - Generated HTML (not source) - `node_modules` - npm dependencies (not our code) - `dist` - Compiled TypeScript output (not source) - `.git` - Repository metadata - `.vale` - Vale configuration and cache - `api-docs` - Generated OpenAPI specifications (many false positives) #### Ignored Words ```ini ignore-words-list = aks,invokable ``` - **`aks`** - Azure Kubernetes Service (acronym) - **`invokable`** - InfluxData product branding term (scriptable tasks/queries) **To add more**: 1. Edit `.codespellrc` 2. Add word to `ignore-words-list` (comma-separated) 3. Add inline comment explaining why ## Running Spell Checkers ### Vale (Pre-commit) Vale automatically runs on files you commit via Lefthook. **Manual check**: ```bash # Check all content docker compose run -T vale content/**/*.md # Check specific file docker compose run -T vale content/influxdb/cloud/reference/cli.md ``` ### Codespell (Manual/CI) ```bash # Check entire content directory codespell content/ --builtin clear # Check specific directory codespell content/influxdb3/core/ # Interactive mode (prompts for fixes) codespell content/ --builtin clear -i 3 # Auto-fix (USE WITH CAUTION) codespell content/ --builtin clear -w ``` ## Rule Validation The spell-checking rules are designed to: ✅ Catch real spelling errors (true positives) ✅ Ignore code patterns, identifiers, and paths (false negative prevention) ✅ Respect product branding terms (invokable, Flux, InfluxQL) ✅ Work seamlessly in existing workflows ### Manual Validation Create a test file with various patterns: ```bash # Test camelCase handling echo "variable myVariable is defined" | codespell # Test version numbers echo "InfluxDB version 2.3.1 is released" | codespell # Test real typos (should be caught) echo "recieve the data" | codespell ``` ## Troubleshooting ### Vale: False Positives **Problem**: Vale flags a word that should be valid **Solutions**: 1. Check if it's a code identifier (camelCase, UPPER\_CASE, hex, version) 2. Add to `InfluxDataDocs/Terms/ignore.txt` if it's a technical term 3. Add filter pattern to `.ci/vale/styles/InfluxDataDocs/Spelling.yml` if it's a pattern ### Codespell: False Positives **Problem**: Codespell flags a legitimate term **Solutions**: 1. Add to `ignore-words-list` in `.codespellrc` 2. Add skip directory if entire directory should be excluded 3. Use `-i 3` (interactive mode) to review before accepting ### Both Tools: Missing Real Errors **Problem**: A real typo isn't caught **Solutions**: 1. Verify it's actually a typo (not a branding term or intentional) 2. Check if it's in excluded scope (tables, URLs, code identifiers) 3. Report as GitHub issue for tool improvement ## Contributing When adding content: 1. **Use semantic line feeds** (one sentence per line) 2. **Run Vale pre-commit** checks before committing 3. **Test code block comments** for typos 4. **Avoid adding to ignore lists** when possible 5. **Document why** you excluded a term (if necessary) ## Related Files - `.ci/vale/styles/InfluxDataDocs/` - Vale rule configuration - `.codespellrc` - Codespell configuration - `.codespellignore` - Codespell ignore word list - `DOCS-CONTRIBUTING.md` - General contribution guidelines - `DOCS-TESTING.md` - Testing and validation guide ## Future Improvements 1. Create comprehensive test suite for spell-checking rules 2. Document how to add product-specific branding terms 3. Consider adding codespell to CI/CD pipeline 4. Monitor and update ignore lists quarterly