- Incremental Markdown build for PRs, full build for production - Shared content-utils library for: - Mapping shared content to consuming pages (Markdown generation, Cypress) - Listing changed content pages (committed, uncommitted, staged) - Extracting source frontmatter (docs edit) - Fix CSS parsing warnings with JSDOM VirtualConsole - Remove unused imports and variables |
||
|---|---|---|
| .. | ||
| lib | ||
| rust-markdown-converter | ||
| schemas | ||
| templates | ||
| README.md | ||
| add-placeholders.js | ||
| build-llm-markdown.js | ||
| deploy-staging.sh | ||
| docs-cli.js | ||
| docs-create.js | ||
| docs-edit.js | ||
| html-to-markdown.js | ||
| setup-local-bin.js | ||
README.md
Documentation Build Scripts
html-to-markdown.js
Converts Hugo-generated HTML files to fully-rendered Markdown with evaluated shortcodes, dereferenced shared content, and removed comments.
Purpose
This script generates production-ready Markdown output for LLM consumption and user downloads. The generated Markdown:
- Has all Hugo shortcodes evaluated to text (e.g.,
{{% product-name %}}→ "InfluxDB 3 Core") - Includes dereferenced shared content in the body
- Removes HTML/Markdown comments
- Adds product context to frontmatter
- Mirrors the HTML version but in clean Markdown format
Usage
# Generate all markdown files (run after Hugo build)
yarn build:md
# Generate with verbose logging
yarn build:md:verbose
# Generate for specific path
node scripts/html-to-markdown.js --path influxdb3/core
# Generate limited number for testing
node scripts/html-to-markdown.js --limit 10
# Combine options
node scripts/html-to-markdown.js --path telegraf/v1 --verbose
Options
--path <path>: Process specific path withinpublic/(default: process all)--limit <n>: Limit number of files to process (useful for testing)--verbose: Enable detailed logging of conversion progress
Build Process
-
Hugo generates HTML (with all shortcodes evaluated):
npx hugo --quiet -
Script converts HTML to Markdown:
yarn build:md -
Generated files:
- Location:
public/**/index.md(alongsideindex.html) - Git status: Ignored (entire
public/directory is gitignored) - Deployment: Generated at build time, like API docs
- Location:
Features
Product Context Detection
Automatically detects and adds product information to frontmatter:
---
title: Set up InfluxDB 3 Core
description: Install, configure, and set up authorization...
url: /influxdb3/core/get-started/setup/
product: InfluxDB 3 Core
product_version: core
date: 2025-11-13
lastmod: 2025-11-13
---
Supported products:
- InfluxDB 3 Core, Enterprise, Cloud Dedicated, Cloud Serverless, Clustered
- InfluxDB v2, v1, Cloud (TSM), Enterprise v1
- Telegraf, Chronograf, Kapacitor, Flux
Turndown Configuration
Custom Turndown rules for InfluxData documentation:
- Code blocks: Preserves language identifiers
- GitHub callouts: Converts to
> [!Note]format - Tables: GitHub-flavored markdown tables
- Lists: Preserves nested lists and formatting
- Links: Keeps relative links intact
- Images: Preserves alt text and paths
Content Extraction
Extracts only article content (removes navigation, footer, etc.):
- Target selector:
article.article--content - Skips files without article content (with warning)
Integration
Local Development:
# After making content changes
npx hugo --quiet && yarn build:md
CircleCI Build Pipeline:
The script runs automatically in the CircleCI build pipeline after Hugo generates HTML:
# .circleci/config.yml
- run:
name: Hugo Build
command: yarn hugo --environment production --logLevel info --gc --destination workspace/public
- run:
name: Generate LLM-friendly Markdown
command: node scripts/html-to-markdown.js
Build order:
- Hugo builds HTML →
workspace/public/**/*.html html-to-markdown.jsconverts HTML →workspace/public/**/*.md- All files deployed to S3
Production Build (Manual):
npx hugo --quiet
yarn build:md
Watch Mode: For development with auto-regeneration, run Hugo server and regenerate markdown after content changes:
# Terminal 1: Hugo server
npx hugo server
# Terminal 2: After making changes
yarn build:md
Performance
- Processing speed: ~10-20 files/second
- Full site: 5,581 HTML files in ~5 minutes
- Memory usage: Minimal (processes files sequentially)
- Caching: None (regenerates from HTML each time)
Troubleshooting
No article content found:
⚠️ No article content found in /path/to/file.html
- File doesn't have
article.article--contentselector - Usually navigation pages or redirects
- Safe to ignore
Shortcodes still present:
- Run after Hugo has generated HTML, not before
- Hugo must complete its build first
Missing product context:
- Check that URL path matches patterns in
PRODUCT_MAP - Add new products to the map if needed
See Also
- Plan document - Architecture decisions
- API docs generation - Similar pattern for API reference
- Package.json scripts - Build commands