docs-v2/scripts/README.md

4.5 KiB

Documentation Build Scripts

html-to-markdown.js

Converts Hugo-generated HTML files to fully-rendered Markdown with evaluated shortcodes, dereferenced shared content, and removed comments.

Purpose

This script generates production-ready Markdown output for LLM consumption and user downloads. The generated Markdown:

  • Has all Hugo shortcodes evaluated to text (e.g., {{% product-name %}} → "InfluxDB 3 Core")
  • Includes dereferenced shared content in the body
  • Removes HTML/Markdown comments
  • Adds product context to frontmatter
  • Mirrors the HTML version but in clean Markdown format

Usage

# Generate all markdown files (run after Hugo build)
yarn build:md

# Generate with verbose logging
yarn build:md:verbose

# Generate for specific path
node scripts/html-to-markdown.js --path influxdb3/core

# Generate limited number for testing
node scripts/html-to-markdown.js --limit 10

# Combine options
node scripts/html-to-markdown.js --path telegraf/v1 --verbose

Options

  • --path <path>: Process specific path within public/ (default: process all)
  • --limit <n>: Limit number of files to process (useful for testing)
  • --verbose: Enable detailed logging of conversion progress

Build Process

  1. Hugo generates HTML (with all shortcodes evaluated):

    npx hugo --quiet
    
  2. Script converts HTML to Markdown:

    yarn build:md
    
  3. Generated files:

    • Location: public/**/index.md (alongside index.html)
    • Git status: Ignored (entire public/ directory is gitignored)
    • Deployment: Generated at build time, like API docs

Features

Product Context Detection

Automatically detects and adds product information to frontmatter:

---
title: Set up InfluxDB 3 Core
description: Install, configure, and set up authorization...
url: /influxdb3/core/get-started/setup/
product: InfluxDB 3 Core
product_version: core
date: 2025-11-13
lastmod: 2025-11-13
---

Supported products:

  • InfluxDB 3 Core, Enterprise, Cloud Dedicated, Cloud Serverless, Clustered
  • InfluxDB v2, v1, Cloud (TSM), Enterprise v1
  • Telegraf, Chronograf, Kapacitor, Flux

Turndown Configuration

Custom Turndown rules for InfluxData documentation:

  • Code blocks: Preserves language identifiers
  • GitHub callouts: Converts to > [!Note] format
  • Tables: GitHub-flavored markdown tables
  • Lists: Preserves nested lists and formatting
  • Links: Keeps relative links intact
  • Images: Preserves alt text and paths

Content Extraction

Extracts only article content (removes navigation, footer, etc.):

  • Target selector: article.article--content
  • Skips files without article content (with warning)

Integration

Local Development:

# After making content changes
npx hugo --quiet && yarn build:md

CircleCI Build Pipeline:

The script runs automatically in the CircleCI build pipeline after Hugo generates HTML:

# .circleci/config.yml
- run:
    name: Hugo Build
    command: yarn hugo --environment production --logLevel info --gc --destination workspace/public
- run:
    name: Generate LLM-friendly Markdown
    command: node scripts/html-to-markdown.js

Build order:

  1. Hugo builds HTML → workspace/public/**/*.html
  2. html-to-markdown.js converts HTML → workspace/public/**/*.md
  3. All files deployed to S3

Production Build (Manual):

npx hugo --quiet
yarn build:md

Watch Mode: For development with auto-regeneration, run Hugo server and regenerate markdown after content changes:

# Terminal 1: Hugo server
npx hugo server

# Terminal 2: After making changes
yarn build:md

Performance

  • Processing speed: ~10-20 files/second
  • Full site: 5,581 HTML files in ~5 minutes
  • Memory usage: Minimal (processes files sequentially)
  • Caching: None (regenerates from HTML each time)

Troubleshooting

No article content found:

⚠️  No article content found in /path/to/file.html
  • File doesn't have article.article--content selector
  • Usually navigation pages or redirects
  • Safe to ignore

Shortcodes still present:

  • Run after Hugo has generated HTML, not before
  • Hugo must complete its build first

Missing product context:

  • Check that URL path matches patterns in PRODUCT_MAP
  • Add new products to the map if needed

See Also