4.5 KiB

Raw Permalink Blame History

Documentation Build Scripts

html-to-markdown.js

Converts Hugo-generated HTML files to fully-rendered Markdown with evaluated shortcodes, dereferenced shared content, and removed comments.

Purpose

This script generates production-ready Markdown output for LLM consumption and user downloads. The generated Markdown:

Has all Hugo shortcodes evaluated to text (e.g., {{% product-name %}} → "InfluxDB 3 Core")
Includes dereferenced shared content in the body
Removes HTML/Markdown comments
Adds product context to frontmatter
Mirrors the HTML version but in clean Markdown format

Usage

# Generate all markdown files (run after Hugo build)
yarn build:md

# Generate with verbose logging
yarn build:md:verbose

# Generate for specific path
node scripts/html-to-markdown.js --path influxdb3/core

# Generate limited number for testing
node scripts/html-to-markdown.js --limit 10

# Combine options
node scripts/html-to-markdown.js --path telegraf/v1 --verbose

Options

--path <path>: Process specific path within public/ (default: process all)
--limit <n>: Limit number of files to process (useful for testing)
--verbose: Enable detailed logging of conversion progress

Build Process

Hugo generates HTML (with all shortcodes evaluated):
```
npx hugo --quiet
```
Script converts HTML to Markdown:
```
yarn build:md
```
Generated files:
- Location: public/**/index.md (alongside index.html)
- Git status: Ignored (entire public/ directory is gitignored)
- Deployment: Generated at build time, like API docs

Features

Product Context Detection

Automatically detects and adds product information to frontmatter:

---
title: Set up InfluxDB 3 Core
description: Install, configure, and set up authorization...
url: /influxdb3/core/get-started/setup/
product: InfluxDB 3 Core
product_version: core
date: 2025-11-13
lastmod: 2025-11-13
---

Supported products:

InfluxDB 3 Core, Enterprise, Cloud Dedicated, Cloud Serverless, Clustered
InfluxDB v2, v1, Cloud (TSM), Enterprise v1
Telegraf, Chronograf, Kapacitor, Flux

Turndown Configuration

Custom Turndown rules for InfluxData documentation:

Code blocks: Preserves language identifiers
GitHub callouts: Converts to > [!Note] format
Tables: GitHub-flavored markdown tables
Lists: Preserves nested lists and formatting
Links: Keeps relative links intact
Images: Preserves alt text and paths

Content Extraction

Extracts only article content (removes navigation, footer, etc.):

Target selector: article.article--content
Skips files without article content (with warning)

Integration

Local Development:

# After making content changes
npx hugo --quiet && yarn build:md

CircleCI Build Pipeline:

The script runs automatically in the CircleCI build pipeline after Hugo generates HTML:

# .circleci/config.yml
- run:
    name: Hugo Build
    command: yarn hugo --environment production --logLevel info --gc --destination workspace/public
- run:
    name: Generate LLM-friendly Markdown
    command: node scripts/html-to-markdown.js

Build order:

Hugo builds HTML → workspace/public/**/*.html
html-to-markdown.js converts HTML → workspace/public/**/*.md
All files deployed to S3

Production Build (Manual):

npx hugo --quiet
yarn build:md

Watch Mode: For development with auto-regeneration, run Hugo server and regenerate markdown after content changes:

# Terminal 1: Hugo server
npx hugo server

# Terminal 2: After making changes
yarn build:md

Performance

Processing speed: ~10-20 files/second
Full site: 5,581 HTML files in ~5 minutes
Memory usage: Minimal (processes files sequentially)
Caching: None (regenerates from HTML each time)

Troubleshooting

No article content found:

⚠️  No article content found in /path/to/file.html

File doesn't have article.article--content selector
Usually navigation pages or redirects
Safe to ignore

Shortcodes still present:

Run after Hugo has generated HTML, not before
Hugo must complete its build first

Missing product context:

Check that URL path matches patterns in PRODUCT_MAP
Add new products to the map if needed

4.5 KiB Raw Permalink Blame History