docs-v2/scripts
Scott Anderson 7f7387ae9c feat(llms): LLM-friendly Markdown, ChatGPT and Claude links.
This enables LLM-friendly documentation for entire sections,
allowing users to copy complete documentation sections with a single click.

Lambda@Edge now generates .md files on-demand with:
- Evaluated Hugo shortcodes
- Proper YAML frontmatter with product metadata
- Clean markdown without UI elements
- Section aggregation (parent + children in single file)

The llms.txt files are now generated automatically during build from
content structure and product metadata in data/products.yml, eliminating
the need for hardcoded files and ensuring maintainability.

**Testing**:
- Automated markdown generation in test setup via cy.exec()
- Implement dynamic content validation that extracts HTML content and
  verifies it appears in markdown version

**Documentation**:
Documents LLM-friendly markdown generation

**Details**:
Add gzip decompression for S3 HTML files in Lambda markdown generator

HTML files stored in S3 are gzip-compressed but the Lambda was attempting
to parse compressed data as UTF-8, causing JSDOM to fail to find article
elements. This resulted in 404 errors for .md and .section.md requests.

- Add zlib gunzip decompression in s3-utils.js fetchHtmlFromS3()
- Detect gzip via ContentEncoding header or magic bytes (0x1f 0x8b)
- Add configurable DEBUG constant for verbose logging
- Add debug logging for buffer sizes and decompression in both files

The decompression adds ~1-5ms per request but is necessary to parse
HTML correctly. CloudFront caching minimizes Lambda invocations.

Await async markdown conversion functions

The convertToMarkdown and convertSectionToMarkdown functions are async
but weren't being awaited, causing the Lambda to return a Promise object
instead of a string. This resulted in CloudFront validation errors:
"The body is not a string, is not an object, or exceeds the maximum size"

**Troubleshooting**:

- Set DEBUG for troubleshooting in lambda
2025-11-21 13:49:36 -06:00
..
lib feat(llms): LLM-friendly Markdown, ChatGPT and Claude links. 2025-11-21 13:49:36 -06:00
schemas Jts agentsmd (#6486) 2025-10-28 07:20:13 -05:00
templates chore(docs): Redesign docs CLI tools for creating and editing content, add content/create.md tutorial page for the How to creat… (#6506) 2025-11-03 10:18:15 -06:00
README.md feat(llms): LLM-friendly Markdown, ChatGPT and Claude links. 2025-11-21 13:49:36 -06:00
add-placeholders.js chore(docs): Redesign docs CLI tools for creating and editing content, add content/create.md tutorial page for the How to creat… (#6506) 2025-11-03 10:18:15 -06:00
docs-cli.js chore(docs): Redesign docs CLI tools for creating and editing content, add content/create.md tutorial page for the How to creat… (#6506) 2025-11-03 10:18:15 -06:00
docs-create.js chore(docs): Redesign docs CLI tools for creating and editing content, add content/create.md tutorial page for the How to creat… (#6506) 2025-11-03 10:18:15 -06:00
docs-edit.js Jts agentsmd (#6486) 2025-10-28 07:20:13 -05:00
html-to-markdown.js feat(llms): LLM-friendly Markdown, ChatGPT and Claude links. 2025-11-21 13:49:36 -06:00
setup-local-bin.js chore(docs): Redesign docs CLI tools for creating and editing content, add content/create.md tutorial page for the How to creat… (#6506) 2025-11-03 10:18:15 -06:00

README.md

Documentation Build Scripts

html-to-markdown.js

Converts Hugo-generated HTML files to fully-rendered Markdown with evaluated shortcodes, dereferenced shared content, and removed comments.

Purpose

This script generates production-ready Markdown output for LLM consumption and user downloads. The generated Markdown:

  • Has all Hugo shortcodes evaluated to text (e.g., {{% product-name %}} → "InfluxDB 3 Core")
  • Includes dereferenced shared content in the body
  • Removes HTML/Markdown comments
  • Adds product context to frontmatter
  • Mirrors the HTML version but in clean Markdown format

Usage

# Generate all markdown files (run after Hugo build)
yarn build:md

# Generate with verbose logging
yarn build:md:verbose

# Generate for specific path
node scripts/html-to-markdown.js --path influxdb3/core

# Generate limited number for testing
node scripts/html-to-markdown.js --limit 10

# Combine options
node scripts/html-to-markdown.js --path telegraf/v1 --verbose

Options

  • --path <path>: Process specific path within public/ (default: process all)
  • --limit <n>: Limit number of files to process (useful for testing)
  • --verbose: Enable detailed logging of conversion progress

Build Process

  1. Hugo generates HTML (with all shortcodes evaluated):

    npx hugo --quiet
    
  2. Script converts HTML to Markdown:

    yarn build:md
    
  3. Generated files:

    • Location: public/**/index.md (alongside index.html)
    • Git status: Ignored (entire public/ directory is gitignored)
    • Deployment: Generated at build time, like API docs

Features

Product Context Detection

Automatically detects and adds product information to frontmatter:

---
title: Set up InfluxDB 3 Core
description: Install, configure, and set up authorization...
url: /influxdb3/core/get-started/setup/
product: InfluxDB 3 Core
product_version: core
date: 2025-11-13
lastmod: 2025-11-13
---

Supported products:

  • InfluxDB 3 Core, Enterprise, Cloud Dedicated, Cloud Serverless, Clustered
  • InfluxDB v2, v1, Cloud (TSM), Enterprise v1
  • Telegraf, Chronograf, Kapacitor, Flux

Turndown Configuration

Custom Turndown rules for InfluxData documentation:

  • Code blocks: Preserves language identifiers
  • GitHub callouts: Converts to > [!Note] format
  • Tables: GitHub-flavored markdown tables
  • Lists: Preserves nested lists and formatting
  • Links: Keeps relative links intact
  • Images: Preserves alt text and paths

Content Extraction

Extracts only article content (removes navigation, footer, etc.):

  • Target selector: article.article--content
  • Skips files without article content (with warning)

Integration

Local Development:

# After making content changes
npx hugo --quiet && yarn build:md

CircleCI Build Pipeline:

The script runs automatically in the CircleCI build pipeline after Hugo generates HTML:

# .circleci/config.yml
- run:
    name: Hugo Build
    command: yarn hugo --environment production --logLevel info --gc --destination workspace/public
- run:
    name: Generate LLM-friendly Markdown
    command: node scripts/html-to-markdown.js

Build order:

  1. Hugo builds HTML → workspace/public/**/*.html
  2. html-to-markdown.js converts HTML → workspace/public/**/*.md
  3. All files deployed to S3

Production Build (Manual):

npx hugo --quiet
yarn build:md

Watch Mode: For development with auto-regeneration, run Hugo server and regenerate markdown after content changes:

# Terminal 1: Hugo server
npx hugo server

# Terminal 2: After making changes
yarn build:md

Performance

  • Processing speed: ~10-20 files/second
  • Full site: 5,581 HTML files in ~5 minutes
  • Memory usage: Minimal (processes files sequentially)
  • Caching: None (regenerates from HTML each time)

Troubleshooting

No article content found:

⚠️  No article content found in /path/to/file.html
  • File doesn't have article.article--content selector
  • Usually navigation pages or redirects
  • Safe to ignore

Shortcodes still present:

  • Run after Hugo has generated HTML, not before
  • Hugo must complete its build first

Missing product context:

  • Check that URL path matches patterns in PRODUCT_MAP
  • Add new products to the map if needed

See Also