History

Jason Stirnaman c2093c8212 Feature: Generate documentation in LLM-friendly Markdown (#6555 ) * feat(llms): LLM-friendly Markdown, ChatGPT and Claude links. This enables LLM-friendly documentation for entire sections, allowing users to copy complete documentation sections with a single click. Lambda@Edge now generates .md files on-demand with: - Evaluated Hugo shortcodes - Proper YAML frontmatter with product metadata - Clean markdown without UI elements - Section aggregation (parent + children in single file) The llms.txt files are now generated automatically during build from content structure and product metadata in data/products.yml, eliminating the need for hardcoded files and ensuring maintainability. Testing: - Automated markdown generation in test setup via cy.exec() - Implement dynamic content validation that extracts HTML content and verifies it appears in markdown version Documentation: Documents LLM-friendly markdown generation Details: Add gzip decompression for S3 HTML files in Lambda markdown generator HTML files stored in S3 are gzip-compressed but the Lambda was attempting to parse compressed data as UTF-8, causing JSDOM to fail to find article elements. This resulted in 404 errors for .md and .section.md requests. - Add zlib gunzip decompression in s3-utils.js fetchHtmlFromS3() - Detect gzip via ContentEncoding header or magic bytes (0x1f 0x8b) - Add configurable DEBUG constant for verbose logging - Add debug logging for buffer sizes and decompression in both files The decompression adds ~1-5ms per request but is necessary to parse HTML correctly. CloudFront caching minimizes Lambda invocations. Await async markdown conversion functions The convertToMarkdown and convertSectionToMarkdown functions are async but weren't being awaited, causing the Lambda to return a Promise object instead of a string. This resulted in CloudFront validation errors: "The body is not a string, is not an object, or exceeds the maximum size" Troubleshooting: - Set DEBUG for troubleshooting in lambda * feat(llms): Add build-time LLM-friendly Markdown generation Implements static Markdown generation during Hugo build. Key Features: - Two-phase generation: HTML→MD (memory-bounded), MD→sections (fast) - Automatic redirect detection via file size check (skips Hugo aliases) - Product detection using compiled TypeScript product-mappings module - Token estimation for LLM context planning (4 chars/token heuristic) - YAML serialization with description sanitization Performance: - ~105 seconds for 5,000 pages + 500 sections - ~300MB peak memory (safe for 2GB CircleCI environment) - 23 files/sec conversion rate with controlled concurrency Configuration Parameters: - MIN_HTML_SIZE_BYTES (default: 1024) - Skip files below threshold - CHARS_PER_TOKEN (default: 4) - Token estimation ratio - Concurrency: 10 workers (CI), 20 workers (local) Output: - Single pages: public//index.md (with frontmatter + content) - Section bundles: public//index.section.md (aggregated child pages) Files Changed: - scripts/build-llm-markdown.js (new) - Main build script - scripts/lib/markdown-converter.cjs (renamed from .js) - Core conversion - scripts/html-to-markdown.js - Updated import path - package.json - Updated exports for .cjs module Related: Replaces Lambda@Edge on-demand generation (5s response time) with build-time static generation for production deployment. feat(deploy): Add staging deployment workflow and update CI Integrates LLM markdown generation into deployment workflows with a complete staging deployment solution. CircleCI Updates: - Switch from legacy html-to-markdown.js to optimized build:md - 2x performance improvement (105s vs 200s+ for 5000 pages) - Better memory management (300MB vs variable) - Enables section bundle generation (index.section.md files) Staging Deployment: - New scripts/deploy-staging.sh for local staging deploys - Complete workflow: Hugo build → markdown gen → S3 upload - Environment variable driven configuration - Optional step skipping for faster iteration - CloudFront cache invalidation support NPM Scripts: - Added deploy:staging command for convenience - Wraps deploy-staging.sh script Documentation: - Updated DOCS-DEPLOYING.md with comprehensive guide - Merged staging/production workflows with Lambda@Edge docs - Build-time generation now primary, Lambda@Edge fallback - Troubleshooting section with common issues - Environment variable reference - Performance metrics and optimization tips Benefits: - Manual staging validation before production - Consistent markdown generation across environments - Faster CI builds with optimized script - Better error handling and progress reporting - Section aggregation for improved LLM context Usage: ```bash export STAGING_BUCKET="test2.docs.influxdata.com" export AWS_REGION="us-east-1" export STAGING_CF_DISTRIBUTION_ID="E1XXXXXXXXXX" yarn deploy:staging ``` Related: Completes build-time markdown generation implementation refactor: Remove Lambda@Edge implementation Build-time markdown generation has replaced Lambda@Edge on-demand generation as the primary method. Removed Lambda code and updated documentation to focus on build-time generation and testing. Removed: - deploy/llm-markdown/ directory (Lambda@Edge code) - Lambda@Edge section from DOCS-DEPLOYING.md Added: - Testing and Validation section in DOCS-DEPLOYING.md - Focus on build-time generation workflow * feat: Add Rust HTML-to-Markdown prototype Implements core markdown-converter.cjs functions in Rust for performance comparison. Performance results: - Rust: ~257 files/sec (10× faster) - JavaScript: ~25 files/sec average Recommendation: Keep JavaScript for now, implement incremental builds first. Rust migration provides 10× speedup but requires 3-4 weeks integration effort. Files: - Cargo.toml: Rust dependencies (html2md, scraper, serde_yaml, clap) - src/main.rs: Core conversion logic + CLI benchmark tool - benchmark-comparison.js: Side-by-side performance testing - README.md: Comprehensive findings and recommendations * fix(ui): improve dropdown positioning on viewport resize - Ensure dropdown stays within viewport bounds (min 8px padding) - Reposition dropdown on window resize and scroll events - Clean up event listeners when dropdown closes * chore(deps): add remark and unified packages for markdown processing Add remark-parse, remark-frontmatter, remark-gfm, and unified for enhanced markdown processing capabilities. * fix(edge): add return to prevent trailing-slash redirect for valid extensions Without the return statement, the Lambda@Edge function would continue executing after the callback, eventually hitting the trailing-slash redirect logic. This caused .md files to redirect to URLs with trailing slashes, which returned 404 from S3. * fix(md): add built-in product mappings and full URL support - Add URL_PATTERN_MAP and PRODUCT_NAME_MAP constants directly in the CommonJS module (ESM product-mappings.js cannot be require()'d) - Update generateFrontmatter() to accept baseUrl parameter and construct full URLs for the frontmatter url field - Update generateSectionFrontmatter() similarly for section pages - Update all call sites to pass baseUrl parameter This fixes empty product fields and relative URLs in generated markdown frontmatter when served via Lambda@Edge. * feat(md): add environment flag for base URL control Add -e, --env flag to html-to-markdown.js to control the base URL in generated markdown frontmatter. This matches Hugo's -e flag behavior and allows generating markdown with staging or production URLs. Also update build-llm-markdown.js with similar environment support. * feat(md): add Rust markdown converter and improve validation - Add Rust-based HTML-to-Markdown converter with NAPI-RS bindings - Update Cypress markdown validation tests - Update deploy-staging.sh with force upload flag * deploy-staging.sh: - Defaults STAGING_URL to https://test2.docs.influxdata.com if not set - Exports it so yarn build:md -e staging can use it - Displays it in the summary * Delete scripts/prototypes/rust-markdown/benchmark-comparison.js * Delete scripts/prototypes directory * fix(llms): Include full URL for section page Markdown and list of child pages * feat(llms): clarify format selector text for AI use case Update button and dropdown text to make the AI/LLM purpose clearer: - Button: "Copy page for AI" / "Copy section for AI" - Sublabel: "Clean Markdown optimized for AI assistants" - Section sublabel: "{N} pages combined as clean Markdown for AI assistants" Cypress tests updated and passing (13/13). --------- Co-authored-by: Scott Anderson <scott@influxdata.com>		2025-12-01 12:32:28 -06:00
..
lib	Feature: Generate documentation in LLM-friendly Markdown (#6555 )	2025-12-01 12:32:28 -06:00
rust-markdown-converter	Feature: Generate documentation in LLM-friendly Markdown (#6555 )	2025-12-01 12:32:28 -06:00
schemas	Jts agentsmd (#6486 )	2025-10-28 07:20:13 -05:00
templates	chore(docs): Redesign docs CLI tools for creating and editing content, add content/create.md tutorial page for the How to creat… (#6506 )	2025-11-03 10:18:15 -06:00
README.md	Feature: Generate documentation in LLM-friendly Markdown (#6555 )	2025-12-01 12:32:28 -06:00
add-placeholders.js	chore(docs): Redesign docs CLI tools for creating and editing content, add content/create.md tutorial page for the How to creat… (#6506 )	2025-11-03 10:18:15 -06:00
build-llm-markdown.js	Feature: Generate documentation in LLM-friendly Markdown (#6555 )	2025-12-01 12:32:28 -06:00
deploy-staging.sh	Feature: Generate documentation in LLM-friendly Markdown (#6555 )	2025-12-01 12:32:28 -06:00
docs-cli.js	chore(docs): Redesign docs CLI tools for creating and editing content, add content/create.md tutorial page for the How to creat… (#6506 )	2025-11-03 10:18:15 -06:00
docs-create.js	chore(docs): Redesign docs CLI tools for creating and editing content, add content/create.md tutorial page for the How to creat… (#6506 )	2025-11-03 10:18:15 -06:00
docs-edit.js	Jts agentsmd (#6486 )	2025-10-28 07:20:13 -05:00
html-to-markdown.js	Feature: Generate documentation in LLM-friendly Markdown (#6555 )	2025-12-01 12:32:28 -06:00
setup-local-bin.js	chore(docs): Redesign docs CLI tools for creating and editing content, add content/create.md tutorial page for the How to creat… (#6506 )	2025-11-03 10:18:15 -06:00

README.md

Documentation Build Scripts

html-to-markdown.js

Converts Hugo-generated HTML files to fully-rendered Markdown with evaluated shortcodes, dereferenced shared content, and removed comments.

Purpose

This script generates production-ready Markdown output for LLM consumption and user downloads. The generated Markdown:

Has all Hugo shortcodes evaluated to text (e.g., {{% product-name %}} → "InfluxDB 3 Core")
Includes dereferenced shared content in the body
Removes HTML/Markdown comments
Adds product context to frontmatter
Mirrors the HTML version but in clean Markdown format

Usage

# Generate all markdown files (run after Hugo build)
yarn build:md

# Generate with verbose logging
yarn build:md:verbose

# Generate for specific path
node scripts/html-to-markdown.js --path influxdb3/core

# Generate limited number for testing
node scripts/html-to-markdown.js --limit 10

# Combine options
node scripts/html-to-markdown.js --path telegraf/v1 --verbose

Options

--path <path>: Process specific path within public/ (default: process all)
--limit <n>: Limit number of files to process (useful for testing)
--verbose: Enable detailed logging of conversion progress

Build Process

Hugo generates HTML (with all shortcodes evaluated):
```
npx hugo --quiet
```
Script converts HTML to Markdown:
```
yarn build:md
```
Generated files:
- Location: public/**/index.md (alongside index.html)
- Git status: Ignored (entire public/ directory is gitignored)
- Deployment: Generated at build time, like API docs

Features

Product Context Detection

Automatically detects and adds product information to frontmatter:

---
title: Set up InfluxDB 3 Core
description: Install, configure, and set up authorization...
url: /influxdb3/core/get-started/setup/
product: InfluxDB 3 Core
product_version: core
date: 2025-11-13
lastmod: 2025-11-13
---

Supported products:

InfluxDB 3 Core, Enterprise, Cloud Dedicated, Cloud Serverless, Clustered
InfluxDB v2, v1, Cloud (TSM), Enterprise v1
Telegraf, Chronograf, Kapacitor, Flux

Turndown Configuration

Custom Turndown rules for InfluxData documentation:

Code blocks: Preserves language identifiers
GitHub callouts: Converts to > [!Note] format
Tables: GitHub-flavored markdown tables
Lists: Preserves nested lists and formatting
Links: Keeps relative links intact
Images: Preserves alt text and paths

Content Extraction

Extracts only article content (removes navigation, footer, etc.):

Target selector: article.article--content
Skips files without article content (with warning)

Integration

Local Development:

# After making content changes
npx hugo --quiet && yarn build:md

CircleCI Build Pipeline:

The script runs automatically in the CircleCI build pipeline after Hugo generates HTML:

# .circleci/config.yml
- run:
    name: Hugo Build
    command: yarn hugo --environment production --logLevel info --gc --destination workspace/public
- run:
    name: Generate LLM-friendly Markdown
    command: node scripts/html-to-markdown.js

Build order:

Hugo builds HTML → workspace/public/**/*.html
html-to-markdown.js converts HTML → workspace/public/**/*.md
All files deployed to S3

Production Build (Manual):

npx hugo --quiet
yarn build:md

Watch Mode: For development with auto-regeneration, run Hugo server and regenerate markdown after content changes:

# Terminal 1: Hugo server
npx hugo server

# Terminal 2: After making changes
yarn build:md

Performance

Processing speed: ~10-20 files/second
Full site: 5,581 HTML files in ~5 minutes
Memory usage: Minimal (processes files sequentially)
Caching: None (regenerates from HTML each time)

Troubleshooting

No article content found:

⚠️  No article content found in /path/to/file.html

File doesn't have article.article--content selector
Usually navigation pages or redirects
Safe to ignore

Shortcodes still present:

Run after Hugo has generated HTML, not before
Hugo must complete its build first

Missing product context:

Check that URL path matches patterns in PRODUCT_MAP
Add new products to the map if needed