Skip to main content

Compilation Passes

The pipeline consists of 15 passes organized into four phases. Each pass is a self-contained unit with clear inputs, outputs, and dependencies.

Pass Overview

PhasePassPriorityRequiredDependencies
FrontendParsePass10Yes
FrontendBuildPass20Yesparse
AnalysisValidatePass22Nobuild
TransformSplitPass25Nobuild
AnalysisEnhancePass30Nobuild
TransformEnrichPass40Yesbuild
BackendReasoningPass45Noenrich
BackendConceptPass47Noreasoning_index
BackendNavigationPass50Noenrich
BackendRoutePass52Nonavigation_index
BackendChainPass54Noenrich
BackendOverlapPass56Nobuild
BackendScorePass58Noenrich
BackendVerifyPass55Yesconcept_extraction
BackendOptimizePass60Noenrich, navigation_index

Frontend Phase

Frontend passes transform raw document bytes into a structured tree — analogous to lexing and parsing in a traditional compiler.

ParsePass (Priority 10)

Parses the source document into a flat list of RawNode values.

  • Input: CompilerInput (file path, content string, or bytes)
  • Output: ctx.raw_nodes: Vec<RawNode>, ctx.format, ctx.page_count
  • No dependencies
  • Supports Markdown and PDF formats
  • PDF parsing can optionally use an LLM client for better structure extraction
  • Each RawNode contains: title, content, hierarchy level, line range, page number, token count
Source bytes → ParsePass → [RawNode, RawNode, RawNode, ...]

BuildPass (Priority 20)

Constructs a hierarchical DocumentTree from the flat raw nodes.

  • Input: ctx.raw_nodes
  • Output: ctx.tree: DocumentTree
  • Depends on: parse
  • Applies thinning (merges nodes below token threshold into parents)
  • Calculates recursive total token counts
  • Assigns unique node IDs if generate_ids is enabled
[RawNode, ...] → BuildPass → DocumentTree (arena-based hierarchical structure)

Analysis Phase

Analysis passes validate and enrich the tree's semantic content.

ValidatePass (Priority 22)

Checks tree integrity and reports warnings.

  • Reads: ctx.tree
  • Checks:
    • Maximum tree depth (20 levels)
    • Empty titles on leaf nodes
    • Token count consistency (parent ≥ sum of children)
    • Content duplication detection
  • Optional — failures are skipped, not fatal

EnhancePass (Priority 30)

Generates LLM summaries for tree nodes. Only runs when an LLM client is available.

  • Reads: ctx.tree
  • Writes: ctx.tree (summaries added to nodes)
  • Depends on: build
  • Failure policy: Retry with backoff
  • Non-leaf nodes get structured summaries: OVERVIEW, QUESTIONS, TAGS
  • Leaf nodes get content summaries
  • Short nodes below the shortcut threshold use original content as summary (saves LLM cost)
  • Uses memoization cache to avoid regenerating summaries for unchanged content

Transform Phase

Transform passes restructure the tree at the IR level.

SplitPass (Priority 25)

Splits oversized leaf nodes into smaller children.

  • Reads/writes: ctx.tree
  • Depends on: build
  • Optional — controlled by SplitConfig
  • Default max tokens per node: 4000
  • Uses natural split points (headings, paragraphs)
  • Pattern-based splitting enabled by default

EnrichPass (Priority 40)

Adds metadata and resolves cross-references.

  • Reads/writes: ctx.tree, ctx.description
  • Depends on: build
  • Required
  • Calculates page ranges (propagated from children up)
  • Generates Table of Contents view
  • Extracts and resolves in-document references ("see Section 2.1"NodeId)
  • Generates document description from root summary

Backend Phase

Backend passes generate the final indexes — analogous to code generation in a traditional compiler.

ReasoningPass (Priority 45)

Builds the symbol table: keyword → node path mappings.

  • Reads: ctx.tree
  • Writes: ctx.reasoning_index
  • Depends on: enrich
  • Optional
  • Extracts keywords with weight normalization: title (2.0×), summary (1.5×), content (1.0×)
  • Builds section map for fast ToC lookup (depth-1 nodes)
  • Creates summary shortcut for overview queries
  • Optional LLM synonym expansion

ConceptPass (Priority 47)

Extracts key concepts from topics and summaries.

  • Reads: ctx.tree, ctx.reasoning_index
  • Writes: ctx.concepts
  • Depends on: reasoning_index
  • Optional
  • Uses LLM for structured concept extraction (max 15 concepts)
  • Falls back to keyword-based extraction without LLM

Builds the runtime navigation index for agent-based traversal.

  • Reads: ctx.tree
  • Writes: ctx.navigation_index
  • Depends on: enrich
  • Optional
  • Creates NavEntry for each non-leaf node (overview, hints, tags, leaf count)
  • Creates ChildRoute entries for children (title, description, leaf count)
  • Builds DocCard for document-level overview

RoutePass (Priority 52)

Builds the pre-computed query routing table for Agent acceleration.

  • Reads: ctx.tree
  • Writes: ctx.query_routes
  • Depends on: navigation_index
  • Optional
  • Phase 1: Builds intent routes from nodes with question_hints
  • Phase 2: Builds concept routes from routing_keywords tags
  • No LLM calls — uses existing tree metadata

ChainPass (Priority 54)

Builds reasoning chain index from in-document cross-references.

  • Reads: ctx.tree
  • Writes: ctx.chain_index
  • Depends on: enrich
  • Optional
  • Analyzes TreeNode.references to find premise→conclusion relationships
  • Classifies chains by type: Causal, Supporting, Contradicting, Elaboration, Sequence
  • Provides bidirectional node lookup (node → chains involving that node)
  • No LLM calls — uses reference types and tree structure

OverlapPass (Priority 56)

Detects content overlap between leaf nodes using Jaccard similarity.

  • Reads: ctx.tree
  • Writes: ctx.content_overlap
  • Depends on: build
  • Optional
  • Pairwise Jaccard similarity on leaf node content (word-level)
  • Classifies overlaps: Duplicate (≥0.9), Subset, Summary
  • Skips nodes with content shorter than 50 characters
  • No LLM calls — pure statistical comparison

ScorePass (Priority 58)

Computes per-node evidence quality scores.

  • Reads: ctx.tree
  • Writes: ctx.evidence_scores
  • Depends on: enrich
  • Optional
  • Three metrics per leaf node:
    • Density: unique tokens / total tokens (information density)
    • Data richness: presence of numbers, tables, code, lists
    • Specificity: ratio of domain terms to filler words
  • Composite score: density×0.4 + richness×0.3 + specificity×0.3
  • No LLM calls — pure content analysis

VerifyPass (Priority 55)

Validates the final output.

  • Reads: ctx.tree
  • Depends on: concept_extraction
  • Required
  • Checks that tree exists and has nodes
  • Verifies document summary is non-empty
  • Warns if no concepts were extracted

OptimizePass (Priority 60)

Performs final tree structure optimization.

  • Reads/writes: ctx.tree
  • Depends on: enrich, navigation_index
  • Optional
  • Merges adjacent small leaf nodes that are siblings
  • Removes empty intermediate nodes

Data Flow

The following diagram shows how data flows through the passes and which CompileContext fields each pass reads and writes:

┌────────────┐
│ CompilerInput │
└──────┬─────┘

┌──────────▼──────────┐
│ ParsePass │ writes: raw_nodes, format, page_count
└──────────┬──────────┘

┌──────────▼──────────┐
│ BuildPass │ reads: raw_nodes → writes: tree
└──────────┬──────────┘

┌──────────────┼──────────────┐
│ │ │
┌────────▼──────┐ ┌─────▼──────┐ ┌────▼────────┐
│ ValidatePass │ │ SplitPass │ │ EnhancePass │ reads: tree → writes: tree
│ (read-only) │ │ │ │ │
└───────────────┘ └────────────┘ └──────┬──────┘

┌──────────▼──────────┐
│ EnrichPass │ reads: tree → writes: tree, description
└──────────┬──────────┘

┌─────────────────────────────┼────────────────────────────┐
│ │ │ │ │
┌─────────▼─────┐ ┌─────▼────────┐ ┌───▼───────┐ ┌──▼──────────┐ ┌▼────────────┐
│ ReasoningPass │ │NavigationPass│ │ Optimize │ │ ChainPass │ │ ScorePass │
│writes:reasoning│ │writes:nav_idx│ │ │ │writes:chains│ │writes:scores│
└─────────┬─────┘ └──────┬───────┘ └───────────┘ └─────────────┘ └─────────────┘
│ │
┌─────────▼─────┐ ┌──────▼───────┐ ┌─────────────┐
│ ConceptPass │ │ RoutePass │ │ OverlapPass │
│writes:concepts│ │writes:routes │ │writes:overlap│
└─────────┬─────┘ └──────────────┘ └─────────────┘

┌─────────▼─────┐
│ VerifyPass │ reads: tree (validation only)
└───────────────┘