AST-Level Code Parsing

vectorless-code uses tree-sitter to parse source code into semantic nodes — functions, classes, methods — instead of treating files as flat text. This produces a structured tree that the vectorless engine can navigate with precision.

Why AST Parsing Matters

Naive code indexing treats each file as a single block of text. When you ask "how does authentication work", the engine has to scan entire files hoping to find relevant snippets. There's no understanding of what a function is, what a class contains, or how methods relate to their parent class.

AST parsing changes this. The engine receives a tree like:

src/auth.py
├── class_definition: AuthService
│   ├── function_definition: __init__
│   ├── function_definition: login
│   └── function_definition: verify_token
└── function_definition: create_session

Now the Orchestrator can cd into AuthService, ls to see its methods, and cat login to read the authentication logic. This is the same navigation model that works for documents — applied to code with structural precision.

How It Works

Per-Language Node Types

Each language defines which AST node types represent semantic units worth indexing:

SPLITTABLE_NODE_TYPES = {
    "python": {
        "function_definition",
        "class_definition",
        "decorated_definition",
        "async_function_definition",
    },
    "rust": {
        "function_item",
        "impl_item",
        "struct_item",
        "enum_item",
        "trait_item",
        "mod_item",
    },
    # ... 12 languages total
}

tree-sitter parses the source into an AST, then vectorless-code walks the tree extracting nodes whose type matches this set. Each extracted node becomes a CodeNode with:

name — the symbol name (e.g. AuthService, login)
node_type — the AST node type (e.g. class_definition)
content — the full source code of the node
children — nested definitions (methods inside classes)

Nested Extraction

When a class is extracted, its methods are extracted as children — not as separate top-level nodes. This preserves the parent-child relationship:

# Input: Python source
class AuthService:
    def login(self, username, password):
        token = self._create_token(username)
        return token

    def verify_token(self, token):
        return self._decode(token)

# Output: CodeNode tree
CodeNode(
    name="AuthService",
    node_type="class_definition",
    children=[
        CodeNode(name="login", node_type="function_definition", ...),
        CodeNode(name="verify_token", node_type="function_definition", ...),
    ],
)

This nesting produces the raw_node tree that vectorless builds into a navigable Document. Level 1 = file, Level 2 = top-level definitions, Level 3 = nested definitions.

Name Extraction

The parser extracts human-readable names from AST nodes by finding identifier children:

function_definition → looks for identifier child → "login"
class_definition → looks for identifier child → "AuthService"
decorated_definition → recurses into the decorated node
impl_item → looks for type_identifier → "impl UserService"

Fallback Strategy

When tree-sitter is unavailable (unsupported language, grammar not installed, parse error), vectorless-code falls back to line-based splitting — splitting on blank-line boundaries into blocks. This produces flat block nodes without nesting, but still provides functional indexing.

The fallback is transparent. The same parse_file() function handles both paths:

def parse_file(file_path, content, language):
    parser = _get_parser(language)  # cached per language
    if parser is None:
        return fallback_split(content, file_path, language)

    nodes = ast_extract(parser, content, language)
    if not nodes:
        return fallback_split(content, file_path, language)
    return nodes

Performance Considerations

Parser Caching

tree-sitter Parser instances are cached per language. A 10,000-file Python project creates exactly one Python parser, reused for every .py file. This avoids repeated memory allocation and grammar loading.

Single-Pass File Scan

Files are read exactly once. A single pass computes:

File hash (SHA-256 for incremental detection)
Stats (line count, byte size, language distribution)
Content for parsing

Incremental Parsing

On subsequent compiles, only files whose hash changed are re-parsed. Unchanged files reuse cached raw_nodes directly. See Incremental Compilation.

Adding a New Language

To add support for a new language:

Add the language to SPLITTABLE_NODE_TYPES with the relevant AST node types
Add the tree-sitter grammar package to pyproject.toml dependencies
Add the package mapping to _LANG_PACKAGE_MAP

For example, to add Zig:

# ast_parser.py
SPLITTABLE_NODE_TYPES["zig"] = {
    "FunctionDecl",
    "TopLevelDecl",
}

_LANG_PACKAGE_MAP["zig"] = "tree_sitter_zig"

# pyproject.toml
"tree-sitter-zig>=0.21",

No other code changes needed. The parser, cache, fallback, and incremental systems handle it automatically.

Why AST Parsing Matters​

How It Works​

Per-Language Node Types​

Nested Extraction​

Name Extraction​

Fallback Strategy​

Performance Considerations​

Parser Caching​

Single-Pass File Scan​

Incremental Parsing​

Adding a New Language​