Chunking Guide

Equal Chunking

Splits content into a fixed number of chunks.

  • Pros: Simple, predictable output.

  • Cons: May split mid-logic.

  • Use When: You need uniform chunk counts.

Example:

komodo src/ --equal-chunks 5

Size-Based Chunking

Limits chunks to a maximum size (tokens or lines with semantic chunking).

  • Pros: Controls chunk size.

  • Cons: Variable chunk count.

  • Use When: Size constraints matter (e.g., LLM context windows).

Example:

komodo src/ --max-chunk-size 1000

Semantic Chunking

Splits Python files by AST (functions/classes).

  • Pros: Preserves logical units.

  • Cons: Limited to Python, requires valid syntax.

  • Use When: Processing Python code for analysis or training.

Example:

komodo src/ --max-chunk-size 200 --semantic-chunks

PDF Chunking

Komodo integrates with PyMuPDF to parse text from PDF files:

  • Text Extraction: Uses multiple methods (plain text, HTML, structured blocks) to handle various PDF layouts, including multi-column and academic papers.

  • Splitting: Divides content by pages and paragraphs, aiming to keep paragraphs whole within --max-chunk-size (in tokens).

  • Output: Each chunk includes a header like --- Page N --- to indicate page boundaries.

  • If you set file_type="pdf", only .pdf files are processed; all other files are skipped.

Ignoring/Unignoring

You can exclude or re-include files via command-line flags like:

komodo . --equal-chunks 5 \
    --ignore "**/test/**" \
    --unignore "**/test/specific_test.py"

Pattern Syntax: - Patterns use Unix shell-style wildcards (e.g., *, ?, [seq], [!seq]). - Use ** to match directories recursively (e.g., **/test/** matches all files under any test directory).

Built-In Ignore Patterns

Komodo automatically ignores:

  • .git, .idea, __pycache__, node_modules

  • Common binary file extensions (exe, dll, etc.)

  • Image files like .png, .jpg, etc.

If you want to override, pass additional patterns with --ignore or --unignore.

Priority Rules

You can specify which files to process first using priority rules:

komodo . --max-chunk-size 200 \
    --priority "*.py,10" \
    --priority "*.md,5"

This means *.py files have priority 10, *.md has priority 5. Komodo processes them in descending order of priority.