Chunking Guide
Equal Chunking
Splits content into a fixed number of chunks.
Pros: Simple, predictable output.
Cons: May split mid-logic.
Use When: You need uniform chunk counts.
Example:
komodo src/ --equal-chunks 5
Size-Based Chunking
Limits chunks to a maximum size (tokens or lines with semantic chunking).
Pros: Controls chunk size.
Cons: Variable chunk count.
Use When: Size constraints matter (e.g., LLM context windows).
Example:
komodo src/ --max-chunk-size 1000
Semantic Chunking
Splits Python files by AST (functions/classes).
Pros: Preserves logical units.
Cons: Limited to Python, requires valid syntax.
Use When: Processing Python code for analysis or training.
Example:
komodo src/ --max-chunk-size 200 --semantic-chunks
PDF Chunking
Komodo integrates with PyMuPDF to parse text from PDF files:
Text Extraction: Uses multiple methods (plain text, HTML, structured blocks) to handle various PDF layouts, including multi-column and academic papers.
Splitting: Divides content by pages and paragraphs, aiming to keep paragraphs whole within
--max-chunk-size(in tokens).Output: Each chunk includes a header like
--- Page N ---to indicate page boundaries.If you set
file_type="pdf", only .pdf files are processed; all other files are skipped.
Ignoring/Unignoring
You can exclude or re-include files via command-line flags like:
komodo . --equal-chunks 5 \
--ignore "**/test/**" \
--unignore "**/test/specific_test.py"
Pattern Syntax:
- Patterns use Unix shell-style wildcards (e.g., *, ?, [seq], [!seq]).
- Use ** to match directories recursively (e.g., **/test/** matches all files under any test directory).
Built-In Ignore Patterns
Komodo automatically ignores:
.git, .idea, __pycache__, node_modules
Common binary file extensions (exe, dll, etc.)
Image files like .png, .jpg, etc.
If you want to override, pass additional patterns with --ignore or --unignore.
Priority Rules
You can specify which files to process first using priority rules:
komodo . --max-chunk-size 200 \
--priority "*.py,10" \
--priority "*.md,5"
This means *.py files have priority 10, *.md has priority 5. Komodo processes them in descending order of priority.