speedyk-005/chunklet-py Documentation - Complete Guide & API Reference

Project Overview

Chunklet is a Python library for multilingual, context-aware text chunking optimized for large language model (LLM) and retrieval-augmented generation (RAG) pipelines. It splits long documents into manageable segments while preserving semantic boundaries, enabling efficient indexing, embedding, and inference.

Core Value Proposition

Context-Aware Splitting
Leverages sentence boundaries, token counts, or a hybrid strategy to maintain coherence.
Multilingual Support
Auto-detects or explicitly handles languages such as English, Spanish, Catalan, Haitian Creole, and more.
High Performance
Sub-30 ms per single run and sub-0.2 s for 100-text batches in hybrid mode (English).
Extensible API
Customize token counters, overlap, and sentence splitters via Pydantic models.
CLI & Programmatic Interfaces
Use chunklet-py in shell scripts or import Chunklet in Python projects.

Primary Use Cases

RAG Pipelines
Pre-chunk large corpora for vector stores (e.g., FAISS, Pinecone) and retrieve relevant passages.
LLM Prompt Engineering
Break prompts or knowledge bases into token- or sentence-level chunks to fit context windows.
Multilingual Preprocessing
Apply consistent chunking across diverse languages with automatic detection.
Batch Processing
Efficiently split thousands of documents in seconds for ETL workflows.

Performance Highlights

Single-Run (100 runs on 1 KB English text)
• sentence: ~0.035 s avg
• token: ~0.019 s avg
• hybrid: ~0.028 s avg
Batch-Run (100 × 1 KB texts)
• sentence: ~0.210 s total
• token: ~0.145 s total
• hybrid: ~0.136 s total
Similar performance profiles apply to Catalan and Haitian Creole, with ±10% variance.

Refer to BENCHMARKS.md for full per-language tables.

Supported Modes & Languages

Modes
• sentence: split on punctuation and whitespace
• token: split by fixed token counts
• hybrid: token-aware sentence splitting
Languages
• Auto-detect (lang='auto')
• Explicit codes: 'en', 'es', 'ca', 'ht', etc.
Advanced
• Custom splitters via CustomSplitterConfig
• Override token counters per call or globally

Quick Start

from chunklet import Chunklet

# Initialize with caching and debug logging
chunker = Chunklet(verbose=True, use_cache=True)

# Chunk a single document in hybrid mode
text = "Once upon a time…"
chunks = chunker.chunk(
    text=text,
    mode="hybrid",
    max_sentences=3,
    max_tokens=150,
    overlap_percent=10.0,
    lang="en"
)

for i, seg in enumerate(chunks):
    print(f"[Chunk {i+1}] {seg}")

# Install and verify version
pip install chunklet-py
chunklet-py --version

This overview equips you to select the right chunking strategy, integrate chunklet into LLM/RAG workflows, and scale text preprocessing across languages.

2. Getting Started

Kick off with installing Chunklet, verifying your setup, and running your first text‐chunking call.

Installation

Install the latest release from PyPI:

pip install chunklet

Or install from source (requires Python 3.8+):

git clone https://github.com/speedyk-005/chunklet-py.git
cd chunklet-py
pip install .

Verify Your Environment

Confirm Chunklet is available and check its version:

python3 - <<EOF
import chunklet
print("Chunklet version:", chunklet.__version__)
EOF

If you see a version string, you’re ready to chunk.

First Chunking Call (Python)

Split a simple paragraph into 3‐sentence chunks:

from chunklet import Chunklet

text = (
    "Chunklet makes multilingual text chunking simple. "
    "It supports sentence, token, and hybrid modes. "
    "You can configure overlap for context preservation. "
    "This helps when feeding chunks into LLMs or search indexes. "
    "Installation takes just one pip command."
)

# Initialize chunker (verbose=True prints debug info)
chunker = Chunklet(verbose=True)

# Chunk by sentences: max 3 sentences per chunk
chunks = chunker.chunk(
    text=text,
    mode="sentence",
    max_sentences=3
)

for i, c in enumerate(chunks, 1):
    print(f"Chunk {i}:\n{c}\n")

Using the CLI

Chunklet includes a chunklet command for quick experiments:

# Write your text to input.txt, then:
chunklet \
  --file input.txt \
  --mode hybrid \
  --max-sentences 5 \
  --max-tokens 200 \
  --overlap-percent 10

Key flags:

• --mode chooses sentence, token, or hybrid.
• --max-sentences and --max-tokens set your chunk limits.
• --overlap-percent retains context between token/hybrid chunks.

Run chunklet --help for full options.

Exploring Examples

Use the bundled examples to explore each chunking strategy:

# Sentence‐only mode
python examples/sentence_mode.py

# Token‐based mode
python examples/token_mode.py

# Hybrid mode (sentence + token)
python examples/hybrid_mode.py

Each script shows realistic setup, imports, and output formatting. Modify parameters in these examples to fit your data.

3. Core Concepts

Chunklet breaks text into manageable pieces using three modes—sentence, token, hybrid—with configurable overlap, optional caching, automatic language detection and pluggable splitters. This section explains how each feature works and how to tailor Chunklet for your workflow.

Chunking Modes

Sentence Mode

Groups contiguous sentences into chunks.

Splits text into sentences via custom splitters → pysbd (if installed) → universal_splitter fallback.
Parameters per call: max_sentences, overlap_sentences.

from chunklet import Chunklet
from chunklet.models import ChunkletInitConfig

cfg = ChunkletInitConfig(default_language="en")
chunker = Chunklet(cfg)

text = "Sentence one. Sentence two! Sentence three?"
# 2 sentences per chunk, 1-sentence overlap
chunks = chunker.chunk(
    text=text,
    mode="sentence",
    max_sentences=2,
    overlap_sentences=1
)
print(chunks)
# → ["Sentence one. Sentence two!", "Sentence two! Sentence three?"]

Token Mode

Splits text by approximate token counts.

Tokenization uses GPT-compatible counters.
Parameters: max_tokens, overlap_tokens.

# 50-token chunks with 10-token overlap
chunks = chunker.chunk(
    text=long_text,
    mode="token",
    max_tokens=50,
    overlap_tokens=10
)

Hybrid Mode

Combines sentence and token logic.

Splits into sentences, groups by max_sentences.
If a sentence exceeds max_tokens, it splits by token mode internally.
Supports both overlap_sentences and overlap_tokens.

# Hybrid: up to 3 sentences or 80 tokens, with overlaps
chunks = chunker.chunk(
    text=complex_doc,
    mode="hybrid",
    max_sentences=3,
    max_tokens=80,
    overlap_sentences=1,
    overlap_tokens=20
)

Overlap Logic

Overlap carries context into the next chunk to preserve continuity.

Sentence overlap: repeats the last N sentences of one chunk at the start of the next.
Token overlap: repeats the last M tokens similarly.

# 3-sentence chunks with 2-sentence overlap
chunks = chunker.chunk(text, mode="sentence", max_sentences=3, overlap_sentences=2)
# 100-token chunks with 15-token overlap
chunks = chunker.chunk(text, mode="token", max_tokens=100, overlap_tokens=15)

Caching

Enable caching to speed up repeated splits and token counts.

Toggle via use_cache in ChunkletInitConfig.
In-memory cache keys on (text, lang, mode, parameters).
Clear cache manually with chunker.clear_cache().

cfg = ChunkletInitConfig(use_cache=True)
chunker = Chunklet(cfg)

# first call: computes splits
chunks1 = chunker.chunk(text, mode="sentence", max_sentences=2)

# second call with same args: returns cached result
chunks2 = chunker.chunk(text, mode="sentence", max_sentences=2)

# clear cache when underlying text or params change
chunker.clear_cache()

Batch processing reuses cached pieces:

docs = ["Doc one...", "Doc two..."]
batches = chunker.chunk_batch(
    texts=docs,
    mode="hybrid",
    max_sentences=4,
    max_tokens=60
)

Language Detection

If you omit lang or set it to None, Chunklet calls detect_text_language for you.

from chunklet.utils.detect_text_language import detect_text_language

# manual detection
lang, score = detect_text_language("¡Hola mundo!")
print(lang, score)  # → "es", 0.98

# automatic detection in chunking
chunks = chunker.chunk(text="Bonjour tout le monde!", lang=None, mode="sentence")

Use the returned confidence to guard downstream logic:

lang, conf = detect_text_language(user_input)
if conf < 0.7:
    # fallback or prompt user

Customization

Default Parameters via Init Config

Set global defaults so you don’t repeat them per call:

cfg = ChunkletInitConfig(
    default_language="fr",
    max_sentences=5,
    overlap_sentences=1,
    max_tokens=100,
    overlap_tokens=20,
    use_cache=True
)
chunker = Chunklet(cfg)
# now chunk() uses these defaults if params are omitted
chunks = chunker.chunk(text, mode="hybrid")

Custom Sentence Splitters

from chunklet.models import CustomSplitterConfig

def my_splitter(text: str) -> list[str]:
    return text.split("|||")  # domain-specific separator

config = CustomSplitterConfig(
    name="pipe_split",
    languages=["en"],
    callback=my_splitter
)

cfg = ChunkletInitConfig(custom_splitters=[config])
chunker = Chunklet(cfg)

chunks = chunker.preview_sentences("A|||B|||C", lang="en")
# → ["A", "B", "C"]

Custom splitters run in order; unmatched texts fall back to pysbd/universal.

4. Python API Reference

Authoritative reference for all public Python interfaces: constructor and method signatures, parameters, return types, and raised exceptions.

4.1 class Chunklet

Primary API for context-aware, multilingual text chunking.

Constructor

from chunklet import Chunklet, CustomSplitterConfig
from typing import Optional, List

def __init__(
    verbose: bool = False,
    cache_dir: Optional[str] = None,
    custom_splitters: Optional[List[CustomSplitterConfig]] = None,
) -> None

verbose (bool): enable detailed logging and warnings.
cache_dir (str, optional): directory path for sentence-split cache.
custom_splitters (List[CustomSplitterConfig], optional): override built-in sentence splitters.

chunk

def chunk(
    self,
    text: str,
    lang: str,
    mode: str = "sentence",
    max_sentences: int = None,
    max_tokens: int = None,
    overlap: int = 0,
    token_counter: Callable[[str], int] = None,
) -> List[str]

Split text into chunks.

Parameters:

text (str): input document.
lang (str): ISO language code (e.g. "en", "fr", "markdown").
mode (str): "sentence", "token" or "hybrid".
max_sentences (int, optional): target sentences per chunk (sentence/hybrid).
max_tokens (int, optional): target tokens per chunk (token/hybrid).
overlap (int): number of sentences or tokens to overlap between chunks.
token_counter (callable, optional): function mapping a string to its token count; required for token modes.

Returns:

List[str]: list of text chunks.

Raises:

InvalidInputError: if text is empty or non-string.
TokenNotProvidedError: if mode includes "token" but token_counter is None.

preview_sentences

def preview_sentences(
    self,
    text: str,
    lang: str
) -> Tuple[List[str], List[str]]

Generate raw sentences and collect splitter warnings without forming chunks.

Parameters:

text (str): input document.
lang (str): ISO language code.

Returns:

Tuple[List[str], List[str]]:
- first element: list of sentences.
- second element: list of warning messages.

Raises:

InvalidInputError: if text is invalid.

chunk_batch

def chunk_batch(
    self,
    texts: Iterable[str],
    lang: str,
    **kwargs
) -> List[List[str]]

Apply chunk() over multiple documents.

Parameters:

texts (Iterable[str]): sequence of input strings.
lang (str): ISO language code.
**kwargs: passed directly to chunk().

Returns:

List[List[str]]: list of chunk lists per document.

Raises:

InvalidInputError: if any element in texts is invalid.

clear_cache

def clear_cache(self) -> None

Remove all cached sentence splits (if cache_dir was set).

4.2 chunklet.exceptions

Custom exception hierarchy for precise error handling.

ChunkletError

class ChunkletError(Exception):
    """Base class for all Chunklet-specific errors."""

InvalidInputError

class InvalidInputError(ChunkletError):
    def __init__(self, message: str = "Invalid input provided.") -> None

Raised when a caller provides malformed or empty input.

TokenNotProvidedError

class TokenNotProvidedError(ChunkletError):
    def __init__(self) -> None

Raised when a token-based chunking operation is invoked without a token_counter.

4.3 chunklet.utils

Utility functions for configuration and string handling.

from typing import Any, Dict, List, Optional

load_env

def load_env(
    env_file: str,
    required: Optional[List[str]] = None
) -> Dict[str, str]

Read key/value pairs from a .env file into a dict, validating required keys.

env_file (str): path to the .env file.
required (List[str], optional): list of keys that must be present.

Returns:

Dict[str, str]: loaded environment variables.

Raises:

FileNotFoundError: if env_file does not exist.
KeyError: if any required key is missing.

get_config

def get_config(
    key: str,
    default: Any = None
) -> Any

Fetch a configuration value from environment or default.

key (str): configuration key.
default (Any): fallback value if key is unset.

Returns:

Any: configuration value or default.

slugify

def slugify(text: str) -> str

Convert text into a lowercase, URL-friendly slug.

text (str): input string.

Returns:

str: slugified string.

camel_to_snake

def camel_to_snake(text: str) -> str

Transform a CamelCase or mixed string into snake_case.

text (str): CamelCase input.

Returns:

str: snake_case string.

5. Command-Line Interface

Chunklet’s chunklet command processes text from files, directories or STDIN and emits chunked output via STDOUT or into files. You can control chunk size, overlap, parallelism, caching, language splitting and external tokenization.

5.1 Basic Invocation

Process a single text string on STDIN and print chunks:

echo "This is a test. It has several sentences to chunk." \
  | chunklet \
    --mode sentence \
    --max-sentences 2

5.2 Input and Output

5.2.1 Files and Directories

Pass one or more paths. Directories are scanned non-recursively for .txt files by default.

chunklet docs/chapter1.txt docs/chapter2.txt \
  --mode token \
  --max-tokens 100 \
  --output-dir out_chunks \
  --extension .chunk

This writes chapter1.chunk, chapter2.chunk under out_chunks/.

To process a folder:

chunklet path/to/texts/ \
  --mode sentence \
  --max-sentences 5 \
  --output-dir chunks/

5.2.2 STDIN / STDOUT

If no paths are given, Chunklet reads from STDIN.
Without --output-dir, Chunklet writes all chunks to STDOUT, one per line.

cat report.txt | chunklet --mode token --max-tokens 50 > report.chunks

5.3 Key Flags

Flag	Description
`--mode {sentence,token}`	Chunking mode. `sentence` requires `--max-sentences`.
`--max-sentences N`	Maximum sentences per chunk (with `--mode sentence`).
`--max-tokens N`	Maximum tokens per chunk (with `--mode token`).
`--overlap-percent P`	Keep last P% of sentences/tokens from previous chunk (0–100).
`--lang CODE`	Sentence-splitter locale (e.g. `en`, `fr`). Defaults to `en`.
`--n-jobs N`	Number of parallel workers (via `mpire`). Defaults to `1`.
`--output-dir DIR`	Write per-input chunk files to `DIR` instead of STDOUT.
`--extension EXT`	File extension for chunks (default `.chunk`).
`--no-cache`	Disable on-disk caching between runs.
`--verbose` / `-v`	Enable debug-level logging.
`--tokenizer-command CMD`	Shell command for custom token counting (see “External Tokenizer Integration”).

5.4 Example Workflows

Parallel Directory Chunking

chunklet ./raw_texts/ \
  --mode sentence \
  --max-sentences 10 \
  --overlap-percent 20 \
  --n-jobs 4 \
  --output-dir ./chunks/ \
  --verbose

Splits each .txt in ./raw_texts/ into 10-sentence chunks with 20% overlap.
Uses 4 processes in parallel.
Writes each output as <basename>.chunk under ./chunks/.

Re-processing Without Cache

chunklet large_corpus/ \
  --mode token \
  --max-tokens 200 \
  --no-cache \
  --output-dir cached_off/

Disables caching to ensure fresh chunking on every run.

Streaming Pipeline

# Preprocess then chunk on the fly
python preprocess.py data.txt \
  | chunklet --mode sentence --max-sentences 3 \
  | python postprocess_chunks.py

preprocess.py writes cleaned text to STDOUT.
chunklet reads that stream, emits chunks to STDOUT.
postprocess_chunks.py consumes each chunk for downstream tasks.

5.5 Under the Hood: Argument Parsing

In src/chunklet/cli.py:

parser.add_argument(
    "inputs", nargs="*",
    help="Input files or directories. Reads STDIN if empty."
)
parser.add_argument(
    "--mode", required=True, choices=["sentence","token"],
    help="Chunk by sentences or tokens."
)
parser.add_argument("--max-sentences", type=int)
parser.add_argument("--max-tokens", type=int)
parser.add_argument("--overlap-percent", type=float, default=0)
parser.add_argument("--lang", default="en")
parser.add_argument("--n-jobs", type=int, default=1)
parser.add_argument("--output-dir")
parser.add_argument("--extension", default=".chunk")
parser.add_argument("--no-cache", action="store_true")
parser.add_argument("--verbose", "-v", action="store_true")
parser.add_argument("--tokenizer-command", type=str)

Parsed args feed into:

chunker = Chunklet(
    verbose=args.verbose,
    use_cache=not args.no_cache,
    token_counter=external_tokenizer if args.tokenizer_command else None
)
results = chunker.batch_chunk(
    texts, mode=args.mode,
    max_sentences=args.max_sentences,
    max_tokens=args.max_tokens,
    overlap_percent=args.overlap_percent,
    n_jobs=args.n_jobs,
    lang=args.lang
)

Chunklet then writes results either to STDOUT or into files under --output-dir.

6. Advanced Usage & Recipes

batch_chunk_pages: Sentence-Based PDF Chunking

Provide a one-step method to extract text from every page of a PDF, clean it, and split it into sentence-level chunks using Chunklet. Ideal for mobile-safe processing or feeding data into LLMs without exceeding token limits.

Essential Details

Uses PdfReader (from pypdf) to extract raw text per page.
Cleans spurious line breaks, preserves headings/lists, strips standalone numbers.
Delegates chunking to Chunklet.batch_chunk in "sentence" mode.
Returns List[List[str]]: pages → sentence chunks.

Method Signature

def batch_chunk_pages(self,
                      max_sentences: int = 5
                     ) -> List[List[str]]:
    ...

max_sentences: max sentences per chunk
Uses n_jobs=1 to avoid parallel issues
Defaults to French (lang="fr"); override in code to switch language.

Code Example

from examples.pdf_chunking import PDFProcessor

pdf_path = "docs/Your-Doc.pdf"
processor = PDFProcessor(pdf_path)

# Chunk pages into groups of up to 10 sentences each
pages_chunks = processor.batch_chunk_pages(max_sentences=10)

for page_index, chunks in enumerate(pages_chunks, start=1):
    print(f"Page {page_index} has {len(chunks)} chunks")
    for chunk in chunks:
        print("  •", chunk)

Practical Usage Tips

To switch to English or another language:

all_chunks = self.chunker.batch_chunk(
    pages_text,
    mode="sentence",
    max_sentences=max_sentences,
    n_jobs=1,
    lang="en"       # switch to English
)

Tune max_sentences to balance chunk size vs. API calls.
For very large PDFs, instantiate multiple PDFProcessor objects with disjoint page ranges.
For paragraph-level chunks, set mode="default" in batch_chunk.

Troubleshooting

If pypdf is missing:
```
pip install pypdf
```
To customize text cleanup, modify _cleanup_text regex patterns in examples/pdf_chunking.py.

Hybrid Chunking Mode

Use Chunklet’s "hybrid" mode to split text by both sentence and token limits, with configurable overlap to preserve context.

Why Hybrid Mode?

Bounds chunks by sentences and tokens.
Overlaps content to maintain semantic coherence.
Ideal for RAG pipelines, summarization, or any workflow with size constraints.

Key Parameters

max_sentences (int): max sentences per chunk.
max_tokens (int): max tokens per chunk (uses your token_counter).
overlap_percent (float): percent of tokens to repeat in next chunk.
token_counter (callable): maps a string to its token count.
verbose (bool): logs internal decisions for tuning.

Code Example

from chunklet import Chunklet

def simple_token_counter(text: str) -> int:
    return len(text.split())

text = """
This is a long text to demonstrate hybrid chunking. It combines both sentence and token limits for flexible chunking.
Overlap helps maintain context between chunks by repeating some clauses. This mode is very powerful for maintaining semantic coherence.
It is ideal for applications like RAG pipelines where context is crucial.
"""

chunker = Chunklet(verbose=True, token_counter=simple_token_counter)

chunks = chunker.chunk(
    text,
    mode="hybrid",
    max_sentences=2,
    max_tokens=15,
    overlap_percent=20
)

print("--- Hybrid Mode Chunks ---")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:\n{chunk}\n")

Expected Behavior

Reads up to 2 sentences or 15 tokens, whichever comes first.
Computes 20% of the last chunk’s tokens (rounded) and prepends them to the next chunk.
Continues until the text is fully processed.

Practical Tips

Increase max_sentences when preserving full thoughts matters more.
Increase max_tokens for strict prompt-size control.
Use overlap_percent of 20–30% for stronger context continuity.
In production, use a model-specific tokenizer (e.g., tiktoken).
Set verbose=True during development to inspect chunking decisions.

Benchmarking Chunklet with `benchmark.py`

Measure Chunklet’s chunking performance (single vs. batch) across languages and modes, and customize benchmarks for your own datasets.

Essential Information

Dependencies:
- chunklet (core library)
- rich for tables and spinners
- loguru (logs suppressed via logger.remove())
Sample texts for English (en), Catalan (ca), Haitian Creole (ht), each ×10.
Supported modes:
- "sentence" (max_sentences=5)
- "token" (max_tokens=50)
- "hybrid" (max_sentences=3 & max_tokens=50)

Running the Benchmark

Install dependencies:
```
pip install chunklet rich loguru
```
Run:
```
python benchmark.py
```
View two tables:
- Single Run: avg time per run over N=100
- Batch Run: total time for a batch of 100 texts

Customizing the Benchmark

Adjust iterations or batch size:

number_of_runs = 200    # default 100
batch_size       = 500  # default 100

Use your own texts:

TEXTS["custom"] = open("my_text.txt", "r").read()

Swap in a custom token counter:

def my_token_counter(text: str) -> int:
    return len(re.findall(r"\w+", text))

chunker = Chunklet(token_counter=my_token_counter)

Key Code Snippets

Single-run timing with timeit.Timer:

def bench_single():
    return chunker.chunk(
        text_content,
        lang=lang_code,
        mode=mode,
        max_sentences=(5 if mode!="token" else None),
        max_tokens=(50 if mode!="sentence" else None)
    )

time_single = timeit.Timer(bench_single).timeit(number=number_of_runs)
avg_time = time_single / number_of_runs

Batch timing with timeit.default_timer:

texts = [text_content] * batch_size
start = timeit.default_timer()
chunks = chunker.batch_chunk(
    texts,
    lang=lang_code,
    mode=mode,
    max_sentences=(5 if mode!="token" else None),
    max_tokens=(50 if mode!="sentence" else None)
)
total_time = timeit.default_timer() - start
total_chunks = sum(len(c) for c in chunks)

Interpreting Results

Input Chars: length of each text input
Runs/Num of texts: invocation count
Avg. Time (s/run) vs. Total Time (s): latency vs. throughput
Chunks: total output chunks

Practical Tips

Increase batch_size for large corpora to amortize overhead.
Use mode="hybrid" when you need both sentence and token bounds.
In production, pass a high-quality tokenizer (e.g., spaCy) as token_counter.

7. Development Guide

This section covers setting up your local development environment, running tests, enforcing code style, and submitting pull requests to the speedyk-005/chunklet repository.

7.1 Prerequisites and Environment Setup

Ensure you have:

Python 3.8–3.11
Git

Steps:

Clone your fork and enter the directory

git clone https://github.com/<your-username>/chunklet.git
cd chunklet

Create and activate a virtual environment

python -m venv .venv
source .venv/bin/activate     # macOS/Linux
# .venv\Scripts\activate      # Windows

Install the package and dependencies

pip install -e .
pip install -r requirements.txt
# Install dev tools
pip install black pytest

7.2 Running Tests

All pushes and PRs trigger the GitHub Actions workflow (.github/workflows/build-and-test.yml) to run tests on Python 3.8–3.11. You can run tests locally with pytest:

Run the full suite

pytest

Run a single test file

pytest tests/test_chunklet.py

Run a specific test method

pytest tests/test_chunklet.py::TestChunklet::test_acronym_preservation

7.3 Code Formatting

We enforce Black for consistent styling. Run before committing:

Check formatting without modifying files

black --check .

Reformat all files in place

black .

7.4 Forking, Branching, and Pull Requests

Fork and Branch

Fork the repo on GitHub:
https://github.com/speedyk-005/chunklet ▶️ Fork
Clone your fork locally (see 7.1)
Add the upstream remote and fetch updates

git remote add upstream https://github.com/speedyk-005/chunklet.git
git fetch upstream

Create a descriptive branch

git checkout -b feat/short-description

Commit and Push

Stage and commit with a concise, scoped message

git add .
git commit -m "feat(parser): support multiline comments"

Rebase frequently to stay in sync

git fetch upstream
git rebase upstream/main

Push your branch

git push origin feat/short-description

Open a Pull Request

On GitHub, click Compare & pull request on your fork
Set the base repository to speedyk-005/chunklet:main
Reference related issues (e.g., “Closes #123”)
Describe your change, testing steps, and any migration notes
Ensure all CI checks pass before requesting review

Adhering to these steps ensures smooth reviews and integration into the main codebase.

Documentation

Contents

Quick Actions

Contents

Quick Actions

Chat about this codebase

Chat about this codebase

Project Overview

Core Value Proposition

Primary Use Cases

Performance Highlights

Supported Modes & Languages

Quick Start

2. Getting Started

Installation

Verify Your Environment

First Chunking Call (Python)

Using the CLI

Exploring Examples

3. Core Concepts

Chunking Modes

Sentence Mode

Token Mode

Hybrid Mode

Overlap Logic

Caching

Language Detection

Customization

Default Parameters via Init Config

Custom Sentence Splitters

4. Python API Reference

4.1 class Chunklet

Constructor

chunk

preview_sentences

chunk_batch

clear_cache

4.2 chunklet.exceptions

ChunkletError

InvalidInputError

TokenNotProvidedError

4.3 chunklet.utils

load_env

get_config

slugify

camel_to_snake

5. Command-Line Interface

5.1 Basic Invocation

5.2 Input and Output

5.2.1 Files and Directories

5.2.2 STDIN / STDOUT

5.3 Key Flags

5.4 Example Workflows

Parallel Directory Chunking

Re-processing Without Cache

Streaming Pipeline

5.5 Under the Hood: Argument Parsing

6. Advanced Usage & Recipes

batch_chunk_pages: Sentence-Based PDF Chunking

Essential Details

Method Signature

Code Example

Practical Usage Tips

Troubleshooting

Hybrid Chunking Mode

Why Hybrid Mode?

Key Parameters

Code Example

Expected Behavior

Practical Tips

Benchmarking Chunklet with benchmark.py

Essential Information

Running the Benchmark

Customizing the Benchmark

Key Code Snippets

Interpreting Results

Practical Tips

7. Development Guide

7.1 Prerequisites and Environment Setup

7.2 Running Tests

Benchmarking Chunklet with `benchmark.py`