Chat about this codebase

AI-powered code exploration

Online

Project Overview

Chunklet is a Python library for multilingual, context-aware text chunking optimized for large language model (LLM) and retrieval-augmented generation (RAG) pipelines. It splits long documents into manageable segments while preserving semantic boundaries, enabling efficient indexing, embedding, and inference.

Core Value Proposition

  • Context-Aware Splitting
    Leverages sentence boundaries, token counts, or a hybrid strategy to maintain coherence.
  • Multilingual Support
    Auto-detects or explicitly handles languages such as English, Spanish, Catalan, Haitian Creole, and more.
  • High Performance
    Sub-30 ms per single run and sub-0.2 s for 100-text batches in hybrid mode (English).
  • Extensible API
    Customize token counters, overlap, and sentence splitters via Pydantic models.
  • CLI & Programmatic Interfaces
    Use chunklet-py in shell scripts or import Chunklet in Python projects.

Primary Use Cases

  • RAG Pipelines
    Pre-chunk large corpora for vector stores (e.g., FAISS, Pinecone) and retrieve relevant passages.
  • LLM Prompt Engineering
    Break prompts or knowledge bases into token- or sentence-level chunks to fit context windows.
  • Multilingual Preprocessing
    Apply consistent chunking across diverse languages with automatic detection.
  • Batch Processing
    Efficiently split thousands of documents in seconds for ETL workflows.

Performance Highlights

  • Single-Run (100 runs on 1 KB English text)
    • sentence: ~0.035 s avg
    • token: ~0.019 s avg
    • hybrid: ~0.028 s avg
  • Batch-Run (100 × 1 KB texts)
    • sentence: ~0.210 s total
    • token: ~0.145 s total
    • hybrid: ~0.136 s total
  • Similar performance profiles apply to Catalan and Haitian Creole, with ±10% variance.

Refer to BENCHMARKS.md for full per-language tables.

Supported Modes & Languages

  • Modes
    • sentence: split on punctuation and whitespace
    • token: split by fixed token counts
    • hybrid: token-aware sentence splitting
  • Languages
    • Auto-detect (lang='auto')
    • Explicit codes: 'en', 'es', 'ca', 'ht', etc.
  • Advanced
    • Custom splitters via CustomSplitterConfig
    • Override token counters per call or globally

Quick Start

from chunklet import Chunklet

# Initialize with caching and debug logging
chunker = Chunklet(verbose=True, use_cache=True)

# Chunk a single document in hybrid mode
text = "Once upon a time…"
chunks = chunker.chunk(
    text=text,
    mode="hybrid",
    max_sentences=3,
    max_tokens=150,
    overlap_percent=10.0,
    lang="en"
)

for i, seg in enumerate(chunks):
    print(f"[Chunk {i+1}] {seg}")
# Install and verify version
pip install chunklet-py
chunklet-py --version

This overview equips you to select the right chunking strategy, integrate chunklet into LLM/RAG workflows, and scale text preprocessing across languages.

2. Getting Started

Kick off with installing Chunklet, verifying your setup, and running your first text‐chunking call.

Installation

Install the latest release from PyPI:

pip install chunklet

Or install from source (requires Python 3.8+):

git clone https://github.com/speedyk-005/chunklet-py.git
cd chunklet-py
pip install .

Verify Your Environment

Confirm Chunklet is available and check its version:

python3 - <<EOF
import chunklet
print("Chunklet version:", chunklet.__version__)
EOF

If you see a version string, you’re ready to chunk.

First Chunking Call (Python)

Split a simple paragraph into 3‐sentence chunks:

from chunklet import Chunklet

text = (
    "Chunklet makes multilingual text chunking simple. "
    "It supports sentence, token, and hybrid modes. "
    "You can configure overlap for context preservation. "
    "This helps when feeding chunks into LLMs or search indexes. "
    "Installation takes just one pip command."
)

# Initialize chunker (verbose=True prints debug info)
chunker = Chunklet(verbose=True)

# Chunk by sentences: max 3 sentences per chunk
chunks = chunker.chunk(
    text=text,
    mode="sentence",
    max_sentences=3
)

for i, c in enumerate(chunks, 1):
    print(f"Chunk {i}:\n{c}\n")

Using the CLI

Chunklet includes a chunklet command for quick experiments:

# Write your text to input.txt, then:
chunklet \
  --file input.txt \
  --mode hybrid \
  --max-sentences 5 \
  --max-tokens 200 \
  --overlap-percent 10

Key flags:

--mode chooses sentence, token, or hybrid.
--max-sentences and --max-tokens set your chunk limits.
--overlap-percent retains context between token/hybrid chunks.

Run chunklet --help for full options.

Exploring Examples

Use the bundled examples to explore each chunking strategy:

# Sentence‐only mode
python examples/sentence_mode.py

# Token‐based mode
python examples/token_mode.py

# Hybrid mode (sentence + token)
python examples/hybrid_mode.py

Each script shows realistic setup, imports, and output formatting. Modify parameters in these examples to fit your data.

3. Core Concepts

Chunklet breaks text into manageable pieces using three modes—sentence, token, hybrid—with configurable overlap, optional caching, automatic language detection and pluggable splitters. This section explains how each feature works and how to tailor Chunklet for your workflow.

Chunking Modes

Sentence Mode

Groups contiguous sentences into chunks.

  • Splits text into sentences via custom splitters → pysbd (if installed) → universal_splitter fallback.
  • Parameters per call: max_sentences, overlap_sentences.
from chunklet import Chunklet
from chunklet.models import ChunkletInitConfig

cfg = ChunkletInitConfig(default_language="en")
chunker = Chunklet(cfg)

text = "Sentence one. Sentence two! Sentence three?"
# 2 sentences per chunk, 1-sentence overlap
chunks = chunker.chunk(
    text=text,
    mode="sentence",
    max_sentences=2,
    overlap_sentences=1
)
print(chunks)
# → ["Sentence one. Sentence two!", "Sentence two! Sentence three?"]

Token Mode

Splits text by approximate token counts.

  • Tokenization uses GPT-compatible counters.
  • Parameters: max_tokens, overlap_tokens.
# 50-token chunks with 10-token overlap
chunks = chunker.chunk(
    text=long_text,
    mode="token",
    max_tokens=50,
    overlap_tokens=10
)

Hybrid Mode

Combines sentence and token logic.

  • Splits into sentences, groups by max_sentences.
  • If a sentence exceeds max_tokens, it splits by token mode internally.
  • Supports both overlap_sentences and overlap_tokens.
# Hybrid: up to 3 sentences or 80 tokens, with overlaps
chunks = chunker.chunk(
    text=complex_doc,
    mode="hybrid",
    max_sentences=3,
    max_tokens=80,
    overlap_sentences=1,
    overlap_tokens=20
)

Overlap Logic

Overlap carries context into the next chunk to preserve continuity.

  • Sentence overlap: repeats the last N sentences of one chunk at the start of the next.
  • Token overlap: repeats the last M tokens similarly.
# 3-sentence chunks with 2-sentence overlap
chunks = chunker.chunk(text, mode="sentence", max_sentences=3, overlap_sentences=2)
# 100-token chunks with 15-token overlap
chunks = chunker.chunk(text, mode="token", max_tokens=100, overlap_tokens=15)

Caching

Enable caching to speed up repeated splits and token counts.

  • Toggle via use_cache in ChunkletInitConfig.
  • In-memory cache keys on (text, lang, mode, parameters).
  • Clear cache manually with chunker.clear_cache().
cfg = ChunkletInitConfig(use_cache=True)
chunker = Chunklet(cfg)

# first call: computes splits
chunks1 = chunker.chunk(text, mode="sentence", max_sentences=2)

# second call with same args: returns cached result
chunks2 = chunker.chunk(text, mode="sentence", max_sentences=2)

# clear cache when underlying text or params change
chunker.clear_cache()

Batch processing reuses cached pieces:

docs = ["Doc one...", "Doc two..."]
batches = chunker.chunk_batch(
    texts=docs,
    mode="hybrid",
    max_sentences=4,
    max_tokens=60
)

Language Detection

If you omit lang or set it to None, Chunklet calls detect_text_language for you.

from chunklet.utils.detect_text_language import detect_text_language

# manual detection
lang, score = detect_text_language("¡Hola mundo!")
print(lang, score)  # → "es", 0.98

# automatic detection in chunking
chunks = chunker.chunk(text="Bonjour tout le monde!", lang=None, mode="sentence")

Use the returned confidence to guard downstream logic:

lang, conf = detect_text_language(user_input)
if conf < 0.7:
    # fallback or prompt user

Customization

Default Parameters via Init Config

Set global defaults so you don’t repeat them per call:

cfg = ChunkletInitConfig(
    default_language="fr",
    max_sentences=5,
    overlap_sentences=1,
    max_tokens=100,
    overlap_tokens=20,
    use_cache=True
)
chunker = Chunklet(cfg)
# now chunk() uses these defaults if params are omitted
chunks = chunker.chunk(text, mode="hybrid")

Custom Sentence Splitters

Register language-specific splitters before built-ins:

from chunklet.models import CustomSplitterConfig

def my_splitter(text: str) -> list[str]:
    return text.split("|||")  # domain-specific separator

config = CustomSplitterConfig(
    name="pipe_split",
    languages=["en"],
    callback=my_splitter
)

cfg = ChunkletInitConfig(custom_splitters=[config])
chunker = Chunklet(cfg)

chunks = chunker.preview_sentences("A|||B|||C", lang="en")
# → ["A", "B", "C"]

Custom splitters run in order; unmatched texts fall back to pysbd/universal.

4. Python API Reference

Authoritative reference for all public Python interfaces: constructor and method signatures, parameters, return types, and raised exceptions.


4.1 class Chunklet

Primary API for context-aware, multilingual text chunking.

Constructor

from chunklet import Chunklet, CustomSplitterConfig
from typing import Optional, List

def __init__(
    verbose: bool = False,
    cache_dir: Optional[str] = None,
    custom_splitters: Optional[List[CustomSplitterConfig]] = None,
) -> None
  • verbose (bool): enable detailed logging and warnings.
  • cache_dir (str, optional): directory path for sentence-split cache.
  • custom_splitters (List[CustomSplitterConfig], optional): override built-in sentence splitters.

chunk

def chunk(
    self,
    text: str,
    lang: str,
    mode: str = "sentence",
    max_sentences: int = None,
    max_tokens: int = None,
    overlap: int = 0,
    token_counter: Callable[[str], int] = None,
) -> List[str]

Split text into chunks.

Parameters:

  • text (str): input document.
  • lang (str): ISO language code (e.g. "en", "fr", "markdown").
  • mode (str): "sentence", "token" or "hybrid".
  • max_sentences (int, optional): target sentences per chunk (sentence/hybrid).
  • max_tokens (int, optional): target tokens per chunk (token/hybrid).
  • overlap (int): number of sentences or tokens to overlap between chunks.
  • token_counter (callable, optional): function mapping a string to its token count; required for token modes.

Returns:

  • List[str]: list of text chunks.

Raises:

  • InvalidInputError: if text is empty or non-string.
  • TokenNotProvidedError: if mode includes "token" but token_counter is None.

preview_sentences

def preview_sentences(
    self,
    text: str,
    lang: str
) -> Tuple[List[str], List[str]]

Generate raw sentences and collect splitter warnings without forming chunks.

Parameters:

  • text (str): input document.
  • lang (str): ISO language code.

Returns:

  • Tuple[List[str], List[str]]:
    • first element: list of sentences.
    • second element: list of warning messages.

Raises:

  • InvalidInputError: if text is invalid.

chunk_batch

def chunk_batch(
    self,
    texts: Iterable[str],
    lang: str,
    **kwargs
) -> List[List[str]]

Apply chunk() over multiple documents.

Parameters:

  • texts (Iterable[str]): sequence of input strings.
  • lang (str): ISO language code.
  • **kwargs: passed directly to chunk().

Returns:

  • List[List[str]]: list of chunk lists per document.

Raises:

  • InvalidInputError: if any element in texts is invalid.

clear_cache

def clear_cache(self) -> None

Remove all cached sentence splits (if cache_dir was set).


4.2 chunklet.exceptions

Custom exception hierarchy for precise error handling.

ChunkletError

class ChunkletError(Exception):
    """Base class for all Chunklet-specific errors."""

InvalidInputError

class InvalidInputError(ChunkletError):
    def __init__(self, message: str = "Invalid input provided.") -> None

Raised when a caller provides malformed or empty input.

TokenNotProvidedError

class TokenNotProvidedError(ChunkletError):
    def __init__(self) -> None

Raised when a token-based chunking operation is invoked without a token_counter.


4.3 chunklet.utils

Utility functions for configuration and string handling.

from typing import Any, Dict, List, Optional

load_env

def load_env(
    env_file: str,
    required: Optional[List[str]] = None
) -> Dict[str, str]

Read key/value pairs from a .env file into a dict, validating required keys.

  • env_file (str): path to the .env file.
  • required (List[str], optional): list of keys that must be present.

Returns:

  • Dict[str, str]: loaded environment variables.

Raises:

  • FileNotFoundError: if env_file does not exist.
  • KeyError: if any required key is missing.

get_config

def get_config(
    key: str,
    default: Any = None
) -> Any

Fetch a configuration value from environment or default.

  • key (str): configuration key.
  • default (Any): fallback value if key is unset.

Returns:

  • Any: configuration value or default.

slugify

def slugify(text: str) -> str

Convert text into a lowercase, URL-friendly slug.

  • text (str): input string.

Returns:

  • str: slugified string.

camel_to_snake

def camel_to_snake(text: str) -> str

Transform a CamelCase or mixed string into snake_case.

  • text (str): CamelCase input.

Returns:

  • str: snake_case string.

5. Command-Line Interface

Chunklet’s chunklet command processes text from files, directories or STDIN and emits chunked output via STDOUT or into files. You can control chunk size, overlap, parallelism, caching, language splitting and external tokenization.

5.1 Basic Invocation

Process a single text string on STDIN and print chunks:

echo "This is a test. It has several sentences to chunk." \
  | chunklet \
    --mode sentence \
    --max-sentences 2

5.2 Input and Output

5.2.1 Files and Directories

Pass one or more paths. Directories are scanned non-recursively for .txt files by default.

chunklet docs/chapter1.txt docs/chapter2.txt \
  --mode token \
  --max-tokens 100 \
  --output-dir out_chunks \
  --extension .chunk

This writes chapter1.chunk, chapter2.chunk under out_chunks/.

To process a folder:

chunklet path/to/texts/ \
  --mode sentence \
  --max-sentences 5 \
  --output-dir chunks/

5.2.2 STDIN / STDOUT

  • If no paths are given, Chunklet reads from STDIN.
  • Without --output-dir, Chunklet writes all chunks to STDOUT, one per line.
cat report.txt | chunklet --mode token --max-tokens 50 > report.chunks

5.3 Key Flags

Flag Description
--mode {sentence,token} Chunking mode. sentence requires --max-sentences.
--max-sentences N Maximum sentences per chunk (with --mode sentence).
--max-tokens N Maximum tokens per chunk (with --mode token).
--overlap-percent P Keep last P% of sentences/tokens from previous chunk (0–100).
--lang CODE Sentence-splitter locale (e.g. en, fr). Defaults to en.
--n-jobs N Number of parallel workers (via mpire). Defaults to 1.
--output-dir DIR Write per-input chunk files to DIR instead of STDOUT.
--extension EXT File extension for chunks (default .chunk).
--no-cache Disable on-disk caching between runs.
--verbose / -v Enable debug-level logging.
--tokenizer-command CMD Shell command for custom token counting (see “External Tokenizer Integration”).

5.4 Example Workflows

Parallel Directory Chunking

chunklet ./raw_texts/ \
  --mode sentence \
  --max-sentences 10 \
  --overlap-percent 20 \
  --n-jobs 4 \
  --output-dir ./chunks/ \
  --verbose
  • Splits each .txt in ./raw_texts/ into 10-sentence chunks with 20% overlap.
  • Uses 4 processes in parallel.
  • Writes each output as <basename>.chunk under ./chunks/.

Re-processing Without Cache

chunklet large_corpus/ \
  --mode token \
  --max-tokens 200 \
  --no-cache \
  --output-dir cached_off/

Disables caching to ensure fresh chunking on every run.

Streaming Pipeline

# Preprocess then chunk on the fly
python preprocess.py data.txt \
  | chunklet --mode sentence --max-sentences 3 \
  | python postprocess_chunks.py
  • preprocess.py writes cleaned text to STDOUT.
  • chunklet reads that stream, emits chunks to STDOUT.
  • postprocess_chunks.py consumes each chunk for downstream tasks.

5.5 Under the Hood: Argument Parsing

In src/chunklet/cli.py:

parser.add_argument(
    "inputs", nargs="*",
    help="Input files or directories. Reads STDIN if empty."
)
parser.add_argument(
    "--mode", required=True, choices=["sentence","token"],
    help="Chunk by sentences or tokens."
)
parser.add_argument("--max-sentences", type=int)
parser.add_argument("--max-tokens", type=int)
parser.add_argument("--overlap-percent", type=float, default=0)
parser.add_argument("--lang", default="en")
parser.add_argument("--n-jobs", type=int, default=1)
parser.add_argument("--output-dir")
parser.add_argument("--extension", default=".chunk")
parser.add_argument("--no-cache", action="store_true")
parser.add_argument("--verbose", "-v", action="store_true")
parser.add_argument("--tokenizer-command", type=str)

Parsed args feed into:

chunker = Chunklet(
    verbose=args.verbose,
    use_cache=not args.no_cache,
    token_counter=external_tokenizer if args.tokenizer_command else None
)
results = chunker.batch_chunk(
    texts, mode=args.mode,
    max_sentences=args.max_sentences,
    max_tokens=args.max_tokens,
    overlap_percent=args.overlap_percent,
    n_jobs=args.n_jobs,
    lang=args.lang
)

Chunklet then writes results either to STDOUT or into files under --output-dir.

6. Advanced Usage & Recipes

batch_chunk_pages: Sentence-Based PDF Chunking

Provide a one-step method to extract text from every page of a PDF, clean it, and split it into sentence-level chunks using Chunklet. Ideal for mobile-safe processing or feeding data into LLMs without exceeding token limits.

Essential Details

  • Uses PdfReader (from pypdf) to extract raw text per page.
  • Cleans spurious line breaks, preserves headings/lists, strips standalone numbers.
  • Delegates chunking to Chunklet.batch_chunk in "sentence" mode.
  • Returns List[List[str]]: pages → sentence chunks.

Method Signature

def batch_chunk_pages(self,
                      max_sentences: int = 5
                     ) -> List[List[str]]:
    ...
  • max_sentences: max sentences per chunk
  • Uses n_jobs=1 to avoid parallel issues
  • Defaults to French (lang="fr"); override in code to switch language.

Code Example

from examples.pdf_chunking import PDFProcessor

pdf_path = "docs/Your-Doc.pdf"
processor = PDFProcessor(pdf_path)

# Chunk pages into groups of up to 10 sentences each
pages_chunks = processor.batch_chunk_pages(max_sentences=10)

for page_index, chunks in enumerate(pages_chunks, start=1):
    print(f"Page {page_index} has {len(chunks)} chunks")
    for chunk in chunks:
        print("  •", chunk)

Practical Usage Tips

  • To switch to English or another language:
    all_chunks = self.chunker.batch_chunk(
        pages_text,
        mode="sentence",
        max_sentences=max_sentences,
        n_jobs=1,
        lang="en"       # switch to English
    )
    
  • Tune max_sentences to balance chunk size vs. API calls.
  • For very large PDFs, instantiate multiple PDFProcessor objects with disjoint page ranges.
  • For paragraph-level chunks, set mode="default" in batch_chunk.

Troubleshooting

  • If pypdf is missing:
    pip install pypdf
    
  • To customize text cleanup, modify _cleanup_text regex patterns in examples/pdf_chunking.py.

Hybrid Chunking Mode

Use Chunklet’s "hybrid" mode to split text by both sentence and token limits, with configurable overlap to preserve context.

Why Hybrid Mode?

  • Bounds chunks by sentences and tokens.
  • Overlaps content to maintain semantic coherence.
  • Ideal for RAG pipelines, summarization, or any workflow with size constraints.

Key Parameters

  • max_sentences (int): max sentences per chunk.
  • max_tokens (int): max tokens per chunk (uses your token_counter).
  • overlap_percent (float): percent of tokens to repeat in next chunk.
  • token_counter (callable): maps a string to its token count.
  • verbose (bool): logs internal decisions for tuning.

Code Example

from chunklet import Chunklet

def simple_token_counter(text: str) -> int:
    return len(text.split())

text = """
This is a long text to demonstrate hybrid chunking. It combines both sentence and token limits for flexible chunking.
Overlap helps maintain context between chunks by repeating some clauses. This mode is very powerful for maintaining semantic coherence.
It is ideal for applications like RAG pipelines where context is crucial.
"""

chunker = Chunklet(verbose=True, token_counter=simple_token_counter)

chunks = chunker.chunk(
    text,
    mode="hybrid",
    max_sentences=2,
    max_tokens=15,
    overlap_percent=20
)

print("--- Hybrid Mode Chunks ---")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:\n{chunk}\n")

Expected Behavior

  1. Reads up to 2 sentences or 15 tokens, whichever comes first.
  2. Computes 20% of the last chunk’s tokens (rounded) and prepends them to the next chunk.
  3. Continues until the text is fully processed.

Practical Tips

  • Increase max_sentences when preserving full thoughts matters more.
  • Increase max_tokens for strict prompt-size control.
  • Use overlap_percent of 20–30% for stronger context continuity.
  • In production, use a model-specific tokenizer (e.g., tiktoken).
  • Set verbose=True during development to inspect chunking decisions.

Benchmarking Chunklet with benchmark.py

Measure Chunklet’s chunking performance (single vs. batch) across languages and modes, and customize benchmarks for your own datasets.

Essential Information

  • Dependencies:
    • chunklet (core library)
    • rich for tables and spinners
    • loguru (logs suppressed via logger.remove())
  • Sample texts for English (en), Catalan (ca), Haitian Creole (ht), each ×10.
  • Supported modes:
    • "sentence" (max_sentences=5)
    • "token" (max_tokens=50)
    • "hybrid" (max_sentences=3 & max_tokens=50)

Running the Benchmark

  1. Install dependencies:
    pip install chunklet rich loguru
    
  2. Run:
    python benchmark.py
    
  3. View two tables:
    • Single Run: avg time per run over N=100
    • Batch Run: total time for a batch of 100 texts

Customizing the Benchmark

  • Adjust iterations or batch size:
    number_of_runs = 200    # default 100
    batch_size       = 500  # default 100
    
  • Use your own texts:
    TEXTS["custom"] = open("my_text.txt", "r").read()
    
  • Swap in a custom token counter:
    def my_token_counter(text: str) -> int:
        return len(re.findall(r"\w+", text))
    
    chunker = Chunklet(token_counter=my_token_counter)
    

Key Code Snippets

Single-run timing with timeit.Timer:

def bench_single():
    return chunker.chunk(
        text_content,
        lang=lang_code,
        mode=mode,
        max_sentences=(5 if mode!="token" else None),
        max_tokens=(50 if mode!="sentence" else None)
    )

time_single = timeit.Timer(bench_single).timeit(number=number_of_runs)
avg_time = time_single / number_of_runs

Batch timing with timeit.default_timer:

texts = [text_content] * batch_size
start = timeit.default_timer()
chunks = chunker.batch_chunk(
    texts,
    lang=lang_code,
    mode=mode,
    max_sentences=(5 if mode!="token" else None),
    max_tokens=(50 if mode!="sentence" else None)
)
total_time = timeit.default_timer() - start
total_chunks = sum(len(c) for c in chunks)

Interpreting Results

  • Input Chars: length of each text input
  • Runs/Num of texts: invocation count
  • Avg. Time (s/run) vs. Total Time (s): latency vs. throughput
  • Chunks: total output chunks

Practical Tips

  • Increase batch_size for large corpora to amortize overhead.
  • Use mode="hybrid" when you need both sentence and token bounds.
  • In production, pass a high-quality tokenizer (e.g., spaCy) as token_counter.

7. Development Guide

This section covers setting up your local development environment, running tests, enforcing code style, and submitting pull requests to the speedyk-005/chunklet repository.

7.1 Prerequisites and Environment Setup

Ensure you have:

  • Python 3.8–3.11
  • Git

Steps:

  1. Clone your fork and enter the directory
git clone https://github.com/<your-username>/chunklet.git
cd chunklet
  1. Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate     # macOS/Linux
# .venv\Scripts\activate      # Windows
  1. Install the package and dependencies
pip install -e .
pip install -r requirements.txt
# Install dev tools
pip install black pytest

7.2 Running Tests

All pushes and PRs trigger the GitHub Actions workflow (.github/workflows/build-and-test.yml) to run tests on Python 3.8–3.11. You can run tests locally with pytest:

  • Run the full suite
pytest
  • Run a single test file
pytest tests/test_chunklet.py
  • Run a specific test method
pytest tests/test_chunklet.py::TestChunklet::test_acronym_preservation

7.3 Code Formatting

We enforce Black for consistent styling. Run before committing:

  • Check formatting without modifying files
black --check .
  • Reformat all files in place
black .

7.4 Forking, Branching, and Pull Requests

Fork and Branch

  1. Fork the repo on GitHub:
    https://github.com/speedyk-005/chunklet ▶️ Fork
  2. Clone your fork locally (see 7.1)
  3. Add the upstream remote and fetch updates
git remote add upstream https://github.com/speedyk-005/chunklet.git
git fetch upstream
  1. Create a descriptive branch
git checkout -b feat/short-description

Commit and Push

  • Stage and commit with a concise, scoped message
git add .
git commit -m "feat(parser): support multiline comments"
  • Rebase frequently to stay in sync
git fetch upstream
git rebase upstream/main
  • Push your branch
git push origin feat/short-description

Open a Pull Request

  1. On GitHub, click Compare & pull request on your fork
  2. Set the base repository to speedyk-005/chunklet:main
  3. Reference related issues (e.g., “Closes #123”)
  4. Describe your change, testing steps, and any migration notes
  5. Ensure all CI checks pass before requesting review

Adhering to these steps ensures smooth reviews and integration into the main codebase.