Project Overview
Chunklet is a Python library for multilingual, context-aware text chunking optimized for large language model (LLM) and retrieval-augmented generation (RAG) pipelines. It splits long documents into manageable segments while preserving semantic boundaries, enabling efficient indexing, embedding, and inference.
Core Value Proposition
- Context-Aware Splitting
Leverages sentence boundaries, token counts, or a hybrid strategy to maintain coherence. - Multilingual Support
Auto-detects or explicitly handles languages such as English, Spanish, Catalan, Haitian Creole, and more. - High Performance
Sub-30 ms per single run and sub-0.2 s for 100-text batches in hybrid mode (English). - Extensible API
Customize token counters, overlap, and sentence splitters via Pydantic models. - CLI & Programmatic Interfaces
Usechunklet-py
in shell scripts or importChunklet
in Python projects.
Primary Use Cases
- RAG Pipelines
Pre-chunk large corpora for vector stores (e.g., FAISS, Pinecone) and retrieve relevant passages. - LLM Prompt Engineering
Break prompts or knowledge bases into token- or sentence-level chunks to fit context windows. - Multilingual Preprocessing
Apply consistent chunking across diverse languages with automatic detection. - Batch Processing
Efficiently split thousands of documents in seconds for ETL workflows.
Performance Highlights
- Single-Run (100 runs on 1 KB English text)
• sentence: ~0.035 s avg
• token: ~0.019 s avg
• hybrid: ~0.028 s avg - Batch-Run (100 × 1 KB texts)
• sentence: ~0.210 s total
• token: ~0.145 s total
• hybrid: ~0.136 s total - Similar performance profiles apply to Catalan and Haitian Creole, with ±10% variance.
Refer to BENCHMARKS.md for full per-language tables.
Supported Modes & Languages
- Modes
• sentence: split on punctuation and whitespace
• token: split by fixed token counts
• hybrid: token-aware sentence splitting - Languages
• Auto-detect (lang='auto'
)
• Explicit codes: 'en', 'es', 'ca', 'ht', etc. - Advanced
• Custom splitters viaCustomSplitterConfig
• Override token counters per call or globally
Quick Start
from chunklet import Chunklet
# Initialize with caching and debug logging
chunker = Chunklet(verbose=True, use_cache=True)
# Chunk a single document in hybrid mode
text = "Once upon a time…"
chunks = chunker.chunk(
text=text,
mode="hybrid",
max_sentences=3,
max_tokens=150,
overlap_percent=10.0,
lang="en"
)
for i, seg in enumerate(chunks):
print(f"[Chunk {i+1}] {seg}")
# Install and verify version
pip install chunklet-py
chunklet-py --version
This overview equips you to select the right chunking strategy, integrate chunklet into LLM/RAG workflows, and scale text preprocessing across languages.
2. Getting Started
Kick off with installing Chunklet, verifying your setup, and running your first text‐chunking call.
Installation
Install the latest release from PyPI:
pip install chunklet
Or install from source (requires Python 3.8+):
git clone https://github.com/speedyk-005/chunklet-py.git
cd chunklet-py
pip install .
Verify Your Environment
Confirm Chunklet is available and check its version:
python3 - <<EOF
import chunklet
print("Chunklet version:", chunklet.__version__)
EOF
If you see a version string, you’re ready to chunk.
First Chunking Call (Python)
Split a simple paragraph into 3‐sentence chunks:
from chunklet import Chunklet
text = (
"Chunklet makes multilingual text chunking simple. "
"It supports sentence, token, and hybrid modes. "
"You can configure overlap for context preservation. "
"This helps when feeding chunks into LLMs or search indexes. "
"Installation takes just one pip command."
)
# Initialize chunker (verbose=True prints debug info)
chunker = Chunklet(verbose=True)
# Chunk by sentences: max 3 sentences per chunk
chunks = chunker.chunk(
text=text,
mode="sentence",
max_sentences=3
)
for i, c in enumerate(chunks, 1):
print(f"Chunk {i}:\n{c}\n")
Using the CLI
Chunklet includes a chunklet
command for quick experiments:
# Write your text to input.txt, then:
chunklet \
--file input.txt \
--mode hybrid \
--max-sentences 5 \
--max-tokens 200 \
--overlap-percent 10
Key flags:
• --mode
chooses sentence
, token
, or hybrid
.
• --max-sentences
and --max-tokens
set your chunk limits.
• --overlap-percent
retains context between token/hybrid chunks.
Run chunklet --help
for full options.
Exploring Examples
Use the bundled examples to explore each chunking strategy:
# Sentence‐only mode
python examples/sentence_mode.py
# Token‐based mode
python examples/token_mode.py
# Hybrid mode (sentence + token)
python examples/hybrid_mode.py
Each script shows realistic setup, imports, and output formatting. Modify parameters in these examples to fit your data.
3. Core Concepts
Chunklet breaks text into manageable pieces using three modes—sentence, token, hybrid—with configurable overlap, optional caching, automatic language detection and pluggable splitters. This section explains how each feature works and how to tailor Chunklet for your workflow.
Chunking Modes
Sentence Mode
Groups contiguous sentences into chunks.
- Splits text into sentences via custom splitters → pysbd (if installed) → universal_splitter fallback.
- Parameters per call:
max_sentences
,overlap_sentences
.
from chunklet import Chunklet
from chunklet.models import ChunkletInitConfig
cfg = ChunkletInitConfig(default_language="en")
chunker = Chunklet(cfg)
text = "Sentence one. Sentence two! Sentence three?"
# 2 sentences per chunk, 1-sentence overlap
chunks = chunker.chunk(
text=text,
mode="sentence",
max_sentences=2,
overlap_sentences=1
)
print(chunks)
# → ["Sentence one. Sentence two!", "Sentence two! Sentence three?"]
Token Mode
Splits text by approximate token counts.
- Tokenization uses GPT-compatible counters.
- Parameters:
max_tokens
,overlap_tokens
.
# 50-token chunks with 10-token overlap
chunks = chunker.chunk(
text=long_text,
mode="token",
max_tokens=50,
overlap_tokens=10
)
Hybrid Mode
Combines sentence and token logic.
- Splits into sentences, groups by
max_sentences
. - If a sentence exceeds
max_tokens
, it splits by token mode internally. - Supports both
overlap_sentences
andoverlap_tokens
.
# Hybrid: up to 3 sentences or 80 tokens, with overlaps
chunks = chunker.chunk(
text=complex_doc,
mode="hybrid",
max_sentences=3,
max_tokens=80,
overlap_sentences=1,
overlap_tokens=20
)
Overlap Logic
Overlap carries context into the next chunk to preserve continuity.
- Sentence overlap: repeats the last N sentences of one chunk at the start of the next.
- Token overlap: repeats the last M tokens similarly.
# 3-sentence chunks with 2-sentence overlap
chunks = chunker.chunk(text, mode="sentence", max_sentences=3, overlap_sentences=2)
# 100-token chunks with 15-token overlap
chunks = chunker.chunk(text, mode="token", max_tokens=100, overlap_tokens=15)
Caching
Enable caching to speed up repeated splits and token counts.
- Toggle via
use_cache
inChunkletInitConfig
. - In-memory cache keys on (text, lang, mode, parameters).
- Clear cache manually with
chunker.clear_cache()
.
cfg = ChunkletInitConfig(use_cache=True)
chunker = Chunklet(cfg)
# first call: computes splits
chunks1 = chunker.chunk(text, mode="sentence", max_sentences=2)
# second call with same args: returns cached result
chunks2 = chunker.chunk(text, mode="sentence", max_sentences=2)
# clear cache when underlying text or params change
chunker.clear_cache()
Batch processing reuses cached pieces:
docs = ["Doc one...", "Doc two..."]
batches = chunker.chunk_batch(
texts=docs,
mode="hybrid",
max_sentences=4,
max_tokens=60
)
Language Detection
If you omit lang
or set it to None
, Chunklet calls detect_text_language
for you.
from chunklet.utils.detect_text_language import detect_text_language
# manual detection
lang, score = detect_text_language("¡Hola mundo!")
print(lang, score) # → "es", 0.98
# automatic detection in chunking
chunks = chunker.chunk(text="Bonjour tout le monde!", lang=None, mode="sentence")
Use the returned confidence to guard downstream logic:
lang, conf = detect_text_language(user_input)
if conf < 0.7:
# fallback or prompt user
Customization
Default Parameters via Init Config
Set global defaults so you don’t repeat them per call:
cfg = ChunkletInitConfig(
default_language="fr",
max_sentences=5,
overlap_sentences=1,
max_tokens=100,
overlap_tokens=20,
use_cache=True
)
chunker = Chunklet(cfg)
# now chunk() uses these defaults if params are omitted
chunks = chunker.chunk(text, mode="hybrid")
Custom Sentence Splitters
Register language-specific splitters before built-ins:
from chunklet.models import CustomSplitterConfig
def my_splitter(text: str) -> list[str]:
return text.split("|||") # domain-specific separator
config = CustomSplitterConfig(
name="pipe_split",
languages=["en"],
callback=my_splitter
)
cfg = ChunkletInitConfig(custom_splitters=[config])
chunker = Chunklet(cfg)
chunks = chunker.preview_sentences("A|||B|||C", lang="en")
# → ["A", "B", "C"]
Custom splitters run in order; unmatched texts fall back to pysbd/universal.
4. Python API Reference
Authoritative reference for all public Python interfaces: constructor and method signatures, parameters, return types, and raised exceptions.
4.1 class Chunklet
Primary API for context-aware, multilingual text chunking.
Constructor
from chunklet import Chunklet, CustomSplitterConfig
from typing import Optional, List
def __init__(
verbose: bool = False,
cache_dir: Optional[str] = None,
custom_splitters: Optional[List[CustomSplitterConfig]] = None,
) -> None
- verbose (bool): enable detailed logging and warnings.
- cache_dir (str, optional): directory path for sentence-split cache.
- custom_splitters (List[CustomSplitterConfig], optional): override built-in sentence splitters.
chunk
def chunk(
self,
text: str,
lang: str,
mode: str = "sentence",
max_sentences: int = None,
max_tokens: int = None,
overlap: int = 0,
token_counter: Callable[[str], int] = None,
) -> List[str]
Split text
into chunks.
Parameters:
- text (str): input document.
- lang (str): ISO language code (e.g. "en", "fr", "markdown").
- mode (str):
"sentence"
,"token"
or"hybrid"
. - max_sentences (int, optional): target sentences per chunk (sentence/hybrid).
- max_tokens (int, optional): target tokens per chunk (token/hybrid).
- overlap (int): number of sentences or tokens to overlap between chunks.
- token_counter (callable, optional): function mapping a string to its token count; required for token modes.
Returns:
- List[str]: list of text chunks.
Raises:
- InvalidInputError: if
text
is empty or non-string. - TokenNotProvidedError: if
mode
includes"token"
buttoken_counter
isNone
.
preview_sentences
def preview_sentences(
self,
text: str,
lang: str
) -> Tuple[List[str], List[str]]
Generate raw sentences and collect splitter warnings without forming chunks.
Parameters:
- text (str): input document.
- lang (str): ISO language code.
Returns:
- Tuple[List[str], List[str]]:
- first element: list of sentences.
- second element: list of warning messages.
Raises:
- InvalidInputError: if
text
is invalid.
chunk_batch
def chunk_batch(
self,
texts: Iterable[str],
lang: str,
**kwargs
) -> List[List[str]]
Apply chunk()
over multiple documents.
Parameters:
- texts (Iterable[str]): sequence of input strings.
- lang (str): ISO language code.
- **kwargs: passed directly to
chunk()
.
Returns:
- List[List[str]]: list of chunk lists per document.
Raises:
- InvalidInputError: if any element in
texts
is invalid.
clear_cache
def clear_cache(self) -> None
Remove all cached sentence splits (if cache_dir
was set).
4.2 chunklet.exceptions
Custom exception hierarchy for precise error handling.
ChunkletError
class ChunkletError(Exception):
"""Base class for all Chunklet-specific errors."""
InvalidInputError
class InvalidInputError(ChunkletError):
def __init__(self, message: str = "Invalid input provided.") -> None
Raised when a caller provides malformed or empty input.
TokenNotProvidedError
class TokenNotProvidedError(ChunkletError):
def __init__(self) -> None
Raised when a token-based chunking operation is invoked without a token_counter
.
4.3 chunklet.utils
Utility functions for configuration and string handling.
from typing import Any, Dict, List, Optional
load_env
def load_env(
env_file: str,
required: Optional[List[str]] = None
) -> Dict[str, str]
Read key/value pairs from a .env
file into a dict, validating required keys.
- env_file (str): path to the
.env
file. - required (List[str], optional): list of keys that must be present.
Returns:
- Dict[str, str]: loaded environment variables.
Raises:
- FileNotFoundError: if
env_file
does not exist. - KeyError: if any
required
key is missing.
get_config
def get_config(
key: str,
default: Any = None
) -> Any
Fetch a configuration value from environment or default.
- key (str): configuration key.
- default (Any): fallback value if key is unset.
Returns:
- Any: configuration value or
default
.
slugify
def slugify(text: str) -> str
Convert text
into a lowercase, URL-friendly slug.
- text (str): input string.
Returns:
- str: slugified string.
camel_to_snake
def camel_to_snake(text: str) -> str
Transform a CamelCase or mixed string into snake_case.
- text (str): CamelCase input.
Returns:
- str: snake_case string.
5. Command-Line Interface
Chunklet’s chunklet
command processes text from files, directories or STDIN and emits chunked output via STDOUT or into files. You can control chunk size, overlap, parallelism, caching, language splitting and external tokenization.
5.1 Basic Invocation
Process a single text string on STDIN and print chunks:
echo "This is a test. It has several sentences to chunk." \
| chunklet \
--mode sentence \
--max-sentences 2
5.2 Input and Output
5.2.1 Files and Directories
Pass one or more paths. Directories are scanned non-recursively for .txt
files by default.
chunklet docs/chapter1.txt docs/chapter2.txt \
--mode token \
--max-tokens 100 \
--output-dir out_chunks \
--extension .chunk
This writes chapter1.chunk
, chapter2.chunk
under out_chunks/
.
To process a folder:
chunklet path/to/texts/ \
--mode sentence \
--max-sentences 5 \
--output-dir chunks/
5.2.2 STDIN / STDOUT
- If no paths are given, Chunklet reads from STDIN.
- Without
--output-dir
, Chunklet writes all chunks to STDOUT, one per line.
cat report.txt | chunklet --mode token --max-tokens 50 > report.chunks
5.3 Key Flags
Flag | Description |
---|---|
--mode {sentence,token} |
Chunking mode. sentence requires --max-sentences . |
--max-sentences N |
Maximum sentences per chunk (with --mode sentence ). |
--max-tokens N |
Maximum tokens per chunk (with --mode token ). |
--overlap-percent P |
Keep last P% of sentences/tokens from previous chunk (0–100). |
--lang CODE |
Sentence-splitter locale (e.g. en , fr ). Defaults to en . |
--n-jobs N |
Number of parallel workers (via mpire ). Defaults to 1 . |
--output-dir DIR |
Write per-input chunk files to DIR instead of STDOUT. |
--extension EXT |
File extension for chunks (default .chunk ). |
--no-cache |
Disable on-disk caching between runs. |
--verbose / -v |
Enable debug-level logging. |
--tokenizer-command CMD |
Shell command for custom token counting (see “External Tokenizer Integration”). |
5.4 Example Workflows
Parallel Directory Chunking
chunklet ./raw_texts/ \
--mode sentence \
--max-sentences 10 \
--overlap-percent 20 \
--n-jobs 4 \
--output-dir ./chunks/ \
--verbose
- Splits each
.txt
in./raw_texts/
into 10-sentence chunks with 20% overlap. - Uses 4 processes in parallel.
- Writes each output as
<basename>.chunk
under./chunks/
.
Re-processing Without Cache
chunklet large_corpus/ \
--mode token \
--max-tokens 200 \
--no-cache \
--output-dir cached_off/
Disables caching to ensure fresh chunking on every run.
Streaming Pipeline
# Preprocess then chunk on the fly
python preprocess.py data.txt \
| chunklet --mode sentence --max-sentences 3 \
| python postprocess_chunks.py
preprocess.py
writes cleaned text to STDOUT.chunklet
reads that stream, emits chunks to STDOUT.postprocess_chunks.py
consumes each chunk for downstream tasks.
5.5 Under the Hood: Argument Parsing
In src/chunklet/cli.py
:
parser.add_argument(
"inputs", nargs="*",
help="Input files or directories. Reads STDIN if empty."
)
parser.add_argument(
"--mode", required=True, choices=["sentence","token"],
help="Chunk by sentences or tokens."
)
parser.add_argument("--max-sentences", type=int)
parser.add_argument("--max-tokens", type=int)
parser.add_argument("--overlap-percent", type=float, default=0)
parser.add_argument("--lang", default="en")
parser.add_argument("--n-jobs", type=int, default=1)
parser.add_argument("--output-dir")
parser.add_argument("--extension", default=".chunk")
parser.add_argument("--no-cache", action="store_true")
parser.add_argument("--verbose", "-v", action="store_true")
parser.add_argument("--tokenizer-command", type=str)
Parsed args feed into:
chunker = Chunklet(
verbose=args.verbose,
use_cache=not args.no_cache,
token_counter=external_tokenizer if args.tokenizer_command else None
)
results = chunker.batch_chunk(
texts, mode=args.mode,
max_sentences=args.max_sentences,
max_tokens=args.max_tokens,
overlap_percent=args.overlap_percent,
n_jobs=args.n_jobs,
lang=args.lang
)
Chunklet then writes results
either to STDOUT or into files under --output-dir
.
6. Advanced Usage & Recipes
batch_chunk_pages: Sentence-Based PDF Chunking
Provide a one-step method to extract text from every page of a PDF, clean it, and split it into sentence-level chunks using Chunklet. Ideal for mobile-safe processing or feeding data into LLMs without exceeding token limits.
Essential Details
- Uses
PdfReader
(from pypdf) to extract raw text per page. - Cleans spurious line breaks, preserves headings/lists, strips standalone numbers.
- Delegates chunking to
Chunklet.batch_chunk
in"sentence"
mode. - Returns
List[List[str]]
: pages → sentence chunks.
Method Signature
def batch_chunk_pages(self,
max_sentences: int = 5
) -> List[List[str]]:
...
max_sentences
: max sentences per chunk- Uses
n_jobs=1
to avoid parallel issues - Defaults to French (
lang="fr"
); override in code to switch language.
Code Example
from examples.pdf_chunking import PDFProcessor
pdf_path = "docs/Your-Doc.pdf"
processor = PDFProcessor(pdf_path)
# Chunk pages into groups of up to 10 sentences each
pages_chunks = processor.batch_chunk_pages(max_sentences=10)
for page_index, chunks in enumerate(pages_chunks, start=1):
print(f"Page {page_index} has {len(chunks)} chunks")
for chunk in chunks:
print(" •", chunk)
Practical Usage Tips
- To switch to English or another language:
all_chunks = self.chunker.batch_chunk( pages_text, mode="sentence", max_sentences=max_sentences, n_jobs=1, lang="en" # switch to English )
- Tune
max_sentences
to balance chunk size vs. API calls. - For very large PDFs, instantiate multiple
PDFProcessor
objects with disjoint page ranges. - For paragraph-level chunks, set
mode="default"
inbatch_chunk
.
Troubleshooting
- If
pypdf
is missing:pip install pypdf
- To customize text cleanup, modify
_cleanup_text
regex patterns inexamples/pdf_chunking.py
.
Hybrid Chunking Mode
Use Chunklet’s "hybrid"
mode to split text by both sentence and token limits, with configurable overlap to preserve context.
Why Hybrid Mode?
- Bounds chunks by sentences and tokens.
- Overlaps content to maintain semantic coherence.
- Ideal for RAG pipelines, summarization, or any workflow with size constraints.
Key Parameters
max_sentences
(int): max sentences per chunk.max_tokens
(int): max tokens per chunk (uses yourtoken_counter
).overlap_percent
(float): percent of tokens to repeat in next chunk.token_counter
(callable): maps a string to its token count.verbose
(bool): logs internal decisions for tuning.
Code Example
from chunklet import Chunklet
def simple_token_counter(text: str) -> int:
return len(text.split())
text = """
This is a long text to demonstrate hybrid chunking. It combines both sentence and token limits for flexible chunking.
Overlap helps maintain context between chunks by repeating some clauses. This mode is very powerful for maintaining semantic coherence.
It is ideal for applications like RAG pipelines where context is crucial.
"""
chunker = Chunklet(verbose=True, token_counter=simple_token_counter)
chunks = chunker.chunk(
text,
mode="hybrid",
max_sentences=2,
max_tokens=15,
overlap_percent=20
)
print("--- Hybrid Mode Chunks ---")
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i}:\n{chunk}\n")
Expected Behavior
- Reads up to 2 sentences or 15 tokens, whichever comes first.
- Computes 20% of the last chunk’s tokens (rounded) and prepends them to the next chunk.
- Continues until the text is fully processed.
Practical Tips
- Increase
max_sentences
when preserving full thoughts matters more. - Increase
max_tokens
for strict prompt-size control. - Use
overlap_percent
of 20–30% for stronger context continuity. - In production, use a model-specific tokenizer (e.g.,
tiktoken
). - Set
verbose=True
during development to inspect chunking decisions.
Benchmarking Chunklet with benchmark.py
Measure Chunklet’s chunking performance (single vs. batch) across languages and modes, and customize benchmarks for your own datasets.
Essential Information
- Dependencies:
chunklet
(core library)rich
for tables and spinnersloguru
(logs suppressed vialogger.remove()
)
- Sample texts for English (
en
), Catalan (ca
), Haitian Creole (ht
), each ×10. - Supported modes:
"sentence"
(max_sentences=5)"token"
(max_tokens=50)"hybrid"
(max_sentences=3 & max_tokens=50)
Running the Benchmark
- Install dependencies:
pip install chunklet rich loguru
- Run:
python benchmark.py
- View two tables:
- Single Run: avg time per run over N=100
- Batch Run: total time for a batch of 100 texts
Customizing the Benchmark
- Adjust iterations or batch size:
number_of_runs = 200 # default 100 batch_size = 500 # default 100
- Use your own texts:
TEXTS["custom"] = open("my_text.txt", "r").read()
- Swap in a custom token counter:
def my_token_counter(text: str) -> int: return len(re.findall(r"\w+", text)) chunker = Chunklet(token_counter=my_token_counter)
Key Code Snippets
Single-run timing with timeit.Timer
:
def bench_single():
return chunker.chunk(
text_content,
lang=lang_code,
mode=mode,
max_sentences=(5 if mode!="token" else None),
max_tokens=(50 if mode!="sentence" else None)
)
time_single = timeit.Timer(bench_single).timeit(number=number_of_runs)
avg_time = time_single / number_of_runs
Batch timing with timeit.default_timer
:
texts = [text_content] * batch_size
start = timeit.default_timer()
chunks = chunker.batch_chunk(
texts,
lang=lang_code,
mode=mode,
max_sentences=(5 if mode!="token" else None),
max_tokens=(50 if mode!="sentence" else None)
)
total_time = timeit.default_timer() - start
total_chunks = sum(len(c) for c in chunks)
Interpreting Results
- Input Chars: length of each text input
- Runs/Num of texts: invocation count
- Avg. Time (s/run) vs. Total Time (s): latency vs. throughput
- Chunks: total output chunks
Practical Tips
- Increase
batch_size
for large corpora to amortize overhead. - Use
mode="hybrid"
when you need both sentence and token bounds. - In production, pass a high-quality tokenizer (e.g., spaCy) as
token_counter
.
7. Development Guide
This section covers setting up your local development environment, running tests, enforcing code style, and submitting pull requests to the speedyk-005/chunklet repository.
7.1 Prerequisites and Environment Setup
Ensure you have:
- Python 3.8–3.11
- Git
Steps:
- Clone your fork and enter the directory
git clone https://github.com/<your-username>/chunklet.git
cd chunklet
- Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows
- Install the package and dependencies
pip install -e .
pip install -r requirements.txt
# Install dev tools
pip install black pytest
7.2 Running Tests
All pushes and PRs trigger the GitHub Actions workflow (.github/workflows/build-and-test.yml
) to run tests on Python 3.8–3.11. You can run tests locally with pytest:
- Run the full suite
pytest
- Run a single test file
pytest tests/test_chunklet.py
- Run a specific test method
pytest tests/test_chunklet.py::TestChunklet::test_acronym_preservation
7.3 Code Formatting
We enforce Black for consistent styling. Run before committing:
- Check formatting without modifying files
black --check .
- Reformat all files in place
black .
7.4 Forking, Branching, and Pull Requests
Fork and Branch
- Fork the repo on GitHub:
https://github.com/speedyk-005/chunklet ▶️ Fork - Clone your fork locally (see 7.1)
- Add the upstream remote and fetch updates
git remote add upstream https://github.com/speedyk-005/chunklet.git
git fetch upstream
- Create a descriptive branch
git checkout -b feat/short-description
Commit and Push
- Stage and commit with a concise, scoped message
git add .
git commit -m "feat(parser): support multiline comments"
- Rebase frequently to stay in sync
git fetch upstream
git rebase upstream/main
- Push your branch
git push origin feat/short-description
Open a Pull Request
- On GitHub, click Compare & pull request on your fork
- Set the base repository to
speedyk-005/chunklet:main
- Reference related issues (e.g., “Closes #123”)
- Describe your change, testing steps, and any migration notes
- Ensure all CI checks pass before requesting review
Adhering to these steps ensures smooth reviews and integration into the main codebase.