Chat about this codebase

AI-powered code exploration

Online

Project Overview

This project provides a command-line tool and Python API for generating unique, realistic-sounding words or names. It supports both syllable-based and Markov-chain methods, with configurable language, length, prefixes, suffixes, and randomness control.

Main Features

  • Dual interface: CLI tool (nwg) and Python importable API
  • Generation methods:
    • Syllable-based splitting
    • Markov-chain sequence modeling
  • Output customization: language selection, word length, prefixes, suffixes
  • Randomness control via seed for reproducible outputs
  • Batch generation with adjustable count

Typical Use Cases

  • Placeholder names in UI/UX mockups
  • Game asset or character naming
  • Brand, product or domain name prototyping
  • Automated test data generation

Quickstart Examples

CLI Usage

Generate five 8-letter Markov words in English:

nwg generate --method markov --language en --length 8 --count 5 --seed 42

Generate three 4-syllable words with a “pre” prefix:

nwg generate --method syllables --syllables 4 --prefix pre --count 3

Python API

from nonsense_word_generator import NonsenseWordGenerator

# Initialize generator with Markov method
gen = NonsenseWordGenerator(
    method="markov",
    language="en",
    length=7,
    seed=2025
)

# Generate a batch of words with optional affixes
names = gen.generate(
    count=10,
    prefix="neo",
    suffix="ia"
)

for name in names:
    print(name)

License

This project is licensed under the MIT License. See the LICENSE file for full terms. You may use, modify, and distribute this software in derivative works provided you include the original license and notices.

Quick Start

Get up and running in minutes. Install the package, then generate pronounceable nonsense words via CLI or Python.

1. Install

Install from PyPI:

pip install nonsense-word-generator

Verify installation:

nonsense --version

2. Generate with the CLI

The nonsense command supports two methods: markov (chain-based) and syllable. Common options:

--method markov or syllable
--source word list (e.g. english, names.txt)
--order (chain order, markov only)
--min-length, --max-length character length (markov)
--min-syllables, --max-syllables syllable count
--prefix, --suffix string constraints
--count number of words
--seed reproducible output

Generate 5 Markov words (English, order 3, length 6–8):

nonsense \
  --method markov \
  --source english \
  --order 3 \
  --min-length 6 \
  --max-length 8 \
  --count 5 \
  --seed 42

Generate 5 syllable-based words (2–4 syllables, with “pre” prefix):

nonsense \
  --method syllable \
  --min-syllables 2 \
  --max-syllables 4 \
  --prefix pre \
  --count 5 \
  --seed 123

3. Generate in Python

Import and configure your generator. Both APIs use the same seed parameter for reproducibility.

3.1 Markov Generator

from markov_generator import MarkovGenerator

# Initialize: English source, chain order 2, with optional prefix/suffix
gen = MarkovGenerator(
    source="english",
    order=2,
    prefix="pro",
    suffix="ix",
    seed=2025
)

# Generate a batch of 10 words, each 5–7 characters long
words = [
    gen.generate(min_length=5, max_length=7)
    for _ in range(10)
]
print(words)
# e.g. ['protax', 'propax', 'prolix', ...]

3.2 Syllable Generator

from syllable_generator import SyllableGenerator

# Initialize: 1–3 syllables per word
syll_gen = SyllableGenerator(
    min_syllables=1,
    max_syllables=3,
    seed=2025
)

# Generate 8 words in one call
words = syll_gen.generate_batch(8)
print(words)
# e.g. ['ba', 'lopi', 'qumoro', ...]

4. Next Steps

  • Explore nonsense --help for full CLI options
  • Use word_loader.load_words() to supply custom dictionaries
  • Combine both generators for hybrid naming schemes

CLI Reference

Quickly invoke nonsense_generator.py to produce random words, tokens or names. Supports syllable-based and Markov-chain engines, custom word lists, prefixes/suffixes, length/count constraints and verbose output.

Invocation

python nonsense_generator.py [MODE] [OPTIONS]

By default, omitting a mode runs batch mode.

Modes (mutually exclusive)

• --single
Generate one word.
• --token
Generate a three-word “token” joined by dashes.
• --name
Generate first + last names (capitalized, retry logic for length).
• (batch default)
Generate a grid of words (50 outputs by default).

Common Options

• --length MIN-MAX or --length=N
Restrict each output to the specified length range or exact length.
• --count N
Number of outputs (batch default 50; single/token/name default 1).
• --markov
Enable Markov-chain generator (default is syllable-based).
• --order K
Markov chain order (default 2).
• --cutoff P
Minimum transition probability (default 0.1).
• --words TYPE|URL
Choose built-in list (e.g. “en”, “names”, “surnames”) or remote HTTP(S) word list.
• --prefix STR
Force outputs to start with STR (auto-enables Markov).
• --suffix STR
Force outputs to end with STR (auto-enables Markov).
• --list
Print available built-in sources and exit.
• -v, --verbose
Show initialization details (loading dictionaries, building chains).

Defaults

• Engine: syllable-based unless --markov or any of {--order, --cutoff, --words, --prefix, --suffix}
• Word list: “en” (English)
• Lengths (if unspecified):
– single: 8–12
– token: 5–8
– name: 6–20
– batch: 5–12
• Count: batch 50, others 1

Practical Examples

Generate one syllable-based word of length 5–8:

python nonsense_generator.py --single --length=5-8
# e.g. “brinola”

Generate a 3-word Markov token, each 3–6 letters:

python nonsense_generator.py --token --markov --length=3-6
# e.g. “fru-cael-vino”

Generate 5 first+last names, each 4–7 letters, using Markov order 3:

python nonsense_generator.py --name --count=5 --length=4-7 --order=3
# e.g.
#   Arlen Tovik
#   Emrie Qelva
#   ...

Batch grid: 8 Markov words, 2–5 letters per word:

python nonsense_generator.py --markov --count=8 --length=2-5
# prints 2 rows of 5 (last row padded):
#   pa   thra   zu   lil   dro
#   xor  mev   hil

Force prefix+suffix with Markov:

python nonsense_generator.py --single --prefix=pre --suffix=ing
# e.g. “prebioning”

Use a remote word list:

python nonsense_generator.py --markov --words=https://example.com/animals.txt --single
# fetches list at runtime and builds model

List built-in sources:

python nonsense_generator.py --list

Common Workflows

Bulk Name Generation for Testing

python nonsense_generator.py --name --count=100 \
  --length=5-10 --order=3 --cutoff=0.05 -v

– Generates 100 realistic names, logs initialization, and uses relaxed cutoff for creativity.

Prefixed Token Series

python nonsense_generator.py --token --count=10 \
  --markov --length=4-6 --prefix=alpha@

– Produces 10 tokens each starting with “alpha@”, suitable for autogenerated IDs.

Custom Dictionary Batch

python nonsense_generator.py --batch --words=custom_list.txt --count=20

– Uses a local file custom_list.txt as source for syllable-based generation.


Refer to tests/test_cli.py for exhaustive coverage of modes, options, error cases and language support.

Python API Reference

This section documents classes and functions for integrating the nonsense-word generators, caching, and word loading into your Python projects.

cache_manager.CacheManager

Handle caching of word lists and Markov chains to speed up repeated operations.

Class Signature

class CacheManager:
    def __init__(self, cache_dir: Optional[str] = None, expiration_seconds: int = 86400)
    def get_cache_path(self, key: str) -> str
    def is_cache_valid(self, path: str) -> bool
    def load_cache(self, key: str) -> Any
    def save_cache(self, key: str, data: Any) -> None

Parameters

  • cache_dir (str, optional): Directory for cache files; defaults to ~/.cache/nwg.
  • expiration_seconds (int): Time in seconds before a cache entry expires.

Methods

  • get_cache_path(key): Return full file path for a given cache key (hashing applied).
  • is_cache_valid(path): Return True if cache file exists and is not expired.
  • load_cache(key): Deserialize and return cached data or raise FileNotFoundError.
  • save_cache(key, data): Serialize data (pickled or JSON) to disk.

Example

from cache_manager import CacheManager

cache = CacheManager(expiration_seconds=3600)
key = "en_markov_order3"
try:
    chain = cache.load_cache(key)
except FileNotFoundError:
    chain = build_markov_chain(words)  # your function
    cache.save_cache(key, chain)
# use chain...

hunspell.parse_affix_rules

Parse .aff file content into rule structures for morphological expansion.

Signature

def parse_affix_rules(aff_content: str) -> Dict[str, List[Dict]]

Parameters

  • aff_content (str): Raw text of a Hunspell .aff file.

Returns

  • Dict mapping each flag (str) to a list of rule dicts:
    • { 'type': 'PFX'|'SFX', 'flag': str, 'strip': str, 'add': str, 'condition': str }

Example

from hunspell import parse_affix_rules

aff_text = open("en_US.aff", encoding="utf-8").read()
affix_rules = parse_affix_rules(aff_text)

hunspell.apply_affix_rules

Generate word variants by applying affix rules to a base word.

Signature

def apply_affix_rules(word: str, flags: str, affix_rules: Dict[str, List[Dict]]) -> Set[str]

Parameters

  • word (str): Lowercased base form (alphabetic).
  • flags (str): Sequence of flag characters from the .dic entry.
  • affix_rules (dict): Output of parse_affix_rules().

Returns

  • Set of valid word forms including the original.

Example

from hunspell import apply_affix_rules, parse_affix_rules

affix_rules = parse_affix_rules(open("en_US.aff", encoding="utf-8").read())
variants = apply_affix_rules("make", "DG", affix_rules)
# variants -> {"make", "made", "making"}

hunspell.get_hunspell_words

Download, parse, and optionally expand words from Hunspell dictionaries.

Signature

def get_hunspell_words(
    lang: str,
    dic_url: Optional[str] = None,
    aff_url: Optional[str] = None,
    expand_morphology: bool = False
) -> Set[str]

Parameters

  • lang (str): Language code (e.g. "en_US").
  • dic_url (str, optional): Override default .dic URL.
  • aff_url (str, optional): Override default .aff URL.
  • expand_morphology (bool): If True, apply affix rules to expand forms.

Returns

  • Set of words (lowercase strings).

Example

from hunspell import get_hunspell_words

# Basic word list
words = get_hunspell_words("en_US")

# With morphological expansion
full_set = get_hunspell_words("en_US", expand_morphology=True)

markov_generator.MarkovWordGenerator

Generate pronounceable nonsense words using Markov chains.

Class Signature

class MarkovWordGenerator:
    def __init__(
        self,
        order: int = 2,
        cutoff: float = 0.0,
        words: Union[str, Iterable[str]] = "en",
        cache_dir: Optional[str] = None,
        max_retries: int = 1000,
        reverse_mode: bool = False,
        verbose: bool = False
    )
    def generate(self, min_len: int = 3, max_len: int = 8) -> str
    def generate_batch(self, count: int, min_len: int = 3, max_len: int = 8) -> List[str]
    def generate_with_prefix(self, prefix: str, min_len: int, max_len: int) -> str
    def generate_batch_with_prefix(self, prefix: str, count: int, min_len: int, max_len: int) -> List[str]
    def generate_with_suffix(self, suffix: str, min_len: int, max_len: int) -> str
    def generate_with_prefix_and_suffix(self, prefix: str, suffix: str, min_len: int, max_len: int) -> str

Parameters

  • order (int): Chain order (number of look-back characters).
  • cutoff (float): Probability threshold to prune low-likelihood transitions.
  • words (str or iterable): "en" for built-in word list or any iterable of strings.
  • cache_dir (str, optional): Directory for caching chain data.
  • max_retries (int): Attempts per generation to meet constraints.
  • reverse_mode (bool): Train on reversed words (required for suffix).
  • verbose (bool): Print progress and diagnostics.

Examples

Basic usage:

from markov_generator import MarkovWordGenerator

gen = MarkovWordGenerator(order=3, words="en")
word = gen.generate(min_len=5, max_len=9)
print(word)  # e.g. "blenvor"

With prefix constraint:

word = gen.generate_with_prefix("pre", min_len=5, max_len=8)
print(word)  # e.g. "precuno"

With suffix constraint:

rev = MarkovWordGenerator(order=2, words="en", reverse_mode=True)
word = rev.generate_with_suffix("ing", min_len=6, max_len=10)
print(word)  # e.g. "florling"

With both prefix and suffix:

word = gen.generate_with_prefix_and_suffix("pro", "ia", min_len=6, max_len=10)
print(word)  # e.g. "prolacia"

syllable_generator.SyllableWordGenerator

Produce pronounceable words by concatenating predefined syllables.

Class Signature

class SyllableWordGenerator:
    def __init__(self, syllables: Optional[Dict[str, Any]] = None)
    def generate(self, min_len: int = 3, max_len: int = 10, prefix: Optional[str] = None) -> str
    def generate_batch(self, count: int, min_len: int = 3, max_len: int = 10) -> List[str]

Parameters

  • syllables (dict, optional): Overrides default onset-nucleus-coda patterns.

Examples

from syllable_generator import SyllableWordGenerator

sgen = SyllableWordGenerator()

# Single word
w = sgen.generate(min_len=4, max_len=7)
print(w)  # e.g. "craynt"

# Batch of 5 words
batch = sgen.generate_batch(5, min_len=3, max_len=6)
print(batch)

word_loader.load_words

Load word lists from various sources with caching and size limits.

Signature

def load_words(
    source: Union[str, Iterable[str]],
    max_size: int = 100000,
    safe: bool = True,
    expand_hunspell: bool = False
) -> Set[str]

Parameters

  • source (str or iterable): URL, Hunspell language code (e.g., "en_US"), or list of words.
  • max_size (int): Maximum number of words to load.
  • safe (bool): If True, enforce size limits and simple sanitization.
  • expand_hunspell (bool): If True and source is a Hunspell code, expand morphology.

Returns

  • Set of unique, lowercase words.

Example

from word_loader import load_words

# Load from URL
words = load_words("https://example.com/wordlist.txt")

# Load built-in English words via Hunspell expansion
en_words = load_words("en_US", expand_hunspell=True)

Use these APIs to integrate caching, Hunspell parsing, and both Markov- and syllable-based generators into your applications.

Architecture & Extensibility

This section describes the internal design, performance considerations, and extension points for the nonsense-word-generator project.

Module Overview

  • cache_manager.py
    Manages on-disk caching of word lists and chains. Provides safe path generation, validity checks, and serialization.

  • hunspell.py
    Downloads and parses Hunspell .dic/.aff files. Applies affix rules for morphological expansion and handles multiple encodings.

  • word_loader.py
    Exposes load_words(source, **kwargs) for loading word sets from URLs, Hunspell dictionaries, or built-in lists. Implements caching, size limits, and safety checks.

  • profile_generator.py
    Benchmarks different word generators. Includes a timer context manager and a registration mechanism for profiling new generators.

  • tests/analyze_names.py
    CLI and library tool for computing name-length statistics and Pearson correlation on “Firstname Lastname” datasets.


1. Extending the Word Loader

load_words(source, **kwargs) dispatches to handlers in SOURCE_HANDLERS. To add a new source:

  1. Define a loader function that returns Set[str].
  2. Register it under a unique key.
# my_loader.py
def load_custom_list(path_or_url: str, **kwargs) -> set[str]:
    # e.g. fetch JSON list of words
    import requests
    data = requests.get(path_or_url).json()
    return set(data)

# In your application bootstrap
from word_loader import SOURCE_HANDLERS
SOURCE_HANDLERS['custom_json'] = load_custom_list

# Usage
from word_loader import load_words
words = load_words('custom_json', 'https://example.com/words.json')

2. Customizing CacheManager

Subclass CacheManager to alter expiration logic, serialization format, or storage location.

# ttl_cache.py
import os, time
from cache_manager import CacheManager

class TTLCacheManager(CacheManager):
    def __init__(self, cache_dir: str, ttl: float):
        super().__init__(cache_dir)
        self.ttl = ttl

    def is_valid(self, cache_key: str, extension: str = ".json") -> bool:
        path = self.get_cache_path(cache_key, extension)
        if not os.path.exists(path):
            return False
        return (time.time() - os.path.getmtime(path)) < self.ttl

# Usage
from ttl_cache import TTLCacheManager
cache = TTLCacheManager(cache_dir="cache", ttl=3600)
words = cache.load('word_list_en') or fetch_and_cache()

3. Adding Hunspell Dictionaries

The HUNSPELL_DICT_URLS mapping in hunspell.py defines known language codes. To support a new code:

# Local override
from hunspell import HUNSPELL_DICT_URLS, get_hunspell_words
HUNSPELL_DICT_URLS['xx'] = (
    'https://example.com/xx.dic',
    'https://example.com/xx.aff'
)

# Load with morphological expansion
words = get_hunspell_words('xx', expand_morphology=True, verbose=True)

Under the hood:

  • Downloads .dic and .aff to cache.
  • Parses affix rules with parse_affix_rules().
  • Applies apply_affix_rules() per entry.

4. Profiling and Benchmarking Generators

profile_generator.py discovers generators via a registry. To benchmark a custom generator:

  1. Implement a class with .generate(min_len, max_len) and .generate_batch(n, min_len, max_len).
  2. Register it with ProfileRunner.
# my_gen.py
class MyWordGenerator:
    def __init__(self, **opts):
        # setup
        pass
    def generate(self, min_len: int, max_len: int) -> str:
        # return single word
        ...
    def generate_batch(self, n: int, min_len: int, max_len: int) -> list[str]:
        ...

# register and run
from profile_generator import ProfileRunner
from my_gen import MyWordGenerator

ProfileRunner.register('MyGen', MyWordGenerator)
ProfileRunner().run_all(
    samples=1000,
    min_len=4,
    max_len=10
)

Use the timer context manager in your own scripts to measure blocks:

from profile_generator import timer

with timer("Custom batch"):
    result = MyWordGenerator().generate_batch(500, 5, 12)

5. Testing & Analysis Utilities

The tests/analyze_names.py script offers two functions:

  • parse_names(text: str) -> list[tuple[int,int]]
  • compute_correlation(pairs: list[tuple[int,int]]) -> float

You can integrate them as a library:

from tests.analyze_names import parse_names, compute_correlation

raw = open('names.txt').read()
pairs = parse_names(raw)
r = compute_correlation(pairs)
print(f"Pearson r = {r:.3f}")

To extend:

  • Add new metrics (e.g. Spearman) alongside compute_correlation.
  • Integrate into CI by invoking analyze_names.py in a pipeline step.

Performance Considerations

  • Caching reduces repeat downloads/parsings. Tune TTL or disable via subclassing.
  • Morphological Expansion can multiply dictionary size 5–10×. Disable if memory-constrained.
  • Generator Algorithms: syllable-based vs Markov chain exhibit different time/memory profiles. Use profile_generator.py to compare.
  • Batch Operations (generate_batch) often outperform repeated single calls.

Follow these guidelines to maintain high performance and keep extension points clear for future features.