Project Overview
This project provides a command-line tool and Python API for generating unique, realistic-sounding words or names. It supports both syllable-based and Markov-chain methods, with configurable language, length, prefixes, suffixes, and randomness control.
Main Features
- Dual interface: CLI tool (
nwg
) and Python importable API - Generation methods:
- Syllable-based splitting
- Markov-chain sequence modeling
- Output customization: language selection, word length, prefixes, suffixes
- Randomness control via seed for reproducible outputs
- Batch generation with adjustable count
Typical Use Cases
- Placeholder names in UI/UX mockups
- Game asset or character naming
- Brand, product or domain name prototyping
- Automated test data generation
Quickstart Examples
CLI Usage
Generate five 8-letter Markov words in English:
nwg generate --method markov --language en --length 8 --count 5 --seed 42
Generate three 4-syllable words with a “pre” prefix:
nwg generate --method syllables --syllables 4 --prefix pre --count 3
Python API
from nonsense_word_generator import NonsenseWordGenerator
# Initialize generator with Markov method
gen = NonsenseWordGenerator(
method="markov",
language="en",
length=7,
seed=2025
)
# Generate a batch of words with optional affixes
names = gen.generate(
count=10,
prefix="neo",
suffix="ia"
)
for name in names:
print(name)
License
This project is licensed under the MIT License. See the LICENSE file for full terms. You may use, modify, and distribute this software in derivative works provided you include the original license and notices.
Quick Start
Get up and running in minutes. Install the package, then generate pronounceable nonsense words via CLI or Python.
1. Install
Install from PyPI:
pip install nonsense-word-generator
Verify installation:
nonsense --version
2. Generate with the CLI
The nonsense
command supports two methods: markov
(chain-based) and syllable
. Common options:
• --method
markov or syllable
• --source
word list (e.g. english, names.txt)
• --order
(chain order, markov only)
• --min-length
, --max-length
character length (markov)
• --min-syllables
, --max-syllables
syllable count
• --prefix
, --suffix
string constraints
• --count
number of words
• --seed
reproducible output
Generate 5 Markov words (English, order 3, length 6–8):
nonsense \
--method markov \
--source english \
--order 3 \
--min-length 6 \
--max-length 8 \
--count 5 \
--seed 42
Generate 5 syllable-based words (2–4 syllables, with “pre” prefix):
nonsense \
--method syllable \
--min-syllables 2 \
--max-syllables 4 \
--prefix pre \
--count 5 \
--seed 123
3. Generate in Python
Import and configure your generator. Both APIs use the same seed
parameter for reproducibility.
3.1 Markov Generator
from markov_generator import MarkovGenerator
# Initialize: English source, chain order 2, with optional prefix/suffix
gen = MarkovGenerator(
source="english",
order=2,
prefix="pro",
suffix="ix",
seed=2025
)
# Generate a batch of 10 words, each 5–7 characters long
words = [
gen.generate(min_length=5, max_length=7)
for _ in range(10)
]
print(words)
# e.g. ['protax', 'propax', 'prolix', ...]
3.2 Syllable Generator
from syllable_generator import SyllableGenerator
# Initialize: 1–3 syllables per word
syll_gen = SyllableGenerator(
min_syllables=1,
max_syllables=3,
seed=2025
)
# Generate 8 words in one call
words = syll_gen.generate_batch(8)
print(words)
# e.g. ['ba', 'lopi', 'qumoro', ...]
4. Next Steps
- Explore
nonsense --help
for full CLI options - Use
word_loader.load_words()
to supply custom dictionaries - Combine both generators for hybrid naming schemes
CLI Reference
Quickly invoke nonsense_generator.py to produce random words, tokens or names. Supports syllable-based and Markov-chain engines, custom word lists, prefixes/suffixes, length/count constraints and verbose output.
Invocation
python nonsense_generator.py [MODE] [OPTIONS]
By default, omitting a mode runs batch mode.
Modes (mutually exclusive)
• --single
Generate one word.
• --token
Generate a three-word “token” joined by dashes.
• --name
Generate first + last names (capitalized, retry logic for length).
• (batch default)
Generate a grid of words (50 outputs by default).
Common Options
• --length MIN-MAX or --length=N
Restrict each output to the specified length range or exact length.
• --count N
Number of outputs (batch default 50; single/token/name default 1).
• --markov
Enable Markov-chain generator (default is syllable-based).
• --order K
Markov chain order (default 2).
• --cutoff P
Minimum transition probability (default 0.1).
• --words TYPE|URL
Choose built-in list (e.g. “en”, “names”, “surnames”) or remote HTTP(S) word list.
• --prefix STR
Force outputs to start with STR (auto-enables Markov).
• --suffix STR
Force outputs to end with STR (auto-enables Markov).
• --list
Print available built-in sources and exit.
• -v, --verbose
Show initialization details (loading dictionaries, building chains).
Defaults
• Engine: syllable-based unless --markov or any of {--order, --cutoff, --words, --prefix, --suffix}
• Word list: “en” (English)
• Lengths (if unspecified):
– single: 8–12
– token: 5–8
– name: 6–20
– batch: 5–12
• Count: batch 50, others 1
Practical Examples
Generate one syllable-based word of length 5–8:
python nonsense_generator.py --single --length=5-8
# e.g. “brinola”
Generate a 3-word Markov token, each 3–6 letters:
python nonsense_generator.py --token --markov --length=3-6
# e.g. “fru-cael-vino”
Generate 5 first+last names, each 4–7 letters, using Markov order 3:
python nonsense_generator.py --name --count=5 --length=4-7 --order=3
# e.g.
# Arlen Tovik
# Emrie Qelva
# ...
Batch grid: 8 Markov words, 2–5 letters per word:
python nonsense_generator.py --markov --count=8 --length=2-5
# prints 2 rows of 5 (last row padded):
# pa thra zu lil dro
# xor mev hil
Force prefix+suffix with Markov:
python nonsense_generator.py --single --prefix=pre --suffix=ing
# e.g. “prebioning”
Use a remote word list:
python nonsense_generator.py --markov --words=https://example.com/animals.txt --single
# fetches list at runtime and builds model
List built-in sources:
python nonsense_generator.py --list
Common Workflows
Bulk Name Generation for Testing
python nonsense_generator.py --name --count=100 \
--length=5-10 --order=3 --cutoff=0.05 -v
– Generates 100 realistic names, logs initialization, and uses relaxed cutoff for creativity.
Prefixed Token Series
python nonsense_generator.py --token --count=10 \
--markov --length=4-6 --prefix=alpha@
– Produces 10 tokens each starting with “alpha@”, suitable for autogenerated IDs.
Custom Dictionary Batch
python nonsense_generator.py --batch --words=custom_list.txt --count=20
– Uses a local file custom_list.txt
as source for syllable-based generation.
Refer to tests/test_cli.py for exhaustive coverage of modes, options, error cases and language support.
Python API Reference
This section documents classes and functions for integrating the nonsense-word generators, caching, and word loading into your Python projects.
cache_manager.CacheManager
Handle caching of word lists and Markov chains to speed up repeated operations.
Class Signature
class CacheManager:
def __init__(self, cache_dir: Optional[str] = None, expiration_seconds: int = 86400)
def get_cache_path(self, key: str) -> str
def is_cache_valid(self, path: str) -> bool
def load_cache(self, key: str) -> Any
def save_cache(self, key: str, data: Any) -> None
Parameters
- cache_dir (str, optional): Directory for cache files; defaults to
~/.cache/nwg
. - expiration_seconds (int): Time in seconds before a cache entry expires.
Methods
- get_cache_path(key): Return full file path for a given cache key (hashing applied).
- is_cache_valid(path): Return True if cache file exists and is not expired.
- load_cache(key): Deserialize and return cached data or raise
FileNotFoundError
. - save_cache(key, data): Serialize data (pickled or JSON) to disk.
Example
from cache_manager import CacheManager
cache = CacheManager(expiration_seconds=3600)
key = "en_markov_order3"
try:
chain = cache.load_cache(key)
except FileNotFoundError:
chain = build_markov_chain(words) # your function
cache.save_cache(key, chain)
# use chain...
hunspell.parse_affix_rules
Parse .aff
file content into rule structures for morphological expansion.
Signature
def parse_affix_rules(aff_content: str) -> Dict[str, List[Dict]]
Parameters
- aff_content (str): Raw text of a Hunspell
.aff
file.
Returns
- Dict mapping each flag (str) to a list of rule dicts:
- { 'type': 'PFX'|'SFX', 'flag': str, 'strip': str, 'add': str, 'condition': str }
Example
from hunspell import parse_affix_rules
aff_text = open("en_US.aff", encoding="utf-8").read()
affix_rules = parse_affix_rules(aff_text)
hunspell.apply_affix_rules
Generate word variants by applying affix rules to a base word.
Signature
def apply_affix_rules(word: str, flags: str, affix_rules: Dict[str, List[Dict]]) -> Set[str]
Parameters
- word (str): Lowercased base form (alphabetic).
- flags (str): Sequence of flag characters from the
.dic
entry. - affix_rules (dict): Output of
parse_affix_rules()
.
Returns
- Set of valid word forms including the original.
Example
from hunspell import apply_affix_rules, parse_affix_rules
affix_rules = parse_affix_rules(open("en_US.aff", encoding="utf-8").read())
variants = apply_affix_rules("make", "DG", affix_rules)
# variants -> {"make", "made", "making"}
hunspell.get_hunspell_words
Download, parse, and optionally expand words from Hunspell dictionaries.
Signature
def get_hunspell_words(
lang: str,
dic_url: Optional[str] = None,
aff_url: Optional[str] = None,
expand_morphology: bool = False
) -> Set[str]
Parameters
- lang (str): Language code (e.g.
"en_US"
). - dic_url (str, optional): Override default
.dic
URL. - aff_url (str, optional): Override default
.aff
URL. - expand_morphology (bool): If True, apply affix rules to expand forms.
Returns
- Set of words (lowercase strings).
Example
from hunspell import get_hunspell_words
# Basic word list
words = get_hunspell_words("en_US")
# With morphological expansion
full_set = get_hunspell_words("en_US", expand_morphology=True)
markov_generator.MarkovWordGenerator
Generate pronounceable nonsense words using Markov chains.
Class Signature
class MarkovWordGenerator:
def __init__(
self,
order: int = 2,
cutoff: float = 0.0,
words: Union[str, Iterable[str]] = "en",
cache_dir: Optional[str] = None,
max_retries: int = 1000,
reverse_mode: bool = False,
verbose: bool = False
)
def generate(self, min_len: int = 3, max_len: int = 8) -> str
def generate_batch(self, count: int, min_len: int = 3, max_len: int = 8) -> List[str]
def generate_with_prefix(self, prefix: str, min_len: int, max_len: int) -> str
def generate_batch_with_prefix(self, prefix: str, count: int, min_len: int, max_len: int) -> List[str]
def generate_with_suffix(self, suffix: str, min_len: int, max_len: int) -> str
def generate_with_prefix_and_suffix(self, prefix: str, suffix: str, min_len: int, max_len: int) -> str
Parameters
- order (int): Chain order (number of look-back characters).
- cutoff (float): Probability threshold to prune low-likelihood transitions.
- words (str or iterable):
"en"
for built-in word list or any iterable of strings. - cache_dir (str, optional): Directory for caching chain data.
- max_retries (int): Attempts per generation to meet constraints.
- reverse_mode (bool): Train on reversed words (required for suffix).
- verbose (bool): Print progress and diagnostics.
Examples
Basic usage:
from markov_generator import MarkovWordGenerator
gen = MarkovWordGenerator(order=3, words="en")
word = gen.generate(min_len=5, max_len=9)
print(word) # e.g. "blenvor"
With prefix constraint:
word = gen.generate_with_prefix("pre", min_len=5, max_len=8)
print(word) # e.g. "precuno"
With suffix constraint:
rev = MarkovWordGenerator(order=2, words="en", reverse_mode=True)
word = rev.generate_with_suffix("ing", min_len=6, max_len=10)
print(word) # e.g. "florling"
With both prefix and suffix:
word = gen.generate_with_prefix_and_suffix("pro", "ia", min_len=6, max_len=10)
print(word) # e.g. "prolacia"
syllable_generator.SyllableWordGenerator
Produce pronounceable words by concatenating predefined syllables.
Class Signature
class SyllableWordGenerator:
def __init__(self, syllables: Optional[Dict[str, Any]] = None)
def generate(self, min_len: int = 3, max_len: int = 10, prefix: Optional[str] = None) -> str
def generate_batch(self, count: int, min_len: int = 3, max_len: int = 10) -> List[str]
Parameters
- syllables (dict, optional): Overrides default onset-nucleus-coda patterns.
Examples
from syllable_generator import SyllableWordGenerator
sgen = SyllableWordGenerator()
# Single word
w = sgen.generate(min_len=4, max_len=7)
print(w) # e.g. "craynt"
# Batch of 5 words
batch = sgen.generate_batch(5, min_len=3, max_len=6)
print(batch)
word_loader.load_words
Load word lists from various sources with caching and size limits.
Signature
def load_words(
source: Union[str, Iterable[str]],
max_size: int = 100000,
safe: bool = True,
expand_hunspell: bool = False
) -> Set[str]
Parameters
- source (str or iterable): URL, Hunspell language code (e.g.,
"en_US"
), or list of words. - max_size (int): Maximum number of words to load.
- safe (bool): If True, enforce size limits and simple sanitization.
- expand_hunspell (bool): If True and
source
is a Hunspell code, expand morphology.
Returns
- Set of unique, lowercase words.
Example
from word_loader import load_words
# Load from URL
words = load_words("https://example.com/wordlist.txt")
# Load built-in English words via Hunspell expansion
en_words = load_words("en_US", expand_hunspell=True)
Use these APIs to integrate caching, Hunspell parsing, and both Markov- and syllable-based generators into your applications.
Architecture & Extensibility
This section describes the internal design, performance considerations, and extension points for the nonsense-word-generator project.
Module Overview
cache_manager.py
Manages on-disk caching of word lists and chains. Provides safe path generation, validity checks, and serialization.hunspell.py
Downloads and parses Hunspell.dic
/.aff
files. Applies affix rules for morphological expansion and handles multiple encodings.word_loader.py
Exposesload_words(source, **kwargs)
for loading word sets from URLs, Hunspell dictionaries, or built-in lists. Implements caching, size limits, and safety checks.profile_generator.py
Benchmarks different word generators. Includes atimer
context manager and a registration mechanism for profiling new generators.tests/analyze_names.py
CLI and library tool for computing name-length statistics and Pearson correlation on “Firstname Lastname” datasets.
1. Extending the Word Loader
load_words(source, **kwargs)
dispatches to handlers in SOURCE_HANDLERS
. To add a new source:
- Define a loader function that returns
Set[str]
. - Register it under a unique key.
# my_loader.py
def load_custom_list(path_or_url: str, **kwargs) -> set[str]:
# e.g. fetch JSON list of words
import requests
data = requests.get(path_or_url).json()
return set(data)
# In your application bootstrap
from word_loader import SOURCE_HANDLERS
SOURCE_HANDLERS['custom_json'] = load_custom_list
# Usage
from word_loader import load_words
words = load_words('custom_json', 'https://example.com/words.json')
2. Customizing CacheManager
Subclass CacheManager
to alter expiration logic, serialization format, or storage location.
# ttl_cache.py
import os, time
from cache_manager import CacheManager
class TTLCacheManager(CacheManager):
def __init__(self, cache_dir: str, ttl: float):
super().__init__(cache_dir)
self.ttl = ttl
def is_valid(self, cache_key: str, extension: str = ".json") -> bool:
path = self.get_cache_path(cache_key, extension)
if not os.path.exists(path):
return False
return (time.time() - os.path.getmtime(path)) < self.ttl
# Usage
from ttl_cache import TTLCacheManager
cache = TTLCacheManager(cache_dir="cache", ttl=3600)
words = cache.load('word_list_en') or fetch_and_cache()
3. Adding Hunspell Dictionaries
The HUNSPELL_DICT_URLS
mapping in hunspell.py
defines known language codes. To support a new code:
# Local override
from hunspell import HUNSPELL_DICT_URLS, get_hunspell_words
HUNSPELL_DICT_URLS['xx'] = (
'https://example.com/xx.dic',
'https://example.com/xx.aff'
)
# Load with morphological expansion
words = get_hunspell_words('xx', expand_morphology=True, verbose=True)
Under the hood:
- Downloads
.dic
and.aff
to cache. - Parses affix rules with
parse_affix_rules()
. - Applies
apply_affix_rules()
per entry.
4. Profiling and Benchmarking Generators
profile_generator.py
discovers generators via a registry. To benchmark a custom generator:
- Implement a class with
.generate(min_len, max_len)
and.generate_batch(n, min_len, max_len)
. - Register it with
ProfileRunner
.
# my_gen.py
class MyWordGenerator:
def __init__(self, **opts):
# setup
pass
def generate(self, min_len: int, max_len: int) -> str:
# return single word
...
def generate_batch(self, n: int, min_len: int, max_len: int) -> list[str]:
...
# register and run
from profile_generator import ProfileRunner
from my_gen import MyWordGenerator
ProfileRunner.register('MyGen', MyWordGenerator)
ProfileRunner().run_all(
samples=1000,
min_len=4,
max_len=10
)
Use the timer
context manager in your own scripts to measure blocks:
from profile_generator import timer
with timer("Custom batch"):
result = MyWordGenerator().generate_batch(500, 5, 12)
5. Testing & Analysis Utilities
The tests/analyze_names.py
script offers two functions:
parse_names(text: str) -> list[tuple[int,int]]
compute_correlation(pairs: list[tuple[int,int]]) -> float
You can integrate them as a library:
from tests.analyze_names import parse_names, compute_correlation
raw = open('names.txt').read()
pairs = parse_names(raw)
r = compute_correlation(pairs)
print(f"Pearson r = {r:.3f}")
To extend:
- Add new metrics (e.g. Spearman) alongside
compute_correlation
. - Integrate into CI by invoking
analyze_names.py
in a pipeline step.
Performance Considerations
- Caching reduces repeat downloads/parsings. Tune TTL or disable via subclassing.
- Morphological Expansion can multiply dictionary size 5–10×. Disable if memory-constrained.
- Generator Algorithms: syllable-based vs Markov chain exhibit different time/memory profiles. Use
profile_generator.py
to compare. - Batch Operations (
generate_batch
) often outperform repeated single calls.
Follow these guidelines to maintain high performance and keep extension points clear for future features.