Project Overview
This project implements an Aho–Corasick automaton on top of a Ternary Search Trie (TST) to provide fast, memory-efficient multi-pattern string matching. You can use it as a standalone command-line tool or embed the library in C++ applications.
Key Features
- Ternary Search Trie node layout for compact storage
- Aho–Corasick failure-link computation for linear-time matching
- Bulk pattern insertion with addPattern()
- Search API returning pattern occurrences with positions
- Debug utilities: trie traversal and queue operations
When to Use
- Matching thousands of patterns against large texts
- Memory-constrained environments (embedded or mobile)
- Real-time log scanning, intrusion detection, bioinformatics
- Embedding high-performance search in C++ services
Command-Line Quick Start
Compile and run the AC-TernarySearchTrie program:
g++ -std=c++11 main.cpp -o ac_tst
./ac_tst string.txt pattern.txt result.txt
# - string.txt : text to search
# - pattern.txt : one pattern per line
# - result.txt : matches in "position:pattern" format
Embedding the Library
Include the header and link your code against the compiled library:
#include "ternarySearchTrie.h"
#include <fstream>
#include <iostream>
int main() {
TernarySearchTrie trie;
// Load patterns
std::ifstream pfile("pattern.txt");
std::string pat;
while (std::getline(pfile, pat))
trie.addPattern(pat);
trie.buildFailureLinks();
// Read input text
std::ifstream tfile("string.txt");
std::string text((std::istreambuf_iterator<char>(tfile)),
std::istreambuf_iterator<char>());
// Search and print matches
auto results = trie.search(text);
for (auto &match : results)
std::cout << "Pos " << match.first << ": " << match.second << "\n";
return 0;
}
Use this API for seamless integration of high-throughput multi-pattern matching in your C++ projects.
Quick Start & CLI Usage
Get a working executable and reproduce the sample run in README.md in three steps: clone, build, run.
Prerequisites
- GCC or Clang
- Unix-style shell (bash, zsh, etc.)
1. Clone the Repository
git clone https://github.com/ChenyuGao/AC-TernarySearchTrie.git
cd AC-TernarySearchTrie
2. Build the Executable
Compile ternarySearchTrie.c
into str
with optimizations enabled:
gcc -O2 -o str ternarySearchTrie.c
If you prefer debug symbols:
gcc -g -O0 -o str ternarySearchTrie.c
3. Command-Line Usage
./str <text-file> <pattern-file> <result-file>
Arguments
• <text-file>
Path to the input text (e.g., string.txt)
• <pattern-file>
One pattern per line (e.g., pattern.txt)
• <result-file>
Output file for matches (e.g., result.txt)
Sample Run
Assuming string.txt
and pattern.txt
live in the repo root:
# Run multi-pattern search
./str string.txt pattern.txt result.txt
# Inspect results
cat result.txt
Expected result.txt
format:
pattern1: 15
pattern2: 42
pattern1: 78
...
Practical Tips
- The program overwrites
<result-file>
; back up existing files if needed. - For very large inputs or many patterns, monitor memory/CPU:
time ./str large.txt patterns.txt results.txt 2> performance.log
- Pre-sort or dedupe your patterns to reduce redundant matches.
- Ensure all input files use UTF-8 or ASCII encoding.
That’s it—now you can build and run the AC-TernarySearchTrie CLI for fast multi-pattern searches.
API Reference
Brief overview of functions to build, query, and free the Aho-Corasick ternary search trie in C.
ac_alloc: Initialize Automaton
Allocates and initializes the AC_STRUCT, setting up an empty ternary search trie and pattern storage.
Signature
AC_STRUCT *ac_alloc(void);
Return Value
• Pointer to a new AC_STRUCT on success
• NULL on allocation failure
Example
#include "ternarySearchTrie.h"
int main() {
AC_STRUCT *ac = ac_alloc();
if (!ac) {
fprintf(stderr, "Error: cannot allocate AC_STRUCT\n");
return 1;
}
// use ac...
ac_free(ac);
return 0;
}
ac_add_string: Insert Pattern into Trie
Registers a pattern in the trie, tags its end node with id
, and stores it in Patterns[id]
.
Signature
int ac_add_string(AC_STRUCT *ac, char *P, int M, int id);
Parameters
• ac – Pointer from ac_alloc()
• P – Null-terminated pattern string
• M – Length of P (≤ max_pattern)
• id – Unique pattern identifier (0 ≤ id < maxp_num)
Return Value
• 1 on success
• 0 on failure (duplicate id, M too large, or out of memory)
How It Works
- Traverse from root, comparing
P[i]
to node->data:
–<
→ lchild
–>
→ rchild
–==
→ next, i++ - On final character, set node->ID = id and copy P into
Patterns[id]
. - Increment
ac->pNum
.
Example
#include <string.h>
#include "ternarySearchTrie.h"
void add_patterns(AC_STRUCT *ac) {
const char *patterns[] = {"he", "she", "his", "hers"};
int n = sizeof(patterns)/sizeof(patterns[0]);
for (int i = 0; i < n; i++) {
int len = strlen(patterns[i]);
if (len >= max_pattern) continue;
if (!ac_add_string(ac, (char*)patterns[i], len, i)) {
fprintf(stderr, "ac_add_string failed for id=%d\n", i);
}
}
}
ac_implement: Build Failure Links
Constructs failure and output pointers for the AC automaton to enable linear-time search.
Signature
int ac_implement(AC_STRUCT *ac);
Return Value
1 on success
Usage
Call after inserting all patterns and before any search.
Example
// After all ac_add_string calls
if (!ac_implement(ac)) {
fprintf(stderr, "Error building failure links\n");
ac_free(ac);
exit(1);
}
search_init: Prepare for Search
Resets automaton state to start scanning a new text.
Signature
void search_init(AC_STRUCT *ac);
Usage
Call before each new call to ac_search if you scan multiple texts.
ac_search: Scan Text for Patterns
Finds all pattern occurrences in a text buffer and invokes a callback on each match.
Signature
void ac_search(AC_STRUCT *ac, char *T, int N,
void (*callback)(int id, int end_pos));
Parameters
• ac – Initialized AC_STRUCT with built failure links
• T, N – Text buffer and its length
• callback – Invoked per match with pattern id and ending position
Example
#include <stdio.h>
#include "ternarySearchTrie.h"
void report_match(int id, int end_pos) {
// Patterns[id] contains the matched string
printf("Match: \"%s\" at position %d\n",
Patterns[id], end_pos - strlen(Patterns[id]) + 1);
}
int main() {
AC_STRUCT *ac = ac_alloc();
add_patterns(ac);
ac_implement(ac);
const char *text = "ushers";
search_init(ac);
ac_search(ac, (char*)text, strlen(text), report_match);
ac_free(ac);
return 0;
}
ac_free: Release Resources
Frees all memory associated with the automaton, including the trie, queues, and pattern storage.
Signature
void ac_free(AC_STRUCT *ac);
Usage
Call after all searches to avoid memory leaks.
Algorithm & Implementation Notes
This section explains how the Aho–Corasick automaton uses a Ternary Search Trie (TST) for multi-pattern matching. You’ll see how patterns flow through insertion, failure-link construction, and search, plus tips to tune or extend the implementation.
Data Structures
TSNode (ternarySearchTrie.h)
Each trie node stores one byte (data
), an output ID (patId
), three child pointers, and a failure link:
typedef struct TSNode {
unsigned char data; // current character
int patId; // nonzero if a pattern ends here
struct TSNode *lchild; // < data
struct TSNode *next; // == data (advance in pattern)
struct TSNode *rchild; // > data
struct TSNode *fail; // Aho–Corasick fallback
} TSNode;
AC_STRUCT
Holds the root TST pointer and pattern count:
typedef struct {
TSNode *root;
int pNum; // number of patterns added
} AC_STRUCT;
Pattern Insertion (ac_add_string)
- Walks the TST comparing each byte to
node->data
. - Uses
lchild
/rchild
on<
/>
andnext
on==
. - Allocates new nodes when no match exists.
- Marks the terminal node’s
patId = id
. - Constraint: call in increasing length or sort first.
Failure Link Construction (ac_implement)
Builds failure links via breadth-first traversal:
- Initialize a queue with all non-null children of
root->next
. - For each dequeued node
u
:- For its
next
childv
, setv->fail
= followu->fail
via the same byte transition (fall back until match or root). - Propagate
patId
: ifv->fail->patId != 0
andv->patId == 0
, thenv->patId = v->fail->patId
. - Enqueue
v->lchild
,v->next
,v->rchild
if non-null.
- For its
- Continue until queue empties.
Code Snippet
void ac_implement(AC_STRUCT *ac) {
Queue q = que_init();
TSNode *r = ac->root->next; // first level
if (r) enqueue(&q, r);
// BFS over TST
while (!que_empty(&q)) {
TSNode *u = dequeue(&q);
for (TSNode **child = &u->lchild; child <= &u->rchild; ++child) {
TSNode *v = *child;
if (!v) continue;
// compute v->fail
TSNode *f = u->fail;
while (f && v->data != f->data) {
// follow fail or split branching
if (v->data < f->data) f = f->lchild;
else if (v->data > f->data) f = f->rchild;
else break;
}
v->fail = (f ? f->next : ac->root->next);
// inherit output
if (!v->patId && v->fail->patId) v->patId = v->fail->patId;
enqueue(&q, v);
}
}
que_free(&q);
}
Search Routine (ac_search)
Walk the input byte by byte, maintaining a pointer u
into the TST:
- While
u
and input bytec
doesn’t matchu->data
, setu = u->fail
. - If
c == u->data
, advanceu = u->next
; elseu = ac->root->next
. - If
u->patId != 0
, report a match at the current position.
Basic Usage
void ac_search(AC_STRUCT *ac, const unsigned char *text, int N) {
TSNode *u = ac->root->next;
for (int i = 0; i < N; ++i) {
unsigned char c = text[i];
while (u && c != u->data) u = u->fail;
if (u) u = u->next;
else u = ac->root->next;
if (u && u->patId) {
printf("Match id=%d at pos=%d\n", u->patId, i);
}
}
}
Performance & Optimization
• Node allocation: replace malloc
per node with a pre-allocated pool for cache locality.
• Queue: use a ring buffer of size ≈ total nodes to avoid dynamic resizing.
• Character set: if alphabet is small (ASCII), you can flatten TST into a fixed 2D array to remove lchild
/rchild
branches.
• Inline hot paths: mark ac_search
and failure resolution functions static inline
for speed.
• Bulk insertion: sort patterns by length once, then call ac_add_string
in order.
Example: Simple Node Pool
#define NODE_POOL_SIZE 1000000
static TSNode nodePool[NODE_POOL_SIZE];
static int poolIdx = 0;
TSNode *alloc_node(unsigned char c) {
if (poolIdx >= NODE_POOL_SIZE) return NULL;
TSNode *n = &nodePool[poolIdx++];
memset(n, 0, sizeof(*n));
n->data = c;
return n;
}
// In ac_add_string, replace malloc with alloc_node().
Extending & Customizing
• Dynamic pattern updates: buffer new patterns, call a partial ac_implement()
only over new nodes.
• Case-insensitive matching: normalize input and patterns to lowercase before insertion/search.
• Unicode support: encode UTF-8 code points into unique byte sequences or switch to wchar_t
.
• Memory cleanup: write a recursive free for TST nodes if you rebuild frequently.
These notes should guide you in modifying or optimizing the TST-based Aho–Corasick implementation in this repo.