Chat about this codebase

AI-powered code exploration

Online

Project Overview

This project implements an Aho–Corasick automaton on top of a Ternary Search Trie (TST) to provide fast, memory-efficient multi-pattern string matching. You can use it as a standalone command-line tool or embed the library in C++ applications.

Key Features

  • Ternary Search Trie node layout for compact storage
  • Aho–Corasick failure-link computation for linear-time matching
  • Bulk pattern insertion with addPattern()
  • Search API returning pattern occurrences with positions
  • Debug utilities: trie traversal and queue operations

When to Use

  • Matching thousands of patterns against large texts
  • Memory-constrained environments (embedded or mobile)
  • Real-time log scanning, intrusion detection, bioinformatics
  • Embedding high-performance search in C++ services

Command-Line Quick Start

Compile and run the AC-TernarySearchTrie program:

g++ -std=c++11 main.cpp -o ac_tst
./ac_tst string.txt pattern.txt result.txt
# - string.txt  : text to search
# - pattern.txt : one pattern per line
# - result.txt  : matches in "position:pattern" format

Embedding the Library

Include the header and link your code against the compiled library:

#include "ternarySearchTrie.h"
#include <fstream>
#include <iostream>

int main() {
  TernarySearchTrie trie;
  // Load patterns
  std::ifstream pfile("pattern.txt");
  std::string pat;
  while (std::getline(pfile, pat))
    trie.addPattern(pat);
  trie.buildFailureLinks();

  // Read input text
  std::ifstream tfile("string.txt");
  std::string text((std::istreambuf_iterator<char>(tfile)),
                    std::istreambuf_iterator<char>());

  // Search and print matches
  auto results = trie.search(text);
  for (auto &match : results)
    std::cout << "Pos " << match.first << ": " << match.second << "\n";
  return 0;
}

Use this API for seamless integration of high-throughput multi-pattern matching in your C++ projects.

Quick Start & CLI Usage

Get a working executable and reproduce the sample run in README.md in three steps: clone, build, run.

Prerequisites

  • GCC or Clang
  • Unix-style shell (bash, zsh, etc.)

1. Clone the Repository

git clone https://github.com/ChenyuGao/AC-TernarySearchTrie.git
cd AC-TernarySearchTrie

2. Build the Executable

Compile ternarySearchTrie.c into str with optimizations enabled:

gcc -O2 -o str ternarySearchTrie.c

If you prefer debug symbols:

gcc -g -O0 -o str ternarySearchTrie.c

3. Command-Line Usage

./str <text-file> <pattern-file> <result-file>

Arguments
<text-file> Path to the input text (e.g., string.txt)
<pattern-file> One pattern per line (e.g., pattern.txt)
<result-file> Output file for matches (e.g., result.txt)

Sample Run

Assuming string.txt and pattern.txt live in the repo root:

# Run multi-pattern search
./str string.txt pattern.txt result.txt

# Inspect results
cat result.txt

Expected result.txt format:

pattern1: 15
pattern2: 42
pattern1: 78
...

Practical Tips

  • The program overwrites <result-file>; back up existing files if needed.
  • For very large inputs or many patterns, monitor memory/CPU:
    time ./str large.txt patterns.txt results.txt 2> performance.log
    
  • Pre-sort or dedupe your patterns to reduce redundant matches.
  • Ensure all input files use UTF-8 or ASCII encoding.

That’s it—now you can build and run the AC-TernarySearchTrie CLI for fast multi-pattern searches.

API Reference

Brief overview of functions to build, query, and free the Aho-Corasick ternary search trie in C.

ac_alloc: Initialize Automaton

Allocates and initializes the AC_STRUCT, setting up an empty ternary search trie and pattern storage.

Signature
AC_STRUCT *ac_alloc(void);

Return Value
• Pointer to a new AC_STRUCT on success
• NULL on allocation failure

Example

#include "ternarySearchTrie.h"

int main() {
    AC_STRUCT *ac = ac_alloc();
    if (!ac) {
        fprintf(stderr, "Error: cannot allocate AC_STRUCT\n");
        return 1;
    }
    // use ac...
    ac_free(ac);
    return 0;
}

ac_add_string: Insert Pattern into Trie

Registers a pattern in the trie, tags its end node with id, and stores it in Patterns[id].

Signature
int ac_add_string(AC_STRUCT *ac, char *P, int M, int id);

Parameters
• ac – Pointer from ac_alloc()
• P – Null-terminated pattern string
• M – Length of P (≤ max_pattern)
• id – Unique pattern identifier (0 ≤ id < maxp_num)

Return Value
• 1 on success
• 0 on failure (duplicate id, M too large, or out of memory)

How It Works

  1. Traverse from root, comparing P[i] to node->data:
    < → lchild
    > → rchild
    == → next, i++
  2. On final character, set node->ID = id and copy P into Patterns[id].
  3. Increment ac->pNum.

Example

#include <string.h>
#include "ternarySearchTrie.h"

void add_patterns(AC_STRUCT *ac) {
    const char *patterns[] = {"he", "she", "his", "hers"};
    int n = sizeof(patterns)/sizeof(patterns[0]);
    for (int i = 0; i < n; i++) {
        int len = strlen(patterns[i]);
        if (len >= max_pattern) continue;
        if (!ac_add_string(ac, (char*)patterns[i], len, i)) {
            fprintf(stderr, "ac_add_string failed for id=%d\n", i);
        }
    }
}

ac_implement: Build Failure Links

Constructs failure and output pointers for the AC automaton to enable linear-time search.

Signature
int ac_implement(AC_STRUCT *ac);

Return Value
1 on success

Usage
Call after inserting all patterns and before any search.

Example

// After all ac_add_string calls
if (!ac_implement(ac)) {
    fprintf(stderr, "Error building failure links\n");
    ac_free(ac);
    exit(1);
}

search_init: Prepare for Search

Resets automaton state to start scanning a new text.

Signature
void search_init(AC_STRUCT *ac);

Usage
Call before each new call to ac_search if you scan multiple texts.

ac_search: Scan Text for Patterns

Finds all pattern occurrences in a text buffer and invokes a callback on each match.

Signature
void ac_search(AC_STRUCT *ac, char *T, int N,
void (*callback)(int id, int end_pos));

Parameters
• ac – Initialized AC_STRUCT with built failure links
• T, N – Text buffer and its length
• callback – Invoked per match with pattern id and ending position

Example

#include <stdio.h>
#include "ternarySearchTrie.h"

void report_match(int id, int end_pos) {
    // Patterns[id] contains the matched string
    printf("Match: \"%s\" at position %d\n",
           Patterns[id], end_pos - strlen(Patterns[id]) + 1);
}

int main() {
    AC_STRUCT *ac = ac_alloc();
    add_patterns(ac);
    ac_implement(ac);
    const char *text = "ushers";
    search_init(ac);
    ac_search(ac, (char*)text, strlen(text), report_match);
    ac_free(ac);
    return 0;
}

ac_free: Release Resources

Frees all memory associated with the automaton, including the trie, queues, and pattern storage.

Signature
void ac_free(AC_STRUCT *ac);

Usage
Call after all searches to avoid memory leaks.

Algorithm & Implementation Notes

This section explains how the Aho–Corasick automaton uses a Ternary Search Trie (TST) for multi-pattern matching. You’ll see how patterns flow through insertion, failure-link construction, and search, plus tips to tune or extend the implementation.

Data Structures

TSNode (ternarySearchTrie.h)

Each trie node stores one byte (data), an output ID (patId), three child pointers, and a failure link:

typedef struct TSNode {
    unsigned char data;      // current character
    int patId;               // nonzero if a pattern ends here
    struct TSNode *lchild;   // < data
    struct TSNode *next;     // == data (advance in pattern)
    struct TSNode *rchild;   // > data
    struct TSNode *fail;     // Aho–Corasick fallback
} TSNode;

AC_STRUCT

Holds the root TST pointer and pattern count:

typedef struct {
    TSNode *root;
    int pNum;        // number of patterns added
} AC_STRUCT;

Pattern Insertion (ac_add_string)

  • Walks the TST comparing each byte to node->data.
  • Uses lchild/rchild on </> and next on ==.
  • Allocates new nodes when no match exists.
  • Marks the terminal node’s patId = id.
  • Constraint: call in increasing length or sort first.

Failure Link Construction (ac_implement)

Builds failure links via breadth-first traversal:

  1. Initialize a queue with all non-null children of root->next.
  2. For each dequeued node u:
    • For its next child v, set v->fail = follow u->fail via the same byte transition (fall back until match or root).
    • Propagate patId: if v->fail->patId != 0 and v->patId == 0, then v->patId = v->fail->patId.
    • Enqueue v->lchild, v->next, v->rchild if non-null.
  3. Continue until queue empties.

Code Snippet

void ac_implement(AC_STRUCT *ac) {
    Queue q = que_init();
    TSNode *r = ac->root->next;  // first level
    if (r) enqueue(&q, r);
    // BFS over TST
    while (!que_empty(&q)) {
        TSNode *u = dequeue(&q);
        for (TSNode **child = &u->lchild; child <= &u->rchild; ++child) {
            TSNode *v = *child;
            if (!v) continue;
            // compute v->fail
            TSNode *f = u->fail;
            while (f && v->data != f->data) {
                // follow fail or split branching
                if (v->data < f->data) f = f->lchild;
                else if (v->data > f->data) f = f->rchild;
                else break;
            }
            v->fail = (f ? f->next : ac->root->next);
            // inherit output
            if (!v->patId && v->fail->patId) v->patId = v->fail->patId;
            enqueue(&q, v);
        }
    }
    que_free(&q);
}

Search Routine (ac_search)

Walk the input byte by byte, maintaining a pointer u into the TST:

  1. While u and input byte c doesn’t match u->data, set u = u->fail.
  2. If c == u->data, advance u = u->next; else u = ac->root->next.
  3. If u->patId != 0, report a match at the current position.

Basic Usage

void ac_search(AC_STRUCT *ac, const unsigned char *text, int N) {
    TSNode *u = ac->root->next;
    for (int i = 0; i < N; ++i) {
        unsigned char c = text[i];
        while (u && c != u->data) u = u->fail;
        if (u) u = u->next;
        else u = ac->root->next;
        if (u && u->patId) {
            printf("Match id=%d at pos=%d\n", u->patId, i);
        }
    }
}

Performance & Optimization

• Node allocation: replace malloc per node with a pre-allocated pool for cache locality.
• Queue: use a ring buffer of size ≈ total nodes to avoid dynamic resizing.
• Character set: if alphabet is small (ASCII), you can flatten TST into a fixed 2D array to remove lchild/rchild branches.
• Inline hot paths: mark ac_search and failure resolution functions static inline for speed.
• Bulk insertion: sort patterns by length once, then call ac_add_string in order.

Example: Simple Node Pool

#define NODE_POOL_SIZE 1000000
static TSNode nodePool[NODE_POOL_SIZE];
static int poolIdx = 0;

TSNode *alloc_node(unsigned char c) {
    if (poolIdx >= NODE_POOL_SIZE) return NULL;
    TSNode *n = &nodePool[poolIdx++];
    memset(n, 0, sizeof(*n));
    n->data = c;
    return n;
}
// In ac_add_string, replace malloc with alloc_node().

Extending & Customizing

• Dynamic pattern updates: buffer new patterns, call a partial ac_implement() only over new nodes.
• Case-insensitive matching: normalize input and patterns to lowercase before insertion/search.
• Unicode support: encode UTF-8 code points into unique byte sequences or switch to wchar_t.
• Memory cleanup: write a recursive free for TST nodes if you rebuild frequently.

These notes should guide you in modifying or optimizing the TST-based Aho–Corasick implementation in this repo.