ganesh-k13/Btree - GitHub Documentation Generator

Project Overview

This repository delivers a high-performance, disk-based B-Tree in C following CLRS. It supports building, inserting, deleting, and searching records on datasets that exceed main memory. Developers interact via a C API (header files), command-line tools, and optional Python scripts for benchmarking and tuning.

Why This Project Exists

• Persist large sorted indexes efficiently on disk
• Minimize I/O by aligning nodes to file-system pages
• Offer a drop-in C library for applications requiring on-disk key-value storage
• Provide a framework for experimenting with disk-based data structures

Core Components

• C library (src/*.c, include/btree.h)
– btree_create, btree_open, btree_insert, btree_delete, btree_search, btree_close
• Command-line tools (tools/)
– btree_build: bulk-load from text/csv
– btree_query: lookup or range scan
– btree_delete: remove keys
• Python scripts (tester.py, bench.py)
– Automate rebuild & benchmark across degree (t) values
– Plot build/search/delete performance

Typical Use-Cases

1. Embedding a Persistent B-Tree in C

#include "btree.h"
#include <stdio.h>

int main(void) {
    // Create or open tree file on disk with minimum degree t=32
    BTree *tree = btree_open("data/tree.idx", 32, sizeof(MyRecord));
    if (!tree) {
        perror("btree_open");
        return 1;
    }

    // Insert a record
    MyRecord r = { .id = 42, .value = "Hello, B-Tree" };
    if (btree_insert(tree, &r.id, &r) != 0) {
        fprintf(stderr, "Insert failed\n");
    }

    // Search for a key
    MyRecord found;
    if (btree_search(tree, &r.id, &found) == 0) {
        printf("Found record: id=%d, value=%s\n", found.id, found.value);
    } else {
        printf("Key not found\n");
    }

    // Close and flush to disk
    btree_close(tree);
    return 0;
}

2. Building and Querying via CLI

# Bulk-load 1M records into a B-Tree with t=32
./tools/btree_build --degree 32 --record-size 128 data/records.txt data/tree.idx

# Lookup a single key
./tools/btree_query --key 12345 data/tree.idx

# Delete a key
./tools/btree_delete --key 12345 data/tree.idx

3. Running Disk-Based Experiments

# Benchmark t in {4,8,16,…,512} on 1M records and plot results
python3 tester.py \
  --input data/records.txt \
  --output results/bench.png \
  --min-degree 4 \
  --max-degree 512 \
  --steps 8 \
  --records 1000000

This script recompiles the core for each t, measures build/search/delete times, and generates performance charts.

Integration Tips

• Align node size to your file-system page (e.g., 4 KB) by choosing t accordingly
• Recompile with make T=<degree> to embed t at compile time
• Adjust record size in btree_open or CLI to match your data structure
• Use Python scripts to fine-tune t when record or key types change

Getting Started

Follow these steps to obtain the code, install dependencies, build the C library/tests, and verify the setup on Linux or macOS.

1. Clone the Repository

Open a terminal and run:

git clone https://github.com/ganesh-k13/Btree.git
cd Btree

2. Install Prerequisites

On Ubuntu/Debian

sudo apt update
sudo apt install -y build-essential make python3 python3-pip

On macOS (with Homebrew)

brew update
brew install gcc make python3

Python Packages

Install fixed-version requirements:

pip3 install --user -r requirements.txt

Verify:

python3 -c "import matplotlib, pytest; print('Python OK')"

3. Build the C Library and Test Harness

The Makefile auto-selects clang or gcc. You must supply T (B-tree minimum degree).

# Example: build with t=4 (max 7 keys/node)
make T=4

This produces the test executable in the project root.

Direct Compilation (Alternative)

gcc -std=c11 -O2 -Iinclude \
    test.c src/Btree.c src/Node.c src/utils.c \
    -o bt_test

4. Verify the Build

Quick Smoke Test

./test -b 1000
# → "Tree built in: XX.XXXX ms"

Build + Search

./test -sb 10000 42
# → build + search times, confirms functionality

Run Python Automated Tests

pytest --maxfail=1 --disable-warnings -q

5. Clean Up and Reset Data

Remove object files & executable:

make clean

Reset on-disk tree file:

make remdat
# deletes *.dat and creates an empty tree.dat

Now you’re ready to develop, test, and benchmark the B-Tree implementation.

Core Concepts & Architecture

This section describes the core data structures, on-disk layout, and metadata organization of the disk-based B-Tree. Understanding these concepts helps you choose the right degree t, estimate node sizes for I/O, and manage the meta.dat file before using the high-level APIs.

1. B-Tree Basics

A B-Tree of minimum degree t (as in CLRS) satisfies:

Every node (except root) has between t−1 and 2t−1 keys.
A non-leaf node with n keys has n+1 children.
All leaves appear on the same level.

Choosing t affects:

Node size on disk (sizeof(Node)).
Tree height and number of I/O's per operation.
Cache and page utilization.

2. Node Structure (`include/Node.h`)

Define t at compile time:

// Choose minimum degree, e.g., 32
#define t 32
#include "Node.h"

Key types and layout:

// Record stored in each slot
typedef struct Data {
    int   key;
    char  country[5];
    char  Grate[10];
    int   Score;
    int   Rate;
} Data;

// On-disk / in-memory node
typedef struct Node {
    bool    isLeaf;           // true ⇒ no children
    int     pos;              // page index or byte‐offset
    int     n;                // number of keys
    Data    records[2*t - 1]; // max 2t−1
    int     children[2*t];    // max 2t child positions
} Node;

Guidelines:

pos indexes the node’s slot in tree.dat: offset = pos * sizeof(Node).
Unused children[i] must be set to -1.
n tracks valid records[0..n-1].

3. On-Disk Layout

We use two files under a tree directory:

tree.dat
- Fixed‐size pages: each sizeof(Node) bytes.
- Node at pos reads/writes at offset pos * sizeof(Node).

meta.dat

Stores B-Tree metadata: degree t, root_pos, next_pos.

Layout (binary):

typedef struct {
    int t;          // minimum degree
    int root_pos;   // position of root node
    int next_pos;   // next free page index
} Meta;

4. Metadata Management

Use utility routines from include/utils.h:

// Initialize a new B-Tree on disk
Btree* btree_open(const char *dir, int degree) {
    // create directory if needed
    // open tree.dat and meta.dat
    Btree *tree = malloc(sizeof(*tree));
    tree->fp      = fopen(dir/tree.dat, "w+b");
    tree->meta_fp = fopen(dir/meta.dat, "w+b");
    tree->t       = degree;
    tree->root    = -1;       // empty tree
    tree->next_pos= 0;
    // write initial Meta
    Meta m = { degree, tree->root, tree->next_pos };
    fwrite(&m, sizeof(Meta), 1, tree->meta_fp);
    fflush(tree->meta_fp);
    return tree;
}

// Load existing B-Tree
Btree* btree_load(const char *dir) {
    Btree *tree = malloc(sizeof(*tree));
    tree->fp      = fopen(dir/tree.dat, "r+b");
    tree->meta_fp = fopen(dir/meta.dat, "r+b");
    Meta m;
    fread(&m, sizeof(Meta), 1, tree->meta_fp);
    tree->t       = m.t;
    tree->root    = m.root_pos;
    tree->next_pos= m.next_pos;
    return tree;
}

// Persist updated Meta
void meta_write(Btree *tree) {
    Meta m = { tree->t, tree->root, tree->next_pos };
    fseek(tree->meta_fp, 0, SEEK_SET);
    fwrite(&m, sizeof(Meta), 1, tree->meta_fp);
    fflush(tree->meta_fp);
}

5. Node I/O (`include/utils.h`)

Centralize reads/writes to maintain consistency:

// Write a Node to disk at its pos (assign if -1)
void write_node(Btree *tree, Node *node) {
    if (node->pos < 0) {
        node->pos = tree->next_pos++;
        meta_write(tree);
    }
    fseek(tree->fp, node->pos * sizeof(Node), SEEK_SET);
    fwrite(node, sizeof(Node), 1, tree->fp);
}

// Read a Node from disk by pos
void read_node(Btree *tree, Node *node, int pos) {
    fseek(tree->fp, pos * sizeof(Node), SEEK_SET);
    fread(node, sizeof(Node), 1, tree->fp);
}

6. Node Initialization

Always initialize child pointers to -1 and n=0 before first write:

// Prepare a fresh node in memory
void node_init(Node *node, bool isLeaf, Btree *tree) {
    node->isLeaf = isLeaf;
    node->n      = 0;
    node->pos    = -1;           // assign on write_node
    for (int i = 0; i < 2*t; i++)
        node->children[i] = -1;
}

7. Practical Guidelines

Keep t consistent across all compilation units to avoid corrupt layout.
Estimate node size: sizeof(Node) ≈ 1 + 4 + 4 + (2t−1)*sizeof(Data) + 2t*4 bytes.
Align sizeof(Node) to your OS page size (common: 4 KB, 8 KB) for optimal I/O.
After split or merge operations, always call write_node for any modified node.
Use meta_write whenever you change root or next_pos in Btree.

Understanding these core structures and layout conventions ensures you can extend insertion/deletion policies, customize record types, and optimize performance based on Report.pdf’s analyses of node sizing and disk access patterns.

C API Reference & Usage

This section outlines the public C interface for persisting and querying large key-value datasets using a disk-backed B-Tree. It covers initialization, insertion, search, traversal, deletion, and utility routines.

B-Tree Initialization and Teardown

Functions

• Btree* BTree_init(const char* filename, bool load_existing)
• void BTree_destroy(Btree* tree)

Overview

On creation (load_existing == false), initializes metadata and empty root.
On load (load_existing == true), reads existing metadata and prepares file for I/O.

Example

#include "Btree.h"

int main() {
    // Create a new tree (overwrites if exists)
    Btree* tree = BTree_init("data.idx", false);
    if (!tree) return 1;

    // ... use tree ...

    BTree_destroy(tree);
    return 0;
}

Record Insertion

Function

• void insert(Btree* tree, Data* record)

Usage

Fill a Data record (see Utility Functions).
Call insert(tree, &record); splits nodes as needed.

Example

#include "Btree.h"
#include "utils.h"

int main() {
    Btree* tree = BTree_init("data.idx", false);

    Data rec;
    enter_data(&rec, 42, "Answer to Life");
    insert(tree, &rec);

    BTree_destroy(tree);
    return 0;
}

Key Search

Function

• Data* search(Btree* tree, int key)

Usage

Returns pointer to Data if found, otherwise NULL.

Example

#include <stdio.h>
#include "Btree.h"

int main() {
    Btree* tree = BTree_init("data.idx", true);
    int key = 42;
    Data* result = search(tree, key);
    if (result)
        printf("Found key=%d payload=\"%s\"\n", result->key, result->payload);
    else
        printf("Key %d not found\n", key);
    BTree_destroy(tree);
    return 0;
}

Tree Traversal

Function

• void traverse(Btree* tree, int pos)

Usage

In-order prints all keys/payloads starting at node position pos.
Call with tree->root to dump entire tree.

Example

#include "Btree.h"

int main() {
    Btree* tree = BTree_init("data.idx", true);
    traverse(tree, tree->root);
    BTree_destroy(tree);
    return 0;
}

Record Deletion

Function

• void removeNode(Btree* tree, int key)

Usage

Deletes key, rebalances nodes.
If key not present, no change.

Example

#include "Btree.h"

int main() {
    Btree* tree = BTree_init("data.idx", true);
    removeNode(tree, 42);
    traverse(tree, tree->root);
    BTree_destroy(tree);
    return 0;
}

Utility Functions

Include "utils.h" to manage node I/O, data entry, CSV loading, and timing.

Node Initialization & Data Entry

void node_init(Node* node, bool isLeaf, Btree* tree);
void enter_data(Data* rec, int key, const char* payload);

node_init sets node->pos = tree->next_pos++ and isLeaf.
enter_data fills rec->key and copies payload.

Disk I/O

void write_file(Btree* tree, Node* node, int pos);
void read_file (Btree* tree, Node* node, int pos);

Use pos = -1 in write_file to auto-assign a new slot (tree->next_pos).

CSV Bulk Load

void load_csv(Btree* tree, const char* csv_path);

Reads key,payload lines and calls insert for each record.

Timing

double get_time(void);

Returns current time in seconds (wall-clock) for performance measurements.

Bulk-Load Example with Timing

#include <stdio.h>
#include "Btree.h"
#include "utils.h"

int main() {
    Btree* tree = BTree_init("bulk.idx", false);
    double start = get_time();
    load_csv(tree, "data.csv");
    double elapsed = get_time() - start;
    printf("Bulk insert completed in %.3f seconds\n", elapsed);

    traverse(tree, tree->root);
    BTree_destroy(tree);
    return 0;
}
## CLI & Benchmark Tools

This section describes the ready-made C-based CLI harness (`test.c`) for B-tree operations and a Python toolkit (`tests/utils.py`) for large-scale benchmarking and visualization.

### test.c: Command-Line Test Harness for B-tree Operations

#### Purpose  
Compile and run performance tests (build, search, delete) on your B-tree using a simple CLI.

#### Compilation  
```bash
gcc -o btree_test test.c src/Node.c src/Btree.c src/utils.c -Iinclude -lrt

(Link with -lrt for clock_gettime.)

Available Flags

-b <N>
Build a new B-tree from the first N records in tmp/dataset.csv.
• Pages → tree.dat
• Metadata → meta.dat
• Prints build time (ms).
-s <key>
Load tree.dat + meta.dat, search for integer <key>.
• Prints matching record
• Prints search time (ms).
-sb <N> <key>
Build (as in -b), then search <key>.
• Prints both build and search times.
-d <key>
Load tree, print node for <key>, delete it, then print status and deletion time (ms).
-t <N> <key>
Debug stub (prints node size).

Examples

Build a 100 000-record tree:

./btree_test -b 100000

Search key 42:

./btree_test -s 42

Build + search (100k records, key 1234):

./btree_test -sb 100000 1234

Delete key 77:

./btree_test -d 77

Expected Output Patterns

Build:
```
Tree built in: 234.567890 ms
```

Search:

[record fields…]
Result found in: 0.012345 ms

Delete:

Node to delete:
[record fields…]
Delete successful
Result found in: 0.009876 ms

Practical Guidance

Ensure tmp/dataset.csv has ≥ N lines.
Use BTree_init(path, false) to create, true to load an existing tree.
After -b or -sb, meta.dat is written; subsequent -s/-d rely on it.
Extend tests by modifying run_tests() in test.c to handle new flags.

Performance Testing Utilities (tests/utils.py)

Purpose

Automate compilations with different minimum degrees (T), measure build/search/delete latencies, and plot results.

Dependencies

Python 3.x
matplotlib
pytest (optional)
Install:

pip install -r requirements.txt

Key Functions

build_tree(t: int, records_num: int) -> float
- Runs make clean && make T=<t>
- Invokes ./btree_test -b <records_num>
- Returns build time (ms).
test_search(key: int) -> float
- Runs ./btree_test -s <key>
- Returns search time (ms).
test_delete(key: int) -> float
- Runs ./btree_test -d <key>
- Returns delete time (ms).

Example Snippet

import subprocess, re

def build_tree(t, records_num):
    subprocess.run(['make', 'clean'], check=True)
    subprocess.run(['make', f'T={t}'], check=True)
    p = subprocess.run(['./btree_test', '-b', str(records_num)],
                       stdout=subprocess.PIPE, check=True)
    ms = float(re.search(r'\d+\.\d+', p.stdout.decode())[0])
    return ms

def test_search(key):
    p = subprocess.run(['./btree_test', '-s', str(key)],
                       stdout=subprocess.PIPE, check=True)
    return float(re.search(r'\d+\.\d+', p.stdout.decode())[0])

Practical Usage

Configure parameters at the top of tests/utils.py:

t_values = [2**i for i in range(2, 12)]
records_num = 1000

Run the benchmarks and plot:
```
python tests/utils.py
```
- Compiles and builds for each T
- Executes 10 random searches/deletes per T
- Displays build, search-avg, delete-avg charts
Custom benchmarks
- Mirror test_search/test_delete patterns for insert (-i <key>)
- Integrate into pytest by wrapping calls and asserting max latencies

Tips

Ensure ./btree_test is executable and in project root.
Increase records_num or sample size to reduce noise.
In headless CI, add at top of tests/utils.py:
```
import matplotlib
matplotlib.use('Agg')
```
Use pytest --maxfail=1 --disable-warnings -q for quick feedback.

Documentation

Contents

Quick Actions

Contents

Quick Actions

Chat about this codebase

Chat about this codebase

Project Overview

Why This Project Exists

Core Components

Typical Use-Cases

1. Embedding a Persistent B-Tree in C

2. Building and Querying via CLI

3. Running Disk-Based Experiments

Integration Tips

Getting Started

1. Clone the Repository

2. Install Prerequisites

On Ubuntu/Debian

On macOS (with Homebrew)

Python Packages

3. Build the C Library and Test Harness

Direct Compilation (Alternative)

4. Verify the Build

Quick Smoke Test

Build + Search

Run Python Automated Tests

5. Clean Up and Reset Data

Core Concepts & Architecture

1. B-Tree Basics

2. Node Structure (include/Node.h)

3. On-Disk Layout

4. Metadata Management

5. Node I/O (include/utils.h)

6. Node Initialization

7. Practical Guidelines

C API Reference & Usage

B-Tree Initialization and Teardown

Functions

Overview

Example

Record Insertion

Function

Usage

Example

Key Search

Function

Usage

Example

Tree Traversal

Function

Usage

Example

Record Deletion

Function

Usage

Example

Utility Functions

Node Initialization & Data Entry

Disk I/O

CSV Bulk Load

Timing

Bulk-Load Example with Timing

Available Flags

Examples

Expected Output Patterns

Practical Guidance

Performance Testing Utilities (tests/utils.py)

Purpose

Dependencies

Key Functions

Example Snippet

Practical Usage

Tips

2. Node Structure (`include/Node.h`)

5. Node I/O (`include/utils.h`)