Chat about this codebase

AI-powered code exploration

Online

Project Overview

This repository delivers a high-performance, disk-based B-Tree in C following CLRS. It supports building, inserting, deleting, and searching records on datasets that exceed main memory. Developers interact via a C API (header files), command-line tools, and optional Python scripts for benchmarking and tuning.

Why This Project Exists

• Persist large sorted indexes efficiently on disk
• Minimize I/O by aligning nodes to file-system pages
• Offer a drop-in C library for applications requiring on-disk key-value storage
• Provide a framework for experimenting with disk-based data structures

Core Components

• C library (src/*.c, include/btree.h)
– btree_create, btree_open, btree_insert, btree_delete, btree_search, btree_close
• Command-line tools (tools/)
– btree_build: bulk-load from text/csv
– btree_query: lookup or range scan
– btree_delete: remove keys
• Python scripts (tester.py, bench.py)
– Automate rebuild & benchmark across degree (t) values
– Plot build/search/delete performance

Typical Use-Cases

1. Embedding a Persistent B-Tree in C

#include "btree.h"
#include <stdio.h>

int main(void) {
    // Create or open tree file on disk with minimum degree t=32
    BTree *tree = btree_open("data/tree.idx", 32, sizeof(MyRecord));
    if (!tree) {
        perror("btree_open");
        return 1;
    }

    // Insert a record
    MyRecord r = { .id = 42, .value = "Hello, B-Tree" };
    if (btree_insert(tree, &r.id, &r) != 0) {
        fprintf(stderr, "Insert failed\n");
    }

    // Search for a key
    MyRecord found;
    if (btree_search(tree, &r.id, &found) == 0) {
        printf("Found record: id=%d, value=%s\n", found.id, found.value);
    } else {
        printf("Key not found\n");
    }

    // Close and flush to disk
    btree_close(tree);
    return 0;
}

2. Building and Querying via CLI

# Bulk-load 1M records into a B-Tree with t=32
./tools/btree_build --degree 32 --record-size 128 data/records.txt data/tree.idx

# Lookup a single key
./tools/btree_query --key 12345 data/tree.idx

# Delete a key
./tools/btree_delete --key 12345 data/tree.idx

3. Running Disk-Based Experiments

# Benchmark t in {4,8,16,…,512} on 1M records and plot results
python3 tester.py \
  --input data/records.txt \
  --output results/bench.png \
  --min-degree 4 \
  --max-degree 512 \
  --steps 8 \
  --records 1000000

This script recompiles the core for each t, measures build/search/delete times, and generates performance charts.

Integration Tips

• Align node size to your file-system page (e.g., 4 KB) by choosing t accordingly
• Recompile with make T=<degree> to embed t at compile time
• Adjust record size in btree_open or CLI to match your data structure
• Use Python scripts to fine-tune t when record or key types change

Getting Started

Follow these steps to obtain the code, install dependencies, build the C library/tests, and verify the setup on Linux or macOS.

1. Clone the Repository

Open a terminal and run:

git clone https://github.com/ganesh-k13/Btree.git
cd Btree

2. Install Prerequisites

On Ubuntu/Debian

sudo apt update
sudo apt install -y build-essential make python3 python3-pip

On macOS (with Homebrew)

brew update
brew install gcc make python3

Python Packages

Install fixed-version requirements:

pip3 install --user -r requirements.txt

Verify:

python3 -c "import matplotlib, pytest; print('Python OK')"

3. Build the C Library and Test Harness

The Makefile auto-selects clang or gcc. You must supply T (B-tree minimum degree).

# Example: build with t=4 (max 7 keys/node)
make T=4

This produces the test executable in the project root.

Direct Compilation (Alternative)

gcc -std=c11 -O2 -Iinclude \
    test.c src/Btree.c src/Node.c src/utils.c \
    -o bt_test

4. Verify the Build

Quick Smoke Test

./test -b 1000
# → "Tree built in: XX.XXXX ms"

Build + Search

./test -sb 10000 42
# → build + search times, confirms functionality

Run Python Automated Tests

pytest --maxfail=1 --disable-warnings -q

5. Clean Up and Reset Data

  • Remove object files & executable:
make clean
  • Reset on-disk tree file:
make remdat
# deletes *.dat and creates an empty tree.dat

Now you’re ready to develop, test, and benchmark the B-Tree implementation.

Core Concepts & Architecture

This section describes the core data structures, on-disk layout, and metadata organization of the disk-based B-Tree. Understanding these concepts helps you choose the right degree t, estimate node sizes for I/O, and manage the meta.dat file before using the high-level APIs.

1. B-Tree Basics

A B-Tree of minimum degree t (as in CLRS) satisfies:

  • Every node (except root) has between t−1 and 2t−1 keys.
  • A non-leaf node with n keys has n+1 children.
  • All leaves appear on the same level.

Choosing t affects:

  • Node size on disk (sizeof(Node)).
  • Tree height and number of I/O's per operation.
  • Cache and page utilization.

2. Node Structure (include/Node.h)

Define t at compile time:

// Choose minimum degree, e.g., 32
#define t 32
#include "Node.h"

Key types and layout:

// Record stored in each slot
typedef struct Data {
    int   key;
    char  country[5];
    char  Grate[10];
    int   Score;
    int   Rate;
} Data;

// On-disk / in-memory node
typedef struct Node {
    bool    isLeaf;           // true ⇒ no children
    int     pos;              // page index or byte‐offset
    int     n;                // number of keys
    Data    records[2*t - 1]; // max 2t−1
    int     children[2*t];    // max 2t child positions
} Node;

Guidelines:

  • pos indexes the node’s slot in tree.dat: offset = pos * sizeof(Node).
  • Unused children[i] must be set to -1.
  • n tracks valid records[0..n-1].

3. On-Disk Layout

We use two files under a tree directory:

  1. tree.dat

    • Fixed‐size pages: each sizeof(Node) bytes.
    • Node at pos reads/writes at offset pos * sizeof(Node).
  2. meta.dat

    • Stores B-Tree metadata: degree t, root_pos, next_pos.
    • Layout (binary):
      typedef struct {
          int t;          // minimum degree
          int root_pos;   // position of root node
          int next_pos;   // next free page index
      } Meta;
      

4. Metadata Management

Use utility routines from include/utils.h:

// Initialize a new B-Tree on disk
Btree* btree_open(const char *dir, int degree) {
    // create directory if needed
    // open tree.dat and meta.dat
    Btree *tree = malloc(sizeof(*tree));
    tree->fp      = fopen(dir/tree.dat, "w+b");
    tree->meta_fp = fopen(dir/meta.dat, "w+b");
    tree->t       = degree;
    tree->root    = -1;       // empty tree
    tree->next_pos= 0;
    // write initial Meta
    Meta m = { degree, tree->root, tree->next_pos };
    fwrite(&m, sizeof(Meta), 1, tree->meta_fp);
    fflush(tree->meta_fp);
    return tree;
}

// Load existing B-Tree
Btree* btree_load(const char *dir) {
    Btree *tree = malloc(sizeof(*tree));
    tree->fp      = fopen(dir/tree.dat, "r+b");
    tree->meta_fp = fopen(dir/meta.dat, "r+b");
    Meta m;
    fread(&m, sizeof(Meta), 1, tree->meta_fp);
    tree->t       = m.t;
    tree->root    = m.root_pos;
    tree->next_pos= m.next_pos;
    return tree;
}

// Persist updated Meta
void meta_write(Btree *tree) {
    Meta m = { tree->t, tree->root, tree->next_pos };
    fseek(tree->meta_fp, 0, SEEK_SET);
    fwrite(&m, sizeof(Meta), 1, tree->meta_fp);
    fflush(tree->meta_fp);
}

5. Node I/O (include/utils.h)

Centralize reads/writes to maintain consistency:

// Write a Node to disk at its pos (assign if -1)
void write_node(Btree *tree, Node *node) {
    if (node->pos < 0) {
        node->pos = tree->next_pos++;
        meta_write(tree);
    }
    fseek(tree->fp, node->pos * sizeof(Node), SEEK_SET);
    fwrite(node, sizeof(Node), 1, tree->fp);
}

// Read a Node from disk by pos
void read_node(Btree *tree, Node *node, int pos) {
    fseek(tree->fp, pos * sizeof(Node), SEEK_SET);
    fread(node, sizeof(Node), 1, tree->fp);
}

6. Node Initialization

Always initialize child pointers to -1 and n=0 before first write:

// Prepare a fresh node in memory
void node_init(Node *node, bool isLeaf, Btree *tree) {
    node->isLeaf = isLeaf;
    node->n      = 0;
    node->pos    = -1;           // assign on write_node
    for (int i = 0; i < 2*t; i++)
        node->children[i] = -1;
}

7. Practical Guidelines

  • Keep t consistent across all compilation units to avoid corrupt layout.
  • Estimate node size: sizeof(Node) ≈ 1 + 4 + 4 + (2t−1)*sizeof(Data) + 2t*4 bytes.
  • Align sizeof(Node) to your OS page size (common: 4 KB, 8 KB) for optimal I/O.
  • After split or merge operations, always call write_node for any modified node.
  • Use meta_write whenever you change root or next_pos in Btree.

Understanding these core structures and layout conventions ensures you can extend insertion/deletion policies, customize record types, and optimize performance based on Report.pdf’s analyses of node sizing and disk access patterns.

C API Reference & Usage

This section outlines the public C interface for persisting and querying large key-value datasets using a disk-backed B-Tree. It covers initialization, insertion, search, traversal, deletion, and utility routines.

B-Tree Initialization and Teardown

Functions

Btree* BTree_init(const char* filename, bool load_existing)
void BTree_destroy(Btree* tree)

Overview

  • On creation (load_existing == false), initializes metadata and empty root.
  • On load (load_existing == true), reads existing metadata and prepares file for I/O.

Example

#include "Btree.h"

int main() {
    // Create a new tree (overwrites if exists)
    Btree* tree = BTree_init("data.idx", false);
    if (!tree) return 1;

    // ... use tree ...

    BTree_destroy(tree);
    return 0;
}

Record Insertion

Function

void insert(Btree* tree, Data* record)

Usage

  1. Fill a Data record (see Utility Functions).
  2. Call insert(tree, &record); splits nodes as needed.

Example

#include "Btree.h"
#include "utils.h"

int main() {
    Btree* tree = BTree_init("data.idx", false);

    Data rec;
    enter_data(&rec, 42, "Answer to Life");
    insert(tree, &rec);

    BTree_destroy(tree);
    return 0;
}

Key Search

Function

Data* search(Btree* tree, int key)

Usage

  • Returns pointer to Data if found, otherwise NULL.

Example

#include <stdio.h>
#include "Btree.h"

int main() {
    Btree* tree = BTree_init("data.idx", true);
    int key = 42;
    Data* result = search(tree, key);
    if (result)
        printf("Found key=%d payload=\"%s\"\n", result->key, result->payload);
    else
        printf("Key %d not found\n", key);
    BTree_destroy(tree);
    return 0;
}

Tree Traversal

Function

void traverse(Btree* tree, int pos)

Usage

  • In-order prints all keys/payloads starting at node position pos.
  • Call with tree->root to dump entire tree.

Example

#include "Btree.h"

int main() {
    Btree* tree = BTree_init("data.idx", true);
    traverse(tree, tree->root);
    BTree_destroy(tree);
    return 0;
}

Record Deletion

Function

void removeNode(Btree* tree, int key)

Usage

  • Deletes key, rebalances nodes.
  • If key not present, no change.

Example

#include "Btree.h"

int main() {
    Btree* tree = BTree_init("data.idx", true);
    removeNode(tree, 42);
    traverse(tree, tree->root);
    BTree_destroy(tree);
    return 0;
}

Utility Functions

Include "utils.h" to manage node I/O, data entry, CSV loading, and timing.

Node Initialization & Data Entry

void node_init(Node* node, bool isLeaf, Btree* tree);
void enter_data(Data* rec, int key, const char* payload);
  • node_init sets node->pos = tree->next_pos++ and isLeaf.
  • enter_data fills rec->key and copies payload.

Disk I/O

void write_file(Btree* tree, Node* node, int pos);
void read_file (Btree* tree, Node* node, int pos);
  • Use pos = -1 in write_file to auto-assign a new slot (tree->next_pos).

CSV Bulk Load

void load_csv(Btree* tree, const char* csv_path);
  • Reads key,payload lines and calls insert for each record.

Timing

double get_time(void);
  • Returns current time in seconds (wall-clock) for performance measurements.

Bulk-Load Example with Timing

#include <stdio.h>
#include "Btree.h"
#include "utils.h"

int main() {
    Btree* tree = BTree_init("bulk.idx", false);
    double start = get_time();
    load_csv(tree, "data.csv");
    double elapsed = get_time() - start;
    printf("Bulk insert completed in %.3f seconds\n", elapsed);

    traverse(tree, tree->root);
    BTree_destroy(tree);
    return 0;
}
## CLI & Benchmark Tools

This section describes the ready-made C-based CLI harness (`test.c`) for B-tree operations and a Python toolkit (`tests/utils.py`) for large-scale benchmarking and visualization.

### test.c: Command-Line Test Harness for B-tree Operations

#### Purpose  
Compile and run performance tests (build, search, delete) on your B-tree using a simple CLI.

#### Compilation  
```bash
gcc -o btree_test test.c src/Node.c src/Btree.c src/utils.c -Iinclude -lrt

(Link with -lrt for clock_gettime.)

Available Flags

  • -b <N>
    Build a new B-tree from the first N records in tmp/dataset.csv.
    • Pages → tree.dat
    • Metadata → meta.dat
    • Prints build time (ms).

  • -s <key>
    Load tree.dat + meta.dat, search for integer <key>.
    • Prints matching record
    • Prints search time (ms).

  • -sb <N> <key>
    Build (as in -b), then search <key>.
    • Prints both build and search times.

  • -d <key>
    Load tree, print node for <key>, delete it, then print status and deletion time (ms).

  • -t <N> <key>
    Debug stub (prints node size).

Examples

Build a 100 000-record tree:

./btree_test -b 100000

Search key 42:

./btree_test -s 42

Build + search (100k records, key 1234):

./btree_test -sb 100000 1234

Delete key 77:

./btree_test -d 77

Expected Output Patterns

  • Build:
    Tree built in: 234.567890 ms
    
  • Search:
    [record fields…]
    Result found in: 0.012345 ms
    
  • Delete:
    Node to delete:
    [record fields…]
    Delete successful
    Result found in: 0.009876 ms
    

Practical Guidance

  • Ensure tmp/dataset.csv has ≥ N lines.
  • Use BTree_init(path, false) to create, true to load an existing tree.
  • After -b or -sb, meta.dat is written; subsequent -s/-d rely on it.
  • Extend tests by modifying run_tests() in test.c to handle new flags.

Performance Testing Utilities (tests/utils.py)

Purpose

Automate compilations with different minimum degrees (T), measure build/search/delete latencies, and plot results.

Dependencies

  • Python 3.x
  • matplotlib
  • pytest (optional)
    Install:
pip install -r requirements.txt

Key Functions

  • build_tree(t: int, records_num: int) -> float

    • Runs make clean && make T=<t>
    • Invokes ./btree_test -b <records_num>
    • Returns build time (ms).
  • test_search(key: int) -> float

    • Runs ./btree_test -s <key>
    • Returns search time (ms).
  • test_delete(key: int) -> float

    • Runs ./btree_test -d <key>
    • Returns delete time (ms).

Example Snippet

import subprocess, re

def build_tree(t, records_num):
    subprocess.run(['make', 'clean'], check=True)
    subprocess.run(['make', f'T={t}'], check=True)
    p = subprocess.run(['./btree_test', '-b', str(records_num)],
                       stdout=subprocess.PIPE, check=True)
    ms = float(re.search(r'\d+\.\d+', p.stdout.decode())[0])
    return ms

def test_search(key):
    p = subprocess.run(['./btree_test', '-s', str(key)],
                       stdout=subprocess.PIPE, check=True)
    return float(re.search(r'\d+\.\d+', p.stdout.decode())[0])

Practical Usage

  1. Configure parameters at the top of tests/utils.py:

    t_values = [2**i for i in range(2, 12)]
    records_num = 1000
    
  2. Run the benchmarks and plot:

    python tests/utils.py
    
    • Compiles and builds for each T
    • Executes 10 random searches/deletes per T
    • Displays build, search-avg, delete-avg charts
  3. Custom benchmarks

    • Mirror test_search/test_delete patterns for insert (-i <key>)
    • Integrate into pytest by wrapping calls and asserting max latencies

Tips

  • Ensure ./btree_test is executable and in project root.
  • Increase records_num or sample size to reduce noise.
  • In headless CI, add at top of tests/utils.py:
    import matplotlib
    matplotlib.use('Agg')
    
  • Use pytest --maxfail=1 --disable-warnings -q for quick feedback.