Project Overview
This repository delivers a high-performance, disk-based B-Tree in C following CLRS. It supports building, inserting, deleting, and searching records on datasets that exceed main memory. Developers interact via a C API (header files), command-line tools, and optional Python scripts for benchmarking and tuning.
Why This Project Exists
• Persist large sorted indexes efficiently on disk
• Minimize I/O by aligning nodes to file-system pages
• Offer a drop-in C library for applications requiring on-disk key-value storage
• Provide a framework for experimenting with disk-based data structures
Core Components
• C library (src/*.c, include/btree.h)
– btree_create, btree_open, btree_insert, btree_delete, btree_search, btree_close
• Command-line tools (tools/)
– btree_build: bulk-load from text/csv
– btree_query: lookup or range scan
– btree_delete: remove keys
• Python scripts (tester.py, bench.py)
– Automate rebuild & benchmark across degree (t) values
– Plot build/search/delete performance
Typical Use-Cases
1. Embedding a Persistent B-Tree in C
#include "btree.h"
#include <stdio.h>
int main(void) {
// Create or open tree file on disk with minimum degree t=32
BTree *tree = btree_open("data/tree.idx", 32, sizeof(MyRecord));
if (!tree) {
perror("btree_open");
return 1;
}
// Insert a record
MyRecord r = { .id = 42, .value = "Hello, B-Tree" };
if (btree_insert(tree, &r.id, &r) != 0) {
fprintf(stderr, "Insert failed\n");
}
// Search for a key
MyRecord found;
if (btree_search(tree, &r.id, &found) == 0) {
printf("Found record: id=%d, value=%s\n", found.id, found.value);
} else {
printf("Key not found\n");
}
// Close and flush to disk
btree_close(tree);
return 0;
}
2. Building and Querying via CLI
# Bulk-load 1M records into a B-Tree with t=32
./tools/btree_build --degree 32 --record-size 128 data/records.txt data/tree.idx
# Lookup a single key
./tools/btree_query --key 12345 data/tree.idx
# Delete a key
./tools/btree_delete --key 12345 data/tree.idx
3. Running Disk-Based Experiments
# Benchmark t in {4,8,16,…,512} on 1M records and plot results
python3 tester.py \
--input data/records.txt \
--output results/bench.png \
--min-degree 4 \
--max-degree 512 \
--steps 8 \
--records 1000000
This script recompiles the core for each t, measures build/search/delete times, and generates performance charts.
Integration Tips
• Align node size to your file-system page (e.g., 4 KB) by choosing t accordingly
• Recompile with make T=<degree>
to embed t at compile time
• Adjust record size in btree_open
or CLI to match your data structure
• Use Python scripts to fine-tune t when record or key types change
Getting Started
Follow these steps to obtain the code, install dependencies, build the C library/tests, and verify the setup on Linux or macOS.
1. Clone the Repository
Open a terminal and run:
git clone https://github.com/ganesh-k13/Btree.git
cd Btree
2. Install Prerequisites
On Ubuntu/Debian
sudo apt update
sudo apt install -y build-essential make python3 python3-pip
On macOS (with Homebrew)
brew update
brew install gcc make python3
Python Packages
Install fixed-version requirements:
pip3 install --user -r requirements.txt
Verify:
python3 -c "import matplotlib, pytest; print('Python OK')"
3. Build the C Library and Test Harness
The Makefile auto-selects clang or gcc. You must supply T
(B-tree minimum degree).
# Example: build with t=4 (max 7 keys/node)
make T=4
This produces the test
executable in the project root.
Direct Compilation (Alternative)
gcc -std=c11 -O2 -Iinclude \
test.c src/Btree.c src/Node.c src/utils.c \
-o bt_test
4. Verify the Build
Quick Smoke Test
./test -b 1000
# → "Tree built in: XX.XXXX ms"
Build + Search
./test -sb 10000 42
# → build + search times, confirms functionality
Run Python Automated Tests
pytest --maxfail=1 --disable-warnings -q
5. Clean Up and Reset Data
- Remove object files & executable:
make clean
- Reset on-disk tree file:
make remdat
# deletes *.dat and creates an empty tree.dat
Now you’re ready to develop, test, and benchmark the B-Tree implementation.
Core Concepts & Architecture
This section describes the core data structures, on-disk layout, and metadata organization of the disk-based B-Tree. Understanding these concepts helps you choose the right degree t
, estimate node sizes for I/O, and manage the meta.dat
file before using the high-level APIs.
1. B-Tree Basics
A B-Tree of minimum degree t
(as in CLRS) satisfies:
- Every node (except root) has between
t−1
and2t−1
keys. - A non-leaf node with
n
keys hasn+1
children. - All leaves appear on the same level.
Choosing t
affects:
- Node size on disk (
sizeof(Node)
). - Tree height and number of I/O's per operation.
- Cache and page utilization.
2. Node Structure (include/Node.h
)
Define t
at compile time:
// Choose minimum degree, e.g., 32
#define t 32
#include "Node.h"
Key types and layout:
// Record stored in each slot
typedef struct Data {
int key;
char country[5];
char Grate[10];
int Score;
int Rate;
} Data;
// On-disk / in-memory node
typedef struct Node {
bool isLeaf; // true ⇒ no children
int pos; // page index or byte‐offset
int n; // number of keys
Data records[2*t - 1]; // max 2t−1
int children[2*t]; // max 2t child positions
} Node;
Guidelines:
pos
indexes the node’s slot intree.dat
:offset = pos * sizeof(Node)
.- Unused
children[i]
must be set to-1
. n
tracks validrecords[0..n-1]
.
3. On-Disk Layout
We use two files under a tree directory:
tree.dat
- Fixed‐size pages: each
sizeof(Node)
bytes. - Node at
pos
reads/writes at offsetpos * sizeof(Node)
.
- Fixed‐size pages: each
meta.dat
- Stores B-Tree metadata: degree
t
,root_pos
,next_pos
. - Layout (binary):
typedef struct { int t; // minimum degree int root_pos; // position of root node int next_pos; // next free page index } Meta;
- Stores B-Tree metadata: degree
4. Metadata Management
Use utility routines from include/utils.h
:
// Initialize a new B-Tree on disk
Btree* btree_open(const char *dir, int degree) {
// create directory if needed
// open tree.dat and meta.dat
Btree *tree = malloc(sizeof(*tree));
tree->fp = fopen(dir/tree.dat, "w+b");
tree->meta_fp = fopen(dir/meta.dat, "w+b");
tree->t = degree;
tree->root = -1; // empty tree
tree->next_pos= 0;
// write initial Meta
Meta m = { degree, tree->root, tree->next_pos };
fwrite(&m, sizeof(Meta), 1, tree->meta_fp);
fflush(tree->meta_fp);
return tree;
}
// Load existing B-Tree
Btree* btree_load(const char *dir) {
Btree *tree = malloc(sizeof(*tree));
tree->fp = fopen(dir/tree.dat, "r+b");
tree->meta_fp = fopen(dir/meta.dat, "r+b");
Meta m;
fread(&m, sizeof(Meta), 1, tree->meta_fp);
tree->t = m.t;
tree->root = m.root_pos;
tree->next_pos= m.next_pos;
return tree;
}
// Persist updated Meta
void meta_write(Btree *tree) {
Meta m = { tree->t, tree->root, tree->next_pos };
fseek(tree->meta_fp, 0, SEEK_SET);
fwrite(&m, sizeof(Meta), 1, tree->meta_fp);
fflush(tree->meta_fp);
}
5. Node I/O (include/utils.h
)
Centralize reads/writes to maintain consistency:
// Write a Node to disk at its pos (assign if -1)
void write_node(Btree *tree, Node *node) {
if (node->pos < 0) {
node->pos = tree->next_pos++;
meta_write(tree);
}
fseek(tree->fp, node->pos * sizeof(Node), SEEK_SET);
fwrite(node, sizeof(Node), 1, tree->fp);
}
// Read a Node from disk by pos
void read_node(Btree *tree, Node *node, int pos) {
fseek(tree->fp, pos * sizeof(Node), SEEK_SET);
fread(node, sizeof(Node), 1, tree->fp);
}
6. Node Initialization
Always initialize child pointers to -1
and n=0
before first write:
// Prepare a fresh node in memory
void node_init(Node *node, bool isLeaf, Btree *tree) {
node->isLeaf = isLeaf;
node->n = 0;
node->pos = -1; // assign on write_node
for (int i = 0; i < 2*t; i++)
node->children[i] = -1;
}
7. Practical Guidelines
- Keep
t
consistent across all compilation units to avoid corrupt layout. - Estimate node size:
sizeof(Node) ≈ 1 + 4 + 4 + (2t−1)*sizeof(Data) + 2t*4
bytes. - Align
sizeof(Node)
to your OS page size (common: 4 KB, 8 KB) for optimal I/O. - After split or merge operations, always call
write_node
for any modified node. - Use
meta_write
whenever you changeroot
ornext_pos
inBtree
.
Understanding these core structures and layout conventions ensures you can extend insertion/deletion policies, customize record types, and optimize performance based on Report.pdf’s analyses of node sizing and disk access patterns.
C API Reference & Usage
This section outlines the public C interface for persisting and querying large key-value datasets using a disk-backed B-Tree. It covers initialization, insertion, search, traversal, deletion, and utility routines.
B-Tree Initialization and Teardown
Functions
• Btree* BTree_init(const char* filename, bool load_existing)
• void BTree_destroy(Btree* tree)
Overview
- On creation (
load_existing == false
), initializes metadata and empty root. - On load (
load_existing == true
), reads existing metadata and prepares file for I/O.
Example
#include "Btree.h"
int main() {
// Create a new tree (overwrites if exists)
Btree* tree = BTree_init("data.idx", false);
if (!tree) return 1;
// ... use tree ...
BTree_destroy(tree);
return 0;
}
Record Insertion
Function
• void insert(Btree* tree, Data* record)
Usage
- Fill a
Data
record (see Utility Functions). - Call
insert(tree, &record)
; splits nodes as needed.
Example
#include "Btree.h"
#include "utils.h"
int main() {
Btree* tree = BTree_init("data.idx", false);
Data rec;
enter_data(&rec, 42, "Answer to Life");
insert(tree, &rec);
BTree_destroy(tree);
return 0;
}
Key Search
Function
• Data* search(Btree* tree, int key)
Usage
- Returns pointer to
Data
if found, otherwiseNULL
.
Example
#include <stdio.h>
#include "Btree.h"
int main() {
Btree* tree = BTree_init("data.idx", true);
int key = 42;
Data* result = search(tree, key);
if (result)
printf("Found key=%d payload=\"%s\"\n", result->key, result->payload);
else
printf("Key %d not found\n", key);
BTree_destroy(tree);
return 0;
}
Tree Traversal
Function
• void traverse(Btree* tree, int pos)
Usage
- In-order prints all keys/payloads starting at node position
pos
. - Call with
tree->root
to dump entire tree.
Example
#include "Btree.h"
int main() {
Btree* tree = BTree_init("data.idx", true);
traverse(tree, tree->root);
BTree_destroy(tree);
return 0;
}
Record Deletion
Function
• void removeNode(Btree* tree, int key)
Usage
- Deletes
key
, rebalances nodes. - If key not present, no change.
Example
#include "Btree.h"
int main() {
Btree* tree = BTree_init("data.idx", true);
removeNode(tree, 42);
traverse(tree, tree->root);
BTree_destroy(tree);
return 0;
}
Utility Functions
Include "utils.h"
to manage node I/O, data entry, CSV loading, and timing.
Node Initialization & Data Entry
void node_init(Node* node, bool isLeaf, Btree* tree);
void enter_data(Data* rec, int key, const char* payload);
node_init
setsnode->pos = tree->next_pos++
andisLeaf
.enter_data
fillsrec->key
and copiespayload
.
Disk I/O
void write_file(Btree* tree, Node* node, int pos);
void read_file (Btree* tree, Node* node, int pos);
- Use
pos = -1
inwrite_file
to auto-assign a new slot (tree->next_pos
).
CSV Bulk Load
void load_csv(Btree* tree, const char* csv_path);
- Reads
key,payload
lines and callsinsert
for each record.
Timing
double get_time(void);
- Returns current time in seconds (wall-clock) for performance measurements.
Bulk-Load Example with Timing
#include <stdio.h>
#include "Btree.h"
#include "utils.h"
int main() {
Btree* tree = BTree_init("bulk.idx", false);
double start = get_time();
load_csv(tree, "data.csv");
double elapsed = get_time() - start;
printf("Bulk insert completed in %.3f seconds\n", elapsed);
traverse(tree, tree->root);
BTree_destroy(tree);
return 0;
}
## CLI & Benchmark Tools
This section describes the ready-made C-based CLI harness (`test.c`) for B-tree operations and a Python toolkit (`tests/utils.py`) for large-scale benchmarking and visualization.
### test.c: Command-Line Test Harness for B-tree Operations
#### Purpose
Compile and run performance tests (build, search, delete) on your B-tree using a simple CLI.
#### Compilation
```bash
gcc -o btree_test test.c src/Node.c src/Btree.c src/utils.c -Iinclude -lrt
(Link with -lrt
for clock_gettime
.)
Available Flags
-b <N>
Build a new B-tree from the first N records intmp/dataset.csv
.
• Pages →tree.dat
• Metadata →meta.dat
• Prints build time (ms).-s <key>
Loadtree.dat
+meta.dat
, search for integer<key>
.
• Prints matching record
• Prints search time (ms).-sb <N> <key>
Build (as in-b
), then search<key>
.
• Prints both build and search times.-d <key>
Load tree, print node for<key>
, delete it, then print status and deletion time (ms).-t <N> <key>
Debug stub (prints node size).
Examples
Build a 100 000-record tree:
./btree_test -b 100000
Search key 42:
./btree_test -s 42
Build + search (100k records, key 1234):
./btree_test -sb 100000 1234
Delete key 77:
./btree_test -d 77
Expected Output Patterns
- Build:
Tree built in: 234.567890 ms
- Search:
[record fields…] Result found in: 0.012345 ms
- Delete:
Node to delete: [record fields…] Delete successful Result found in: 0.009876 ms
Practical Guidance
- Ensure
tmp/dataset.csv
has ≥ N lines. - Use
BTree_init(path, false)
to create,true
to load an existing tree. - After
-b
or-sb
,meta.dat
is written; subsequent-s
/-d
rely on it. - Extend tests by modifying
run_tests()
intest.c
to handle new flags.
Performance Testing Utilities (tests/utils.py)
Purpose
Automate compilations with different minimum degrees (T), measure build/search/delete latencies, and plot results.
Dependencies
- Python 3.x
- matplotlib
- pytest (optional)
Install:
pip install -r requirements.txt
Key Functions
build_tree(t: int, records_num: int) -> float
- Runs
make clean && make T=<t>
- Invokes
./btree_test -b <records_num>
- Returns build time (ms).
- Runs
test_search(key: int) -> float
- Runs
./btree_test -s <key>
- Returns search time (ms).
- Runs
test_delete(key: int) -> float
- Runs
./btree_test -d <key>
- Returns delete time (ms).
- Runs
Example Snippet
import subprocess, re
def build_tree(t, records_num):
subprocess.run(['make', 'clean'], check=True)
subprocess.run(['make', f'T={t}'], check=True)
p = subprocess.run(['./btree_test', '-b', str(records_num)],
stdout=subprocess.PIPE, check=True)
ms = float(re.search(r'\d+\.\d+', p.stdout.decode())[0])
return ms
def test_search(key):
p = subprocess.run(['./btree_test', '-s', str(key)],
stdout=subprocess.PIPE, check=True)
return float(re.search(r'\d+\.\d+', p.stdout.decode())[0])
Practical Usage
Configure parameters at the top of
tests/utils.py
:t_values = [2**i for i in range(2, 12)] records_num = 1000
Run the benchmarks and plot:
python tests/utils.py
- Compiles and builds for each T
- Executes 10 random searches/deletes per T
- Displays build, search-avg, delete-avg charts
Custom benchmarks
- Mirror
test_search
/test_delete
patterns for insert (-i <key>
) - Integrate into pytest by wrapping calls and asserting max latencies
- Mirror
Tips
- Ensure
./btree_test
is executable and in project root. - Increase
records_num
or sample size to reduce noise. - In headless CI, add at top of
tests/utils.py
:import matplotlib matplotlib.use('Agg')
- Use
pytest --maxfail=1 --disable-warnings -q
for quick feedback.