Chat about this codebase

AI-powered code exploration

Online

Project Overview

This project delivers an end-to-end AI-powered document assistant. Authenticated users upload PDFs or images, interact with a chatbot that leverages Retrieval-Augmented Generation (RAG), and receive prompt recommendations. It streamlines document QA, summarization and exploration within a secure, user-centric interface.

Problems Solved

  • Disparate document search and QA workflows
  • Time-consuming manual prompt engineering
  • Lack of integrated history, prompts and file management

High-Level Architecture

Backend (FastAPI)

  • Entrypoint: app/backend/main.py
  • Exposes REST endpoints for:
    • User management: /register, /login
    • File operations: /upload/, /files/{user_id}, /file/{file_id}
    • Chat & RAG: /chat/
    • Prompt recommendations: /recommend_prompt/
    • Conversation history: /history/
  • Integrates:
    • Database ORM for users, files, history
    • File storage on disk (PDF/image)
    • ChromaDB for embedding index
    • LLM for generation

Start the backend:

pip install -r requirements.txt
uvicorn app.backend.main:app --reload --port 8000

Frontend (Streamlit)

  • Entrypoint: app/frontend/app.py
  • Provides a web UI for:
    • Registration & login
    • Drag-and-drop file upload
    • Prompt selection & recommendations
    • Real-time chat with RAG responses
    • Session-based history view
  • Communicates with backend REST API

Run the frontend:

cd app/frontend
pip install -r requirements.txt
streamlit run app.py --server.port 8501

RAG Pipeline

  1. Ingestion: Uploaded files convert to text (PDF/image OCR).
  2. Embedding: Text splits into chunks; each chunk embeds via a text-embedding model.
  3. Indexing: ChromaDB stores embeddings for efficient similarity search.
  4. Retrieval: On user query, backend retrieves top-k similar chunks.
  5. Generation: LLM (e.g., OpenAI GPT) fuses retrieved context with user prompt.

Database & Storage

  • Relational DB (configurable via SQLAlchemy) stores:
    • User credentials (hashed)
    • File metadata and ownership
    • Chat history entries
  • ChromaDB holds vector index for all ingested document chunks.
  • Filesystem persists raw uploads under user-scoped directories.

Quick Start

Get the Vibe Pi system running locally in five minutes: backend API, Vector DB, and Streamlit chat UI.

Prerequisites

• Python 3.10+
• PostgreSQL 13+ (or adjust DATABASE_URL for other engines)
• Tesseract-OCR (for document parsing)
• Git

1. Clone and Enter Repo

git clone https://github.com/riqalter/vibe-pi-releasecandidate1.git
cd vibe-pi-releasecandidate1

2. Create & Activate Virtual Environment

python3 -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows
.\.venv\Scripts\activate

3. Configure Environment

Copy the sample and fill in values:

cp .env.example .env

Edit .env:

DATABASE_URL=postgresql://user:password@localhost:5432/vibepi
OPENAI_API_KEY=sk-xxxxxx
CHROMA_DB_DIR=./chroma_data
TESSDATA_PREFIX=/usr/share/tesseract/tessdata

4. Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

5. Initialize Database

Apply migrations (uses SQLAlchemy Meta):

python scripts/init_db.py

If you prefer Alembic:

alembic upgrade head

6. Start Backend API

uvicorn api.main:app \
  --reload \
  --host 0.0.0.0 \
  --port 8000

7. Start Streamlit Chat UI

streamlit run web/app.py --server.port 8501

Open http://localhost:8501 to register, upload files, and start chatting.

8. First Successful Chat

  1. Register or log in via the UI.
  2. Upload a PDF or DOCX.
  3. Select the document in “Manage Files.”
  4. Open the “Chat” tab and send a prompt.

You should see streamed responses powered by OpenAI + ChromaDB retrieval.

Troubleshooting

  • “Database connection failed”: verify DATABASE_URL and that Postgres is running.
  • “Tesseract not found”: ensure tesseract is on $PATH or set TESSDATA_PREFIX.
  • “Chroma indexing errors”: confirm CHROMA_DB_DIR is writable.
  • Port conflicts: adjust --port flags for both uvicorn and Streamlit.

Now you have a running Vibe Pi environment—start exploring advanced features like custom retrieval chains or OCR tuning.

User Guide

This guide explains how end users and integrators upload files via the Streamlit UI and interact with the FastAPI backend once the application runs.

Uploading Files (PDF and Images)

Purpose: Embed a Streamlit uploader to send PDFs, images, or PPTs to FastAPI, handle multipart uploads, and reset state for repeated uploads.

Essential Code

import streamlit as st
import requests

API_URL = "http://localhost:8000"

# Initialize session state
st.session_state.setdefault('upload_success', False)
st.session_state.setdefault('file_uploader_key', 0)

st.header("Upload File (PDF/Gambar)")
uploaded_file = st.file_uploader(
    "Pilih file PDF atau gambar",
    type=["pdf", "png", "jpg", "jpeg", "ppt", "pptx"],
    key=st.session_state['file_uploader_key']
)

if uploaded_file and not st.session_state['upload_success']:
    files = {
        "file": (
            uploaded_file.name,
            uploaded_file.getvalue(),
            uploaded_file.type
        )
    }
    data = {"user_id": st.session_state['user_id']}

    response = requests.post(f"{API_URL}/upload/", files=files, data=data)
    if response.ok:
        st.success(f"File {uploaded_file.name} berhasil diupload!")
        st.session_state['upload_success'] = True
        st.session_state['file_uploader_key'] += 1
        st.rerun()
    else:
        st.error("Gagal upload file.")

# Reset flag for next upload
if st.session_state['upload_success']:
    st.session_state['upload_success'] = False

Practical Usage Guidance

  • Accepted types: pdf, png, jpg, jpeg, ppt, pptx.
  • Use file_uploader_key to clear the widget after each successful upload.
  • Guard against duplicates with upload_success flag.
  • Build files dict with (filename, raw bytes, MIME type) to satisfy FastAPI’s UploadFile.
  • Include metadata (e.g., user_id) in the data payload.
  • Call st.rerun() immediately after success to refresh UI and reset state.

File Upload Endpoint (/upload/)

Purpose: Accept user file uploads, save them to disk, persist metadata in the database, and index content into ChromaDB for RAG queries.

Key Behavior

  • Accepts multipart form data:
    • user_id (int)
    • file (UploadFile)
  • Saves file to UPLOAD_DIR with a timestamped filename
  • Inserts a DBFile record (user_id, filename, MIME type, upload time)
  • Creates a User record if none exists
  • Calls index_file(...) to ingest content into ChromaDB

Endpoint Definition

from datetime import datetime
import os
from fastapi import FastAPI, Form, File, UploadFile, Depends
from sqlalchemy.orm import Session
from .models import DBFile, User
from .database import get_db
from .indexer import index_file

app = FastAPI()
UPLOAD_DIR = "uploads"

@app.post("/upload/")
def upload_file(
    user_id: int = Form(...),
    file: UploadFile = File(...),
    db: Session = Depends(get_db)
):
    # Construct unique filename
    timestamp = datetime.now().strftime('%Y%m%d%H%M%S')
    filename = f"{timestamp}_{file.filename}"
    file_path = os.path.join(UPLOAD_DIR, filename)

    # Persist file to disk
    with open(file_path, "wb") as f:
        f.write(file.file.read())

    # Record metadata in database
    db_file = DBFile(
        user_id=user_id,
        filename=filename,
        filetype=file.content_type,
        upload_time=datetime.now()
    )
    db.add(db_file)
    db.commit()
    db.refresh(db_file)

    # Index content in ChromaDB
    index_file(
        file_path,
        file.content_type,
        metadata={"user_id": user_id, "filename": filename}
    )

    # Ensure user exists
    user = db.query(User).filter(User.id == user_id).first()
    if not user:
        user = User(id=user_id, username=f"user{user_id}")
        db.add(user)
        db.flush()

    return {"file_id": db_file.id, "filename": filename}

Practical Usage

  1. Configure UPLOAD_DIR (defaults to <project_root>/uploads).

  2. Ensure CORS, static files, and database are initialized.

  3. Use curl to test the endpoint:

    curl -X POST http://localhost:8000/upload/ \
      -F "user_id=42" \
      -F "file=@/path/to/report.pdf"
    
  4. Sample JSON response:

    {
      "file_id": 17,
      "filename": "20250715123045_report.pdf"
    }
    
  5. Access the file via:

    GET http://localhost:8000/uploads/20250715123045_report.pdf
    

Tips & Gotchas

  • For very large files, adjust ASGI server limits (e.g., client_max_size).
  • Timestamp prefixes avoid filename collisions; ensure clock sync in distributed setups.
  • Store and filter by MIME type (content_type) in DBFile.
  • If higher throughput is needed, dispatch index_file to a background task.

Inside the Code

This section dives into the core modules of the project. You’ll learn how the file‐indexing pipeline works, how the RAG answer generator integrates with Google’s Gemini model, and how the SQLAlchemy ORM maps your data.

Indexing Files with ChromaDB

Provide a single entry point (index_file) to extract text from supported file types (PDF, image, DOCX, PPTX), split it into chunks, generate embeddings using Google Generative AI, and persist them in a local ChromaDB vector store.

Function Signature

def index_file(file_path: str, filetype: str, metadata: dict = None) -> bool

How It Works

  1. Detects file type via MIME or extension.
  2. Routes to one of:
    • extract_text_from_pdf
    • extract_text_from_image
    • extract_text_from_docx
    • extract_text_from_pptx
  3. Splits text into 1,000-char chunks with 100-char overlap (RecursiveCharacterTextSplitter).
  4. Wraps each chunk in a LangChain Document (attaching optional metadata).
  5. Uses Google GenAI to embed documents and stores them in ChromaDB under collection "rag_docs".
  6. Logs progress at INFO level for observability.

Parameters

  • file_path: Local path to the file.
  • filetype: MIME type (e.g., "application/pdf"). Falls back to extension for unsupported MIME values.
  • metadata: Optional dict stored alongside each vector (e.g., {"project": "alpha"}).

Return Value

  • True on success.
  • False if file type is unsupported or an error occurs.

Basic Usage

from app.backend.chroma import index_file

# Index a quarterly finance report with metadata
success = index_file(
    file_path="data/quarterly_report.pdf",
    filetype="application/pdf",
    metadata={"department": "finance", "quarter": "Q2"}
)
if success:
    print("PDF indexed successfully.")

Common Advanced Usage

# Index a JPEG (OCR via Tesseract)
index_file(
    file_path="images/handwritten_notes.jpg",
    filetype="image/jpeg"
)

# Extend to support Markdown files
def extract_text_from_md(path: str) -> str:
    with open(path, "r") as f:
        return f.read()

# In index_file's type dispatch:
elif filetype in ("text/markdown",) or path.endswith(".md"):
    raw = extract_text_from_md(file_path)

Practical Tips

  • Set GOOGLE_API_KEY before startup.
  • Monitor INFO logs to verify chunk counts and embedding stages.
  • To add new file types, extend the dispatch in index_file and implement an extract_text_from_* helper.
  • Tune chunk_size and chunk_overlap for your domain.
  • Backup the CHROMA_DIR folder before migrating or restoring your ChromaDB.

generate_answer: Contextual Answer Generation

Combine retrieved documents with Google’s Gemini model to produce concise, context-aware answers.

Function Signature

def generate_answer(query: str, context_docs: List[Document] = None) -> str

How It Works

  1. Loads GOOGLE_API_KEY from .env and instantiates genai.Client.
  2. Concatenates doc.page_content from each context_doc into a single “Context:” block.
  3. Appends the user’s question and sends both to gemini-2.0-flash-001.
  4. Returns response.text as the answer.

Implementation

# app/backend/rag.py
from google import genai
import os
from dotenv import load_dotenv
from langchain.schema import Document

load_dotenv()
client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY", ""))
MODEL_NAME = "gemini-2.0-flash-001"

def generate_answer(query: str, context_docs: List[Document] = None) -> str:
    context = "\n\n".join(doc.page_content for doc in context_docs) if context_docs else ""
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
    response = client.models.generate_content(
        model=MODEL_NAME,
        contents=prompt
    )
    return response.text

Usage Pattern

from app.backend.rag import generate_answer
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# 1. Load vector store
embeddings = OpenAIEmbeddings()
faiss = FAISS.load_local("faiss_index", embeddings)

# 2. Fetch top-k docs
query = "What is the time complexity of quicksort?"
docs = faiss.similarity_search(query, k=5)  # List[Document]

# 3. Generate and display answer
answer = generate_answer(query, context_docs=docs)
print("AI Answer:", answer)

Best Practices

  • Limit total context length to avoid token-limit errors; truncate or summarize long documents.
  • Handle empty context_docs by returning a default prompt or asking for clarification.
  • Sanitize PII before passing documents to the model.
  • Cache frequent queries or enable streaming for long responses.

ORM Models in db.py

Define core SQLAlchemy models for users, messages, and uploaded files, and show how to initialize and interact with the database.

Model Definitions

# app/backend/db.py
from sqlalchemy import Column, Integer, String, DateTime, ForeignKey, Text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship, sessionmaker
from sqlalchemy import create_engine

Base = declarative_base()

class User(Base):
    __tablename__ = 'users'
    id            = Column(Integer, primary_key=True, index=True)
    username      = Column(String, unique=True, index=True)
    password_hash = Column(String)
    messages      = relationship('Message', back_populates='user')
    files         = relationship('File',    back_populates='user')

class Message(Base):
    __tablename__ = 'messages'
    id        = Column(Integer, primary_key=True, index=True)
    user_id   = Column(Integer, ForeignKey('users.id'))
    content   = Column(Text)
    timestamp = Column(DateTime)
    user      = relationship('User', back_populates='messages')

class File(Base):
    __tablename__ = 'files'
    id          = Column(Integer, primary_key=True, index=True)
    user_id     = Column(Integer, ForeignKey('users.id'))
    filename    = Column(String)
    filetype    = Column(String)
    upload_time = Column(DateTime)
    user        = relationship('User', back_populates='files')

# Database setup
DATABASE_URL = "sqlite:///./app.db"
engine = create_engine(DATABASE_URL, connect_args={"check_same_thread": False})
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)

def init_db():
    Base.metadata.create_all(bind=engine)

Initializing the Database

Run once at startup to create tables:

python - <<EOF
from app.backend.db import init_db
init_db()
EOF

Working with a Session

from app.backend.db import SessionLocal, User, Message, File
from datetime import datetime

db = SessionLocal()
try:
    # 1. Create a user
    alice = User(username='alice', password_hash='hashed_pw')
    db.add(alice)
    db.commit()
    db.refresh(alice)

    # 2. Post a message
    msg = Message(user_id=alice.id, content='Hello, Vibe Pi!', timestamp=datetime.utcnow())
    db.add(msg)
    db.commit()

    # 3. Query back
    user = db.query(User).filter_by(username='alice').first()
    print(f"{user.username} has {len(user.messages)} messages.")
    for m in user.messages:
        print(f"- {m.content} at {m.timestamp}")
finally:
    db.close()

Practical Tips

  • Always commit() after writes and refresh(obj) to load autogenerated fields.
  • Use relationship(..., back_populates=...) for bidirectional navigation.
  • Close sessions in a finally block or use a context manager to prevent connection leaks.
  • Configure DATABASE_URL via environment variables for different environments.