Project Overview
This project delivers an end-to-end AI-powered document assistant. Authenticated users upload PDFs or images, interact with a chatbot that leverages Retrieval-Augmented Generation (RAG), and receive prompt recommendations. It streamlines document QA, summarization and exploration within a secure, user-centric interface.
Problems Solved
- Disparate document search and QA workflows
- Time-consuming manual prompt engineering
- Lack of integrated history, prompts and file management
High-Level Architecture
Backend (FastAPI)
- Entrypoint:
app/backend/main.py
- Exposes REST endpoints for:
- User management:
/register
,/login
- File operations:
/upload/
,/files/{user_id}
,/file/{file_id}
- Chat & RAG:
/chat/
- Prompt recommendations:
/recommend_prompt/
- Conversation history:
/history/
- User management:
- Integrates:
- Database ORM for users, files, history
- File storage on disk (PDF/image)
- ChromaDB for embedding index
- LLM for generation
Start the backend:
pip install -r requirements.txt
uvicorn app.backend.main:app --reload --port 8000
Frontend (Streamlit)
- Entrypoint:
app/frontend/app.py
- Provides a web UI for:
- Registration & login
- Drag-and-drop file upload
- Prompt selection & recommendations
- Real-time chat with RAG responses
- Session-based history view
- Communicates with backend REST API
Run the frontend:
cd app/frontend
pip install -r requirements.txt
streamlit run app.py --server.port 8501
RAG Pipeline
- Ingestion: Uploaded files convert to text (PDF/image OCR).
- Embedding: Text splits into chunks; each chunk embeds via a text-embedding model.
- Indexing: ChromaDB stores embeddings for efficient similarity search.
- Retrieval: On user query, backend retrieves top-k similar chunks.
- Generation: LLM (e.g., OpenAI GPT) fuses retrieved context with user prompt.
Database & Storage
- Relational DB (configurable via SQLAlchemy) stores:
- User credentials (hashed)
- File metadata and ownership
- Chat history entries
- ChromaDB holds vector index for all ingested document chunks.
- Filesystem persists raw uploads under user-scoped directories.
Quick Start
Get the Vibe Pi system running locally in five minutes: backend API, Vector DB, and Streamlit chat UI.
Prerequisites
• Python 3.10+
• PostgreSQL 13+ (or adjust DATABASE_URL
for other engines)
• Tesseract-OCR (for document parsing)
• Git
1. Clone and Enter Repo
git clone https://github.com/riqalter/vibe-pi-releasecandidate1.git
cd vibe-pi-releasecandidate1
2. Create & Activate Virtual Environment
python3 -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows
.\.venv\Scripts\activate
3. Configure Environment
Copy the sample and fill in values:
cp .env.example .env
Edit .env
:
DATABASE_URL=postgresql://user:password@localhost:5432/vibepi
OPENAI_API_KEY=sk-xxxxxx
CHROMA_DB_DIR=./chroma_data
TESSDATA_PREFIX=/usr/share/tesseract/tessdata
4. Install Dependencies
pip install --upgrade pip
pip install -r requirements.txt
5. Initialize Database
Apply migrations (uses SQLAlchemy Meta):
python scripts/init_db.py
If you prefer Alembic:
alembic upgrade head
6. Start Backend API
uvicorn api.main:app \
--reload \
--host 0.0.0.0 \
--port 8000
- API docs: http://localhost:8000/docs
- Health check: GET http://localhost:8000/health
7. Start Streamlit Chat UI
streamlit run web/app.py --server.port 8501
Open http://localhost:8501 to register, upload files, and start chatting.
8. First Successful Chat
- Register or log in via the UI.
- Upload a PDF or DOCX.
- Select the document in “Manage Files.”
- Open the “Chat” tab and send a prompt.
You should see streamed responses powered by OpenAI + ChromaDB retrieval.
Troubleshooting
- “Database connection failed”: verify
DATABASE_URL
and that Postgres is running. - “Tesseract not found”: ensure
tesseract
is on$PATH
or setTESSDATA_PREFIX
. - “Chroma indexing errors”: confirm
CHROMA_DB_DIR
is writable. - Port conflicts: adjust
--port
flags for both uvicorn and Streamlit.
Now you have a running Vibe Pi environment—start exploring advanced features like custom retrieval chains or OCR tuning.
User Guide
This guide explains how end users and integrators upload files via the Streamlit UI and interact with the FastAPI backend once the application runs.
Uploading Files (PDF and Images)
Purpose: Embed a Streamlit uploader to send PDFs, images, or PPTs to FastAPI, handle multipart uploads, and reset state for repeated uploads.
Essential Code
import streamlit as st
import requests
API_URL = "http://localhost:8000"
# Initialize session state
st.session_state.setdefault('upload_success', False)
st.session_state.setdefault('file_uploader_key', 0)
st.header("Upload File (PDF/Gambar)")
uploaded_file = st.file_uploader(
"Pilih file PDF atau gambar",
type=["pdf", "png", "jpg", "jpeg", "ppt", "pptx"],
key=st.session_state['file_uploader_key']
)
if uploaded_file and not st.session_state['upload_success']:
files = {
"file": (
uploaded_file.name,
uploaded_file.getvalue(),
uploaded_file.type
)
}
data = {"user_id": st.session_state['user_id']}
response = requests.post(f"{API_URL}/upload/", files=files, data=data)
if response.ok:
st.success(f"File {uploaded_file.name} berhasil diupload!")
st.session_state['upload_success'] = True
st.session_state['file_uploader_key'] += 1
st.rerun()
else:
st.error("Gagal upload file.")
# Reset flag for next upload
if st.session_state['upload_success']:
st.session_state['upload_success'] = False
Practical Usage Guidance
- Accepted types:
pdf
,png
,jpg
,jpeg
,ppt
,pptx
. - Use
file_uploader_key
to clear the widget after each successful upload. - Guard against duplicates with
upload_success
flag. - Build
files
dict with(filename, raw bytes, MIME type)
to satisfy FastAPI’sUploadFile
. - Include metadata (e.g.,
user_id
) in thedata
payload. - Call
st.rerun()
immediately after success to refresh UI and reset state.
File Upload Endpoint (/upload/
)
Purpose: Accept user file uploads, save them to disk, persist metadata in the database, and index content into ChromaDB for RAG queries.
Key Behavior
- Accepts multipart form data:
user_id
(int)file
(UploadFile)
- Saves file to
UPLOAD_DIR
with a timestamped filename - Inserts a
DBFile
record (user_id
, filename, MIME type, upload time) - Creates a
User
record if none exists - Calls
index_file(...)
to ingest content into ChromaDB
Endpoint Definition
from datetime import datetime
import os
from fastapi import FastAPI, Form, File, UploadFile, Depends
from sqlalchemy.orm import Session
from .models import DBFile, User
from .database import get_db
from .indexer import index_file
app = FastAPI()
UPLOAD_DIR = "uploads"
@app.post("/upload/")
def upload_file(
user_id: int = Form(...),
file: UploadFile = File(...),
db: Session = Depends(get_db)
):
# Construct unique filename
timestamp = datetime.now().strftime('%Y%m%d%H%M%S')
filename = f"{timestamp}_{file.filename}"
file_path = os.path.join(UPLOAD_DIR, filename)
# Persist file to disk
with open(file_path, "wb") as f:
f.write(file.file.read())
# Record metadata in database
db_file = DBFile(
user_id=user_id,
filename=filename,
filetype=file.content_type,
upload_time=datetime.now()
)
db.add(db_file)
db.commit()
db.refresh(db_file)
# Index content in ChromaDB
index_file(
file_path,
file.content_type,
metadata={"user_id": user_id, "filename": filename}
)
# Ensure user exists
user = db.query(User).filter(User.id == user_id).first()
if not user:
user = User(id=user_id, username=f"user{user_id}")
db.add(user)
db.flush()
return {"file_id": db_file.id, "filename": filename}
Practical Usage
Configure
UPLOAD_DIR
(defaults to<project_root>/uploads
).Ensure CORS, static files, and database are initialized.
Use
curl
to test the endpoint:curl -X POST http://localhost:8000/upload/ \ -F "user_id=42" \ -F "file=@/path/to/report.pdf"
Sample JSON response:
{ "file_id": 17, "filename": "20250715123045_report.pdf" }
Access the file via:
GET http://localhost:8000/uploads/20250715123045_report.pdf
Tips & Gotchas
- For very large files, adjust ASGI server limits (e.g.,
client_max_size
). - Timestamp prefixes avoid filename collisions; ensure clock sync in distributed setups.
- Store and filter by MIME type (
content_type
) inDBFile
. - If higher throughput is needed, dispatch
index_file
to a background task.
Inside the Code
This section dives into the core modules of the project. You’ll learn how the file‐indexing pipeline works, how the RAG answer generator integrates with Google’s Gemini model, and how the SQLAlchemy ORM maps your data.
Indexing Files with ChromaDB
Provide a single entry point (index_file
) to extract text from supported file types (PDF, image, DOCX, PPTX), split it into chunks, generate embeddings using Google Generative AI, and persist them in a local ChromaDB vector store.
Function Signature
def index_file(file_path: str, filetype: str, metadata: dict = None) -> bool
How It Works
- Detects file type via MIME or extension.
- Routes to one of:
extract_text_from_pdf
extract_text_from_image
extract_text_from_docx
extract_text_from_pptx
- Splits text into 1,000-char chunks with 100-char overlap (
RecursiveCharacterTextSplitter
). - Wraps each chunk in a LangChain
Document
(attaching optional metadata). - Uses Google GenAI to embed documents and stores them in ChromaDB under collection
"rag_docs"
. - Logs progress at INFO level for observability.
Parameters
file_path
: Local path to the file.filetype
: MIME type (e.g.,"application/pdf"
). Falls back to extension for unsupported MIME values.metadata
: Optional dict stored alongside each vector (e.g.,{"project": "alpha"}
).
Return Value
True
on success.False
if file type is unsupported or an error occurs.
Basic Usage
from app.backend.chroma import index_file
# Index a quarterly finance report with metadata
success = index_file(
file_path="data/quarterly_report.pdf",
filetype="application/pdf",
metadata={"department": "finance", "quarter": "Q2"}
)
if success:
print("PDF indexed successfully.")
Common Advanced Usage
# Index a JPEG (OCR via Tesseract)
index_file(
file_path="images/handwritten_notes.jpg",
filetype="image/jpeg"
)
# Extend to support Markdown files
def extract_text_from_md(path: str) -> str:
with open(path, "r") as f:
return f.read()
# In index_file's type dispatch:
elif filetype in ("text/markdown",) or path.endswith(".md"):
raw = extract_text_from_md(file_path)
Practical Tips
- Set
GOOGLE_API_KEY
before startup. - Monitor INFO logs to verify chunk counts and embedding stages.
- To add new file types, extend the dispatch in
index_file
and implement anextract_text_from_*
helper. - Tune
chunk_size
andchunk_overlap
for your domain. - Backup the
CHROMA_DIR
folder before migrating or restoring your ChromaDB.
generate_answer: Contextual Answer Generation
Combine retrieved documents with Google’s Gemini model to produce concise, context-aware answers.
Function Signature
def generate_answer(query: str, context_docs: List[Document] = None) -> str
How It Works
- Loads
GOOGLE_API_KEY
from.env
and instantiatesgenai.Client
. - Concatenates
doc.page_content
from eachcontext_doc
into a single “Context:” block. - Appends the user’s question and sends both to
gemini-2.0-flash-001
. - Returns
response.text
as the answer.
Implementation
# app/backend/rag.py
from google import genai
import os
from dotenv import load_dotenv
from langchain.schema import Document
load_dotenv()
client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY", ""))
MODEL_NAME = "gemini-2.0-flash-001"
def generate_answer(query: str, context_docs: List[Document] = None) -> str:
context = "\n\n".join(doc.page_content for doc in context_docs) if context_docs else ""
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
response = client.models.generate_content(
model=MODEL_NAME,
contents=prompt
)
return response.text
Usage Pattern
from app.backend.rag import generate_answer
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
# 1. Load vector store
embeddings = OpenAIEmbeddings()
faiss = FAISS.load_local("faiss_index", embeddings)
# 2. Fetch top-k docs
query = "What is the time complexity of quicksort?"
docs = faiss.similarity_search(query, k=5) # List[Document]
# 3. Generate and display answer
answer = generate_answer(query, context_docs=docs)
print("AI Answer:", answer)
Best Practices
- Limit total context length to avoid token-limit errors; truncate or summarize long documents.
- Handle empty
context_docs
by returning a default prompt or asking for clarification. - Sanitize PII before passing documents to the model.
- Cache frequent queries or enable streaming for long responses.
ORM Models in db.py
Define core SQLAlchemy models for users, messages, and uploaded files, and show how to initialize and interact with the database.
Model Definitions
# app/backend/db.py
from sqlalchemy import Column, Integer, String, DateTime, ForeignKey, Text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship, sessionmaker
from sqlalchemy import create_engine
Base = declarative_base()
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True, index=True)
username = Column(String, unique=True, index=True)
password_hash = Column(String)
messages = relationship('Message', back_populates='user')
files = relationship('File', back_populates='user')
class Message(Base):
__tablename__ = 'messages'
id = Column(Integer, primary_key=True, index=True)
user_id = Column(Integer, ForeignKey('users.id'))
content = Column(Text)
timestamp = Column(DateTime)
user = relationship('User', back_populates='messages')
class File(Base):
__tablename__ = 'files'
id = Column(Integer, primary_key=True, index=True)
user_id = Column(Integer, ForeignKey('users.id'))
filename = Column(String)
filetype = Column(String)
upload_time = Column(DateTime)
user = relationship('User', back_populates='files')
# Database setup
DATABASE_URL = "sqlite:///./app.db"
engine = create_engine(DATABASE_URL, connect_args={"check_same_thread": False})
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
def init_db():
Base.metadata.create_all(bind=engine)
Initializing the Database
Run once at startup to create tables:
python - <<EOF
from app.backend.db import init_db
init_db()
EOF
Working with a Session
from app.backend.db import SessionLocal, User, Message, File
from datetime import datetime
db = SessionLocal()
try:
# 1. Create a user
alice = User(username='alice', password_hash='hashed_pw')
db.add(alice)
db.commit()
db.refresh(alice)
# 2. Post a message
msg = Message(user_id=alice.id, content='Hello, Vibe Pi!', timestamp=datetime.utcnow())
db.add(msg)
db.commit()
# 3. Query back
user = db.query(User).filter_by(username='alice').first()
print(f"{user.username} has {len(user.messages)} messages.")
for m in user.messages:
print(f"- {m.content} at {m.timestamp}")
finally:
db.close()
Practical Tips
- Always
commit()
after writes andrefresh(obj)
to load autogenerated fields. - Use
relationship(..., back_populates=...)
for bidirectional navigation. - Close sessions in a
finally
block or use a context manager to prevent connection leaks. - Configure
DATABASE_URL
via environment variables for different environments.