AdvancedUpdated Dec 17, 2025
RAG with Vector Databases
Build context-aware AI applications by combining Abstrakt's models with vector databases like Pinecone for retrieval augmented generation.
DMW
Dr. Marcus Webb
ML Research Lead
18 min read
What is RAG?
Retrieval Augmented Generation (RAG) enhances AI responses by:
- Retrieving relevant documents from a knowledge base
- Augmenting the prompt with this context
- Generating an informed response
Architecture Overview
text
User Query → Embed Query → Search Vector DB → Retrieve Context → Generate Response
Setting Up Pinecone
python
import pinecone
from abstrakt import AbstraktClient
# Initialize clients
pinecone.init(api_key="PINECONE_KEY", environment="us-west1-gcp")
abstrakt = AbstraktClient()
# Create index
pinecone.create_index(
name="knowledge-base",
dimension=1536, # Match embedding dimension
metric="cosine"
)
index = pinecone.Index("knowledge-base")Creating Embeddings
python
def create_embedding(text):
"""Create embedding for text using Abstrakt."""
result = abstrakt.run("fal-ai/text-embedding", {
"input": {"text": text}
})
return result.embedding
def index_documents(documents):
"""Index documents into Pinecone."""
vectors = []
for doc in documents:
embedding = create_embedding(doc["content"])
vectors.append({
"id": doc["id"],
"values": embedding,
"metadata": {
"title": doc["title"],
"content": doc["content"][:1000]
}
})
# Batch upsert
index.upsert(vectors=vectors, batch_size=100)Querying with RAG
python
def query_with_context(user_query, top_k=5):
"""Query with RAG context."""
# 1. Embed the query
query_embedding = create_embedding(user_query)
# 2. Search vector database
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
# 3. Build context from results
context_parts = []
for match in results.matches:
context_parts.append(match.metadata["content"])
context = "\n\n".join(context_parts)
# 4. Generate response with context
response = abstrakt.run("fal-ai/llm", {
"input": {
"prompt": f"""Based on the following context, answer the question.
Context:
{context}
Question: {user_query}
Answer:""",
"max_tokens": 500
}
})
return response.textChunking Strategies
Large documents need to be split into chunks:
python
def chunk_document(text, chunk_size=500, overlap=50):
"""Split document into overlapping chunks."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
# Find natural break point
if end < len(text):
# Look for sentence end
last_period = chunk.rfind('.')
if last_period > chunk_size * 0.7:
chunk = chunk[:last_period + 1]
end = start + last_period + 1
chunks.append(chunk.strip())
start = end - overlap
return chunksHybrid Search
Combine vector search with keyword filtering:
python
def hybrid_search(query, filters=None, top_k=10):
"""Search with both semantic and keyword matching."""
query_embedding = create_embedding(query)
search_params = {
"vector": query_embedding,
"top_k": top_k,
"include_metadata": True
}
if filters:
search_params["filter"] = filters
# Example filter: category and date
# filters = {
# "category": {"$eq": "technical"},
# "date": {"$gte": "2025-01-01"}
# }
return index.query(**search_params)Visual RAG
Use RAG with image generation:
python
def generate_contextual_image(query):
"""Generate image based on retrieved visual context."""
# Search for relevant visual descriptions
results = index.query(
vector=create_embedding(query),
top_k=3,
filter={"type": "visual_description"}
)
# Build enhanced prompt
style_context = " ".join([
m.metadata["style_description"]
for m in results.matches
])
enhanced_prompt = f"{query}, {style_context}"
# Generate image
return abstrakt.run("fal-ai/flux/dev", {
"input": {"prompt": enhanced_prompt}
})Caching Strategies
python
import hashlib
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_embedding(text_hash):
"""Cache embeddings to reduce API calls."""
return create_embedding(text_hash)
def get_embedding_cached(text):
text_hash = hashlib.md5(text.encode()).hexdigest()
return cached_embedding(text_hash)Monitoring & Metrics
python
class RAGMetrics:
def __init__(self):
self.queries = []
def log_query(self, query, results, response_time):
self.queries.append({
"query": query,
"num_results": len(results),
"response_time": response_time,
"timestamp": datetime.now()
})
def get_avg_latency(self):
times = [q["response_time"] for q in self.queries]
return sum(times) / len(times) if times else 0Best Practices
- Chunk wisely - Balance context and relevance
- Index metadata - Enable filtering
- Cache embeddings - Reduce latency and costs
- Monitor quality - Track relevance metrics
- Update regularly - Keep knowledge base fresh
Next Steps
- Implement content safety
- Learn fine-tuning
- Explore webhook patterns
#rag#vector-database#embeddings#pinecone