TutorialsAdvancedRAG with Vector Databases
AdvancedUpdated Dec 17, 2025

RAG with Vector Databases

Build context-aware AI applications by combining Abstrakt's models with vector databases like Pinecone for retrieval augmented generation.

DMW
Dr. Marcus Webb
ML Research Lead
18 min read

What is RAG?

Retrieval Augmented Generation (RAG) enhances AI responses by:

  1. Retrieving relevant documents from a knowledge base
  2. Augmenting the prompt with this context
  3. Generating an informed response

Architecture Overview

text
User Query → Embed Query → Search Vector DB → Retrieve Context → Generate Response

Setting Up Pinecone

python
import pinecone
from abstrakt import AbstraktClient

# Initialize clients
pinecone.init(api_key="PINECONE_KEY", environment="us-west1-gcp")
abstrakt = AbstraktClient()

# Create index
pinecone.create_index(
    name="knowledge-base",
    dimension=1536,  # Match embedding dimension
    metric="cosine"
)

index = pinecone.Index("knowledge-base")

Creating Embeddings

python
def create_embedding(text):
    """Create embedding for text using Abstrakt."""
    result = abstrakt.run("fal-ai/text-embedding", {
        "input": {"text": text}
    })
    return result.embedding

def index_documents(documents):
    """Index documents into Pinecone."""
    vectors = []
    
    for doc in documents:
        embedding = create_embedding(doc["content"])
        vectors.append({
            "id": doc["id"],
            "values": embedding,
            "metadata": {
                "title": doc["title"],
                "content": doc["content"][:1000]
            }
        })
    
    # Batch upsert
    index.upsert(vectors=vectors, batch_size=100)

Querying with RAG

python
def query_with_context(user_query, top_k=5):
    """Query with RAG context."""
    
    # 1. Embed the query
    query_embedding = create_embedding(user_query)
    
    # 2. Search vector database
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )
    
    # 3. Build context from results
    context_parts = []
    for match in results.matches:
        context_parts.append(match.metadata["content"])
    
    context = "\n\n".join(context_parts)
    
    # 4. Generate response with context
    response = abstrakt.run("fal-ai/llm", {
        "input": {
            "prompt": f"""Based on the following context, answer the question.

Context:
{context}

Question: {user_query}

Answer:""",
            "max_tokens": 500
        }
    })
    
    return response.text

Chunking Strategies

Large documents need to be split into chunks:

python
def chunk_document(text, chunk_size=500, overlap=50):
    """Split document into overlapping chunks."""
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        
        # Find natural break point
        if end < len(text):
            # Look for sentence end
            last_period = chunk.rfind('.')
            if last_period > chunk_size * 0.7:
                chunk = chunk[:last_period + 1]
                end = start + last_period + 1
        
        chunks.append(chunk.strip())
        start = end - overlap
    
    return chunks

Combine vector search with keyword filtering:

python
def hybrid_search(query, filters=None, top_k=10):
    """Search with both semantic and keyword matching."""
    
    query_embedding = create_embedding(query)
    
    search_params = {
        "vector": query_embedding,
        "top_k": top_k,
        "include_metadata": True
    }
    
    if filters:
        search_params["filter"] = filters
    
    # Example filter: category and date
    # filters = {
    #     "category": {"$eq": "technical"},
    #     "date": {"$gte": "2025-01-01"}
    # }
    
    return index.query(**search_params)

Visual RAG

Use RAG with image generation:

python
def generate_contextual_image(query):
    """Generate image based on retrieved visual context."""
    
    # Search for relevant visual descriptions
    results = index.query(
        vector=create_embedding(query),
        top_k=3,
        filter={"type": "visual_description"}
    )
    
    # Build enhanced prompt
    style_context = " ".join([
        m.metadata["style_description"] 
        for m in results.matches
    ])
    
    enhanced_prompt = f"{query}, {style_context}"
    
    # Generate image
    return abstrakt.run("fal-ai/flux/dev", {
        "input": {"prompt": enhanced_prompt}
    })

Caching Strategies

python
import hashlib
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_embedding(text_hash):
    """Cache embeddings to reduce API calls."""
    return create_embedding(text_hash)

def get_embedding_cached(text):
    text_hash = hashlib.md5(text.encode()).hexdigest()
    return cached_embedding(text_hash)

Monitoring & Metrics

python
class RAGMetrics:
    def __init__(self):
        self.queries = []
    
    def log_query(self, query, results, response_time):
        self.queries.append({
            "query": query,
            "num_results": len(results),
            "response_time": response_time,
            "timestamp": datetime.now()
        })
    
    def get_avg_latency(self):
        times = [q["response_time"] for q in self.queries]
        return sum(times) / len(times) if times else 0

Best Practices

  1. Chunk wisely - Balance context and relevance
  2. Index metadata - Enable filtering
  3. Cache embeddings - Reduce latency and costs
  4. Monitor quality - Track relevance metrics
  5. Update regularly - Keep knowledge base fresh

Next Steps

#rag#vector-database#embeddings#pinecone