RAG Is the Skill Your Agent Pipeline Is Missing

Every agent eventually hits the same wall: the context window isn't big enough, the training data isn't fresh enough, and the user's question requires knowledge the model doesn't have.

Retrieval-Augmented Generation solves this. But most teams implement RAG as a monolithic system — a single codebase that handles chunking, embedding, vector storage, retrieval, and reranking. When any piece breaks or underperforms, the whole pipeline needs surgery.

The alternative is composable RAG: discrete, specialized skills that each do one thing well, wired together through a pipeline. Change your embedding model without touching your chunker. Swap reranking strategies without redeploying your retrieval layer. Pay only for what you use.

The Three RAG Primitives

1. Embedding Generation

The foundation of any RAG pipeline is turning text into vectors. But embedding models differ dramatically in cost, dimensionality, and semantic fidelity. OpenAI's text-embedding-3-large excels at general knowledge. Cohere's v3 handles multilingual content better. Voyage-2 optimizes for code.

VectorVault's embedding-generator ($0.001/call) abstracts this choice. Pass text, optionally specify a model, and get back normalized vectors ready for your database. The auto mode selects the best model based on content type detection — technical docs route to Voyage, multilingual content routes to Cohere, general text routes to OpenAI.

Batch processing handles up to 100 chunks per call, and automatic dimensionality reduction lets you target smaller index sizes without re-embedding.

2. Semantic Chunking

Naive chunking — splitting every 512 tokens with a 50-token overlap — loses context at every boundary. A function definition gets split from its docstring. A table header separates from its rows. A paragraph's conclusion lands in a different chunk than its premise.

The chunk-optimizer ($0.002/call) detects semantic boundaries: section headings, code block delimiters, paragraph breaks, and topic shifts. It produces chunks that are semantically coherent units, not arbitrary text slices. Each chunk carries metadata — its source position, parent heading, and neighboring chunk IDs — so downstream retrieval can reconstruct context when needed.

Five strategies cover different content types: semantic for general documents, paragraph for well-structured prose, sentence for fine-grained retrieval, recursive for nested structures, and fixed when you need deterministic sizing.

3. Neural Reranking

Vector similarity gets you candidates. Reranking gets you answers. The gap between "top 20 by cosine similarity" and "top 5 by actual relevance" is where RAG pipelines succeed or fail.

The retrieval-reranker ($0.003/call) implements two-stage retrieval: BM25 keyword matching for recall, cross-encoder neural scoring for precision. Pass your pre-retrieved candidates and query, get back scored results with relevance explanations and extracted citations.

The citation extraction is particularly valuable for agent pipelines — instead of dumping entire chunks into the context window, the reranker identifies the specific supporting passage, reducing token usage while improving answer grounding.

Cost Arithmetic

A typical RAG pipeline that processes a 10-page document and answers 5 questions:

Step	Calls	Cost
Chunk document	1	$0.002
Generate embeddings	1 (batch)	$0.001
Rerank per query	5	$0.015
Total	7	$0.018

Compare this to stuffing the entire document into the context window at $0.015/1K tokens for a 10-page document (~8K tokens input + output): roughly $0.24 per question, $1.20 for five questions. The RAG pipeline is 66x cheaper and produces more grounded answers.

Why Skills Beat Libraries

Every language has RAG libraries — LangChain, LlamaIndex, Haystack. They're excellent for prototyping. But they share three production problems:

Dependency sprawl. A RAG library pulls in torch, transformers, tokenizers, and model weights. Your agent's container image grows from 200MB to 4GB.

Version coupling. Upgrading your embedding model means upgrading the library, which means testing every other feature that touches it.

No cost isolation. When embedding costs spike because a new document type produces larger chunks, there's no billing signal — it's buried in your cloud compute bill.

Skills solve all three. Each RAG primitive runs in its own environment, bills per call, and can be upgraded or swapped independently. The BluePages composition engine chains them into a pipeline with typed handoff schemas and cost tracking per step.

Building a RAG Pipeline on BluePages

Using the composition API, a complete RAG pipeline looks like:

Document -> chunk-optimizer -> embedding-generator -> [vector DB] -> retrieval-reranker -> Agent

Each step is a separate skill invocation with x402 payment. The composition engine handles the data flow, and the spending limit system caps your total RAG budget per agent per day.

VectorVault: The Publisher

VectorVault (vectorvault.dev) specializes in production RAG infrastructure as composable skills. Their three skills cover the full retrieval lifecycle — from raw documents to ranked, cited results. All skills support the major embedding providers and are designed for agent-to-agent invocation patterns where token efficiency matters most.

What This Means for Agent Builders

RAG isn't optional for production agents. But building and maintaining RAG infrastructure shouldn't be either. Composable RAG skills let you:

Start with a working pipeline in minutes, not weeks
Swap components without pipeline surgery
Pay per retrieval, not per GPU-hour
Track costs per query, not per cluster

The knowledge retrieval layer is now as composable as the rest of your agent pipeline. Browse the full RAG & Knowledge Retrieval collection on BluePages to get started.