Every AI product eventually hits the same wall: users ask questions about your data — your docs, your knowledge base, your product — and the LLM has no idea what to say. Fine-tuning is expensive and slow. Stuffing whole documents into the context window burns tokens. Neither scales. RAG — Retrieval-Augmented Generation — solves this cleanly: retrieve only the relevant pieces at query time and hand them to the model as context. In this guide you'll build a fully working RAG application from scratch — Next.js App Router, OpenAI embeddings, Pinecone — and deploy it to Vercel. Real code, real architecture, production-aware.
Understanding RAG — why it beats fine-tuning for most use cases
The idea is simple. At ingestion time, you split your documents into chunks, convert each chunk into a vector embedding, and store those embeddings in a vector database. At query time, you embed the user's question, find the most similar chunks, and pass them to the LLM as context alongside the question. The model answers from retrieved knowledge instead of trained knowledge.
Why this wins over fine-tuning for most use cases:
| RAG | Fine-tuning | |
|---|---|---|
| Cost | Low (embeddings are cheap) | High (GPU hours) |
| Time to update | Minutes (re-index) | Hours to days (retrain) |
| Works with private data | Yes | Yes |
| Handles dynamic data | Yes | No |
| Explainable | Yes (show sources) | No |
| Minimum data required | Any amount | 50–100+ examples |
Fine-tuning makes sense when you need to change the model's style, tone, or behavior — not when you need it to know specific facts. For facts, use RAG.
Architecture overview
Two separate flows, and keeping them separate in your head is half the architecture. The ingestion pipeline runs once (or whenever documents change): load → chunk → embed → upsert.
The query pipeline runs on every user question: embed the question, retrieve the top-K most similar chunks, build a context-grounded prompt, and stream the answer back.
Setting up Next.js App Router with TypeScript
You need Node.js 20+, a free Pinecone account, an OpenAI API key, and basic familiarity with the App Router. Scaffold the project:
npx create-next-app@latest rag-app --typescript --tailwind --app --no-src-dir
cd rag-app
pnpm add @pinecone-database/pinecone openai ai @ai-sdk/openai zod
The ai package is the Vercel AI SDK — it handles streaming from OpenAI to the browser cleanly. Set your environment variables in .env.local:
OPENAI_API_KEY=sk-your-openai-key
PINECONE_API_KEY=your-pinecone-api-key
PINECONE_INDEX_NAME=rag-demo
The project structure you're building toward:
rag-app/
├── app/
│ ├── api/
│ │ ├── ingest/route.ts # chunk → embed → upsert
│ │ └── chat/route.ts # embed → retrieve → stream
│ ├── page.tsx # chat UI
│ └── layout.tsx
├── lib/
│ ├── pinecone.ts # Pinecone client
│ ├── openai.ts # OpenAI client
│ ├── chunker.ts # document chunking
│ └── embeddings.ts # embedding utilities
└── .env.local
Connecting Pinecone vector database
Create your index in the Pinecone dashboard first: dimensions 1536 (matches text-embedding-3-small output), metric cosine, cloud AWS us-east-1 (free tier). Then create the client in lib/pinecone.ts — singleton pattern, so the client is reused across requests instead of re-created on every invocation:
// lib/pinecone.ts
import { Pinecone } from "@pinecone-database/pinecone";
let pineconeClient: Pinecone | null = null;
export function getPineconeClient(): Pinecone {
if (!pineconeClient) {
pineconeClient = new Pinecone({
apiKey: process.env.PINECONE_API_KEY!,
});
}
return pineconeClient;
}
export function getPineconeIndex() {
const client = getPineconeClient();
return client.index(process.env.PINECONE_INDEX_NAME!);
}
Same pattern for the OpenAI client in lib/openai.ts:
// lib/openai.ts
import OpenAI from "openai";
let openaiClient: OpenAI | null = null;
export function getOpenAIClient(): OpenAI {
if (!openaiClient) {
openaiClient = new OpenAI({
apiKey: process.env.OPENAI_API_KEY!,
});
}
return openaiClient;
}
Chunking and embedding your documents
This is the most important step in the entire pipeline. Bad chunking = bad retrieval = bad answers. Everything downstream — the vector search, the prompt, the model — can only work with the chunks you give it.
Bad chunking is invisible in your code and obvious in your answers. It's the highest-leverage hundred lines in the whole pipeline.
Choosing the right chunk size
Chunk size is the single biggest lever for retrieval quality:
| Chunk size | Best for | Trade-off |
|---|---|---|
| 256–512 tokens | Precise Q&A, FAQs | May lose surrounding context |
| 512–1024 tokens | General documents, articles | Good balance for most cases |
| 1024–2048 tokens | Long-form content, legal docs | Less precise retrieval |
Start with 512 tokens and 50-token overlap. The overlap ensures you don't cut a sentence in half between chunks and lose meaning at the boundary. Create lib/chunker.ts:
// lib/chunker.ts
export interface DocumentChunk {
id: string;
text: string;
metadata: {
source: string;
chunkIndex: number;
totalChunks: number;
};
}
/**
* Splits a document into overlapping chunks.
* chunkSize is in characters — roughly 4 chars per token.
*/
export function chunkDocument(
text: string,
source: string,
chunkSize: number = 2000, // ~500 tokens
overlap: number = 200 // ~50 tokens
): DocumentChunk[] {
const chunks: DocumentChunk[] = [];
let start = 0;
let chunkIndex = 0;
// Clean the text first
const cleanedText = text
.replace(/\n{3,}/g, "\n\n") // collapse excessive newlines
.replace(/\s+/g, " ") // collapse whitespace
.trim();
while (start < cleanedText.length) {
let end = start + chunkSize;
// Try to break at a sentence boundary
if (end < cleanedText.length) {
const lastPeriod = cleanedText.lastIndexOf(".", end);
const lastNewline = cleanedText.lastIndexOf("\n", end);
const breakPoint = Math.max(lastPeriod, lastNewline);
// Only use it if it's reasonably close to our target
if (breakPoint > start + chunkSize * 0.7) {
end = breakPoint + 1;
}
}
const chunkText = cleanedText.slice(start, end).trim();
if (chunkText.length > 50) { // skip tiny chunks
chunks.push({
id: `${source}-chunk-${chunkIndex}`,
text: chunkText,
metadata: { source, chunkIndex, totalChunks: 0 },
});
chunkIndex++;
}
start = end - overlap;
}
// Update totalChunks now that we know the final count
return chunks.map(chunk => ({
...chunk,
metadata: { ...chunk.metadata, totalChunks: chunks.length },
}));
}
Now the embedding utility in lib/embeddings.ts. text-embedding-3-small gives you 1536 dimensions at $0.02 per 1M tokens — 6× cheaper than -large and sufficient for most RAG apps:
// lib/embeddings.ts
import { getOpenAIClient } from "./openai";
import { DocumentChunk } from "./chunker";
const EMBEDDING_MODEL = "text-embedding-3-small";
/** Embeds an array of texts, batched to stay within OpenAI's limits. */
export async function generateEmbeddings(texts: string[]): Promise<number[][]> {
const openai = getOpenAIClient();
const BATCH_SIZE = 100;
const allEmbeddings: number[][] = [];
for (let i = 0; i < texts.length; i += BATCH_SIZE) {
const batch = texts.slice(i, i + BATCH_SIZE);
const response = await openai.embeddings.create({
model: EMBEDDING_MODEL,
input: batch,
});
const batchEmbeddings = response.data
.sort((a, b) => a.index - b.index)
.map((item) => item.embedding);
allEmbeddings.push(...batchEmbeddings);
}
return allEmbeddings;
}
/** Prepares chunks with their embeddings for Pinecone upsert. */
export async function embedChunks(chunks: DocumentChunk[]) {
const texts = chunks.map((chunk) => chunk.text);
const embeddings = await generateEmbeddings(texts);
return chunks.map((chunk, i) => ({
id: chunk.id,
values: embeddings[i],
metadata: { text: chunk.text, ...chunk.metadata },
}));
}
The ingestion API route
The ingestion route ties it together: accept document text, chunk it, embed it, upsert to Pinecone in batches of 100 (Pinecone's recommendation). Create app/api/ingest/route.ts:
// app/api/ingest/route.ts
import { NextRequest, NextResponse } from "next/server";
import { chunkDocument } from "@/lib/chunker";
import { embedChunks } from "@/lib/embeddings";
import { getPineconeIndex } from "@/lib/pinecone";
export async function POST(req: NextRequest) {
try {
const { text, source } = await req.json();
if (!text || !source) {
return NextResponse.json(
{ error: "Both text and source are required" },
{ status: 400 }
);
}
// 1. Chunk → 2. Embed → 3. Upsert
const chunks = chunkDocument(text, source);
const vectors = await embedChunks(chunks);
const index = getPineconeIndex();
const UPSERT_BATCH_SIZE = 100;
for (let i = 0; i < vectors.length; i += UPSERT_BATCH_SIZE) {
await index.upsert(vectors.slice(i, i + UPSERT_BATCH_SIZE));
}
return NextResponse.json({
success: true,
chunksIngested: chunks.length,
source,
});
} catch (error) {
console.error("[Ingest] Error:", error);
return NextResponse.json(
{ error: "Ingestion failed", details: String(error) },
{ status: 500 }
);
}
}
Test it with a sample document:
curl -X POST http://localhost:3000/api/ingest \
-H "Content-Type: application/json" \
-d '{
"text": "Your document text goes here. This can be the contents of a PDF, a knowledge base article, or any text you want the AI to know about.",
"source": "sample-doc-v1"
}'
Building the retrieval API route
This is the query-time pipeline: embed the question, retrieve the top matches from Pinecone, build a prompt with that context, stream the response. Create app/api/chat/route.ts:
// app/api/chat/route.ts
import { NextRequest } from "next/server";
import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
import { generateEmbeddings } from "@/lib/embeddings";
import { getPineconeIndex } from "@/lib/pinecone";
// Chunks to retrieve — higher = more context, more tokens, more cost
const TOP_K = 5;
export async function POST(req: NextRequest) {
try {
const { messages } = await req.json();
const userMessage = messages[messages.length - 1].content as string;
// 1. Embed the user's question
const [questionEmbedding] = await generateEmbeddings([userMessage]);
// 2. Query Pinecone for the most relevant chunks
const index = getPineconeIndex();
const queryResponse = await index.query({
vector: questionEmbedding,
topK: TOP_K,
includeMetadata: true,
});
// 3. Keep only genuinely relevant matches
const retrievedChunks = queryResponse.matches
.filter((match) => match.score && match.score > 0.7)
.map((match) => ({
text: match.metadata?.text as string,
source: match.metadata?.source as string,
}));
// 4. Build context from retrieved chunks
const context = retrievedChunks
.map((chunk, i) => `[Source ${i + 1}: ${chunk.source}]\n${chunk.text}`)
.join("\n\n---\n\n");
// 5. The RAG prompt
const systemPrompt = `You are a helpful assistant that answers questions based on the provided context.
CONTEXT:
${context || "No relevant context found for this question."}
INSTRUCTIONS:
- Answer using ONLY the information in the context above
- If the context does not contain enough information, say "I don't have enough information about that in my knowledge base"
- Always cite which source(s) you used
- Be concise and accurate
- Do not make up information that is not in the context`;
// 6. Stream the response via the Vercel AI SDK
const result = streamText({
model: openai("gpt-4o"),
system: systemPrompt,
messages,
});
return result.toDataStreamResponse();
} catch (error) {
console.error("[Chat] Error:", error);
return new Response(
JSON.stringify({ error: "Failed to process your question" }),
{ status: 500, headers: { "Content-Type": "application/json" } }
);
}
}
The match.score > 0.7 line matters more than it looks. Cosine similarity runs 0–1; if the top result scores 0.4, Pinecone found nothing truly relevant — the user asked about something outside your knowledge base. Without the filter you'd pass irrelevant chunks to the model and get confidently hallucinated answers. Tune the threshold per use case; 0.7 is a good start for document Q&A.
A RAG app that answers from nothing is just a chatbot with extra steps. The score filter is what keeps it honest.
Streaming responses with the Vercel AI SDK
The Vercel AI SDK handles the streaming connection between your API route and React — the useChat hook gives you messages, input state, and submission handling with zero boilerplate. Create the chat UI in app/page.tsx:
// app/page.tsx
"use client";
import { useChat } from "ai/react";
export default function RAGChatPage() {
const { messages, input, handleInputChange, handleSubmit, isLoading } =
useChat({ api: "/api/chat" });
return (
<div className="flex flex-col h-screen max-w-3xl mx-auto p-4">
<h1 className="text-2xl font-bold mb-6">Document Q&A — RAG Demo</h1>
{/* Message list */}
<div className="flex-1 overflow-y-auto space-y-4 mb-4">
{messages.length === 0 && (
<p className="text-gray-400 text-center mt-20">
Ask a question about your documents...
</p>
)}
{messages.map((message) => (
<div
key={message.id}
className={`flex ${message.role === "user" ? "justify-end" : "justify-start"}`}
>
<div
className={`max-w-[80%] rounded-lg px-4 py-2 ${
message.role === "user"
? "bg-blue-600 text-white"
: "bg-gray-100 text-gray-900"
}`}
>
<p className="text-sm whitespace-pre-wrap">{message.content}</p>
</div>
</div>
))}
{isLoading && (
<p className="text-sm text-gray-500">Thinking...</p>
)}
</div>
{/* Input form */}
<form onSubmit={handleSubmit} className="flex gap-2">
<input
value={input}
onChange={handleInputChange}
placeholder="Ask a question about your documents..."
className="flex-1 border border-gray-300 rounded-lg px-4 py-2"
disabled={isLoading}
/>
<button
type="submit"
disabled={isLoading || !input.trim()}
className="bg-blue-600 text-white px-6 py-2 rounded-lg disabled:opacity-50"
>
Send
</button>
</form>
</div>
);
}
Tokens render as they arrive — no spinner staring contest. The same hook works with Anthropic, Google, Mistral, and other providers; swap openai("gpt-4o") for another model and the UI doesn't change.
Deploying to Vercel
Push to GitHub and connect the repo, or deploy directly:
vercel deploy
Set OPENAI_API_KEY, PINECONE_API_KEY, and PINECONE_INDEX_NAME in the Vercel dashboard under Settings → Environment Variables.
Production note: protect the ingestion route. Never leave it open — anyone could flood your Pinecone index. Add an INGEST_SECRET environment variable and validate it at the top of the route:
// at the top of app/api/ingest/route.ts
const secret = req.headers.get("x-ingest-secret");
if (secret !== process.env.INGEST_SECRET) {
return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
}
Optimizing retrieval quality
A working RAG app and a good RAG app are two different things. Once the basics work, these are the highest-impact upgrades, in order:
- Query expansion — rewrite the raw question into something more explicit and searchable before embedding it. A cheap
gpt-4o-minicall fixes vague queries like "does it work offline?". - Metadata filtering — if your knowledge base has multiple document types, filter at retrieval time so a billing question never retrieves engineering docs.
- Hybrid search — pure semantic search misses exact-match queries: product names, error codes, version numbers. Combine embeddings with keyword search (BM25); Pinecone supports it natively via
sparseValues. - Reranking — retrieve 10 chunks, then use a reranker (Cohere Rerank is the standard) to reorder by true relevance and keep the top 3 before prompting.
Query expansion in practice:
// Before embedding, expand the query with gpt-4o-mini
const expansion = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: "Rewrite the user's question to be more explicit and searchable for a document retrieval system. Return only the rewritten question.",
},
{ role: "user", content: userMessage },
],
max_tokens: 150,
});
const expandedQuery = expansion.choices[0].message.content || userMessage;
const [questionEmbedding] = await generateEmbeddings([expandedQuery]);
Metadata filtering and reranking:
// Only search a specific source at query time
const queryResponse = await index.query({
vector: questionEmbedding,
topK: TOP_K,
includeMetadata: true,
filter: { source: { $eq: "product-docs" } },
});
// Rerank retrieved chunks with Cohere before prompting
import { CohereClient } from "cohere-ai";
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });
const reranked = await cohere.rerank({
model: "rerank-english-v3.0",
query: userMessage,
documents: retrievedChunks.map((c) => c.text),
topN: 3, // keep only the top 3 after reranking
});
What it actually costs to run
For a RAG app handling 1,000 questions per day with a medium-sized knowledge base:
| Operation | Model | Cost |
|---|---|---|
| Embedding questions | text-embedding-3-small | ~$0.02/day |
| Vector queries | Pinecone serverless | ~$0.05/day |
| Generation (500 tokens avg) | gpt-4o | ~$1.50/day |
| Total | ~$1.57/day ≈ $47/month |
The dominant cost is always generation. If cost matters, route simple queries to gpt-4o-mini and keep gpt-4o for complex ones — model routing cuts spend by 70–80% with little quality loss.
Frequently asked questions
What is the difference between RAG and fine-tuning?
RAG retrieves external information at query time — it doesn't change the model. Fine-tuning modifies the model's weights to change its behavior or bake in knowledge. Use RAG when your data changes frequently, is private, or is too large for a context window. Use fine-tuning when you need to change the model's output style, format, or domain-specific reasoning — not to teach it facts.
How do I choose between Pinecone, Weaviate, and Chroma?
For production apps where you don't want to manage infrastructure, Pinecone is the easiest. For local development and testing, Chroma is the simplest to run. For self-hosted production with more control over filtering, Weaviate is solid. And if you're already on Postgres, pgvector is usually the most cost-effective at scale.
How much does a RAG app cost to run?
For a small app with 1,000 daily queries, expect roughly $50–100/month with GPT-4o and Pinecone. Switching generation to GPT-4o mini cuts that by 60–80%. Embedding costs are negligible — text-embedding-3-small is one of the cheapest API calls in AI.
Can I build RAG without a vector database?
Yes, for small knowledge bases. Under ~500 chunks, you can embed everything in memory at startup and compute cosine similarity in plain JavaScript. Beyond that, a vector database becomes necessary for acceptable query latency.
Why does my RAG app give wrong answers?
The usual suspects, in order: (1) chunk size too large or too small, losing context at retrieval; (2) the relevance threshold too low, passing junk chunks to the model; (3) the answer simply isn't in the knowledge base and no fallback was implemented; (4) a system prompt that doesn't firmly instruct the model to stick to retrieved context. Trace your pipeline with LangSmith or similar so you can see exactly what was retrieved for each query.
Where to go next
You now have ingestion, semantic retrieval, streaming answers, and a deployment. The upgrades that matter most from here:
Chunk well, filter ruthlessly, stream everything, and protect your write paths — and RAG stops being a demo trick and becomes the most reliable pattern in applied AI: a model that answers from your data, updated in minutes, explainable down to the source.