← All blog posts
LLMs

Building a RAG Application with Next.js, OpenAI, and Pinecone — Step by Step

A complete, production-aware walkthrough of building a RAG application with Next.js App Router, OpenAI, and Pinecone — chunking, embeddings, vector search, streaming with the Vercel AI SDK, deployment, and the retrieval optimizations every tutorial skips.

Every AI product eventually hits the same wall: users ask questions about your data — your docs, your knowledge base, your product — and the LLM has no idea what to say. Fine-tuning is expensive and slow. Stuffing whole documents into the context window burns tokens. Neither scales. RAG — Retrieval-Augmented Generation — solves this cleanly: retrieve only the relevant pieces at query time and hand them to the model as context. In this guide you'll build a fully working RAG application from scratch — Next.js App Router, OpenAI embeddings, Pinecone — and deploy it to Vercel. Real code, real architecture, production-aware.

Understanding RAG — why it beats fine-tuning for most use cases

The idea is simple. At ingestion time, you split your documents into chunks, convert each chunk into a vector embedding, and store those embeddings in a vector database. At query time, you embed the user's question, find the most similar chunks, and pass them to the LLM as context alongside the question. The model answers from retrieved knowledge instead of trained knowledge.

Why this wins over fine-tuning for most use cases:

RAGFine-tuning
CostLow (embeddings are cheap)High (GPU hours)
Time to updateMinutes (re-index)Hours to days (retrain)
Works with private dataYesYes
Handles dynamic dataYesNo
ExplainableYes (show sources)No
Minimum data requiredAny amount50–100+ examples

Fine-tuning makes sense when you need to change the model's style, tone, or behavior — not when you need it to know specific facts. For facts, use RAG.

Architecture overview

Two separate flows, and keeping them separate in your head is half the architecture. The ingestion pipeline runs once (or whenever documents change): load → chunk → embed → upsert.

The ingestion conveyor — each document is scanned, split into overlapping 512-token chunks, embedded into 1536-dimension vectors with text-embedding-3-small, and upserted into the Pinecone index. When docs change it re-runs in minutes. No retraining.

The query pipeline runs on every user question: embed the question, retrieve the top-K most similar chunks, build a context-grounded prompt, and stream the answer back.

The query pipeline, animated — watch a question travel the spine: embedded, matched against the index, injected into a context-grounded prompt, and answered by GPT-4o as a stream with source citations.

Setting up Next.js App Router with TypeScript

You need Node.js 20+, a free Pinecone account, an OpenAI API key, and basic familiarity with the App Router. Scaffold the project:

npx create-next-app@latest rag-app --typescript --tailwind --app --no-src-dir
cd rag-app
pnpm add @pinecone-database/pinecone openai ai @ai-sdk/openai zod

The ai package is the Vercel AI SDK — it handles streaming from OpenAI to the browser cleanly. Set your environment variables in .env.local:

OPENAI_API_KEY=sk-your-openai-key
PINECONE_API_KEY=your-pinecone-api-key
PINECONE_INDEX_NAME=rag-demo

The project structure you're building toward:

rag-app/
├── app/
│   ├── api/
│   │   ├── ingest/route.ts     # chunk → embed → upsert
│   │   └── chat/route.ts       # embed → retrieve → stream
│   ├── page.tsx                # chat UI
│   └── layout.tsx
├── lib/
│   ├── pinecone.ts             # Pinecone client
│   ├── openai.ts               # OpenAI client
│   ├── chunker.ts              # document chunking
│   └── embeddings.ts           # embedding utilities
└── .env.local

Connecting Pinecone vector database

Create your index in the Pinecone dashboard first: dimensions 1536 (matches text-embedding-3-small output), metric cosine, cloud AWS us-east-1 (free tier). Then create the client in lib/pinecone.ts — singleton pattern, so the client is reused across requests instead of re-created on every invocation:

// lib/pinecone.ts
import { Pinecone } from "@pinecone-database/pinecone";

let pineconeClient: Pinecone | null = null;

export function getPineconeClient(): Pinecone {
  if (!pineconeClient) {
    pineconeClient = new Pinecone({
      apiKey: process.env.PINECONE_API_KEY!,
    });
  }
  return pineconeClient;
}

export function getPineconeIndex() {
  const client = getPineconeClient();
  return client.index(process.env.PINECONE_INDEX_NAME!);
}

Same pattern for the OpenAI client in lib/openai.ts:

// lib/openai.ts
import OpenAI from "openai";

let openaiClient: OpenAI | null = null;

export function getOpenAIClient(): OpenAI {
  if (!openaiClient) {
    openaiClient = new OpenAI({
      apiKey: process.env.OPENAI_API_KEY!,
    });
  }
  return openaiClient;
}

Chunking and embedding your documents

This is the most important step in the entire pipeline. Bad chunking = bad retrieval = bad answers. Everything downstream — the vector search, the prompt, the model — can only work with the chunks you give it.

Bad chunking is invisible in your code and obvious in your answers. It's the highest-leverage hundred lines in the whole pipeline.

Choosing the right chunk size

Chunk size is the single biggest lever for retrieval quality:

Chunk size is a precision↔context dial — the moving window is what the retriever sees of your document. Small chunks hit precisely but fragment meaning; large chunks keep context but dilute relevance. 512 tokens with 50-token overlap is the right default.
Chunk sizeBest forTrade-off
256–512 tokensPrecise Q&A, FAQsMay lose surrounding context
512–1024 tokensGeneral documents, articlesGood balance for most cases
1024–2048 tokensLong-form content, legal docsLess precise retrieval

Start with 512 tokens and 50-token overlap. The overlap ensures you don't cut a sentence in half between chunks and lose meaning at the boundary. Create lib/chunker.ts:

// lib/chunker.ts
export interface DocumentChunk {
  id: string;
  text: string;
  metadata: {
    source: string;
    chunkIndex: number;
    totalChunks: number;
  };
}

/**
 * Splits a document into overlapping chunks.
 * chunkSize is in characters — roughly 4 chars per token.
 */
export function chunkDocument(
  text: string,
  source: string,
  chunkSize: number = 2000, // ~500 tokens
  overlap: number = 200     // ~50 tokens
): DocumentChunk[] {
  const chunks: DocumentChunk[] = [];
  let start = 0;
  let chunkIndex = 0;

  // Clean the text first
  const cleanedText = text
    .replace(/\n{3,}/g, "\n\n")  // collapse excessive newlines
    .replace(/\s+/g, " ")         // collapse whitespace
    .trim();

  while (start < cleanedText.length) {
    let end = start + chunkSize;

    // Try to break at a sentence boundary
    if (end < cleanedText.length) {
      const lastPeriod = cleanedText.lastIndexOf(".", end);
      const lastNewline = cleanedText.lastIndexOf("\n", end);
      const breakPoint = Math.max(lastPeriod, lastNewline);

      // Only use it if it's reasonably close to our target
      if (breakPoint > start + chunkSize * 0.7) {
        end = breakPoint + 1;
      }
    }

    const chunkText = cleanedText.slice(start, end).trim();

    if (chunkText.length > 50) { // skip tiny chunks
      chunks.push({
        id: `${source}-chunk-${chunkIndex}`,
        text: chunkText,
        metadata: { source, chunkIndex, totalChunks: 0 },
      });
      chunkIndex++;
    }

    start = end - overlap;
  }

  // Update totalChunks now that we know the final count
  return chunks.map(chunk => ({
    ...chunk,
    metadata: { ...chunk.metadata, totalChunks: chunks.length },
  }));
}

Now the embedding utility in lib/embeddings.ts. text-embedding-3-small gives you 1536 dimensions at $0.02 per 1M tokens — 6× cheaper than -large and sufficient for most RAG apps:

// lib/embeddings.ts
import { getOpenAIClient } from "./openai";
import { DocumentChunk } from "./chunker";

const EMBEDDING_MODEL = "text-embedding-3-small";

/** Embeds an array of texts, batched to stay within OpenAI's limits. */
export async function generateEmbeddings(texts: string[]): Promise<number[][]> {
  const openai = getOpenAIClient();
  const BATCH_SIZE = 100;
  const allEmbeddings: number[][] = [];

  for (let i = 0; i < texts.length; i += BATCH_SIZE) {
    const batch = texts.slice(i, i + BATCH_SIZE);
    const response = await openai.embeddings.create({
      model: EMBEDDING_MODEL,
      input: batch,
    });
    const batchEmbeddings = response.data
      .sort((a, b) => a.index - b.index)
      .map((item) => item.embedding);
    allEmbeddings.push(...batchEmbeddings);
  }

  return allEmbeddings;
}

/** Prepares chunks with their embeddings for Pinecone upsert. */
export async function embedChunks(chunks: DocumentChunk[]) {
  const texts = chunks.map((chunk) => chunk.text);
  const embeddings = await generateEmbeddings(texts);

  return chunks.map((chunk, i) => ({
    id: chunk.id,
    values: embeddings[i],
    metadata: { text: chunk.text, ...chunk.metadata },
  }));
}

The ingestion API route

The ingestion route ties it together: accept document text, chunk it, embed it, upsert to Pinecone in batches of 100 (Pinecone's recommendation). Create app/api/ingest/route.ts:

// app/api/ingest/route.ts
import { NextRequest, NextResponse } from "next/server";
import { chunkDocument } from "@/lib/chunker";
import { embedChunks } from "@/lib/embeddings";
import { getPineconeIndex } from "@/lib/pinecone";

export async function POST(req: NextRequest) {
  try {
    const { text, source } = await req.json();

    if (!text || !source) {
      return NextResponse.json(
        { error: "Both text and source are required" },
        { status: 400 }
      );
    }

    // 1. Chunk → 2. Embed → 3. Upsert
    const chunks = chunkDocument(text, source);
    const vectors = await embedChunks(chunks);

    const index = getPineconeIndex();
    const UPSERT_BATCH_SIZE = 100;
    for (let i = 0; i < vectors.length; i += UPSERT_BATCH_SIZE) {
      await index.upsert(vectors.slice(i, i + UPSERT_BATCH_SIZE));
    }

    return NextResponse.json({
      success: true,
      chunksIngested: chunks.length,
      source,
    });
  } catch (error) {
    console.error("[Ingest] Error:", error);
    return NextResponse.json(
      { error: "Ingestion failed", details: String(error) },
      { status: 500 }
    );
  }
}

Test it with a sample document:

curl -X POST http://localhost:3000/api/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your document text goes here. This can be the contents of a PDF, a knowledge base article, or any text you want the AI to know about.",
    "source": "sample-doc-v1"
  }'

Building the retrieval API route

This is the query-time pipeline: embed the question, retrieve the top matches from Pinecone, build a prompt with that context, stream the response. Create app/api/chat/route.ts:

// app/api/chat/route.ts
import { NextRequest } from "next/server";
import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
import { generateEmbeddings } from "@/lib/embeddings";
import { getPineconeIndex } from "@/lib/pinecone";

// Chunks to retrieve — higher = more context, more tokens, more cost
const TOP_K = 5;

export async function POST(req: NextRequest) {
  try {
    const { messages } = await req.json();
    const userMessage = messages[messages.length - 1].content as string;

    // 1. Embed the user's question
    const [questionEmbedding] = await generateEmbeddings([userMessage]);

    // 2. Query Pinecone for the most relevant chunks
    const index = getPineconeIndex();
    const queryResponse = await index.query({
      vector: questionEmbedding,
      topK: TOP_K,
      includeMetadata: true,
    });

    // 3. Keep only genuinely relevant matches
    const retrievedChunks = queryResponse.matches
      .filter((match) => match.score && match.score > 0.7)
      .map((match) => ({
        text: match.metadata?.text as string,
        source: match.metadata?.source as string,
      }));

    // 4. Build context from retrieved chunks
    const context = retrievedChunks
      .map((chunk, i) => `[Source ${i + 1}: ${chunk.source}]\n${chunk.text}`)
      .join("\n\n---\n\n");

    // 5. The RAG prompt
    const systemPrompt = `You are a helpful assistant that answers questions based on the provided context.

CONTEXT:
${context || "No relevant context found for this question."}

INSTRUCTIONS:
- Answer using ONLY the information in the context above
- If the context does not contain enough information, say "I don't have enough information about that in my knowledge base"
- Always cite which source(s) you used
- Be concise and accurate
- Do not make up information that is not in the context`;

    // 6. Stream the response via the Vercel AI SDK
    const result = streamText({
      model: openai("gpt-4o"),
      system: systemPrompt,
      messages,
    });

    return result.toDataStreamResponse();
  } catch (error) {
    console.error("[Chat] Error:", error);
    return new Response(
      JSON.stringify({ error: "Failed to process your question" }),
      { status: 500, headers: { "Content-Type": "application/json" } }
    );
  }
}

The match.score > 0.7 line matters more than it looks. Cosine similarity runs 0–1; if the top result scores 0.4, Pinecone found nothing truly relevant — the user asked about something outside your knowledge base. Without the filter you'd pass irrelevant chunks to the model and get confidently hallucinated answers. Tune the threshold per use case; 0.7 is a good start for document Q&A.

A RAG app that answers from nothing is just a chatbot with extra steps. The score filter is what keeps it honest.

Streaming responses with the Vercel AI SDK

The Vercel AI SDK handles the streaming connection between your API route and React — the useChat hook gives you messages, input state, and submission handling with zero boilerplate. Create the chat UI in app/page.tsx:

// app/page.tsx
"use client";

import { useChat } from "ai/react";

export default function RAGChatPage() {
  const { messages, input, handleInputChange, handleSubmit, isLoading } =
    useChat({ api: "/api/chat" });

  return (
    <div className="flex flex-col h-screen max-w-3xl mx-auto p-4">
      <h1 className="text-2xl font-bold mb-6">Document Q&A — RAG Demo</h1>

      {/* Message list */}
      <div className="flex-1 overflow-y-auto space-y-4 mb-4">
        {messages.length === 0 && (
          <p className="text-gray-400 text-center mt-20">
            Ask a question about your documents...
          </p>
        )}
        {messages.map((message) => (
          <div
            key={message.id}
            className={`flex ${message.role === "user" ? "justify-end" : "justify-start"}`}
          >
            <div
              className={`max-w-[80%] rounded-lg px-4 py-2 ${
                message.role === "user"
                  ? "bg-blue-600 text-white"
                  : "bg-gray-100 text-gray-900"
              }`}
            >
              <p className="text-sm whitespace-pre-wrap">{message.content}</p>
            </div>
          </div>
        ))}
        {isLoading && (
          <p className="text-sm text-gray-500">Thinking...</p>
        )}
      </div>

      {/* Input form */}
      <form onSubmit={handleSubmit} className="flex gap-2">
        <input
          value={input}
          onChange={handleInputChange}
          placeholder="Ask a question about your documents..."
          className="flex-1 border border-gray-300 rounded-lg px-4 py-2"
          disabled={isLoading}
        />
        <button
          type="submit"
          disabled={isLoading || !input.trim()}
          className="bg-blue-600 text-white px-6 py-2 rounded-lg disabled:opacity-50"
        >
          Send
        </button>
      </form>
    </div>
  );
}

Tokens render as they arrive — no spinner staring contest. The same hook works with Anthropic, Google, Mistral, and other providers; swap openai("gpt-4o") for another model and the UI doesn't change.

Deploying to Vercel

Push to GitHub and connect the repo, or deploy directly:

vercel deploy

Set OPENAI_API_KEY, PINECONE_API_KEY, and PINECONE_INDEX_NAME in the Vercel dashboard under Settings → Environment Variables.

Production note: protect the ingestion route. Never leave it open — anyone could flood your Pinecone index. Add an INGEST_SECRET environment variable and validate it at the top of the route:

// at the top of app/api/ingest/route.ts
const secret = req.headers.get("x-ingest-secret");
if (secret !== process.env.INGEST_SECRET) {
  return NextResponse.json({ error: "Unauthorized" }, { status: 401 });
}

Optimizing retrieval quality

A working RAG app and a good RAG app are two different things. Once the basics work, these are the highest-impact upgrades, in order:

  • Query expansion — rewrite the raw question into something more explicit and searchable before embedding it. A cheap gpt-4o-mini call fixes vague queries like "does it work offline?".
  • Metadata filtering — if your knowledge base has multiple document types, filter at retrieval time so a billing question never retrieves engineering docs.
  • Hybrid search — pure semantic search misses exact-match queries: product names, error codes, version numbers. Combine embeddings with keyword search (BM25); Pinecone supports it natively via sparseValues.
  • Reranking — retrieve 10 chunks, then use a reranker (Cohere Rerank is the standard) to reorder by true relevance and keep the top 3 before prompting.

Query expansion in practice:

// Before embedding, expand the query with gpt-4o-mini
const expansion = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    {
      role: "system",
      content: "Rewrite the user's question to be more explicit and searchable for a document retrieval system. Return only the rewritten question.",
    },
    { role: "user", content: userMessage },
  ],
  max_tokens: 150,
});

const expandedQuery = expansion.choices[0].message.content || userMessage;
const [questionEmbedding] = await generateEmbeddings([expandedQuery]);

Metadata filtering and reranking:

// Only search a specific source at query time
const queryResponse = await index.query({
  vector: questionEmbedding,
  topK: TOP_K,
  includeMetadata: true,
  filter: { source: { $eq: "product-docs" } },
});

// Rerank retrieved chunks with Cohere before prompting
import { CohereClient } from "cohere-ai";
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });

const reranked = await cohere.rerank({
  model: "rerank-english-v3.0",
  query: userMessage,
  documents: retrievedChunks.map((c) => c.text),
  topN: 3, // keep only the top 3 after reranking
});

What it actually costs to run

For a RAG app handling 1,000 questions per day with a medium-sized knowledge base:

OperationModelCost
Embedding questionstext-embedding-3-small~$0.02/day
Vector queriesPinecone serverless~$0.05/day
Generation (500 tokens avg)gpt-4o~$1.50/day
Total~$1.57/day ≈ $47/month

The dominant cost is always generation. If cost matters, route simple queries to gpt-4o-mini and keep gpt-4o for complex ones — model routing cuts spend by 70–80% with little quality loss.

Frequently asked questions

What is the difference between RAG and fine-tuning?

RAG retrieves external information at query time — it doesn't change the model. Fine-tuning modifies the model's weights to change its behavior or bake in knowledge. Use RAG when your data changes frequently, is private, or is too large for a context window. Use fine-tuning when you need to change the model's output style, format, or domain-specific reasoning — not to teach it facts.

How do I choose between Pinecone, Weaviate, and Chroma?

For production apps where you don't want to manage infrastructure, Pinecone is the easiest. For local development and testing, Chroma is the simplest to run. For self-hosted production with more control over filtering, Weaviate is solid. And if you're already on Postgres, pgvector is usually the most cost-effective at scale.

How much does a RAG app cost to run?

For a small app with 1,000 daily queries, expect roughly $50–100/month with GPT-4o and Pinecone. Switching generation to GPT-4o mini cuts that by 60–80%. Embedding costs are negligible — text-embedding-3-small is one of the cheapest API calls in AI.

Can I build RAG without a vector database?

Yes, for small knowledge bases. Under ~500 chunks, you can embed everything in memory at startup and compute cosine similarity in plain JavaScript. Beyond that, a vector database becomes necessary for acceptable query latency.

Why does my RAG app give wrong answers?

The usual suspects, in order: (1) chunk size too large or too small, losing context at retrieval; (2) the relevance threshold too low, passing junk chunks to the model; (3) the answer simply isn't in the knowledge base and no fallback was implemented; (4) a system prompt that doesn't firmly instruct the model to stick to retrieved context. Trace your pipeline with LangSmith or similar so you can see exactly what was retrieved for each query.

Where to go next

You now have ingestion, semantic retrieval, streaming answers, and a deployment. The upgrades that matter most from here:

Chunk well, filter ruthlessly, stream everything, and protect your write paths — and RAG stops being a demo trick and becomes the most reliable pattern in applied AI: a model that answers from your data, updated in minutes, explainable down to the source.

Related articles