Caching Strategies for AI
LLM requests are slow and expensive. Caching is the best way to fix both.
1. Standard API Caching (Vercel Data Cache)
If the prompt is static (e.g., "Generate the daily horoscope"), cache the response.
typescript
// Next.js App Router
export const revalidate = 3600; // Cache for 1 hour
export async function GET() {
const completion = await openai.chat.completions.create({ ... });
return Response.json(completion);
}2. Semantic Caching (The Holy Grail)
Users rarely type the exact same thing twice.
- User A: "Who is Elon Musk?"
- User B: "Tell me about Elon Musk"
Standard cache misses. Semantic Cache hits.
How it works:
- Embed the incoming prompt.
- Search your Vector DB (Redis/Pinecone) for similar past prompts (threshold > 0.95).
- If found, return the stored answer.
Libraries:
- GPTCache: Python library.
- Upstash Semantic Cache: Serverless solution.
3. Edge Caching (CDN)
For assets generated by AI (Images, Audio), always cache them on the Edge (CDN). Don't serve generated images from your database; upload them to S3/R2 and serve via Cloudflare/Vercel Edge.
Cache Invalidation
AI models change (updates from OpenAI).
- Time-based: Expire cache every week.
- Model-based: Invalidate all cache when you upgrade from
gpt-3.5togpt-4o.