The 2026 open-source AI stack we'd actually build on

Every week brings a new "AI-native" everything and a fresh pile of tools that will supposedly change how you build. Most of it you'll never touch. This is the opposite: the open-source stack we'd actually reach for to ship a real AI product today, named layer by layer, with the reason we'd pick each piece.

It's opinionated on purpose — a list of every option is just noise, and cutting noise is the whole point. It's open-source-first on purpose too: you get to see how it works, you can run it on infrastructure you own, and you're not renting your core capability from one vendor's roadmap. One honest caveat, the brand way — the specific names below will drift over the next year. The shape of the stack won't. There are six layers, and they've been the same six for a while.

Layer	The pick	Reach for more when…
Models	Qwen or Llama (open-weight)	frontier reasoning: DeepSeek-R1-class · strict licensing: Mistral
Serving	vLLM	local & dev: Ollama or llama.cpp
Orchestration	a thin, typed loop (Pydantic AI)	durable multi-step state: LangGraph
Retrieval	pgvector	scale or latency demands it: Qdrant
Evals & observability	Langfuse + promptfoo	RAG-specific metrics: Ragas
The glue	LiteLLM + Pydantic	structured output: Instructor, Outlines

Models: open-weight first

Everything else sits on this layer, so start here. The open-weight field is good enough now that "hosted frontier or nothing" is a false choice for most products. Our default is Qwen for the best quality per parameter across a wide range of sizes, or Llama when you want the deepest ecosystem — the most fine-tunes, quantizations, and tooling built around a single family. When a task is genuinely reasoning-heavy, reach for a DeepSeek-R1-class reasoning model. When licensing matters — you're shipping in the EU, or you just want a clean, permissive license — Mistral's Apache-2.0 models are the easy answer.

So when does open actually beat the hosted frontier? Four cases: control (the model won't change under you mid-quarter), privacy (data and weights stay on hardware you own), cost at scale (past some volume, your own GPUs beat per-token pricing), and customization (a fine-tune on your data is yours to keep). When none of those apply, a hosted API is often the right call — saying so is what no-hype means. Open-source-first is a default, not a religion.

Inference and serving: without lighting money on fire

A model you can't serve efficiently is a science project. For production GPU serving, vLLM is the answer: continuous batching and paged attention give you the throughput that makes self-hosting economical, and it speaks the OpenAI API, so the rest of your stack doesn't need to know or care. SGLang is worth benchmarking if you're chasing latency; TGI if you already live in that ecosystem. For local development, prototyping, and edge, Ollama and llama.cpp run quantized models on a laptop with near-zero ceremony. The rule of thumb: prototype on Ollama, ship on vLLM, and measure tokens per dollar before you scale anything.

Orchestration: less framework than you think

This is where the most money and sanity get burned. The hype says you need an agent framework with a graph, a DSL, and a dozen abstractions. For most applications, you don't. Our default is a thin, typed loop you own — built on the provider SDK, with Pydantic AI supplying typed inputs and outputs without seizing your control flow. You can read it, you can step through it in a debugger, and it does exactly what you wrote and nothing you didn't.

Reach for LangGraph when you genuinely need what it offers: durable, stateful, multi-step graphs with checkpointing and human-in-the-loop pauses. That's a real problem and a real answer to it. If your application is fundamentally retrieval over your own data, LlamaIndex earns its keep. Be slow to adopt the heavyweight multi-agent frameworks — CrewAI, AutoGen — until you've felt the specific pain they solve. A framework you added on speculation is a dependency you'll spend next quarter fighting. (For how the agent inside that loop actually behaves, the field guide to AI coding agents is the companion piece.)

Retrieval: the boring parts are the ones that matter

Everyone reaches for a dedicated vector database first. Most teams don't need one. If you already run Postgres — and you probably do — pgvector keeps your embeddings next to your data, in one system you already know how to back up, query, and operate. Graduate to Qdrant when scale or latency genuinely demand it: it's fast, written in Rust, and its metadata filtering is excellent. Weaviate is a fine alternative in the same tier.

But the database is the easy part. Retrieval quality lives in the unglamorous details. Hybrid search — vector similarity combined with plain keyword/BM25 ranking, and Postgres full-text search is right there — beats pure-vector search on real queries far more often than the demos let on. Put a reranker (a cross-encoder) on top to reorder the shortlist and you've fixed most of what makes RAG feel stupid. Chunk well, search hybrid, then rerank. That sequence is the actual work.

Evals and observability: how you know it works

You can't improve what you can't see, and "it looked fine when I tried it" is not a measurement. Langfuse, self-hostable, gives you tracing on every call — prompts, completions, latencies, costs — so production is observable instead of a black box. For evaluation, promptfoo is a CI-friendly harness: write test cases, run them on every change, and catch regressions before your users do. If you're doing RAG, Ragas scores the retrieval-specific things — faithfulness, answer relevance — that generic evals miss. Phoenix from Arize is a strong pick if you'd rather have evaluation and observability in one tool.

The discipline matters more than the brand name: trace everything in production, and never ship a prompt or model change you didn't evaluate. This is the layer teams skip, and the one that separates a demo from a product.

The glue: the small tools that keep it standing

The unglamorous layer that quietly decides whether the whole thing stays maintainable. LiteLLM gives you one API across every provider and open model, so swapping Qwen for Llama — or a self-hosted endpoint for a hosted one — is a config change, not a rewrite. Pydantic is the backbone of trustworthy I/O; Instructor and Outlines turn "the model usually returns valid JSON" into "the model returns valid JSON," with structured and constrained generation. Wrap the app in FastAPI, package it with Docker, and you have something a team can actually run on a Tuesday.

The throughline

Read the picks back to back and the pattern is obvious: every one favors control and legibility over magic. Open weights you can host, a server you can afford, a loop you can read, retrieval you can debug, evals you can trust, and glue thin enough to swap any piece without a rewrite. None of it is the flashiest choice on offer — which is exactly the point. The flashy layer is the one you rip out in six months; the boring one is still standing.

The names will drift; some are already on the clock. The architecture won't. Pick open where it buys you something real, measure everything, and keep the glue thin.