Starting Echostra: Building an LLM Runtime from Scratch in Zig

I’m building an LLM inference runtime from scratch. No wrappers, no llama.cpp bindings, no Python. Pure Zig, from the GGUF file reader up to the HTTP server. It’s called Echostra, and this is where I document how it goes.

Why

The honest answer is two reasons that happened to point at the same project.

The first is that I want to understand how this technology actually works. Not at the API level — at the kernel level. Attention, KV cache, quantization, GPU dispatch. The work that happens before a token arrives at your screen.

The second is that I think there’s room for a better tool than Ollama. Smaller binary, cleaner embeddability, the full stack speaking one protocol. Whether that’s true is something you can only find out by building it.

The project is also a vehicle for learning Zig seriously. I come from Go and Rust. Zig is smaller than both, stricter in some ways, more honest about what it is. It’s the right language for something that needs to run fast and be understood completely.

What it actually is

Echostra is five tools that will eventually compose into one:

  • A runtime — loads and runs models, serves an API (like Ollama)
  • A base SDK — typed client for OpenAI-compatible endpoints
  • A vector store — HNSW index, semantic search (like FAISS)
  • An agent framework — tools, agent loop, MCP (like LangGraph)
  • A harness — a coding assistant (like Claude Code)

The runtime comes first. Everything else depends on it.

Where I’m starting

Not with the transformer. With Karpathy’s micrograd — a scalar autograd engine in a hundred lines. I built it in Zig instead of Python, which forced me to learn the language at the same time as the concept.

From a Go/Rust background, the things that surprised me about Zig:

Allocators are explicit everywhere. There’s no hidden allocation. You pass an Allocator to anything that touches the heap, and you decide when memory is freed. For a computation graph, the right pattern is an arena allocator — you allocate freely into it during the forward pass and free everything in one call when you’re done. Clean.

Tagged unions fill the role Rust enums do. union(enum) gives you a discriminated union with exhaustive switch checking. I used this to represent which operation created each node in the autograd graph — the backward step is just a switch on the tag. The |payload| capture pulls out the value.

There is no hidden control flow. No destructors, no RAII, no goroutines. defer is the one mechanism for guaranteed cleanup. Once you stop expecting hidden behavior, the code is easier to reason about than either Go or Rust.

The micrograd exercise took a few hours and produced something that passes all five tests, generates correct gradients, and trains a small MLP to classify four points in 20 steps. It’s a toy. But it’s a working toy, and it taught me enough Zig to start the real work.

What’s next

The first real milestone is echostra-inspect — a CLI that parses a GGUF file and prints its full metadata and tensor table. No math, just binary parsing. It teaches mmap, packed struct, endianness, and the file format that every subsequent component reads from.

After that: a tensor library, a tokenizer, a forward pass. The first time the runtime generates real text is still months away. That’s fine. The point is to understand every line of it when it does.

I’ll write here when something interesting happens — when something breaks in a way that teaches me something, when a design decision has a non-obvious answer, when a paper or a reference implementation changes how I think about the problem.

The Echostra source is on GitHub.