The idea
Andrej Karpathy just released autoresearch, a project built around a simple question: what happens if you let an AI agent do machine learning research on its own overnight?
The setup: an agent (Claude Code, Codex, or similar) receives a small LLM training codebase. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards the change, and repeats. You sleep. In the morning, you get an experiment log and, hopefully, a better model.
Three files, nothing more
The architecture is deliberately minimal:
prepare.py: data prep, tokenizer, dataloader, evaluation. Fixed, the agent doesn’t touch it.train.py: the full GPT model, optimizer (Muon + AdamW), and training loop. This is the only file the agent modifies. Architecture, hyperparameters, batch size, everything is fair game.program.md: instructions for the agent. This is the only file the human modifies.
Clean separation. The human programs program.md. The agent programs train.py. One file each.
A fixed 5-minute budget
Every experiment runs for exactly 5 minutes of wall-clock time, regardless of configuration. That’s roughly 12 experiments per hour, about 100 overnight.
The fixed budget serves two purposes. First, all experiments are directly comparable: whether the agent changes the architecture, model size, or batch size, the time constraint is identical. Second, autoresearch naturally converges toward the optimal model for your hardware within that budget, not the best model in absolute terms, but the best your GPU can produce in 5 minutes.
The single metric is val_bpb (validation bits per byte). Lower is better. Vocabulary-size independent, which makes architectural comparisons fair.
What’s actually new
Automating ML experiments isn’t new. What’s new is the medium. You’re not writing a hyperparameter search script. You’re writing a Markdown document describing a research strategy, and an LLM agent interprets it freely.
The default program.md is intentionally bare-bones. Karpathy says it explicitly: the real game is iterating on this file to find the “research org code” that drives the fastest progress. How many agents? What exploration strategy? What constraints? It’s meta-research: you’re no longer searching for the right hyperparameters. You’re searching for the right instructions for the agent that searches for the right hyperparameters.
Design decisions that matter
Single modifiable file. The agent only touches train.py. No fragmentation, no unnecessary refactoring. Diffs stay readable and auditable.
Self-contained. PyTorch and a handful of dependencies. One GPU, one file, one metric. No distributed training, no nested YAML configs. Setup is 4 commands.
Single GPU. By design, not by limitation. The goal isn’t to train the world’s best model. It’s to create an experimentation ground where an agent can iterate fast. An H100 is recommended, but community forks exist for Apple Silicon (MLX) and macOS.
What this means going forward
Autoresearch is a prototype, but it points somewhere concrete. If an agent can iterate on a training loop overnight and find improvements, the ML researcher’s role shifts. Less time modifying code and relaunching runs. More time formulating hypotheses, structuring instructions, and analyzing results.
It’s also a clear example of the pattern emerging everywhere with AI agents: the human no longer codes directly. They code the instructions that guide the agent that codes. The program.md is exactly the equivalent of CLAUDE.md in a standard software project using Claude Code. The skill that matters is no longer “writing PyTorch”. It’s “writing instructions that make an agent write good PyTorch.”
To borrow Karpathy’s words: “Research is now entirely the domain of autonomous swarms of AI agents.” We’re not there yet. But with autoresearch, you can see what the first step looks like.
Further reading
- autoresearch GitHub repo
- Karpathy’s announcement tweet
- nanochat, the LLM training framework autoresearch is derived from
Want your dev team coding with AI? 2-day Claude Code training →