The Math
Has Opinions.
Most ML content is written for the timeline.
This is written for the shelf.
Every week, a dozen newsletters summarize the same arXiv papers in the same breathless register. They tell you what was published. They do not tell you why it matters, what failed before it, which assumptions it quietly inherits, or where the next three years of citation graphs will probably lead.
Cipher is built on a different premise. Each issue takes one breakthrough — an architecture, a training regime, a theoretical result — and traces it backward through the decisions that produced it. The false starts. The prior work that was cited but not understood. The grad student whose ablation study quietly changed everything.
The result reads less like a newsletter and more like a chapter from a monograph you wish existed. Dense. Argued. Annotated in the margins by someone who has been thinking about this longer than the hype cycle has been running.
"Understanding is not the same as being current."
Attention Is Not All You Need:
The Residual Stream as Information Bottleneck
The original Transformer paper presents attention as a mechanism for relating positions in a sequence. What it does not present — and what took the field roughly four years to articulate clearly — is that the residual stream is doing something structurally different from the attention heads reading it.1
Consider the residual pathway as a shared workspace. Each layer writes to it additively. The attention heads and MLPs are not transforming a state so much as bidding for bandwidth in a channel with fixed capacity. When you frame it this way, the information bottleneck literature — Tishby et al., 2000; Saxe et al., 2019 — stops looking like a theoretical curiosity and starts looking like a design specification that was accidentally implemented before it was named.
"The superposition hypothesis is not a hypothesis about neurons. It is a hypothesis about what happens when you have more concepts than dimensions."
This matters for practitioners because it reframes what d_model actually controls. Increasing embedding dimension is not just giving the model more room to think — it is relaxing the bottleneck constraint. The empirical scaling laws Hoffmann et al. derived in Chinchilla are, under this lens, a measurement of how information density scales with parameter count when the bottleneck is the binding factor.2
The failure mode this predicts — and which practitioners who have debugged transformers at scale will recognize — is representational collapse in the middle layers. Not gradient vanishing. Not rank collapse in the weight matrices. A quieter failure: the residual stream converges to a low-dimensional manifold before the final layers have had a chance to write their computation. The loss looks fine. The attention patterns look reasonable. The model is simply not using most of its capacity.
1. Elhage et al., A Mathematical Framework for Transformer Circuits, 2021. Anthropic Transformer Circuits Thread.
2. Hoffmann et al., Training Compute-Optimal Large Language Models, 2022. The paper that made GPT-3 look undertrained.
Notes left in the margins by people who read it closely.
“I've read every major ML blog for six years. Cipher is the first one I actually annotate.”
“The issue on RLHF traced the actual failure modes nobody else was writing about. I forwarded it to my entire team.”
“I came for the transformers piece. I stayed because every issue makes me realize how much I've been pattern-matching instead of understanding.”
“The grant proposal I submitted last fall cited the Cipher piece on diffusion model provenance. The reviewers thought I'd read the original papers. I had — because Cipher told me which ones mattered.”
Next issue ships in 18 days.
One email. Monthly. Unsubscribe with a single click.
What arrives in your inbox
once a month, without apology.
One Breakthrough
Every issue covers exactly one development in depth. Not a roundup. Not five things to know. One argument, fully made.
The Lineage
Where did this idea come from? Which papers were cited without being read? Which grad student's thesis quietly shaped the entire direction?
The Math, Explained
Not avoided, not hidden in appendices. The relevant equations are typeset and walked through — with the intuition built first, then the formalism.
The Failure Record
What was tried before this? Why didn't it work? The negative results that don't make it into abstracts but shape every decision that follows.
A Working Question
Each issue ends with one open question — not a call to action, not a summary. A question the field hasn't settled, stated precisely.
"The timeline moves fast. The shelf doesn't move at all. That's the point."
Start with the current issue.
Free to read. No account. Subscribe only if it earns it.