Cipher
Read Current Issue →
Vol. I · Issue 12 · Feb 2026cipher
A Monthly Dispatch

The Math
Has Opinions.

SCROLL
On Method

Most ML content is written for the timeline.
This is written for the shelf.

Every week, a dozen newsletters summarize the same arXiv papers in the same breathless register. They tell you what was published. They do not tell you why it matters, what failed before it, which assumptions it quietly inherits, or where the next three years of citation graphs will probably lead.

Cipher is built on a different premise. Each issue takes one breakthrough — an architecture, a training regime, a theoretical result — and traces it backward through the decisions that produced it. The false starts. The prior work that was cited but not understood. The grad student whose ablation study quietly changed everything.

The result reads less like a newsletter and more like a chapter from a monograph you wish existed. Dense. Argued. Annotated in the margins by someone who has been thinking about this longer than the hype cycle has been running.

"Understanding is not the same as being current."

From the Archive · Issue 09

Attention Is Not All You Need:
The Residual Stream as Information Bottleneck

The original Transformer paper presents attention as a mechanism for relating positions in a sequence. What it does not present — and what took the field roughly four years to articulate clearly — is that the residual stream is doing something structurally different from the attention heads reading it.1

Consider the residual pathway as a shared workspace. Each layer writes to it additively. The attention heads and MLPs are not transforming a state so much as bidding for bandwidth in a channel with fixed capacity. When you frame it this way, the information bottleneck literature — Tishby et al., 2000; Saxe et al., 2019 — stops looking like a theoretical curiosity and starts looking like a design specification that was accidentally implemented before it was named.

"The superposition hypothesis is not a hypothesis about neurons. It is a hypothesis about what happens when you have more concepts than dimensions."

This matters for practitioners because it reframes what d_model actually controls. Increasing embedding dimension is not just giving the model more room to think — it is relaxing the bottleneck constraint. The empirical scaling laws Hoffmann et al. derived in Chinchilla are, under this lens, a measurement of how information density scales with parameter count when the bottleneck is the binding factor.2

The failure mode this predicts — and which practitioners who have debugged transformers at scale will recognize — is representational collapse in the middle layers. Not gradient vanishing. Not rank collapse in the weight matrices. A quieter failure: the residual stream converges to a low-dimensional manifold before the final layers have had a chance to write their computation. The loss looks fine. The attention patterns look reasonable. The model is simply not using most of its capacity.

1. Elhage et al., A Mathematical Framework for Transformer Circuits, 2021. Anthropic Transformer Circuits Thread.

2. Hoffmann et al., Training Compute-Optimal Large Language Models, 2022. The paper that made GPT-3 look undertrained.

Read the Current IssueFree · No account required
Marginalia

Notes left in the margins by people who read it closely.

I've read every major ML blog for six years. Cipher is the first one I actually annotate.

Dr. Priya NairResearch Lead, ML Systems · Toronto
Issue 07 reader since launch

The issue on RLHF traced the actual failure modes nobody else was writing about. I forwarded it to my entire team.

Marcus OkonkwoPrincipal Engineer, NLP Infrastructure · London
Shared internally at two orgs

I came for the transformers piece. I stayed because every issue makes me realize how much I've been pattern-matching instead of understanding.

Yuki TanakaSelf-taught practitioner · Seoul
No CS degree. Reads every footnote.

The grant proposal I submitted last fall cited the Cipher piece on diffusion model provenance. The reviewers thought I'd read the original papers. I had — because Cipher told me which ones mattered.

Sofía HerreraPhD Candidate, Computational Neuroscience · Madrid
Grant funded.

Next issue ships in 18 days.

One email. Monthly. Unsubscribe with a single click.

Structure

What arrives in your inbox
once a month, without apology.

01

One Breakthrough

Every issue covers exactly one development in depth. Not a roundup. Not five things to know. One argument, fully made.

02

The Lineage

Where did this idea come from? Which papers were cited without being read? Which grad student's thesis quietly shaped the entire direction?

03

The Math, Explained

Not avoided, not hidden in appendices. The relevant equations are typeset and walked through — with the intuition built first, then the formalism.

04

The Failure Record

What was tried before this? Why didn't it work? The negative results that don't make it into abstracts but shape every decision that follows.

05

A Working Question

Each issue ends with one open question — not a call to action, not a summary. A question the field hasn't settled, stated precisely.

"The timeline moves fast. The shelf doesn't move at all. That's the point."

Start with the current issue.

Free to read. No account. Subscribe only if it earns it.

Read the Current Issueor subscribe below
Monthly · No sponsors · No affiliate links · Ever