Vol. I · Issue 12 · Feb 2026cipher

A Monthly Dispatch

The Math
Has Opinions.

SCROLL

Machine learning,
traced to its roots.

Est. 2024
12 issues deep.

On Method

Most ML content is written for the timeline.
This is written for the shelf.

Every week, a dozen newsletters summarize the same arXiv papers in the same breathless register. They tell you what was published. They do not tell you why it matters, what failed before it, which assumptions it quietly inherits, or where the next three years of citation graphs will probably lead.

Cipher is built on a different premise. Each issue takes one breakthrough — an architecture, a training regime, a theoretical result — and traces it backward through the decisions that produced it. The false starts. The prior work that was cited but not understood. The grad student whose ablation study quietly changed everything.

The result reads less like a newsletter and more like a chapter from a monograph you wish existed. Dense. Argued. Annotated in the margins by someone who has been thinking about this longer than the hype cycle has been running.

"Understanding is not the same as being current."

¶ Written for the engineer who still has the textbook on the desk.

From the Archive · Issue 09

Attention Is Not All You Need:
The Residual Stream as Information Bottleneck

The original Transformer paper presents attention as a mechanism for relating positions in a sequence. What it does not present — and what took the field roughly four years to articulate clearly — is that the residual stream is doing something structurally different from the attention heads reading it.¹

Consider the residual pathway as a shared workspace. Each layer writes to it additively. The attention heads and MLPs are not transforming a state so much as bidding for bandwidth in a channel with fixed capacity. When you frame it this way, the information bottleneck literature — Tishby et al., 2000; Saxe et al., 2019 — stops looking like a theoretical curiosity and starts looking like a design specification that was accidentally implemented before it was named.

"The superposition hypothesis is not a hypothesis about neurons. It is a hypothesis about what happens when you have more concepts than dimensions."

This matters for practitioners because it reframes what d_model actually controls. Increasing embedding dimension is not just giving the model more room to think — it is relaxing the bottleneck constraint. The empirical scaling laws Hoffmann et al. derived in Chinchilla are, under this lens, a measurement of how information density scales with parameter count when the bottleneck is the binding factor.²

The failure mode this predicts — and which practitioners who have debugged transformers at scale will recognize — is representational collapse in the middle layers. Not gradient vanishing. Not rank collapse in the weight matrices. A quieter failure: the residual stream converges to a low-dimensional manifold before the final layers have had a chance to write their computation. The loss looks fine. The attention patterns look reasonable. The model is simply not using most of its capacity.

1. Elhage et al., A Mathematical Framework for Transformer Circuits, 2021. Anthropic Transformer Circuits Thread.

2. Hoffmann et al., Training Compute-Optimal Large Language Models, 2022. The paper that made GPT-3 look undertrained.

→ This excerpt is from Issue 09, Feb 2025. 4,200 words with full derivations.

Read the Current IssueFree · No account required

Marginalia

Notes left in the margins by people who read it closely.

“I've read every major ML blog for six years. Cipher is the first one I actually annotate.”

Dr. Priya Nair— Research Lead, ML Systems · Toronto

Issue 07 reader since launch

“The issue on RLHF traced the actual failure modes nobody else was writing about. I forwarded it to my entire team.”

Marcus Okonkwo— Principal Engineer, NLP Infrastructure · London

Shared internally at two orgs

“I came for the transformers piece. I stayed because every issue makes me realize how much I've been pattern-matching instead of understanding.”

Yuki Tanaka— Self-taught practitioner · Seoul

No CS degree. Reads every footnote.

“The grant proposal I submitted last fall cited the Cipher piece on diffusion model provenance. The reviewers thought I'd read the original papers. I had — because Cipher told me which ones mattered.”

Sofía Herrera— PhD Candidate, Computational Neuroscience · Madrid

Grant funded.

Next issue ships in 18 days.

One email. Monthly. Unsubscribe with a single click.

Structure

What arrives in your inbox
once a month, without apology.

One Breakthrough

Every issue covers exactly one development in depth. Not a roundup. Not five things to know. One argument, fully made.

4,000–6,000 words

The Lineage

Where did this idea come from? Which papers were cited without being read? Which grad student's thesis quietly shaped the entire direction?

10–20 cited sources

The Math, Explained

Not avoided, not hidden in appendices. The relevant equations are typeset and walked through — with the intuition built first, then the formalism.

Assumes calculus, linear algebra

The Failure Record

What was tried before this? Why didn't it work? The negative results that don't make it into abstracts but shape every decision that follows.

The part conferences don't publish

A Working Question

Each issue ends with one open question — not a call to action, not a summary. A question the field hasn't settled, stated precisely.

For the 2 AM debug sessions

"The timeline moves fast. The shelf doesn't move at all. That's the point."

Start with the current issue.

Free to read. No account. Subscribe only if it earns it.

Read the Current Issueor subscribe below

Monthly · No sponsors · No affiliate links · Ever

The MathHas Opinions.

Attention Is Not All You Need:The Residual Stream as Information Bottleneck

What arrives in your inboxonce a month, without apology.