Cascade: Scalable Neural Decoders for Fault-Tolerant Quantum Computation

The Decoding Problem

Errors Accumulate

Stabilizers Detect

Decoder

Errors Corrected

Measure→

Decode→

Did an error
occur?

Correct→

QEC Pipeline Errors accumulate on data qubits; stabilizer measurements produce a spacetime syndrome; the decoder determines whether a logical error occurred; the logical state is protected.

Quantum error correction produces a continuous stream of noisy, structured measurements — syndromes — that must be classified in real time. The decoder maps each syndrome to a binary label (did a logical error occur?). A wrong answer silently corrupts the computation; a slow answer bottlenecks the processor.

What is quantum error correction?

Individual qubits are fragile — environmental noise constantly introduces random bit-flip and phase-flip errors. Quantum error correction protects against this by spreading a single logical qubit across many physical qubits and continuously running local parity checks (called stabilizer measurements) to detect errors without disturbing the encoded information.

The outcomes of these checks form a syndrome: a structured grid of binary values indicating which checks were violated. Crucially, the syndrome reveals the presence of errors but not their exact location — multiple different error patterns can produce the same syndrome. The decoder's job is to infer the most likely correction from this ambiguous signal, and it must do so in real time before new errors accumulate.

At operational error rates, errors are sparse — only a small fraction of qubits are affected. The syndrome provides a set of local parity checks on this sparse error vector: fewer bits than there are potential error locations, but enough to recover the error given the sparsity constraint — a structure analogous to compressed sensing, but over $\mathbb{F}_2$. Crucially, sparse errors produce localized clusters of syndrome violations, so the problem has natural spatial structure that a decoder can exploit.

Our Approach

Cascade is a convolutional neural network decoder that exploits the spatial structure of the syndrome: each layer resolves nearby violations, and successive layers integrate over progressively larger regions until the global logical class can be determined. The architecture is designed around two principles: structure — the convolutions encode the geometric regularity of the code — and scale — sufficient model capacity enables generalization from high-noise training to the low-noise regime fault tolerance demands.

Scaling neural decoders to the large codes and low error rates needed for fault tolerance raises two challenges. First, extreme class imbalance: at operational error rates, logical failures occur with probability 10^-12 to 10^-6, so naive training requires exponentially many samples to encounter enough failures for useful gradient signal. Second, real-time inference: each syndrome must be decoded before the next one arrives — on the order of 1 µs for superconducting qubits, ~1 ms for trapped ions and neutral atoms — favoring local, parallelizable operations.

Cascade

Embed → $\mathbb{R}^H$

Geometry-Aware Convolution ($\times L$)

Aggregate

Classify

→

or: BB code BB code torus convolution

→

→

Decoder Architecture Syndromes are embedded into learned representations, processed by geometry-aware convolutional layers that respect the code's local structure, aggregated over each logical operator's support, and classified.

The key design choices are which inductive biases to build in, and how large the model needs to be. The convolutional structure exploits three geometric regularities of QEC codes, and the model's capacity determines how much of the code's error-correcting power it can actually access.

Structure + Scale

Structure

QEC codes have three geometric regularities that Cascade exploits by construction.

Translation equivariance: the same local rules at every site.
Locality: successive layers resolve errors at increasing scales.
Anisotropy: distinct learned weights for each geometric offset, because information from different directions carries different meaning.

Convolution encodes all three. For codes on a torus, we implement custom Triton kernels for the required non-Euclidean convolutions.

Scale

We train at a single, relatively high physical error rate where gradient signal is abundant, then deploy across a range extending orders of magnitude lower. The geometric inductive bias ensures the learned representations transfer — the same local error patterns recur at any noise rate, just less frequently.

Sufficient model capacity is needed to exploit this prior. Small models memorize the training distribution; large models learn the underlying structure and generalize reliably to the low-noise regime that fault tolerance demands. These are the ML ingredients that made the waterfall regime accessible.

The Waterfall Effect

Cascade probes an under-explored regime in which quantum error-correcting codes suppress errors far more aggressively than standard distance scaling predicts. The standard model predicts that logical error rates scale as $P_L \sim p^{\lfloor(d+1)/2\rfloor}$, determined by the code distance $d$ — this implicitly assumes that minimum-weight uncorrectable errors dominate. In practice, these minimum-weight failure modes are extremely rare; the logical error rate is actually dominated by far more numerous higher-weight failure modes.

The result is two distinct regimes. At moderate physical error rates below the code's critical threshold (the rate below which error correction starts working), the abundant high-weight failure modes drive a steep waterfall in error rate ($\sim p^{10.8}$ for the $\llbracket 144, 12, 12 \rrbracket$ code). Only at very low noise do the rare minimum-weight modes take over, producing a shallower distance-limited floor ($\sim p^{6.4} \approx p^{\lfloor(d+1)/2\rfloor}$). This two-regime structure is well known for classical LDPC codes and has been noted in the quantum coding theory literature, but was not widely appreciated or clearly characterized in practice — because existing decoders are not accurate enough to resolve it cleanly.

Cascade exposes the waterfall by correctly handling the complex high-weight error patterns that simpler decoders systematically fail on. BP+OSD (belief propagation with ordered statistics decoding, the standard algorithmic decoder) misses the waterfall entirely ($\sim p^{5.4}$), leaving logical error rates ${\sim}4000\times$ above Cascade at $p = 0.1\%$. There is no error floor: exponential error suppression persists down to $P_L \approx 2 \times 10^{-11}$.

Waterfall error suppression on the [[144,12,12]] bivariate bicycle code — Waterfall Error Suppression Error suppression on the $\llbracket 144, 12, 12 \rrbracket$ BB code. The logical error rate decomposes into two power-law contributions: a steep waterfall ($\sim p^{10.8}$) where the numerous high-weight failure modes dominate, and a distance-limited floor ($\sim p^{6.4}$) that emerges only at very low noise. BP+OSD ($\sim p^{5.4}$) never accesses the waterfall regime. Insets show representative failure modes at each weight scale.

Practical Impact

The waterfall has direct implications for how many physical qubits fault-tolerant quantum computation actually requires. Standard resource estimates model logical error rates using the distance-limited formula $P_L \sim \Lambda^{-\lfloor(d+1)/2\rfloor}$ with $\Lambda \approx 10$, calibrated to MWPM (minimum-weight perfect matching, the dominant classical decoding algorithm). A decoder that achieves steeper-than-distance scaling reduces the code distance $d$ — and therefore the physical qubit count — needed for any given target error rate.

~40%

fewer physical qubits to reach a target logical error rate of ${\sim}10^{-9}$. Cascade achieves this at code distance $d=15$, compared to $d=19$ for MWPM. The advantage grows with stricter targets.

Complementing the waterfall's reduction of space overhead, Cascade's well-calibrated confidence estimates reduce time overhead. Many fault-tolerant protocols require operations that must be retried until they succeed; discarding low-confidence predictions achieves a ${\sim}20\times$ higher acceptance rate than cluster-based post-selection at matched error rates — directly reducing the number of retries required.

Results

BB Codes: State-of-the-Art Across Three Code Sizes

Cascade achieves lower logical error rates than all prior decoders — BP+OSD, Relay (a learned variant of belief propagation), and Tesseract (a near-optimal but computationally expensive decoder) — across all three bivariate bicycle codes tested, with 3–5 orders of magnitude higher throughput. On the largest code ($\llbracket 288, 12, 18 \rrbracket$, encoding 12 logical qubits into 288 physical qubits), the decoder achieves $P_L \sim 10^{-10}$ per logical qubit per cycle at $p = 0.2\%$. Unlike belief propagation, whose fixed update rules cause convergence failures when multiple error patterns produce the same syndrome, Cascade learns flexible message-passing rules that circumvent these failure modes.

Distance scaling of BB code decoders under circuit-level noise — BB Code Distance Scaling (a–c) Logical error rate versus physical error rate for $\llbracket 72, 12, 6 \rrbracket$, $\llbracket 144, 12, 12 \rrbracket$, and $\llbracket 288, 12, 18 \rrbracket$. (d) Accuracy vs. latency at $p = 0.2\%$ for all three codes. Cascade (GPU inference on NVIDIA H200) spans a range of latencies while achieving lower error rates than all prior decoders.

Surface Codes: Approaching Near-Optimal Decoding

On surface codes under realistic noise, Cascade achieves an error suppression factor of $\Lambda \approx 8.4$ — a single number summarizing how aggressively errors are suppressed per unit of code distance (higher is better). This substantially exceeds MWPM ($\Lambda \approx 5.0$) and its correlated variant ($\Lambda \approx 7.8$), and approaches Tesseract ($\Lambda \approx 9.1$), whose computational cost (up to 1 s per shot) makes it impractical for real-time use. Compared to AlphaQubit, a transformer-based decoder that achieves comparable accuracy on surface codes, Cascade uses a simpler training pipeline (binary cross-entropy at a single noise level, no auxiliary losses or multi-stage fine-tuning) and an order of magnitude fewer training examples ($3 \times 10^8$ versus $2$–$3 \times 10^9$).

Distance scaling of surface code decoders at p=0.2% — Surface Code Performance (a) Logical error rate per round versus code distance. (b) Accuracy–latency tradeoff on NVIDIA H200. (c) Error suppression factor $\Lambda$ across decoders.

Calibration and Post-Selection

Where the waterfall reduces space costs, calibrated confidence estimates reduce time costs. On the $\llbracket 72, 12, 6 \rrbracket$ BB code, Cascade's predicted probabilities remain well-calibrated across physical error rates far below the training distribution — the predicted probability of a logical error closely matches the true frequency. This means low-confidence predictions can be discarded (post-selected) to achieve much lower error rates at the cost of throughput. At $p = 0.55\%$, the decoder reaches an error rate $\sim 2 \times 10^{-3}$ while keeping ${\sim}95\%$ of predictions, compared to ${\sim}5\%$ for prior methods — a roughly $20\times$ improvement.

Scaling and Generalization

How well a decoder exploits a code's error-correcting capability depends strongly on its capacity. We trained surface code decoders of varying width $H$ (with fixed depth $L=8$) at a single noise level, then evaluated across a range of physical error rates extending well below the training distribution. All models see the same data — the differences reflect only their capacity to learn generalizable representations.

Small models ($H \lesssim 64$) achieve reasonable performance near the training noise rate but fail to extrapolate, exhibiting error suppression worse than even uncorrelated MWPM. Large models ($H \geq 64$) recover the correct error-suppression exponents across the entire range — approaching the performance of Tesseract. Insufficient capacity is a key reason existing decoders fail to access the waterfall regime: as a model's expressive capacity grows, it gains the ability to recognize and correctly classify increasingly complex error patterns that simpler decoders systematically mishandle.

Logical error rate and error suppression exponent versus model size — Capacity and Generalization (a) Logical error rate versus physical error rate at $d=15$ for models with varying hidden dimension $H$. (b) Error suppression exponent $m$ (from fitting $P_L \propto p^m$) versus hidden dimension. Small models fall below MWPM; large models approach Tesseract (near-optimal).

Inside Cascade

Cascade runs inference live in your browser below. Click qubits to introduce errors and watch the network's internal activations respond — early layers detect local syndrome patterns while deeper layers integrate information across larger regions. Hover over any activation to see its receptive field traced back through the network to the surface code.

How Cascade Works

Architecture

Cascade follows an encoder–backbone–readout design. A lightweight embedding projects the syndrome into a hidden dimension $H$, followed by $L$ pre-activation bottleneck blocks with local convolutions and residual connections. A final convolution scatters representations to data qubits, global average pooling aggregates over each logical operator's support, and a small MLP head outputs logits for logical observables. We use 3D convolutions for surface codes (2D space + time) and torus-equivariant convolutions for BB codes.

Bottleneck Block The code-specific convolution is the only component that changes between code families.

We train across scales using MuP to preserve training dynamics while increasing width, and scale residual connections by $1/\sqrt{2L}$ for stability.

Custom Triton Kernels for Torus Convolutions

Standard deep learning frameworks provide convolutions for regular grids — 1D sequences, 2D images, 3D volumes. BB codes live on a torus: the stabilizer connectivity wraps around both cycles with modular arithmetic, and each stabilizer's neighbors include a mix of same-type and cross-type stabilizers at code-specific offsets. No built-in operation handles this.

Each output position $(b, t, p, i, j)$ — batch, time, plane (X or Z), and grid coordinates on the $\ell \times m$ torus — is computed as:

$$y_p[t, i, j] = \sum_{\Delta t}\; \sum_{(p', \Delta i, \Delta j) \in \mathcal{N}_p} x_{p'}[t{+}\Delta t,\; (i{+}\Delta i) \bmod \ell,\; (j{+}\Delta j) \bmod m] \;\cdot\; W_p^{(\Delta t, p', \Delta i, \Delta j)}$$

The neighbor set $\mathcal{N}_p$ includes both same-plane deltas ($p' = p$) and cross-plane deltas ($p' \neq p$), at code-specific offsets that wrap toroidally. In PyTorch, a naive implementation gathers neighbors with torch.roll, concatenates them, and runs a matrix multiply:

# x: (batch, time, 2, l, m, channels)
# deltas: list of (plane, delta_i, delta_j) offsets
# W: (channels * len(deltas) * time_kernel, out_channels)

# 1. Gather time neighbors
x = stack([x[:, t:t+t_out] for t in range(time_kernel)], dim=-1).flatten(-2)

# 2. Gather spatial neighbors on the torus (per plane)
gathered = cat([
    roll(x[:, :, plane], shifts=(di, dj), dims=(3, 4))
    for plane, di, dj in deltas
], dim=-1)

# 3. Linear projection
y = gathered @ W

This is an explicit im2col: steps 1–2 gather all neighbor features into a single matrix of shape (batch*t_out*2*l*m, channels*num_deltas*time_kernel), and step 3 is a standard GEMM. The gathered matrix is large and transient — it exists only to be multiplied and discarded.

Our Triton kernel performs an implicit im2col instead: rather than materializing the full gathered matrix, each thread block loads neighbor values on-the-fly inside the matmul loop, accumulating partial products directly into the output. This avoids the intermediate allocation and fuses the entire operation into a single kernel launch.

Kernel factorization. Each check has 22 spatial neighbors in the check-to-check graph across 3 temporal offsets, giving 66 distinct relations per layer. But the check-to-check connectivity is the square of the Tanner graph — two checks are neighbors precisely when they share a data qubit. This means the kernel factors into two bipartite steps: check→data followed by data→check, each with only 6 spatial neighbors across 2 temporal offsets (12 relations) — an over 5× reduction in kernel size. This mirrors one round of belief propagation on the Tanner graph, but with learned weights instead of the BP update rules. Stacking two smaller layers reproduces the same receptive field with fewer parameters and FLOPs per layer — analogous to how two stacked 3×3 convolutions replace a 5×5 in conventional architectures (same receptive field, but $2 \times 3^2 = 18$ parameters per channel instead of $5^2 = 25$).

Kernel view:

Torus Convolution Kernels The [[144, 12, 12]] bivariate bicycle code on a torus. Click any stabilizer to see its convolution kernel. The direct check→check kernel has 22 entries — but it factors into two smaller kernels of 6 entries each, passing through data qubits. Blue/orange: X/Z stabilizers. Green/purple: data qubits.

Ablation Studies

We compare three architectures at identical model size ($L=8$, $H=256$) on distance-15 surface codes, plotting logical error rate as a function of training compute (PFLOPs) for a fair comparison across architectures with different per-step costs. Convolution achieves the lowest final error rate, reaching parity with Tesseract. Full attention performs worst: it not only saturates at a higher error rate than local attention, but its entire learning curve is shifted rightward, requiring more total compute to reach any given accuracy. Adding flexibility by moving from local to global attention actually degrades performance — the additional degrees of freedom dilute rather than enhance the structural prior.

Architecture ablation study — Architecture Ablation **(a)** Logical error rate versus training compute (PFLOPs) for convolution, local attention, and full attention at fixed model size, evaluated at $p = 8\%$. **(b)** Architectural inductive biases: convolution applies identical directional weights at every position (colors = direction-specific weights); local attention learns position-dependent weights (varying thickness); full attention connects globally with position-dependent weights.

Citation

@misc{gu2026scalableneuraldecoderspractical,
  title={Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation},
  author={Andi Gu and J. Pablo Bonilla Ataides and Mikhail D. Lukin and Susanne F. Yelin},
  year={2026},
  eprint={2604.08358},
  archivePrefix={arXiv},
  primaryClass={quant-ph},
  url={https://arxiv.org/abs/2604.08358},
}