← Writing
Anomaly Detection HEP Knowledge Distillation

How Normalizing Flows Detect New Physics at the LHC

A look at the density estimation model I built for my dissertation at CERN — what normalizing flows are, why they work for anomaly detection, and how knowledge distillation makes them fast enough to matter.

Uday Sharma May 2026 12 min read

The Large Hadron Collider produces roughly 40 million proton–proton collision events per second. After a hardware trigger reduces that to around 100,000, a software system selects about 1,000 for permanent storage. Everything else is gone. If new physics is hiding in those discarded events, no amount of analysis will find it later.

This is the anomaly detection problem in high-energy physics: find the unusual events before throwing them away, without knowing in advance what “unusual” looks like. My dissertation tried to solve a version of this problem using normalizing flows.

The Setup

Why density estimation?

Most anomaly detection methods need labels — examples of the thing you’re looking for. In HEP, we mostly do not have those. We know what Standard Model background events look like, but beyond-Standard-Model (BSM) signals are, by definition, hypothetical. Any supervised approach bakes in assumptions about what we’re searching for.

A cleaner approach: learn the density of normal events, then flag anything with low probability under that model. No labels required. This is what normalizing flows do well.

If you can accurately model what ordinary looks like, you get anomaly detection almost for free — anything the model considers unlikely is a candidate.

What is a normalizing flow?

A normalizing flow is a generative model that learns an invertible mapping between a simple base distribution (usually a standard Gaussian) and the complex data distribution you care about. Given a data point $x$, you can compute its exact log-likelihood using the change-of-variables formula:

$$\log p_X(x) = \log p_Z(f(x)) + \log \left| \det \frac{\partial f}{\partial x} \right|$$

Where $f$ is the learned bijection and $Z$ is the latent space. The Jacobian determinant term is the key — it accounts for how $f$ stretches or compresses space. Making this tractable is what the architecture of the flow is designed to do.

The Architecture

Coupling layers and the Jacobian trick

Computing a full Jacobian determinant for an arbitrary neural network is $O(D^3)$ in the input dimension $D$ — completely infeasible for LHC data. Coupling layers solve this with a structural constraint: partition the input dimensions, transform one half as a function of the other, leave the second half unchanged.

# RealNVP-style coupling layer
class CouplingLayer(nn.Module):
    def forward(self, x):
        x1, x2 = x.chunk(2, dim=-1)
        # scale and translate from first half
        s, t = self.net(x1).chunk(2, dim=-1)
        s = torch.tanh(s)
        # transform second half
        y2 = x2 * torch.exp(s) + t
        # log-det is just the sum of scale factors
        log_det = s.sum(dim=-1)
        return torch.cat([x1, y2], dim=-1), log_det

Because only x2 changes and the Jacobian is lower-triangular, the determinant reduces to the product of diagonal elements — just the sum of the scale terms. $O(D)$ instead of $O(D^3)$.

Knowledge Distillation

Making it fast enough to matter

A flow with enough capacity to model LHC jet data accurately is expensive to evaluate. The real-time trigger systems at the LHC have microsecond latency budgets and run on FPGAs. A 20-layer coupling flow is never going to fit that.

Knowledge distillation compresses the knowledge of a large “teacher” model into a smaller “student” model. The student does not train on raw data alone; it also tries to match the teacher’s soft outputs, which carry richer information than hard labels.

Key insight: For anomaly detection, we only need the student to reproduce the teacher’s anomaly scores accurately, not its full generative capability. This relaxation makes distillation much more effective.

We used hls4ml and QKeras to quantise and synthesise the distilled student model into FPGA-deployable HLS code. The resulting model fits in a realistic trigger budget while preserving most of the teacher’s detection performance.

Results and What’s Next

The distilled model achieved strong anomaly detection performance on benchmark BSM signal injections (stop pair production, W′ boson) while maintaining a manageable false-positive rate on Standard Model background. The full results are in the paper currently under review at Springer Nature IJSA.

What I find most interesting about this line of work is how transferable it is. The same combination — density estimation for anomaly scoring, distillation for deployment efficiency — applies directly to financial time series, network intrusion detection, and any domain with abundant “normal” data and rare, unlabelled anomalies. The physics is specific; the method is general.