The Large Hadron Collider produces roughly 40 million proton–proton collision events per second. After a hardware trigger reduces that to around 100,000, a software system selects about 1,000 for permanent storage. Everything else is gone. If new physics is hiding in those discarded events, no amount of analysis will find it later.
This is the anomaly detection problem in high-energy physics: find the unusual events before throwing them away, without knowing in advance what “unusual” looks like. My dissertation tried to solve a version of this problem using normalizing flows.
The Setup
Why density estimation?
Most anomaly detection methods need labels — examples of the thing you’re looking for. In HEP, we mostly do not have those. We know what Standard Model background events look like, but beyond-Standard-Model (BSM) signals are, by definition, hypothetical. Any supervised approach bakes in assumptions about what we’re searching for.
A cleaner approach: learn the density of normal events, then flag anything with low probability under that model. No labels required. This is what normalizing flows do well.
If you can accurately model what ordinary looks like, you get anomaly detection almost for free — anything the model considers unlikely is a candidate.
What is a normalizing flow?
A normalizing flow is a generative model that learns an invertible mapping between a simple base distribution (usually a standard Gaussian) and the complex data distribution you care about. Given a data point $x$, you can compute its exact log-likelihood using the change-of-variables formula:
$$\log p_X(x) = \log p_Z(f(x)) + \log \left| \det \frac{\partial f}{\partial x} \right|$$
Where $f$ is the learned bijection and $Z$ is the latent space. The Jacobian determinant term is the key — it accounts for how $f$ stretches or compresses space. Making this tractable is what the architecture of the flow is designed to do.
The Architecture
Coupling layers and the Jacobian trick
Computing a full Jacobian determinant for an arbitrary neural network is $O(D^3)$ in the input dimension $D$ — completely infeasible for LHC data. Coupling layers solve this with a structural constraint: partition the input dimensions, transform one half as a function of the other, leave the second half unchanged.
# RealNVP-style coupling layer
class CouplingLayer(nn.Module):
def forward(self, x):
x1, x2 = x.chunk(2, dim=-1)
# scale and translate from first half
s, t = self.net(x1).chunk(2, dim=-1)
s = torch.tanh(s)
# transform second half
y2 = x2 * torch.exp(s) + t
# log-det is just the sum of scale factors
log_det = s.sum(dim=-1)
return torch.cat([x1, y2], dim=-1), log_det
Because only x2 changes and the Jacobian is
lower-triangular, the determinant reduces to the product of diagonal
elements — just the sum of the scale terms. $O(D)$ instead of
$O(D^3)$.
Knowledge Distillation
Making it fast enough to matter
A flow with enough capacity to model LHC jet data accurately is expensive to evaluate. The real-time trigger systems at the LHC have microsecond latency budgets and run on FPGAs. A 20-layer coupling flow is never going to fit that.
Knowledge distillation compresses the knowledge of a large “teacher” model into a smaller “student” model. The student does not train on raw data alone; it also tries to match the teacher’s soft outputs, which carry richer information than hard labels.
We used hls4ml and QKeras to quantise and synthesise the distilled student model into FPGA-deployable HLS code. The resulting model fits in a realistic trigger budget while preserving most of the teacher’s detection performance.
Results and What’s Next
The distilled model achieved strong anomaly detection performance on benchmark BSM signal injections (stop pair production, W′ boson) while maintaining a manageable false-positive rate on Standard Model background. The full results are in the paper currently under review at Springer Nature IJSA.
What I find most interesting about this line of work is how transferable it is. The same combination — density estimation for anomaly scoring, distillation for deployment efficiency — applies directly to financial time series, network intrusion detection, and any domain with abundant “normal” data and rare, unlabelled anomalies. The physics is specific; the method is general.