Aidan Ewart

Updated 12/09/2024


Hi! I'm an undergrad studying maths at the University of Bristol, and I do research into ensuring the safety of ML systems in my free time.

I am currently interning at Haize Labs where we make automated adverserial attacks for frontier language models.

Publications

Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham*, Aidan Ewart*, Logan Riggs*, Robert Huben, Lee Sharkey
Demonstrates an unsupervised method for finding human-understandable decompositions of LM activations.

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Abhay Sheshadri*, Aidan Ewart*, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
Develops a new method for cheaply adverserially training LMs.

Eight Methods to Evaluate Robust Unlearning in LLMs
Aengus Lynch*, Phillip Guo*, Aidan Ewart*, Stephen Casper, Dylan Hadfield-Menell
Rigorously evaluates the machine unlearning done in Eldan and Russinovich (2023).

Interesting Projects