Understanding Search in Transformers

Modern AI systems are black boxes, whose capabilities are growing far faster than our ability to understand them. I believe that mechanistic interpretability is the key to understanding the AI systems we have today and building safer ones in the future.

Previously, I was at Conjecture working with Janus and Nicholas Kees on mechanistic interpretability for LLMs. Now, as part of my thesis and my role as a research lead for AI safety camp, I’m continuing this work, albeit focusing more on toy models trained on spatial tasks, with the goal of finding, understanding, and re-targeting the search process implemented internally in transformer networks. This project is taking new collaborators – please reach out if you’re interested!

Please see github.com/understanding-search and unsearch.org for the latest updates.

Some work we’ve put out so far: