Tom Henighan

I work on large language model interpretability at Anthropic. Prior to that I worked on scaling laws at OpenAI and ML engineering at at Beehive AI. I did my PhD in Physics at Stanford.

Some selected publications and projects can be found below.

Github  /  Scholar  /  LinkedIn

Anthropic

I work on the interpretability team, which is focused mechanistically understanding large language models. As one of Anthropic's first employees, I also helped build out and manage some of our core infrastructure to support training and evaluation of large language models.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Adly Templeton*, Tom Conerly*, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henghan
transformer-circuits , 2024  
Blogpost / Paper

We scaled Dictionary Learning to extract millions of features from Claude 3 Sonnet, Anthropic's medium-scale production model at the time of writing. The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references. There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them. Features can be used to steer large models (see e.g. Influence on Behavior). We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Trenton Bricken*, Adly Templeton*, Joshua Batson*, Brian Chen*, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henghan, Chris Olah
transformer-circuits , 2023  
Blogpost / Paper / Feature Visualization

In a transformer language model, we decompose a layer with 512 neurons into more than 4000 features which separately represent things like DNA sequences, legal language, HTTP requests, Hebrew text, nutrition statements, and much, much more. Most of these model properties are invisible when looking at the activations of individual neurons in isolation.

Superposition, Memorization, and Double Descent
Tom Henighan*, Shan Carter*, Tristan Hume*, Nelson Elhage*, Robert Lasenby, Stanislav Fort, Nicholas Schiefer, Christopher Olah
transformer-circuits , 2023  
Paper

We extend our previous toy-model work to the finite data regime, revealing how and when they memorize training examples.

Toy Models of Superposition
N Elhage*, T Hume*, C Olsson*, N Schiefer*, T Henighan, S Kravec, ZH Dodds, R Lasenby, D Drain, C Chen, R Grosse, S McCandlish, J Kaplan, D Amodei, M Wattenberg*, C Olah
transformer-circuits , 2022  
Paper

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. In this work, we build toy models where the origins of polysemanticity can be fully understood.

Language Models (Mostly) Know What They Know
S Kadavath*, T Conerly, A Askell, T Henighan, D Drain, E Perez, N Schiefer, ZH Dodds, N DasSarma, E Tran-Johnson, S Johnston, S El-Showk, A Jones, N Elhage, T Hume, A Chen, Y Bai, S Bowman, S Fort, D Ganguli, D Hernandez, J Jacobson, J Kernion, S Kravec, L Lovitt, K Ndousse, C Olsson, S Ringer, D Amodei, T Brown, J Clark, N Joseph, B Mann, S McCandlish, C Olah, J Kaplan*
Arxiv , 2022  
Paper

We show that language models can evaluate whether what they say is true, and predict ahead of time whether they'll be able to answer questions correctly.

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Y Bai*, A Jones*, K Ndousse*, A Askell, A Chen, N DasSarma, D Drain, S Fort, D Ganguli, T Henighan, N Joseph, S Kadavath, J Kernion, T Conerly, S El-Showk, N Elhage, Z Hatfield-Dodds, D Hernandez, T Hume, S Johnston, S Kravec, L Lovitt, N Nanda, C Olsson, D Amodei, T Brown, J Clark, S McCandlish, C Olah, B Mann, J Kaplan
Arxiv , 2022  
Paper

Anthropic's second AI Alignment paper. We've trained a natural language assistant to be more helpful and harmless by using reinforcement learning with human feedback (RLHF).

In-context Learning and Induction Heads
C Olsson*, N Elhage*, N Nanda*, N Joseph†, N DasSarma†, T Henighan†, B Mann†, A Askell, Y Bai, A Chen, T Conerly, D Drain, D Ganguli, Z Hatfield-Dodds, D Hernandez, S Johnston, A Jones, J Kernion, L Lovitt, K Ndousse, D Amodei, T Brown, J Clark, J Kaplan, S McCandlish, C Olah
transformer-circuits , 2022  
Paper

Anthropic's second interpretability paper explores the hypothesis that induction heads (discovered in our first interpretability paper) are the mechanism driving in-context learning.

Predictability and Surprise in Large Generative Models
D Ganguli*, D Hernandez*, L Lovitt*, N DasSarma†, T Henighan†, A Jones†, N Joseph†, J Kernion†, B Mann†, A Askell, Y Bai, A Chen, T Conerly, D Drain, N Elhage, S El Showk, S Fort, Z Hatfield-Dodds, S Johnston, S Kravec, N Nanda, K Ndousse, C Olsson, D Amodei, D Amodei, T Brown, J Kaplan, Sam McCandlish, Chris Olah, Jack Clark
Arxiv , 2022  
Paper

Anthropic's first societal impacts paper explores the technical traits of large generative models and the motivations and challenges people face in building and deploying them.

A Mathematical Framework for Transformer Circuits
N Elhage*†, N Nanda*, C Olsson*, T Henighan† N Joseph†, B Mann†, A Askell, Y Bai, A Chen, T Conerly, N DasSarma, D Drain, D Ganguli, Z Hatfield-Dodds, D Hernandez, A Jones, J Kernion, L Lovitt, K Ndousse, D Amodei, T Brown, J Clark, J Kaplan, S McCandlish, C Olah
transformer-circuits , 2021  
Paper

Anthropic's first interpretability paper. We try to mechanistically understand some small, simplified transformers in detail, as a first step toward understanding large transformer language models.

A General Language Assistant as a Laboratory for Alignment
A Askell*, Y Bai*, A Chen*, D Drain*, D Ganguli*, T Henighan† A Jones†, N Joseph†, B Mann*, N DasSarma, N Elhage, Z Hatfield-Dodds, D Hernandez, J Kernion, K Ndousse, C Olsson, D Amodei, T Brown, J Clark, S McCandlish, Chris Olah, Jared Kaplan
Arxiv , 2021  
Paper

Anthropic's first AI alignment paper, focused on simple baselines and investigations. We compare scaling trends for alignment from prompting, imitation learning, and preference modeling, and find ways to simplify these techniques and improve their sample efficiency.

OpenAI

My research at OpenAI focused on scaling laws. I also contributed to the GPT-3 project.

Scaling Laws for Transfer
D Hernandez, J Kaplan, T Henighan, S McCandlish
Arxiv , 2021  
Paper

We studied the empirical scaling of pre-training on one dataset, then fine-tuning to another. We find the "effective data transferred" from pre-training to fine-tuning follows predictable trends.

Scaling Laws for Autoregressive Generative Modeling
T Henighan*, J Kaplan*, M Katz*, M Chen, C Hesse, J Jackson, H Jun, T Brown, P Dhariwal, S Gray, C Hallacy, B Mann, A Radford, A Ramesh, N Ryder, D Ziegler, J Schulman, D Amodei, S McCandlish
Arxiv , 2020  
Paper / Podcast Interview

We studied empirical scaling laws in four domains: image modeling, video modeling, multimodal image+text modeling, and mathematical problem solving. In all cases, autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law.

Language Models are Few-Shot Learners
T Brown*, B Mann*, N Ryder*, M Subbiah*, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Satstry, A Askell, S Agarwal, A Herbert-Voss, G Krueger, T Henighan, R Child, A Ramesh, D Ziegler, J Wu, C Winter, C Hesse, M Chen, E Sigler, M Litwin, S Gray, B Chess, J Clark, C Berner, S McCandlish, A Radford, I Sutskever, D Amodei
Arxiv , 2020  
Paper / Wikipedia Article

This is the paper which describes GPT-3, a 175 billion parameter language model which was competitive with state-of-the-art performance on a wide variety of benchmarks.

Scaling Laws for Neural Language Models
J Kaplan*, S McCandlish*, T Henighan, T Brown, B Chess, R Child, S Gray, A Radford, J Wu, D Amodei
Arxiv , 2020  
Paper

We found that language modeling loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.

Projects

Some of these were projects for classes I took while at Stanford, while others were just for fun.

Deep Reinforcement Learning with OpenAI Gym
Tom Henighan
Weekend Project , 2019  
Github

Read through OpenAI's spinning up materials and was inspired to implement some algorithms myself. This gif is the 'HalfCheetah-v2' environment. I trained a proximal policy optimization agent to make it run.

Blackjack Reinforcement Learning
Tom Henighan
Weekend Project , 2019  
Github

Built from scratch a little python package for using reinforcement learning to find the optimal strategy on blackjack. The gif to the left shows how the randomly- initialized optimal strategy evolves as the agent trains over more episodes.

MNIST Tensorflow.js webapp
Tom Henighan
Weekend Project , 2018  
App

Trained a convolution neural network for the task of recoginizing digits from the MNIST dataset. Deployed this network using tensorflow.js, so the network actually runs in your browser! (And saves the server costs of hosting it :)). Built a little webapp so you can write a digit in the box and get the network's prediction. Tuned the network's hyperparameters using the implementation of bayesian optimization from skopt.

Spatial Control of Style Blending in Neural Style Transfer
Tom Henighan
CS231n: Convolutional Neural Networks for Visual Recognition , 2017  
PDF / Poster / Examples

Implemented an algorithm for neural style transfer which takes in one or more "style" images (usually paintings) and a "content" image (usually a photograph) and renders the content image in the "style" of the style image. Inspired by the work of Gatys et al, my implementation allows for spatial control of blending multiple styles, allowing for smooth transitions from one style to another. See some examples here.

Iterative Attention Networks for Question Answering
Tom Henighan
CS224n: Natural Language Processing with Deep Learning , 2017  
PDF / Poster / Example Predictions

Designed a deep learning model which takes in a paragraph from wikipedia and then answers a question based on that paragraph. Trained on the SQuAD dataset. My poster was recognized as outstanding by the course staff. Check out some example answers produced by the model here .

Predicting Bill Votes in the House of Representatives
Tom Henighan, Scott Kravitz
CS229: Machine Learning, 2015
Interactive Visualization / PDF / Poster

We created a model for predicting how a member of congress would vote not based on their voting history, but on their party and their campaign contributions. Check out the interactive visualization which shows funding by district and economic sector.

PhD Research

I completed my PhD in the Physics department at Stanford. Under my advisor, David Reis, I studied atomic motion in solids using the Linac Coherent Light Source.

Direct Measurement of Anharmonic Decay Channels of a Coherent Phonon
S W Teitelbaum, T Henighan, Y Huang, H Liu, M P Jiang, D Zhu, M Chollet, T Sato, E D Murray, S Fahy, S O'Mahony, T P Bailey, C Uher, M Trigo, and D A Reis
Physical Review Letters , 2018  
Phys Rev Lett

We made time and wavevector resolved measurements of phonon decay with X-ray diffraction. More specifically, we measured an optically excited coherent zone-center phonon parametrically drive mean-square displacements in lower frequency phonons across the brillouin zone.

Dissertation: Couplings of Phonons to Light and One Another Studied with LCLS
Tom Henighan, advisor: David Reis
Stanford University Department of Physics , 2016  
Defense / Thesis

The Linac Coherent Light Source (LCLS) is the first x-ray source of its kind, providing a combination of atomic-scale wavelengths, temporally-short pulses, and high-flux. This allows for previously impossible time-domain measurements of phonons. My collaborators and I demonstrated techinques that not only allow for measurement of phonon dispersions and lifetimes, but momentum-resolved phonon-phonon coupling.

Control of two-phonon correlations and the mechanism of high-wavevector phonon generation by ultrafast light pulses
T Henighan, M Trigo , M Chollet, J N Clark, S Fahy, J M Glownia, M P Jiang, M Kozina, H Liu, S Song, D Zhu, and D A Reis
Physical Review B Rapid Communications , 2016  
Phys Rev B / arXiv

We showed that in Fourier-Transfor Inelastic X-ray Scattering (FTIXS) measurements on high-quality crystals, the pump laser couples to high-wavevector phonons primarily through second-order processes.

Generation mechanism of terahertz coherent acoustic phonons in Fe
T Henighan, M Trigo, Stefano Bonetti, P Granitzka, D Higley, Z Chen, M P Jiang, R Kukreja, A Gray, A H Reid, E Jal, M C Hoffmann, M Kozina, S Song, M Chollet, D Zhu, P F Xu, J Jeong, K Carva, P Maldonado, P M Oppeneer, M G Samant, S P Parkin, D A Reis, and H A Durr
Physical Review B Rapid Communications , 2016  
Phys Rev B / arXiv

We were able to make time-resolved measurements of acoustic phonons with frequencies up to 3.5 THz in iron using LCLS.

Fulbright Research

I spent a year at Delft Institute of Technology (TUDelft) as a Fulbright Scholar doing biophysics research in the lab of Cees Dekker.

Magnetic Forces and DNA Mechanics in Multiplexed Magnetic Tweezers
I De Vlaminck*, T Henighan*, M T J van Loenhout, D Burnham, C Dekker *authors contributed equally
PLOS ONE , 2012  
PLOS ONE

We demonstrated ways of parallelizing single-molecule measurements with magnetic tweezers, allowing for simultaneous measurement of hundreds of molecules instead of just a few.

Highly Parallel Magnetic Tweezers by Targeted DNA Tethering
I De Vlaminck, T Henighan, M T J van Loenhout, I Pfeiffer, J Huijts, J W J Kerssemakers, A J Katan, A van Langen-Suurling, E van der Drift, C Wyman, C Dekker
Nano Letters , 2011  
Nano Lett

Patterning tether sites of the DNA strands allowed for furter improvement in parallelization capacity.

Undergraduate Research

I did my bachelors at The Ohio State University where I was advised by Prof Sooryakumar. I majored in engineering physics with a focus in electrical engineering.

Undergraduate Thesis: Patterned magnetic traps for magnetophoretic assembly and actuation of microrotor pumps
T Henighan, D Giglio, A Chen, G Vieira, and R Sooryakumar
Applied Physics Letters , 2011  
App Phys Lett / Undergraduate Thesis

We demonstrated a magnetically controlled microfluidic pump. The pump consisted of a magnetic microsphere trapped by the magnetic field gradient produced by a patterned paramagnetic film on the floor of the microchannel. Time-varying magnetic fields positioned and spun the microsphere, activating the pump.

Manipulation of Magnetically Labeled and Unlabeled Cells with Mobile Magnetic Traps
T Henighan, A Chen, G Vieira, A J Hauser, F Y Yang, J J Chalmers, and R Sooryakumar
Biophysical Journal , 2011  
Biophys / Dancing Microspheres / Patent

Using patterned paramagnetic disks of micron scale diameter and 10s of nm thickness and externally applied weak (~10's of Oe) magnetic fields, we could control the position of magnetic microspheres on a lab-on-chip device.