Some selected publications and projects can be found below.
I work on the interpretability team, which is focused mechanistically understanding large language models. As one of Anthropic's first employees, I also helped build out and manage some of our core infrastructure to support training and evaluation of large language models.
Adly Templeton*, Tom Conerly*, Jonathan Marcus, Jack Lindsey, Trenton
Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy
Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte
MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi,
C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson,
Adam Jermyn, Shan Carter, Chris Olah, Tom Henghan
transformer-circuits
, 2024
Blogpost
/
Paper
We scaled Dictionary Learning to extract millions of features from Claude 3 Sonnet, Anthropic's medium-scale production model at the time of writing. The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references. There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them. Features can be used to steer large models (see e.g. Influence on Behavior). We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.
Trenton Bricken*, Adly Templeton*, Joshua Batson*, Brian Chen*, Adam
Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison,
Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas
Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen,
Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter,
Tom Henghan, Chris Olah
transformer-circuits
, 2023
Blogpost
/
Paper /
Feature
Visualization
In a transformer language model, we decompose a layer with 512 neurons into more than 4000 features which separately represent things like DNA sequences, legal language, HTTP requests, Hebrew text, nutrition statements, and much, much more. Most of these model properties are invisible when looking at the activations of individual neurons in isolation.
Tom Henighan*, Shan Carter*, Tristan Hume*, Nelson Elhage*, Robert Lasenby, Stanislav Fort,
Nicholas Schiefer, Christopher Olah
transformer-circuits
, 2023
Paper
We extend our previous toy-model work to the finite data regime, revealing how and when they memorize training examples.
N Elhage*, T Hume*, C Olsson*, N Schiefer*,
T Henighan,
S Kravec, ZH Dodds, R Lasenby, D Drain, C Chen, R Grosse, S
McCandlish, J Kaplan, D Amodei, M Wattenberg*, C Olah
transformer-circuits
, 2022
Paper
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. In this work, we build toy models where the origins of polysemanticity can be fully understood.
S Kadavath*, T Conerly, A Askell,
T Henighan,
D Drain, E Perez, N Schiefer, ZH Dodds, N DasSarma, E Tran-Johnson, S
Johnston, S El-Showk, A Jones, N Elhage, T Hume, A Chen, Y Bai, S
Bowman, S Fort, D Ganguli, D Hernandez, J Jacobson, J Kernion, S
Kravec, L Lovitt, K Ndousse, C Olsson, S Ringer, D Amodei, T Brown, J
Clark, N Joseph, B Mann, S McCandlish, C Olah, J Kaplan*
Arxiv
, 2022
Paper
We show that language models can evaluate whether what they say is true, and predict ahead of time whether they'll be able to answer questions correctly.
Y Bai*, A Jones*, K Ndousse*,
A Askell, A Chen, N DasSarma, D Drain, S Fort,
D Ganguli,
T Henighan,
N Joseph, S Kadavath, J Kernion,
T Conerly, S El-Showk, N Elhage, Z Hatfield-Dodds,
D Hernandez, T Hume, S Johnston, S Kravec, L Lovitt,
N Nanda, C Olsson, D Amodei, T Brown, J Clark,
S McCandlish, C Olah, B Mann, J Kaplan
Arxiv
, 2022
Paper
Anthropic's second AI Alignment paper. We've trained a natural language assistant to be more helpful and harmless by using reinforcement learning with human feedback (RLHF).
C Olsson*, N Elhage*, N Nanda*, N
Joseph†, N DasSarma†,
T Henighan†,
B Mann†, A Askell, Y Bai, A Chen, T Conerly, D
Drain, D Ganguli, Z Hatfield-Dodds, D Hernandez, S
Johnston, A Jones, J Kernion, L Lovitt, K Ndousse,
D Amodei, T Brown, J Clark, J Kaplan, S McCandlish,
C Olah
transformer-circuits
, 2022
Paper
Anthropic's second interpretability paper explores the hypothesis that induction heads (discovered in our first interpretability paper) are the mechanism driving in-context learning.
D Ganguli*, D Hernandez*, L Lovitt*, N DasSarma†,
T Henighan†,
A Jones†, N Joseph†, J Kernion†, B Mann†, A Askell, Y Bai,
A Chen, T Conerly, D Drain, N Elhage, S El Showk, S Fort,
Z Hatfield-Dodds, S Johnston, S Kravec, N Nanda, K Ndousse,
C Olsson, D Amodei, D Amodei, T Brown, J Kaplan,
Sam McCandlish, Chris Olah, Jack Clark
Arxiv
, 2022
Paper
Anthropic's first societal impacts paper explores the technical traits of large generative models and the motivations and challenges people face in building and deploying them.
N Elhage*†, N Nanda*, C Olsson*,
T Henighan†
N Joseph†, B Mann†,
A Askell, Y Bai, A Chen, T Conerly, N DasSarma, D Drain,
D Ganguli, Z Hatfield-Dodds, D Hernandez, A Jones, J Kernion,
L Lovitt, K Ndousse, D Amodei, T Brown, J Clark, J Kaplan,
S McCandlish, C Olah
transformer-circuits
, 2021
Paper
Anthropic's first interpretability paper. We try to mechanistically understand some small, simplified transformers in detail, as a first step toward understanding large transformer language models.
A Askell*, Y Bai*, A Chen*, D Drain*, D Ganguli*,
T Henighan†
A Jones†, N Joseph†, B Mann*, N DasSarma, N Elhage,
Z Hatfield-Dodds, D Hernandez, J Kernion, K Ndousse,
C Olsson, D Amodei, T Brown, J Clark, S McCandlish, Chris Olah,
Jared Kaplan
Arxiv
, 2021
Paper
Anthropic's first AI alignment paper, focused on simple baselines and investigations. We compare scaling trends for alignment from prompting, imitation learning, and preference modeling, and find ways to simplify these techniques and improve their sample efficiency.
My research at OpenAI focused on scaling laws. I also contributed to the GPT-3 project.
D Hernandez, J Kaplan, T Henighan, S McCandlish
Arxiv
, 2021
Paper
We studied the empirical scaling of pre-training on one dataset, then fine-tuning to another. We find the "effective data transferred" from pre-training to fine-tuning follows predictable trends.
T Henighan*, J Kaplan*, M Katz*, M Chen, C
Hesse, J Jackson, H Jun, T Brown, P Dhariwal, S Gray, C
Hallacy, B Mann, A Radford, A Ramesh, N Ryder, D Ziegler,
J Schulman, D Amodei, S McCandlish
Arxiv
, 2020
Paper /
Podcast Interview
We studied empirical scaling laws in four domains: image modeling, video modeling, multimodal image+text modeling, and mathematical problem solving. In all cases, autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law.
T Brown*, B Mann*, N Ryder*, M Subbiah*, J Kaplan, P
Dhariwal, A Neelakantan, P Shyam, G Satstry, A Askell, S
Agarwal, A Herbert-Voss, G Krueger,
T Henighan,
R Child, A Ramesh, D Ziegler, J Wu, C
Winter, C Hesse, M Chen, E Sigler, M Litwin, S Gray, B Chess,
J Clark, C Berner, S McCandlish, A Radford, I Sutskever, D
Amodei
Arxiv
, 2020
Paper /
Wikipedia Article
This is the paper which describes GPT-3, a 175 billion parameter language model which was competitive with state-of-the-art performance on a wide variety of benchmarks.
J Kaplan*, S McCandlish*,
T Henighan,
T Brown, B Chess, R Child, S Gray, A Radford, J Wu, D Amodei
Arxiv
, 2020
Paper
We found that language modeling loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.
Some of these were projects for classes I took while at Stanford, while others were just for fun.
Tom Henighan
Weekend Project
, 2019
Github
Read through OpenAI's spinning up materials and was inspired to implement some algorithms myself. This gif is the 'HalfCheetah-v2' environment. I trained a proximal policy optimization agent to make it run.
Tom Henighan
Weekend Project
, 2019
Github
Built from scratch a little python package for using reinforcement learning to find the optimal strategy on blackjack. The gif to the left shows how the randomly- initialized optimal strategy evolves as the agent trains over more episodes.
Tom Henighan
Weekend Project
, 2018
App
Trained a convolution neural network for the task of recoginizing digits from the MNIST dataset. Deployed this network using tensorflow.js, so the network actually runs in your browser! (And saves the server costs of hosting it :)). Built a little webapp so you can write a digit in the box and get the network's prediction. Tuned the network's hyperparameters using the implementation of bayesian optimization from skopt.
Tom Henighan
CS231n: Convolutional Neural Networks for Visual Recognition
, 2017
PDF
/
Poster
/
Examples
Implemented an algorithm for neural style transfer which takes in one or more "style" images (usually paintings) and a "content" image (usually a photograph) and renders the content image in the "style" of the style image. Inspired by the work of Gatys et al, my implementation allows for spatial control of blending multiple styles, allowing for smooth transitions from one style to another. See some examples here.
Tom Henighan
CS224n: Natural Language Processing with Deep Learning
, 2017
PDF
/
Poster
/
Example Predictions
Designed a deep learning model which takes in a paragraph from wikipedia and then answers a question based on that paragraph. Trained on the SQuAD dataset. My poster was recognized as outstanding by the course staff. Check out some example answers produced by the model here .
Tom Henighan,
Scott Kravitz
CS229: Machine Learning, 2015
Interactive Visualization
/
PDF
/
Poster
We created a model for predicting how a member of congress would vote not based on their voting history, but on their party and their campaign contributions. Check out the interactive visualization which shows funding by district and economic sector.
I completed my PhD in the Physics department at Stanford. Under my advisor, David Reis, I studied atomic motion in solids using the Linac Coherent Light Source.
S W Teitelbaum, T Henighan, Y Huang, H Liu, M P
Jiang, D Zhu, M Chollet, T Sato, E D Murray, S Fahy, S O'Mahony, T P
Bailey, C Uher, M Trigo, and D A Reis
Physical Review Letters
, 2018
Phys Rev Lett
We made time and wavevector resolved measurements of phonon decay with X-ray diffraction. More specifically, we measured an optically excited coherent zone-center phonon parametrically drive mean-square displacements in lower frequency phonons across the brillouin zone.
Tom Henighan, advisor: David Reis
Stanford University Department of Physics
, 2016
Defense
/
Thesis
The Linac Coherent Light Source (LCLS) is the first x-ray source of its kind, providing a combination of atomic-scale wavelengths, temporally-short pulses, and high-flux. This allows for previously impossible time-domain measurements of phonons. My collaborators and I demonstrated techinques that not only allow for measurement of phonon dispersions and lifetimes, but momentum-resolved phonon-phonon coupling.
T Henighan, M Trigo , M Chollet, J N Clark, S Fahy, J M Glownia,
M P Jiang, M Kozina, H Liu, S Song, D Zhu, and D A Reis
Physical Review B Rapid Communications
, 2016
Phys Rev B
/
arXiv
We showed that in Fourier-Transfor Inelastic X-ray Scattering (FTIXS) measurements on high-quality crystals, the pump laser couples to high-wavevector phonons primarily through second-order processes.
T Henighan, M Trigo, Stefano Bonetti, P Granitzka, D Higley, Z Chen, M P Jiang,
R Kukreja, A Gray, A H Reid, E Jal, M C Hoffmann,
M Kozina, S Song, M Chollet, D Zhu, P F Xu, J Jeong,
K Carva, P Maldonado, P M Oppeneer, M G Samant,
S P Parkin, D A Reis, and H A Durr
Physical Review B Rapid Communications
, 2016
Phys Rev B
/
arXiv
We were able to make time-resolved measurements of acoustic phonons with frequencies up to 3.5 THz in iron using LCLS.
I spent a year at Delft Institute of Technology (TUDelft) as a Fulbright Scholar doing biophysics research in the lab of Cees Dekker.
I De Vlaminck*,
T Henighan*, M T J van Loenhout,
D Burnham, C Dekker *authors contributed equally
PLOS ONE
, 2012
PLOS ONE
We demonstrated ways of parallelizing single-molecule measurements with magnetic tweezers, allowing for simultaneous measurement of hundreds of molecules instead of just a few.
I De Vlaminck,
T Henighan, M T J van Loenhout,
I Pfeiffer, J Huijts, J W J Kerssemakers, A J Katan,
A van Langen-Suurling, E van der Drift, C Wyman, C Dekker
Nano Letters
, 2011
Nano Lett
Patterning tether sites of the DNA strands allowed for furter improvement in parallelization capacity.
I did my bachelors at The Ohio State University where I was advised by Prof Sooryakumar. I majored in engineering physics with a focus in electrical engineering.
T Henighan, D Giglio, A Chen,
G Vieira, and R Sooryakumar
Applied Physics Letters
, 2011
App Phys Lett
/
Undergraduate Thesis
We demonstrated a magnetically controlled microfluidic pump. The pump consisted of a magnetic microsphere trapped by the magnetic field gradient produced by a patterned paramagnetic film on the floor of the microchannel. Time-varying magnetic fields positioned and spun the microsphere, activating the pump.
T Henighan, A Chen, G Vieira,
A J Hauser, F Y Yang, J J Chalmers, and R Sooryakumar
Biophysical Journal
, 2011
Biophys
/
Dancing Microspheres
/
Patent
Using patterned paramagnetic disks of micron scale diameter and 10s of nm thickness and externally applied weak (~10's of Oe) magnetic fields, we could control the position of magnetic microspheres on a lab-on-chip device.