User profiles for Stephen Casper

Stephen Casper

PhD student, MIT
Verified email at mit.edu
Cited by 552

Clusterability in neural networks

D Filan, S Casper, S Hod, C Wild, A Critch… - arXiv preprint arXiv …, 2021 - arxiv.org
The learned weights of a neural network have often been considered devoid of scrutable
internal structure. In this paper, however, we look for structure in the form of clusterability: how …

Concussion: A History of Science and Medicine, 1870‐2005

ST Casper - Headache: The Journal of Head and Face Pain, 2018 - Wiley Online Library
Objective To review the intellectual history of concussion from the mid‐19th century to the
opening decade of the 21st century. Background Head injuries (HI) and their acute and long‐…

Open problems and fundamental limitations of reinforcement learning from human feedback

S Casper, X Davies, C Shi, TK Gilbert… - arXiv preprint arXiv …, 2023 - arxiv.org
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state-…

[HTML][HTML] The punch-drunk boxer and the battered wife: Gender and brain injury research

ST Casper, K O'Donnell - Social Science & Medicine, 2020 - Elsevier
This essay uses gender as a category of historical and sociological analysis to situate two
populations—boxers and victims of domestic violence—in context and explain the temporal …

Toward transparent ai: A survey on interpreting the inner structures of deep neural networks

T Räuker, A Ho, S Casper… - 2023 IEEE Conference …, 2023 - ieeexplore.ieee.org
The last decade of machine learning has seen drastic increases in scale and capabilities.
Deep neural networks (DNNs) are increasingly being deployed in the real world. However, …

Robust feature-level adversaries are interpretability tools

S Casper, M Nadeau… - Advances in Neural …, 2022 - proceedings.neurips.cc
The literature on adversarial attacks in computer vision typically focuses on pixel-level
perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent …

Explore, establish, exploit: Red teaming language models from scratch

S Casper, J Lin, J Kwon, G Culp… - arXiv preprint arXiv …, 2023 - arxiv.org
Stephen Casper received support for this work from … [5] Isabelle Augenstein, Christina
Lioma, Dongsheng Wang, Lucas Chaves Lima, Casper Hansen, Christian Hansen, and Jakob …

Red teaming deep neural networks with feature synthesis tools

S Casper, T Bu, Y Li, J Li, K Zhang… - Advances in …, 2023 - proceedings.neurips.cc
Interpretable AI tools are often motivated by the goal of understanding model behavior in out-of-distribution
(OOD) contexts. Despite the attention this area of study receives, there are …

Scalable and transferable black-box jailbreaks for language models via persona modulation

R Shah, S Pour, A Tagade, S Casper… - arXiv preprint arXiv …, 2023 - arxiv.org
Despite efforts to align large language models to produce harmless responses, they are still
vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate …

Frivolous units: Wider networks are not really that wide

S Casper, X Boix, V D'Amario, L Guo… - Proceedings of the …, 2021 - ojs.aaai.org
A remarkable characteristic of overparameterized deep neural networks (DNNs) is that their
accuracy does not degrade when the network width is increased. Recent evidence suggests …