User profiles for Stephen Casper
Stephen CasperPhD student, MIT Verified email at mit.edu Cited by 552 |
Clusterability in neural networks
The learned weights of a neural network have often been considered devoid of scrutable
internal structure. In this paper, however, we look for structure in the form of clusterability: how …
internal structure. In this paper, however, we look for structure in the form of clusterability: how …
Concussion: A History of Science and Medicine, 1870‐2005
ST Casper - Headache: The Journal of Head and Face Pain, 2018 - Wiley Online Library
Objective To review the intellectual history of concussion from the mid‐19th century to the
opening decade of the 21st century. Background Head injuries (HI) and their acute and long‐…
opening decade of the 21st century. Background Head injuries (HI) and their acute and long‐…
Open problems and fundamental limitations of reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is a technique for training AI systems
to align with human goals. RLHF has emerged as the central method used to finetune state-…
to align with human goals. RLHF has emerged as the central method used to finetune state-…
[HTML][HTML] The punch-drunk boxer and the battered wife: Gender and brain injury research
ST Casper, K O'Donnell - Social Science & Medicine, 2020 - Elsevier
This essay uses gender as a category of historical and sociological analysis to situate two
populations—boxers and victims of domestic violence—in context and explain the temporal …
populations—boxers and victims of domestic violence—in context and explain the temporal …
Toward transparent ai: A survey on interpreting the inner structures of deep neural networks
The last decade of machine learning has seen drastic increases in scale and capabilities.
Deep neural networks (DNNs) are increasingly being deployed in the real world. However, …
Deep neural networks (DNNs) are increasingly being deployed in the real world. However, …
Robust feature-level adversaries are interpretability tools
The literature on adversarial attacks in computer vision typically focuses on pixel-level
perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent …
perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent …
Explore, establish, exploit: Red teaming language models from scratch
… Stephen Casper received support for this work from … [5] Isabelle Augenstein, Christina
Lioma, Dongsheng Wang, Lucas Chaves Lima, Casper Hansen, Christian Hansen, and Jakob …
Lioma, Dongsheng Wang, Lucas Chaves Lima, Casper Hansen, Christian Hansen, and Jakob …
Red teaming deep neural networks with feature synthesis tools
Interpretable AI tools are often motivated by the goal of understanding model behavior in out-of-distribution
(OOD) contexts. Despite the attention this area of study receives, there are …
(OOD) contexts. Despite the attention this area of study receives, there are …
Scalable and transferable black-box jailbreaks for language models via persona modulation
Despite efforts to align large language models to produce harmless responses, they are still
vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate …
vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate …
Frivolous units: Wider networks are not really that wide
A remarkable characteristic of overparameterized deep neural networks (DNNs) is that their
accuracy does not degrade when the network width is increased. Recent evidence suggests …
accuracy does not degrade when the network width is increased. Recent evidence suggests …