Hello! I am a final-year Statistics Ph.D. student at the University of Pennsylvania, gratefully
advised by Edgar
Dobriban, and collaborate with Eric
Wong and Hamed Hassani. I
am broadly interested in making language models smarter and safer.
For my undergraduate education, I graduated from UC Berkeley, receiving Bachelor's degrees in
Computer Science, Mathematics, and Statistics with honors. I was very fortunate to have been advised
by William Fithian and Horia Mania.
Previously, I interned at Amazon AI, working on diffusion models for causality, and at Jane Street
Capital as a trading intern.
We introduce a benchmark, automated evaluation pipeline, and leaderboard for jailbreak attacks and defenses.
A Safe Harbor for AI Evaluation and Red Teaming
Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, Yi Zeng, Weiyan Shi, Xianjun Yang, Reid Southen, Alexander Robey, Patrick Chao, Diyi Yang, Ruoxi Jia, Daniel Kang, Sandy Pentland, Arvind Narayanan, Percy Liang, Peter Henderson
arXiv |
Website |
Blog
We propose AI companies provide a safe harbor regarding evaluation and red teaming to promote safety, security, and trustworthiness of AI systems.
We propose PAIR, an automated method that uses a language model to systematically generate semantic
jailbreaks for other language models, often in under twenty queries. PAIR is more computationally
efficient than state-of-the-art methods by many orders of magnitude and only requires black-box
access.
When estimating an average under distribution shifts, folk wisdom is to use something robust like
the
median.
Surprisingly, under bounded (Wasserstein) shifts, we show that the sample mean remains optimal. For
linear
regression, we show that ordinary least squares remains optimal.