[LR] Steering Language Models With Activation Engineering
[This review is intended solely for my personal learning] Paper Info arXiv: 2308.10248 Title: Steering Language Models With Activation Engineering Authors: Alexander Matt Turner, David Udell, Nantas Nardelli, Sam Ringer, Tom McGrath, Eric Michaud, Mantas Mazeika Prior Knowledge Steering Language Models: Traditionally achieved through techniques such as prompt engineering, fine-tuning, and reinforcement learning from human feedback (RLHF), which often require substantial computational resources and specialized datasets. Internal Activation Manipulation: Exploring internal activations of neural networks can offer fine-grained control over their outputs without the cost associated with retraining or extensive model tuning. Goal The paper aims to introduce and validate a lightweight inference-time technique, termed Activation Addition (ActAdd), for steering the output of large language models (LLMs). This method targets latent capabilities of LLMs, such as controlled sentiment expression or reduced output toxicity, without the need for additional training. ...