[LR] Representation Engineering: Top-Down AI Transparency

[This review is intended solely for my personal learning] Paper Info arXiv: 2310.01405v4 Title: Representation Engineering: A Top-Down Approach to AI Transparency Authors: Evan Hubinger, Ajeya Cotra, Buck Shlegeris, Tom Lieberum, Nicholas Joseph, Owain Evans, Nicholas Schiefer, Oliver Zhang, Jan Brauner, Collin Burns, Leo Gao, Ryan Greenblatt Prior Knowledge Interpretability in AI: Traditional approaches often focus on low-level features such as neurons or circuits. However, such bottom-up analysis struggles with high-level abstractions like honesty or power-seeking. Representation Learning: Neural networks form internal embeddings of concepts during training, enabling powerful generalization and emergent behavior. Mechanistic Interpretability: Aims to reverse-engineer model internals to understand reasoning processes, but often lacks tools for directly modifying internal representations to improve safety. Transparency and Alignment: Transparency is critical for ensuring model behavior is aligned with human intentions and values, particularly to avoid deceptive alignment and inner misalignment. Goal The paper proposes and develops Representation Engineering (RepE), a new top-down framework for interpretability and model control. Rather than focusing on neurons or circuits, RepE centers on representations of high-level cognitive concepts and aims to understand, detect, and manipulate them. The core goal is to advance transparency methods that are directly useful for improving AI safety, including controlling undesirable behaviors like deception or power-seeking. ...

March 25, 2025 · 3 min

[LR] LLMs Enhancement via Negative Emotional Stimuli

[This review is intended solely for my personal learning] Paper Info arXiv: 2405.02814 Title: NegativePrompt: Leveraging Psychology for Large Language Models Enhancement via Negative Emotional Stimuli Authors: Xu Wang, Cheng Li, Yi Chang, Jindong Wang, Yuan Wu Prior Knowledge Emotion and Cognition in AI: Previous research has shown that positive emotional stimuli improve LLM performance, raising the question of whether negative emotional stimuli can also have an impact. Psychological Theories: The study draws on Cognitive Dissonance Theory, Social Comparison Theory, and Stress and Coping Theory to design negative emotional prompts that may influence LLM responses. Goal To explore whether negative emotional stimuli, integrated into prompts, can enhance the performance of LLMs across various NLP tasks. The study introduces NegativePrompt, a novel prompting technique that applies negative emotional cues to improve LLM output quality. ...

November 2, 2024 · 3 min

[LR] Unveiling Theory of Mind in LLMs

[This review is intended solely for my personal learning] Paper Info arXiv:2309.01660 Title: Unveiling Theory of Mind in Large Language Models: A Parallel to Single Neurons in the Human Brain Author: Mohsen Jamali and Ziv M. Williams and Jing Cai Prior Knowledge Theory of Mind (ToM): A complex cognitive capacity related to our conscious mind and mental state that allows us to infer another’s beliefs and perspective. Through ToM, human can create intricate mental representations of other agents and realize that others may have beliefs that’s different from our own or the objective reality. True- and False-belief Task True-belief task: assesses whether someone understands that some other people’s believes is correctly aligned with reality. False-belief task: assesses whether someone understands that some other people’s believes is not correctly aligned with reality. (ex: belief diverges from reality after a change to the environment that one did not witness.) A critical test for ToM is the false belief task. Both tasks are evaluated by providing the participant a scenario and asking the participant “fact questions” and “belief questions”, which are about the reality and the belief of some character in the scenario respectively. These tasks are designed to test if the individual can attribute mental states (including potentially false beliefs) to others in general. ToM in the human brain Human brain imaging studies have provided substantial evidence for the brain network that supports our ToM ability, including the temporalparietal junction, superior temporal sulcus and the dorsal medial prefrontal cortex (dmPFC) ...

October 29, 2024 · 6 min