Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on Scaling Monosemanticity and Golden Gate Claude, this project aims to make concept steering accessible to the broader research community.
We provide pre-trained SAEs and discovered features for popular models on HuggingFace:
Each model repository includes:
See the examples/
directory for detailed notebooks demonstrating:
This project is licensed under the MIT License.
If you feel compelled to cite this library in your work, feel free to do so however you please.
This project builds upon the work described in "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" by Anthropic, and this project absolutely would not have been possible without it.