Idea:

Training SAEs for finding attention activation:

on GPT-2: https://github.com/openai/sparse_autoencoder and paper

  • lens into viewing random activations

https://lstmvis.vizhub.ai/ LSTM vis https://github.com/TransformerLensOrg/TransformerLens

Attribute al

  • What does it captures features?

How do we steer? based on features or token-based?

Alignment on human representations?

features rep? Correctness w/ models internal representation

Accurate mappings based on human and machine features?

Context: reduction in third space

representation based on users text (build features)

Use SAE to steerable generations

User feedbacks

actionable steering based on attention-based models