Interpretability of deep learning models: assessing the feasibility of controlling the model beyond the prompt
Anastasia Borovykh (Imperial College London)
Thursday 13th March 14:00-15:00 Fully online
Abstract
Zoom link: https://uofglasgow.zoom.us/j/84371158896
State-of-the-art foundation models are often seen as black boxes: we send a prompt in and we get out our - often useful - answer. But what happens inside the system as the prompt gets processed remains a bit of a mystery & our ability to control the processing into specific directions is limited.
In this walk we will discuss two concepts that can help us add additional controls into these models. First up, LogitLens, a method to gain more insight into the representations of hidden nodes. Second, steering vectors: by computing a vector that represents a particular feature or concept, we can use this to steer the model to include any property in the output we want. We end the talk by discussing whether the approach of ‘build a good model, then interpret it’ can work or whether we should shift our focus to ‘build an a priori interpretable model’ and discuss ways of potentially achieving this.
Add to your calendar
Download event information as iCalendar file (only this event)