Interpretability of deep learning models: assessing the feasibility of controlling the model beyond the prompt

Anastasia Borovykh (Imperial College London)

Thursday 13th March 14:00-15:00 Fully online

Abstract

Zoom link: https://uofglasgow.zoom.us/j/84371158896

State-of-the-art foundation models are often seen as black boxes: we send a prompt in and we get out our - often useful - answer. But what happens inside the system as the prompt gets processed remains a bit of a mystery & our ability to control the processing into specific directions is limited.

In this walk we will discuss two concepts that can help us add additional controls into these models. First up, LogitLens, a method to gain more insight into the representations of hidden nodes. Second, steering vectors: by computing a vector that represents a particular feature or concept, we can use this to steer the model to include any property in the output we want. We end the talk by discussing whether the approach of ‘build a good model, then interpret it’ can work or whether we should shift our focus to ‘build an a priori interpretable model’ and discuss ways of potentially achieving this.

Add to your calendar

Download event information as iCalendar file (only this event)

iCalendar URL:
TBC
How to use our iCalendar feeds
Download current list of all events in the series (will not update)

We use cookies

Necessary cookies

Analytics cookies

Clarity

Interpretability of deep learning models: assessing the feasibility of controlling the model beyond the prompt

Anastasia Borovykh (Imperial College London)

Abstract

Add to your calendar