Overview of me:
Hello! My name is Ole, and I am an AI Safety researcher focussing on language model evaluations and interpretability. I recently completed an MSc in Artificial Intelligence from Imperial College London, and previously received an MMath in Mathematics from the University of St Andrews.
I am involved with the AI Safety and Effective Altruism Communities.
What I'm Working on:
I have just published a paper on evaluating large language models, alongside my group from AI Safety Camp! We investigated the self-consistency of various OpenAI models under ambiguity, using the novel setting of integer sequences. This will be published in the BlackboxNLP workshop at EMNLP 2023, the arxiv version is here.
I completed my Dissertation on investigating latent spaces of transformer models. My supervisors were Murray Shanahan, Dylan Cope, and Nandi Schoots, who I have all enjoyed working with immensely. Specifically we investigated when transformer models represent features geometrically, and how this can be used to better control models (such as via activation additions and feature detection). You can see my Dissertation here. We are currently working on extending the experiments to Llama models, and will be submitting to a conference in the near future. In general I am interested in utilising an improved understanding of model activations to better control language models.