
Speaker:Arthur Spirling [Princeton University]
Date:June 26, 2025/15:00‐16:40 (JST)
Location:Room 104, Conference Room 2, 1F, Institute of Social Science, Hongo Campus, the University of Tokyo
https://www.iss.u-tokyo.ac.jp/guide/
Language:English
Target : Open to the public
Abstract:Large Language Models (LMs) are exciting tools: they require minimal researcher input and but make it possible to annotate and generate large quantities of data. Yet there has been almost no systematic research into the reproducibility of research using LMs. This is a potential problem for scientific integrity. We give a theoretical framework for replication in the discipline and show that LM work is perhaps uniquely problematic. We demonstrate the problem empirically using a rolling iterated replication design in which we compare crowdsourcing and LMs on multiple repeated tasks, over many months. We find that LMs can be accurate, but the observed variance in performance is often unacceptably high. Strict "temperature" control does not resolve these issues. This affects downstream results. In many cases the LM findings cannot be re-run, let alone replicated. We conclude with recommendations for best practice, including the use of locally versioned 'open' LMs.
To register, please visit here.