Lookahead Bias in Large Language Model Training Data

Steve LeCompte | April 26, 2024 | Posted in: Investing Expertise

Can Large Language Models (LLM) inject lookahead bias into backtests when rigor is lacking in generation of LLM training samples? In their preliminary and incomplete March 2024 paper entitled "Lookahead Bias in Pretrained Language Models", Suproteem Sarkar and Keyon Vafa examine the potential for lookahead bias in backtests using the Llama-2 LLM to identify future firm risks based on content of earnings calls. They consider cases for which: (1) the backtest falls within the LLM training sample, but the researcher tells the LLM to consider only information before the test period; and, (2) the researcher specifies a training sample that ends before the backtest but generates it long after the end of the training sample. Using Llama-2 to interpret transcripts of selected firm earnings calls from 2018, they find that:

Get the research edge serious investors rely on.

1,200+ research articles
Monthly strategy signals
20+ years of backtested analysis

$17.99 /month

Cancel anytime

Subscribe Now Already a member? Log in

Subscribe to Keep Reading

Further Reading