Flat Circle LLM Benchmark - Methodology

Assessing LLMs' ability to play the hardest game in the world

Jan 08, 2025

Flat Circle is a live, rolling benchmark assessing LLMs’ ability to correctly call company earnings.

The problem

Language models are scored against each other with benchmarks like MMLU, Human Eval, GPQA or MATH that proxy the models’ skills in domains like reasoning, math or coding.

Two problems with these benchmarks:

1. They eventually become saturated. Either the models are engineered to score well on the tests, or in some cases, have the answers as part of their training data. Inevitably, the models score so well that they outperform PhD-level humans and thus the ability of the test to stump the model.

2. Limited real world application. Just because an LLM can earn a PhD level score on an exam, doesn't mean it can help with everyday tasks.

Introducing the Flat Circle benchmark

Flat Circle is a new benchmark assessing LLMs ability to play the hardest game in the world: predicting company earnings results.

Most publicly traded companies report operating results to the market every three months. Vast teams of analysts, quant algorithms, data vendors, expert networks and conference organizers compete for an edge in predicting whether a company will "beat" or "miss" their earnings and are rewarded with gains if they're correct or punished with losses if they're wrong.

Earnings are the primary thunderdome of the financial industry. As a benchmark, it can never become saturated or contaminated since the answer is not yet known at the time of the test. And the smartest, most well financed people and algorithms are constantly pricing stocks closer to their expected value, making the game harder and harder.

The Flat Circle benchmark will see which LLMs are ready to play this game.

How it works

When using LLMs, it’s typical to provide a prompt like “please summarize this document” and context such as the document you want summarized. You can you do this via a web interface like chat.openai.com or send similar information via an api request.

Context

The day before a company reports earnings, the system generates a long document (the “Context”) containing information that a professional investor uses to evaluate whether to buy or sell into earnings. This includes information that has come out since the last time the company has reported earnings (the “Setup”), such as:

Share price return vs the S&P since the last reported earnings
Sellside upgrades/downgrades
Company press releases
Transcripts for recent peer earnings

The system supplies the same Prompt and Context to each model for any given earnings. Over time, the goal is to improve the Prompt and expand/optimize the Context to increase the aggregate level of accuracy of the models.

Models

The system generates the Prompts and Contexts for most public companies the day before they report earnings, then provide them in an API request to each of the following models:

These models have consistently scored the highest on reasoning tests, and thus seem most suited to attempt to compete against professional investors in predicting upcoming earnings. Over time the goal is to expand the number of models in the benchmark.

If you have suggestions for other models to test, or would like to include your own models in the benchmark (either privately or publicly), contact me via X or LinkedIn.

Calls

The day before each Earnings, the system supplies each model with the same Prompt and Context, asking it to return a BUY or SELL recommendation (a “Call”). The day after Earnings, the system assess whether the models were correct in their BUY or SELL call as measured by whether the stock closes up or down that day.

Limitations

These models have many limitations in their ability to make accurate earnings calls, including:

the Context lacks complete information available to professional investors such as: access to management, sell side research, financial news, expert networks and expert transcripts, alternative data, conference commentary, insider selling
the Prompt doesn’t instruct the model to create its own financial model or forecast
the Prompt doesn’t instruct the model to consider historical patterns that a professional analyst would, such as whether companies tend to beat or miss following a new CEO or in advance of an options grant

We can address each of these limitations over time with expanded access to information and a more robust RAG architecture to query that information.

Finally, there is the problem that accurately calling company earnings may not equate to the highest level of reasoning. Investors are known to complain that the market is “not behaving rationally” when investments do not go their way. Over time this experiment should surface models with the highest average accuracy in calling earnings. If this level even approaches the accuracy of professional analysts, it will have real implications for markets. And even if calling earnings is not the best overall measure of a model’s ability to reason, at least it’s a fair and objective benchmark.

Opportunities

Still, there’s reason to believe that LLMs may eventually be able to call company earnings more accurately than a human analyst:

LLMs reason without emotion
LLMs may be able to analyze a larger amount of unstructured information than any human. For example, the system may eventually be able to ingest all written and spoken statements from the company, its competitors, suppliers and customers in every language
LLMs may be able to extract patterns from longer time periods than a human analyst would be able to reasonably consider
LLMs may be able to incorporate a larger corpus of “real-time” information in their call
LLMs may eventually be able to make deductive leaps that a human being cannot

Disclaimer

Of course none of the content produced by any of these models are investment advice. While we're hopeful some models may occasionally provide interesting insights, they will also make mistakes or even hallucinate so should not be relied upon for trading decisions. There’s also no consideration of trading specific factors like borrow costs or liquidity. Furthermore, the context provided to these models to not yet contain the totality of information available to institutional asset managers, so they are operating at a disadvantage to the largest, well financed professional investors.

Suggestions? Feedback?

Please contact us with any feedback or suggestions to improve this process. Our goal is to identify the best possible methodology and model for using LLMs to perform investment research. Reply to any email or contact me X or LinkedIn.