Flat Circle - Can the best LLMs predict company earnings?
Benchmarking how OpenAI, Anthropic, Gemini and xAI's latest models play the hardest game in the world
Introducing the Flat Circle Benchmark
The Flat Circle Benchmark compares the top language models’ ability to make accurate buy / sell recommendations in advance of a company reporting earnings.
Current LLM benchmarks are based on tests like MMLU, Human Eval, GPQA or MATH. This approach has two limitations:
The benchmarks eventually become saturated. Either the models are engineered to score well on the tests, or in some cases, have the answers as part of their training data.
Limited real world application. Just because an LLM can earn a PhD level score on an exam, doesn't mean it can help with everyday tasks.
Testing whether models can accurately forecast whether a company will beat or miss earnings addresses both these limitations. The answer is not known at the time of the test. And the real world application is the test itself - the smartest, most well financed analysts and algorithms already play the same game every day.
The Flat Circle Benchmark provides extensive information on companies that are about to report earnings to the top reasoning models (OpenAI’s o1-preview, Anthropic’s Claude 3 Opus, Google AI’s Gemini 2.0 Flash Experimental and xAI’s Grok-2), and asks them to make a BUY or SELL call as measured by whether each company’s stock closes up or down the day after it reports.
Please see the methodology for more details, limitations and opportunities.
Upcoming earnings
The full reasoning output returned by each model in support of its call is available here.
Franklin Covey Co (FC)
OpenAI o1-preview (BUY)
“Given the company's history of delivering strong financial results and positive earnings surprises, it's likely that FC will once again exceed expectations. Additionally, the company's ongoing investments aimed at accelerating growth may start showing positive traction…”
xAI grok-2-1212 (SELL)
“The magnitude of these positive reactions has not been consistent, and the highest reaction was in the most recent quarter, which might suggest a peak in positive sentiment that could be difficult to sustain… The company's focus on making substantial growth investments… may lead to short-term volatility and lower-than-expected earnings…”
Jefferies Financial Group (JEF)
OpenAI o1-preview (SELL)
“Historical patterns… show that when JEF’s share price outperforms the market substantially before earnings… the stock often reacts negatively… This negative reaction is likely due to elevated market expectations that are difficult to exceed.”
Google AI 2.0 Flash Experimental (BUY)
“This strong market performance is accompanied by a wave of positive analyst upgrades… This suggests growing confidence in JEF’s performance… likely to beat earnings expectations.”
The Simply Good Foods Co. (SMPL)
Model: OpenAI o1-preview (BUY):
“The launch of Quest’s new ‘Sweet on Protein’ campaign…is expected to drive stronger-than-anticipated sales growth…These positive developments may not be fully reflected in the current share price.”
Model: Anthropic Claude 3 Opus (SELL):
“Atkins sales declined 5%…the company expects this brand to decline high-single digits…So…slowing Atkins sales, supply constraints on Quest, and margin pressures are likely to weigh on Q1 results.”
Recent earnings
Highlights
Commercial Metals Company (CMC)
Outcome: The stock rose +3.1% post-earnings.
Correct Calls (BUY):
OpenAI o1-preview stated, “In each of the past four quarters, CMC’s stock price reacted positively after earnings announcements…indicating the market may have underestimated CMC’s performance.” This prediction aligned with the actual 3.1% gain.
xAI grok-2-1212 argued that CMC was “poised to exceed market expectations,” referencing positive trends in steel shipments and core EBITDA margins.
Incorrect Calls (SELL):
Google AI 2.0 Flash Experimental warned that “negative sentiment” and “recent analyst downgrades” pointed to a miss, which proved overly bearish given the positive post-earnings reaction.
Anthropic Claude 3 Opus highlighted “seasonal weakness” and “pricing/margin pressure,” but the market instead rewarded CMC’s underlying performance and one-time charge adjustment.
RPM International (RPM)
Outcome: Shares closed +1.1% post-earnings.
Correct Calls (BUY):
OpenAI o1-preview cited RPM’s “strong execution on MAP 2025 initiatives” and potential “positive share price reaction” if the company exceeded modest expectations. RPM indeed reported record EBIT and EPS.
Anthropic Claude 3 Opus emphasized “impressive execution” and “resiliency” in Construction Products and Performance Coatings, expecting an upside surprise.
Incorrect Calls (SELL):
Google AI 2.0 Flash Experimental reasoned that the underperformance to the S&P and a “flat second quarter” outlook would weigh on the stock, but RPM’s beat on both revenue and EPS drove a slight gain.
xAI grok-2-1212 pointed to negative consumer demand trends and mixed analyst sentiment; however, RPM’s strong balance sheet and improved margins helped lift the shares after earnings.
Limitations and Opportunities
These models have many limitations in their ability to make accurate earnings calls. Most importantly the system lacks complete information available to professional investors such as access to management, sell side research, expert networks, alternative data, etc. The plan is to address each of these limitations over time with expanded access to information and a more robust architecture to query that information.
Still, there’s reason to believe that LLMs may eventually be able to call company earnings more accurately than a human analyst:
LLMs reason without emotion
LLMs lack conflicts of interest often present in other sources of stock analysis content, e.g., sellside analysts may also have banking relationships, buyside analyst who publish presentations already have a position
LLMs may be able to analyze a larger amount of unstructured information than any human. For example, the system may eventually be able to ingest all written and spoken statements from the company, its competitors, suppliers and customers in every language
LLMs may be able to extract patterns from longer time periods than a human analyst would be able to reasonably consider
LLMs may be able to incorporate a larger corpus of “real-time” information in their call
LLMs may eventually be able to make deductive leaps that a human being cannot
Please see the methodology for more detail on limitations, opportunities, and a disclaimer.
Follow the progress of LLMs learning to make accurate earnings calls
To track the progress of LLMs in investment research, subscribe for free:
If you have any questions, suggestions or would like to participate in this project, reply to this email or reach out via X.