Learnings from new model
Plus: Calls on APLD and KBH, reconciling DAL, STZ, TLRY and WBA hits and misses
The Flat Circle Benchmark measures the top language models’ ability to predict company earnings results. See our methodology for detail and disclaimers. If you haven’t already subscribed, join investors and engineers interested in LLMs+investment research here:
Key takeaways
We received feedback that Anthropic’s latest model, Claude 3.5 Sonnet, may be superior to its Opus reasoning model, so we added it to our pool and are now benchmarking two models from Anthropic
For Applied Digital’s earnings tomorrow, Opus recommends BUY and Sonnet recommends SELL. Both models highlight the same key issues but come to different conclusions
Investors often interpret the same issues differently, and this project may allow us to analyze reasoning skill at the issue level
Thank you to
for the suggestion. If you would like to add any additional models, please reach out
Grook still in the lead with a two-thirds hitrate and +39% total return over nine calls. Gemini lags with 44% hitrate and -18% total return
All models recommended SELL ahead of Walgreens (WBA) earnings last week and all were wrong, losing 27.5%. We breakdown the hits and misses from recent earnings and potential ways to improve the system below
Upcoming earnings
The full reasoning of the models is available here.
Applied Digital (APLD) - Claude Opus and Sonnet’s diverging calls
For tomorrow’s APLD call, Claude’s Opus (reasoning model) and Sonnet (newest model) highlighted the same key issues but came to different conclusions
Ellendale facility
Opus (BUY)
"The company appears very close to finalizing a significant 400 MW data center lease with a major U.S. hyperscaler customer. This would provide substantial long-term, high-margin revenue"
Sonnet (SELL)
"Ongoing delays in finalizing the key hyperscaler lease agreement for the Ellendale facility continue to create uncertainty. While management expressed confidence in Q1 about finalizing this "in a matter of days to 8 weeks," we still don't have an announcement with earnings approaching"
Convertible debt offering
Opus (BUY)
"APLD recently raised over $450 million via convertible debt offerings, showing investor confidence and providing capital to fund growth initiatives. The company's balance sheet has strengthened significantly"
Sonnet (SELL)
"The recent substantial capital raises ($450M convertible notes, $150M senior secured note) suggest significant near-term cash needs and potential dilution concerns"
Additionally, Sonnet identified a key risk that Opus did not:
"The company faces significant costs without corresponding revenue, including: - $4.1M in Q1 expenses for unutilized data center leases in the cloud business - High depreciation and amortization costs ($34.4M in Q1) - Ongoing construction costs for the Ellendale facility"
Which interpretation of the issues is correct? We’ll see tomorrow and more broadly over a larger sample of earnings. Comparing new model generations against predecessors and using earnings accuracy to score them will be an opportunity to see how reasoning improves on business issues
Recent earnings
Walgreens (WBA) - All models were wrong
Key reasons for the beat were strong execution in cost management and outperformance in the retail pharmacy segment.
Cost management: “We've begun our footprint optimization program and are pleased with the early results. We're currently exceeding historical script retention rates and have retained the majority of store and pharmacy team members. We expect to significantly ramp the pace of our store closures from the first quarter level.”
Retail pharmacy services: “Pharmacy comp sales increased 12.7% driven by brand inflation and script volume, partly offset by lower vaccine volume. Comm scripts excluding immunizations grew 3.5% in the quarter, and we held script market share. Pharmacy services performed better than our expectations during the quarter. As higher margin for COVID-19 vaccines was offset by the lower overall vaccine market volume due to the weaker cough, cold, and flu season.”
Each models’ pre-earnings calls and the post-earnings summary are available here. None of the models anticipated these factors.
Delta Airlines (DAL) - Only incorrect call from was OpenAI’s o1 SELL
Delta’s beat was driven by (i) favorable supply/demand factors and (ii) lower than expected non-fuel unit costs.
o1-preview’s incorrect SELL reasoning
"Delta's stock has significantly outperformed the market this quarter, raising expectations. However, in the past four quarters, similar outperformance led to negative price reactions after earnings. This pattern suggests that the heightened expectations may not be met, likely resulting in a stock price decline following the upcoming earnings report"
Note: o1-preview utilizes a hidden chain of thought so while its reasoning output appears most simplistic, we do not the see the alternate reasoning approaches the model pursued then abandoned. Still the above analysis seems much too narrow.
Grok’s correct BUY reasoning
"...The airline industry has shown signs of rationalizing capacity, particularly in the domestic market, which has led to improved unit revenue trends. Delta's management has expressed optimism about the industry's focus on improving financial performance, which should benefit Delta given its strong market position and diversified revenue streams.."
None of the models anticipated the better than expected operational execution as detailed in the earnings call “…as the year progresses, we should see improvement in non-fuel cost growth as we continue to drive efficiency.”
Constellation Brands (STZ) and Tilray (TLRY) - All models correctly called SELL
STZ missed due to macroeconomic headwinds on lower-income and Hispanic demographics, pressure in convenience store channel and high end light beer. The models generally predicted macroeconomic pressure specifically but not the pressure in specific categories.
For TLRY, the models were focused on profitability which generally bore out.
Potential areas for improvement
The press releases fed to the models currently include shareholder lawsuits, which are typically a symptom of prior underperformance and rarely an investor concern going forward. We’ll likely include only company issued press releases in the Context going forward
We’ll explore including short-interest and holders in the Context to provide more information to the model on whether earnings offer asymmetric upside or downside
We’ll explore additional information sources for company specific operational factors, e.g., store rationalizations and other cost efficiencies, pharmacy services demand trends such as Covid vaccine uptake
WBA has a relatively new CEO who announced a big turnaround last October. We’ll explore how to feed patterns of similar turnarounds announced by new CEOs joining from outside the company into the Context
If you have feedback or would like to participate in this project, please reply to this email or reach out via X or LinkedIn.