Flat Circle - Contrasting good vs poor reasoning

Plus: Grok and o1 share the lead, 10 billion times more compute, more deep researchers

Feb 20, 2025

Flat Circle measures the ability of language models to predict company earnings results. See our methodology for detail and disclaimers. If you haven’t already subscribed, join investors and engineers interested in LLMs+investment research here:

Key takeaways

After 344 earnings, Grok-2 and o1 share the lead with a ~54% hitrate, earning ~1.1% per earnings
We contrast the reasoning approach of stronger vs weaker models regarding shareholder lawsuits
Perplexity and Grok announced their own competitors to ChatGPT Deep Research
- Agentic research systems create another dimension on which LLMs may compete with human investors: they could either reason better or research better
- Since Deep Researchers search online and can access historical information, it’s impossible to backtest their ability to make investment decisions. You need to test them live

Model accuracy

Are all models converging on a 50/50 coinflip? 2 standard deviations away from the mean of a random coinflipper is 55% after 345 flips, and it seems worrisome the hitrate has been trending down.

However, the mean profit per earnings remains very strong for o1 (113 bps) and Grok-2 (107 bps). o1 is by far the most advanced according to other benchmarks and by far the most expensive in terms of tokens, seems promising it’s earning the most. We’ll see.

Example of good vs poor reasoning?

This project provides a way to quantify correct vs incorrect investor judgement. As an example, PGY recently reported better than expected revenue and margins and provided overall certainty around its balance sheet.

Gemini Flash Thinking, which loses -1.1% per earnings, had issued a SELL:

The emergence of securities fraud investigations significantly overshadows any positive historical trends or recent ABS deal announcements. While Pagaya has shown a pattern of exceeding guidance and achieving strong growth, the serious nature of securities fraud allegations creates substantial uncertainty and risk.

We feed the models recent press releases, and apparently Gemini Flash Thinking had fixated on shareholder lawsuits related to a short seller report.

o1, which earns 1.1% per earnings, was fed the same information - it accounted for the shareholder suits but concluded BUY:

While the wave of shareholder lawsuits and the Q3 share‐price plunge indicate heightened uncertainty, the company’s expanding network of top-tier lending partners (including new bank and auto relationships) and its successful capital raises suggest solid underlying demand. In addition, management’s repeated emphasis on profitability in 2025, along with evidence of improving loan performance and rising fee margins, points to a business now positioned for stronger operating leverage as interest rates stabilize. With shares trading well off recent highs yet fundamentals and guidance still largely intact, the risk/reward profile appears favorable going into the February earnings release—making PGY a buy.

We don’t specify in the prompt whether to ignore shareholder lawsuit press releases. The returns teach us which reasoning is correct.

Interesting articles

The amount of compute per request is going to skyrocket. Major implications for datacenters and hyperscalers, and also what can and will be spent on investment decisions:

“…this single process from a single human interaction would involve 10 billion times more compute than a single human writing into ChatGPT today, at the exact same model size. That is the incredible expansion dynamic in inference compute that is playing out today and over the next few years!”

(See “Inference Compute Scaling” on Attune Research)

Extensive thread on using ChatGPT Deep Research to create an investment thesis around DoorDash (DASH). Lots of great detail, I particularly like the multiple rounds with ChatGPT to create the optimal prompt:

“I asked ChatGPT to build me a prompt for Deep Research to do Deep Research on Deep Research prompting. It read all the blogs and literature on best practices and gave me a thorough report. Then I asked for this to be turned into a prompt template for Deep Research. I've added it below. This routinely creates 3-5 page prompts that are generating 60-100 page, very thorough reports”

(@BuccoCapital on X)

Grok-3 with DeepSearch announced (Techcrunch)

Perplexity launches Deep Research (Perplexity)

Hedge Fund that replaced analysts with AI beat the market (Bloomberg)

Follow the progress of LLM investment research

If you have feedback or would like to participate in this project, please reply to this email or reach out via X or LinkedIn.