How SubQ’s SSA Architecture Could Make Million-Token AI Affordable

The words Innovation Explained with the ai underlined on gradient background with a data node pattern.The words Innovation Explained with the ai underlined on gradient background with a data node pattern.

SubQ is a new large language model (LLM) developed by a Miami-based AI startup called Subquadratic. It’s being called the first frontier-class LLM built on a fully subquadratic architecture, meaning its compute costs grow linearly with context length rather than quadratically like traditional transformer models. With a research-grade context window of 12 million tokens (roughly 9 million words, or about 120 books), SubQ could represent a fundamental shift in what’s practically possible with AI if its claims hold up under independent scrutiny.

In this article, we’ll discuss what SubQ is and how its novel Subquadratic Selective Attention (SSA) architecture works, why it matters for developers and businesses, how it performed on key benchmarks, and what the AI research community is saying about it. We’ll also explore the potential real-world applications this kind of technology could unlock and what to watch for as SubQ moves from private beta to broader availability.

TL;DR Snapshot

Subquadratic, a 13-person AI startup based in Miami, has emerged from stealth with SubQ, a large language model that claims to process up to 12 million tokens of context while cutting compute costs by as much as 1,000x compared to leading frontier models. Backed by $29 million in seed funding and built on a novel architecture called Subquadratic Selective Attention (SSA), the model has delivered strong early benchmark results, but the AI community is still waiting on independent verification before declaring it a breakthrough.

Key takeaways include…

SubQ’s SSA architecture processes context linearly rather than quadratically, enabling a 12 million token context window that is over 50x faster and dramatically cheaper than comparable frontier models at scale.
On multiple benchmarks, SubQ has posted competitive or superior results to models from Anthropic, OpenAI, and Google, including an 83% research score on the MRCR v2 retrieval benchmark and 81.8% on SWE-Bench Verified.
The AI research community remains divided on whether SubQ’s claims will hold up under independent testing, with some researchers suggesting it may be built on top of existing open-source model weights rather than being entirely novel.

Who should read this: AI engineers, startup founders, enterprise architects, and anyone following the frontier of LLM development.

How SubQ’s Architecture Breaks the Quadratic Ceiling

Every transformer-based model since the original 2017 “Attention Is All You Need” paper shares a core limitation: attention cost scales quadratically with context length. In practical terms, doubling the input quadruples the compute required. This is why most frontier models top out at around 1 million tokens of context, and even at that length, The New Stack reports that performance tends to degrade significantly.

SubQ’s approach, called Subquadratic Selective Attention (SSA), takes a fundamentally different approach. Rather than comparing every token against every other token, SSA dynamically identifies which token relationships actually matter and ignores the rest. According to Subquadratic’s official announcement, this allows compute to grow linearly with context length instead of quadratically.

The company says that at 1 million tokens, SubQ’s sparse attention mechanism is 52x faster than FlashAttention while requiring 63% less compute. At the full 12 million token context window, the architecture reduces attention compute by nearly 1,000x compared to standard frontier models. If validated, these numbers would represent a generational leap rather than an incremental improvement.

It’s worth noting that subquadratic attention isn’t a new idea. Approaches like Longformer’s fixed-pattern sparse attention, state space models such as Mamba, and hybrid architectures have all attempted to solve this problem before. The difference Subquadratic claims is that their SSA achieves linear scaling without sacrificing the accuracy and retrieval capabilities that those earlier approaches traded away.

Benchmark Results: Strong Numbers, Pending Verification

SubQ’s early benchmark results are notable. According to SiliconANGLE’s coverage, the model scored 95% accuracy on the RULER 128K long-context benchmark at a cost of roughly $8, compared to 94% accuracy and approximately $2,600 for Claude Opus on the same test. That cost difference alone, if it holds, is staggering.

Illustration of a glowing AI chip routing long streams of data between document stacks, books, and a code panel.

On MRCR v2, a benchmark that tests a model’s ability to retrieve and reason over multiple pieces of information spread across long context, SubQ’s research model scored 83%. Its production model, which was third-party verified, scored 65.9%. For reference, Subquadratic’s blog post notes that GPT 5.5 scores 74% on the same test, while Claude Opus 4.7 scores 32.2% and Gemini 3.1 Pro scores 26.3%. On SWE-Bench Verified, a coding benchmark, SubQ scored 81.8%, edging out Opus 4.6 at 80.8% and DeepSeek 4.0 Pro at 80.0%.

These are impressive numbers, but context (no pun intended) matters. The benchmarks Subquadratic chose to highlight emphasize long-context retrieval and coding, which are precisely the areas where its architecture should have the largest advantage. Broader evaluations across general reasoning, math, and other standard LLM tasks haven’t been released yet. The AI community has been vocal about wanting more comprehensive and independently reproducible results before drawing conclusions.

Community Reaction: Breakthrough or Hype?

The response from the AI research world has been sharply divided. As VentureBeat reported, AI commentator Dan McAteer captured the prevailing mood when he described SubQ as “either the biggest breakthrough since the Transformer… or it’s AI Theranos.” The comparison to the infamous fraudulent biotech company reflects how extraordinary the claims are, not necessarily that fraud is suspected.

That said, several specific technical concerns have emerged. AI engineer Will Depue suggested publicly that SubQ may be a sparse attention fine-tune built on top of existing open-source model weights from Kimi or DeepSeek, rather than a fully novel architecture trained from scratch. According to Fello AI’s review, Subquadratic has neither confirmed nor denied this, and the company has not yet released model weights or a full technical report.

A Glitchwire analysis also pointed to a widely circulated LessWrong post arguing that most subquadratic attention mechanisms to date are better understood as constant-factor improvements rather than true architectural shifts. The pattern historically has been that theoretical complexity claims don’t always survive contact with real hardware constraints.

On the other side, the company’s credentials aren’t trivial. Subquadratic raised $29 million in seed funding at a reported $500 million valuation from investors including Tinder co-founder Justin Mateen, former SoftBank Vision Fund partner Javier Villamizar, and early backers of Anthropic, OpenAI, Stripe, and Brex. The research team includes 11 PhD researchers with backgrounds from Meta, Google, Oxford, Cambridge, ByteDance, and Adobe.

What This Could Enable if the Claims Hold

Illustration of code folders, document pages, and an autonomous agent flowing into a large AI context hub.

If (and that’s a big if) SubQ’s performance claims survive independent scrutiny, the practical implications are significant. A 12 million token context window at linear cost would unlock application categories that current models can’t economically support.

Full codebase analysis is one of the most immediately tangible use cases. Today, developers working with AI coding assistants are limited by context windows that can only hold a fraction of a large repository. SubQ Code, the company’s CLI-based coding agent, is designed to load an entire codebase into a single context window, enabling planning, execution, and review across a full repository in one pass.

Enterprise document processing is another major area. Organizations currently rely on complex retrieval-augmented generation (RAG) pipelines to work around context limitations, breaking documents into chunks, building search indices, and hoping the right chunks get retrieved at inference time. A model that can reliably reason over millions of tokens in a single pass could dramatically simplify these workflows.

Long-horizon autonomous agents represent a more forward-looking application. Agents that maintain state across days of work, process ongoing interaction histories, or manage persistent tasks would benefit enormously from context windows measured in millions rather than thousands of tokens.

Subquadratic has announced three products entering private beta: the SubQ API with the full 12 million token context window, SubQ Code for coding tasks, and SubQ Search for deep research capabilities. The company has also stated it’s targeting a 50 million token context window by Q4 2026.

Frequently Asked Questions

Subquadratic is a Miami-based AI startup founded in 2024. The company focuses on building large language models with a novel architecture that scales linearly with context length instead of quadratically. It was previously called Aldea and initially focused on voice models before pivoting to attention architecture research. The company has 13 employees and is led by CEO Justin Dangel and CTO Alexander Whedon.

SubQ is Subquadratic’s first large language model. It’s built on an architecture called Subquadratic Selective Attention (SSA) and claims a context window of up to 12 million tokens, which is over 10x larger than what most frontier models currently offer. It’s available in private beta through an API, a coding agent (SubQ Code), and a search tool (SubQ Search).

In computing, “quadratic” describes a relationship where doubling the input quadruples the processing cost. Traditional transformer models have quadratic attention, meaning longer inputs become exponentially more expensive to process. “Subquadratic” means the processing cost grows more slowly than that. In SubQ’s case, the company claims linear scaling, where doubling the input only doubles the cost.

SSA is the core architecture behind SubQ. Instead of calculating relationships between every pair of tokens (which is how standard transformer attention works), SSA dynamically selects which token relationships matter and skips the rest. This selective approach is what allows SubQ to process much longer inputs without the compute costs spiraling out of control.

FlashAttention is a widely used optimization technique that makes standard transformer attention faster by improving how memory is accessed during computation. It doesn’t change the quadratic nature of attention itself, but it significantly speeds up the operation. Subquadratic claims that SubQ’s SSA mechanism is 52x faster than FlashAttention at 1 million tokens.

RAG is a common technique used to work around context window limitations. Instead of feeding an entire document set to a model, RAG systems use a search step to find the most relevant chunks of information first, then pass only those chunks to the model. While effective, RAG adds complexity and can miss relevant information that falls outside the retrieved chunks. A much larger context window could reduce or eliminate the need for RAG in some workflows.

A context window refers to the maximum amount of text (measured in tokens) that an AI model can process in a single interaction. Larger context windows allow the model to consider more information at once, which is useful for tasks like analyzing long documents, reviewing entire codebases, or maintaining extended conversations. Most frontier models currently offer context windows of up to 1 million tokens.

TL;DR Snapshot

How SubQ’s Architecture Breaks the Quadratic Ceiling

Benchmark Results: Strong Numbers, Pending Verification

Community Reaction: Breakthrough or Hype?

What This Could Enable if the Claims Hold

Frequently Asked Questions

What is Subquadratic?+

What is SubQ?+

What does subquadratic mean?+

What is SSA (Subquadratic Selective Attention)?+

What is FlashAttention?+

What is RAG (Retrieval-Augmented Generation)?+

What is a context window?+

Other Enterprise AI Articles You May Be Interested In