
A Tensor Processing Unit, or TPU, is a custom-built chip that Google designs specifically to accelerate artificial intelligence workloads. Unlike general-purpose processors, TPUs are optimized for the high-volume, low-precision math that powers machine learning. Google has been developing and refining these chips for over a decade, and they’ve quietly become a cornerstone of the infrastructure behind products like Gemini, Google Search, and many of the AI services millions of people use every day. Now, with the announcement of its eighth-generation TPUs, Google is making its boldest architectural decision yet, splitting its chip lineup into two specialized processors (one for training AI models and one for running them).
In this article, we’ll discuss what Google’s new TPU 8t and TPU 8i chips are, why the company decided to separate training and inference into distinct processors, and what this move signals about the broader AI hardware landscape. We’ll also look at how this positions Google in its ongoing competition with Nvidia, which currently dominates the AI accelerator market, and explore what the rise of custom silicon means for businesses and developers building on cloud infrastructure.
TL;DR Snapshot
Google has unveiled two new chips, one for training and one for inference, marking the first time it’s splitting these workloads into purpose-built processors. The decision reflects the growing complexity of AI systems and the distinct performance demands that training and serving models place on hardware. Both chips were co-designed with Google DeepMind and will be available later this year through Google Cloud.
Key takeaways include…
- Google’s eighth-generation TPU lineup splits into the TPU 8t and TPU 8i, each optimized for fundamentally different workloads with specialized architectures.
- The TPU 8t delivers nearly 3x the compute performance per pod over the previous generation, while the TPU 8i offers 80% better performance-per-dollar for inference tasks.
- This move represents a direct challenge to Nvidia’s dominance in AI chips, as the custom ASIC market is projected to grow 45% in 2026 compared to 16% growth in GPU shipments.
Who should read this: Cloud architects, AI engineers, tech investors, and anyone tracking the evolving competition in AI infrastructure.
Why Google Is Splitting Training and Inference Apart
For years, Google’s TPUs served double duty. A single chip architecture handled both training (the process of teaching an AI model by feeding it massive amounts of data) and inference (the process of running a trained model to generate responses). But as AI models have grown dramatically in size and complexity, the performance requirements for these two tasks have diverged significantly.
Training is all about raw compute throughput. It involves processing enormous datasets across thousands of chips simultaneously, and the bottleneck is usually how fast data can move between processors. Inference, on the other hand, is about speed and efficiency at the point of delivery. When a user asks an AI assistant a question, the response needs to come back in seconds, not minutes. Latency and cost-per-query matter far more than peak computational power.
According to Google’s official blog post, they anticipated this divergence several years ago during the hardware development cycle for the eighth-generation TPU. As Amin Vahdat, Google’s SVP and Chief Technologist for AI and Infrastructure put it, the key insight behind the TPU program has always been that co-designing silicon with hardware, networking, and software delivers better power efficiency and absolute performance. The split into two specialized chips is the natural evolution of that philosophy.
The timing also aligns with the emergence of agentic AI, where multiple specialized models work together in complex, iterative loops to complete tasks. These “swarms” of agents place unique demands on inference hardware because even small inefficiencies get amplified when agents interact at scale.
Inside the TPU 8t and TPU 8i

The TPU 8t is Google’s training powerhouse. According to Data Center Dynamics, a single TPU 8t superpod scales to 9,600 chips, offering two petabytes of high-bandwidth memory and 121 exaflops of FP4 compute performance. That’s nearly triple the per-pod compute performance of Google’s previous generation, Ironwood.
Google has also introduced two new technologies to reduce data bottlenecks. TPUDirect RDMA allows data to move directly between memory and network interface cards without routing through the host CPU, cutting latency. TPU Direct Storage does something similar for storage access, which Google says effectively doubles bandwidth for large data transfers.
On the networking side, Google’s new Virgo Network supports a fourfold increase in data center bandwidth, and when combined with JAX and Pathways software, the system can scale to over one million TPU chips in a single training cluster. Google is also targeting over 97% “goodput,” a measure of how much time the cluster is spent doing useful work rather than dealing with hardware failures or restarts.
The TPU 8i tackles a different problem entirely. It’s designed for inference workloads where low latency and cost efficiency are paramount. According to Google’s blog post, the chip includes several innovations aimed at eliminating the “waiting room” effect, where user requests are intentionally queued to maximize hardware utilization at the expense of response time.
The TPU 8i pairs 288 GB of high-bandwidth memory with 384 MB of on-chip SRAM (three times more than the previous generation) to keep a model’s active working set entirely on-chip. It also features a new Collectives Acceleration Engine that reduces on-chip latency by up to 5x. Google reports that these innovations deliver 80% better performance-per-dollar compared to Ironwood, meaning businesses could serve nearly twice the customer volume at the same cost.
The chip is scalable to 1,152 chips per pod and delivers 11.6 exaflops of FP8 compute performance per pod. Both the TPU 8t and 8i run on Google’s custom Axion Arm-based CPUs and use fourth-generation liquid cooling technology.
What This Means for the AI Chip Wars
Google’s decision to split its TPU lineup is as much a competitive statement as it is a technical one. Nvidia currently commands roughly 80% of the AI accelerator market by revenue, according to Silicon Analysts, who estimates that figure peaked near 87% in 2024 and is projected to decline to around 75% by 2026. The decline isn’t because Nvidia is shrinking, it’s because the total market is growing faster than any single vendor can keep up with.
Google isn’t alone in pursuing custom chips. Amazon has its Trainium and Inferentia processors, and Microsoft has its Maia chip. But Google’s approach stands out for its ambition and its supply chain strategy. According to The Next Web, Google is working with a few design partners: Broadcom is building the TPU 8t training chip (codenamed “Sunfish”), MediaTek is designing the TPU 8i inference chip (codenamed “Zebrafish”), and Marvell is in talks for additional components. The same report notes that TrendForce projects custom chip sales will increase 45% in 2026, compared to 16% growth in GPU shipments.
Google’s projected TPU shipments tell a story of massive scale. They’re expected to ship around 4.3 million TPU units in 2026, with plans to scale beyond 35 million by 2028. And major customers are already lining up. Anthropic, for example, has committed to up to a million TPU chips, with access to roughly 3.5 gigawatts of TPU-based compute starting in 2027.
The broader implication here is that the AI chip market is fragmenting. Nvidia’s CUDA ecosystem still represents a powerful moat, but the hyperscalers are building alternatives that offer compelling economics for their specific workloads. For businesses evaluating cloud AI infrastructure, this means more options, more competitive pricing, and a growing reason to think carefully about vendor lock-in.
Frequently Asked Questions
A TPU, or Tensor Processing Unit, is a type of custom chip designed by Google specifically for machine learning workloads. Unlike general-purpose GPUs, TPUs are optimized for the matrix math operations that underpin neural networks. Google first announced the TPU in 2016, and the chip has gone through eight generations since then. TPUs power many of Google’s internal AI services and are also available to external customers through Google Cloud.
Training is the process of building an AI model by feeding it large datasets and adjusting its parameters until it learns to perform a specific task. It’s computationally intense and can take weeks or months on large clusters of chips. Inference is what happens after a model is trained. It’s the process of using the finished model to generate predictions, answers, or actions in real time. Training prioritizes raw compute throughput, while inference prioritizes low latency and cost efficiency.
Agentic AI refers to AI systems that can autonomously reason through problems, execute multi-step workflows, and learn from their own actions. Rather than simply responding to a single prompt, agentic systems often involve multiple specialized models (sometimes called “agents”) working together in iterative loops. This places unique demands on infrastructure because the agents need to communicate rapidly and efficiently.
Broadcom and MediaTek are semiconductor companies that design and manufacture chips for a wide range of applications. In the context of Google’s TPU 8 program, Broadcom is responsible for designing the TPU 8t training chip (codenamed “Sunfish”), while MediaTek is designing the TPU 8i inference chip (codenamed “Zebrafish”). Both chips are expected to be manufactured using TSMC’s 2-nanometer process and are targeted for availability in late 2027.
A superpod is a large-scale cluster of TPU chips connected together to function as a single, unified computing system. Google’s TPU 8t superpod, for example, connects up to 9,600 chips with shared high-bandwidth memory, allowing massive AI models to train across all of them simultaneously. Superpods are designed to maximize both performance and reliability at scale.
Other Enterprise AI Articles You May Be Interested In
Can Flexible Data Centers Fix the AI Energy Crisis? A New Santa Clara Pilot Aims to Find Out
What 3DIC Is and Why It Matters for AI Chips: Alchip’s New Platform Explained
OpenAI’s GPT-Rosalind: A New AI Model Purpose-Built for Life Sciences Research
Claude Opus 4.7: Everything You Need to Know About Anthropic’s Latest AI Model
What Is Composable AI Decisioning? GrowthLoop’s New Platform Explained
