
Critique is a new feature within Microsoft 365 Copilot’s Researcher agent that pairs OpenAI’s GPT and Anthropic’s Claude in a single workflow, where one model drafts a research response and the other reviews it for accuracy, completeness, and citation quality before the user ever sees the result. Announced on March 30, 2026, it represents a notable shift in enterprise AI strategy. Instead of relying on a single LLM, it utilizes a collaborative, multi-model architecture designed to reduce hallucinations and improve the trustworthiness of AI-generated research.
In this article, we’ll discuss how Critique works under the hood, why Microsoft chose to combine competing models rather than bet on one, and what this means for knowledge workers who depend on AI for deep research tasks. We’ll also cover the companion features announced alongside Critique (including Council mode and Copilot Cowork), and look at the benchmark results that suggest this multi-model approach is more than just a marketing ploy.
TL;DR Snapshot
Critique is Microsoft’s answer to a growing problem in enterprise AI: even the best models hallucinate. By running OpenAI’s GPT as the drafter and Anthropic’s Claude as the reviewer, Copilot’s Researcher agent now puts every response through a built-in quality check before delivering it to the user. The result is a measurable improvement in research accuracy and a signal that the future of AI productivity tools may be multi-model by default.
Key takeaways include…
- Multi-model quality control is here. Critique uses GPT to generate research responses and Claude to audit them for factual accuracy and citation integrity, producing a 13.8% improvement on the DRACO deep-research benchmark.
- Council mode adds transparency. A new side-by-side comparison feature lets users see how different AI models answer the same query, including where they agree and where they diverge.
- This is part of a bigger play. Alongside Critique, Microsoft is expanding access to Copilot Cowork, a Claude-powered agent that can plan and execute multi-step tasks autonomously, signaling that the company’s AI strategy is now firmly multi-vendor.
Who should read this: Enterprise technology leaders, knowledge workers, AI strategists, and anyone evaluating deep-research tools for their organization.
How Critique Works: One Model Drafts, Another Reviews
At its core, Critique introduces a sequential workflow into the Researcher agent. When a user submits a research query, OpenAI‘s GPT handles the initial response, synthesizing information from web sources and enterprise data connectors like Salesforce, ServiceNow, and Confluence. Before that response reaches the user, however, Anthropic‘s Claude steps in as a reviewer. It evaluates the draft for factual accuracy, checks the completeness of the analysis, and verifies that citations are properly sourced and attributed.
This mirrors a pattern that many developers and researchers already use informally: drafting with one model and reviewing with another from a different model family. Microsoft has simply formalized and automated it within the Copilot interface. Nicole Herskowitz, corporate vice president of Microsoft 365 and Copilot, framed the approach in an interview with Reuters as going beyond simply offering multiple models, and instead making them actively collaborate to produce better outcomes.
Currently, the workflow is one-directional, meaning that GPT always drafts, and Claude always reviews. Microsoft has indicated plans to make the process bidirectional in the future though, which would give users even more flexibility in how they leverage the strengths of each model.
The Benchmark Results: What DRACO Tells Us
Microsoft is backing up its claims with numbers from the DRACO benchmark, an industry evaluation framework developed by Perplexity AI. DRACO, which stands for Deep Research Accuracy, Completeness, and Objectivity, measures AI systems across 100 complex research tasks spanning domains like law, medicine, finance, and technology. Each task is graded against expert-crafted rubrics covering factual accuracy, analytical depth, presentation quality, and citation reliability.

According to data shared by The New Stack, Claude Opus 4.6 scored a 42.7 on the DRACO benchmark when operating as a standalone model, and a 50.4 when embedded within Perplexity’s Deep Research mode. Copilot’s Researcher with Critique enabled scored a 57.4, significantly higher than any individual model or competing deep-research tool from OpenAI, Anthropic, Perplexity, or even de-facto search king Google. This suggests that the multi-model review process is catching and correcting errors that individual models typically won’t fix on their own.
It’s worth noting that DRACO was originally built to evaluate Perplexity’s own deep-research system, so it isn’t a Microsoft-controlled benchmark. The fact that Researcher with Critique outperforms on a third-party evaluation lends some credibility to the results, though independent verification will be important as the feature matures.
Council Mode and the Rise of Multi-Model Transparency
Critique isn’t the only new feature in this update. Microsoft also introduced Council, a mode that presents users with side-by-side responses from different AI models for the same query. Rather than trusting a single model’s output, users can compare answers, review where models agree, and critically, see where they disagree.
This is a meaningful shift in how people interact with AI tools. Instead of treating any single model as an authority, Council encourages users to become active evaluators. For tasks that require nuanced judgment (e.g. competitive analysis, legal research, strategic planning, etc.), seeing the range of responses can surface blind spots that a single model might miss.
Council is available today through the model picker in Copilot’s Researcher agent, alongside Critique. Both are accessible to anyone with a Microsoft 365 Copilot license.
What This Means for the Enterprise AI Landscape
Microsoft’s announcement arrives at a competitive moment. The company reported 15 million paid Copilot seats as of January 2026, representing roughly 3.3% of its 450 million commercial Microsoft 365 users. That adoption rate suggests there’s still a large gap between availability and actual enterprise uptake. Features like Critique and Council appear designed to close that gap by addressing one of the most common objections to AI-assisted research: trustworthiness.
The broader trend here extends well beyond Microsoft. The era of single-model dominance appears to be giving way to multi-model orchestration, where the competitive advantage isn’t which model you use, but how well you combine them. Google has been expanding Gemini across its Workspace apps, Anthropic’s Claude has been gaining enterprise traction on its own, and OpenAI continues to push ChatGPT Enterprise. Microsoft’s bet is that by sitting at the orchestration layer (i.e. pulling in models from multiple providers and making them collaborate), it can offer something none of the individual model makers can deliver alone.
Alongside these Researcher upgrades, Microsoft also expanded access to Copilot Cowork through its Frontier early-access program. Copilot Cowork is a Claude-powered agent designed for delegating longer-running, multi-step tasks. Users describe what they want done, and the agent plans, reasons across files and tools, and executes, checking in with the user when it needs direction. It’s another signal that Microsoft sees the future of enterprise AI as agentic, collaborative, and fundamentally multi-model.
Frequently Asked Questions
Microsoft 365 Copilot is an AI assistant integrated across Microsoft’s suite of productivity applications, including Word, Excel, PowerPoint, Outlook, and Teams. It uses large language models to help users draft documents, analyze data, summarize meetings, and conduct research. It’s available as an add-on license for commercial Microsoft 365 subscribers.
Researcher is an AI agent within Microsoft 365 Copilot designed for complex, multi-step research tasks. It can synthesize information from both web sources and enterprise data connectors (such as Salesforce, ServiceNow, and Confluence) to produce detailed reports, go-to-market strategies, competitor analyses, and more.
DRACO stands for Deep Research Accuracy, Completeness, and Objectivity. It’s a benchmark developed by Perplexity AI that evaluates AI systems on 100 complex research tasks across 10 domains, using expert-crafted rubrics. It measures factual accuracy, analytical depth, presentation quality, and citation reliability.
Copilot Cowork is an agentic AI tool within Microsoft 365 that lets users delegate multi-step tasks to an AI agent powered by Anthropic’s Claude. The agent can plan its own workflow, execute tasks across Microsoft 365 apps and connected data sources, and check in with the user when it needs guidance. It’s currently available through Microsoft’s Frontier early-access program.
Microsoft’s Frontier program provides select enterprise customers with early access to the company’s newest AI features before they become generally available. Copilot Cowork is currently rolling out through this program.
Multi-model intelligence refers to the practice of using more than one AI model within a single workflow to improve output quality. In the case of Critique, it means using GPT for drafting and Claude for reviewing, combining the strengths of both models rather than relying on either one alone.
