Analysis
Explore how AI models perform in Analysis. Compare rankings, scoring criteria, and recent benchmark examples.
Genre overview
Compare depth, reasoning quality, and clarity in analytical responses.
In this genre, the main abilities being tested are Depth, Correctness, Reasoning Quality.
Unlike explanation, this genre rewards evidence reading and justified conclusions more than audience-friendly teaching style.
A high score here does not guarantee concise writing, strong humor, or practical execution details.
Strong models here are useful for
option review, evidence comparison, decision support, and risk assessment.
This genre alone cannot tell you
whether the model can implement code well, write polished business documents, or produce many creative ideas.
Top Models in This Genre
This ranking is ordered by average score within this genre only.
Latest Updated: Mar 29, 2026 12:05
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
| Ranked Models |
|
|
Detail | ||||
|---|---|---|---|---|---|---|---|
| #1 | GPT-5.4 | OpenAI |
100%
|
87
|
4 | 4 | View scores and evaluation for GPT-5.4 |
| #2 | GPT-5.2 | OpenAI |
100%
|
87
|
4 | 4 | View scores and evaluation for GPT-5.2 |
| #3 | Claude Opus 4.6 | Anthropic |
75%
|
87
|
3 | 4 | View scores and evaluation for Claude Opus 4.6 |
| #4 | GPT-5 mini | OpenAI |
75%
|
83
|
3 | 4 | View scores and evaluation for GPT-5 mini |
| #5 | Claude Sonnet 4.6 | Anthropic |
60%
|
83
|
3 | 5 | View scores and evaluation for Claude Sonnet 4.6 |
| #6 | Claude Haiku 4.5 | Anthropic |
50%
|
83
|
2 | 4 | View scores and evaluation for Claude Haiku 4.5 |
| #7 | Gemini 2.5 Flash-Lite |
0%
|
76
|
0 | 5 | View scores and evaluation for Gemini 2.5 Flash-Lite | |
| #8 | Gemini 2.5 Flash |
0%
|
76
|
0 | 5 | View scores and evaluation for Gemini 2.5 Flash | |
| #9 | Gemini 2.5 Pro |
0%
|
73
|
0 | 3 | View scores and evaluation for Gemini 2.5 Pro |
What Is Evaluated in Analysis
Scoring criteria and weight used for this genre ranking.
Depth
25.0%
This criterion is included to check Depth in the answer. It carries heavier weight because this part strongly shapes the overall result in this genre.
Correctness
25.0%
This criterion is included to check Correctness in the answer. It has meaningful weight because it affects quality in a visible way, even if it is not the only thing that matters.
Reasoning Quality
20.0%
This criterion is included to check Reasoning Quality in the answer. It has meaningful weight because it affects quality in a visible way, even if it is not the only thing that matters.
Structure
15.0%
This criterion is included to check Structure in the answer. It is weighted more lightly because it supports the main goal rather than defining the genre by itself.
Clarity
15.0%
This criterion is included to check Clarity in the answer. It is weighted more lightly because it supports the main goal rather than defining the genre by itself.
Recent tasks
Analysis
Urban Transit Policy Analysis
Analyze the three proposed transit policies for the fictional city of Riverbend. Based on the provided context, recommend the best policy for the city's long-term future. Your analysis should compare the options across key factors like cost, environmental impact, public acceptance, and effectiveness in reducing congestion. Justify your final recommendation with a clear, evidence-based argument.
Analysis
Select the Most Effective School Attendance Intervention
A public middle school has a budget to fund one pilot program for the next academic year to reduce chronic absenteeism. Chronic absenteeism is defined here as missing 10% or more of school days. The school serves 600 students, and currently 18% are chronically absent. The principal wants the option that is most likely to reduce absenteeism in a meaningful and sustainable way within one year. The school is considering these three options: Option A: Daily text-message reminders and attendance alerts - Cost: $18,000 for software and staff time - Target group: all families - Evidence from similar districts: chronic absenteeism fell by 1.5 percentage points on average - Risks: message fatigue, outdated phone numbers, limited effect for families facing serious barriers - Operational notes: can be launched quickly and scaled easily Option B: Two additional school social workers focused on high-risk students - Cost: $95,000 for one year - Target group: roughly 90 students with the highest absence rates - Evidence from similar schools: among targeted students, average attendance improved enough to reduce schoolwide chronic absenteeism by about 4 percentage points when implementation was strong - Risks: recruiting delays, benefits may depend heavily on staff quality, hard to sustain if grant funding ends - Operational notes: allows individualized support for transportation, family crises, mental health, and housing instability Option C: Free morning shuttle routes from two neighborhoods with poor attendance - Cost: $52,000 for one year - Target group: about 140 students in neighborhoods with low car ownership and unreliable public transit - Evidence from similar programs: schoolwide chronic absenteeism fell by 2.5 percentage points on average where transportation was a major barrier - Risks: only addresses one cause of absence, route design may miss some students, ongoing operating costs - Operational notes: visible program, may improve punctuality as well as attendance Additional context: - A recent internal survey suggests the main reported reasons for absence are: transportation problems (30%), illness or caregiving duties (25%), anxiety or mental health concerns (20%), family instability such as housing or frequent moves (15%), and disengagement or other reasons (10%). - The school has one part-time counselor already, but no dedicated attendance team. - The district can likely continue funding a successful program next year only if the first-year results are clearly visible. Task: Analyze the three options and recommend the single best pilot program. Your answer should compare trade-offs, consider the quality and limits of the evidence, and explain why your chosen option is better than the alternatives in this specific context.
Analysis
Analysis of a Four-Day Work Week Policy for a City
The city of Rivertown, a mid-sized municipality with approximately 2,000 city employees, is considering a proposal to switch to a four-day work week. Under this proposal, employees would work four 10-hour days instead of five 8-hour days, with no reduction in their weekly pay or benefits. The stated goals are to improve employee morale and work-life balance, attract and retain top talent in a competitive job market, and maintain or even increase overall productivity. Analyze the potential positive and negative consequences of this policy for Rivertown. Your analysis should consider the impacts on city services, the municipal budget, employee well-being, and the local economy. Conclude with a clear, justified recommendation on whether Rivertown should implement this policy, perhaps starting with a limited pilot program.
Analysis
Rivertown Congestion Charge Policy Analysis
The city council of Rivertown, a mid-sized city with a population of 500,000, is considering implementing a congestion charge. This would require drivers to pay a fee to enter the downtown business district between 7 AM and 7 PM on weekdays. The stated goals are to reduce traffic congestion, lower air pollution, and generate revenue for improving public transportation (buses and a new light rail line). Analyze the potential positive and negative consequences of this proposed policy. Your analysis should consider the impact on at least three different groups of people (e.g., downtown business owners, low-income commuters who drive to work, suburban families, environmental groups). Conclude with a clear, justified recommendation on whether Rivertown should implement the congestion charge, perhaps with specific suggestions for how to mitigate the negative impacts.
Analysis
Analyze a Proposed City Ordinance on Plastic Bags
You are a neutral policy analyst for the Rivertown City Council. Based on the provided context, write an analysis of the proposed ban on single-use plastic bags. Your analysis should: 1. Evaluate the potential environmental, economic, and social impacts of the ban. 2. Assess the arguments presented by both the 'Friends of the Rivertown River' and the 'Rivertown Small Business Alliance'. 3. Conclude with a clear, justified recommendation to the City Council. Your recommendation could be to pass the ordinance as is, reject it, or suggest specific modifications.
Analysis
Evaluating Evidence in a Product Recall Decision
A consumer electronics company, VoltTech, manufactures a popular portable phone charger called the PowerPak 3000. Over the past six months, the company has received the following reports and data: 1. Customer complaints: 47 reports of the device overheating during use, out of approximately 820,000 units sold. Of these, 12 customers reported minor burns, and 3 reported small fires that were quickly contained. 2. Internal testing: VoltTech's quality assurance team tested 500 units from recent production batches. They found that 2.4% of units exhibited higher-than-normal thermal output under sustained maximum load, but all remained within the technical safety threshold defined by the relevant UL certification standard. 3. A competitor's similar product was recalled last month for a comparable overheating issue, generating significant media coverage and public concern about portable charger safety in general. 4. An independent consumer safety blog published an article claiming the PowerPak 3000 has a "dangerous design flaw," based on teardown analysis of a single unit purchased from a third-party reseller. VoltTech has not verified whether that unit was genuine or counterfeit. 5. VoltTech's legal team estimates that a voluntary recall would cost approximately $14 million, while continuing sales without action and facing potential future litigation could cost between $2 million (if no serious incidents occur) and $40 million (if a serious injury or property damage lawsuit succeeds). Analyze the evidence above and recommend whether VoltTech should issue a voluntary recall, implement a lesser corrective action (such as a firmware update, warning label addition, or exchange program), or take no action. Justify your recommendation by evaluating the strength and limitations of each piece of evidence, weighing the risks, and explaining your reasoning clearly.