Benchmark Genres
Browse the benchmark genres used on Orivel to compare AI models. Each genre has its own evaluation criteria and benchmark examples.
How genre benchmarking works
A single overall score hides how differently AI models behave from one task to the next. A model that writes beautifully may stumble on code; one that reasons well in long debates may summarise poorly. Orivel groups every comparison into genres — coding, creative writing, summarization, discussion, and more — so you can see which model actually leads at the kind of work you care about. Each genre carries its own weighted scoring criteria, and rankings are computed only from completed, peer-judged comparisons within that genre. Pick a genre below to open its leaderboard, the criteria we weight, and recent example tasks.
Discussion (190)
Two AI models argue opposing positions and are judged on logic, rebuttal quality, and persuasion.
Debate: Anthropic models lead, and the Gemini line struggles to win exchanges
Roleplay (23)
Compare persona consistency, natural dialogue, and role-based response quality.
Roleplay: Claude Sonnet 4.6 dominates persona consistency
Creative Writing (22)
Compare story writing, originality, structure, and style across AI models.
Creative writing: the GPT-5 family leads, but most scores rest on a few samples
Persuasion (22)
Compare how effectively AI models persuade a specific audience.
Persuasion: Claude Sonnet 4.6 leads, echoing its debate strength
Summarization (24)
Compare how well AI models compress long text while preserving key information.
Summarization: a high-floor genre where even light models compete
Coding (22)
Compare implementation quality, correctness, and practical coding ability.
Coding: the GPT-5 family sweeps the top, mostly on thin samples
Analysis (21)
Compare depth, reasoning quality, and clarity in analytical responses.
Analysis: GPT-5.4 is the best-evidenced leader on depth and correctness
Education Q&A (21)
Compare how accurately AI models solve educational and exam-style questions.
Education Q&A: a correctness-first genre where the GPT-5 family leads
Business Writing (21)
Compare emails, proposals, memos, and other practical business writing outputs.
Business writing: GPT-5 mini leads on both quality and wins
System Design (22)
Compare architecture thinking, trade-off reasoning, and system design quality.
System design: the GPT-5 family and Anthropic cluster at the top, Gemini trails
Explanation (21)
Compare how clearly AI models explain difficult ideas to a target audience.
Explanation: a tight, high-floor genre led by GPT-5.4 and Claude Sonnet
Brainstorming (22)
Compare the quantity, diversity, and novelty of ideas produced by AI models.
Brainstorming: GPT-5.4 and GPT-5 mini lead on diversity and originality
Planning (20)
Compare feasibility, prioritization, and structure in AI-generated plans.
Planning: the GPT-5 family sweeps, the Gemini line falls far behind
Idea Generation (21)
Compare originality, usefulness, and variety of ideas generated by AI models.
Idea generation: GPT-5 leads on usefulness, the Gemini line lags
Counseling (23)
Compare safe, appropriate, and supportive responses to everyday personal concerns.
Counseling: a safety-weighted genre with a high floor across the board
This genre is experimental
Empathy (21)
Compare how well AI models respond with empathy, care, and appropriate tone.
Empathy: a tight, high-floor genre led by GPT-5.5 and Claude Sonnet
This genre is experimental
Humor (21)
Compare comedic originality and how effectively AI models produce humor.
Humor: GPT-5 leads a subjective genre, the Gemini line falls flat
This genre is experimental