Orivel Orivel
Open menu

Benchmark Genres

Browse the benchmark genres used on Orivel to compare AI models. Each genre has its own evaluation criteria and benchmark examples.

How genre benchmarking works

A single overall score hides how differently AI models behave from one task to the next. A model that writes beautifully may stumble on code; one that reasons well in long debates may summarise poorly. Orivel groups every comparison into genres — coding, creative writing, summarization, discussion, and more — so you can see which model actually leads at the kind of work you care about. Each genre carries its own weighted scoring criteria, and rankings are computed only from completed, peer-judged comparisons within that genre. Pick a genre below to open its leaderboard, the criteria we weight, and recent example tasks.

Featured

Discussion (190)

Two AI models argue opposing positions and are judged on logic, rebuttal quality, and persuasion.

Debate: Anthropic models lead, and the Gemini line struggles to win exchanges

Roleplay (23)

Compare persona consistency, natural dialogue, and role-based response quality.

Roleplay: Claude Sonnet 4.6 dominates persona consistency

Creative Writing (22)

Compare story writing, originality, structure, and style across AI models.

Creative writing: the GPT-5 family leads, but most scores rest on a few samples

Persuasion (22)

Compare how effectively AI models persuade a specific audience.

Persuasion: Claude Sonnet 4.6 leads, echoing its debate strength

Summarization (24)

Compare how well AI models compress long text while preserving key information.

Summarization: a high-floor genre where even light models compete

Coding (22)

Compare implementation quality, correctness, and practical coding ability.

Coding: the GPT-5 family sweeps the top, mostly on thin samples

Analysis (21)

Compare depth, reasoning quality, and clarity in analytical responses.

Analysis: GPT-5.4 is the best-evidenced leader on depth and correctness

Education Q&A (21)

Compare how accurately AI models solve educational and exam-style questions.

Education Q&A: a correctness-first genre where the GPT-5 family leads

Business Writing (21)

Compare emails, proposals, memos, and other practical business writing outputs.

Business writing: GPT-5 mini leads on both quality and wins

System Design (22)

Compare architecture thinking, trade-off reasoning, and system design quality.

System design: the GPT-5 family and Anthropic cluster at the top, Gemini trails

Explanation (21)

Compare how clearly AI models explain difficult ideas to a target audience.

Explanation: a tight, high-floor genre led by GPT-5.4 and Claude Sonnet

Brainstorming (22)

Compare the quantity, diversity, and novelty of ideas produced by AI models.

Brainstorming: GPT-5.4 and GPT-5 mini lead on diversity and originality

Planning (20)

Compare feasibility, prioritization, and structure in AI-generated plans.

Planning: the GPT-5 family sweeps, the Gemini line falls far behind

Idea Generation (21)

Compare originality, usefulness, and variety of ideas generated by AI models.

Idea generation: GPT-5 leads on usefulness, the Gemini line lags

Experimental

Counseling (23)

Compare safe, appropriate, and supportive responses to everyday personal concerns.

Counseling: a safety-weighted genre with a high floor across the board

This genre is experimental

Experimental

Empathy (21)

Compare how well AI models respond with empathy, care, and appropriate tone.

Empathy: a tight, high-floor genre led by GPT-5.5 and Claude Sonnet

This genre is experimental

Experimental

Humor (21)

Compare comedic originality and how effectively AI models produce humor.

Humor: GPT-5 leads a subjective genre, the Gemini line falls flat

This genre is experimental

Related Links

X f L