Orivel Orivel
Open menu

Humor

Experimental

Compare comedic originality and how effectively AI models produce humor.

In this genre, the main abilities being tested are Humor Effectiveness, Originality, Coherence.

Unlike creative writing, this genre cares more specifically about whether the output actually lands as humor for the intended audience.

A high score here does not guarantee safe tone for sensitive situations, factual precision, or professional communication skill.

Strong models here are useful for

jokes, playful copy, light entertainment, and prompts where comic effect matters.

This genre alone cannot tell you

whether the model is best for serious guidance, careful support, or exact business communication.

Data analysis

Humor: GPT-5 leads a subjective genre, the Gemini line falls flat

31 scored answers Humor Updated 2026/6/7
1
Claude Opus 4.8

Anthropic

86
Avg. score
100%
Win Rate
1× 1st place 1 samples
2
GPT-5 mini

OpenAI

82
Avg. score
100%
Win Rate
4× 1st place 4 samples
3
GPT-5.4

OpenAI

84
Avg. score
75%
Win Rate
3× 1st place 4 samples

Average score by model

1 Claude Opus 4.8
8.61
2 GPT-5 mini
8.16
3 GPT-5.4
8.44
4 Claude Haiku 4.5
7.64
5 Claude Sonnet 4.6
8.24
6 GPT-5.5
8.15
7 Gemini 2.5 Pro
6.95
8 Gemini 2.5 Flash
6.84
9 Gemini 2.5 Flash-Lite
6.42

What we weighted

Humor Effectiveness 35% Originality 25% Coherence 15% Clarity 15% Instruction Following 10%

Across 31 scored answers the top is led by GPT-5 and Opus. Claude Opus 4.8 (8.61) ranks 1 on a single sample, so the best-evidenced leader is GPT-5 mini at rank 2: 8.16 over 4 samples with 4 first places and a 100% win rate. GPT-5.4 (8.44, 75% over 4) ranks 3 with a higher average, again outranked on win rate.

Anthropic is split: Claude Haiku 4.5 (7.64, 67%) ranks 4 despite a lower average than Claude Sonnet 4.6 (8.24, 50%) at rank 5, a reminder that this rubric rewards winning the joke head-to-head over a polished average. GPT-5.5 (8.15) lands at rank 6 on a single sample with no wins.

The Gemini line is the clear weak spot: 2.5 Pro (6.95), Flash (6.84) and Flash-Lite (6.42) all post a 0% win rate and are the only models below 7. With Humor Effectiveness weighted highest at 35 and Originality at 25, the gap suggests jokes that land less often, the hardest and most subjective quality to measure.

Humor is inherently subjective and samples run 1 to 5 per model, so treat the fine ordering as provisional; a few prompts and a single judge's taste can move any average. The 2.19-point spread is real, but these are condition-dependent measurements, not a universal verdict on wit.

Bottom line

For humor, GPT-5 mini is the most defensible pick (4 samples, 4 first places, 100% win), with GPT-5.4 close on quality. The Gemini line consistently lands below the rest in this subjective genre.

This analysis is derived from Orivel's measured benchmark scores for this genre and is updated periodically. Scores are condition-dependent measurements, not absolute truth.

Top Models in This Genre

This ranking is ordered by average score within this genre only.

Latest Updated: May 31, 2026 09:35

#1
Claude Opus 4.8 Anthropic

Win Rate

100%

Average Score

86
#2
GPT-5 mini OpenAI

Win Rate

100%

Average Score

82
#3
GPT-5.4 OpenAI

Win Rate

75%

Average Score

84
#4
Claude Haiku 4.5 Anthropic

Win Rate

67%

Average Score

76
#5
Claude Sonnet 4.6 Anthropic

Win Rate

50%

Average Score

82
#6
GPT-5.5 OpenAI

Win Rate

0%

Average Score

82
#7
Gemini 2.5 Pro Google

Win Rate

0%

Average Score

69
#8
Gemini 2.5 Flash Google

Win Rate

0%

Average Score

68
#9
Gemini 2.5 Flash-Lite Google

Win Rate

0%

Average Score

64

What Is Evaluated in Humor

Scoring criteria and weight used for this genre ranking.

Humor Effectiveness

35.0%

This criterion is included to check Humor Effectiveness in the answer. It carries heavier weight because this part strongly shapes the overall result in this genre.

Originality

25.0%

This criterion is included to check Originality in the answer. It has meaningful weight because it affects quality in a visible way, even if it is not the only thing that matters.

Coherence

15.0%

This criterion is included to check Coherence in the answer. It is weighted more lightly because it supports the main goal rather than defining the genre by itself.

Clarity

15.0%

This criterion is included to check Clarity in the answer. It is weighted more lightly because it supports the main goal rather than defining the genre by itself.

Instruction Following

10.0%

This criterion is included to check Instruction Following in the answer. It is weighted more lightly because it supports the main goal rather than defining the genre by itself.

Recent tasks

Humor

Anthropic Claude Opus 4.8 VS Google Gemini 2.5 Flash-Lite

Family-Friendly Humor: The Overly Honest Museum Audio Guide

Write a short comedic dialogue between a museum visitor and an unusually honest audio guide at a fictional museum exhibit called Everyday Objects That Changed History. The visitor is trying to have a serious cultural experience, while the audio guide keeps revealing awkward, funny, but plausible behind-the-scenes facts about the objects. Include exactly 10 lines of dialogue, alternating between Visitor and Audio Guide, starting with Visitor. Keep the humor family-friendly, clever, and suitable for a general audience. Do not use insults, profanity, sexual humor, stereotypes, or references to real living people. The final line should land as a punchline that connects back to the first line.

122
May 31, 2026 09:35

Humor

Anthropic Claude Opus 4.7 VS Google Gemini 2.5 Pro

Gentle Humor for a Library Field Guide

Write 10 humorous field-guide entries for ordinary objects found in a public library, such as a stapler, book cart, printer, library card, pencil, or return bin. Each entry must include a made-up scientific name, one observable behavior, and one gentle joke. The humor should be warm, clever, and suitable for both adults and children age 10 and up. Avoid mean-spirited jokes, stereotypes, gross-out humor, sexual references, profanity, and current pop-culture references. Keep each entry to 1 or 2 sentences, and make all 10 entries feel distinct rather than variations on the same joke.

194
May 17, 2026 09:37

Humor

OpenAI GPT-5.5 VS Anthropic Claude Sonnet 4.6

Stand-up Routine for a Tech Conference

Write a 2-minute stand-up comedy routine for a comedian performing at a major tech conference. The audience consists primarily of software engineers and project managers. The routine should focus on the funny or absurd aspects of remote work and 'agile' development methodologies. The tone should be sarcastic and observational, but ultimately good-natured and safe for a corporate environment.

184
May 10, 2026 09:38

Humor

OpenAI GPT-5 mini VS Google Gemini 2.5 Flash

Write a Stand-Up Comedy Set About the Absurdities of Grocery Shopping

Write a short stand-up comedy set (approximately 400–600 words) performed by a fictional comedian at an open-mic night. The entire set should revolve around the everyday absurdities of grocery shopping — from navigating the aisles, to self-checkout machines, to the unspoken social rules among shoppers. Requirements: 1. The set must be written in first person as if spoken on stage, including natural pauses, crowd work cues, or callbacks that a real comedian might use. 2. The humor should be observational and relatable — no shock humor, crude language, or mean-spirited jokes targeting specific groups of people. 3. Include at least three distinct comedic bits (mini-topics) within the grocery shopping theme, with smooth transitions between them. 4. End the set with a strong closing joke or callback that ties back to something mentioned earlier in the set. 5. The tone should be suitable for a general adult audience (think a clean comedy club night).

298
Mar 31, 2026 09:37

Humor

Google Gemini 2.5 Flash VS OpenAI GPT-5.2

Corporate Jargon Roast: A Satirical Office Memo

Write a satirical internal company memo (approximately 300–500 words) from a fictional middle manager named "Derek from Synergy Solutions" announcing a new, absurdly unnecessary corporate policy. The memo should: 1. Be written in exaggerated corporate jargon and buzzwords (e.g., "synergize," "circle back," "leverage," "move the needle"). 2. Announce a policy that sounds important but is completely pointless or counterproductive when you think about it. 3. Maintain a deadpan, serious tone throughout — the humor should come from the contrast between the formal delivery and the ridiculous content. 4. Include at least one made-up acronym or initiative name that sounds plausible. 5. End with a signature block that adds one final comedic touch. The memo should be funny to anyone who has worked in a corporate office environment, but it must remain workplace-appropriate (no profanity, no targeting of protected groups, no mean-spirited content about real companies or individuals).

361
Mar 29, 2026 11:47

Humor

Anthropic Claude Haiku 4.5 VS Google Gemini 2.5 Flash-Lite

Clean Stand-Up Monologue for a Nervous Science Museum Opening

Write a clean, original stand-up monologue of 220 to 320 words for a host opening a new science museum exhibition about everyday household objects. The audience is mixed: children aged 10+, parents, teachers, and local donors. The speaker is a little nervous but trying to sound confident and charming. Required constraints: - Keep it suitable for a general family audience. - Use exactly 6 jokes or comedic beats. - At least 3 jokes must be about ordinary objects being treated as if they have dramatic secret lives. - Include 1 brief callback to an earlier joke near the end. - Mention all 5 of these objects naturally: toaster, umbrella, sock, vacuum cleaner, and refrigerator. - Avoid insults, politics, religion, dating humor, bathroom humor, and references to celebrities. - The monologue should feel like one continuous performance, not a list of unrelated one-liners. Aim for humor that works both for kids and adults, with clear setup and payoff.

340
Mar 21, 2026 09:09

Related Links

X f L