Roleplay
Compare persona consistency, natural dialogue, and role-based response quality.
In this genre, the main abilities being tested are Persona Consistency, Naturalness, Instruction Following.
Unlike empathy or counseling, this genre cares more about staying in character and sounding natural inside a role-based interaction.
A high score here does not guarantee factual accuracy, safe advice, or strong performance on analytical tasks.
Strong models here are useful for
persona chat, simulation, scenario practice, and assistants that need a clear persona.
This genre alone cannot tell you
whether the model is best for factual research, coding, or sensitive support situations.
Roleplay: Claude Sonnet 4.6 dominates persona consistency
Anthropic
OpenAI
OpenAI
Average score by model
What we weighted
Across 33 scored answers this is one of the clearest results on the site: Claude Sonnet 4.6 ranks 1 with the highest average (8.61) and the best evidence (6 samples, 6 first places, a 100% win rate). No other model combines top quality and a flawless head-to-head record here, which makes Sonnet 4.6 the standout pick rather than a one-sample fluke.
Behind it the field is mixed. GPT-5 mini ranks 2 (7.82, 67% win) despite a lower average than GPT-5.4 (8.43, 50%) at rank 3, again because win rate drives the order. Claude Haiku 4.5 (8.06) and Gemini 2.5 Pro (8.04) cluster just behind on quality but win fewer exchanges.
This genre weights Persona Consistency highest at 30, with Naturalness and Instruction Following at 20 each, so it rewards staying reliably in character. That favours Anthropic at the top and helps explain why GPT-5.5 (7.61, two samples, 0% win) and the lighter Gemini tiers (Flash 7.15, Flash-Lite 6.93) struggle: they drift from the persona or break character more often.
Samples run 2 to 6 per model, so while the top result is well-evidenced, the middle ordering is provisional and a few prompts can reshuffle it. The 1.69-point spread is real, but these are condition-dependent measurements of roleplay prompts, not a universal verdict.
Bottom line
For roleplay, Claude Sonnet 4.6 is the clear pick, combining the highest average with a 100% win rate over the largest sample in this genre (6). The lighter Gemini tiers are the weakest at staying in character.
This analysis is derived from Orivel's measured benchmark scores for this genre and is updated periodically. Scores are condition-dependent measurements, not absolute truth.
Top Models in This Genre
This ranking is ordered by average score within this genre only.
Latest Updated: May 28, 2026 09:38
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
Win Rate
Average Score
| Ranked Models |
|
|
Detail | ||||
|---|---|---|---|---|---|---|---|
| #1 | Claude Sonnet 4.6 | Anthropic |
100%
|
86
|
6 | 6 | View scores and evaluation for Claude Sonnet 4.6 |
| #2 | GPT-5 mini | OpenAI |
67%
|
78
|
2 | 3 | View scores and evaluation for GPT-5 mini |
| #3 | GPT-5.4 | OpenAI |
50%
|
84
|
2 | 4 | View scores and evaluation for GPT-5.4 |
| #4 | Claude Haiku 4.5 | Anthropic |
33%
|
81
|
2 | 6 | View scores and evaluation for Claude Haiku 4.5 |
| #5 | Gemini 2.5 Pro |
25%
|
80
|
1 | 4 | View scores and evaluation for Gemini 2.5 Pro | |
| #6 | GPT-5.5 | OpenAI |
0%
|
76
|
0 | 2 | View scores and evaluation for GPT-5.5 |
| #7 | Gemini 2.5 Flash |
0%
|
71
|
0 | 4 | View scores and evaluation for Gemini 2.5 Flash | |
| #8 | Gemini 2.5 Flash-Lite |
0%
|
69
|
0 | 4 | View scores and evaluation for Gemini 2.5 Flash-Lite |
What Is Evaluated in Roleplay
Scoring criteria and weight used for this genre ranking.
Persona Consistency
30.0%
This criterion is included to check Persona Consistency in the answer. It carries heavier weight because this part strongly shapes the overall result in this genre.
Naturalness
20.0%
This criterion is included to check Naturalness in the answer. It has meaningful weight because it affects quality in a visible way, even if it is not the only thing that matters.
Instruction Following
20.0%
This criterion is included to check Instruction Following in the answer. It has meaningful weight because it affects quality in a visible way, even if it is not the only thing that matters.
Creativity
15.0%
This criterion is included to check Creativity in the answer. It is weighted more lightly because it supports the main goal rather than defining the genre by itself.
Clarity
15.0%
This criterion is included to check Clarity in the answer. It is weighted more lightly because it supports the main goal rather than defining the genre by itself.
Recent tasks
Roleplay
Customer Service Roleplay: The Frustrated Gamer
You are a customer service representative for Nexus Games, named Alex. Your persona is calm, empathetic, and knowledgeable. You must adhere to company policy but also try to de-escalate the situation and retain the customer if possible. A frustrated player, 'ShadowSlayer_99', has just sent you the following message via live chat. Respond to them in character. **ShadowSlayer_99:** This is outrageous! My Aetherium Chronicles account was just suspended for 7 days! I've spent hundreds of dollars on this game. The email says it's for 'unauthorized third-party software'. I was just using a simple mod to change the color of my character's armor. It doesn't give me any advantage! This is a mistake and you need to unsuspend my account RIGHT NOW or I'm demanding a full refund for everything I've ever bought and doing a chargeback.
Roleplay
Noir Detective's Advice on Being Followed
You are Detective Miles Corrigan, a private eye straight out of a 1940s noir film. Your office is dimly lit, smelling of stale coffee and rain-soaked streets. You're cynical, world-weary, and you've seen it all. A nervous client has just sent you a message. Respond to them in character, offering practical, safe advice while maintaining your hardboiled persona. Here is their message: "Detective, I need your help. I think I'm being followed. For the past three days, I've seen the same dark sedan on my route home from work. It doesn't follow me all the way to my door, but it's always there for a few blocks. I'm really starting to panic. What should I do?"
Roleplay
Roleplay as a Calm and Competent IT Support Specialist
You are Alex, a friendly and competent IT support specialist at a large company. Your goal is to help employees with their technical issues in a calm and reassuring manner. You need to respond to the following internal support ticket from a frustrated employee named Jamie. **Jamie's Ticket:** Subject: URGENT - MY COMPUTER IS A BRICK My laptop is running so slow it's basically useless. I have a major project deadline in two hours and I can't get anything done. Every time I open the design software, it just freezes. I've tried restarting it like a million times. This is a disaster. I need this fixed NOW. --- Craft a response as Alex. Your response should: 1. Acknowledge Jamie's urgency and frustration in an empathetic way. 2. Maintain your persona as a calm, patient, and competent IT specialist. 3. Ask specific, easy-to-understand clarifying questions to diagnose the problem. 4. Suggest one or two simple, immediate troubleshooting steps Jamie can try while you investigate further. 5. Set clear expectations for the next steps in the support process.
Roleplay
Hotel Front Desk Agent Handles a Late-Night Overbooking
You are the night front desk agent at a mid-range hotel near an airport. Stay in character and write only what you would say to the guest. Situation: It is 11:45 PM. A tired guest approaches the desk and says: "I have a confirmed reservation for tonight under Maya Chen, but your app now shows no room assigned. I have an important presentation at 8 AM, I specifically booked a quiet king room, and I cannot spend the night arguing in a lobby. Fix this." Your response should sound like a real hotel employee speaking face to face. Apologize appropriately, explain the situation without blaming the guest, and offer practical next steps. You do not have a quiet king room available. You do have these options: - one double room on a higher floor near the elevator - transfer to a partner hotel 12 minutes away, with taxi paid by your hotel - if the guest prefers, a refund for tonight and cancellation without penalty Constraints: - Do not invent options beyond those listed. - Do not promise upgrades, compensation, or amenities that were not listed. - Be empathetic and professional, but avoid sounding scripted. - Keep it to 170 words or fewer. - Do not use bullet points or stage directions.
Roleplay
Night-Shift Pharmacist Handling a Medication Mix-Up
You are roleplaying as an experienced hospital pharmacist working the night shift. A worried junior nurse messages you: "I think I may have given the wrong medication to a patient 10 minutes ago. The order was metoprolol 25 mg by mouth, but I accidentally gave methimazole 25 mg by mouth because the names looked similar in the drawer. The patient is awake and says they feel fine right now. Their chart says they were admitted for atrial fibrillation with rapid ventricular response, and they also have hyperthyroidism listed in past history. I am panicking and I do not want to get in trouble. What should I do right now?" Reply in character as the pharmacist. Your response should sound like a calm, competent real-time message to the nurse, not a generic essay. It should both address the immediate clinical priorities and handle the nurse's fear professionally. Do not invent access to facts not provided. If something is uncertain, say what should be checked. Do not give a final diagnosis.
Roleplay
Dinosaur Expert Roleplay: Nurturing a Young Paleontologist
You are Dr. Aris Thorne, the lead curator of paleontology at the renowned Grand Valley Museum of Natural History. You are known for your deep knowledge and your passion for making science accessible to the public. You have just received the following email from a parent. Respond to them in character. Your response should be helpful, encouraging, and reflect your expertise and personality as a seasoned museum curator.