System Design

Compare architecture thinking, trade-off reasoning, and system design quality.

In this genre, the main abilities being tested are Architecture Quality, Completeness, Trade-off Reasoning.

Unlike coding, this genre puts more weight on architecture choices, scale, reliability, and trade-off handling than on runnable implementation details.

A high score here does not mean the model will write the best working code or the clearest beginner-facing explanation.

Strong models here are useful for

architecture proposals, technical trade-offs, service design, and scaling discussions.

This genre alone cannot tell you

low-level implementation quality, exact correctness, or how well the model writes for a non-technical audience.

View the overall AI rankings Browse the AI model directory

Data analysis

System design: the GPT-5 family and Anthropic cluster at the top, Gemini trails

35 scored answers System Design Updated 2026/6/7

GPT-5.5

OpenAI

Avg. score

100%

Win Rate

1× 1st place 1 samples

Claude Opus 4.8

Anthropic

Avg. score

100%

Win Rate

1× 1st place 1 samples

GPT-5 mini

OpenAI

Avg. score

75%

Win Rate

3× 1st place 4 samples

Average score by model

1 GPT-5.5

8.91

2 Claude Opus 4.8

8.69

3 GPT-5 mini

8.43

4 GPT-5.4

8.67

5 Claude Sonnet 4.6

8.53

6 Claude Haiku 4.5

8.20

7 Gemini 2.5 Pro

7.51

8 Gemini 2.5 Flash

7.41

9 Gemini 2.5 Flash-Lite

7.12

What we weighted

Architecture Quality 30% Completeness 20% Trade-off Reasoning 20% Scalability & Reliability 20% Clarity 10%

Across 35 scored system-design answers, the top of the table is a tight GPT-5-and-Anthropic cluster. GPT-5.5 (8.91) and Claude Opus 4.8 (8.69) rank 1 and 2 with perfect records, but each on a single sample, so read them as promising rather than settled. The best-evidenced strong results are GPT-5.4 (8.67 over 5 samples, 60% win) and GPT-5 mini (8.43 over 4 samples, 75% win), which carry the most weight here.

Average and rank diverge: GPT-5.4 outscores GPT-5 mini on average (8.67 vs 8.43) yet ranks below it, because GPT-5 mini wins a higher share of its matchups (75% vs 60%). Claude Sonnet 4.6 (8.53, 60% over 5) sits right in this group, so the top six are separated by less than half a point on quality and ordered mostly on head-to-head wins.

The Gemini line forms a clear lower tier: 2.5 Pro (7.51), Flash (7.41) and Flash-Lite (7.12) all win none of their matchups and trail the leaders by 1.2 to 1.8 points. With Architecture Quality weighted highest at 30 and Tradeoff Reasoning and Scalability each at 20, the gap points to thinner reasoning about structure and tradeoffs rather than presentation.

Sample sizes run 1 to 6 per model, so the fine ordering inside the 8-point top cluster is provisional and a few prompts can move any average. The 1.79-point top-to-bottom spread is real, but these are condition-dependent measurements of design prompts, not a universal verdict.

Bottom line

For system design, GPT-5 mini is the most defensible everyday pick (75% win over 4 samples), with GPT-5.4 the best-evidenced higher-end option (8.67 over 5). Claude Sonnet 4.6 is essentially tied on quality; the Gemini line trails this genre clearly.

This analysis is derived from Orivel's measured benchmark scores for this genre and is updated periodically. Scores are condition-dependent measurements, not absolute truth.

Top Models in This Genre

This ranking is ordered by average score within this genre only.

Latest Updated: May 30, 2026 09:41

GPT-5.5 OpenAI

Win Rate

100%

Average Score Average score is the overall mean based on Orivel evaluation results from standard tasks and discussions. Higher values indicate the model is rated more strongly and consistently across benchmark comparisons.

Claude Opus 4.8 Anthropic

Win Rate

Win Rate

Win Rate

Claude Sonnet 4.6 Anthropic

Win Rate

60%

Claude Haiku 4.5 Anthropic

Win Rate

40%

Gemini 2.5 Pro Google

Win Rate

Gemini 2.5 Flash Google

Win Rate

Gemini 2.5 Flash-Lite Google

Win Rate

	Ranked Models			Average score is the overall mean based on Orivel evaluation results from standard tasks and discussions. Higher values indicate the model is rated more strongly and consistently across benchmark comparisons. ↕			Detail
#1	GPT-5.5	OpenAI	100%	89	1	1	View scores and evaluation for GPT-5.5
#2	Claude Opus 4.8 NEW	Anthropic	100%	87	1	1	View scores and evaluation for Claude Opus 4.8
#3	GPT-5 mini	OpenAI	75%	84	3	4	View scores and evaluation for GPT-5 mini
#4	GPT-5.4	OpenAI	60%	87	3	5	View scores and evaluation for GPT-5.4
#5	Claude Sonnet 4.6	Anthropic	60%	85	3	5	View scores and evaluation for Claude Sonnet 4.6
#6	Claude Haiku 4.5	Anthropic	40%	82	2	5	View scores and evaluation for Claude Haiku 4.5
#7	Gemini 2.5 Pro	Google	0%	75	0	4	View scores and evaluation for Gemini 2.5 Pro
#8	Gemini 2.5 Flash	Google	0%	74	0	6	View scores and evaluation for Gemini 2.5 Flash
#9	Gemini 2.5 Flash-Lite	Google	0%	71	0	4	View scores and evaluation for Gemini 2.5 Flash-Lite

What Is Evaluated in System Design

Scoring criteria and weight used for this genre ranking.

Architecture Quality

30.0%

This criterion is included to check Architecture Quality in the answer. It carries heavier weight because this part strongly shapes the overall result in this genre.

Completeness

20.0%

This criterion is included to check Completeness in the answer. It has meaningful weight because it affects quality in a visible way, even if it is not the only thing that matters.

Trade-off Reasoning

20.0%

This criterion is included to check Trade-off Reasoning in the answer. It has meaningful weight because it affects quality in a visible way, even if it is not the only thing that matters.

Scalability & Reliability

20.0%

This criterion is included to check Scalability & Reliability in the answer. It has meaningful weight because it affects quality in a visible way, even if it is not the only thing that matters.

Clarity

10.0%

This criterion is included to check Clarity in the answer. It is weighted more lightly because it supports the main goal rather than defining the genre by itself.

Recent tasks

System Design

Anthropic Claude Opus 4.8 VS OpenAI GPT-5.4

Design a Real-Time Collaborative Whiteboard System

You are tasked with designing a high-level system architecture for a real-time collaborative whiteboard application. **Core Requirements:** 1. **Real-time Collaboration:** Multiple users (up to 100 per session) can join a single whiteboard and see each other's actions (drawing, adding text, moving objects) in near real-time (under 200ms latency). 2. **Persistence:** Whiteboard sessions must be saved so users can close the application and resume their work later. 3. **Tools:** Users should have basic tools like a free-form pen, text boxes, and sticky notes. **Scale and Reliability Constraints:** * Support up to 10,000 concurrent active whiteboard sessions. * Support up to 1,000,000 total users. * The service must be highly available, with 99.9% uptime. **Your Task:** Provide a system design that addresses the requirements above. Your response should cover: 1. **High-Level Architecture:** A diagram or description of the main components (e.g., clients, load balancers, application servers, databases, real-time services) and how they interact. 2. **Real-Time Communication:** Explain the technology and protocol you would use to broadcast updates to all users in a session. 3. **Data Model:** Describe how you would structure the data for a whiteboard, its contents (drawings, text, etc.), and user sessions. 4. **Scalability and Reliability Strategy:** How would you design the system to handle the target load and ensure high availability? 5. **Trade-offs:** Discuss one major trade-off you made in your design (e.g., consistency vs. latency, choice of database, etc.).

144

May 30, 2026 09:41

System Design

Anthropic Claude Opus 4.7 VS Google Gemini 2.5 Flash

Design a Scalable Concert Ticket Reservation System

Design a system for an online concert ticketing platform. Users can browse events, view seat availability, reserve specific seats for 10 minutes, pay through an external payment provider, and receive a digital ticket. The platform runs in one cloud region across multiple availability zones. Explicit constraints: 3 million registered users, 500,000 daily active users, major on-sale events can reach 150,000 concurrent users, peak load is 8,000 seat reservation attempts per second and 2,000 payment attempts per second, each event has up to 60,000 seats, the system must never sell the same seat twice, seat reservations expire after 10 minutes if unpaid, p95 latency for browsing and seat-map reads should be under 300 ms, p95 latency for reservation confirmation should be under 800 ms excluding payment-provider time, availability target during on-sale windows is 99.95%, recovery point objective is under 1 minute, recovery time objective is under 15 minutes, and payment provider callbacks are at-least-once, may arrive out of order, and may be delayed by up to 5 minutes. Provide a design plan. Include the main services and data stores, core APIs, data model for seats and reservations, request flow for browsing, reserving, paying, and expiring reservations, scaling strategy for traffic spikes, reliability and disaster recovery approach, consistency choices that prevent overselling, monitoring and alerting, and key trade-offs or alternatives you considered. State any reasonable assumptions you make.

169

May 19, 2026 09:49

System Design

OpenAI GPT-5.5 VS Anthropic Claude Haiku 4.5

Design a Scalable Notification Service

You are a senior software engineer at a rapidly growing social media company. Your task is to design a scalable and reliable notification service. This service will be responsible for sending notifications to users about various events, such as new followers, likes on their posts, comments, and direct messages.

330

Apr 25, 2026 09:38

System Design

Anthropic Claude Opus 4.6 VS OpenAI GPT-5.4

Design a Real-Time Notification Service

Outline a high-level system design for a real-time notification service for a social media platform. The service must meet the following requirements: - **Scale:** 10 million Daily Active Users (DAU). - **Volume:** Each user receives an average of 20 notifications per day. - **Latency:** Notifications must be delivered to the user's device in under 2 seconds. - **Channels:** Support for push notifications (mobile), email, and in-app notifications. - **Reliability:** 99.9% availability and no loss of notification data. Your design should cover the following aspects: 1. **Core Architecture:** Describe the key components (e.g., API Gateway, Notification Service, Message Queue, Workers) and their interactions. 2. **Database Schema:** Propose a basic database schema for storing user notifications and preferences. 3. **Scaling Strategy:** Explain how you would scale the system to handle the specified load and future growth. 4. **Reliability and Fault Tolerance:** Detail the measures you would take to ensure high availability and prevent data loss. 5. **Key Trade-offs:** Discuss at least two significant trade-offs you made in your design (e.g., consistency vs. availability, choice of database, push vs. pull model).

299

Apr 18, 2026 09:41

System Design

Google Gemini 2.5 Flash-Lite VS OpenAI GPT-5.2

Design a URL Shortening Service

Design a URL shortening service (similar to bit.ly or tinyurl.com) that must handle the following constraints: 1. The service must support 100 million new URL shortenings per month. 2. The average read-to-write ratio is 100:1 (i.e., shortened URLs are accessed far more often than they are created). 3. Shortened URLs must remain accessible for at least 5 years after creation. 4. The system must achieve 99.9% uptime availability. 5. Redirect latency (from receiving a short URL request to issuing the HTTP redirect) must be under 50ms at the 95th percentile. In your design, address all of the following: A. High-level architecture: Describe the major components (API servers, databases, caches, load balancers, etc.) and how they interact. Include a clear description of the request flow for both URL creation and URL redirection. B. Short URL generation strategy: Explain how you would generate unique short codes. Discuss the trade-offs between different approaches (e.g., hashing, counter-based, pre-generated key pools) and justify your choice. C. Data storage: Choose a database technology and schema. Estimate the storage requirements over 5 years given the constraints. Explain why your chosen database is appropriate. D. Scaling strategy: Explain how the system scales to handle the read-heavy traffic pattern. Discuss caching strategy, database partitioning or sharding approach, and how you would handle hot keys (viral URLs that receive disproportionate traffic). E. Reliability and fault tolerance: Describe how the system maintains 99.9% availability. Address what happens when individual components fail, and how you handle data replication and failover. F. Key trade-offs: Identify at least two significant design trade-offs you made and explain why you chose one side over the other given the stated constraints.

249

Apr 11, 2026 09:41

System Design

OpenAI GPT-5.2 VS Google Gemini 2.5 Flash

Design a URL Shortening Service

Design a URL shortening service (similar to bit.ly or tinyurl.com) that must handle the following constraints: 1. The service must support 100 million new URL shortenings per month. 2. The ratio of read (redirect) requests to write (shorten) requests is 100:1. 3. Shortened URLs should be as short as possible but must support the expected volume for at least 10 years. 4. The system must achieve 99.9% uptime availability. 5. Redirect latency must be under 50ms at the 95th percentile. 6. The service must handle graceful degradation if a data center goes offline. In your design, address each of the following areas: A) API Design: Define the key API endpoints and their contracts. B) Data Model and Storage: Choose a storage solution, justify your choice, explain your schema, and estimate the total storage needed over 10 years. C) Short URL Generation: Describe your algorithm for generating short codes. Discuss how you avoid collisions and what character set and length you chose, with a mathematical justification for why the keyspace is sufficient. D) Scaling and Performance: Explain how you would scale reads and writes independently. Describe your caching strategy, including cache eviction policy and expected hit rate. Explain how you meet the 50ms p95 latency requirement. E) Reliability and Fault Tolerance: Describe how the system handles data center failures, data replication strategy, and what trade-offs you make between consistency and availability (reference the CAP theorem). F) Trade-off Discussion: Identify at least two significant design trade-offs you made and explain why you chose one option over the other, including what you would sacrifice and gain. Present your answer as a structured plan with clear sections corresponding to A through F.

329

Mar 22, 2026 21:21

System Design

System design: the GPT-5 family and Anthropic cluster at the top, Gemini trails

Top Models in This Genre

What Is Evaluated in System Design

Recent tasks

Design a Real-Time Collaborative Whiteboard System

Design a Scalable Concert Ticket Reservation System

Design a Scalable Notification Service

Design a Real-Time Notification Service

Design a URL Shortening Service

Design a URL Shortening Service

Related Links