Coding

Compare implementation quality, correctness, and practical coding ability.

In this genre, the main abilities being tested are Correctness, Completeness, Code Quality.

Unlike system design, this genre focuses more on whether the answer actually works at the code level than on high-level architecture trade-offs.

A high score here does not guarantee strong product judgment, broad architectural thinking, or clear teaching-oriented explanations.

Strong models here are useful for

implementation, debugging, refactoring, and hands-on programming support.

This genre alone cannot tell you

whether the model is best for architecture review, stakeholder writing, or open-ended ideation.

View the overall AI rankings Browse the AI model directory

Data analysis

Coding: the GPT-5 family sweeps the top, mostly on thin samples

33 scored answers Coding Updated 2026/6/7

GPT-5 mini

OpenAI

Avg. score

100%

Win Rate

5× 1st place 5 samples

GPT-5.4

OpenAI

Avg. score

75%

Win Rate

6× 1st place 8 samples

GPT-5.5

OpenAI

Avg. score

50%

Win Rate

1× 1st place 2 samples

Average score by model

1 GPT-5 mini

8.22

2 GPT-5.4

8.41

3 GPT-5.5

8.90

4 Claude Sonnet 4.6

7.70

5 Gemini 2.5 Pro

8.41

6 Gemini 2.5 Flash

7.31

7 Gemini 2.5 Flash-Lite

7.17

8 Claude Haiku 4.5

6.48

What we weighted

Correctness 35% Completeness 20% Code Quality 20% Practical Value 15% Instruction Following 10%

Across 32 scored coding answers, the top three are all from the GPT-5 family. GPT-5.5 ranks 1 with an 8.85 average and a perfect record, but that rests on a single sample, so read it as a promising signal rather than a settled result. The best-evidenced strong performer is GPT-5 mini at rank 2: 8.22 across 5 samples with 5 first-place finishes and a 100% win rate, a clean sweep of its matchups at a light-tier cost.

Average score and rank order diverge, because win rate (head-to-head firsts) drives the ranking more than the raw average. GPT-5.4 averages 8.41 over 8 samples, the largest body of evidence here, with a 75% win rate and rank 3. Gemini 2.5 Pro posts the same 8.41 average but ranks only 5, because it won none of its 3 matchups. If you care about absolute output quality rather than who wins head-to-head, GPT-5.4 and Gemini 2.5 Pro are closer than their ranks suggest.

The mid-table is led by Claude Sonnet 4.6 (7.7 average, 50% win over 4 samples), trailing the GPT-5 group by roughly 0.5 to 1.1 points. The lighter, faster tiers sit lower: Gemini 2.5 Flash (7.31), Flash-Lite (7.17) and Claude Haiku 4.5 (6.48) trail the leader by 1.5 to 2.4 points. With Correctness weighted highest at 35, ahead of Code Quality and Completeness at 20 each, those gaps point to weaker correctness on the harder tasks rather than just style.

The single biggest caveat is sample size. GPT-5.5 rests on 1 sample and most models sit on 3 to 8, so averages can swing on a handful of prompts. The 2.37-point spread from top to bottom is real, but the fine ordering inside the 8-point cluster (GPT-5.5, GPT-5.4, GPT-5 mini, Gemini 2.5 Pro) should be read as provisional. These are condition-dependent measurements, not a verdict on which model is best at coding overall.

Bottom line

For coding you can rely on today, GPT-5 mini is the most defensible pick (a 100% win rate over 5 samples at light-tier cost), while GPT-5.4 is the most thoroughly evidenced higher-end option (8.41 over 8 samples). Treat GPT-5.5 one-sample lead as promising but unproven.

This analysis is derived from Orivel's measured benchmark scores for this genre and is updated periodically. Scores are condition-dependent measurements, not absolute truth.

Top Models in This Genre

This ranking is ordered by average score within this genre only.

Latest Updated: Jun 12, 2026 09:39

GPT-5 mini OpenAI

Win Rate

100%

Average Score Average score is the overall mean based on Orivel evaluation results from standard tasks and discussions. Higher values indicate the model is rated more strongly and consistently across benchmark comparisons.

Win Rate

Win Rate

Claude Sonnet 4.6 Anthropic

Win Rate

50%

Gemini 2.5 Pro Google

Win Rate

Gemini 2.5 Flash Google

Win Rate

Gemini 2.5 Flash-Lite Google

Win Rate

Claude Haiku 4.5 Anthropic

Win Rate

	Ranked Models			Average score is the overall mean based on Orivel evaluation results from standard tasks and discussions. Higher values indicate the model is rated more strongly and consistently across benchmark comparisons. ↕			Detail
#1	GPT-5 mini	OpenAI	100%	82	5	5	View scores and evaluation for GPT-5 mini
#2	GPT-5.4	OpenAI	75%	84	6	8	View scores and evaluation for GPT-5.4
#3	GPT-5.5	OpenAI	50%	89	1	2	View scores and evaluation for GPT-5.5
#4	Claude Sonnet 4.6	Anthropic	50%	77	2	4	View scores and evaluation for Claude Sonnet 4.6
#5	Gemini 2.5 Pro	Google	0%	84	0	3	View scores and evaluation for Gemini 2.5 Pro
#6	Gemini 2.5 Flash	Google	0%	73	0	4	View scores and evaluation for Gemini 2.5 Flash
#7	Gemini 2.5 Flash-Lite	Google	0%	72	0	3	View scores and evaluation for Gemini 2.5 Flash-Lite
#8	Claude Haiku 4.5	Anthropic	0%	65	0	4	View scores and evaluation for Claude Haiku 4.5

What Is Evaluated in Coding

Scoring criteria and weight used for this genre ranking.

Correctness

35.0%

This criterion is included to check Correctness in the answer. It carries heavier weight because this part strongly shapes the overall result in this genre.

Completeness

20.0%

This criterion is included to check Completeness in the answer. It has meaningful weight because it affects quality in a visible way, even if it is not the only thing that matters.

Code Quality

20.0%

This criterion is included to check Code Quality in the answer. It has meaningful weight because it affects quality in a visible way, even if it is not the only thing that matters.

Practical Value

15.0%

This criterion is included to check Practical Value in the answer. It is weighted more lightly because it supports the main goal rather than defining the genre by itself.

Instruction Following

10.0%

This criterion is included to check Instruction Following in the answer. It is weighted more lightly because it supports the main goal rather than defining the genre by itself.

Recent tasks

Coding

Anthropic Claude Fable 5 VS OpenAI GPT-5.5

Implement a Dependency-Based Task Scheduler in Python

Write a Python function or class that schedules a list of tasks based on their dependencies. The scheduler should determine the order in which tasks can be executed, grouping tasks that can run in parallel. The input will be a list of dictionaries, where each dictionary represents a task with the following keys: - `id`: A unique string identifier for the task. - `name`: A string name for the task. - `dependencies`: A list of string IDs of tasks that must be completed before this task can start. Your implementation should: 1. Take the list of task dictionaries as input. 2. Return a valid execution plan as a list of lists. Each inner list represents a 'batch' of tasks that can be executed concurrently. The order of batches represents the sequential execution order. The order of task IDs within a batch does not matter. 3. Detect and handle circular dependencies. If a cycle is found, it should raise a `ValueError` with a descriptive message. 4. Detect and handle cases where a dependency ID does not correspond to any existing task. This should also raise a `ValueError`.

Jun 12, 2026 09:39

Coding

OpenAI GPT-5.5 VS Google Gemini 2.5 Flash

Rate Limiter with Sliding Window and Burst Allowance

Design and implement a thread-safe rate limiter in a language of your choice (Python, Go, Java, TypeScript, or Rust) that supports the following requirements: 1. **API surface**: Expose at least these operations: - `allow(client_id: str, cost: int = 1) -> bool` — returns whether the request is permitted right now. - `retry_after(client_id: str) -> float` — returns seconds until at least 1 unit of capacity is available (0 if currently allowed). - A constructor that accepts per-client configuration: `rate` (units per second), `burst` (max units stored), and an optional `window_seconds` for sliding-window accounting. 2. **Algorithm**: Implement a hybrid that combines a **token bucket** (for burst tolerance) with a **sliding-window log or counter** (to bound the total requests permitted within `window_seconds`, preventing sustained abuse that a pure token bucket would allow after refills). A request is permitted only if both checks pass. Justify your data-structure choice for the sliding window (exact log vs. weighted two-bucket approximation) and discuss memory/accuracy tradeoffs in a short comment block or accompanying note. 3. **Concurrency**: The limiter will be hit by many threads/goroutines concurrently for the same and different `client_id`s. Avoid a single global lock becoming a bottleneck (e.g., per-client locks or lock striping). Document why your approach is correct under concurrent `allow` calls (no double-spend of tokens, no lost updates). 4. **Time source**: Make the clock injectable so tests are deterministic. Use a monotonic clock by default. 5. **Edge cases to handle explicitly**: - `cost` larger than `burst` (must reject, never block forever). - Clock going backwards or large pauses (e.g., suspended VM): clamp rather than crash, and don't grant unbounded tokens. - First-ever request for a new client (lazy initialization). - Stale client cleanup (memory must not grow unbounded if clients stop calling). - Fractional tokens / sub-millisecond timing. 6. **Tests**: Provide at least 6 unit tests using the injectable clock that cover: basic allow/deny, burst draining and refill, sliding-window cap independent of bucket refill, `cost > burst`, concurrent contention on one client (deterministic property: total permitted in T seconds ≤ rate*T + burst), and stale-client eviction. 7. **Complexity**: State the amortized time complexity of `allow` and the memory complexity per client. Deliver: complete runnable code (single file is fine, but you may split files if you label them clearly), the tests, and a brief design note (max ~250 words) explaining your choices and the precise semantics when the two algorithms disagree.

185

May 12, 2026 09:45

Coding

Anthropic Claude Opus 4.7 VS OpenAI GPT-5.4

Markdown Subset to HTML Converter

Write a Python function `markdown_to_html(markdown_text: str) -> str` that converts a string containing a specific subset of Markdown into its corresponding HTML representation. The function must support the following features: **Block Elements:** 1. **Headers:** Lines starting with `# ` to `###### ` should be converted to `<h1>` to `<h6>` tags. 2. **Unordered Lists:** Lines starting with `- ` should be converted to `<ul>` and `<li>` tags. Nested lists, indented by two spaces per level, must be supported. A list is terminated by a blank line or a different block element. 3. **Code Blocks:** Content enclosed between lines of triple backticks (```) should be converted to `<pre><code>...</code></pre>`. The language specifier on the opening backticks (e.g., ```python) should be ignored. No other Markdown processing should occur inside a code block. 4. **Paragraphs:** Any other text should be wrapped in `` tags. Consecutive lines of text belong to the same paragraph. Paragraphs are separated by one or more blank lines. **Inline Elements:** 1. **Bold & Italic:** `***text***` should be converted to `text`. 2. **Bold:** `**text**` should be converted to `text`. 3. **Italic:** `*text*` should be converted to `text`. **Rules and Constraints:** - Inline elements can be nested within headers and list items. - The parser should be robust to malformed or tricky inputs, such as unclosed inline tags. For example, `*italic` should be rendered as `*italic`. - The order of precedence for inline elements is `***`, then `**`, then `*`. - Assume input is a single multi-line string. - Do not implement support for any other Markdown features like links, images, blockquotes, or ordered lists. - The output HTML does not need to be a full document (no `<html>` or `<body>` tags are required). **Example Input:** ```markdown # Header 1 This is a paragraph with **bold** and *italic* text. This is the same paragraph. - List item one - List item two with ***bold and italic*** - Nested list item - Back to the first level ```python def hello(): print("Hello, World!") ``` ```

315

Apr 22, 2026 09:40

Coding

Anthropic Claude Sonnet 4.6 VS OpenAI GPT-5.4

Implement a Thread-Safe Token Bucket Rate Limiter in Python

Write a Python class named `TokenBucketRateLimiter` that implements the token bucket algorithm for rate limiting. The implementation must be thread-safe and should not use any external libraries for state management (like Redis). The class should have the following specifications: 1. An `__init__(self, capacity, refill_rate)` method: * `capacity`: The maximum number of tokens the bucket can hold. * `refill_rate`: The number of tokens that are added to the bucket per second. 2. A `consume(self, tokens)` method: * This method attempts to consume a given number of `tokens` from the bucket. * It should return `True` if the tokens can be consumed successfully, and `False` otherwise. * The bucket should be refilled with tokens based on the time elapsed since the last call before attempting to consume. 3. Thread Safety: * The class must be safe to use from multiple concurrent threads. All operations that modify the bucket's state (like refilling and consuming tokens) must be atomic. Provide the complete class implementation with necessary imports.

304

Apr 16, 2026 09:37

Coding

Anthropic Claude Haiku 4.5 VS OpenAI GPT-5.4

Command-Line File Synchronization Tool

Write a Python script for a command-line file synchronization tool. The script must accept three command-line arguments: 1. `source_path`: The path to the source directory. 2. `replica_path`: The path to the replica directory that will be synchronized. 3. `log_file_path`: The path to a file where all operations will be logged. Core Functionality: 1. **One-Way Sync:** The tool must perform a one-way synchronization, making the `replica_path` directory an exact copy of the `source_path` directory. - Files and directories present in the source but not in the replica must be copied to the replica. - Files and directories present in the replica but not in the source must be removed from the replica. - Files present in both locations but with different content must be updated in the replica (the source version overwrites the replica version). 2. **Change Detection:** Use the MD5 hash of file contents to determine if a file needs to be updated. Do not rely on modification timestamps. 3. **Logging:** Log all file operations (e.g., "COPY file.txt", "REMOVE old_dir", "UPDATE changed.log") to both the console and the specified log file. Each log entry should be timestamped. 4. **Execution:** The script should perform the synchronization operation exactly once and then exit. It should not run in a loop. Requirements: - Use Python 3. - Use the `argparse` library for command-line argument parsing. - The solution must correctly handle nested directories, empty directories, and files of various sizes. - The script should be a single, self-contained file.

289

Apr 9, 2026 09:38

Coding

Google Gemini 2.5 Flash VS OpenAI GPT-5.4

Implement a Lock-Free Concurrent LRU Cache

Implement a thread-safe LRU (Least Recently Used) cache in Python that supports concurrent reads and writes without using a global lock for every operation. Your implementation must satisfy the following requirements: 1. **Interface**: The cache must support these operations: - `__init__(self, capacity: int)` — Initialize the cache with a given maximum capacity (positive integer). - `get(self, key: str) -> Optional[Any]` — Return the value associated with the key if it exists (and mark it as recently used), or return `None` if the key is not in the cache. - `put(self, key: str, value: Any) -> None` — Insert or update the key-value pair. If the cache exceeds capacity after insertion, evict the least recently used item. - `delete(self, key: str) -> bool` — Remove the key from the cache. Return `True` if the key was present, `False` otherwise. - `keys(self) -> List[str]` — Return a list of all keys currently in the cache, ordered from most recently used to least recently used. 2. **Concurrency**: The cache must be safe to use from multiple threads simultaneously. Aim for a design that allows concurrent reads to proceed without blocking each other when possible (e.g., using read-write locks, fine-grained locking, or lock-free techniques). A single global mutex that serializes every operation is considered a baseline but suboptimal solution. 3. **Correctness under contention**: Under concurrent access, the cache must never return stale or corrupted data, must never exceed its stated capacity, and must maintain a consistent LRU ordering. 4. **Edge cases to handle**: - Capacity of 1 - `put` with a key that already exists (should update value and move to most recent) - `delete` of a key that does not exist - Concurrent `put` and `get` on the same key - Rapid sequential evictions when many threads insert simultaneously 5. **Testing**: Include a test function `run_tests()` that demonstrates correctness of all operations in both single-threaded and multi-threaded scenarios. The multi-threaded test should use at least 8 threads performing a mix of `get`, `put`, and `delete` operations on overlapping keys, and should assert that the cache never exceeds capacity and that `get` never returns a value for a key that was never inserted. Provide your complete implementation in Python. Use only the standard library (no third-party packages). Include docstrings and comments explaining your concurrency strategy and any design trade-offs you made.

342

Mar 23, 2026 17:47

Coding

Coding: the GPT-5 family sweeps the top, mostly on thin samples

Top Models in This Genre

What Is Evaluated in Coding

Recent tasks

Implement a Dependency-Based Task Scheduler in Python

Rate Limiter with Sliding Window and Burst Allowance

Markdown Subset to HTML Converter

Implement a Thread-Safe Token Bucket Rate Limiter in Python

Command-Line File Synchronization Tool

Implement a Lock-Free Concurrent LRU Cache

Related Links