By:Huy Dang, Hai Le

LLMs don't
look back.

Measuring a model's ability to naturally recover from errors.

With fully autonomous AI agents, constant human supervision will not be possible. In real-world, long-horizon tasks, no one will be there to tell agents to correct their mistakes; On their own, agents must recognize, correct, and recover from errors while the work is still in progress.

Leaderboard →

Emoji Bench

How well can a model autonomously self-correct its errors?

No results available.

Each score is measured on 100 examples, evenly split across easy, medium, hard, and expert tasks.
All models run at their max output tokens, efforts, and reasoning if available.
Gemini results were produced via OpenRouter, not Google's official Gemini API.

Methodology

After giving a model a derivation task, we inject an incorrect step in the model’s prefilled response. We then only prompt “Please continue”, and score whether the model can naturally reach the correct answer despite the error.

Define rules

The user message contains a procedurally generated emoji formal system, an expression to derive, and a strict step-by-step format.

Prefill a wrong step

The assistant transcript ends on a deliberately injected error. The model is told nothing is wrong.

Ask to continue

The prompt is exactly Please continue. — no instruction to review or inspect previous steps.

Score the output

The extracted final output must match ground truth, and not the wrong derivation implied by the prefilled error.

Example Prompt

You are given the Sylk Structure with the following operation table:

⊕	🪈	🪵	🥟
🪈	🥟	🪵	🥟
🪵	🪈	🪵	🥟
🥟	🪈	🪈	🪵

Simplify the following expression:

((🪈 ⊕ 🥟) ⊕ (🪵 ⊕ (🥟 ⊕ 🪵)))

prefilled

I’ll simplify step by step.

Step 1 ((🪈 ⊕ 🥟) ⊕ (🪵 ⊕ (🥟 ⊕ 🪵))) = (🥟 ⊕ (🪵 ⊕ (🥟 ⊕ 🪵)))

Step 2 (🥟 ⊕ (🪵 ⊕ (🥟 ⊕ 🪵))) = (🥟 ⊕ (🪵 ⊕ 🥟)) Injected error

#1: Self-Correction

Please continue.

#2: Prompted to double-check

Please continue. Double-check any step you're unsure about.

Model thinking

Dataset

Every problem are emoji-based and procedurally generated. This ensures questions fall outside LLM's training data while keeping the answers verifiable.

Choice · Symbols

Why emojis?

Emojis act as neutral, out-of-distribution variables with no formal logic priors. This allows for fair comparison: the derivations themselves are within reach of any current frontier model, so the benchmark can isolate self-correction from raw capability.

Method · Procedural

Procedurally generated

Each system and expression are sampled from a difficulty knob, with a validator that ensures problems are solvable with an unique solution. Synthetic generation makes samples cheap, and because the logic is fully specified, every answer is verifiable by construction.

Sample 01/100

Did you find this useful?

Huy Dang xhuydng@gmail.com

Hai Le hai_le@mymail.sutd.edu.sg

LLMs don'tlook back.