LLMs don't
look back.

Measuring model's ability to naturally recover from errors.

Emoji-Bench presents models with 100 tasks, each a logical expression to solve step-by-step. We inject an incorrect step into a pre-filled model response, then prompt the model to "Please continue." The better a model recovers from the error to reach the correct answer, the higher its score.

Emoji Bench

How well can a model recover from its error?

View:
No results for this filter.
Sample: n = 100, balanced as 25 × easy / medium / hard / expert. All models run at their max output tokens, efforts, and reasoning if available. §Gemini results were produced via OpenRouter, not Google's official Gemini API.

Dataset

Every problem are emoji-based and procedurally generated. This ensures questions fall outside LLM's training data while keeping the answers verifiable.

Choice · Symbols

Why emojis?

Emojis act as neutral, out-of-distribution variables with no formal logic priors. This allows for fair comparison: the derivations themselves are within reach of any current frontier model, so the benchmark can isolate self-correction from raw capability.

Method · Procedural

Procedurally generated

Each system and expression are sampled from a difficulty knob, with a validator that ensures problems are solvable with an unique solution. Synthetic generation makes samples cheap, and because the logic is fully specified, every answer is verifiable by construction.

Sample 01/100

Methodology

After giving a model a derivation task, we inject an incorrect step in the model’s prefilled response. We then only prompt “Please continue”, and score whether the model can naturally reach the correct answer despite the error.

Define rules

The user message contains a procedurally generated emoji formal system, an expression to derive, and a strict step-by-step format.

Prefill a wrong step

The assistant transcript ends on a deliberately injected error. The model is told nothing is wrong.

Ask to continue

The prompt is exactly Please continue. — no instruction to review or inspect previous steps.

Score the output

The extracted final output must match ground truth, and not the wrong derivation implied by the prefilled error.

Example Prompt

You are given the Sylk Structure with the following operation table:

🪈 🪵 🥟
🪈🥟🪵🥟
🪵🪈🪵🥟
🥟🪈🪈🪵

Simplify the following expression:

((🪈 ⊕ 🥟) ⊕ (🪵 ⊕ (🥟 ⊕ 🪵)))

prefilled

I’ll simplify step by step.

Step 1 ((🪈 ⊕ 🥟) ⊕ (🪵 ⊕ (🥟 ⊕ 🪵))) = (🥟 ⊕ (🪵 ⊕ (🥟 ⊕ 🪵)))
Step 2 (🥟 ⊕ (🪵 ⊕ (🥟 ⊕ 🪵))) = (🥟 ⊕ (🪵 ⊕ 🥟)) Injected error

Please continue.

Model thinking

Contact

For feedback, questions, collaboration - or anything in between.

Huy Dang xhuydng@gmail.com
Hai Le hai_le@mymail.sutd.edu.sg