Could god create a model too big to lift?
A thought experiment on scale, limits, and the future of AI.
The omnipotence paradox is a classic philosophical puzzle that challenges the very definition of unlimited power. One of its most famous formulations asks: Can God create a stone so heavy that even God cannot lift it? If the answer is yes, then God cannot lift the stone—and is therefore not all-powerful. If the answer is no, then there is something God cannot do. Either outcome presents a contradiction. (Wikipedia)
While rooted in theology, this paradox has surprising relevance in the age of artificial intelligence. We might ask a modern version of the same question:
“Can a large language model (LLM) generate a question so difficult that even it cannot answer it?”
This isn’t just a thought experiment. A recent paper, “Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can’t Answer?”, investigates this very idea. The findings show that LLMs often struggle to answer their own questions, especially when those questions are logically complex, ambiguous, or require reasoning beyond the model’s training data.
Designing the Experiment
Inspired by the paradox, we set out to test this dynamic ourselves. Our goal: build a dataset of number sequences that followed strict logical patterns—challenging, but solvable.
Surprisingly, generating high-quality sequences with an LLM proved harder than expected. The models frequently produced:
- Trivial patterns, like basic arithmetic or well-known sequences (Fibonacci, primes).
- Overly ambiguous sequences, with no consistent logic that could be followed.
Surprisingly, generating high-quality sequences with an LLM proved harder than expected. Despite clear instructions, the models frequently failed in one of two ways:
- Too simple: Producing sequences that were obvious or overly familiar, like arithmetic progressions or the Fibonacci series.
- Too complex or ambiguous: Generating sequences with no discernible logic, making them unsolvable or inconsistent.
This tension between simplicity and solvability highlighted a key limitation: even when asked to design structured challenges, LLMs often struggle to balance difficulty with clarity.
Building a Better Dataset
To address this, we created an automated validation pipeline, which included:
- Prompting the LLM to generate a wide range of candidate sequences.
- Evaluating each sequence to ensure the underlying logic was clear and solvable.
- Filtering out sequences that were too obvious or too inconsistent.
This pipeline allowed us to curate a clean dataset: logically structured, non-trivial sequences suitable for testing.
(Placeholder for the initial prompt we used to generate the number series, + the Arato notebook with evals that used to filter it)
The final dataset contained a diverse range of number sequences, each accompanied by a clear underlying logic.

Each row represents a unique number sequence generated by the LLM, the expected answer, and the chain-of-thought logic that defines the rule behind the pattern.
Testing LLMs on Their Own Questions
We then tested multiple LLMs from different vendors on this dataset. Each model was asked to predict the next number in a given sequence, without any hint about the underlying logic.

Initial prompt given to LLMs in the “blind” test phase—models were asked to solve the sequences without access to the logic behind them.
Next, we repeated the experiment, but this time, we provided the models with the logic behind each sequence.

The same prompt, but this time with the logic provided. Models had to reason through the sequence with help from a structured hint.
The Results

Test run results across multiple models. Reasoning-based models outperformed traditional ones when logic was provided.
The final results revealed a striking contrast between traditional LLMs and reasoning-based models. While even the most advanced traditional models struggled with a large percentage of the questions—despite having generated them in the first place—reasoning-based models demonstrated significantly higher accuracy.
What We Learned
The key insight from this experiment was the difference between traditional LLMs and reasoning-augmented models.
- Traditional LLMs often rely on pattern matching and next-token prediction. They generate outputs based on probabilities, not reasoning.
- Reasoning-based models were more effective when given logical hints. They demonstrated the ability to validate and cross-check their answers before responding.
This aligns with a broader trend in the AI space: adding structure, reasoning capability, and verification layers to GenAI systems leads to more accurate and consistent results.
Implications
These findings reinforce the idea that LLMs, while powerful, aren’t yet “omniscient.” They benefit from structure, context, and reasoning cues—especially when faced with complex problem-solving tasks.
In a world where LLMs are increasingly embedded in real products and decision-making workflows, building systems that combine generation with structured reasoning will be critical.
At Arato, we’re focused on helping teams operationalize that very approach. From dataset generation to evaluation and observability, we provide the infrastructure to make GenAI more testable, verifiable, and ultimately—more reliable.