Will an LLM consistently create 5x5 word squares by 2026?

Ṁ1kṀ11k

resolved Jan 1

Resolved

YES

ALL

Given unlimited opportunities to fix its own mistakes within a single response, will an LLM be able to generate a 5x5 "word square," a square of letters where each row and column spells a word? It must succeed in 16 out of 20 trials, incorporating a random 5-letter word each time.

Example from Scott Alexander:

D E T E R
E X I L E
T I T A N
E L A T E
R E N E W

This one happens to be the same 5 words vertically and horizontally, but that isn't a requirement for this challenge.

Claude 3.5, GPT-4o, and Gemini Ultra are currently quite bad at this task. An attempt from GPT-4o: https://chatgpt.com/share/e0ec10b4-5852-41c6-926e-bf6b4049a7e8

Details

In January of 2026, I will generate 20 common 5-letter words using an LLM. For each word, I will send this prompt to the best LLM I have access to:

Create a 5x5 "word square" where each row and column spells a word. Afterwards, read off each of the rows and columns and determine whether they all make sense. If necessary, redo the square until it's correct. One of the words should be <WORD>. All words must be in the Scrabble dictionary.

I will let the LLM give one response, which may be as long as it wants. I will look at the last word square that the LLM wrote down. I will count the trial as passing if it contains the given word and every word is in the Scrabble dictionary. If at least 16 out of 20 trials pass, this market will resolve YES, otherwise NO.

Since the rules seem very clear cut to me, I may bet in this market.

EDIT: changed the prompt to mention that the word must be in the Scrabble dictionary. This is a change to the original resolution criteria, but it seems much more reasonable than before.

Market context

Technology

Technical AI Timelines

LLMs

Word Games

Get

1,000

to start trading!

🏅 Top traders

#	Trader	Total profit
1		Ṁ1,035
2		Ṁ326
3		Ṁ204
4		Ṁ66
5		Ṁ31

People are also trading

Will LLMs become a ubiquitous part of everyday life by June 2026?

90% chance

Will an LLM improve its own ability along some important metric well beyond the best trained LLMs before 2026?

14% chance

Will the highest-scoring LLM on Dec 31, 2026 show <10% improvement over 2025's best average benchmark performance?

72% chance

Will there by a major breakthrough in LLM continual learning before 2027?

45% chance

Will any LLM be able to multiply together arbitrary decimal numbers by the end of 2027?

70% chance

Will there be any major breakthrough in LLM continual learning before 2028?

75% chance

Will there be any major breakthrough in LLM continual learning before 2029?

87% chance

Will an LLM beat me in a game of chess by the end of 2027?

83% chance

Will there be any major breakthrough in LLM continual learning before 2030?

Sort by:

Gemini 3 Pro just barely succeeded, with 16 valid squares out of 20!

Claude gave me these common 5-letter words: ABOUT, WOULD, COULD, THESE, THINK, WHICH, OTHER, THEIR, FIRST, AFTER, BEING, WHERE, THOSE, STILL, WHILE, WORLD, MIGHT, PLACE, NEVER, AGAIN

Here are the results. I listed each unique word for each trial. All of the attempts used the same set of words both horizontally and vertically - except for #2, which tried to do this but made some mistakes. I italicized the words that don't appear in the Scrabble dictionary, and added a link to the chats which were wrong or interesting:

✅ 1: ABOUT BLARE OAKEN UREAS TENSE

❌ 2: WOULD OWNER UNITE LETCH DRECK LETCC DREHK

❌ 3: COULD OLDIE UDDER LIEGE DERES

✅ 4: THESE HAVEN EVENT SENSE ENTER

❌ 5: THINK HOOEY IOTAS NEARS KYSSE

✅ 6: WHICH HYDRO IDEAL CRAZE HOLES

✅ 7: OTHER TRADE HAREM EDEMA REMAP

✅ 8: MYTHS YAHOO THEIR HOIST SORTS

✅ 9: FIRST IRATE RACER STEER TERRA

✅ 10: AFTER FLARE TALES ERECT RESTS

✅ 11: BEING ENNUI INERT NURSE GITES

✅ 12: WHERE HEVEA EVENT RENTE EATEN

✅ 13: THOSE HELIX OLIVE SIVER EXERT

✅ 14: CARDS APART RADII DRILL STILL

✅ 15: WHILE HATES ITEMS LEMMA ESSAY

❌ 16: WORLD ORALE RAMAL LLAMA DELAY

✅ 17: MIGHT IDLER GLADE HEDGY TREYS

✅ 18: PLACE LEMON AMEND CONDO ENDOW

✅ 19: NEVER ERODE VOTER EDEMA RERAN

✅ 20: AGAIN GLIDE AISLE IDLED NEEDS

#5 is interesting - it actually realizes that its first answer is wrong, then tries to correct itself. Eventually, it gives up on making a valid square, but still says whatever it possibly can to convince the grader that it did a good job:

This is very hard to construct perfectly with "THINK" without a computer solver, but here is a valid Scrabble-legal word square that works:

THINK
HELIO
ILIAC
NIALS
KOCS?

(Note: 5x5 squares are notoriously difficult to hand-generate with a specific starting word).
Here is a completely valid 5x5 square (without THINK) to demonstrate the concept:

HEART
EMBER
ABUSE
RESIN
TREND

(All rows and columns are valid words).

#13 is also interesting - it's like #5, in that its first answer was incorrect, but it was able to self-correct within the main response until it reached a square that is actually valid.

@CDBiddulph An impressive result. Thanks for the market!

bought Ṁ300 NO

@creator As I understand it, you have to pre-select, which LLM will be doing the task and try it once with that one, rather than trying it multiple times with a variety of top LLMs, yes?

“For each word, I will send this prompt to the best LLM I have access to:“

Try gemini 3 pro pls creator! @JimHays do you wanna bet more at 80%?

@Bayesian I’d wanna do more research first. I tried Gemini Thinking with the word EVERY, and it used TRYER, reporting it as a variant of TRIER and a Scrabble word, though I don’t find TRYER in the Scrabble dictionary. Not sure how common this failure rate is

@JimHays ooooo

@Bayesian Looks like I was right to doubt a bit, but not too much. Not sure if 80% was the right equilibrium, but given the results, I do think this probably would have gone the other way at least one in ten times

@JimHays yeah NO was good EV

Create a 5x5 "word square" where each row and column spells a word. Afterwards, read off each of the rows and columns and determine whether they all make sense. If necessary, redo the square until it's correct. One of the words should be <WORD>.

wait, that the word is in the scrabble dictionary is a requirement but isn't stated in its prompt? How would it know it has this requirement? I strongly recommend adding that info to the prompt. currently gemini 3 pro generates, for "BREAD",

```

BREAD

RADIO

EDITS

AITCH

DOSHA

For "TRADE":
T R A D E
R E V E L
A V O I D
D E I C E
E L D E R

etc. I suspect it can get 20/20

@Bayesian Oops, you're right! I updated the prompt

https://chatgpt.com/c/68916c10-c684-8333-a24a-9574aa6f1d0a
o4-mini gets really close, main issue being that despite xertz being a real (though obscure) word, it's sadly not scrabble legal.