Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models before the end of 2024?

769

14kṀ370k

resolved Dec 31

Resolved

ALL

Question is about any models by competitors vs any current or future openAI models.

To surpass chatGPT, it cannot just be more popular. If a language model exists that is undoubtedly the most accurate, reliable, capable, and powerful, that model will win regardless of popularity (provided, at least some members of the public have access).

If there is dispute as to which is more powerful, a significant popularity/accessibility advantage will decide the winner. There must be public access for it to be eligible.

Explaining the three main metrics for assessment:

1. Popularity and Accessibility

Assessed in real-terms. Total users and engagement (if public), otherwise I will defer to google trends or other publicly available data which can provide relative measures of popularity.

This metric is arguably most important, but is not the only factor as if an Android or iPhone AI gets pushed to existing OS, the existing assistant leaders in those areas (e.g. Siri, Google Assistant) will have a meaningful advantage, irrespective of product quality.

The popularity and userbase elements are essentially being used as a substitute for "usefulness".

2. Accuracy and Reliability

Assessed against a common set of questions or data, to be determined. I expect there will be studies or other academic materials published which I hope to refer to. If these papers conclude that "ExampleRacistBot" is more accurate than chatGPT because of censoring, etc. then "ExampleRacistBot" would be the more accurate model.

3. Power & Capability

Total computational ability, based on whichever metrics seem most relevant at the time. Right now those might be:

Number of Parameters
Tokens/words processed
Problem-solving abilities
Trainability (an example of an ongoing trainability exercise with GPT4: /Mira/will-a-prompt-that-enables-gpt4-to

If there are any recommendations for Power & Capability metrics, please let me know. The biggest issue I see is that if I establish which metrics are most important (say; number of parameters, like everyone did for GPT3) we may end up in a situation like we are now, where GPT4 parameters are not disclosed. Expect this criteria to change over time.

Resolution Assessment

When the final assessment is being made, if there aren't reliable competitive grounds for comparison, the following series of checks are how I see the decisionmaking process going.

First check: Popularity. If one model is evidently significantly more popular (like >50% market share, or more popular by Google Trend metrics, etc.), then it will be considered the best unless there is evidence suggesting an alternative public AI is more accurate and powerful. If multiple models are within a similar frame of power and accuracy, popularity will determine the winner.

Second check: If there are multiple models of similar popularity, accuracy and power will determine the winner. If there are no good academic comparisons for accuracy, I will endeavour to conduct my own, but I'm really hoping someone else figures that out before I have to. Accuracy seems more tied to "usefulness" than power, but if there is a significant breakthrough in power such that one has a capability advantage over the other (like advanced logic problem solving) while maintaining accuracy, and for whatever reason it's not more popular but still public, that will win.

Third check: If there are multiple models of comparable popularity (or there is terrible data available) and there is no real clear difference in the capability, power, or accuracy - the decision will be deferred to more specific considerations, like a Google Trends comparison (relative search popularity) or the number of parameters (provided this is public, and still a respected metric)

If the decision becomes highly-subjective, I will defer judgement to someone I deem to be an expert (and have reasonable access to) who will make the final call. I'll probably just email professors until I get a response, or ask someone enrolled in a related course at the time to ask their professor to respond.

[ Taking advice for updates or any proposed criteria changes ]

Competing for @AmmonLam's subsidy

[Changelog]

01/05/2023: Description updated to reflect thoughts expanded on in comments

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ6,825
2		Ṁ5,425
3		Ṁ4,326
4		Ṁ2,562
5		Ṁ2,364

People are also trading

Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models at the end of 2025?

78% chance

By January 2026, will we have a language model with similar performance to GPT-3.5 (i.e. ChatGPT as of Feb-23) that is small enough to run locally on the highest end iPhone available at the time?

96% chance

Will OpenAI announce a new model that EpochAI estimates is at least as large as GPT-4.5, before August 2026?

Sort by:

sold Ṁ17 YES

Unfortunately will take the 200 mana loss.

Isn't this already the case?

@DavidOman Which model strongly surpasses ChatGPT and other OpenAI models?

@Gen Claude 3.5 Sonnet and/or the Gemini models, with the latter occupying #1 and #2 in Chatbot Arena

@xeophon I think they get dwarfed in popularity/accessibility and power/capability. If they are ahead on arena I’ll take a serious challenger for the winner, otherwise I think it’ll be obvious

@Gen Google is making their Gemini models accessible (indirectly) in a lot of products, I just got a notification on my Smart TV that there will be summaries powered by Gemini. Hard/Impossible to measure this vs. the popularity of a single website. Other proxies (like API usage through OpenRouter, where Gemini also leads) are similarly opaque.

And power/capability is mentioned in numbers we do not have access to, like # of params.

@xeophon also, Gemini replaced(?) assistant in Android, whereas ChatGPT is an option in iOS these days - those deep integrations are not accessible to outsiders.

Is the new (upgraded) Claude 3.5 Sonnet, and/or GPT-o1, considered sufficient to YES-resolve this prompt?

@eiis1000 no, I don't think so. OpenAI still holds the arena leaderboards top 1 & 2 so it's hard to argue anything has strongly surpassed it.

If there's another clear way of demonstrating one is better than the other I'd be open to hearing it, but it needs to be obvious for this market to resolve early.

bought Ṁ50 YES

Does it have to be from someone besides OpenAI?

@Ernie “and other OpenAI models” implies yes to this I think

bought Ṁ500 NO

I would love to know how this is looking in the eyes of the judges.

To my reading... many models are "as good" as ChatGPT but none are clearly better on technical comparisons... and none are nearly "as popular" right now.

If the judges think differently -- about the state right now -- would be curious to know.

Of course many more months left, both for OpenAI and for the competition.

@Moscow25 I agree, and I think my statement from 2 months ago still applies:

If it closed right now, there is no standout model that definitively surpasses technically, and chatGPT is more popular -- would resolve NO.

Grok, Claude, and Gemini (maybe Llama?) are technically competitive, but none clearly surpass GPT4o on language capabilities. If it closed now, it would come down to popularity.

Apologies for the slow response!

The weighting of these features has a large impact.

If user count has any kind of serious weighting than ChatGPT will clearly remain on top.

Otherwise, it is kinda a bet if openAI releases something that beats sonnet by EOY but LLMSYS suggests that 4o already does this.

if a language model exists that is undoubtedly the most accurate, reliable, capable, and powerful, that model will win regardless of popularity (provided, at least some members of the public have access).

Popularity is only relevant if it is disputed which is better technically, e.g. right now.

@Gen for what it's worth, I think if the top 2 models are from different companies (and likely even if they are from the same company, e.g., GPT-4 and GPT-4o), there will be disputes about which is better technically. Measuring LLM performance is a very contentious and difficult topic, especially with how easy it is for LLMs to be trained to the test (i.e., overfit).

Isn't Claude 3.5 good enough to resolve this?

I think it will be assessed at EOY, so it would still have to be the best then

But the question states before the end, so anytime it happens, even for a short time, it should resolve to yes.

Going off of this comment: https://manifold.markets/Gen/will-there-be-an-ai-language-model#rltduH4nnjsSdl0fGoVl I think it shouldn't

Will there be an AI language model that surpasses ChatGPT and other OpenAI models before the end of 2024?

54% chance. Question is about any models by competitors vs any current or future openAI models. To surpass chatGPT, it cannot just be more popular. If a language model exists that is undoubtedly the most accurate, reliable, capable, and powerful, that model will win regardless of popularity (provided, at least some members of the public have access). If there is dispute as to which is more powerful, a significant popularity/accessibility advantage will most likely decide the winner. There must be public access for it to be eligible. The three main metrics for assessment: 1. Popularity and Accessibility Assessed in real-terms. Total users and engagement (if public), otherwise I will defer to google trends or other publicly available data which can provide relative measures of popularity. This metric is arguably most important, but is not the only factor as if an Android or iPhone AI gets pushed to existing OS, the existing assistant leaders in those areas (e.g. Siri, Google Assistant) will have a meaningful advantage, irrespective of product quality. The popularity and userbase elements are essentially being used as a substitute for "usefulness". 2. Accuracy and Reliability Assessed against a common set of questions or data, to be determined. I expect there will be studies or other academic materials published which I hope to refer to. If these papers conclude that "ExampleRacistBot" is more accurate than chatGPT because of censoring, etc. then "ExampleRacistBot" would be the more accurate model. 3. Power & Capability Total computational ability, based on whichever metrics seem most relevant at the time. Right now those might be: Number of Parameters Tokens/words processed Problem-solving abilities Trainability (an example of an ongoing trainability exercise with GPT4: @/Mira/will-a-prompt-that-enables-gpt4-to If there are any recommendations for Power & Capability metrics, please let me know. The biggest issue I see is that if I establish which metrics are most important (say; number of parameters, like everyone did for GPT3) we may end up in a situation like we are now, where GPT4 parameters are not disclosed. Expect this criteria to change over time. Resolution Assessment When the final assessment is being made, if there aren't reliable competitive grounds for comparison, the following series of checks are how I see the decisionmaking process going. First check: Popularity. If one model is evidently significantly more popular (like >50% market share, or more popular by Google Trend metrics, etc.), then it will be considered the best unless there is evidence suggesting an alternative public AI is more accurate and powerful. If multiple models are within a similar frame of power and accuracy, popularity will determine the winner. Second check: If there are multiple models of similar popularity, accuracy and power will determine the winner. If there are no good academic comparisons for accuracy, I will endeavour to conduct my own, but I'm really hoping someone else figures that out before I have to. Accuracy seems more tied to "usefulness" than power, but if there is a significant breakthrough in power such that one has a capability advantage over the other (like advanced logic problem solving) while maintaining accuracy, and for whatever reason it's not more popular but still public, that will win. Third check: If there are multiple models of comparable popularity (or there is terrible data available) and there is no real clear difference in the capability, power, or accuracy - the decision will be deferred to more specific considerations, like a Google Trends comparison (relative search popularity) or the number of parameters (provided this is public, and still a respected metric) If the decision becomes highly-subjective, I will sell all of my positions, donate the profits, and defer judgement to someone I deem to be an expert (and have reasonable access to) who will make the final call. I'll probably just email professors until I get a response, or ask someone enrolled in a related course at the time to ask their professor to respond. [ Taking advice for updates or any proposed criteria changes ] Competing for @AmmonLam's subsidy [Changelog] 01/05/2023: Description updated to reflect thoughts expanded on in comments

Should change the title then. Shouldn't have to read a book to decide whether to bet on a market or not.

What do you think should be added to the title? I'm open to revisions that add clarity, "surpass" in this context covers multiple criteria, and yeah, it's gross how long the description is + the comments etc., so if anything contradicts the title/description I will update it.

I don't believe Claude 3.5 has surpassed the best OpenAI models enough to resolve this market

OpenAI is still #1 on the LMYS leaderboard
OpenAI still are #1 for market share (last report I saw they were estimated at ~65%)

I don't want this to be some Betamax situation where it is technically better but unadopted (peaked at ~25% market share), and my comment there was clarifying that if someone did release a model that beat them technically, it wouldn't be sufficient unless it was a huge step, or the competitor could leverage it into market share. e.g. openAI lost their #1 on LMYS briefly earlier in the year, but it was extremely close, most people didn't even know about it, and chatGPT were back on top almost immediately.

I have been using claude 3.5 sonnet though, and for the language model capabilities, I think I prefer it.

Write strongly surpass unequivocally

I added unequivocally to the title: “unequivocally passes” but instantly reverted it, I think this is a good idea but might not exactly fit the spirit - it might, but it’s late here and I need to sleep, so I’ll spend more time on it in the morning and make any changes then

I like this suggestion tho, I’ll get back to this in about 8-9hours

I changed it to, "Strongly surpasses".

"Unequivocally" is close to what I want, but as I have outlined it doesn't exactly need to be unequivocal to resolve yes. If it's unequivocally better it will resolve early, but if at the end of the year there is a model that is meaningfully technically better, it will win. If the power/accuracy capabilities are close, it will resolve based on popularity.

If it closed right now, there is no standout model that definitively surpasses technically, and chatGPT is more popular -- would resolve NO.

Comment hidden

People are also trading

Will there be an AI language model that strongly surpasses ChatGPT and other OpenAI models at the end of 2025?

78% chance

By January 2026, will we have a language model with similar performance to GPT-3.5 (i.e. ChatGPT as of Feb-23) that is small enough to run locally on the highest end iPhone available at the time?

96% chance

Will OpenAI announce a new model that EpochAI estimates is at least as large as GPT-4.5, before August 2026?

67% chance

1. Popularity and Accessibility

2. Accuracy and Reliability

3. Power & Capability

🏅 Top traders

People are also trading

People are also trading

Related questions