Question is about any models by competitors vs any current or future openAI models.
To surpass chatGPT, it cannot just be more popular. If a language model exists that is undoubtedly the most accurate, reliable, capable, and powerful, that model will win regardless of popularity (provided, at least some members of the public have access).
If there is dispute as to which is more powerful, a significant popularity/accessibility advantage will decide the winner. There must be public access for it to be eligible.
Explaining the three main metrics for assessment:
1. Popularity and Accessibility
Assessed in real-terms. Total users and engagement (if public), otherwise I will defer to google trends or other publicly available data which can provide relative measures of popularity.
This metric is arguably most important, but is not the only factor as if an Android or iPhone AI gets pushed to existing OS, the existing assistant leaders in those areas (e.g. Siri, Google Assistant) will have a meaningful advantage, irrespective of product quality.
The popularity and userbase elements are essentially being used as a substitute for "usefulness".
2. Accuracy and Reliability
Assessed against a common set of questions or data, to be determined. I expect there will be studies or other academic materials published which I hope to refer to. If these papers conclude that "ExampleRacistBot" is more accurate than chatGPT because of censoring, etc. then "ExampleRacistBot" would be the more accurate model.
3. Power & Capability
Total computational ability, based on whichever metrics seem most relevant at the time. Right now those might be:
Number of Parameters
Tokens/words processed
Problem-solving abilities
Trainability (an example of an ongoing trainability exercise with GPT4: /Mira/will-a-prompt-that-enables-gpt4-to
If there are any recommendations for Power & Capability metrics, please let me know. The biggest issue I see is that if I establish which metrics are most important (say; number of parameters, like everyone did for GPT3) we may end up in a situation like we are now, where GPT4 parameters are not disclosed. Expect this criteria to change over time.
Resolution Assessment
When the final assessment is being made, if there aren't reliable competitive grounds for comparison, the following series of checks are how I see the decisionmaking process going.
First check: Popularity. If one model is evidently significantly more popular (like >50% market share, or more popular by Google Trend metrics, etc.), then it will be considered the best unless there is evidence suggesting an alternative public AI is more accurate and powerful. If multiple models are within a similar frame of power and accuracy, popularity will determine the winner.
Second check: If there are multiple models of similar popularity, accuracy and power will determine the winner. If there are no good academic comparisons for accuracy, I will endeavour to conduct my own, but I'm really hoping someone else figures that out before I have to. Accuracy seems more tied to "usefulness" than power, but if there is a significant breakthrough in power such that one has a capability advantage over the other (like advanced logic problem solving) while maintaining accuracy, and for whatever reason it's not more popular but still public, that will win.
Third check: If there are multiple models of comparable popularity (or there is terrible data available) and there is no real clear difference in the capability, power, or accuracy - the decision will be deferred to more specific considerations, like a Google Trends comparison (relative search popularity) or the number of parameters (provided this is public, and still a respected metric)
If the decision becomes highly-subjective, I will defer judgement to someone I deem to be an expert (and have reasonable access to) who will make the final call. I'll probably just email professors until I get a response, or ask someone enrolled in a related course at the time to ask their professor to respond.
[ Taking advice for updates or any proposed criteria changes ]
Competing for @AmmonLam's subsidy
[Changelog]
01/05/2023: Description updated to reflect thoughts expanded on in comments
@eiis1000 no, I don't think so. OpenAI still holds the arena leaderboards top 1 & 2 so it's hard to argue anything has strongly surpassed it.
If there's another clear way of demonstrating one is better than the other I'd be open to hearing it, but it needs to be obvious for this market to resolve early.
I would love to know how this is looking in the eyes of the judges.
To my reading... many models are "as good" as ChatGPT but none are clearly better on technical comparisons... and none are nearly "as popular" right now.
If the judges think differently -- about the state right now -- would be curious to know.
Of course many more months left, both for OpenAI and for the competition.
@Moscow25 I agree, and I think my statement from 2 months ago still applies:
If it closed right now, there is no standout model that definitively surpasses technically, and chatGPT is more popular -- would resolve NO.
Grok, Claude, and Gemini (maybe Llama?) are technically competitive, but none clearly surpass GPT4o on language capabilities. If it closed now, it would come down to popularity.
Apologies for the slow response!
@Gen for what it's worth, I think if the top 2 models are from different companies (and likely even if they are from the same company, e.g., GPT-4 and GPT-4o), there will be disputes about which is better technically. Measuring LLM performance is a very contentious and difficult topic, especially with how easy it is for LLMs to be trained to the test (i.e., overfit).
Going off of this comment: https://manifold.markets/Gen/will-there-be-an-ai-language-model#rltduH4nnjsSdl0fGoVl I think it shouldn't
What do you think should be added to the title? I'm open to revisions that add clarity, "surpass" in this context covers multiple criteria, and yeah, it's gross how long the description is + the comments etc., so if anything contradicts the title/description I will update it.
I don't believe Claude 3.5 has surpassed the best OpenAI models enough to resolve this market
OpenAI is still #1 on the LMYS leaderboard
OpenAI still are #1 for market share (last report I saw they were estimated at ~65%)
I don't want this to be some Betamax situation where it is technically better but unadopted (peaked at ~25% market share), and my comment there was clarifying that if someone did release a model that beat them technically, it wouldn't be sufficient unless it was a huge step, or the competitor could leverage it into market share. e.g. openAI lost their #1 on LMYS briefly earlier in the year, but it was extremely close, most people didn't even know about it, and chatGPT were back on top almost immediately.
I have been using claude 3.5 sonnet though, and for the language model capabilities, I think I prefer it.
I added unequivocally to the title: “unequivocally passes” but instantly reverted it, I think this is a good idea but might not exactly fit the spirit - it might, but it’s late here and I need to sleep, so I’ll spend more time on it in the morning and make any changes then
I like this suggestion tho, I’ll get back to this in about 8-9hours
I changed it to, "Strongly surpasses".
"Unequivocally" is close to what I want, but as I have outlined it doesn't exactly need to be unequivocal to resolve yes. If it's unequivocally better it will resolve early, but if at the end of the year there is a model that is meaningfully technically better, it will win. If the power/accuracy capabilities are close, it will resolve based on popularity.
If it closed right now, there is no standout model that definitively surpasses technically, and chatGPT is more popular -- would resolve NO.
Zvi writes, "There is a new clear best (non-tiny) LLM" ... Read more: https://thezvi.substack.com/p/on-claude-35-sonnet