AI keeps getting capabilities that I previously would've thought would require AGI. Solving FrontierMath problems, explaining jokes, generating art and poetry. And this isn't new. We once presumed that playing grandmaster-level chess would require AGI. I'd like to have a market for predicting when we'll reach truly human-level intelligence -- mentally, not physically -- that won't accidentally resolve YES on a technicality.
But I also want to allow for the perfectly real possibility that people like Eric Drexler are right and that we get AGI without the world necessarily turning upside-down.
I think the right balance, what best captures the spirit of the question, is to focus on Leopold Aschenbrenner's concept of drop-in remote workers.
However, it's possible that that definition is also flawed. So, DO NOT TRADE IN THIS MARKET YET. Ask more clarifying questions until this FAQ is fleshed out enough that you're satisfied that we're predicting the right thing.
FAQ
1. Will this resolve the same as other AGI markets on Manifold?
It probably will but the point of this market is to avoid getting backed into a corner on a definition of AGI that violates the spirit of the question.
2. Does passing a long, informed, adversarial Turing Test count?
This is a necessary but not sufficient condition. Matthew Barnett on Metaculus has a lengthy definition of this kind of Turning test. I believe it's even stricter than the Longbets version. Roughly the idea is to have human foils who are PhD-level experts in at least one STEM field, judges who are PhD-level experts in AI, with each interview at least 2 hours and with everyone trying their hardest to act human and distinguish humans from AIs.
One way I'd reject the outcome of such a Turing test is if the judges were able to identify the AI because it was too smart or too fast. Or any other tell that was unrelated to actual intelligence/capability.
One could also imagine an AI that passes such a test but can't stay coherent for more than a couple hours and is thus not AGI according to the spirit of the question. Stay tuned for additional FAQ items about that.
3. Does the AGI have to be publicly available?
No but we won't just believe Sam Altman or whoever and in fact we'll err on the side of disbelieving. (This needs to be pinned down further.)
4. What if running the AGI is obscenely slow and expensive?
The focus is on human-level AI so if it's more expensive than hiring a human at market rates and slower than a human, that doesn't count.
5. What if it's cheaper but slower than a human, or more expensive but faster?
Again, we'll use the drop-in remote worker concept here. I don't know the threshold yet but it needs to commonly make sense to use an artificial worker rather than hire a human. Being sufficiently cheaper despite being slower or being sufficiently faster despite being pricier could count.
6. If we get AGI in, say, 2027, does 2028 also resolve YES?
No, we're predicting the exact year. This is a probability density function (pdf), not a cumulative distribution function (cdf).
Ask clarifying questions before betting! I'll add them to the FAQ.
Note that I'm betting in this market myself and am committing to make the resolution fair. Meaning I'll be transparent about my reasoning and if there are good faith objections, I'll hear them out, we'll discuss, and I'll outsource the final decision if needed. But, again, note the evolving FAQ. The expectation is that bettors ask clarifying questions before betting, to minimize the chances of it coming down to a judgment call.
Related Markets
https://manifold.markets/ManifoldAI/agi-when-resolves-to-the-year-in-wh-d5c5ad8e4708
https://manifold.markets/dreev/will-ai-be-lifechanging-for-muggles
https://manifold.markets/dreev/will-ai-have-a-sudden-trillion-doll
https://manifold.markets/dreev/instant-deepfakes-of-anyone-within
https://manifold.markets/dreev/will-ai-pass-the-turing-test-by-202
https://manifold.markets/ScottAlexander/by-2028-will-there-be-a-visible-bre
https://manifold.markets/MetaculusBot/before-2030-will-an-ai-complete-the
https://manifold.markets/ZviMowshowitz/will-we-develop-leopolds-dropin-rem
Scratch area for auto-generated AI updates
Don't believe what magically appears down here; I'll add clarifications to the FAQ.
Doesn't #2 (pass long, adversarial Turing Test) mean being unaligned is a "necessary condition" for AGI? I fail to see how an aligned AI would fool its users into believing it's human.
I seems based on the current FAQ, this market is asking "In what specific year will I think we have achieved unaligned AGI?"
@Primer Interesting objection. Suppose we have a human-level AI that's very honest and will always fess up if you ask it point-blank whether it's human. We want that AI to pass our test but it trivially fails a Turing test if it outs itself like that. My first reaction is that I don't think it counts as unaligned if it's willing to role-play and pretend to be human. But if we don't like that, there's another solution: have the humans role-play silicon versions of themselves. Now the humans will also "admit" to being artificial and the judges can no longer use "are you human?" to clock the AI.
This is related to the genius of the Turing test. Usually I think in terms of how all the ways a Turing test can be defeated -- keying in on typing speed or typos or whether the AI is willing to utter racial slurs -- are easy enough to counter by a smart enough AI. It's not hard to mimic human typing speed, insert the right distribution of typos, or override the safety training to fully mimic a human. But you make a good point that we might not want that kind of deception. So instead we can just copyedit the humans' replies, insert delays, and instruct the humans to strictly apply the show-don't-tell rule in asserting their humanity.
In general the idea of using a Turing test to define AGI is that the judges should attempt to identify the AI purely via what human abilities it's lacking. If there are no such abilities, the AI passes the test. Any other way of identifying the AI is cheating.
@dreev Yeah, maybe we could develop a test which mitigates that problem (humans have to try to pass as an AI + fast narrow AI is used to obfuscate things like typing speed or add some muddled words if it's live speech, etc.), but that leads us pretty far from a nice and simple Turing Test, and, more importantly, there's no way such a test will actually be performed.
The only way I see is dropping any requirement of "passing a Turing Test" entirely.
Metaculus has taken a stab at this (thanks to @PeterMcCluskey for the pointer):
https://www.metaculus.com/questions/5121/date-of-artificial-general-intelligence/
In short, they require a unified system that passes all of the following:
A long, informed, adversarial Turing test (see FAQ 2)
Robotics, putting together a model car with just the human instructions
Min 75% and mean 90% on the MMLU benchmark
Min 90% on the first try on the APPS benchmark
And by "unified" they mean it can't be something cobbled together that passes each of those. The same system has to be able to do all of it and fully explain itself.
I think we want something even stronger (except the robotics part, which we're explicitly dropping for this market) but this feels like a good start.
Has such a test been performed and “trained” for? I’m pretty sure if someone fine tuned existing SOTA models for this task it could be done today, under the right conditions. Though potentially not meeting the cost or timing requirements.
I’d be curious to know if there’s been any attempts at this. If there aren’t any attempts made in a given year that would qualify for the purposes of this market, that doesn’t necessarily mean we couldn’t have met the criteria. I think it’s still valid criteria to have, but perhaps it’s less informative as a AGI indicator as a result
@MarcusM Sounds like the question is, what if we get AI that could in principle count as AGI but isn't packaged into something that meets the criteria we come up with? In that case I'm inclined to say it doesn't count yet. We wait till it meets our explicit criteria (which we need to continue to pin down!).
@VitorBosshard Exact. This is a probability density function (pdf), not a cumulative distribution function (cdf). Adding that to the FAQ now; thank you!
Note this bit from the market description (I edited it just now as well, to be clearer about outsourcing the final decision if there's no way to avoid it coming down to a judgment call):
Note that I'm betting in this market myself and am committing to make the resolution fair. Meaning I'll be transparent about my reasoning and if there are good faith objections, I'll hear them out, we'll discuss, and I'll outsource the final decision if needed. But, again, note the evolving FAQ. The expectation is that bettors ask clarifying questions before betting, to minimize the chances of it coming down to a judgment call.
You can also take a look at my other markets (or perhaps my AGI Friday newsletter) to decide how much you trust me on this. But I'll definitely be trading.
I thought Claude 3.7 identified a couple useful properties:
# AI as "Drop-in Remote Workers" Resolution Criteria
Here's a concise set of capability-based resolution criteria for determining when an AI system qualifies as a "drop-in remote worker":
1. Independent Task Completion: The AI can take a job description with minimal additional context and complete tasks autonomously without human intervention beyond initial instruction and final review.
2. Communication Competence: The AI can participate in meetings, ask clarifying questions, provide progress updates, and collaborate with team members using standard communication channels.
3. Context Adaptation: The AI can understand and adapt to company-specific terminology, processes, and culture based on limited documentation (employee handbook, style guides, etc.).
4. Self-Improvement: The AI independently identifies knowledge gaps and takes initiative to acquire necessary information to complete tasks better over time.
5. Practical Tool Use: The AI can effectively use standard workplace tools (email, chat, productivity software, internal systems) with only standard authorization processes.
6. Reliability Threshold: The AI completes assigned tasks with at least 90% success rate without requiring more supervision than an average human worker in the same role.
Resolution occurs when at least one AI system can consistently satisfy all these criteria across at least three common knowledge-worker domains (e.g., content creation, customer support, data analysis).
I think the "minimal additional context besides a job description" is a bit strong, but maybe that should be interpreted as that the AI can ask for the context it needs, as per #4
@Siebe Nice. Can we avoid getting into the weeds that much with a criterion along the lines of "do bosses hiring for remote positions generally prefer AI to humans?" or "does it generally only make business sense to hire humans because you need their physical skills or for reasons other than actual job performance, such as legal or PR reasons?"?
PS, either way, we may want to pick a threshold for the fraction of the remote workforce that should apply to. Maybe it first becomes true for the cheapest, least skilled human workers. And maybe we want to say we're hitting the AGI threshold when it's true for a median first-world college-educated native speaker?