Will we reverse-engineer a language model into an interpretable (python) program by 2027?
31
1kṀ2836
2027
8%
chance

One of the most ambitious goals of mechanistic interpretability would be achieved if we could train a neural network and then distill it into an interpretable algorithm that closely resembles the intermediate computations done by the model.

Some argue this is unlikely to happen (e.g. https://www.lesswrong.com/posts/d52aS7jNcmi6miGbw/take-1-we-re-not-going-to-reverse-engineer-the-ai) , while others are trying to make this happen.

In order for the market to resolve to YES, a model as least as capable as llama2-7B needs to be distilled into python code that can be understood and edited by humans, and this distilled version of the model must perform at least 95% as good as the original model on every benchmark except adversarially constructed ones that specifically highlight the differences between the distilled and the original model.

Get
Ṁ1,000
to start trading!
© Manifold Markets, Inc.TermsPrivacy