One of the most ambitious goals of mechanistic interpretability would be achieved if we could train a neural network and then distill it into an interpretable algorithm that closely resembles the intermediate computations done by the model.
Some argue this is unlikely to happen (e.g. https://www.lesswrong.com/posts/d52aS7jNcmi6miGbw/take-1-we-re-not-going-to-reverse-engineer-the-ai) , while others are trying to make this happen.
In order for the market to resolve to YES, a model as least as capable as llama2-7B needs to be distilled into python code that can be understood and edited by humans, and this distilled version of the model must perform at least 95% as good as the original model on every benchmark except adversarially constructed ones that specifically highlight the differences between the distilled and the original model.