Or maybe no one's applied it to that benchmark yet

To operationalize this, this question will resolve based on the LeanDojo benchmark (

), in particular the Pass@1 metric, where "The prover is given only one attempt and must find the proof within a wall time limit of 10 minutes."

GPT-4 is reported to achieve an accuracy of 28.8% on the "random" split of the test data in Table 2 of the LeanDojo paper (

This question closes when an evaluation of Gemini's performance on this task is brought to my attention.

Related questions