Note that this still doesn't seem as good at solving proofs as some of the specialized prover models at formal theorem proving that aren't LLM-based. In particular, they show a 3% increase in proves solved over COPRA on the MiniF2F Lean dataset, but in COPRA's own paper they show that they prove about 18.5% fewer theorems than Proverbot9001 on the CompCert dataset (their pitch is that they are faster, but not better at proving, than other tools). Assuming COPRA isn't just way better at Lean than Coq (there should be some amount of difference, but given that it's the same algorithm run on a uniform interface over Coq and Lean it's unlikely to be large), Llemma should be proving 10-15% fewer proofs than Proverbot9001's algorithm. There's a bit of trickiness to trying to do this kind of cross-benchmark comparison, and COPRA picks a bit of a weird dataset to compare to Proverbot9001 on to save costs, but they also use a slightly older version which has worse results. So it probably still washes out to the best LLM-based provers being about 10-15% worse at proving theorems than the best specialized models.
EDIT: To be clear, that's 10-15% of the total theorems in a test set, not a relative 10-15% improvement. Given that solve rates for tools are in the 10-30% range, that's much more than a relative 10-15% improvement.
I don't think the right way to think about its utility is as a replacement. I see it in terms of a NNUE for Stockfish type augmentation, where a small neural network supercharges search. Small neural network because no LLM, not even GPT4, is good enough that the time lost evaluating with them is gained in disproportionally less search done.
Other uses are: better autocomplete from comments for Coq and Lean VScode envs than generalist tools.
Translation from NL sketch to formal proof. This is different from autocomplete in that it should generate long attempts as an automated spitballer. Leave it running a a long time to see if it finds anything when you're stuck. This works for formal and informal proofs but the latter gets no feedback (this is imagining future tunes able to make better use of interactive prover feedback).
Translating formal proofs to natural language.
Combine it with RAG and have it pull in and summarize references from your personal paper collection and the internet. Particularly useful in place of search when you don't have the vocabulary for a concept. As a basis for non-code based autocomplete of mathematical work.
I see unbounded potential for this combination. And a natural setting where between cost of search and the symbolic prover doing the heavy lifting, it's one of those areas that naturally lends itself to specialist Open over API models.
Oh, I think you might misunderstand what I'm comparing it to. The other tools, like Proverbot9001, are exactly the NNUE scenario you describe, where a small neural network guides a search procedure to find proofs; they are more effective at finding formal proofs than Llemma. For other tasks, like non-formal proof generation, Llemma has novel results as far as I know; it's just in terms of producing formal proofs that it currently seems to lack as compared to the state of the art.
Ah you're right. That makes sense. The autocomplete, informal proofs, translation or autoformalization, reference, search, feedback interactive assistant use-cases do seem promising though.
One of the main points of Llemma is being an open-source reproduction of Minerva, so basically that's the only comparison they "need" to make. Besides, isn't WizardMath trained on GPT-4 output? They may want to be comparing only among models that can be used commercially
I wonder if I could use this to explore mathematics. I have been attempting to use existing LLMs, but results vary widely. I could imagine getting intuitions and insights about an equation, where the responses don't have to be watertight algebraic manipulations, but rather just deeply informed descriptions from several perspectives.
>> Before training L LEMMA 7B, we contract the RoPE (Su et al., 2022) base period of the Code Llama 7B initialization from θ = 1, 000, 000 to θ = 10, 000. This is so that the long context finetuning procedure described in Peng et al. (2023)and Rozière et al. (2023) can be repeated on the trained LLEMMA 7B (we leave actually doing so to future work).
This does not explain what was the reason for RoPE contraction in the first place. Can anyone elaborate what could be the reason?
In the absence of a very strong justification, my assumption for any random technique in an AI paper is that they tried a bunch of different things and whatever gave the highest evals made it into the paper (even though that performance is likely random not a genuine improvement)
For what it's worth, I hallucinated a hell of a lot on paper when taking 4xx-level math classes as an undergrad. My homework was marked with so many "no"s in red ink... although I got away with some handwaving.
Thanks! First thank you for all the work to develop a math LLM. Those prompts are helpful. I shrank the weights down to f16 to fit on my system from f32 using the procedure at ggerganov/llama.cpp, then tried some of those prompts. From line 34:
> Problem:
> If $\det \mathbf{A} = 2$ and $\det \mathbf{B} = 12,$ then find $\det (\mathbf{A \mathbf{B}).$
> Solution:
> We know that for a matrix \mathbf{M}, the determinant of its inverse is given by $\frac{1}{\det \mathbf{M}}.$ We also know that the determinant of a product of matrices is equal to the product of their respective determinants. Therefore, we have:
\begin{align} \det (\mathbf{A}^{-1}\mathbf{B}) &= \frac{\det(\mathbf{AB})}{\det\mathbf{A}}\\ &= \frac{\det\mathbf{A}}{\det\mathbf{B}} \end{align}
> Therefore, we have:
> $\det(\mathbf{AB}) = 24
The f32 answer:
> We have that $\det (\mathbf{A} \mathbf{B}) = (\det \mathbf{A})(\det \mathbf{B}) = (2)(12) = \boxed{24}.$
> Final Answer: The final answer is $24$. I hope it is correct.
Final answer is the same, which is encouraging for quantization to expand hardware options.
can you please ask it if algebraic numbers are closed under addition? chatgpt4 just told me no, and then i asked it again and it said yes, and then it apologized profusely for saying no.
Not directly, but COPRA which they compare to (and show about 3% more theorems proved than) is based on GPT-4 using an agent framework. And in the COPRA paper, they compare to using GPT-4 directly as a one-shot, and find it can only get 10% of theorems in miniF2F, as opposed to 23% with their agent based approach. So if the eval line up correctly, we would see GPT-4 one-shot proving 10.6% of theorems in miniF2F, as opposed to Llemma-7b proving 26.23%, a pretty significant improvement. Still not as good as other specialized tools though, especially when you look at tools in other theorem proving languages (see my other comment for more detail about a cross-language comparison).
This seems to be as open as proprietary game engines as no mention of model license was present in the paper, code, model card. I wonder whether their incorrect and misleading use of "open" will be caught during peer review.
Maybe the authors also believe that Windows is open. Its source code is also publicly available after all.
Every time someone wants to communicate something new the person(s) should take a crap and use OCR software to generate glyphs for a new alphabet used solely to communicate the new thing. That's the only way to possibly communicate ideas in a reasonable manner. Using existing things that have some notoriety can only be used to utterly confuse and distract people from new ideas. /s
EDIT: To be clear, that's 10-15% of the total theorems in a test set, not a relative 10-15% improvement. Given that solve rates for tools are in the 10-30% range, that's much more than a relative 10-15% improvement.