Llemma: An Open Language Model for Mathematics

aSanchezStern · on Oct 17, 2023

Note that this still doesn't seem as good at solving proofs as some of the specialized prover models at formal theorem proving that aren't LLM-based. In particular, they show a 3% increase in proves solved over COPRA on the MiniF2F Lean dataset, but in COPRA's own paper they show that they prove about 18.5% fewer theorems than Proverbot9001 on the CompCert dataset (their pitch is that they are faster, but not better at proving, than other tools). Assuming COPRA isn't just way better at Lean than Coq (there should be some amount of difference, but given that it's the same algorithm run on a uniform interface over Coq and Lean it's unlikely to be large), Llemma should be proving 10-15% fewer proofs than Proverbot9001's algorithm. There's a bit of trickiness to trying to do this kind of cross-benchmark comparison, and COPRA picks a bit of a weird dataset to compare to Proverbot9001 on to save costs, but they also use a slightly older version which has worse results. So it probably still washes out to the best LLM-based provers being about 10-15% worse at proving theorems than the best specialized models.

EDIT: To be clear, that's 10-15% of the total theorems in a test set, not a relative 10-15% improvement. Given that solve rates for tools are in the 10-30% range, that's much more than a relative 10-15% improvement.

Vetch · on Oct 17, 2023

I don't think the right way to think about its utility is as a replacement. I see it in terms of a NNUE for Stockfish type augmentation, where a small neural network supercharges search. Small neural network because no LLM, not even GPT4, is good enough that the time lost evaluating with them is gained in disproportionally less search done.

Other uses are: better autocomplete from comments for Coq and Lean VScode envs than generalist tools.

Translation from NL sketch to formal proof. This is different from autocomplete in that it should generate long attempts as an automated spitballer. Leave it running a a long time to see if it finds anything when you're stuck. This works for formal and informal proofs but the latter gets no feedback (this is imagining future tunes able to make better use of interactive prover feedback).

Translating formal proofs to natural language.

Combine it with RAG and have it pull in and summarize references from your personal paper collection and the internet. Particularly useful in place of search when you don't have the vocabulary for a concept. As a basis for non-code based autocomplete of mathematical work.

I see unbounded potential for this combination. And a natural setting where between cost of search and the symbolic prover doing the heavy lifting, it's one of those areas that naturally lends itself to specialist Open over API models.

aSanchezStern · on Oct 17, 2023

Oh, I think you might misunderstand what I'm comparing it to. The other tools, like Proverbot9001, are exactly the NNUE scenario you describe, where a small neural network guides a search procedure to find proofs; they are more effective at finding formal proofs than Llemma. For other tasks, like non-formal proof generation, Llemma has novel results as far as I know; it's just in terms of producing formal proofs that it currently seems to lack as compared to the state of the art.

Vetch · on Oct 18, 2023

Ah you're right. That makes sense. The autocomplete, informal proofs, translation or autoformalization, reference, search, feedback interactive assistant use-cases do seem promising though.

jychang · on Oct 18, 2023

When do you think we’ll have something that can do “verify this proof of the ABC conjecture” and it would check the proof?

schoen · on Oct 18, 2023

If the proof is written in a formal language, we have that now! Even several directly competing efforts (for better and worse).

https://www.cs.ru.nl/~freek/100/

Some people greatly hope that fully formal proofs become a routine part of math research and communication in the future.

aSanchezStern · on Oct 18, 2023

1979 :) https://en.wikipedia.org/wiki/Logic_for_Computable_Functions

YetAnotherNick · on Oct 18, 2023

WizardMath[1] clearly has better benchmark results and they conveniently ignored it. They knew about the model because they have used their dataset.

This has 51.5% in GSM8K compared to 81.6% for WizardMath.

[1]: https://huggingface.co/WizardLM/WizardMath-70B-V1.0

joaogui1 · on Oct 18, 2023

One of the main points of Llemma is being an open-source reproduction of Minerva, so basically that's the only comparison they "need" to make. Besides, isn't WizardMath trained on GPT-4 output? They may want to be comparing only among models that can be used commercially

alhirzel · on Oct 17, 2023

I love the name, it's intuitive today and does not have any overloaded meanings or other confusers.

Probiotic6081 · on Oct 17, 2023

Do you say "L-L-Emma" or "L-Lemma"?

dfee · on Oct 18, 2023

LL is the ‘y’ sound is Español.

Yemma.

FanaHOVA · on Oct 18, 2023

It's just a wordplay on llama / lemma, so it's pronounced as a normal L I assume.

forgotusername6 · on Oct 18, 2023

It's also a thl or cl sound in Welsh. Thlemma or clemma.

mattsan · on Oct 18, 2023

lemma

Syntonicles · on Oct 17, 2023

I wonder if I could use this to explore mathematics. I have been attempting to use existing LLMs, but results vary widely. I could imagine getting intuitions and insights about an equation, where the responses don't have to be watertight algebraic manipulations, but rather just deeply informed descriptions from several perspectives.

pk-protect-ai · on Oct 18, 2023

>> Before training L LEMMA 7B, we contract the RoPE (Su et al., 2022) base period of the Code Llama 7B initialization from θ = 1, 000, 000 to θ = 10, 000. This is so that the long context finetuning procedure described in Peng et al. (2023)and Rozière et al. (2023) can be repeated on the trained LLEMMA 7B (we leave actually doing so to future work).

This does not explain what was the reason for RoPE contraction in the first place. Can anyone elaborate what could be the reason?

tempusalaria · on Oct 18, 2023

In the absence of a very strong justification, my assumption for any random technique in an AI paper is that they tried a bunch of different things and whatever gave the highest evals made it into the paper (even though that performance is likely random not a genuine improvement)

classified · on Oct 18, 2023

So we get hallucinations in math, too. This was inevitable.

nxobject · on Oct 18, 2023

For what it's worth, I hallucinated a hell of a lot on paper when taking 4xx-level math classes as an undergrad. My homework was marked with so many "no"s in red ink... although I got away with some handwaving.

leedrake5 · on Oct 17, 2023

Downloading this now to test it out - if anyone has prompts or tests they’d like to see let me know. I plan to run it with llama.cpp

zazerbayev · on Oct 17, 2023

Hey, author here.

You can check out our evaluation library (https://github.com/wellecks/lm-evaluation-harness) for the exact benchmark implementations we used, including prompting.

In particular, the prompt that starts at line 27 in this file (https://github.com/wellecks/lm-evaluation-harness/blob/maste...) is quite good for high school/olympiad problems. We took this prompt from Google's Minerva paper.

leedrake5 · on Oct 17, 2023

Thanks! First thank you for all the work to develop a math LLM. Those prompts are helpful. I shrank the weights down to f16 to fit on my system from f32 using the procedure at ggerganov/llama.cpp, then tried some of those prompts. From line 34:

> Problem: > If $\det \mathbf{A} = 2$ and $\det \mathbf{B} = 12,$ then find $\det (\mathbf{A \mathbf{B}).$

> Solution: > We know that for a matrix \mathbf{M}, the determinant of its inverse is given by $\frac{1}{\det \mathbf{M}}.$ We also know that the determinant of a product of matrices is equal to the product of their respective determinants. Therefore, we have: \begin{align} \det (\mathbf{A}^{-1}\mathbf{B}) &= \frac{\det(\mathbf{AB})}{\det\mathbf{A}}\\ &= \frac{\det\mathbf{A}}{\det\mathbf{B}} \end{align} > Therefore, we have: > $\det(\mathbf{AB}) = 24

The f32 answer: > We have that $\det (\mathbf{A} \mathbf{B}) = (\det \mathbf{A})(\det \mathbf{B}) = (2)(12) = \boxed{24}.$ > Final Answer: The final answer is $24$. I hope it is correct.

Final answer is the same, which is encouraging for quantization to expand hardware options.

mkl · on Oct 17, 2023

> > Problem: > If $\det \mathbf{A} = 2$ and $\det \mathbf{B} = 12,$ then find $\det (\mathbf{A \mathbf{B}).$

Is that supposed to be missing a "}" after the last "A"?

leedrake5 · on Oct 17, 2023

Possible. I pasted the output, but tried to get some paragraph spacing in the output, possible I accidentally deleted a character.

doubloon · on Oct 18, 2023

can you please ask it if algebraic numbers are closed under addition? chatgpt4 just told me no, and then i asked it again and it said yes, and then it apologized profusely for saying no.

pumanoir · on Oct 17, 2023

I skimmed (ctrl-f) the paper and didn't see a comparison against gpt-4. Anybody knows how they'd compare?

aSanchezStern · on Oct 17, 2023

Not directly, but COPRA which they compare to (and show about 3% more theorems proved than) is based on GPT-4 using an agent framework. And in the COPRA paper, they compare to using GPT-4 directly as a one-shot, and find it can only get 10% of theorems in miniF2F, as opposed to 23% with their agent based approach. So if the eval line up correctly, we would see GPT-4 one-shot proving 10.6% of theorems in miniF2F, as opposed to Llemma-7b proving 26.23%, a pretty significant improvement. Still not as good as other specialized tools though, especially when you look at tools in other theorem proving languages (see my other comment for more detail about a cross-language comparison).

Zuiii · on Oct 18, 2023

This seems to be as open as proprietary game engines as no mention of model license was present in the paper, code, model card. I wonder whether their incorrect and misleading use of "open" will be caught during peer review.

Maybe the authors also believe that Windows is open. Its source code is also publicly available after all.

keirp1 · on Oct 18, 2023

The license is in the model card at the top - it's the same as the Llama 2 license.

Zuiii · on Oct 18, 2023

> The license is in the model card at the top - it's the same as the Llama 2 license.

So the license is indeed proprietary. I hope peer review corrects the authors' misuse of the term.

akomtu · on Oct 18, 2023

A missed opportunity: TheLemma.

utopiah · on Oct 18, 2023

LeLemma

akomtu · on Oct 19, 2023

No, the idea was thelemma = thelema.

an_aparallel · on Oct 17, 2023

so - now you dont need to know maths to work on DSP heavy/math reliant programming? :)

lanstin · on Oct 17, 2023

Can we run this somewhere?

lappa · on Oct 17, 2023

From the paper

https://github.com/EleutherAI/math-lm

https://huggingface.co/EleutherAI/llemma_34b

https://huggingface.co/EleutherAI/llemma_7b

3abiton · on Oct 17, 2023

Can't wait till Llemma 4bit

behnamoh · on Oct 17, 2023

Can we stop using catchy marketing names for things that are *supposed to be* scientific results?

Choosing "llemma" (similar to Meta's "Llama") is just clout seeking.

pvg · on Oct 17, 2023

There's a thing about this in the site guidelines

Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead.

Scaevolus · on Oct 17, 2023

It's a derivative of Code Llama, which is a derivative of Llama, so it's a good name.

bogtog · on Oct 17, 2023

Yeah, if nothing else, pun names are memorable, which is a good thing

nl · on Oct 18, 2023

And of course Llama is derived from LLM (Large LAnguage Model)

IanCal · on Oct 17, 2023

Science can pick from

* impenetrably long descriptions

* horrifically forced backronyms

* fun names

I know which I prefer.

artninja1988 · on Oct 17, 2023

Catchy marketing names is all you need

infamouscow · on Oct 17, 2023

You're totally right.

Every time someone wants to communicate something new the person(s) should take a crap and use OCR software to generate glyphs for a new alphabet used solely to communicate the new thing. That's the only way to possibly communicate ideas in a reasonable manner. Using existing things that have some notoriety can only be used to utterly confuse and distract people from new ideas. /s