Frankly, this reads as a lot of words that amount to an excuse for using only LMArena, and the rationale is quite clear: it’s for an unrelated argument that isn’t going to ring true to people, especially an audience of programmers who just spent the last year watching the AI go from being able to make coherent file edits to multi hour work.
LMArena is, de facto, a sycophancy and Markdown usage detector.
Two others you can trust, off the top of my head, are LiveBench.ai and Artifical Analysis. Or even Humanity’s Last Exam results. (Though, frankly, I’m a bit suspicious of them. Can’t put my finger on why. Just was a rather rapid hill climb for a private benchmark over the last year.)
FWIW GPT 5.2 unofficial marketing includes the Erdos thing you say isn’t happening.
I've always found LiveBench a bit confusing to try to compare over time as the dataset isn't meant to be compared over time. It also currently claims GPT-5 Mini High from last summer is within ~15% of Claude 4.5 Opus Thinking High Effort in the average, but I'll wait with bated breath for the millions of amazing apps which couldn't be coded before to start showing up (or, more likely, be told in 6 months how these 2 benchmarks weren't the ones that should matter either). Artificial Analysis at least has the same at 20% from the top, so maybe that's the one we all agree to use for now since it implies faster growth.
> FWIW GPT 5.2 unofficial marketing includes the Erdos thing you say isn’t happening.
Certainly not, unless you're about to tell me I can pop into ChatGPT and pop out Erdos proofs regularly since #728 was massaged out with multiple prompts and external tooling a few weeks ago - which is what I was writing about. It was great, it was exciting, but it's exactly the slow growth I'm talking about.
I like using LLMs, I use them regularly, and I'm hoping they continue to get better for a long time... but this is in no way the GPT 3 -> 3.5 -> 4 era of mind boggling growth of frontier models anymore. At best, people are finding out how to attach various tooling to the models to eek more out as the models themselves very slowly improve.
I never claimed people don't make apps with AI. Of course it does - I can do that in a few clicks and some time with most any provider. You've been able to do that for a few years now, and that (linear) trend line starts over a year ago.
I can guarantee if you restricted yourself to just that 60% you wouldn't be responding to me doubting AI apps are already amazing things people are actually supposed to be so excited about using though.
See peer reply re: yes, your self-chosen benchmark has been reached.
Generally, I've learned to warn myself off of a take when I start writing emotionally charged stuff like [1]. Without any prompting (who mentioned apps? and why would you without checking?), also, when reading minds, and assigning weak arguments, now and in my imagination of the future. [2]
At the very least, [2] is a signal to let the keyboard have a rest, and ideally my mind.
Bailey:
> "If [there were] new LLMs...consistently solving Erdos problems at rapidly increasing rates then they'd be showing...that"
Motte:
> "I can['t] pop into ChatGPT and pop out Erdos proofs regularly"
No less than Terence Tao, a month ago, pointing out your bailey was newly happening with the latest generation: https://mathstodon.xyz/@tao/115788262274999408.
Not sure how you only saw one Erdos problem.
[1] "I'll wait with bated breath for the millions of amazing apps which couldn't be coded before to start showing up"
[2] "...or, more likely, be told in 6 months how these 2 benchmarks weren't the ones that should matter either"
I'm going to stick to the stuff around Tao, as even well tempered discussion about the rest would be against the guidelines anyways.
I had a very different read of Tao's post last month. To me, he opens that there have been many claims of novel solutions which turn out to be known solutions from publications buried for years, but nothing about rapid increase in the rates or even claims mathematicians using LLMs are having most of the work done by them yet.
He speculates, and I also assume correctly as well, that that contaminations are not the only reason. Indeed, we've seen at least 1 novel solution which couldn't have come from a low interest publication being in the training data alone. How many of the 3 examples at the top end up actually falling that way is not really something anyone can know, but I agree it should be safe to assume the answer will not be 0, or even if it was it would seem unreasonable to think it stayed that way. These solutions are coming out of systems of which the LLM is a part, and very often a mathematician still actually orchestrating.
None of these are just popping in a prompt and hoping for the answer, nor will you get an unknown solution to an LLM by going to ChatGPT 5.2 Pro and asking it without the rest of the story (and even then, you still will not get such a solution regularly, consistently, or at a massively higher rate than several months ago). They are multishot from experts with tools. Tao makes a very balanced note of this in reply to his main message:
> The nature of these contributions is rather nuanced; individually and collectively, they do not meet the hyped up goal of AI autonomously solving major mathematical open problems, but they also cannot all be dismissed as inconsequential trickery.
It's exciting, and helpful, but it's slow and he doesn't even think we're truly actually at "AI solves some Erdos problems" yet, let alone "AI solves Erdos problems regularly and at a rapidly increasing rate".
"...as even well tempered discussion about the rest would be against the guidelines anyways."
Didn't bother reading after that. I deeply respect you have the self-awareness to notice and spare us, that's rare. But it also means we all have to have conversations purely on your terms, and because its async, the rules constantly change post-hoc.
And that's on top of the post-hoc motte / bailey instances, of which we have multiple. I was stunned (stunned!!) by the attempted retcon of the app claim once there were numbers.
Anyways, all your bete noirs aside, all your Red Team vs. Blue Team signalling aside, using LMArena alone as a benchmark is a bad idea.
The conversation is certainly not on "my terms" as I didn't write the guidelines (nor do they benefit me more than anyone else). If you are genuinely concerned with the conversation, please flag it and/or email hn@ycombinator.com and they will (genuinely) handle it appropriately. Otherwise there is not much else which can be said around this here.
If not, continuing to have a conversation can only happen if we want to discuss the recent growth rate of AI and take the time to read what each other write. Similarly, async conversation can be as clear and consistent as we want it to be - we just have to take the time to ask for clarification before writing a response on something we feel could be a movable understanding. Nothing is meant to be unclear as a "gotcha" and I'll always be glad to clarify before moving on.
I also agree nobody should rely solely on LM Arena for benchmarks, which is not what starting a conversation by using it in an example was meant to imply we need to do. I'd love to continue chatting more about other benchmarks and how you see Tao's comments, as you seem to have walked away from reading them with a very different understanding than I did.
Let's say you have a $SOCIETAL_TABOO streak and let it out via a soundcloud account that isn't identifiable as you without your email.
Now it is.
Now I can blackmail you or haunt you.
(I'm sure there's other examples, tl;dr people are deanonymized, there are uncountable reasons why people choose anonymity)
> The data in the leak (other than follower count, etc) was already available for purchase from Zoominfo, 8sense, or a variety of other data brokers or other legal marketplaces for PII.
You are 100% correct based on article. Not good that you're gray, and your parent of "who cares it was already available and scraped" is the top comment.
Kinda sad to see a "Recommended Actions", with only sponsors, with ad copy that would be understood by HN readers but not our non-technical friends. (i.e. a simple "Nothing. No passwords have been leaked yet, only metadata" in this case)
I'm confused about what are you asking (404 CAFFEINE_MISSING), and it helped me to reframe in terms of what the parent and grandparent write.
My reframe was, "If you're a Dem, don't you think Brockman should donate $25M to Trump, because I'm told I have to vote Dem if I don't like GOP, because Dems are the lesser evil, thus, Dems believe it is okay to support evil if it is in your self-interest?"
Assuming that, then turning back to theory, "Lesser evil" is a constraint on imperfect choices, not a moral voucher that turns any tactic into virtue. If you can justify writing a $25M check to someone you think is dangerous because it helps your side, then your issue was never "good vs. bad" - it was "my team wins," and you’re just shopping for a cleaner-sounding label.
Interesting reaction to that story, I'm fascinated: why do you think it's fake?
(my guess: Soviet-style repression differences b/t USSR and satellites; reads as fake to you because non-USSR was more lax, i.e. you'll be fine speaking honestly in private, just not in public)
You’re a student of history, thus I think you understand how “commander in chief of the armed forces” is a constitutional duty without needing further explanation of why.
I think you intended to communicate the Supreme Court would balk at it happening.
Yes.
Much like Kavanaugh balking at ethnicity-based stops after allowing language + skin color based stops. By then, it’s too late.
> Two survivors of the initial attack later appeared to wave at the aircraft after clambering aboard an overturned piece of the hull, before the military killed them in a follow-up strike that also sank the wreckage. It is not clear whether the initial survivors knew that the explosion on their vessel had been caused by a missile attack.
> The Pentagon’s own manual on the laws of war describes a scenario similar to the Sept. 2 boat strike when discussing when service members should refuse to comply with unlawful orders. “For example,” the manual says, “orders to fire upon the shipwrecked would be clearly illegal.”
It seems extremely relevant. Your argument suggests the president need only appoint a subordinate who will themselves give the desired illegal order without the president's public command. In the unlikely event the subordinate is called to account, the president can simply pardon them.
This is certainly not a hypothetical "parade of horribles", since Trump has already pardoned military officers convicted of war crimes.[1]
War crimes sounds scary as a whole mess of badness, but which one is kind of material. Eg Obama's drone strikes and CIA torture likely count as war crimes, though no court has actually tried him for them, so it's hard to get worked up about Navy Seals (whos job it is to go into war zones and do war-type things) having generically having committed war crimes. Did they rape women and babies, or did they shoot the wrong person in the dark of night who it turns out wasn't actually a threat.
> Gallagher was the subject of a number of reports from fellow SEAL team members, stating that his actions were not in keeping with the rules of war, but these reports were dismissed by the SEAL command structure.
> Other snipers said they witnessed Gallagher taking at least two militarily pointless shots, shooting and killing an unarmed elderly man in a white robe as well as a young girl walking with other girls.
Murdered a prisoner, and was shitty enough his fellow SEALs were uncomfortable enough to complain. Pardoned eventually, by Trump.
I've been in the tech industry for 45 years. Layoffs happen regularly. Well, not regularly, what it is is a chaotic system. There will be good times and bad times. The best way to deal with it is to immediately save, at a minimum, 6 months of runway. Preferably a year.
When you're in between jobs, work on:
1. improving your job skills
2. network
3. build your resume by contributing to open source
I don't intend to be dismissive by sharing a bunch, I ate a bunch of downvotes so I should share something. But, there's no singular, like, Wikipedia article for "tech layoffs spiked significantly in 2022 and have stayed elevated" - so this is a mix of informal and formal and academic and business news that treats that knowledge as implicit while discussing it.
(I am deeply curious what valhalla you are at that skipped this so much that it was a foreign idea! N or A, it must be one of those two)
Sure, it waxes and wanes. 2022-2023 were probably above average layoff years, while 2020-2021 before that were probably below average years. I think layoffs have fallen since 2023 rather than staying elevated, but I haven't attempted to quantify that.
Q: You know what investors and shareholders love more than having 1 billion dollars?
A: Having 2 billion dollars. And with all the money being burned on AI, having 2 billion is better than 1.
If mass layoffs causes the stock to go from 1 to 2, then guess what's gonna happen?
In the ZIRP era companies would hire needlessly to get the stock up because that signaled growth to investors. Now it's the opposite, you trim because that gets the stock up, not because they conspire together to lay off people.
Why is the highest and best use of a company's free cash paying the least productive employees, instead of returning cash to shareholders or investing it in something more productive?
reply