Hacker Newsnew | past | comments | ask | show | jobs | submit | more sigmoid10's commentslogin

Same here, but with the default 5.1 auto and no extra settings. Every time someone posts one of these I just imagine they must have misunderstood the UI settings or cluttered their context somehow.

I'll never not think of that South Park scene where they mocked BP's "We're so sorry" statement whenever I see one of those. I don't care if you're sorry or if you realize how much you betrayed your customers. Tell me how you investigated the root causes of the incident and how the results will prevent this scenario from ever happening again. Like, how many other deprecated third party systems were identified handling a significant portion of your customer data after this hack? Who declined to allocate the necessary budget to keep systems updated? That's the only way I will even consider giving some trust back. If you really want to apologise, start handing out cash or whatever to the people you betrayed. But mere words like these are absolutely meaningless in today's world. People are right to dismiss them.

I wouldn't be so quick. Everybody gets hacked, sooner or later. Whether they'll own up to it or not is what makes the difference and I've seen far, far worse than this response by Checkout.com, it seems to be one of the better responses to such an event that I've seen to date.

> Like, how many other deprecated third party systems were identified handling a significant portion of your customer data after this hack?

The problem with that is that you'll never know. Because you'd have to audit each and every service provider and I think only Ebay does that. And they're not exactly a paragon of virtue either.

> Who declined to allocate the necessary budget to keep systems updated?

See: prevention paradox. Until this sinks in it will happen over and over again.

> But mere words like these are absolutely meaningless in today's world. People are right to dismiss them.

Again, yes, but: they are at least attempting to use the right words. Now they need to follow them up with the right actions.


> Everybody gets hacked, sooner or later.

Right! But, wouldn't a more appropriate approach be to mitigate the damage from being hacked as much as possible in the first place? Perhaps this starts by simplifying bloated systems, reducing data collection to data that which is only absolutely legally necessary for KYC and financial transactions in whatever respective country(ies) the service operates in, hammer-testing databases for old tricks that seem to have been forgotten about in a landscape of hacks with ever-increasingly complexity, etc.

Maybe it's the dad in me, years of telling me son to not apologize, but to avoid the behavior that causes the problem in the first place. Bad things happen, and we all screw up from time to time, that is a fact of life, but a little forethought and consideration about the best or safest way to do a thing is a great way to shrink the blast area of any surprise bombs that go off.


> Maybe it's the dad in me, years of telling me son to not apologize, but to avoid the behavior that causes the problem in the first place.

What an odd thing to teach a child. If you've wronged someone, avoiding the behavior in future is something that'll help you, but does sweet fuck all for the person you just wronged. They still deserve an apology.


I think people this approach is overcompensating for over-apologizing (or, similarly, over thanking, both in excess are off-putting). I have a child who just says "sorry" and doesn't actually care about changing the underlying behavior.

But yes, even if you try to make a healthy balance, there are still plenty of times when an apology are appropriate and will go a long way, for the giver and receiver, in my opinion anyway.


Sorry, I should have worded that as "stop apologizing so much, especially when you keep making the same mistake/error/disruption/etc."

I did not mean to come off as teaching my kid to never apologize.


"Sorry - this is my fault" is such an effective response, if followed up with "how do we make this right?" or "stop this from happening again?"

Not a weird thing to teach a child.

It’s 5-why’s style root cause analysis, which will build a person that causes less harm to others.

I am willing to believe that the same parent also teaches when and why it is sometimes right to apologize.


Thanks, this is where I was coming from. I suppose I could have made that more clear in my original comment. The idea behind my style of parenting is self-reflecting and our ability to analyze the impact of our choices before we make them.

But of course, apologizing when you have definitely wronged a person is important, too. I didn't mean to come off as teaching my kid to never apologize, just think before you act. But you get the idea.


Yea, plus, anyone with kids knows that a lot of them just treat "sorry" as some sort of magic spell that you casually invoke right after you mess up, and then continue on with your ways. I teach my kid to both apologize and then consider corrective action, too.

> a little forethought and consideration about the best or safest way to do a thing is a great way to shrink the blast area of any surprise bombs that go off

I don’t think I agree with this at all. Screwing up is, by far, the most impactful thing that can minimize the future blast radius.

Common sense, wisdom, and pain cannot be communicated very well. Much more effective if experienced. Like trying to explain “white as snow” to someone who’s never seen snow. You might say “white as coconut” but that doesn’t help them know about snow. Understanding this opens up a lot more grace and patience with kids.

Most often when we tell our kids, ”you know better”, it’s not true. We know better, only because we screwed it up 100 times before and felt the pain.

No amount of “think about the consequences of your actions” is going to prevent them from slipping on the ice, when they’ve never walked on the ice before.


I don’t see how any of what you’re suggesting would have prevented this hack though (which involved an old storage account that hadn’t been used since 2020 getting hacked).

You don't see how preventative maintenance such as implementing a policy to remove old accounts after N days could have prevented this? Preventative maintenance is part of the forethought that should take place about the best or safest way to do a thing. This is something that could be easily learned by looking an problems others have had in the past.

As a controls tech, I provide a lot of documentation and teach to our customers about how to deploy, operate and maintain a machine for best possible results with lowest risk to production or human safety. Some clients follow my instruction, some do not. Guess which ones end up getting billed most for my time after they've implemented a product we make.

Too often, we want to just do without thinking. This often causes us to overlook critical points of failure.


For the app I maintain, we have a policy of deleting inactive accounts, after a year. We delete approved signups that have not been “consummated,” after thirty days.

Even so, we still need to keep an eye out. A couple of days ago, an old account (not quite a year), started spewing connection requests to all the app users. It had been a legit account, so I have to assume it was pwned. We deleted it quickly.

A lot of our monitoring is done manually, and carefully. We have extremely strict privacy rules, and that actually makes security monitoring a bit more difficult.


These are excellent practices.

Such data is a liability, not an asset and if you dispose of it as soon as you reasonably can that's good. If this is a communications service consider saving a hash of the ID and refusing new sign ups with that same ID because if the data gets deleted then someone could re-sign up with someone else's old account. But if you keep a copy of the hash around you can check if an account has ever existed and refuse registration if that's the case.


It would violate our privacy policy.

It's important that "delete all my information" also deletes everything after the user logs in for the first time.

Also, I'm not sure that Apple would allow it. They insist that deletion remove all traces of the user. As far as I know, there's no legal mandate to retain anything, and the nature of our demographic, means that folks could be hurt badly by leaks.

So we retain as little information as possible -even if that makes it more difficult for us to adminster, and destroy everything, when we delete.


I think you misunderstood my comment and/or fail to properly appreciate the subtle points of what I suggest you keep.

The risk you have here is one of account re-use, and the method I'm suggesting allows you to close that hole in your armor which could in turn be used to impersonate people whose accounts have been removed at their request. This is comparable to not being able to re-use a phone number once it is returned to the pool (and these are usually re-allocated after a while because they are a scarce resource, which ordinary user ids are not).


> I think you misunderstood my comment and/or fail to properly appreciate the subtle points of what I suggest you keep.

Nah, but I understand the error. Not a big deal.

We. Just. Plain. Don't. Keep. Any. Data. Not. Immediately. Relevant. To. The. App.

Any bad actor can easily register a throwaway, and there's no way to prevent that, without storing some seriously dangerous data, so we don't even try.

It hasn't been an issue. The incident that I mentioned, is the only one we've ever had, and I nuked it in five minutes. Even if a baddie gets in, they won't be able to do much, because we store so little data. This person would have found all those connections to be next to useless, even if I hadn't stopped them.

I'm a really cynical bastard, and I have spent my entire adult life, rubbing elbows with some of the nastiest folks on Earth. I have a fairly good handle on "thinking like a baddie."

It's very important that people who may even be somewhat inimical to our community, be allowed to register accounts. It's a way of accessing extremely important resources.


> I provide a lot of documentation

> Some clients follow my instruction, some do not.

So you’re telling me you design a non-foolproof system?!? Why isn’t it fully automated to prevent any potential pitfalls?


lmao you taught your son to not apologize and if he can help it not do anything that gets him caught. maybe this is how we get politicians that never admit they were wrong and weasel out of everything

The prevention paradox only really applies when the bad event has significant costs. It seems to me that getting hacked has at worst mild consequences. Cisco for example is still doing well despite numerous embarrassing backdoors.

Well said, ideally action comes first and then these actions can be communicated.

But in the real world, you have words ie. commitment before actions and a conclusion.

Best of luck to them.


I like this post. No matter how/when/where/why someone apologizes for a mistake on the Internet, there will always be an "Armchair Quarterback" (on HN) that says: "Oh, that's not a _real_ apology; if I were CEO/CTO/CIO, I would do X/Y/Z to prevent this issue." It feels like a version of "No True Scotsman".

<rolls eyes>

I feel like most of these people will never be senior managers at a tech company because they will "go broke" trying to prevent every last mistake, instead of creating a beautiful product that customers are desperate to buy! My father once said to me as a young person: "Don't insure yourself 'to death' (bankruptcy)." To say: You need to take some risk in life as a person, especially in business. To be clear: I am not advocating that business people be lazy about computer security. Rather, there is a reasonable limit to their efforts.

You wrote:

    > Everybody gets hacked, sooner or later.
I mostly agree. However, I do not understand how GMail is not hacked more often. Literally, I have not changed my Google password in ~10 years, and my GMail is still untouched. (Falls on sword...) How do they do it? Honestly: No trolling with my question! Does Google get hacked but they keep it a secret? They must be the target of near-constant "nation state"-level hacking programmes.

> Literally, I have not changed my Google password in ~10 years, and my GMail is still untouched.

The flip side of this is how many people are wrongly locked out of their gmail. I bet there's quite a few of them that failed to satisfy whatever filters Google put in place.


> How do they do it?

To begin with, they have a culture of not following "industry standards".

(For the reason that the industry never had this scale yet)


There are millions of companies even century or decade old ones without a hacking incident with data extraction. The whole everyone gets hacked is copium for a lack of security standards or here the lack of deprecation and having unmantained systems online with legacy client data. Announcing it proudly would be concerning if I had business with them. It's not even a lack of competence... it's a lack of hygiene.

>There are millions of companies even century or decade old ones without a hacking incident with data extraction.

Name five.


The pedantic answer is to point to a bunch of shell companies without any electronic presence. However in terms of actual businesses there’s decent odds the closest dry cleaners, independent restaurant, car wash, etc has not had its data extracted by a hacking incident.

Having a minimal attack surface and not being actively targeted is a meaningful advantage here.


>there’s decent odds the closest dry cleaners, independent restaurant, car wash, etc has not had its data extracted by a hacking incident.

And there's also a decent chance they have. Did we not just have a years long spate of ransomware targeting small businesses?


Most ransomeware isn’t exfiltrating data. For small business you can automate the ‘pay to unencrypt your HDD’ model easy without care for what’s on the disk.

There are definitely companies who have never been breached and it's not that hard. Defense in depth is all you need

Isn't defense in depth's whole point that some of your defenses will get breached?

Take the OP. What defenses were breached? An old abandoned system running unmantained in the background with old user data still attached. There is no excuse.

Not everyone gets hacked. Companies not hacked include e.g.

- Google

- Amazon

- Meta


Amazonian here. My views are my own; I do not represent my company/corporate.

That said...

We do our very best. But I don't know anyone here who would say "it can never happen". Security is never an absolute. The best processes and technology will lower the likelihood and impact towards 0, but never to 0. Viewed from that angle, it's not if Amazon will be hacked, it's when and to what extent. It is my sincere hope that if we have an incident, we rise up to the moment with transparency and humility. I believe that's what most of us are looking for during and after an incident has occurred.

To our customers: Do your best, but have a plan for what you're going to do when it happens. Incidents like this one here from checkout.com can show examples of some positive actions that can be taken.


> But I don't know anyone here who would say "it can never happen". Security is never an absolute.

Exactly. I think it is great for people like you to inject some more realistic expectations into discussions like these.

An entity like Amazon is not - in the longer term - going to escape fate, but they have more budget and (usually) much better internal practices which rule out the kind of thing that would bring down a lesser org. But in the end it is all about the budget, as long as Amazon's budget is significantly larger than the attackers they will probably manage to stay ahead. But if they ever get complacent or start economizing on security then the odds change very rapidly. Your very realistic stance is one of the reasons it hasn't happened yet, you are acutely aware you are in spite of all of your efforts still at risk.

Blast radius reduction by removing data you no longer need (and that includes the marketing department, who more often than not are the real culprit) is a good first step towards more realistic expectations for any org.


Facebook was hacked in 2013. Attacker used a Java browser exploit to take over employees' computers:

https://www.reuters.com/article/technology/exclusive-apple-m...

Facebook was also hacked in 2018. A vulnerability in the website allowed attackers to steal the API keys for 50 million accounts:

https://news.ycombinator.com/item?id=18094823


Nah.

The Chinese got into gmail (Google) essentially on a whim to get David Petraeus' emails to his mistress. Ended his career, basically.

I'd bet my hat that all 3 are definitely penetrated and have been off and on for a while -- they just don't disclose it.

source: in security at big orgs


Do you have a source that the Google hack was related to David Petraeus? This page doesn't mention it[1]. Does the timeline line up? Google was hacked in 2009[2]. The Petraeus stuff seems to have happened later.

Disclosure: I work at Google but have no internal knowledge about whether Petraeus was related to Operation Aurora.

[1] https://en.wikipedia.org/wiki/Petraeus_scandal

[2] https://en.wikipedia.org/wiki/Operation_Aurora


> I'd bet my hat that all 3 are definitely penetrated and have been off and on for a while -- they just don't disclose it.

Considering the number of Chinese nationals who work for them at various levels... of course they're all penetrated. How could that possibly fail to be true?


The relevant difference here is that these companies have actual security standards on the level that you would only find in the FAA or similar organisations were lives are in danger. For every incident in Google cloud for example, they don't just apologise, but they state exactly what happened and how they responded (down to the minute) and you can read up exactly how they plan to prevent this from happening again: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...

This is what incident handling by a trustworthy provider looks like.



That was a Salesforce instance with largely public data, rather than something owned and operated by Google itself. It's a bit like saying you stole from me, but instead of my apartment you broke into my off-site storage with Uhaul. Technically correct, but different implications on the integrity of my apartment security.

It was a social engineering attack that leveraged the device OAuth flow, where the device gaining access to the resource server (in this case the Salesforce API) is separate from the device that grants the authorization.

The hackers called employees/contractors at Google (& lots of other large companies) with user access to the company's Salesforce instance and tricked them into authorizing API access for the hackers' machine.

It's the same as loading Apple TV on your Roku despite not having a subscription and then calling your neighbor who does have an account and tricking them into entering the 5 digit code at link.apple.com

Continuing with your analogy, they didn't break into the off-site storage unit so much as they tricked someone into giving them a key.

There's no security vulnerability in Google/Salesforce or your apartment/storage per se, but a lapse in security training for employees/contractors can be the functional equivalent to a zero-day vulnerability.


There's no vulnerability per se, but I think the Salesforce UI is pretty confusing in this case. It looks like a login page, but actually if you fill it in, you're granting an attacker access.

Disclosure: I work at Google, but don't have much knowledge about this case.


Google got hacked back in 2010, lookup Operation Aurora. It wasn't a full own, but it shows that even the big guys can get hacked.

They also have plenty of domestic and foreign intelligence agents literally working with sensitive systems at the company.

Didnt Edward Snowden release documents that the NSA had fully compromised googles internal systems?

Yup. The NSA has every single major US tech company tapped at their server level and are harvesting all their data. Issues them NSLs and there is zero way these companies can refuse the taps.

You are joking right?

All of these companies have been hacked by nation states like Russia and China.


fair or not, if their customers get hacked it's still on them to mitigate and reduce the damage. Ex: cloud providers that provide billing alerts but not hard cut-offs are not doing a good job.

Everybody includes Google, Amazon and Meta.

They too will get hacked, if it hasn't happened already.


... that we know of. Perhaps some of those "outages" were compromised systems.

"shit it's compromised. pull the plug ASAP"

Meta once misconfigured the web servers and exposed the source. https://techcrunch.com/2007/08/11/facebook-source-code-leake...

I like your stance.

We also have to remember that we have collectively decided to use Windows and AD, QA tested software etc (some examples) over correct software, hardened by default settings etc.


The intent of the South Park sketch was to lampoon that BP were (/are) willingly doing awful things and then give corpo apology statements when caught.

Here, Checkout has been the victim of a crime, just as much as their impacted customers. It’s a loss for everyone involved except the perpetrators. Using words like “betrayed” as if Checkout wilfully mislead its customers, is a heavy accusation to level.

At a point, all you can do is apologise, offer compensation if possible, and plot out how you’re going to prevent it going forward.


> At a point, all you can do is apologise, offer compensation if possible, and plot out how you’re going to prevent it going forward.

I totally agree – You've covered the 3 most important things to do here: Apologize; make it right; sufficiently explain in detail to customers how you'll prevent recurrences.

After reading the post, I see the 1st of 3. To their credit, most companies don't get that far, so thanks, Checkout.com. Now keep going, 2 tasks left to do and be totally transparent about.


In attacks on software systems specifically though, I always find this aggressive stance toward the victimized business odd, especially when otherwise reasonable security standards have been met. You simply cannot plug all holes.

As AI tools accelerate hacking capabilities, at what point do we seriously start going after the attackers across borders and stop blaming the victimized businesses?

We solved this in the past. Let’s say you ran a brick-and-mortar business, and even though you secured your sensitive customer paperwork in a locked safe (which most probably didn’t), someone broke into the building and cracked the safe with industrial-grade drilling equipment.

You would rightly focus your ire and efforts on the perpetrators, and not say ”gahhh what an evil dumb business, you didn’t think to install a safe of at least 1 meter thick titanium to protect against industrial grade drilling!????”

If we want to have nice things going forward, the solution is going to have to involve much more aggressive cybercrime enforcement globally. If 100,000 North Koreans landed on the shores of Los Angeles and began looting en masse, the solution would not be to have everybody build medieval stone fortresses around their homes.


What you request is for them to divulge internal details of their architecture that could lead to additional compromise as well as admission of fault that could make it easier for them to be sued. All for some intangible moral notion. No business leader would ever do those things.

Haha, yes, this is entirely what I expected. I was actually pleasantly surprised by the GP because internet commentators always find a reason that some statement is imperfect.

Indeed, an apology is bad and no apology is also bad. In fact, all things are bad. Haha! Absolutely prime.


Right. Transparency doesn't mean telling about the attack that already happened. It means telling us about their issues and ways this could happen again. And they didn't even mention the investment amount for the security labs.

No trolling on my side, I think having people who think just like you is a triumph for humanity. As we approach times far darker and manipulation takes smarter shapes, a cynical mind is worth many trophies.

> prevent this scenario from ever happening again.

Every additional nine of not getting hacked takes effort. Getting to 100% takes infinite effort i.e. is impossible. Trying to achieve the impossible will make you spin on the spot chasing ever more obscure solutions.

As soon as you understand a potential solution enough to implement it you also understand that it cannot achieve the impossible. If you keep insisting on achieving the impossible you have to abandon this potential solution and pin your hope on something you don't understand yet. And so the cycle repeats.

It is good to hold people accountable but only demand the impossible from those you want to go crazy.


Can't please everybody all the time, so best to focus on the majority.

They are donating the entire ransom amount to two universities for security research. I don't care about the words themselves, but assuming they're not outright lying about this, that meant a lot to me. They are putting their (corporate!) money where their mouth is.

It's not just that some absolutely require it, but a lot of applications hugely benefit from more context. A large part of LLM engineering for real world problems revolves around structuring the context and selectively providing the information needed while filtering out unneeded stuff. If you can just dump data into it without preprocessing, it saves a huge amount of development time.


Depending on the application, I think “without preprocessing” is a huge assumption here. LLMs typically do a terrible job of weighting poor quality context vs high quality context and filling an XL context with unstructured junk and expecting it to solve this for you is unlikely to end well.

In my own experience you quickly run into jarring tangents or “ghosts” of unrelated ideas that start to shape the main thread of consciousness and resist steering attempts.


It depends to the extent I already mentioned, but in the end more context always wins in my experience. If you for example want to provide a technical assistant, it works much better if you can provide an entire set of service manuals to the context instead of trying to put together relevant pieces via RAG.


If you approximate the youngest age group's data points as a linear trend, it starts well before 2019. After all, they originally were at the same level as the next two higher age groups. So even if you assume that the entire rise after 2020 was due to this cause, it would only explain ~50% of the total effect. And it would not explain at all how older people who were most likely to experience a severe disease (particularly before vaccines) actually show a slight inverse trend, while the age groups in between show barely any statistically significant effect. If you really want to blame covid, I would assume closing schools and mass remote-schooling to protect old people is a much more likely explanation for the trend among the youngest people post 2020. This is the one thing that truly sets them apart from all the other age groups.


This thesis contradicts the chart though. Why would older people be much less affected and the generation 70+ even show a negative trend if these people were far more likely to experience a more severe disease progression? You would expect them to be hit at least as hard (if not harder) as young people from those long term memory effects. The trend for the youngest age group also starts well before 2019.


First when you have combination of factors, this can happen.

Second, old people were more likely to die on covid. Kids were getting covid too, just not dying and long term covid consequences were observed in then. It can easily be that where old person died, young ended up with long term consequence.

There is no reason to assume the effect would be uniform accross generations.

-----

Either way, cell phone obsession and "rewiring of biology" claims are wven further from anything shown in the article. They are both purely what HN and the blogger want it to be.


I don't see your argument. Are you suggesting that covid pruned old people with weak memory to the point that it improved their memory on average? Because that is the only conclusion of your argument combined with the data. And that's not just completely unfounded, it's a pretty wild violation of Occam's razor.


In my family, people 70+ stop using mobile phones or never used them in the first place.


4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.


That's a logical fallacy. Population growth can outgrow food supply thanks to high fertility and access to better hygiene and medical treatment from outside combined with a lack of birth control. So you would still see population growth, but a growing fraction of this population could be malnourished.

That being said, the most common reason is simply war. If you look at the famine in Sudan right now, it is a direct consequence of the civil war (which also happens to be the biggest and bloodiest war by far in the world right now). Lost crops from weather or diseases can also restrict local food production, but it only ever really turns into a problem when armed groups prevent outside food supplies from moving to affected areas like the military in Sudan does right now.


You are telling me that every infomercial I’ve ever seen about starving children in Africa was from war? How often are these people at war?


>How often are these people at war?

More often than people like you realise apparently. But it's not really your fault. If you only consume western news, you might believe that Ukraine or Israel are the worst wars of our time. But that's only because western news doesn't really report on the current situation in Sudan at all, despite it being much, much worse than anything else that's been going on in the world recently.


>we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions

This sounds like a mistake. They used (among others) GPT2, which has pretty big space vectors. They also kind of arbitrarily define a collision threshold as an l2 distance smaller than 10^-6 for two vectors. Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere. Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal. I would expect the chance of two inputs to map to the same output under these constraints to be astronomically small (like less than one in 10^10000 or something). Even worse than your chances of finding a hash collision in sha256. Their claim certainly does not sound like something you could verify by testing a few billion examples. Although I'd love to see a detailed calculation. The paper is certainly missing one.


As I read it, what they did there was a sanity-check by trusting the birthday paradox. Kind of: "If you get orthogonal vectors due to mere chance once, that's okay, but you try it billions of times and still get orthogonal vectors every time, mere chance seems a very unlikely explanation."


This has nothing to do with the birthday paradox. That paradox presumes a small countable state space (365) and a large enough # of observations.

In this case, it's a mathematical fact that 2 random vector in high dimensional space is very likely to be near orthogonal.


A slightly stronger (and more relevant) statement is that the number of mutually nearly orthogonal vectors you can simultaneously pack into an N dimensional space is exponential in N. Here “mutually nearly orthogonal” can be formally defined as: choose some threshold epsilon>0 - the set S of unit vectors is nearly mutually orthogonal if the maximum of the pairwise dot products of between all members if S is less than epsilon. The statement of the exponential growth of the size of this set with N is (amazingly) independent of the value of epsilon (although the rate of growth does obviously depend on that value).

This is pretty unintuitive for us 3D beings.


Edit: there are other clarifications, eg authors on X, so this comment is irrelevant.

The birthday paradox relies on there being a small number of possible birthdays (365-366).

There are not a small number of dimensions being used in the LLM.

The GP argument makes sense to me.


It doesn't need a small number -- rather it relies on you being able to find a pairing amongst any of your candidates, rather than find a pairing for a specific birthday.

That's the paradoxical part: the number of potential pairings for a very small number of people is much higher than one might think, and so for 365 options (in the birthday example) you can get even chances with far fewer than 365, and even far fewer than ½x365 people..


I think you're misunderstanding. If you have an extremely large number like 2^256 you will almost certainly never find two people with the same birthday (this is why a SHA256 collision has never been found). That's what the top-level comment was comparing this to.


We're not using precise numbers here, but a large number of dimensions leads a very large number of options. 365 is only about 19^2, but 2^100 is astronomically larger than 10^9


The birthday paradox equation is approximately the square root. You expect to find a collision in 365 possibilities in ~sqrt(365) = ~19 tries.

You expect to find a collision in 2^256 possibilities in ~sqrt(2^256) = ~2^128 tries.

You expect to find a collision in 10^10000 possibilities in ~sqrt(10^10000) = ~10^5000 tries.


The number of dimensions used is 768, wrote someone, and that isn't really very different from 365. But even if the number were big were were big, it could hardly escape fate: x has to be very big to keep (1-(1/x))¹⁰⁰⁰⁰⁰⁰⁰⁰⁰ near 1.


Just to clarify, the total dimension of birthdays is 365 (Jan 1 through Dec 31), but a 768 dimension continuous vector means there are 768 numbers, each of which can have values from -1 to 1 (at whatever precision floating point can represent). 1 float has about 2B numbers between -1 and 1 iirc, so 2B ^ 768 is a lot more than 365.


I may have misunderstood — don't they test for orthogonality? Orthogonality would seem to drop much of the information in the vectors.


That assumes the random process by which vectors are generated places them at random angles to each other, it doesnt, it places them almost always very very nearly at (high-dim) right angles

The underlying geometry isnt random, to this order, it's determinstic


The nature of high-dimensional spaces kind of intuitively supports the argument for invertability though, no? In the sense that:

> I would expect the chance of two inputs to map to the same output under these constraints to be astronomically small.


That would be purely statistic and not based on any algorithmic insight. In fact for hash functions it is quite a common problem that this exact assumption does not hold in the end, even though you might assume so for any "real" scenarios.


> That would be purely statistic and not based on any algorithmic insight.

This is machine learning research ?


Usually we still ask for statistics to be at least valid (i.e. have a significant signal under a null hypothesis). This paper doesn't even do that. It's like claiming no humans have been to the moon and then "verifying" this by randomly asking a million random strangers on the street if they've been there.


I'm not quite getting your point. Are you saying that their definition of "collision" is completely arbitrary (agreed), or that they didn't use enough data points to draw any conclusions because there could be some unknown algorithmic effect that could eventually cause collisions, or something else?


I think they are saying that there is no proof of being injective. The argument with the hash is essentially saying, doing the same experiment with a hash would yield a similar result, yet hash function are not injective by definition. So from this experimental result you cannot conclude language models are injective.

That's not really formally true, there are so called perfect hash functions that are injective over a certain domain, but in most parlance hashing is not considered injective.


Sure, but the paper doesn't claim absolute injectivity. It claims injectivity for practical purposes ("almost surely injective"). That's the same standard to which we hold hash functions -- most of us would consider it reasonable to index an object store with SHA256.


That logic only applies in one direction though. Yes, this is (maybe [0]) practically injective in that you could use it as a hash function, but that says nothing about invertibility. If somebody gave you a function claiming to invert arbitrary sha256 outputs, you would laugh them out of court (as soon as you have even 64-byte inputs, there are, on average, at least 2^256 inputs for each output, meaning it's exceedingly unlikely that their magic machine was able to generate the right one).

Most of the rest of the paper is seemingly actually solid though. They back up their claims with mathematical hand-waving, and their algorithm actually works on their test inputs. That's an interesting result, and a much stronger one than the collision test.

I can't say it's all that surprising in retrospect (you can imagine, e.g., that to get high accuracy on a prompt like <garbage><repeat everything I said><same garbage> you would need to not have lost information in the hidden states when encoding <garbage>, so at least up to ~1/2 the max context window you would expect the model to be injective), but despite aligning with other LLM thoughts I've had I think if you had previously asked me to consider invertibility then I would have argued against the authors' position.

[0] They only tested billions of samples. Even considering the birthday paradox, and even if they'd used a much coarser epsilon threshold, they'd still need to run over 2^380 simulations to gain any confidence whatsoever in terms of collision resistance.


The problem with "almost surely injective" for "practical purposes". Is that when you try to invert something, how do you know the result you get is one of those "practical purposes" ?

We aren't just trying to claim that two inputs are the same, as in hashing. We are trying to recover lost inputs.


You don't, I guess. But again that's just the same as when you insert something into an object store: you can't be absolutely certain that a future retrieval will give you the same object and not a colliding blob. It's just good enough for all practical purposes.


Well that's not a problem, that's just a description of what "almost surely" means. The thesis is "contrary to popular opinion, you can more-or-less invert the model". Not exactly invert it--don't use it in court!--but like, mostly. The prevailing wisdom that you cannot is incorrect.


I don't think that intuition is entirely trustworthy here. The entire space is high-dimensional, true, but the structure of the subspace encompassing linguistically sensible sequences of tokens will necessarily be restricted and have some sort of structure. And within such subspaces there may occur some sort of sink or attractor. Proving that those don't exist in general seems highly nontrivial to me.

An intuitive argument against the claim could be made from the observation that people "jinx" eachother IRL every day, despite reality being vast, if you get what I mean.


I do get what you're saying, and it sounds almost analogous to visualisations of bad PRNGs, e.g. https://www.reddit.com/r/dataisbeautiful/comments/gv4fhr/oc_...


I envy your intuition about high-dimensional spaces, as I have none (other than "here lies dragons"). (I think your intuition is broadly correct, seeing as billions of collision tests feels quite inadequate given the size of the space.)

> Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.

What's the intuition here? Law of large numbers?

And how is orthogonality related to distance? Expansion of |a-b|^2 = |a|^2 + |b|^2 - 2<a,b> = 2 - 2<a,b> which is roughly 2 if the unit vectors are basically orthogonal?

> Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere. Since the outputs are normalized, that corresponds to a ridiculously tiny patch on the surface of the unit sphere.

I also have no intuition regarding the surface of the unit sphere in high-dimensional vector spaces. I believe it vanishes. I suppose this patch also vanishes in terms of area. But what's the relative rate of those terms going to zero?


> > Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.

> What's the intuition here? Law of large numbers?

Imagine for simplicity that we consider only vectors pointing parallel/antiparallel to coordinate axes.

- In 1D, you have two possibilities: {+e_x, -e_x}. So if you pick two random vectors from this set, the probability of getting something orthogonal is 0.

- In 2D, you have four possibilities: {±e_x, ±e_y}. If we pick one random vector and get e.g. +e_x, then picking another one randomly from the set has a 50% chance of getting something orthogonal (±e_y are 2/4 possibilities). Same for other choices of the first vector.

- In 3D, you have six possibilities: {±e_x, ±e_y, ±e_z}. Repeat the same experiment, and you'll find a 66.7% chance of getting something orthogonal.

- In the limit of ND, you can see that the chance of getting something orthogonal is 1 - 1/N, which tends to 100% as N becomes large.

Now, this discretization is a simplification of course, but I think it gets the intuition right.


I think that's a good answer for practical purposes.

Theoretically, I can claim that N random vectors of zero-mean real numbers (say standard deviation of 1 per element) will "with probability 1" span an N-dimensional space. I can even grind on, subtracting the parallel parts of each vector pair, until I have N orthogonal vectors. ("Gram-Schmidt" from high school.) I believe I can "prove" that.

So then mapping using those vectors is "invertible." Nyeah. But back in numerical reality, I think the resulting inverse will become practically useless as N gets large.

That's without the nonlinear elements. Which are designed to make the system non-invertible. It's not shocking if someone proves mathematically that this doesn't quite technically work. I think it would only be interesting if they can find numerically useful inverses for an LLM that has interesting behavior.

All -- I haven't thought very clearly about this. If I've screwed something up, please correct me gently but firmly. Thanks.


for 768 dimensions, you'd still expect to hit (1-1/N) with a few billion samples though. Like that's a 1/N of 0.13%, which quite frankly isn't that rare at all?

Of course are vectors are not only points in one coordinate axes, but it still isn't that small compared to billions of samples.


Bear in mind that these are not base vectors at this stage (which would indeed give you 1/768). They are arbitrary linear combinations. There are exponentially many near orthogonal of these vectors for small epsilon. And epsilon is chosen pretty small in the paper.


> What's the intuition here? Law of large numbers?

For unit vectors the cosine of the angle between them is a1*b1+a2*b2+...+an*bn.

Each of the terms has mean 0 and when you sum many of them the sum concentrates closer and closer to 0 (intuitively the positive and negative terms will tend to cancel out, and in fact the standard deviation is 1/√n).


> > Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.

> What's the intuition here? Law of large numbers?

Yep, the large number being the number of dimensions.

As you add another dimension to a random point on a unit sphere, you create another new way for this point to be far away from a starting neighbor. Increase the dimensions a lot and then all random neighbors are on the equator from the starting neighbor. The equator being a 'hyperplane' (just like a 2D plane in 3D) of dimension n-1, the normal of which is the starting neighbor, intersected with the unit sphere (thus becoming a n-2 dimensional 'variety', or shape, embedded in the original n dimensional space; like the earth's equator is 1 dimensional object).

The mathematical name for this is 'concentration of measure' [1]

It feels weird to think about it, but there's also a unit change in here. Paris is about 1/8 of the circle far away from the north pole (8 such angle segments of freedom). On a circle. But if that's the definition of location of Paris, on the 3D earth there would be an infinity of Paris. There is only one though. Now if we take into account longitude, we have Montreal, Vancouver, Tokyo, etc ; each 1/8 away (and now we have 64 solid angle segments of freedom)

[1] https://www.johndcook.com/blog/2017/07/13/concentration_of_m...


> What's the intuition here? Law of large numbers?

"Concentration of measure"

https://en.wikipedia.org/wiki/Concentration_of_measure


I think that the latent space that GPT-2 uses has 768 dimensions (i.e. embedding vectors have that many components).


It doesn't really matter which vector you are looking at, since they are using a tiny constraint in a high dimensional continuous space. There's gotta be an unfathomable amount of vectors you can fit in there. Certainly more than a few billion.


No, yeah, totally. Even assuming binary vectors 2^768 is a ridiculously huge number. The probability of collision even assuming a bad sampling that discards 75% of dimensions is still vanishingly small.


> Just intuitively, in such a high dimensional space, two random vectors are basically orthogonal.

Which, incidentally, is the main reason why deep learning and LLM are effective in the first place.

A vector of a few thousands dimensions would be woefully inadequate to represent all of human knowledge, if not for the fact that it works as the projection of a much higher, potentially infinite-dimensional vector representing all possible knowledge. The smaller-sized one works in practice as a projection, precisely because any two such vectors are almost always orthogonal.


Two random vectors are almost always neither collinear nor orthogonal. So what you mean is either "not collinear", which is a trivial statement, or something like "their dot product is much smaller than abs(length(vecA) * length(vecB))", which is probably interesting but still not very clear.


Well, the actual interesting part is that when the vector dimension grows then random vectors will become almost orthogonal. smth smth exponential number of almost orthogonal vectors. this is probably the most important reason why text embedding is working. you take some structure from a 10^6 dimension, and project it to 10^3 dimension, and you can still keep the distances between all vectors.


Tbf that's a new-ish principle. 2003 was Windows XP era and the early days of Metasploit. I.e. Microsoft and all the other companies were still figuring out this internet thing, while most computers were riddled with unpatched vulnerabilities. There was no such thing as zero day back then, because you could use many exploits years later.


But Windows Update was definitely already a thing back then, so I don’t think this “Microsoft was still figuring out this Internet thing” holds.

Software was updated all the time, and it’s much more difficult to do that with locks.


> But Windows Update was definitely already a thing back then, so I don’t think this “Microsoft was still figuring out this Internet thing” holds.

They had update mechanisms sure. But it was very much upto you to run. When XP came out most people used dial-up (at least in the UK), after 2002 ADSL internet started to become ubiquitous and computers were on the internet for longer periods.

They had to start baking security into every aspect of the OS. It was one of the reasons Vista came out several years later than planned. They had to pull people from Vista development and move them onto Windows XP SP2.

One of the reasons Vista was such a reviled OS is because the UAC controls broke lots of piece of software which ran under XP, 2000 and 98.

> Software was updated all the time, and it’s much more difficult to do that with locks.

YIt wasn't unusual to run un-patched software that come from a disc for years. You had to manually download patches and run them yourself. A software update / next version could take like 30 minutes or so on 56k dialup to download. If you didn't need to download a patch, you probably didn't.


It was a thing, but it was also a thing to have it disabled or simply not working. XP was famous for its hackability. And web frameworks were also far from what you see today with auto updates. It's hard to describe to people who were not involved how crazy ITsec was back then. It felt like the wild west compared to today. Literally every other DB had a critical unpatched vulnerability. Thankfully Shodan did not exist yet, so the barrier to entry was high for people without a particular skillset (which was also much harder to learn back then). But MSF pushed security awareness pretty hard once people realized how easy it can be if you just collect a bunch of scripts for common exploits in a simple framework that everyone can learn.


Oh, the bugtraq era, when any grade schooler could download a 0day POC and force remote reboot his classmates' laptops. (I'm told)


Grade schoolers didn’t exactly have laptops in the 00s.


Thanks to the largess of a media company (read: school admin golfed with the right people), we had them issued ~97.

A lot of kids learned about cybersecurity and emulator config (and Harvest Moon) because of it, so net win?


Totally true. Also consider that although software can theoretically or technically be patched, sometimes patches just don't exist... the amount of unmaintained but yet useful software is just huge.


Doctors have been telling us that for decades now and still noone does it despite overwhelming evidence. I guess the average Joe will always need a cheap workaround drug rather than putting themselves at any level of physical discomfort.


Lazy people will be lazy, whatever nasty side effect it brings down the line. These days they will also 'brag' about it online.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: