So we are on agreement that "weights" are not source code. Training data might n...

ecb_penguin · 2025-12-28T16:15:05 1766938505

It is absurd to think that releasing open source code also requires releasing thousands of terabytes of Twitter and Reddit posts.

You already have access to all the training data everyone else is using.... You can download an offline version of Wikipedia. Here's every Reddit comment for a decade: https://academictorrents.com/details/ba051999301b109eab37d16...

_flux · 2025-12-28T20:50:09 1766955009

I mean no, you don't need to be open source at all. Just don't release the data and call the release "open weights". Or do release the data, and the training process, and call yourself "open source".

Though, I do think it's still acceptable if you just point how to get the data (i.e. if it was the offline version of Wikipedia and then URL to that) if actually providing the source data is overwhelming. Offering to provide a copy at cost would be quite acceptable (i.e. I deliver the media to you to make a copy).

But if there's no way another person can acquire that data, even in theory, then I think it's pretty clear the source was not open. Just use the more appropriate term and everyone is on the level what the release is about.