As well, Google is, because of these things, somewhat acting as a library. An li...

kmeisthax · on Jan 6, 2023

Not really. Google won because Google Books was not actually a new concept; someone else had already built a book search engine the same way Google did, also got sued by the Authors Guild, and also prevailed. The only thing different about Google Books was that it'd give you two pages worth of excerpt out of the book. So it was very easy for a court to extend the fair use logic that they had already weaved into the law.

I still think "training is fair use" still has a leg to stand on, though. But it doesn't save GitHub Copilot because they're not merely training a model; they're selling access to its outputs and telling people they have "full commercial rights" to its outputs (i.e. sublicensing). Fair use is not transitive; if I make 100 Google Books searches to get all the pages out of a book, I don't suddenly own the book. There is no "copyright laundry" here.

gnramires · on Jan 6, 2023

> I still think "training is fair use" still has a leg to stand on, though

If that's the case, we need to serious re-consider how we reward Open Source as a society (I think that would be fantastic anyway!) -- we have people producing knowledge and others profiting directly from this material, producing new content and new code that's incompatible with the original license.

You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?

I think meanwhile the most reasonable solution is that an AI should always produce content compatible with the training material licenses. So if you want to use GPL training sets, you can only use that to create GPL-compatible code. If you use public domain (or e.g. 0BSD?) training sets, you can produce any code I guess.

kmeisthax · on Jan 6, 2023

> You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?

If the output (not just the model) can be determined to be a derivative work of the input, or the model is overfit and regurgitating training set data, then yes. It should. And a court would make the same demands, because fair use is intransitive - you cannot reach through a fair use to make an unfair use. So each model invocation creates a new question of "did I just copy GPL code or not".

gnramires · on Jan 6, 2023

It would be an essential feature, imo, to have this 'near-verbatim check' for copyleft code.

Overall it feels like it's a bit too much of specialized learning on GPL/Copyleft code to be fair. It's not like a human that reads some source code and gets an idea how it works. It's really learning code from scratch on Copyleft code, without which it would likely perform much worse and not generate a number of examples. It's not just copy-paste, but it's closer on the spectrum to copy paste than just super-abstract inspiration to feel fair.

As others have said, I don't think it would be fine (specially from big companies pov.) to decompile proprietary code (or just grab publicly available but illegal to reproduce code) and have AIs learn from that in a way that seems different in scope and ability to human research and reverse engineering.

I think we need a good tradeoff that isn't ludditism (that would reject a benefit for us all, i.e. that is good for everyone), but that still promotes and maintains open source software. In this case it's really a public "good" that's being seized and commercialized, that doesn't seem quite right: make copilot public, or use only permitted code (or share your revenue with developers -- although that would seem more complicated and up to each copyright holder to re-license for this usage). I remember not long ago MS declaring Open Source was a kind of "Cancer", now they're relying on it to sell their programming AIs. I personally think Open Source is quite the opposite of cancer, it is usually an unmitigated social good.

Much of the same could be said for the case of artists an generative AI art.

And this isn't even starting on how we move forward as a society that has highly automated most jobs and needs to distribute the resources and wealth in a good way to enable greatest wellbeing for all beings.

bbarnett · on Jan 6, 2023

You make GPL code, a make an AI that learns from GPL code, shouldn't its output be GPL licensed as well?

I think, for the desired outcome to occur, you should instead ask:

You write close sourced code, then a make an AI that learns from that code, shouldn't its output be licensed as well?

Ask the above, and suddenly Microsoft will agree.

andybak · on Jan 6, 2023

Depends if you think the GPL means "copyright is great!" vs "let's use their biggest weapon against them..."

It's a surprisingly subtle distinction.

EDIT - if I squint hard enough in exactly the right way, there's a sense in which CoPilot etc aligns perfectly with the goals of the free software movement. A world in which you can use it as a code copyright laundry might be a world where code is actually free.

Is that any weirder than bizarre legal contortions such as the Google/Oracle "9 lines of code"? Or the whole dance around reverse engineering: "It's OK if you never saw the actual code but you're allowed to read comprehensive notes from someone who did"..?

There's a ton of examples like this. Tell me with a straight face that there's a clear moral line in either copyright or patent law as it relates to software.

IP is a mess and it's not clear who benefits. Is a world where code isn't subject to copyright so bad?

kmeisthax · on Jan 7, 2023

If Copilot was released as FOSS with trained model weights, I don't think the Free Software movement would have "shot first" in the resulting copyright fight.

It is specifically the idea of using copyright to eat itself that is harmed by AI training. In the world where we currently live in, only source code can be trained on. If I want to train an AI on, say, the NT kernel; I have to decompile it first, and even then it's not going to be good training data because there's no comments or variable names to guide the AI. The whole point of the GPL was to force other companies to not lock down programs and withhold source code, after all.

Keep in mind too that AI is basically proprietary software's final form. Not even the creator of an AI program has anything that resembles "source code"; and a good chunk of AI safety research boils down to "here's a program you can't comprehend except through gradient descent, how do we design it to have an incentive to not do bad things".

If you like copyright licensing and just view the GPL as an exception sales vehicle, then AI is less of a threat, because it's just another thing to sell licenses for.

skissane · on Jan 6, 2023

> You write close sourced code, then a make an AI that learns from that code, shouldn't its output be licensed as well?

> Ask the above, and suddenly Microsoft will agree.

Does Microsoft actually agree? Many people have posted leaked/stolen Microsoft code (such as Windows, MS-DOS 6) to GitHub. Microsoft doesn't seem to make a very serious effort to stop it – sometimes they DMCA repos hosting it, but others have stayed up for ages. They could easily build some system to automatically detect and takedown leaks of their own code, but they haven't. Given this reality, if they trained GitHub Copilot on all public GitHub repos, it seems likely that its training included leaked Microsoft source code. If true, that means Microsoft doesn't actually have a problem with people using the outputs of an AI trained on their own closed source code.

emodendroket · on Jan 6, 2023

> If that's the case, we need to serious re-consider how we reward Open Source as a society (I think that would be fantastic anyway!) -- we have people producing knowledge and others profiting directly from this material, producing new content and new code that's incompatible with the original license.

Is that new? If I include some excerpt from copyrighted material in my own work and it's deemed to be fair use, that doesn't limit my right to profit from the work, sell the copyright to someone else, and so on, does it?

nradov · on Jan 6, 2023

If open source code authors (and other content creators) don't want their IP to be used in AI training data sets then they can simply change the license terms to prohibit that use. And if they really want to control how their IP is used then they shouldn't host it on GitHub in the first place. Of course Microsoft is going to look for ways to monetize that data.

b3morales · on Jan 6, 2023

> they can simply change the license terms to prohibit that use

GitHub's argument* is not that they're following the license but that the license does not apply to their use. So they would continue to ignore any provision that says they can't use the material for training.

Previously discussed: https://news.ycombinator.com/item?id=27740001

Moving off GitHub is a better step at a practical level. But again they claim the license doesn't matter, so even if it's hosted publicly elsewhere they would (presumably) maintain that they can still scoop it up. It just becomes more work, for them, to do so.

*Which is completely wrong in my opinion, for the record

bbarnett · on Jan 6, 2023

And if they really want to control how their IP is used then they shouldn't host it on GitHub in the first place

No. It's called copyright, it is enforceable, and that's the control.

GPL source code is available everywhere, in all formats, in textbooks, on CDs, on websites, but it is still gpl.

And Microsoft doesn't get to scrub the license.

naasking · on Jan 6, 2023

> But it doesn't save GitHub Copilot because they're not merely training a model; they're selling access to its outputs and telling people they have "full commercial rights" to its outputs (i.e. sublicensing).

But if you read the source code of 100 different projects to learn how they worked and then someone hired you to write a program that uses this knowledge, that should be legit. I'm not sure if the law currently makes a distinction between learning vs. remixing, and if Copilot would qualify as learning.

marginalia_nu · on Jan 6, 2023

That's not necessarily true at all. There's even techniques designed to demonstrably avoid such knowledge-contamination.

https://en.m.wikipedia.org/wiki/Clean_room_design

corysama · on Jan 6, 2023

That kind of legal ass-covering is expedient when you are going to explicitly reproduce someone else’s source-available work. It’s cheaper in that case to go through the whole clean room hassle than to risk getting into an intractable argument in court about how your code that does exactly the same thing as someone else’s code came to resemble the other people’s code so much.

But, for the general case, the argument still stands. I have looked at GPL code before. I might have even learned something from it. Is my brain infected? Am I required by law to license everything I ever make as GPL for the remainder of my days?

naasking · on Jan 6, 2023

Yes, it will sometimes depend on the unique qualities of the code. For instance, if you learned a new sorting algorithm from a C repo, and then wrote a comparable imperative solution in OCaml, that might be a derivative work. But if you wrote a purely functional equivalent of that algorithm, I don't think that could be considered a derivative work.

jacquesm · on Jan 6, 2023

Precisely, that is the key point.

Cthulhu_ · on Jan 6, 2023

And Google kept the copyright notices and attributions; probably not super relevant, but it's a difference between the two cases.

I mean in essence Github is a library; they did have a license to a point to do with the code as they pleased, but they then started to create a derivative work in the form of an AI, without correctly crediting the source materials.

I mean I think they made a gamble on it; as far as I'm aware, AI training sets were yet unchallenged in a court of law, so legally not fully defined yet. These lawsuits - and the ones (if any) aimed at the image generators, using CC artwork from e.g. artstation - will lay the legal groundwork for future AI / ML development.

ghaff · on Jan 6, 2023

Libraries are really not very special. They mostly exist on the basis of first-sale doctrine and have to subscribe to electronic services like everyone else.

Entities like the Internet Archive skate by (at least before their book lending stunt during COVID) by being non-profit and bending over backwards to respect even retrospective robots.txt instructions, meaning that it's not really worth suing them given they'll mostly do what you ask anyway.

But I guarantee you that if I set up a best comic strips of all time library I'll probably be in court.

toomuchtodo · on Jan 6, 2023

What if a library builds AI models from the content they host? The Internet Archive, for example.