Both Napster and the Pirate Bay founders argued that only users could be held re...

spott · on Sept 7, 2023

Napster and the Pirate Bay struggled because the vast majority of content was pirated. You would be hard pressed to say a significant minority of generative ai has any copyright issues, much less copyright issues as blatant as straight piracy.

Retric · on Sept 8, 2023

I’m not convinced any of the output of these generative AI is free from copyright issues. Consider, a ROT13 copy of a book may at first glance look nothing like the original, but distributing digital copies would be clear copyright infringement.

Feature extraction is literally a form of lossy compression. You can prod DALEE to make obvious copies of some of the works it was trained on, but even seemingly novel images could contain enough similarities to training material to be problematic.

spott · on Sept 8, 2023

I think this is basically the same as the sentiment that there is no such thing as a truly novel idea.

The standard isn’t “I think that looks like an Andy Warhol picture”, it is “That is substantially similar to a specific Andy Warhol piece”. Copyright doesn’t protect style.

> Feature extraction is literally a form of lossy compression.

This is one way think of neural nets, another is that they find the topological space of pictures.

But these are just models of computation, which aren’t especially relevant in the same way that it isn’t relevant what produces an infringing image, just that it is produced.

Which brings me back to my original point: there are a two different barriers for generative ai: is the model itself transformative, and is the primary purpose of the model to generate copyright infringing material.

With respect to the first… I have no idea how someone could argue that the model itself isn’t transformative enough. It isn’t “substantially similar” to any of the works that it is trained on. It might be able to generate things that are “substantially similar”, but the model itself isn’t… it’s just a bunch of numbers and code.

Regarding the second: I have less experience with image models, but I use chatgpt regularly without even trying to violate copyright, and I don’t think I’m alone, so I doubt you could make an argument that llms have a primary purpose of committing copyright infringement.

Retric · on Sept 8, 2023

> it’s just a bunch of numbers and code

That really doesn’t fly legally because any digital format is ‘just’ numbers.

spott · on Sept 8, 2023

Yes, you are correct, I was being flippant.

But I think the greater point still stands. In order to call the model itself a copyright violation you would have to say it is "substantially similar" to thousands of original works. Then in order to make it illegal you would have to say it wasn't "transformative enough" to be considered fair use.

I can't come up with an argument for either one of those points that holds any water at all.

zarzavat · on Sept 8, 2023

Copyright is not cooties. For something to be infringing it has to be beyond the de minimis threshold. It’s not enough to show that a copyrighted work influenced another work, there needs to be some substantial level of copying.

This music industry has been going through exactly this for the last few years and the courts have recognized that the creative process necessarily involves copying and that a small amount of copying is not infringement.

Retric · on Sept 8, 2023

The de minimis threshold is shockingly low as seen in various successful lawsuits.

Critically it’s not just a question of what percentage of a work is a copy of the original but what of the original work was copied. IE copying 3 lines in a book is a tiny fraction of the book but if you coped half the poem it’s well past the de minimis threshold.

Similarly only a small percentage of a giant library of MP3’s comes from any one work, but that’s not relevant.

zarzavat · on Sept 9, 2023

Exactly, but the kinds of things that Copilot is taking are necessarily very generic. It’s not going to be taking the “special sauce” from an open source project, because that is very unlikely to be the most probable continuation of any prompt that would occur in normal usage.

Copilot is taking things like “reverse a string” or “escape HTML tags”, that have very little originality to start with. This kind of common language is analogous to the musical motifs that have been also found to be under the threshold.

nickpsecurity · on Sept 8, 2023

The foundational models most coding models are built on May have comments and code in them. They’re almost certainly built on a number of legal violations, including copyright infringement. I have details in “Proving Wrongdoing” section here:

https://www.heswithjesus.com/tech/exploringai/index.html

I’ve also seen GPT spit out proprietary content word for word that’s not licensed for commercial use that I’m aware of. They probably got it from web crawling without checking licenses.

What I want more than anything in this space right now are two models: all public domain books (eg Gutenberg); permissive code in at least Python, JavaScript, HTML/CSS, C, C++, and ASM’s. One layered on the other but released individually. We can keep using that to generate everything from synthetic data to code to revenue-producing deliverables. All with nearly zero, legal risk.

spott · on Sept 8, 2023

The problem with this line of thinking is that a person can also cut and paste code that they don’t have a license to use… but until they do, they haven’t done anything wrong by reading the code.

So either we carve out an explicit exception that machines aren’t allowed to do things that are remarkably similar to what humans do… which would be a massive setback for AI in the US.

Or we agree that generative models are subject to the same rules that humans are — they can’t commit copyright infringement, but are able to appropriately consume copyrighted material that a human would be able to consume.

The second option seems to me to be much simpler, nicer, and more appropriate than the first.

nickpsecurity · on Sept 8, 2023

I agree with you. In fact, the latter is in my Alternative Models section.

Spivak · on Sept 7, 2023

Yeah they seem to be applying legal theory like programming code. If what they said about Napster applied to every website then literally every site with user generated content would be instantly wiped off the internet.

dkjaudyeqooe · on Sept 8, 2023

> You would be hard pressed to say a significant minority of generative ai has any copyright issues, much less copyright issues as blatant as straight piracy.

Where generative AI ingests copyrighted works in order to work and bases its output on it, then it is copyright infringement, equivalent to 'straight piracy' of all that it ingested, unless it's deemed fair use.

What Google does with its search engine, for example, is fair use, what Napster did was not.

xp84 · on Sept 8, 2023

I suspect that your extreme position must come with a sincere belief that nothing even close to human intelligence will ever be achieved. Imagine a robot with an AI brain. It would have to be blindfolded because just by learning to recognize, or even just viewing, the label on a can of Coke it would have “copied” it and become “illegal,” especially if it was capable of sketching it on demand. Any kind of intelligence cannot even simply view or listen to the world without encountering something IP-encumbered.

Learning, by human or machine, means extracting a copy of the essence of something and yes, storing that essence in a lossy way. It seems like learning from copyright-encumbered material ought to either be illegal for both, or legal. I know which world I would rather live in.

jojo100 · on Sept 8, 2023

>Where generative AI ingests copyrighted works in order to work and bases its output on it, then it is copyright infringement

This is an absurd standard. Is it copyright infringement when a human "ingests" copyrighted work and bases their output on it? Because that's commonly called inspiration and is how every artist creates their work - through experiencing other works and using that cumulative inspiration to form their own product.

Copyright infringement is already ridiculously restrictive as it is, this proposal not only fundamentally misunderstands how generative AI works but penalises AI for doing what humans do everyday.

spott · on Sept 8, 2023

Yes, but I believe there are two questions here:

* Does the model itself violate copyright? * Does the output of the model violate copyright?

I don't know how you could make an argument that the ingestion of information into a model through a training procedure in order to create something that can generate truly unique outputs isn't transformative of the original works. The legal standard for a new work to be considered a copyright violation of an original work is "substantial similarity". I don't know how you can make an argument that a generative model is "substantially similar" to thousands of original works...

Honestly, I'm not even sure if "fair use" comes into play for the model itself. In order for fair use to come into play, the model has to be deemed to be violating some copyright. Only once it is found to be violating does "fair use" come into play in order to figure out if it is illegal or not.

The second question is the one where fair use is likely to come into play more. And this question has to be asked of each output. The model's legality only becomes an issue here if, like Napster, you can't argue that the model has much point other than violating copyright. Napster didn't violate copyright (the code for Napster wasn't infringing on anything), but it enabled the violation of copyright and didn't have much point other than that.

I don't think you can make that argument though. I use ChatGPT most days, and I've never gotten copyrighted material out of it. I could ask it to write me some Disney fan fiction, which would violate a copyright. And I think there is a valid legal question here about who is responsible for preventing me from doing that. This is where I think the gray area is.

satvikpendem · on Sept 7, 2023

Those tools overwhelmingly supported pirate means, nearly no one was downloading actually legal public domain songs or Linux ISOs from there. I contrast with LLMs and generative AI, people are using them for actual work, not for piracy, which will be seen differently by the courts.