Microsoft will assume liability for legal copyright risks of Copilot

tremon · on Sept 7, 2023

Let Microsoft first publish a Copilot model that's trained on the internal codebases of Azure, Windows and Office. That's the only way Microsoft can convince me that they truly believe Copilot is non-infringing technology.

londons_explore · on Sept 7, 2023

I suspect Microsoft would earn more money by doing this.

Their own engineers would get productivity boosts - with copilot already being familiar with data structures, code style, etc. would be a big boost to accuracy.

But also, third party code would end up being more similar. Code style of the whole world would be pushed towards 'Microsoft style', which probably makes hiring easier, less training time for engineers, etc.

And the downside, that is outsiders might learn tiny nuggets of info about microsoft sources, is probably irrelevant when outsiders can already decompile binaries and learn far more.

chii · on Sept 8, 2023

> is probably irrelevant when outsiders can already decompile binaries and learn far more.

most, if not all microsoft products can have their sources be available for viewing, if you are one of those vip development partners. microsoft doesn't really have any secret source (pardon the pun) of which the leaking would undo their value proposition.

In fact, if microsft opened up their system a bit more, they might even gain some PR or mindshare, and have no effect on, if not increase, their bottom line.

zargon · on Sept 8, 2023

It would be surprising to me if their internal engineers don't already have access to a model trained on internal Microsoft code.

dh2022 · on Sept 7, 2023

You are assuming Microsoft code base is superior to Linux / Git / MySql / whatever else is in github right now. That is a .... big assumption.

And if Microsoft's code ends up influencing the rest of the world code that would be a .... big downside.

Karellen · on Sept 7, 2023

I don't think you should be looking at the best of the Microsoft/GitHub corpora to gauge their overall quality. You probably want to be looking at the quality of the median project, which is going to be heavily influenced by the long tail of low quality projects.

IMO, the long tail of non-code-reviewed, written-by-someone-in-their-first-month-of-coding, barely-even-compiles noob code[0] in Github is going to be orders of magnitude larger than the long tail of crap in Microsoft's internal repos.

[0] Hey, everyone has to start somewhere. There's nothing wrong with your first "hello world" program being buggy - that's what being a beginner means. But it's probably not the sort of code you want to train an LLM on.

Wowfunhappy · on Sept 8, 2023

Now I'm wondering if the copilot AI (GPT3/4?) takes number of stars/forks/etc into account during the training process.

singleshot_ · on Sept 7, 2023

I would think if the LLM knew it was rookie code, it could actually be pretty useful, no?

giantg2 · on Sept 7, 2023

Without a myriad of dumbasses like me being able to commit to Microsoft vs Github, I'd assume Microsoft's average is better than Github's.

blackhole · on Sept 7, 2023

That is a... bold assumption to make. Not just for Microsoft but for any large corporation.

lelanthran · on Sept 8, 2023

> That is a... bold assumption to make. Not just for Microsoft but for any large corporation.

I dunno; the average project on github isn't code-reviewed, while all the projects at Microsoft are.

giantg2 · on Sept 7, 2023

I'm not saying bad code doesn't exist there. My thought is that the percent of bad code increases with volume (or at least higher number of producers). Tens of millions of people committing to Github should mean its more cluttered with garbage than in MS. I at least assume MS has some automated code standard or security scans. That's at least more than nothing.

dh2022 · on Sept 7, 2023

" I at least assume MS has some automated code standard or security scans." -- that is a .... big assumption.

giantg2 · on Sept 7, 2023

No, it really isn't when we're dealing with an organization that is audited for SOC 1/2, DoD, and likely others.

clankyclanker · on Sept 7, 2023

Are you sure?

https://arstechnica.com/security/2023/09/hack-of-a-microsoft...

The Azure-State-Department breach had nearly a half dozen contributing bugs...

giantg2 · on Sept 7, 2023

And how does that compare to all the bugs on Github?

dh2022 · on Sept 7, 2023

My friend - Chinese secret services read Secretary of Commerce's emails because of Microsoft's security leaks: https://abcnews.go.com/Politics/commerce-secretary-gina-raim...

So yeah, assuming Microsoft systems are up to standard or have security reviews or whatever is a .... big assumption.

tmpX7dMeXU · on Sept 7, 2023

My friend, nobody implied that any of these things result in a foolproof system.

lannisterstark · on Sept 7, 2023

Do you have the exact style of talking in all your comments lol?

dietr1ch · on Sept 7, 2023

Yup, individuals struggling to have impact will cut corners and heavily impact tech debt :P

zik · on Sept 7, 2023

I have a friend who worked at Microsoft... if his opinion is anything to go by that's very far from true.

eru · on Sept 8, 2023

> You are assuming Microsoft code base is superior to Linux / Git / MySql / whatever else is in github right now.

How do you get that impression from the comment? I don't see anything implying that.

ryanwaggoner · on Sept 7, 2023

Not for Microsoft it wouldn’t, which was their point.

totallywrong · on Sept 8, 2023

> Code style of the whole world would be pushed towards 'Microsoft style'

Yes, that's exactly what the world needs, more software like Teams.

trgdr · on Sept 8, 2023

I mean, microsoft's code is probably better than the github average. There's an awful lot of horrific code out there.

eitally · on Sept 7, 2023

I don't know about MSFT, but I bet this would really help Google a ton. With a mono-repo and huge focus on readability, not to mention how many thousands of SWEs spend the majority of their time slinging protobufs around, it seems a significant fraction of day-to-day code could be largely automated.

liliumregale · on Sept 7, 2023

Google absolutely has their own internal models that do exactly this. It wouldn't surprise me if Microsoft indeed does have an internal Copilot that is trained on their data, but even on the smallest risk that they leak their code, they wouldn't share that particular model.

versteegen · on Sept 7, 2023

What does "absolutely has" mean here? Have you actually heard anything about such internal models?

phyrex · on Sept 8, 2023

Why wouldn’t they? Meta does, and they write openly about it

dtagames · on Sept 8, 2023

This is incorrect and not how Copilot works. My company just hosted two MS engineers to explain it live to 175 of us.

The style applied by Copilot comes from your surrounding code context, not from the LLM. And that base, trained on all public repos from GitHub, knows everything about data structures, etc, in the languages that were scanned.

Nothing new would be gained by scanning MS's own repositories and nothing would be leaked or color the output in actual use.

circuit10 · on Sept 7, 2023

They're not claiming that it can never spit out code exactly, but that they will take liability for if:

- It does

- The user didn't turn off the filters that prevent this

- The user didn't intentionally make it do it

- This use is found to be illegal

There's a difference between code that needs to be kept private from bad actors (from their point of view at least) and code that is public but with restrictions on its use that anyone who gets it should be aware of. This is like saying "if you truly believe that license agreements are legally binding, then publish your user's passwords publically with a license saying no one can use them"

klyrs · on Sept 7, 2023

> This use is found to be illegal

This being the real hurdle. With Microsoft money behind the defense, only megacorps can win.

eru · on Sept 8, 2023

Microsoft has lost legal battles against non-megacorps in the past.

I remember some guy representing himself and winning some dispute over shrink wrap licenses and student discounts.

matheusmoreira · on Sept 8, 2023

That's badass. Where can I read about that court case?

eru · on Sept 10, 2023

https://en.wikipedia.org/wiki/Microsoft_Corp._v._Zamos

matheusmoreira · on Sept 10, 2023

That's seriously awesome.

zulban · on Sept 7, 2023

Leaking sensitive data and infringement are separate (tho related) concerns. They may not want to do what you say, even though it's totally infringement safe.

hnlmorg · on Sept 7, 2023

Are they separate? Or is it the same concern but from opposite view points?

Both worried about IP leaking but one side is worried about their IP leaking and the other worried about liability if they inadvertently implement any leaked IP. Either way, the concern is leaked IP.

Gigachad · on Sept 8, 2023

Yes, if I ask something like "Can you describe microsoft's internal security processes and the names of upcoming products" the output would be original and not covered by copyright, but it would be sensitive internal information and covered by NDAs. But any code publicly posted and available to be scraped will not have such sensitive info in it.

hnlmorg · on Sept 8, 2023

I don’t think GitHub Co-pilot can respond to prompts like that. I thought it was ostensibly sophisticated source code completion. If so, source code is absolutely covered under copyright.

hunter2_ · on Sept 9, 2023

Works generated by AI are not copyrightable. If you take a generated work and substantially build upon it, then it's likely copyrightable.

At least that's the case for art, and I think the same logic should apply to art and code.

hnlmorg · on Sept 9, 2023

That hasn’t been tested in court.

But even if that were true, it’s a moot point because we are talking about the copyrighted content that the models were trained on. Hence the point the OP made that if Microsoft really wanted to reassure people then they’d promote models that were trained on Microsoft’s own code rather than handwave away these concerns with gestures of assuming theoretical liability.

hunter2_ · on Sept 9, 2023

Ah, ok. As for testing in court, that will be useful, but a rather official source says "created by a human author" [0] in defining the notion of copyright, which I assume is paraphrasing actual law, which I assume a judge would interpret similarly. However, I will concede that it's conceivable that if a human authors a work that then itself authors another work, the second work could potentially be attributed to the human for purposes of copyright eligibility.

[0] https://www.copyright.gov/what-is-copyright/

eru · on Sept 8, 2023

> But any code publicly posted and available to be scraped will not have such sensitive info in it.

Well, at least you'd hope so.

chongli · on Sept 7, 2023

even though it's totally infringement safe

This hasn't been tested in court.

ryukoposting · on Sept 7, 2023

The last thing the world needs is more code written in the style of Win32 API.

samch · on Sept 7, 2023

I believe you’re referring to GitHub Copilot which is a distinct offering in their portfolio (still Microsoft). GitHub Copilot was based on GPT-3 with fine tuning from public code repositories. That is the controversial aspect of it, I believe.

This blog post refers to the broader ecosystem of Microsoft Copilot solutions. Most of those tools rely on the Azure OpenAI API service on the backend and are not specifically tailored for code generation.

zare_st · on Sept 8, 2023

Windows API and the entirety of its client code aren't a good source of standard C programming. On the source level you have additional types and qualifiers/annotations that only MSVC understands.

LLM copilot doesn't really understand the context of the project, it just goes for similar text.

So if you train on big projects you're picking up their patterns only. When a copilot user asks for a string concatenation 'tip' you want LLM to output a general answer, not something tied to a specific project. Big project is likely to use abstraction over strings, where base library usage is shrunk down to few lines of code as opposed to abstraction. In this case you'd want LLM to source a few "simpler" projects that use base library strings abundantly, so it can have decent amount of text for the most likely correct match over user's input.

I do believe Microsoft has all the code available for good training, it's not only about Azure, Windows and Office, there is tons more and it's open source already.

monocasa · on Sept 8, 2023

There's illicit copies of Windows source just up on github. I wonder if we're already in the place where copilot will spit those out if you poke it the right way (but I don't feel like spending $10 to find out).

gareth_untether · on Sept 7, 2023

It would be an ugly beast. But I agree with you that there is a fair approach.

Eliah_Lakhin · on Sept 7, 2023

Interestingly, would the Copilot become better after such training...

contravariant · on Sept 7, 2023

Negative examples should aid training, right?

onemoresoop · on Sept 7, 2023

Probably not

nadermx · on Sept 7, 2023

Is there any evidence that it isn't also trained on parts of msfts code base?

j-bos · on Sept 7, 2023

Is there any evidence it is?

londons_explore · on Sept 7, 2023

If it is, it should be fairly easy to see.

We can already take a guess what many internal functions look like from the published symbol tables of every function across all major microsoft products. Simply ask copilot to write those functions and see if the code comes out better than a similar set of made up yet plausible function names.

_flux · on Sept 8, 2023

Wouldn't you then end up with code suggestions based on the style guide of a single company and limited set of languages?

It probably would not be a very desirable product in the end.

TiredOfLife · on Sept 7, 2023

Even Microsoft knows that their own code is absolute garbage that would bring the quality of copilot way down.

satvikpendem · on Sept 7, 2023

It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature. Sure, if you really coax it, you can get code or images out that look similar to existing ones, but the courts might see that generally speaking, it produces new content that has not been seen before, especially in the case of images.

Google Books literally copied and pasted books to add to their online database and that was deemed fair use, so something much more transformative like generative AI will likely fall under much broader consideration for fair use. Google Books was, yes, non-commercial, but the courts generally have the provision that the more transformative something is, the less it needs to adhere to the guidelines laid out for determining such fair use.

https://ogc.harvard.edu/pages/copyright-and-fair-use

dkjaudyeqooe · on Sept 7, 2023

> It's likely that generative AI in general will be deemed fair use

Everybody seems to be saying this, but I really don't think there's even 50% chance of it happening.

Google books was fair use because it was a public benefit and did not take away from publishers or authors, to the contrary it helped people find their works.

Compare generative AI which extracts the essence of people's works and recreates similar works (in terms of style, etc) while cutting out the original authors completely. This potentially denies them the fruits of their labor. It's notable that it's a purely mechanical process and no human creativity is involved, except that which is extracted from other authors. Mere prompts don't count.

The argument you're suggesting will hold is essentially "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok".

spott · on Sept 7, 2023

>extracts the essence of people's works and recreates similar works (in terms of style, etc) while cutting out the original authors completely.

Only if you ask it to. At which point the person asking is at the very least culpable as well of violating someone's IP.

It is also illegal for me to pay someone to write Micky Mouse fan fiction (though if I don't publish it, this gets more murky).

> The argument you're suggesting will hold is essentially "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok".

I want to flip this on its head: the argument you are suggesting is essentially "LLMs should be illegal because they can be asked to break copyright at scale!" It isn't illegal to be an author for hire, even though someone could potentially ask you to write fan fiction for their personal collection in the style of Tolkien, but because an LLM can do it at scale, it is illegal?

belorn · on Sept 7, 2023

Both Napster and the Pirate Bay founders argued that only users could be held responsible, since it was the user who requested the infringing files. It did not stop the courts.

Anyone could use those tools to download creative common files and linux ISO, but those arguments did not succeed in the legal system. Bittorent as a technology was however not made illegal, as could be seen in games using it to distribute patches.

spott · on Sept 7, 2023

Napster and the Pirate Bay struggled because the vast majority of content was pirated. You would be hard pressed to say a significant minority of generative ai has any copyright issues, much less copyright issues as blatant as straight piracy.

Retric · on Sept 8, 2023

I’m not convinced any of the output of these generative AI is free from copyright issues. Consider, a ROT13 copy of a book may at first glance look nothing like the original, but distributing digital copies would be clear copyright infringement.

Feature extraction is literally a form of lossy compression. You can prod DALEE to make obvious copies of some of the works it was trained on, but even seemingly novel images could contain enough similarities to training material to be problematic.

spott · on Sept 8, 2023

I think this is basically the same as the sentiment that there is no such thing as a truly novel idea.

The standard isn’t “I think that looks like an Andy Warhol picture”, it is “That is substantially similar to a specific Andy Warhol piece”. Copyright doesn’t protect style.

> Feature extraction is literally a form of lossy compression.

This is one way think of neural nets, another is that they find the topological space of pictures.

But these are just models of computation, which aren’t especially relevant in the same way that it isn’t relevant what produces an infringing image, just that it is produced.

Which brings me back to my original point: there are a two different barriers for generative ai: is the model itself transformative, and is the primary purpose of the model to generate copyright infringing material.

With respect to the first… I have no idea how someone could argue that the model itself isn’t transformative enough. It isn’t “substantially similar” to any of the works that it is trained on. It might be able to generate things that are “substantially similar”, but the model itself isn’t… it’s just a bunch of numbers and code.

Regarding the second: I have less experience with image models, but I use chatgpt regularly without even trying to violate copyright, and I don’t think I’m alone, so I doubt you could make an argument that llms have a primary purpose of committing copyright infringement.

Retric · on Sept 8, 2023

> it’s just a bunch of numbers and code

That really doesn’t fly legally because any digital format is ‘just’ numbers.

spott · on Sept 8, 2023

Yes, you are correct, I was being flippant.

But I think the greater point still stands. In order to call the model itself a copyright violation you would have to say it is "substantially similar" to thousands of original works. Then in order to make it illegal you would have to say it wasn't "transformative enough" to be considered fair use.

I can't come up with an argument for either one of those points that holds any water at all.

zarzavat · on Sept 8, 2023

Copyright is not cooties. For something to be infringing it has to be beyond the de minimis threshold. It’s not enough to show that a copyrighted work influenced another work, there needs to be some substantial level of copying.

This music industry has been going through exactly this for the last few years and the courts have recognized that the creative process necessarily involves copying and that a small amount of copying is not infringement.

Retric · on Sept 8, 2023

The de minimis threshold is shockingly low as seen in various successful lawsuits.

Critically it’s not just a question of what percentage of a work is a copy of the original but what of the original work was copied. IE copying 3 lines in a book is a tiny fraction of the book but if you coped half the poem it’s well past the de minimis threshold.

Similarly only a small percentage of a giant library of MP3’s comes from any one work, but that’s not relevant.

zarzavat · on Sept 9, 2023

Exactly, but the kinds of things that Copilot is taking are necessarily very generic. It’s not going to be taking the “special sauce” from an open source project, because that is very unlikely to be the most probable continuation of any prompt that would occur in normal usage.

Copilot is taking things like “reverse a string” or “escape HTML tags”, that have very little originality to start with. This kind of common language is analogous to the musical motifs that have been also found to be under the threshold.

nickpsecurity · on Sept 8, 2023

The foundational models most coding models are built on May have comments and code in them. They’re almost certainly built on a number of legal violations, including copyright infringement. I have details in “Proving Wrongdoing” section here:

https://www.heswithjesus.com/tech/exploringai/index.html

I’ve also seen GPT spit out proprietary content word for word that’s not licensed for commercial use that I’m aware of. They probably got it from web crawling without checking licenses.

What I want more than anything in this space right now are two models: all public domain books (eg Gutenberg); permissive code in at least Python, JavaScript, HTML/CSS, C, C++, and ASM’s. One layered on the other but released individually. We can keep using that to generate everything from synthetic data to code to revenue-producing deliverables. All with nearly zero, legal risk.

spott · on Sept 8, 2023

The problem with this line of thinking is that a person can also cut and paste code that they don’t have a license to use… but until they do, they haven’t done anything wrong by reading the code.

So either we carve out an explicit exception that machines aren’t allowed to do things that are remarkably similar to what humans do… which would be a massive setback for AI in the US.

Or we agree that generative models are subject to the same rules that humans are — they can’t commit copyright infringement, but are able to appropriately consume copyrighted material that a human would be able to consume.

The second option seems to me to be much simpler, nicer, and more appropriate than the first.

nickpsecurity · on Sept 8, 2023

I agree with you. In fact, the latter is in my Alternative Models section.

Spivak · on Sept 7, 2023

Yeah they seem to be applying legal theory like programming code. If what they said about Napster applied to every website then literally every site with user generated content would be instantly wiped off the internet.

dkjaudyeqooe · on Sept 8, 2023

> You would be hard pressed to say a significant minority of generative ai has any copyright issues, much less copyright issues as blatant as straight piracy.

Where generative AI ingests copyrighted works in order to work and bases its output on it, then it is copyright infringement, equivalent to 'straight piracy' of all that it ingested, unless it's deemed fair use.

What Google does with its search engine, for example, is fair use, what Napster did was not.

xp84 · on Sept 8, 2023

I suspect that your extreme position must come with a sincere belief that nothing even close to human intelligence will ever be achieved. Imagine a robot with an AI brain. It would have to be blindfolded because just by learning to recognize, or even just viewing, the label on a can of Coke it would have “copied” it and become “illegal,” especially if it was capable of sketching it on demand. Any kind of intelligence cannot even simply view or listen to the world without encountering something IP-encumbered.

Learning, by human or machine, means extracting a copy of the essence of something and yes, storing that essence in a lossy way. It seems like learning from copyright-encumbered material ought to either be illegal for both, or legal. I know which world I would rather live in.

jojo100 · on Sept 8, 2023

>Where generative AI ingests copyrighted works in order to work and bases its output on it, then it is copyright infringement

This is an absurd standard. Is it copyright infringement when a human "ingests" copyrighted work and bases their output on it? Because that's commonly called inspiration and is how every artist creates their work - through experiencing other works and using that cumulative inspiration to form their own product.

Copyright infringement is already ridiculously restrictive as it is, this proposal not only fundamentally misunderstands how generative AI works but penalises AI for doing what humans do everyday.

spott · on Sept 8, 2023

Yes, but I believe there are two questions here:

* Does the model itself violate copyright? * Does the output of the model violate copyright?

I don't know how you could make an argument that the ingestion of information into a model through a training procedure in order to create something that can generate truly unique outputs isn't transformative of the original works. The legal standard for a new work to be considered a copyright violation of an original work is "substantial similarity". I don't know how you can make an argument that a generative model is "substantially similar" to thousands of original works...

Honestly, I'm not even sure if "fair use" comes into play for the model itself. In order for fair use to come into play, the model has to be deemed to be violating some copyright. Only once it is found to be violating does "fair use" come into play in order to figure out if it is illegal or not.

The second question is the one where fair use is likely to come into play more. And this question has to be asked of each output. The model's legality only becomes an issue here if, like Napster, you can't argue that the model has much point other than violating copyright. Napster didn't violate copyright (the code for Napster wasn't infringing on anything), but it enabled the violation of copyright and didn't have much point other than that.

I don't think you can make that argument though. I use ChatGPT most days, and I've never gotten copyrighted material out of it. I could ask it to write me some Disney fan fiction, which would violate a copyright. And I think there is a valid legal question here about who is responsible for preventing me from doing that. This is where I think the gray area is.

satvikpendem · on Sept 7, 2023

Those tools overwhelmingly supported pirate means, nearly no one was downloading actually legal public domain songs or Linux ISOs from there. I contrast with LLMs and generative AI, people are using them for actual work, not for piracy, which will be seen differently by the courts.

regularfry · on Sept 7, 2023

> Only if you ask it to.

This isn't necessarily true. It's entirely possible for a model to regurgitate a chunk of GPL'd code without you knowing that's what it's done.

spott · on Sept 7, 2023

True, though I’m not sure this risk isn’t overblown. I’ve heard of a couple cases where someone got a copyright statement spit out, but I haven’t been able to find much more than the one or two that I’ve seen on hn. If you have more examples, I’d love to hear about them.

Code is also tricky: there are a finite number of ways to write an algorithm, and I’m sure both that multiple people have written the same version of left pad for example, and that it is not possible to copyright something small like that. When the code gets bigger, the likelihood of an llm spitting out large chunks of GPL’d code seems vanishingly small (without asking for something specific like that). Though I’d love to see examples to the contrary.

regularfry · on Sept 8, 2023

https://twitter.com/DocSparse/status/1581461734665367554 is the one I was thinking of. It's not just the copyright header in that case.

spott · on Sept 8, 2023

Interesting, though I think I would put those examples under a “will emit copyrighted code if prompted to”. I can’t think of a compelling reason someone would prompt for a sparse matrix function with a “cs_” unless they were searching for copyrighted code they knew existed.

Though that is definitely a simpler prompt than I would have expected was necessary to get such a result. Thanks!

(The first example also isn’t the same code. It is very close, and definitely similar in style, but it isn’t clear that code would a)run, or b) would work as expected. I need to sleep though, so I’m not sure how much that matters.)

semiquaver · on Sept 8, 2023

> yes we're using copyrighted works, but

There’s no law against “using” copyrighted works, there is a law against copying and distributing them.

Fair use analysis doesn’t come into play unless we’re dealing with clearly established copyright infringement. What LLMs do doesn’t clearly qualify as any of the behaviors reserved to copyright owners. For example, it certainly doesn’t “copy” the things it’s trained on by any legal definition.

Law works on precedent and analogy when there’s no clearly on-point statutes or case law. The most analogous situation to what transformer models do is a person learning from experience and creating their own work _influenced_ by what they’ve observed. That behavior is not copyright infringement by any stretch of the imagination. The fact that it’s done with a computer is not as important as people seem to think it is.

diffeomorphism · on Sept 8, 2023

> For example, it certainly doesn’t “copy” the things it’s trained on by any legal definition.

What about pictures still containing watermarks? Regardless of the actual legality, this does not fit "certainly".

> The most analogous situation to what transformer models do is a person learning from experience and creating their own work _influenced_ by what they’ve observed

No, it is not. It is called machine "learning" so clearly that is a fly made out of butter. Maybe courts will agree, maybe they won't, but the analogy to human learning is quite strenuous at best.

semiquaver · on Sept 8, 2023

Here is the section of title 17 that defines the rights of copyright holders and what terms like “copy” mean in US law. It’s clear as mud but I feel it’s likely that the process of training neural network weights is not going to be held as equivalent to verbatim digital copies. It’s just not the same thing and the law has no clear provision for it, except by analogy to existing human creative processes.

https://www.copyright.gov/title17/92chap1.html#106A

The most closely applicable existing law is that of “derivative works” but those require human authorship, so it’s far from clear that those would apply to AI output either. Ultimately this is going to be hashed out in the courts until some actual laws are written to deal with it.

(IANAL)

heavyset_go · on Sept 8, 2023

> It’s clear as mud but I feel it’s likely that the process of training neural network weights is not going to be held as equivalent to verbatim digital copies.

It's taking verbatim digital copies and using a form of lossy compression to transform them, which I think is clear when looking at things like auto-encoders.

semiquaver · on Sept 8, 2023

Isn’t your brain doing the same thing when it reads text or views a painting? Some people can even memorize and precisely recreate the things they’ve seen. But no one considers the process of lossy storage in human memory to be copyright infringement. Instead the later reproduction itself might be infringing. I think it will be the same here. Training models on copyrighted content won’t fall afoul of any existing law, instead legal challenges will have to be aimed at specific instances where the models produce output that arguably infringes copyright.

That’s inconvenient for opponents of this technology because they would prefer to ban the training itself, but there’s not a good justification under existing law to do this.

r00fus · on Sept 7, 2023

Ultimately I find that commoditization enables the purest form of the banality of evil.

Commoditized goods allows the bad to be sorted in with the good, allowing a price to be put on the commodity. Great where it's applicable but horrendous when it's improperly done - ie, home loans, or intellectual property.

If your commodity markets aren't properly regulated you get a race to the bottom. If you are trying to commoditize something that shouldn't be, it's effectively enables white-collar looting or money laundering.

satvikpendem · on Sept 7, 2023

First, style is not copyrightable. I could draw something in a Studio Ghibli style and they could do nothing about it, legally speaking.

Second, the way we've seen generative AI be used is not really the same as it was touted originally, that a mere prompt could replace an entire artist's work. A year later, we see that most people, artists included, don't use it as a verbatim text to image machine, they use it as a tool. See apps like ComfyUI or others which allow Node based or layer based image creation and editing, which even Photoshop now has. It's the same as Copilot and ChatGPT, it's not replacing any programmers, just increasing their productivity Given that, it is not looking like generative AI is hurting one's professions, quite the opposite.

akhosravian · on Sept 7, 2023

While there are no IP protections for “style” there are certainly elements that are covered. Particular colors be trademarked, characters can be copyrighted separately from the works they appear in, design patents are a thing that cover more than most folks realize.

I don’t think any but the most copyleft segment of society thinks it would be reasonable for a generative AI trained on exactly one persons work to be used for profit by someone else.

An AI trained on two or ten people’s work probably feels the same for most folks, but what about when it’s thousands or millions? What if instead of one persons work it is the works held in copyright by an entity like Getty Images?

regularfry · on Sept 7, 2023

> I don’t think any but the most copyleft segment of society thinks it would be reasonable for a generative AI trained on exactly one persons work to be used for profit by someone else.

Why do you think that? It doesn't seem obvious to me at all.

versteegen · on Sept 7, 2023

Well imagine you created a novel generator that just cut-and-paste whole sentences from the Harry Potter books to create a new Harry Potter, and posted it online. Now, as long as this was done as a just a bit of fun, non-commercial, I feel like that likely should be allowed as fair use (whether it actually would is not the point), though borderline. If you tried to sell it, definitely too far.

So I think "used for profit" is quite key.

But another example is someone writing and selling a reference guide to Tolkien's mythos that catalogues the content of his novels. And we would say that should be allowed, though that could be taken too far as well, for example it could duplicate the material in the appendices to LoTR.

gmerc · on Sept 8, 2023

You can make trademarked colors with microsoft paint…

yencabulator · on Sept 8, 2023

> First, style is not copyrightable. I could draw something in a Studio Ghibli style and they could do nothing about it, legally speaking.

Meanwhile, drawing Mickey ears on the wall of a kindergarten is not safe.

If you feel strongly that generational ML somehow launders copyright out of the bits, train an image generator purely on Disney copyrighted material and share the model on the web, see how well that works out.

livrem · on Sept 8, 2023

Training purely on Disney's images would probably be difficult, considering the huge amount of images you need to train a model from scratch. But here is a "fine-tuned Stable Diffusion model trained on screenshots from a popular animation studio". Seems to have worked out quite well so far as that model was last updated last year.

https://civitai.com/models/24/modern-disney

yencabulator · on Sept 8, 2023

"Modern Disney" is neat, I wasn't aware of that. Just so you know, The Mouse is much more fierce about protecting their classic properties like Mickey Mouse or Donald Duck, so this doesn't quite demonstrate my point yet.

gmerc · on Sept 8, 2023

Mickey (who will enter the public domain soon) is a specific expression and both copy righted and trademarked.

dkjaudyeqooe · on Sept 8, 2023

> First, style is not copyrightable

Wasn't suggesting it is. The point is that the tool is used to create things that substitute for the original authors' work by ingesting the works of those authors. The impact of the copying matters when weighing fair use.

If I use your copyrighted works to supplant you in some way, even as a part of a large group, then it's unlikely to be deemed fair use.

iraqmtpizza · on Sept 7, 2023

Ghibli style is probably trademarked. Different thing. Outline width, color palette, ambient noises, musical style, how the eyes and hands are drawn, when used together, would be possible to trademark I would think.

dmix · on Sept 7, 2023

> Google books was fair use because it was a public benefit

What are the odds the market leaders in LLM right now are just the current day version of Borland-style compilers before open source takes it over?

I've heard arguments the infrastructure part is a long term barrier to entry for OSS development, which will continue to remain in the future. But I don't know enough about it.

Who knows maybe the legal/gov world will move slow enough to miss the bulk of the money-extraction opportunities before OSS takes over and the reality of this problem never going away fully kicks in.

heavyset_go · on Sept 8, 2023

You'd need millions of dollars to just compile and label datasets. The training itself requires a lot of resources and money, as does human reinforcement.

Open source models would need benefactors with deep pockets.

az226 · on Sept 8, 2023

Copilot makes open source developers and contributors that much more productive which is a public good.

graeme · on Sept 8, 2023

Indeed. Further to this, training on data involves copying it. To do so without permission robs authors of the right to contract their work for this training, either to OpenAi or any other third party.

livrem · on Sept 8, 2023

Every kind of web crawler has to copy data. If that part of the AI training is illegal for that reason then every web crawler ever is suddenly declared automatically illegal.

graeme · on Sept 9, 2023

Web crawlers generally allow sites to remove them from the index.

Are there any crawlers used for commercial purposes which refuse to remove sites from an index if they ask? The distinction from OpenAI is that there is no way to be removed from openai's training set.

You can remove yourself from the crawler not but not from what they previously crawled.

livrem · on Sept 9, 2023

BIf a copy of the downloaded file is redistributed or used in possibly other ways that infringe on copyrights, THAT I could understand, but suddenly making the act of just downloading the file (assuming it is made legally available to the public). But if the downloaded file is analyzed by some software and then thrown away, I don't see how that is infringing copyrights, more than say downloading an image to decompress it and scale it to display on a screen (then throw it away once it is no longer needed).

CatWChainsaw · on Sept 9, 2023

So many downthread comments pulling out the computers and brains are exactly the same meeeerrrrrr BS.

"I'll keep saying it every time this comes up. I LOVE being told by techbros that a human painstaking studying one thing at a time, and not memorizing verbatin but rather taking away the core concept, is exactly the same type of "learning" that a model does when it takes in millions of things at once and can spit out copyrighted code verbatim."

(I also love it when they're deliberately obtuse about it too. The past decade has made me sick of this trolling tactic.)

dang · on Sept 9, 2023

Could you please stop posting unsubstantive comments and/or posting in the flamewar style? It's not what this site is for, and destroys what it is for.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

p.s. Also, please don't copy/paste comments on HN.

tick_tock_tick · on Sept 7, 2023

> Everybody seems to be saying this, but I really don't think there's even 50% chance of it happening.

That's true it's probably 99% plus it happening or at-least that's the conclusion that the experts and lawyers hired to help evaluate AI startup valuations are coming too. Hired by banks, venture funds, short selling shops, etc plenty of people who don't depending on it being ok to make money.

> "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok"

I mean you know collages are legal right? You literally take 100s of copyrighted pictures and put them together and suddenly it's perfectly legal and ok.

heisenzombie · on Sept 7, 2023

Art made from a collage of, say, magazine photos does not supplant or substitute for the magazine photos which is why it's much more likely to be deemed fair-use. Despite the collage using perhaps large portions of the copyrighted photos, it is nonetheless transformative in the sense that no-one is deciding to buy the collage art instead of the magazine photo.

Contrast LLM-created code which is certainly a substitute for the original copyrighted work.

mbreese · on Sept 7, 2023

> you know collages are legal right

Only if it’s sufficiently transformative. There was recently a case that hit the US Supreme Court about this subject regarding an Andy Warhol adaptation of a portrait of Prince [1]. So, in the US, fair use in this regard requires some amount of substantive transformation of the material. But, as we are talking about AI algorithms, there isn’t a person in between the model and the training data. The argument here is whether or not a person is required to make a transformative use of the material (and thus fair use applies). Given that AI generated (and non-human animal generated) works aren’t copyrightable due to the lack of human involvement, I’d wager that any AI use of copyrighted material won’t get fair use protections.

[1] https://www.eff.org/deeplinks/2023/05/what-supreme-courts-de...

jprete · on Sept 7, 2023

You really think AI startups are valued based on the opinions of lawyers and experts? They’re valued based on whether the investors think they can find a bigger fool to hold the bag.

digging · on Sept 7, 2023

> I mean you know collages are legal right? You literally take 100s of copyrighted pictures and put them together and suddenly it's perfectly legal and ok.

Really? When has this been done?

tick_tock_tick · on Sept 7, 2023

You've never seen a collage?

jazzyjackson · on Sept 7, 2023

I've never seen one go to court

dokein · on Sept 7, 2023

I won't debate that no 'human' creativity is involved, but human brains are a purely mechanical process, and that's where human creativity originates (unless one invokes the supernatural).

LLMs are typically implemented in a way that makes them non-deterministic (i.e. temperature > 0).

jcranmer · on Sept 7, 2023

> It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature.

Have you read the recent SCOTUS decision in Warhol v Goldsmith? Because that's a pretty major redefinition of transformative for the purposes of fair use, and not in a good way for arguing that generative AI is fair use, especially because it ties transformative to the market impact. That generative AI is generally creating outputs that are directly competing with inputs (particularly in the case of generating images, where it's clearly competing with stock images) would make it dramatically less likely that a court would find that it is in fact transformative.

EMIRELADERO · on Sept 7, 2023

From what I understand, the "market impact" test is about the value of the specific work for which the copyright has been infringed. If, 99% of the times, the generative AI systems do not output anything that a court/jury would deem a derivative work of the original, I don't see how the "effect on the market" prong can be won by the copyright holder.

hn_throwaway_99 · on Sept 7, 2023

I think the Warhol decision is an entirely different kettle of fish. Just take a look at the pieces in question: the Warhol portraits don't really look that different compared to the original photographs.

The benefit that generative AI has is that, when claiming copyright infringement, you need to specify individual works that were infringed. It's not enough to say "this work is an amalgam of these other ten thousand works, and we can't really tell you how."

I could imagine if generative AI gives an identical, word-for-word match for an individual piece of source material it could be in trouble, but that's also the easiest type of thing to prevent from an AI company perspective.

The fact is that existing copyright law just can't really encompass the kinds of societal concerns we have around generative AI.

dkjaudyeqooe · on Sept 7, 2023

No one has to claim individual copyright infringement for it to be copyright infringement.

At any rate you can force the infringer to disclose what works they use as input.

Copyright law doesn't encompass novel uses, but courts can and will deal with it.

hn_throwaway_99 · on Sept 7, 2023

> No one has to claim copyright infringement for it to be copyright infringement.

That's a little bit like "If a tree falls in the forest but nobody hears it..."

I mean, sure, "theoretically" any number of things can be infringement. But it's obviously a gray area, so it only really matters when somebody brings a suit and a work is found to be legally infringing.

belorn · on Sept 7, 2023

The pirate bay case demonstrated that you don't need to prove a specific instance of infringement, only that occurrence of infringement "somewhere/somehow" was more believable than the alternative theory that no such infringement has happened. It may be enough to demonstrate that infringement is trivial, and then point to user statistics to demonstrate that infringement is more believable than that infringement has never happened.

bdowling · on Sept 7, 2023

Until someone files a lawsuit and a judge and jury decide the infringement question, it’s all speculative.

Some cases are pretty obvious, but even literal copying isn’t always copyright infringement (e.g., if the material is arguably not eligible for copyright protection).

CharlesW · on Sept 7, 2023

> It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature.

This isn't how "fair use" works, in the sense that there can never be a blanket assurance like that. Also, whether the result is "transformative" is just one of many factors (see audio sampling/remixing).

twoodfin · on Sept 7, 2023

Also “transformative” doesn’t have its everyday meaning in this context:

“The Godfather” film is absolutely a transformative interpretation of Mario Puzo’s book and a fully distinct, valuable work of original art. Paramount still needed to pay Puzo for the right to base it on his words.

CamperBob2 · on Sept 7, 2023

It's not how fair use works now, but many things about copyright law will have to change radically over the next few years. There's too much at stake.

JoshTriplett · on Sept 7, 2023

As long as the change is to reduce copyright restrictions on everyone, rather than just giving AI a pass to copy and launder the work of others with impunity.

jojo100 · on Sept 8, 2023

This is what people on the "this is copyright infringement" side don't understand. Even if it somehow is copyright infringement by the standards of today's law, those laws will inevitably change in the near future. Generative AI is far, far too lucrative and convenient to society for it to be crippled by obsolete copyright infringement laws formed in a time where generative AI was a thing of science fiction novels.

yladiz · on Sept 7, 2023

What is at stake?

jaysinn_420 · on Sept 7, 2023

Billions of dollars in investments that will partially benefit lawmakers?

kube-system · on Sept 7, 2023

Well, content creators and publishers have trillions.

jojo100 · on Sept 8, 2023

Content creators and publishers are also the primary ones who are using generative AI. They aren't a united monolith against generative AI.

saurik · on Sept 7, 2023

Google Books is transformative in its use and for what it is, sure; and yet, if you do a query on Google Books and try to take the output and paste it into your book, that might not be fair use (and I only say "might not" instead of "would not" as maybe you are writing a research paper and wanted a quote from a book or whatever, but of course that's just a silly corner case someone would try to call me out for on an Internet forum).

Just because Copilot might be itself a transformative work which is itself allowed to exist, that doesn't at all necessitate a conclusion that the developers who are using it are going to or should somehow be guaranteed not be committing their own copyright sins if they try to incorporate its output into their own works (any more so than one can or should assume all of the outputs of another human being are free of copyright entanglements, even though no one is as-yet claiming a human being is themselves infringement just because they saw another work).

sillysaurusx · on Sept 7, 2023

You're getting a lot of pushback, but the EU seems to agree with you: https://creativecommons.org/wp-content/uploads/2021/12/CC-St...

https://www.notion.so/DSM-Directive-Implementation-Tracker-3...

https://eur-lex.europa.eu/eli/dir/2019/790/oj

The TDM4 copyright exception allows datasets to be created consisting of copyrighted works, as long as there is a mechanism for rightsholders to opt out. This seems like the best of both worlds: the dataset is transparent, rightsholders can assert their rights, and certain AI companies can train on copyrighted material.

Of course, this doesn't grant commercial rights for the trained model, only scientific and academic research rights. (I.e. it's fine for Meta to train and release a LLaMA model trained on books, as long as they're not commercially profiting from it, and there's a mechanism for authors to opt out.)

I'm talking with Jordan from https://spawning.ai to try to build some kind of opt out system that makes sense for books. One could imagine doing this for music too.

This is a European law, but unlike other overreaching EU regulations, this one seems like an extremely sensible compromise.

EDIT: Oh, Jordan emailed me a correction:

> Looking at your hackernews comment, my understanding is the right to opt out only comes for commercial research. So making a dataset for eleuther (or whomever you compiled it for originally) probably doesn't even require opt outs. It'd be if openai used it for gpt-5 and charged for it that it would be required.

Wow. So this law actually applies to commercial uses of ML, and non-commercial uses such as LLaMA wouldn't even require an opt-out.

That's wonderful. This gives researchers legal cover, and requires commercial uses to be transparent in their datasets.

bsder · on Sept 7, 2023

> as long as there is a mechanism for rightsholders to opt out.

I really don't like this--opt-out never works because the scale advantages are backwards. It places the burden in the wrong place. The aggregators should have to get opt-in.

Look at YouTube. Because of "opt-out", lots of people monetize content that they have no right to and it's up to the original author to have to fight the scale of a zillion uploaders. Only the biggest entities can do that.

YouTube (and everybody else) should have to assert "You, the uploader, own this content" when they ingest it. Nothing else works.

I wouldn't mind an exemption for research use, though.

jojo100 · on Sept 8, 2023

>I really don't like this--opt-out never works because the scale advantages are backwards. It places the burden in the wrong place. The aggregators should have to get opt-in.

>Look at YouTube. Because of "opt-out", lots of people monetize content that they have no right to and it's up to the original author to have to fight the scale of a zillion uploaders. Only the biggest entities can do that.

>YouTube (and everybody else) should have to assert "You, the uploader, own this content" when they ingest it. Nothing else works.

Thankfully, in a rare turn of fate, capital will be on the side of the laissez-faire instead of the stringent anti-copyright-infringers for once. You do not own the rights to material created by a generative AI.

sillysaurusx · on Sept 7, 2023

One subtle benefit of the opt out is that it forces commercial ML companies to reveal what they trained on. So Copilot would need to reveal a list of repositories, in order to give the repo authors a chance to opt out.

This is a fairly big deal since right now there’s no incentive for AI companies to disclose their training data, and it seems unlikely that legislation to that effect will be enacted anytime soon. Whereas this opt out mechanism is already getting widespread adoption in the EU.

lewhoo · on Sept 7, 2023

> Sure, if you really coax it, you can get code or images out that look similar to existing one

I'd say it is possible to produce exact data as well. Try "Provide quote from King James' Bible Genesis :1-25" with chatgpt. You'll get a verbatim text. You can get the same with things like Moby Dick, but when I typed "Provide the first five sentences of the book A Game Of Thrones" I got:

Certainly! Here are the first five sentences from the book "A Game of Thrones" by George R.R. Martin:

"We should start back," Gared

This content may violate our content policy or terms of use. If you believe this to be in error, please submit your feedback — your input will aid our research in this area.

The model is clearly capable of reproducing verbatim data I think.

smoldesu · on Sept 7, 2023

That's part of what made the Google Books ruling so shocking; it considered Google's transformation of "we digitized and indexed these books" to be transformative. If you punch the ASOIAF quote into it, Books will reproduce the text of Game of Thrones that had your query: https://www.google.com/search?tbm=bks&q=%22We+should+start+b...

It's still surreal that this is considered Fair Use, and even defended relatively recently (2013). It's hard to say where the ruling will land ultimately, but there seems to be an argument that verbatim reproduction doesn't matter.

satvikpendem · on Sept 7, 2023

It's likely defended due to being non-commericial and for the public good, as I posted with my link to the Harvard page above. That was for literal copying and pasting so the bar for transformativeness is higher, but with generative AI where it can produce wholly new code/images, I think it will also be deemed fair use.

twoodfin · on Sept 7, 2023

Also, extracting for a new purpose is fair use of long standing, distinct from something like sampling for the same purpose of composing music.

Siskel and Ebert didn’t need to pay rights holders to extract from their works for public criticism.

saurik · on Sept 7, 2023

Google Books itself being some kind of fair use transformative work is unrelated to whether you could use the output of a Google Books query as part of a book you yourself have written (and like, clearly you can't).

lewhoo · on Sept 7, 2023

Yeah. I wasn't so much trying to put weight on that you can get a fragment of copyrighted text, like Google Books also provides, but using the Bible as an example my point is you could technically get the whole thing bit by bit. You can't do that with Game of Thrones likely not because of capability but because of guardrails, because for a machine what's the difference if it's fed a copyrighted text or not.

riedel · on Sept 7, 2023

I just want to highlight that this a very US centric view. A user of copilot in the EU might be confronted with a totally different legal regime. (No fair use per se, no copyright transferability, ...). It seems quite a bold move as being an internationally active company if there is no small print...

tremon · on Sept 7, 2023

no copyright transferability

The economic part of copyright is transferable in the EU just as it is in the US, only certain moral rights (such as the right to attribution) are inalienable.

edit to add: it's not just in the EU. According to Wikipedia, the same distinction is made in Brazil, China, India and Indonesia (among others, but those were a few big countries that stood out).

riedel · on Sept 8, 2023

That is true: You are certainly allow to (exclusively) licence your works to others. Actually, I only meant it to be an example of how giving guarantees can become difficult if authorship is not clear.

satvikpendem · on Sept 7, 2023

Yes, I'm talking about American law specifically.

tick_tock_tick · on Sept 7, 2023

Yeah but if it's ok in the USA the EU needs to allow it or they'll fall even farther behind.

CaptainFever · on Sept 8, 2023

It's already allowed in the EU: https://www.reedsmith.com/en/perspectives/ai-in-entertainmen...

croes · on Sept 7, 2023

Didn't Copilot produce an exact copy of code including the comments?

Gigachad · on Sept 8, 2023

Take a look at the prompts people use in these examples. They are always so contrived. Sure if you ask it to "Take this function exactly as it is from this file at this repo and output it without changes" it can do that.

yencabulator · on Sept 8, 2023

Is the contrivedness relevant to the legal question? It shows the model contains the copyrighted content and can reproduce it on demand.

Gigachad · on Sept 8, 2023

My brain contains loads of copyrighted info. And if I exactly reproduce it from memory, it's copyright infringement. But if I come up with my own work, even if using that copyrighted info to learn from, it isn't infringement.

yencabulator · on Sept 8, 2023

I don't understand why people keep comparing humans and computers. The law does not treat machinery equal to a human.

jaimex2 · on Sept 8, 2023

You can't really say that. All this needs to be tested in court and see what definitions of what end up winning and setting precendent.

It can go either way.

Gigachad · on Sept 8, 2023

How would you even know where it came from? If I commit some code, you’d have no idea if I came up with it myself or if AI generated it.

And why would it matter?

Filligree · on Sept 8, 2023

Yes. Courts will generally assign blame to whoever did the thing that caused a breach of the law, which in this case is the user.

In other words, law isn’t a programming language.

croes · on Sept 8, 2023

Do you think the prompt "sparse matrix transpose, cs_" is contrived?

Gigachad · on Sept 8, 2023

Maybe? I have no idea what that prompt even means.

Kiro · on Sept 7, 2023

Only if you push it into a corner, at which point you may just as well go to the repo and copy-paste the code you're trying to reproduce.

littlestymaar · on Sept 7, 2023

> It's likely that generative AI in general will be deemed fair use

Except that “fair use” is mostly an American thing. In many other jurisdictions (especially those with of civil law) there's such a wide principle, and there's only specific laws allowing some explicit kinds of use of copyrighted material that the law allows. In those jurisdiction, most uses of generative AI trained on copyrighted material are, more likely than not, illegal at least until the legislator actually changes the law.

CaptainFever · on Sept 8, 2023

TDM exceptions, which are already in place in a number of jurisdictions: https://www.reedsmith.com/en/perspectives/ai-in-entertainmen...

aidenn0 · on Sept 7, 2023

> It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature

Purely mechanical modifications may not be considered transformative, and there's an argument to be made that LLMs are purely mechanical (in fact a US district court recently ruled that AIs cannot be authors of copyrighted works).

amalcon · on Sept 7, 2023

> (in fact a US district court recently ruled that AIs cannot be authors of copyrighted works).

I thought that was because only humans and other legal persons can legally author things, not because of anything subtler about the nature of LLMs. See also the case where the monkey managed to take photos of itself. I'm not a lawyer, though.

dragonwriter · on Sept 7, 2023

> I thought that was because only humans and other legal persons can legally author things, not because of anything subtler about the nature of LLMs.

It very explicitly was and made a point of noting that it was not addressing anything about whether and when a human author could hold a copyright on a work authored using AI.

professoretc · on Sept 7, 2023

Correct, as usual, everyone interprets that case as 'OMG animals/AI created work is uncopyrightable!1!' but in reality it's just that animals/AIs cannot hold copyright. Whether a human using an AI can copyright the resulting work is still up in the air.

nonfamous · on Sept 7, 2023

In the monkey-photo case, didn't the journalist attempt to assert copyright on the photo the monkey took with his camera, but was denied?

nfriedly · on Sept 7, 2023

I think it'll likely be deemed fair use because of how much money Microsoft and others are willing to throw at getting that result.

slashdev · on Sept 7, 2023

Provided you don’t use it to deliberately recreate a substantial part of a copyrighted work. Intent will matter here, and it’s difficult to prove.

Even Microsoft is couching their guarantee here with an exception for this very case.

ryukoposting · on Sept 7, 2023

There are cases where generative AI can be trained for the explicit purpose of ripping off a particular artist's style. Take a gander at all the artist/art style LoRAs for Stable Diffusion. Some of them are harmless (a Rembrandt LoRA for example) but others are trained to make convincing knockoffs of living artists who are trying to put food on the table.

simion314 · on Sept 7, 2023

>It's likely that generative AI in general will be deemed fair use

What if you train it only on my huge repo of GPL code? You are just remixing my code.

Now you maybe think "let me train on 2 different devs GPL code", the remixed code will probably be 50-50 and you can get away with it ?

If the 2 number is too small then tell me what the number N should be ? From how many people you need to "steal" code , mix it and the output is "original" ?

Edit: my opinion is that AI should be fair, if you train it on open source then model should also be open source and output should also be open source.

andybak · on Sept 7, 2023

> What if you train it only one my huge repo of GPL code? You are just remixing my code.

The word "remixing" here is useful because it will fit any conclusion the reader prefers.

Arguably even in your reductive example, the result would be non-infringing. Or not. Which conclusion you reach is exactly the topic under debate. Isn't this textbook question begging?

simion314 · on Sept 7, 2023

But in this case the LLM will predict the next token based on the input data, all the input is mine , Microsoft tweaked some numbers to make the interpolation mroe correct.

Imagine I get the Windows source code and rename the variables by adding a "314" after each varaible, after each function name and rebuild Windows, in your definition this is remixing and fair ?

andybak · on Sept 7, 2023

I haven't said whether it's fair or not but oversimplifying a complex topic isn't going to shed any light.

Where you like it or not this is an undecided area both legally and morally. Pretending it's clear cut is either disingenuous or delusional.

simion314 · on Sept 8, 2023

I think we agree , there is no clear line or clear answer.

My simple example is to show that is not as simple as "the AI earned from N devs GPL code and now it can spit new original code without ZERO concerns", we know how this stuff works and that it can spit out the exact training input in some cases.

So IMO a judge should ask the question "from how many people you need to steal, mix the input to be sure the output is actually original".

And about the thing "if I read someone code it is not stealing" , hyumans are different and even for humans it is not allowed to read the code of your competitor and then write new code using that knowledge.

kube-system · on Sept 7, 2023

Copyrighted code with a GPL license is copyrighted code as far as copyright law is concerned. Copyright is the basis on which GPL is built. The GPL does not apply to anything that is fair use, public domain, or otherwise not copyrighted. If the author does not have copyright to the work in question, they don't have a right to license it.

This is all to say: the question about copyright and fair use remains exactly the same regardless of license.

mistrial9 · on Sept 7, 2023

you are assuming fair and impartial judgement that is then implemented

kube-system · on Sept 7, 2023

I'm not sure what you're referring to. What do you think would be unfair or impartial?

mistrial9 · on Sept 7, 2023

here is what I was thinking -- in business (or war), parties can implement an unreasonable or illegal action in fact, then use time to rebuff others while making their position stronger.. or alternately simply pay gatekeepers or stakeholders while furthering their position, before others can actually stop the action. None of these or other scenarios involve a considered opinion by an impartial party with an effective implementation, in fact the key is to avoid impartial decision while gaining an advantage (and income) in the ground

I do not disagree with what you said, only that in reality, this is not the only way business conflicts are decided.

kube-system · on Sept 7, 2023

Yes, "the question" that I am referring to above is the currently lack of that impartial decision. Courts have yet to issue any ruling that is particularly useful for determining where the line is for the application of LLMs. I am saying that when the line for fair use is further defined in this context, it won't be predicated on which license, if any, the content has.

satvikpendem · on Sept 7, 2023

What if I learned to code based only on your huge repo of GPL code? I'd just be remixing your GPL code at that point, right? Will you brand all of my output as being GPL as well?

simion314 · on Sept 7, 2023

>What if I learned to code based only on your huge repo of GPL code? I'd just be remixing your GPL code at that point, right? Will you brand all of my output as being GPL as well?

This never happens, you will first learn from a book or tutorials.

But your idea is sound, have Microsoft buy books from the authors and train the LLM on those books then have the LLM solve new problems. If is an AI and not a text interpolating tool then it should be able to learn like humans from a few books.

andybak · on Sept 7, 2023

> This never happens, you will first learn from a book or tutorials.

I have learned to code almost entirely from reading code. I hate tutorials and books.

I occasionally skim docs but mainly for the code examples.

So "never happens" is a curious take.

pyuser583 · on Sept 7, 2023

Why is Google books so unusable then? Even documents in the early 20th century are inaccessible.

verve_rat · on Sept 7, 2023

Sure, that might work for some places, but some jurisdictions don't have a legal concept of fair use.

FrustratedMonky · on Sept 7, 2023

"likely"

Big bet on legal costs based on something being "likely".

jacquesm · on Sept 7, 2023

Will you indemnify those that follow your advice?

Because 'transformative' is a pretty dangerous word to use in this context.

542458 · on Sept 7, 2023

> Will you indemnify those that follow your advice?

I strongly feel that this is a terrible metric for comments on the internet.

First, the person you’re replying to has nothing to gain and a lot to lose by saying "yes".

Second, it invites silly corner case nitpicking. Their comment is written in reasonable plain English for other users reading plain English. It’s not a legal contract, and so leaves lots of loopholes. Sure, you could create a likely non-transformative LLM by training it on nothing but the text of Harry Potter with fitness measured by how accurately it exactly reproduces the complete text of Harry Potter, but that’s not what reasonable people are doing with LLMs.

jacquesm · on Sept 7, 2023

It's borderline legal advice and you have to be very careful with predicting how judges will rule on future cases.

In a legal context certain words have immense power. In the context of copyright 'transformative' is one such case. It's a very fine line between 'transformative' and 'derivative' and you don't get to preempt the judiciary about how they will see things.

satvikpendem · on Sept 8, 2023

This is not a legal context though. I am not a lawyer, I don't claim to be a lawyer, and even if I were a lawyer, no one in the internet should be taking my comments as legal advice in the first place. One should not need to disclaim everything they write with such a statement.

otterley · on Sept 8, 2023

As an attorney, I'm of the opinion that otherwise-intelligent people who provide confidently-wrong legal opinions on the Internet should be held accountable for people following their advice. I see incorrect understandings of the law and sloppy legal analysis with dismaying frequency here, even when it comes to settled law like what "fair use" is.

satvikpendem · on Sept 8, 2023

This is a weird stance. Anyone can say anything on the internet, they can be legal opinions or other things. It should not be necessary to disclaim such an opinion because no one should be using the internet as their basis of law (or medicine, etc) instead of a professional in the first place.

otterley · on Sept 9, 2023

> no one should be using the internet as their basis of law (or medicine, etc) instead of a professional in the first place.

Designing systems around what people should do, as opposed to what they actually do, has proven time and again not to work particularly well in practice. I'm sure you've seen countless examples of how people track paths through manicured grass fields. The landscaper will complain about how people should walk and they'll put up signs to no avail.

The fact is, we (including me, BTW) are frequently wrong about a lot of things, and when there's little riding on it, we can ignore that most of the time. With subjects like medicine and law, however, where a mistake can cost you your life or lots of money, we want to make sure people are getting the best advice possible. That's why we require licenses to practice medicine and law, and we have governing and ethics bodies to regulate how professionals operate their practices.

satvikpendem · on Sept 9, 2023

> That's why we require licenses to practice medicine and law, and we have governing and ethics bodies to regulate how professionals operate their practices.

Correct, so people should (and do) go to the people who have these licenses, not random people on the internet. I don't even understand what your solution, or even problem, is. It seems like you're suggesting that everyone, whenever they speak on the internet about anything vaguely related to medicine, law, or hell, even regulated fields like engineering, should disclaim that they are not speaking in such a context. And I saw that that is a ludicrous task that is expected of one to do. So if you have any better solutions, let me know.

otterley · on Sept 9, 2023

One doesn't have to disclaim anything that they had the good sense not to assert in the first place.

satvikpendem · on Sept 9, 2023

That's your opinion on how people should speak, not most people's, so feel free to disclaim when you yourself talk, but don't deem what other people should or should not say.

otterley · on Sept 9, 2023

I’m afraid you didn’t understand what I just said. I was politely trying to say “if you wisely abstain from talking about things you don’t know about, you won’t need to disclaim that you don’t know what you’re talking about.”

satvikpendem · on Sept 10, 2023

Or I can just say whatever I want, as can anyone. You can only contain your own words, if you would like to "wisely abstain," then do so.

satvikpendem · on Sept 7, 2023

No because I don't have that much money, but it looks like Microsoft will. They likely wouldn't if their lawyers did not think there was a reasonable chance that they'd win the lawsuits, likely from, again, generative AI being deemed fair use.