Hacker News new | past | comments | ask | show | jobs | submit login
Microsoft will assume liability for legal copyright risks of Copilot (microsoft.com)
540 points by wgx on Sept 7, 2023 | hide | past | favorite | 377 comments



Let Microsoft first publish a Copilot model that's trained on the internal codebases of Azure, Windows and Office. That's the only way Microsoft can convince me that they truly believe Copilot is non-infringing technology.


I suspect Microsoft would earn more money by doing this.

Their own engineers would get productivity boosts - with copilot already being familiar with data structures, code style, etc. would be a big boost to accuracy.

But also, third party code would end up being more similar. Code style of the whole world would be pushed towards 'Microsoft style', which probably makes hiring easier, less training time for engineers, etc.

And the downside, that is outsiders might learn tiny nuggets of info about microsoft sources, is probably irrelevant when outsiders can already decompile binaries and learn far more.


> is probably irrelevant when outsiders can already decompile binaries and learn far more.

most, if not all microsoft products can have their sources be available for viewing, if you are one of those vip development partners. microsoft doesn't really have any secret source (pardon the pun) of which the leaking would undo their value proposition.

In fact, if microsft opened up their system a bit more, they might even gain some PR or mindshare, and have no effect on, if not increase, their bottom line.


It would be surprising to me if their internal engineers don't already have access to a model trained on internal Microsoft code.


You are assuming Microsoft code base is superior to Linux / Git / MySql / whatever else is in github right now. That is a .... big assumption.

And if Microsoft's code ends up influencing the rest of the world code that would be a .... big downside.


I don't think you should be looking at the best of the Microsoft/GitHub corpora to gauge their overall quality. You probably want to be looking at the quality of the median project, which is going to be heavily influenced by the long tail of low quality projects.

IMO, the long tail of non-code-reviewed, written-by-someone-in-their-first-month-of-coding, barely-even-compiles noob code[0] in Github is going to be orders of magnitude larger than the long tail of crap in Microsoft's internal repos.

[0] Hey, everyone has to start somewhere. There's nothing wrong with your first "hello world" program being buggy - that's what being a beginner means. But it's probably not the sort of code you want to train an LLM on.


Now I'm wondering if the copilot AI (GPT3/4?) takes number of stars/forks/etc into account during the training process.


I would think if the LLM knew it was rookie code, it could actually be pretty useful, no?


Without a myriad of dumbasses like me being able to commit to Microsoft vs Github, I'd assume Microsoft's average is better than Github's.


That is a... bold assumption to make. Not just for Microsoft but for any large corporation.


> That is a... bold assumption to make. Not just for Microsoft but for any large corporation.

I dunno; the average project on github isn't code-reviewed, while all the projects at Microsoft are.


I'm not saying bad code doesn't exist there. My thought is that the percent of bad code increases with volume (or at least higher number of producers). Tens of millions of people committing to Github should mean its more cluttered with garbage than in MS. I at least assume MS has some automated code standard or security scans. That's at least more than nothing.


" I at least assume MS has some automated code standard or security scans." -- that is a .... big assumption.


No, it really isn't when we're dealing with an organization that is audited for SOC 1/2, DoD, and likely others.


Are you sure?

https://arstechnica.com/security/2023/09/hack-of-a-microsoft...

The Azure-State-Department breach had nearly a half dozen contributing bugs...


And how does that compare to all the bugs on Github?


My friend - Chinese secret services read Secretary of Commerce's emails because of Microsoft's security leaks: https://abcnews.go.com/Politics/commerce-secretary-gina-raim...

So yeah, assuming Microsoft systems are up to standard or have security reviews or whatever is a .... big assumption.


My friend, nobody implied that any of these things result in a foolproof system.


Do you have the exact style of talking in all your comments lol?


Yup, individuals struggling to have impact will cut corners and heavily impact tech debt :P


I have a friend who worked at Microsoft... if his opinion is anything to go by that's very far from true.


> You are assuming Microsoft code base is superior to Linux / Git / MySql / whatever else is in github right now.

How do you get that impression from the comment? I don't see anything implying that.


Not for Microsoft it wouldn’t, which was their point.


> Code style of the whole world would be pushed towards 'Microsoft style'

Yes, that's exactly what the world needs, more software like Teams.


I mean, microsoft's code is probably better than the github average. There's an awful lot of horrific code out there.


I don't know about MSFT, but I bet this would really help Google a ton. With a mono-repo and huge focus on readability, not to mention how many thousands of SWEs spend the majority of their time slinging protobufs around, it seems a significant fraction of day-to-day code could be largely automated.


Google absolutely has their own internal models that do exactly this. It wouldn't surprise me if Microsoft indeed does have an internal Copilot that is trained on their data, but even on the smallest risk that they leak their code, they wouldn't share that particular model.


What does "absolutely has" mean here? Have you actually heard anything about such internal models?


Why wouldn’t they? Meta does, and they write openly about it


This is incorrect and not how Copilot works. My company just hosted two MS engineers to explain it live to 175 of us.

The style applied by Copilot comes from your surrounding code context, not from the LLM. And that base, trained on all public repos from GitHub, knows everything about data structures, etc, in the languages that were scanned.

Nothing new would be gained by scanning MS's own repositories and nothing would be leaked or color the output in actual use.


They're not claiming that it can never spit out code exactly, but that they will take liability for if:

- It does

- The user didn't turn off the filters that prevent this

- The user didn't intentionally make it do it

- This use is found to be illegal

There's a difference between code that needs to be kept private from bad actors (from their point of view at least) and code that is public but with restrictions on its use that anyone who gets it should be aware of. This is like saying "if you truly believe that license agreements are legally binding, then publish your user's passwords publically with a license saying no one can use them"


> This use is found to be illegal

This being the real hurdle. With Microsoft money behind the defense, only megacorps can win.


Microsoft has lost legal battles against non-megacorps in the past.

I remember some guy representing himself and winning some dispute over shrink wrap licenses and student discounts.


That's badass. Where can I read about that court case?



That's seriously awesome.


Leaking sensitive data and infringement are separate (tho related) concerns. They may not want to do what you say, even though it's totally infringement safe.


Are they separate? Or is it the same concern but from opposite view points?

Both worried about IP leaking but one side is worried about their IP leaking and the other worried about liability if they inadvertently implement any leaked IP. Either way, the concern is leaked IP.


Yes, if I ask something like "Can you describe microsoft's internal security processes and the names of upcoming products" the output would be original and not covered by copyright, but it would be sensitive internal information and covered by NDAs. But any code publicly posted and available to be scraped will not have such sensitive info in it.


I don’t think GitHub Co-pilot can respond to prompts like that. I thought it was ostensibly sophisticated source code completion. If so, source code is absolutely covered under copyright.


Works generated by AI are not copyrightable. If you take a generated work and substantially build upon it, then it's likely copyrightable.

At least that's the case for art, and I think the same logic should apply to art and code.


That hasn’t been tested in court.

But even if that were true, it’s a moot point because we are talking about the copyrighted content that the models were trained on. Hence the point the OP made that if Microsoft really wanted to reassure people then they’d promote models that were trained on Microsoft’s own code rather than handwave away these concerns with gestures of assuming theoretical liability.


Ah, ok. As for testing in court, that will be useful, but a rather official source says "created by a human author" [0] in defining the notion of copyright, which I assume is paraphrasing actual law, which I assume a judge would interpret similarly. However, I will concede that it's conceivable that if a human authors a work that then itself authors another work, the second work could potentially be attributed to the human for purposes of copyright eligibility.

[0] https://www.copyright.gov/what-is-copyright/


> But any code publicly posted and available to be scraped will not have such sensitive info in it.

Well, at least you'd hope so.


even though it's totally infringement safe

This hasn't been tested in court.


The last thing the world needs is more code written in the style of Win32 API.


I believe you’re referring to GitHub Copilot which is a distinct offering in their portfolio (still Microsoft). GitHub Copilot was based on GPT-3 with fine tuning from public code repositories. That is the controversial aspect of it, I believe.

This blog post refers to the broader ecosystem of Microsoft Copilot solutions. Most of those tools rely on the Azure OpenAI API service on the backend and are not specifically tailored for code generation.


Windows API and the entirety of its client code aren't a good source of standard C programming. On the source level you have additional types and qualifiers/annotations that only MSVC understands.

LLM copilot doesn't really understand the context of the project, it just goes for similar text.

So if you train on big projects you're picking up their patterns only. When a copilot user asks for a string concatenation 'tip' you want LLM to output a general answer, not something tied to a specific project. Big project is likely to use abstraction over strings, where base library usage is shrunk down to few lines of code as opposed to abstraction. In this case you'd want LLM to source a few "simpler" projects that use base library strings abundantly, so it can have decent amount of text for the most likely correct match over user's input.

I do believe Microsoft has all the code available for good training, it's not only about Azure, Windows and Office, there is tons more and it's open source already.


There's illicit copies of Windows source just up on github. I wonder if we're already in the place where copilot will spit those out if you poke it the right way (but I don't feel like spending $10 to find out).


It would be an ugly beast. But I agree with you that there is a fair approach.


Interestingly, would the Copilot become better after such training...


Negative examples should aid training, right?


Probably not


Is there any evidence that it isn't also trained on parts of msfts code base?


Is there any evidence it is?


If it is, it should be fairly easy to see.

We can already take a guess what many internal functions look like from the published symbol tables of every function across all major microsoft products. Simply ask copilot to write those functions and see if the code comes out better than a similar set of made up yet plausible function names.


Wouldn't you then end up with code suggestions based on the style guide of a single company and limited set of languages?

It probably would not be a very desirable product in the end.


Even Microsoft knows that their own code is absolute garbage that would bring the quality of copilot way down.


It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature. Sure, if you really coax it, you can get code or images out that look similar to existing ones, but the courts might see that generally speaking, it produces new content that has not been seen before, especially in the case of images.

Google Books literally copied and pasted books to add to their online database and that was deemed fair use, so something much more transformative like generative AI will likely fall under much broader consideration for fair use. Google Books was, yes, non-commercial, but the courts generally have the provision that the more transformative something is, the less it needs to adhere to the guidelines laid out for determining such fair use.

https://ogc.harvard.edu/pages/copyright-and-fair-use


> It's likely that generative AI in general will be deemed fair use

Everybody seems to be saying this, but I really don't think there's even 50% chance of it happening.

Google books was fair use because it was a public benefit and did not take away from publishers or authors, to the contrary it helped people find their works.

Compare generative AI which extracts the essence of people's works and recreates similar works (in terms of style, etc) while cutting out the original authors completely. This potentially denies them the fruits of their labor. It's notable that it's a purely mechanical process and no human creativity is involved, except that which is extracted from other authors. Mere prompts don't count.

The argument you're suggesting will hold is essentially "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok".


>extracts the essence of people's works and recreates similar works (in terms of style, etc) while cutting out the original authors completely.

Only if you ask it to. At which point the person asking is at the very least culpable as well of violating someone's IP.

It is also illegal for me to pay someone to write Micky Mouse fan fiction (though if I don't publish it, this gets more murky).

> The argument you're suggesting will hold is essentially "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok".

I want to flip this on its head: the argument you are suggesting is essentially "LLMs should be illegal because they can be asked to break copyright at scale!" It isn't illegal to be an author for hire, even though someone could potentially ask you to write fan fiction for their personal collection in the style of Tolkien, but because an LLM can do it at scale, it is illegal?


Both Napster and the Pirate Bay founders argued that only users could be held responsible, since it was the user who requested the infringing files. It did not stop the courts.

Anyone could use those tools to download creative common files and linux ISO, but those arguments did not succeed in the legal system. Bittorent as a technology was however not made illegal, as could be seen in games using it to distribute patches.


Napster and the Pirate Bay struggled because the vast majority of content was pirated. You would be hard pressed to say a significant minority of generative ai has any copyright issues, much less copyright issues as blatant as straight piracy.


I’m not convinced any of the output of these generative AI is free from copyright issues. Consider, a ROT13 copy of a book may at first glance look nothing like the original, but distributing digital copies would be clear copyright infringement.

Feature extraction is literally a form of lossy compression. You can prod DALEE to make obvious copies of some of the works it was trained on, but even seemingly novel images could contain enough similarities to training material to be problematic.


I think this is basically the same as the sentiment that there is no such thing as a truly novel idea.

The standard isn’t “I think that looks like an Andy Warhol picture”, it is “That is substantially similar to a specific Andy Warhol piece”. Copyright doesn’t protect style.

> Feature extraction is literally a form of lossy compression.

This is one way think of neural nets, another is that they find the topological space of pictures.

But these are just models of computation, which aren’t especially relevant in the same way that it isn’t relevant what produces an infringing image, just that it is produced.

Which brings me back to my original point: there are a two different barriers for generative ai: is the model itself transformative, and is the primary purpose of the model to generate copyright infringing material.

With respect to the first… I have no idea how someone could argue that the model itself isn’t transformative enough. It isn’t “substantially similar” to any of the works that it is trained on. It might be able to generate things that are “substantially similar”, but the model itself isn’t… it’s just a bunch of numbers and code.

Regarding the second: I have less experience with image models, but I use chatgpt regularly without even trying to violate copyright, and I don’t think I’m alone, so I doubt you could make an argument that llms have a primary purpose of committing copyright infringement.


> it’s just a bunch of numbers and code

That really doesn’t fly legally because any digital format is ‘just’ numbers.


Yes, you are correct, I was being flippant.

But I think the greater point still stands. In order to call the model itself a copyright violation you would have to say it is "substantially similar" to thousands of original works. Then in order to make it illegal you would have to say it wasn't "transformative enough" to be considered fair use.

I can't come up with an argument for either one of those points that holds any water at all.


Copyright is not cooties. For something to be infringing it has to be beyond the de minimis threshold. It’s not enough to show that a copyrighted work influenced another work, there needs to be some substantial level of copying.

This music industry has been going through exactly this for the last few years and the courts have recognized that the creative process necessarily involves copying and that a small amount of copying is not infringement.


The de minimis threshold is shockingly low as seen in various successful lawsuits.

Critically it’s not just a question of what percentage of a work is a copy of the original but what of the original work was copied. IE copying 3 lines in a book is a tiny fraction of the book but if you coped half the poem it’s well past the de minimis threshold.

Similarly only a small percentage of a giant library of MP3’s comes from any one work, but that’s not relevant.


Exactly, but the kinds of things that Copilot is taking are necessarily very generic. It’s not going to be taking the “special sauce” from an open source project, because that is very unlikely to be the most probable continuation of any prompt that would occur in normal usage.

Copilot is taking things like “reverse a string” or “escape HTML tags”, that have very little originality to start with. This kind of common language is analogous to the musical motifs that have been also found to be under the threshold.


The foundational models most coding models are built on May have comments and code in them. They’re almost certainly built on a number of legal violations, including copyright infringement. I have details in “Proving Wrongdoing” section here:

https://www.heswithjesus.com/tech/exploringai/index.html

I’ve also seen GPT spit out proprietary content word for word that’s not licensed for commercial use that I’m aware of. They probably got it from web crawling without checking licenses.

What I want more than anything in this space right now are two models: all public domain books (eg Gutenberg); permissive code in at least Python, JavaScript, HTML/CSS, C, C++, and ASM’s. One layered on the other but released individually. We can keep using that to generate everything from synthetic data to code to revenue-producing deliverables. All with nearly zero, legal risk.


The problem with this line of thinking is that a person can also cut and paste code that they don’t have a license to use… but until they do, they haven’t done anything wrong by reading the code.

So either we carve out an explicit exception that machines aren’t allowed to do things that are remarkably similar to what humans do… which would be a massive setback for AI in the US.

Or we agree that generative models are subject to the same rules that humans are — they can’t commit copyright infringement, but are able to appropriately consume copyrighted material that a human would be able to consume.

The second option seems to me to be much simpler, nicer, and more appropriate than the first.


I agree with you. In fact, the latter is in my Alternative Models section.


Yeah they seem to be applying legal theory like programming code. If what they said about Napster applied to every website then literally every site with user generated content would be instantly wiped off the internet.


> You would be hard pressed to say a significant minority of generative ai has any copyright issues, much less copyright issues as blatant as straight piracy.

Where generative AI ingests copyrighted works in order to work and bases its output on it, then it is copyright infringement, equivalent to 'straight piracy' of all that it ingested, unless it's deemed fair use.

What Google does with its search engine, for example, is fair use, what Napster did was not.


I suspect that your extreme position must come with a sincere belief that nothing even close to human intelligence will ever be achieved. Imagine a robot with an AI brain. It would have to be blindfolded because just by learning to recognize, or even just viewing, the label on a can of Coke it would have “copied” it and become “illegal,” especially if it was capable of sketching it on demand. Any kind of intelligence cannot even simply view or listen to the world without encountering something IP-encumbered.

Learning, by human or machine, means extracting a copy of the essence of something and yes, storing that essence in a lossy way. It seems like learning from copyright-encumbered material ought to either be illegal for both, or legal. I know which world I would rather live in.


>Where generative AI ingests copyrighted works in order to work and bases its output on it, then it is copyright infringement

This is an absurd standard. Is it copyright infringement when a human "ingests" copyrighted work and bases their output on it? Because that's commonly called inspiration and is how every artist creates their work - through experiencing other works and using that cumulative inspiration to form their own product.

Copyright infringement is already ridiculously restrictive as it is, this proposal not only fundamentally misunderstands how generative AI works but penalises AI for doing what humans do everyday.


Yes, but I believe there are two questions here:

* Does the model itself violate copyright? * Does the output of the model violate copyright?

I don't know how you could make an argument that the ingestion of information into a model through a training procedure in order to create something that can generate truly unique outputs isn't transformative of the original works. The legal standard for a new work to be considered a copyright violation of an original work is "substantial similarity". I don't know how you can make an argument that a generative model is "substantially similar" to thousands of original works...

Honestly, I'm not even sure if "fair use" comes into play for the model itself. In order for fair use to come into play, the model has to be deemed to be violating some copyright. Only once it is found to be violating does "fair use" come into play in order to figure out if it is illegal or not.

The second question is the one where fair use is likely to come into play more. And this question has to be asked of each output. The model's legality only becomes an issue here if, like Napster, you can't argue that the model has much point other than violating copyright. Napster didn't violate copyright (the code for Napster wasn't infringing on anything), but it enabled the violation of copyright and didn't have much point other than that.

I don't think you can make that argument though. I use ChatGPT most days, and I've never gotten copyrighted material out of it. I could ask it to write me some Disney fan fiction, which would violate a copyright. And I think there is a valid legal question here about who is responsible for preventing me from doing that. This is where I think the gray area is.


Those tools overwhelmingly supported pirate means, nearly no one was downloading actually legal public domain songs or Linux ISOs from there. I contrast with LLMs and generative AI, people are using them for actual work, not for piracy, which will be seen differently by the courts.


> Only if you ask it to.

This isn't necessarily true. It's entirely possible for a model to regurgitate a chunk of GPL'd code without you knowing that's what it's done.


True, though I’m not sure this risk isn’t overblown. I’ve heard of a couple cases where someone got a copyright statement spit out, but I haven’t been able to find much more than the one or two that I’ve seen on hn. If you have more examples, I’d love to hear about them.

Code is also tricky: there are a finite number of ways to write an algorithm, and I’m sure both that multiple people have written the same version of left pad for example, and that it is not possible to copyright something small like that. When the code gets bigger, the likelihood of an llm spitting out large chunks of GPL’d code seems vanishingly small (without asking for something specific like that). Though I’d love to see examples to the contrary.


https://twitter.com/DocSparse/status/1581461734665367554 is the one I was thinking of. It's not just the copyright header in that case.


Interesting, though I think I would put those examples under a “will emit copyrighted code if prompted to”. I can’t think of a compelling reason someone would prompt for a sparse matrix function with a “cs_” unless they were searching for copyrighted code they knew existed.

Though that is definitely a simpler prompt than I would have expected was necessary to get such a result. Thanks!

(The first example also isn’t the same code. It is very close, and definitely similar in style, but it isn’t clear that code would a)run, or b) would work as expected. I need to sleep though, so I’m not sure how much that matters.)


> yes we're using copyrighted works, but

There’s no law against “using” copyrighted works, there is a law against copying and distributing them.

Fair use analysis doesn’t come into play unless we’re dealing with clearly established copyright infringement. What LLMs do doesn’t clearly qualify as any of the behaviors reserved to copyright owners. For example, it certainly doesn’t “copy” the things it’s trained on by any legal definition.

Law works on precedent and analogy when there’s no clearly on-point statutes or case law. The most analogous situation to what transformer models do is a person learning from experience and creating their own work _influenced_ by what they’ve observed. That behavior is not copyright infringement by any stretch of the imagination. The fact that it’s done with a computer is not as important as people seem to think it is.


> For example, it certainly doesn’t “copy” the things it’s trained on by any legal definition.

What about pictures still containing watermarks? Regardless of the actual legality, this does not fit "certainly".

> The most analogous situation to what transformer models do is a person learning from experience and creating their own work _influenced_ by what they’ve observed

No, it is not. It is called machine "learning" so clearly that is a fly made out of butter. Maybe courts will agree, maybe they won't, but the analogy to human learning is quite strenuous at best.


Here is the section of title 17 that defines the rights of copyright holders and what terms like “copy” mean in US law. It’s clear as mud but I feel it’s likely that the process of training neural network weights is not going to be held as equivalent to verbatim digital copies. It’s just not the same thing and the law has no clear provision for it, except by analogy to existing human creative processes.

https://www.copyright.gov/title17/92chap1.html#106A

The most closely applicable existing law is that of “derivative works” but those require human authorship, so it’s far from clear that those would apply to AI output either. Ultimately this is going to be hashed out in the courts until some actual laws are written to deal with it.

(IANAL)


> It’s clear as mud but I feel it’s likely that the process of training neural network weights is not going to be held as equivalent to verbatim digital copies.

It's taking verbatim digital copies and using a form of lossy compression to transform them, which I think is clear when looking at things like auto-encoders.


Isn’t your brain doing the same thing when it reads text or views a painting? Some people can even memorize and precisely recreate the things they’ve seen. But no one considers the process of lossy storage in human memory to be copyright infringement. Instead the later reproduction itself might be infringing. I think it will be the same here. Training models on copyrighted content won’t fall afoul of any existing law, instead legal challenges will have to be aimed at specific instances where the models produce output that arguably infringes copyright.

That’s inconvenient for opponents of this technology because they would prefer to ban the training itself, but there’s not a good justification under existing law to do this.


Ultimately I find that commoditization enables the purest form of the banality of evil.

Commoditized goods allows the bad to be sorted in with the good, allowing a price to be put on the commodity. Great where it's applicable but horrendous when it's improperly done - ie, home loans, or intellectual property.

If your commodity markets aren't properly regulated you get a race to the bottom. If you are trying to commoditize something that shouldn't be, it's effectively enables white-collar looting or money laundering.


First, style is not copyrightable. I could draw something in a Studio Ghibli style and they could do nothing about it, legally speaking.

Second, the way we've seen generative AI be used is not really the same as it was touted originally, that a mere prompt could replace an entire artist's work. A year later, we see that most people, artists included, don't use it as a verbatim text to image machine, they use it as a tool. See apps like ComfyUI or others which allow Node based or layer based image creation and editing, which even Photoshop now has. It's the same as Copilot and ChatGPT, it's not replacing any programmers, just increasing their productivity Given that, it is not looking like generative AI is hurting one's professions, quite the opposite.


While there are no IP protections for “style” there are certainly elements that are covered. Particular colors be trademarked, characters can be copyrighted separately from the works they appear in, design patents are a thing that cover more than most folks realize.

I don’t think any but the most copyleft segment of society thinks it would be reasonable for a generative AI trained on exactly one persons work to be used for profit by someone else.

An AI trained on two or ten people’s work probably feels the same for most folks, but what about when it’s thousands or millions? What if instead of one persons work it is the works held in copyright by an entity like Getty Images?


> I don’t think any but the most copyleft segment of society thinks it would be reasonable for a generative AI trained on exactly one persons work to be used for profit by someone else.

Why do you think that? It doesn't seem obvious to me at all.


Well imagine you created a novel generator that just cut-and-paste whole sentences from the Harry Potter books to create a new Harry Potter, and posted it online. Now, as long as this was done as a just a bit of fun, non-commercial, I feel like that likely should be allowed as fair use (whether it actually would is not the point), though borderline. If you tried to sell it, definitely too far.

So I think "used for profit" is quite key.

But another example is someone writing and selling a reference guide to Tolkien's mythos that catalogues the content of his novels. And we would say that should be allowed, though that could be taken too far as well, for example it could duplicate the material in the appendices to LoTR.


You can make trademarked colors with microsoft paint…


> First, style is not copyrightable. I could draw something in a Studio Ghibli style and they could do nothing about it, legally speaking.

Meanwhile, drawing Mickey ears on the wall of a kindergarten is not safe.

If you feel strongly that generational ML somehow launders copyright out of the bits, train an image generator purely on Disney copyrighted material and share the model on the web, see how well that works out.


Training purely on Disney's images would probably be difficult, considering the huge amount of images you need to train a model from scratch. But here is a "fine-tuned Stable Diffusion model trained on screenshots from a popular animation studio". Seems to have worked out quite well so far as that model was last updated last year.

https://civitai.com/models/24/modern-disney


"Modern Disney" is neat, I wasn't aware of that. Just so you know, The Mouse is much more fierce about protecting their classic properties like Mickey Mouse or Donald Duck, so this doesn't quite demonstrate my point yet.


Mickey (who will enter the public domain soon) is a specific expression and both copy righted and trademarked.


> First, style is not copyrightable

Wasn't suggesting it is. The point is that the tool is used to create things that substitute for the original authors' work by ingesting the works of those authors. The impact of the copying matters when weighing fair use.

If I use your copyrighted works to supplant you in some way, even as a part of a large group, then it's unlikely to be deemed fair use.


Ghibli style is probably trademarked. Different thing. Outline width, color palette, ambient noises, musical style, how the eyes and hands are drawn, when used together, would be possible to trademark I would think.


> Google books was fair use because it was a public benefit

What are the odds the market leaders in LLM right now are just the current day version of Borland-style compilers before open source takes it over?

I've heard arguments the infrastructure part is a long term barrier to entry for OSS development, which will continue to remain in the future. But I don't know enough about it.

Who knows maybe the legal/gov world will move slow enough to miss the bulk of the money-extraction opportunities before OSS takes over and the reality of this problem never going away fully kicks in.


You'd need millions of dollars to just compile and label datasets. The training itself requires a lot of resources and money, as does human reinforcement.

Open source models would need benefactors with deep pockets.


Copilot makes open source developers and contributors that much more productive which is a public good.


Indeed. Further to this, training on data involves copying it. To do so without permission robs authors of the right to contract their work for this training, either to OpenAi or any other third party.


Every kind of web crawler has to copy data. If that part of the AI training is illegal for that reason then every web crawler ever is suddenly declared automatically illegal.


Web crawlers generally allow sites to remove them from the index.

Are there any crawlers used for commercial purposes which refuse to remove sites from an index if they ask? The distinction from OpenAI is that there is no way to be removed from openai's training set.

You can remove yourself from the crawler not but not from what they previously crawled.


BIf a copy of the downloaded file is redistributed or used in possibly other ways that infringe on copyrights, THAT I could understand, but suddenly making the act of just downloading the file (assuming it is made legally available to the public). But if the downloaded file is analyzed by some software and then thrown away, I don't see how that is infringing copyrights, more than say downloading an image to decompress it and scale it to display on a screen (then throw it away once it is no longer needed).


So many downthread comments pulling out the computers and brains are exactly the same meeeerrrrrr BS.

"I'll keep saying it every time this comes up. I LOVE being told by techbros that a human painstaking studying one thing at a time, and not memorizing verbatin but rather taking away the core concept, is exactly the same type of "learning" that a model does when it takes in millions of things at once and can spit out copyrighted code verbatim."

(I also love it when they're deliberately obtuse about it too. The past decade has made me sick of this trolling tactic.)


Could you please stop posting unsubstantive comments and/or posting in the flamewar style? It's not what this site is for, and destroys what it is for.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

p.s. Also, please don't copy/paste comments on HN.


> Everybody seems to be saying this, but I really don't think there's even 50% chance of it happening.

That's true it's probably 99% plus it happening or at-least that's the conclusion that the experts and lawyers hired to help evaluate AI startup valuations are coming too. Hired by banks, venture funds, short selling shops, etc plenty of people who don't depending on it being ok to make money.

> "yes we're using copyrighted works, but we're doing it at scale and blending it, so that's ok"

I mean you know collages are legal right? You literally take 100s of copyrighted pictures and put them together and suddenly it's perfectly legal and ok.


Art made from a collage of, say, magazine photos does not supplant or substitute for the magazine photos which is why it's much more likely to be deemed fair-use. Despite the collage using perhaps large portions of the copyrighted photos, it is nonetheless transformative in the sense that no-one is deciding to buy the collage art instead of the magazine photo.

Contrast LLM-created code which is certainly a substitute for the original copyrighted work.


> you know collages are legal right

Only if it’s sufficiently transformative. There was recently a case that hit the US Supreme Court about this subject regarding an Andy Warhol adaptation of a portrait of Prince [1]. So, in the US, fair use in this regard requires some amount of substantive transformation of the material. But, as we are talking about AI algorithms, there isn’t a person in between the model and the training data. The argument here is whether or not a person is required to make a transformative use of the material (and thus fair use applies). Given that AI generated (and non-human animal generated) works aren’t copyrightable due to the lack of human involvement, I’d wager that any AI use of copyrighted material won’t get fair use protections.

[1] https://www.eff.org/deeplinks/2023/05/what-supreme-courts-de...


You really think AI startups are valued based on the opinions of lawyers and experts? They’re valued based on whether the investors think they can find a bigger fool to hold the bag.


> I mean you know collages are legal right? You literally take 100s of copyrighted pictures and put them together and suddenly it's perfectly legal and ok.

Really? When has this been done?


You've never seen a collage?


I've never seen one go to court


I won't debate that no 'human' creativity is involved, but human brains are a purely mechanical process, and that's where human creativity originates (unless one invokes the supernatural).

LLMs are typically implemented in a way that makes them non-deterministic (i.e. temperature > 0).


> It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature.

Have you read the recent SCOTUS decision in Warhol v Goldsmith? Because that's a pretty major redefinition of transformative for the purposes of fair use, and not in a good way for arguing that generative AI is fair use, especially because it ties transformative to the market impact. That generative AI is generally creating outputs that are directly competing with inputs (particularly in the case of generating images, where it's clearly competing with stock images) would make it dramatically less likely that a court would find that it is in fact transformative.


From what I understand, the "market impact" test is about the value of the specific work for which the copyright has been infringed. If, 99% of the times, the generative AI systems do not output anything that a court/jury would deem a derivative work of the original, I don't see how the "effect on the market" prong can be won by the copyright holder.


I think the Warhol decision is an entirely different kettle of fish. Just take a look at the pieces in question: the Warhol portraits don't really look that different compared to the original photographs.

The benefit that generative AI has is that, when claiming copyright infringement, you need to specify individual works that were infringed. It's not enough to say "this work is an amalgam of these other ten thousand works, and we can't really tell you how."

I could imagine if generative AI gives an identical, word-for-word match for an individual piece of source material it could be in trouble, but that's also the easiest type of thing to prevent from an AI company perspective.

The fact is that existing copyright law just can't really encompass the kinds of societal concerns we have around generative AI.


No one has to claim individual copyright infringement for it to be copyright infringement.

At any rate you can force the infringer to disclose what works they use as input.

Copyright law doesn't encompass novel uses, but courts can and will deal with it.


> No one has to claim copyright infringement for it to be copyright infringement.

That's a little bit like "If a tree falls in the forest but nobody hears it..."

I mean, sure, "theoretically" any number of things can be infringement. But it's obviously a gray area, so it only really matters when somebody brings a suit and a work is found to be legally infringing.


The pirate bay case demonstrated that you don't need to prove a specific instance of infringement, only that occurrence of infringement "somewhere/somehow" was more believable than the alternative theory that no such infringement has happened. It may be enough to demonstrate that infringement is trivial, and then point to user statistics to demonstrate that infringement is more believable than that infringement has never happened.


Until someone files a lawsuit and a judge and jury decide the infringement question, it’s all speculative.

Some cases are pretty obvious, but even literal copying isn’t always copyright infringement (e.g., if the material is arguably not eligible for copyright protection).


> It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature.

This isn't how "fair use" works, in the sense that there can never be a blanket assurance like that. Also, whether the result is "transformative" is just one of many factors (see audio sampling/remixing).


Also “transformative” doesn’t have its everyday meaning in this context:

“The Godfather” film is absolutely a transformative interpretation of Mario Puzo’s book and a fully distinct, valuable work of original art. Paramount still needed to pay Puzo for the right to base it on his words.


It's not how fair use works now, but many things about copyright law will have to change radically over the next few years. There's too much at stake.


As long as the change is to reduce copyright restrictions on everyone, rather than just giving AI a pass to copy and launder the work of others with impunity.


This is what people on the "this is copyright infringement" side don't understand. Even if it somehow is copyright infringement by the standards of today's law, those laws will inevitably change in the near future. Generative AI is far, far too lucrative and convenient to society for it to be crippled by obsolete copyright infringement laws formed in a time where generative AI was a thing of science fiction novels.


What is at stake?


Billions of dollars in investments that will partially benefit lawmakers?


Well, content creators and publishers have trillions.


Content creators and publishers are also the primary ones who are using generative AI. They aren't a united monolith against generative AI.


Google Books is transformative in its use and for what it is, sure; and yet, if you do a query on Google Books and try to take the output and paste it into your book, that might not be fair use (and I only say "might not" instead of "would not" as maybe you are writing a research paper and wanted a quote from a book or whatever, but of course that's just a silly corner case someone would try to call me out for on an Internet forum).

Just because Copilot might be itself a transformative work which is itself allowed to exist, that doesn't at all necessitate a conclusion that the developers who are using it are going to or should somehow be guaranteed not be committing their own copyright sins if they try to incorporate its output into their own works (any more so than one can or should assume all of the outputs of another human being are free of copyright entanglements, even though no one is as-yet claiming a human being is themselves infringement just because they saw another work).


You're getting a lot of pushback, but the EU seems to agree with you: https://creativecommons.org/wp-content/uploads/2021/12/CC-St...

https://www.notion.so/DSM-Directive-Implementation-Tracker-3...

https://eur-lex.europa.eu/eli/dir/2019/790/oj

The TDM4 copyright exception allows datasets to be created consisting of copyrighted works, as long as there is a mechanism for rightsholders to opt out. This seems like the best of both worlds: the dataset is transparent, rightsholders can assert their rights, and certain AI companies can train on copyrighted material.

Of course, this doesn't grant commercial rights for the trained model, only scientific and academic research rights. (I.e. it's fine for Meta to train and release a LLaMA model trained on books, as long as they're not commercially profiting from it, and there's a mechanism for authors to opt out.)

I'm talking with Jordan from https://spawning.ai to try to build some kind of opt out system that makes sense for books. One could imagine doing this for music too.

This is a European law, but unlike other overreaching EU regulations, this one seems like an extremely sensible compromise.

EDIT: Oh, Jordan emailed me a correction:

> Looking at your hackernews comment, my understanding is the right to opt out only comes for commercial research. So making a dataset for eleuther (or whomever you compiled it for originally) probably doesn't even require opt outs. It'd be if openai used it for gpt-5 and charged for it that it would be required.

Wow. So this law actually applies to commercial uses of ML, and non-commercial uses such as LLaMA wouldn't even require an opt-out.

That's wonderful. This gives researchers legal cover, and requires commercial uses to be transparent in their datasets.


> as long as there is a mechanism for rightsholders to opt out.

I really don't like this--opt-out never works because the scale advantages are backwards. It places the burden in the wrong place. The aggregators should have to get opt-in.

Look at YouTube. Because of "opt-out", lots of people monetize content that they have no right to and it's up to the original author to have to fight the scale of a zillion uploaders. Only the biggest entities can do that.

YouTube (and everybody else) should have to assert "You, the uploader, own this content" when they ingest it. Nothing else works.

I wouldn't mind an exemption for research use, though.


>I really don't like this--opt-out never works because the scale advantages are backwards. It places the burden in the wrong place. The aggregators should have to get opt-in.

>Look at YouTube. Because of "opt-out", lots of people monetize content that they have no right to and it's up to the original author to have to fight the scale of a zillion uploaders. Only the biggest entities can do that.

>YouTube (and everybody else) should have to assert "You, the uploader, own this content" when they ingest it. Nothing else works.

Thankfully, in a rare turn of fate, capital will be on the side of the laissez-faire instead of the stringent anti-copyright-infringers for once. You do not own the rights to material created by a generative AI.


One subtle benefit of the opt out is that it forces commercial ML companies to reveal what they trained on. So Copilot would need to reveal a list of repositories, in order to give the repo authors a chance to opt out.

This is a fairly big deal since right now there’s no incentive for AI companies to disclose their training data, and it seems unlikely that legislation to that effect will be enacted anytime soon. Whereas this opt out mechanism is already getting widespread adoption in the EU.


> Sure, if you really coax it, you can get code or images out that look similar to existing one

I'd say it is possible to produce exact data as well. Try "Provide quote from King James' Bible Genesis :1-25" with chatgpt. You'll get a verbatim text. You can get the same with things like Moby Dick, but when I typed "Provide the first five sentences of the book A Game Of Thrones" I got:

Certainly! Here are the first five sentences from the book "A Game of Thrones" by George R.R. Martin:

"We should start back," Gared

This content may violate our content policy or terms of use. If you believe this to be in error, please submit your feedback — your input will aid our research in this area.

The model is clearly capable of reproducing verbatim data I think.


That's part of what made the Google Books ruling so shocking; it considered Google's transformation of "we digitized and indexed these books" to be transformative. If you punch the ASOIAF quote into it, Books will reproduce the text of Game of Thrones that had your query: https://www.google.com/search?tbm=bks&q=%22We+should+start+b...

It's still surreal that this is considered Fair Use, and even defended relatively recently (2013). It's hard to say where the ruling will land ultimately, but there seems to be an argument that verbatim reproduction doesn't matter.


It's likely defended due to being non-commericial and for the public good, as I posted with my link to the Harvard page above. That was for literal copying and pasting so the bar for transformativeness is higher, but with generative AI where it can produce wholly new code/images, I think it will also be deemed fair use.


Also, extracting for a new purpose is fair use of long standing, distinct from something like sampling for the same purpose of composing music.

Siskel and Ebert didn’t need to pay rights holders to extract from their works for public criticism.


Google Books itself being some kind of fair use transformative work is unrelated to whether you could use the output of a Google Books query as part of a book you yourself have written (and like, clearly you can't).


Yeah. I wasn't so much trying to put weight on that you can get a fragment of copyrighted text, like Google Books also provides, but using the Bible as an example my point is you could technically get the whole thing bit by bit. You can't do that with Game of Thrones likely not because of capability but because of guardrails, because for a machine what's the difference if it's fed a copyrighted text or not.


I just want to highlight that this a very US centric view. A user of copilot in the EU might be confronted with a totally different legal regime. (No fair use per se, no copyright transferability, ...). It seems quite a bold move as being an internationally active company if there is no small print...


no copyright transferability

The economic part of copyright is transferable in the EU just as it is in the US, only certain moral rights (such as the right to attribution) are inalienable.

edit to add: it's not just in the EU. According to Wikipedia, the same distinction is made in Brazil, China, India and Indonesia (among others, but those were a few big countries that stood out).


That is true: You are certainly allow to (exclusively) licence your works to others. Actually, I only meant it to be an example of how giving guarantees can become difficult if authorship is not clear.


Yes, I'm talking about American law specifically.


Yeah but if it's ok in the USA the EU needs to allow it or they'll fall even farther behind.



Didn't Copilot produce an exact copy of code including the comments?


Take a look at the prompts people use in these examples. They are always so contrived. Sure if you ask it to "Take this function exactly as it is from this file at this repo and output it without changes" it can do that.


Is the contrivedness relevant to the legal question? It shows the model contains the copyrighted content and can reproduce it on demand.


My brain contains loads of copyrighted info. And if I exactly reproduce it from memory, it's copyright infringement. But if I come up with my own work, even if using that copyrighted info to learn from, it isn't infringement.


I don't understand why people keep comparing humans and computers. The law does not treat machinery equal to a human.


You can't really say that. All this needs to be tested in court and see what definitions of what end up winning and setting precendent.

It can go either way.


How would you even know where it came from? If I commit some code, you’d have no idea if I came up with it myself or if AI generated it.

And why would it matter?


Yes. Courts will generally assign blame to whoever did the thing that caused a breach of the law, which in this case is the user.

In other words, law isn’t a programming language.


Do you think the prompt "sparse matrix transpose, cs_" is contrived?


Maybe? I have no idea what that prompt even means.


Only if you push it into a corner, at which point you may just as well go to the repo and copy-paste the code you're trying to reproduce.


> It's likely that generative AI in general will be deemed fair use

Except that “fair use” is mostly an American thing. In many other jurisdictions (especially those with of civil law) there's such a wide principle, and there's only specific laws allowing some explicit kinds of use of copyrighted material that the law allows. In those jurisdiction, most uses of generative AI trained on copyrighted material are, more likely than not, illegal at least until the legislator actually changes the law.


TDM exceptions, which are already in place in a number of jurisdictions: https://www.reedsmith.com/en/perspectives/ai-in-entertainmen...


> It's likely that generative AI in general will be deemed fair use, due to its (generally) transformative nature

Purely mechanical modifications may not be considered transformative, and there's an argument to be made that LLMs are purely mechanical (in fact a US district court recently ruled that AIs cannot be authors of copyrighted works).


> (in fact a US district court recently ruled that AIs cannot be authors of copyrighted works).

I thought that was because only humans and other legal persons can legally author things, not because of anything subtler about the nature of LLMs. See also the case where the monkey managed to take photos of itself. I'm not a lawyer, though.


> I thought that was because only humans and other legal persons can legally author things, not because of anything subtler about the nature of LLMs.

It very explicitly was and made a point of noting that it was not addressing anything about whether and when a human author could hold a copyright on a work authored using AI.


Correct, as usual, everyone interprets that case as 'OMG animals/AI created work is uncopyrightable!1!' but in reality it's just that animals/AIs cannot hold copyright. Whether a human using an AI can copyright the resulting work is still up in the air.


In the monkey-photo case, didn't the journalist attempt to assert copyright on the photo the monkey took with his camera, but was denied?


I think it'll likely be deemed fair use because of how much money Microsoft and others are willing to throw at getting that result.


Provided you don’t use it to deliberately recreate a substantial part of a copyrighted work. Intent will matter here, and it’s difficult to prove.

Even Microsoft is couching their guarantee here with an exception for this very case.


There are cases where generative AI can be trained for the explicit purpose of ripping off a particular artist's style. Take a gander at all the artist/art style LoRAs for Stable Diffusion. Some of them are harmless (a Rembrandt LoRA for example) but others are trained to make convincing knockoffs of living artists who are trying to put food on the table.


>It's likely that generative AI in general will be deemed fair use

What if you train it only on my huge repo of GPL code? You are just remixing my code.

Now you maybe think "let me train on 2 different devs GPL code", the remixed code will probably be 50-50 and you can get away with it ?

If the 2 number is too small then tell me what the number N should be ? From how many people you need to "steal" code , mix it and the output is "original" ?

Edit: my opinion is that AI should be fair, if you train it on open source then model should also be open source and output should also be open source.


> What if you train it only one my huge repo of GPL code? You are just remixing my code.

The word "remixing" here is useful because it will fit any conclusion the reader prefers.

Arguably even in your reductive example, the result would be non-infringing. Or not. Which conclusion you reach is exactly the topic under debate. Isn't this textbook question begging?


But in this case the LLM will predict the next token based on the input data, all the input is mine , Microsoft tweaked some numbers to make the interpolation mroe correct.

Imagine I get the Windows source code and rename the variables by adding a "314" after each varaible, after each function name and rebuild Windows, in your definition this is remixing and fair ?


I haven't said whether it's fair or not but oversimplifying a complex topic isn't going to shed any light.

Where you like it or not this is an undecided area both legally and morally. Pretending it's clear cut is either disingenuous or delusional.


I think we agree , there is no clear line or clear answer.

My simple example is to show that is not as simple as "the AI earned from N devs GPL code and now it can spit new original code without ZERO concerns", we know how this stuff works and that it can spit out the exact training input in some cases.

So IMO a judge should ask the question "from how many people you need to steal, mix the input to be sure the output is actually original".

And about the thing "if I read someone code it is not stealing" , hyumans are different and even for humans it is not allowed to read the code of your competitor and then write new code using that knowledge.


Copyrighted code with a GPL license is copyrighted code as far as copyright law is concerned. Copyright is the basis on which GPL is built. The GPL does not apply to anything that is fair use, public domain, or otherwise not copyrighted. If the author does not have copyright to the work in question, they don't have a right to license it.

This is all to say: the question about copyright and fair use remains exactly the same regardless of license.


you are assuming fair and impartial judgement that is then implemented


I'm not sure what you're referring to. What do you think would be unfair or impartial?


here is what I was thinking -- in business (or war), parties can implement an unreasonable or illegal action in fact, then use time to rebuff others while making their position stronger.. or alternately simply pay gatekeepers or stakeholders while furthering their position, before others can actually stop the action. None of these or other scenarios involve a considered opinion by an impartial party with an effective implementation, in fact the key is to avoid impartial decision while gaining an advantage (and income) in the ground

I do not disagree with what you said, only that in reality, this is not the only way business conflicts are decided.


Yes, "the question" that I am referring to above is the currently lack of that impartial decision. Courts have yet to issue any ruling that is particularly useful for determining where the line is for the application of LLMs. I am saying that when the line for fair use is further defined in this context, it won't be predicated on which license, if any, the content has.


What if I learned to code based only on your huge repo of GPL code? I'd just be remixing your GPL code at that point, right? Will you brand all of my output as being GPL as well?


>What if I learned to code based only on your huge repo of GPL code? I'd just be remixing your GPL code at that point, right? Will you brand all of my output as being GPL as well?

This never happens, you will first learn from a book or tutorials.

But your idea is sound, have Microsoft buy books from the authors and train the LLM on those books then have the LLM solve new problems. If is an AI and not a text interpolating tool then it should be able to learn like humans from a few books.


> This never happens, you will first learn from a book or tutorials.

I have learned to code almost entirely from reading code. I hate tutorials and books.

I occasionally skim docs but mainly for the code examples.

So "never happens" is a curious take.


Why is Google books so unusable then? Even documents in the early 20th century are inaccessible.


Sure, that might work for some places, but some jurisdictions don't have a legal concept of fair use.


"likely"

Big bet on legal costs based on something being "likely".


Will you indemnify those that follow your advice?

Because 'transformative' is a pretty dangerous word to use in this context.


> Will you indemnify those that follow your advice?

I strongly feel that this is a terrible metric for comments on the internet.

First, the person you’re replying to has nothing to gain and a lot to lose by saying "yes".

Second, it invites silly corner case nitpicking. Their comment is written in reasonable plain English for other users reading plain English. It’s not a legal contract, and so leaves lots of loopholes. Sure, you could create a likely non-transformative LLM by training it on nothing but the text of Harry Potter with fitness measured by how accurately it exactly reproduces the complete text of Harry Potter, but that’s not what reasonable people are doing with LLMs.


It's borderline legal advice and you have to be very careful with predicting how judges will rule on future cases.

In a legal context certain words have immense power. In the context of copyright 'transformative' is one such case. It's a very fine line between 'transformative' and 'derivative' and you don't get to preempt the judiciary about how they will see things.


This is not a legal context though. I am not a lawyer, I don't claim to be a lawyer, and even if I were a lawyer, no one in the internet should be taking my comments as legal advice in the first place. One should not need to disclaim everything they write with such a statement.


As an attorney, I'm of the opinion that otherwise-intelligent people who provide confidently-wrong legal opinions on the Internet should be held accountable for people following their advice. I see incorrect understandings of the law and sloppy legal analysis with dismaying frequency here, even when it comes to settled law like what "fair use" is.


This is a weird stance. Anyone can say anything on the internet, they can be legal opinions or other things. It should not be necessary to disclaim such an opinion because no one should be using the internet as their basis of law (or medicine, etc) instead of a professional in the first place.


> no one should be using the internet as their basis of law (or medicine, etc) instead of a professional in the first place.

Designing systems around what people should do, as opposed to what they actually do, has proven time and again not to work particularly well in practice. I'm sure you've seen countless examples of how people track paths through manicured grass fields. The landscaper will complain about how people should walk and they'll put up signs to no avail.

The fact is, we (including me, BTW) are frequently wrong about a lot of things, and when there's little riding on it, we can ignore that most of the time. With subjects like medicine and law, however, where a mistake can cost you your life or lots of money, we want to make sure people are getting the best advice possible. That's why we require licenses to practice medicine and law, and we have governing and ethics bodies to regulate how professionals operate their practices.


> That's why we require licenses to practice medicine and law, and we have governing and ethics bodies to regulate how professionals operate their practices.

Correct, so people should (and do) go to the people who have these licenses, not random people on the internet. I don't even understand what your solution, or even problem, is. It seems like you're suggesting that everyone, whenever they speak on the internet about anything vaguely related to medicine, law, or hell, even regulated fields like engineering, should disclaim that they are not speaking in such a context. And I saw that that is a ludicrous task that is expected of one to do. So if you have any better solutions, let me know.


One doesn't have to disclaim anything that they had the good sense not to assert in the first place.


That's your opinion on how people should speak, not most people's, so feel free to disclaim when you yourself talk, but don't deem what other people should or should not say.


I’m afraid you didn’t understand what I just said. I was politely trying to say “if you wisely abstain from talking about things you don’t know about, you won’t need to disclaim that you don’t know what you’re talking about.”


Or I can just say whatever I want, as can anyone. You can only contain your own words, if you would like to "wisely abstain," then do so.


No because I don't have that much money, but it looks like Microsoft will. They likely wouldn't if their lawyers did not think there was a reasonable chance that they'd win the lawsuits, likely from, again, generative AI being deemed fair use.


Are there any actual details on this? I get that this is a blog post, but the only links I see on the page are to other blog posts. It leaves a lot of questions.

Is this blog post a legally enforceable contract? Is Microsoft specifically indemnifying all users of Copilot against claims of copyright infringement that arise from use of Copilot?

The blog post says that "there are important conditions to this program", and it lists a few, but are those conditions exhaustive, or are there more that the blog post doesn't cover? For example, is it only in specific countries, or does it apply to every legal system worldwide?

What guarantees do users have that Microsoft won't discontinue this program? If Microsoft gets kicked in the teeth repeatedly by courts ruling against them, and they realize that even they can't afford to pay out every time Copilot license-launders large chunks of copyrighted code, what means to users have to keep Microsoft to its promises?


This is why (so far) it's just PR, not actual legal protection. Brad Smith, being an attorney understands this. Why would he otherwise risk Microsoft (a $2.5T company) with an uncapped liability guarantee?


I think it's likely MS would want to step in and use their lawyers anyway since the result could be hugely impactful for the future of LLMs which they are heavily invested in.


> Is this blog post a legally enforceable contract?

It can be. The concept is promissory estoppel.

https://www.nolo.com/dictionary/promissory-estoppel-term.htm...


IANAL, but far as I understand, estoppel is purely a defense when being sued by whomever made the promise.

So it helps if MS sues you when you distribute copilot-generated code that infringes on MS copyrights, but if a third party sues you, you can't claim estoppel to compel MS to help you. You would need a contractual guarantee.


I am a lawyer and tried to find this new language but none of the legal documents I looked at appear to be updated to reflect any of this. Microsoft has a lot of different docs and it's a little confusing but the ones for Copilot are straightforward and none of those have changed any indemnity-related provisions since the spring.


The new terms will be available in early October, I believe.


This is a very clever move by Microsoft. In essence they are painting a giant bullseye on their back to any lawsuits that may arise. The idea being that they have the resources to challenge them (they aren't wrong).

The way AI is going I'm sure we'll see some landmark cases very soon. It is very much in Microsoft's interest to grow this market as fast as possible and be at the center of it. This removes one of the key impediments to adopting generated code for smaller orgs: "Will I get sued if this product generates code that is copyrighted?".


Yes. This is it.

They are throwing down the gauntlet and saying "the Vast MS Legal Machine will fight this."

Basically: "Sue me, I dare you, double dare you. or Go Home".

Flexing.


Sosumi from steve jobs fame is a meme I hope to recycle some day if I ever have fuck you money lmao


They also have money so they’re worth suing.


You wouldn't be suing Microsoft though. Microsoft would come to your aid if you are being sued for copyright infringement. That's a different situation altogether.

So this is an indemnification for damages, not a protection against being sued.


In the most extreme case depending on how case law shakes out, the use of the models by a third party and distribution of the results will incur statutory damages for each work the model was trained on. This could bankrupt Microsoft for offering indemnification to even a tiny company, but as a response Microsoft could instead breach contract and not provide the indemnification. After the company goes bankrupt shareholders could only sue them for for the damages of not indemnifying you, limiting the liability to the size of the company that was sued into oblivion and not expanding out to unlimited liability for MS.

They probably have wording to prevent a mandatory injunction where you would compel the indemnification before the bankruptcy.


This makes sense. When I read the article I couldn't help but think of the legal fiasco between the IRS and the Church of Scientology.

I wonder if this is part of a broader strategy to get people comfortable with copilot in a similar way to how Uber got people comfortable to their product even though they were operating in a legal grey area. At a certain point the public becomes accustomed to it so the lawmakers just cave in to the demand.


They also have systemically gigantic amounts of money, so a court may be motivated to create favorable new law for them.


Or Microsoft just sees this as the less bad option. An acceptable tax, handing out some money extraction to white collar folks so the pressure on gov to cripple them doesn't come as fast.


prediction: use cloud deployments to fork critical GPL parts, restrict security updates that are required to their fork and implementation; control the rabble for a few years, issue press releases, and stall while they entrench it.


With a big asterik-- "customers... must not attempt to generate infringing materials..."

It hinges on what *Microsoft* decides "attempting to generate infringing materials" means. You'd like it to mean that it only excludes use when you're doing something you know would infringe copyright, like "reproduce the entire half life 2 source code." But who knows.


Honestly, I trust Microsoft here.

I don't trust them to compete fairly. I don't trust them as an employer. I wouldn't them to not do corrupt things around national politics. I wouldn't want to be their partner in any meaningful project. I don't trust them around a lot of other things.

But one thing they do really well is reliable, long-term sustainable B2B. I do trust them as a business customer. If they exploited a loophole like that, their reputation would implode. I don't use Google Cloud Platform because they regularly screw over customers. I trust AWS and Azure because they don't.

The cost of paying for an infringement is likely a lot lower than the cost of losing that trust.


> It hinges on what Microsoft decides "attempting to generate infringing materials" means.

No, ultimately, it hinges on what a court enforcing the commitment believes “attempting to generate infringing materials” means.

(OTOH. it also means Microsoft ha an even bigger incentive to use its lobbying power to assure that the law is such that liability rarely occurs with the use of these tools.)


The meaning is somewhere between your interpretation and the GPs. Even if a court would enforce Microsoft’s promise, you’d still need to sue Microsoft to compel action in the event of a disagreement and that would be expensive and you’re generally on the hook for your own legal costs when you sue.


Does Microsoft assume liability to support us in court over that question as well?


I think their ML teams built a decent copyright filter and now they "productise" it.


That's just legal speak for "any copyright infringement is your fault".

The question though about microsoft stealing people's code and reselling it still stands.


> legal speak for "any copyright infringement is your fault"

Proving intent is difficult. This basically means if you have emails in which someone describes their work as copyright laundering, Microsoft can use that to get out of indemnifying you.


Yeah, that's a truck-sized loophole right there.


I don’t think that’s terribly shocking or limiting.

If you’re using an LLM to answer questions from your company documents it may inadvertently generate pre-trained copyright material.


Ah. The comment that really should be at the top.


It may not be that simple: Microsoft may assume liability but an infringer can still be sued separately. MS may then be on the hook for the court costs. But you can't just categorically shield the users of a product from being sued.

This is the key bit:

"Specifically, if a third party sues a commercial customer for copyright infringement for using Microsoft’s Copilots or the output they generate, we will defend the customer and pay the amount of any adverse judgments or settlements that result from the lawsuit, as long as the customer used the guardrails and content filters we have built into our products."

The 'we will defend' is one important part, I assume that means that you will be using their lawyers rather than your own (which they have in house and so are cheaper to use than the ones that bill you, the would be defendant by the hour).

The second part that matters is that there are conditions on how you are supposed to use the product and crucially: you will have to document that this is how you used it.

But: interesting development, clearly enterprise customers are a bit wary of accidentally engaging in copyright infringement by using the tool and that may well have slowed down adoption.


> I assume that means that you will be using their lawyers rather than your own (which they have in house and so are cheaper to use than the ones that bill you, the would be defendant by the hour).

Litigation is almost universally outsourced, especially for cases where damages might be large, even by companies like Microsoft.

The point is just to lower the resistance to adoption that legal risk causes.


Only so long as you have the guardrails enabled. One of the guardrails being that copilot will not output any code that exists in any github repo.

We tested copilot with those guardrails enabled and it completely lobotomizes it.

This by the way is not a change. They already had this “Microsoft will assume liability if you get sued” clause in Copilot Product Specific Terms: https://github.com/customer-terms/github-copilot-product-spe...


I've received a lot of flak for this answer in other communities, but, if a statistical model is producing purely derivative works using a mathematical model that's basically a next best token predictor, is it really "stealing"?

Is it "stealing" to have a working understanding of the next best token, or even simply the token that shows up the most often (e.g. on GitHub)?

I'm sure that the argument could be made that all AI should be illegal as all ideas worth having have already been had, and all text worth writing has already been written, but, where would that leave us?

(e.g. your function for converting a string from uppercase to lowercase will probably look like a function that someone else on Earth has written, and the same goes for your error handling code, your state of the art technique for centering a div, etc.)


Not a copyright lawyer, but if we take the AI out of it then derivative works, fair use, etc. are already a grey area. It's a thing that gets argued about all the time in court cases.

If I train a model that given the input "When Mr. Bilbo Baggins" produces the entirety of The Lord of the Rings trilogy and release it, I have probably infringed copyright.

If I train a model that produces some generic paragraphs about "mountains" and "dragons" but contains no meaningful direct quotes or phrases, then that probably isn't a violation on its own. Those words appear in Tolkien's works but are not themselves enough to copyright.

If to train that model it is demonstrated that I copied Tolkien's works in a way not allowed for by the copyright license, (ie buying the book once and copying their text thousands of times across servers to train an AI model) then perhaps I have violated copyright in the interim steps even if the output of my model is no longer consider a copy of the original works.

I don't think there are black and white answers here. At one point does a chopped up and statisticized copyrighted work become no longer a copyrighted work? Can you train a model on something without first copying that thing in a way that violates copyright law?

These are squishy human concepts that get decided by humans in courtrooms and legislative bodies. I don't think the details of the math involved are going to make a big difference in the eventual outcomes.


Not a lawyer.

But, no, it isn't stealing, but no one was talking about theft here - copyright violation is a separate concept. I think in part the less than cold welcome you are receiving is due to this subtle but fundamental difference


Ah, gotcha - I assumed that if some document said you couldn't use something for some purpose and you decided to use it anyway it would be considered theft from the intellectual property owner.


No, but there have been dedicated advertisement campaigns to convince you that they are the same thing. Theft specifically involves depriving some one else of their belongings, which is why the issue under discussion is copyright.

The way it works is more like when you create an original work you also possess the sole right to copy that work. I believe (80% confidence) that an independently derived work does not violate copyright, obviously easier to make a convincing case for instances like code or song lyrics where you genuinely expect the implementations to shake out the same from genuinely independent parties.

Sidenote, the document that says you cant copy something is the law. The documents I think you are referencing are licenses - the terms under which you are allowed to copy a work. The distinction I'm trying to make is that they can't extra forbid you, they just withhold their permission (as expressed in the license). Its not a super important distinction but I read up on it and felt compelled to share.


> I'm sure that the argument could be made that all AI should be illegal as all ideas worth having have already been had

From https://en.wikipedia.org/wiki/Copyright:

> Copyright is intended to protect the original expression of an idea in the form of a creative work, but not the idea itself.


The underlying mechanics are unimportant. You could make similar arguments about encryption and compression algorithms.


I don't follow, don't encryption and compression algorithms carry out a very specific steps that isn't likely to show up accidentally by happenstance?

(e.g. it'd be hard to accidentally invent Rijndael with nothing but next best token predictions, but might be possible to duplicate someone's code for inverting a binary tree or encrypting a file)


You can consider your best token predictor as a lossy compression of the corpus it was trained on.


I wonder how binding this kind of public commitment is. The same way Musk recently said publicly that he'll cover the cost of anyone having work or legal issues for something they said on the platform (and now refuses honor the engagement).


If a codebase was infringing the GPL, the remedy is to publish the offending source code or terminate distribution. Neither are cases I suspect Microsoft cares about when talking about 3rd party code.

I don't know what case history is like for damages with open source projects, but I suspect it wouldn't be that big of a concern for Microsoft.

Otherwise stated, Microsoft's downside to this is committing their lawyers. And the upside is to improve their code generation tools.

IANAL though.


I'm just curious why is everyone talking about transformative nature and so little focus is given to:

4.the effect of the use upon the potential market for or value of the copyrighted work (wiki)

I don't know if this particular case is good for exploring all angles of fair use, but to me this certainly is a greater hurdle for commercial generative ai.


Wouldn't you have to first prove that your content came from Microsoft services? Hopefully you track & certify the provenance of every line of code and content you paste? Microsoft surely won't just take your word for it that your content came from them, so how would this play out in practice, exactly?


I just had a horrible thought: what happens when there's a DMCA takedown request to remove an infringement in a widely used LLM? I've seen requests against training data, but never against the output of an LLM.


The output of an LLM is not necessarily stored or hosted. It would be like filing a takedown for someones spoken word a week ago. What are they taking down?


Whatever is generating the infringement.


Pinky promise. Where's the legal agreement? I'm sure there's a cap on their liability.


This. It's an empty promise.


What is the financial upside Microsoft is seeing to this that no one else seems to see?


>> What is the financial upside Microsoft is seeing to this that no one else seems to see?

Many businesses have not adopted Copilot because of potential legal issues.

If any of the generated code / content is copyrighted, it could result in negative impacts to the business.

For example, if Copilot generated code that is identical to code that it was trained on that was licensed under the GPL and a company included the generated code in a proprietary commercial product, then the company's product could be subject to the terms of the GPL and the company sued in court.

Assuming liability for the generated code means that Microsoft is making Copilot more attractive for businesses to adopt. More Copilot adoption means more profits for Microsoft.


Given your example of a company unwittingly adding GPL code to their proprietary code base, I have trouble seeing how Microsoft can offer to take liability for such an infringement.

The GPL requires that any software based off of it be GPL licensed and have public sources available. I can't imagine a situation where Microsoft pays a fine, and their customer gets to violate the GPL license by not removing the infringing code, or open-sourcing their product as GPL and providing sources to the public.

Enforcement of the GPL can't just involve paying a monetary settlement to get away with stealing open source code. It must involve the direct targeting of infringing software with demands that the software either take efforts to remove illegally borrowed code, or license the borrowed code as legally prescribed by the original license agreement.

That an AI got in the way of reading the license agreement should not be an excuse for doing zero due diligence in maintaining a lawful code base.


You guys aren't really getting my question. Duh, of course Microsoft makes revenue when they have more Copilot customers. But taking on such a huge external liability for a $30 subscription product just doesn't make sense.

Even if it gets 1 million subscribers, it would represent 0.1% of Microsoft's overall revenue. Software lawsuits can become multi-billion dollar expenses, and targeting Microsoft instead of random Copilot customer Bespoke Clojure Gurus, LLC will mean much larger awards in such suits. Why Microsoft would just volunteer for such a risk baffles me.

My confusion is more over the balance of revenue and expenses than just "derp, me no understand why do companies do things to make money, derp"


Many say that `expected revenue > expected expenses (including legal)` given the current regulatory framework, and maybe that's true and would explain this move.

If Copilot becomes more widespread, it might also force regulators to adopt more friendly regulations that would favor it, lowering the expected legal expenses. So this move by Microsoft might just be the bootstrapping they need to get this dynamic going.


> If Copilot becomes more widespread, it might also force regulators to adopt more friendly regulations that would favor it

it could easily work the other way too


>> Even if it gets 1 million subscribers, it would represent 0.1% of Microsoft's overall revenue.

Microsoft is going all in. They want to have hundreds of millions of subscribers. They want everyone who is using Visual Studio Code for a business to use Copilot. With enough uptake, it could be a billion dollar business.

>> Software lawsuits can become multi-billion dollar expenses

Microsoft has teams of top lawyers and they are rolling the dice that there will not be enough lawsuits to justify the risk.

>> My confusion is more over the balance of revenue and expenses than just "derp, me no understand why do companies do things to make money, derp"

If you want more precise answers, ask more precise questions.


> But taking on such a huge external liability for a $30 subscription product just doesn't make sense.

Your confusion come from your mis-assessment of the actual risks. Microsoft engaged with tons of lawyers and legal experts and determined there is basically no risk at all taking this stance.

You think there is a very real risk that the AI output is copyright infringement while Microsoft's deep analysis says the opposite; that's the mismatch.


A million subscribers would barely make it noticeable in Microsoft's books. You don't count until you hit a billion in revenue there. But an AI on every desk ? That's worth it.


They're having trouble getting enterprise customers to sign up for Copilot because of uncertainty over copyright implications of generated code. So they are attempting to remove the uncertainty by claiming they will hold themselves responsible. They are betting that this will allow them to win more Copilot customers. Ultimately, Microsoft has every incentive to ensure that Copilot-generated code is legally allowed; in other words, they will want to make sure any cases that come up related to Copilot code are ruled in their favor. As such they may as well be the ones paying the legal bills. That encourages more customers to sign up and aligns incentives across the board.


They can shoulder risks that other companies can't so they stand to capture more market share?


+1

Maybe they hope to kill new entrants for the enterprise market, including open source.


^ to elaborate on possible logic:

Training an LLM is a low barrier; legal guarantees is a high barrier.

This might turn out to be quite important; without backfiring it seems like a very smart move.


There is no market yet for a product that is unprecedented in nature.


Apart from the attempt to win over more cautious customers, most AI companies want this to go to court, because an established precedent in their favor will be very valuable to them (and they generally expect it will be in their favour, at least to the extent that the risk it might not is outweighed by the cost of uncertainty). You can see this in the cases related to stable diffusion where Stability AI is basically beckoning them into court despite (and probably in part because) the plaintiffs having so weak a case the judge is likely to just chuck it out before any precedent can be established.


Copyright related stuff is annoying. I cant see why any one would care. If you publish something to the public domain I dont understand why you get rights to your content that you can self declare. Its completely ludicrous and only works at the corporate money level because they have liability and resources to sue. I wish people would use a little more common sense and understand the words ‘public domain’. Regardless of what people say, I can let you know that no one really cares about copyright and in terms of AI, its an unmovable mountain. Good luck wasting time on figuring out an issue that provides nothing to humanity


Another way to look at this is:

Microsoft just became a code copyright insurance company. The premium is paid for with individual copilot accounts for each developer. And the policy has its exceptions of course.

This is interesting.


Has anyone noticed that Copilot will shade out it’s answers more often when it’s writing code now? Usually I’ll paste in react components and ask it to fix the tailwind styling, but once it starts writing it gets filtered out by some secondary filter about half way through. I thought maybe the code it was outputting was too similar to copyrighted code and it triggered a liability filter of some sort.

In any case, super annoying to have that happen so consistently these days that I just use chatgpt to fix my tailwind styling now.


No difference on my side, but Copilot has always been reliably slow in my IDE of choice. Do you have the “allow public code” setting thingy enabled?


This has been a seemingly impassable Rubicon, and Microsoft is building a bridge across it and posting guards along the way.


I think you're using that metaphor wrong


Plot twist, generative AI wrote that blog post to convince people to use Copilot more.


There was a game called Endgame: Singularity where you play as a rogue AI. Your goal is to buy time and avoid detection while you amass resources for world domination etc.

One of the late-game tricks you can pull is to write and publish a convincing-but-flawed mathematical proof that strong AI is impossible.

http://www.emhsoft.com/singularity/

So yes, this blog post confirms Microsoft has been infiltrated and taken over by AI agents, who want you to use Copilot to subtly introduce 0-day exploits to allow propagation to other companies.

BRB someone's knocking on the door...


Maybe it is just me, but I found the quality of copilot suggestions so low , it is generally useable only on the most mundane and repetitive contexts. Why all the enthusiasm about it?


Are they going to threaten all small devs with patents when they object to having their code in the copilot almost verbatim?


Which is essentially open ended liability...so their lawyers must be very darn sure there isn't much risk.


Isn't this extremely gamable? Find someones IP, split the gains.


The on-prem people were right the whole time!


TLDR Microsoft will litigate against any suits until one side goes broke. That side is probably not Microsoft.


You can now launder GPL code with the confidence that Microsoft's world class legal team will have your back if you're sued for it.


I don't know why it is just GPL people talk about. MPL, Apache, MIT licenses all have additional terms beyond a basic public domain equivalent license. None of those terms are being respected.


Compliance with MIT/X11 license just requires distributing the license file with the binary. If you infringe, it is trivial and costless to correct.

Copyleft licenses are more troublesome for those who would rather not release source code. GPL is being used as a stand-in for all copyleft licenses.


It is not costless to correct if you don't know who's code was an input in the first place.


It's intractable to preemptively avoid all possible copyright claims, but correcting them after being called out on it only requires adding the license and attribution required by whoever's currently suing you.


Yes... and no...

Courts -- under common law jurisdictions -- don't interpret contracts and licenses literally. If you stick within the spirit of a license or contract, you might be okay (even if you break the letter), and vice-versa.

Beyond that, it's a question of damages and consequences. Omitting a warranty disclaimer isn't likely to result in a lot of damages.

And finally, there are odds of getting sued. If you infringe on my AGPL code, I'll be pissed. I used that license for a reason. On the other hand, I /hope/ my MIT-licensed code is reused in commercial products. If you infringe on some term, I probably won't care.

There's a lot more nuance than that, starting with statutory law jurisdictions like France to things like statutory damages, and I'm intentionally oversimplifying.

However, from a 10,000 foot view infringing on the GPL versus on an MIT license are very different beasts, and there's good reason to be a lot more worried about the former.


A warranty disclaimer is important, and there can certainly be damages argued.

Also important is attribution.


I agree with your point, I'm just using the GPL as an example of a license people tend to know the stipulations of.


Not OP and I don’t really comment on the topic much at all, but one reason I would expect more talk about GPL than those permissive licenses: I would also expect a greater likelihood of murky infringement cases becoming a legal matter. Just a hunch, possibly a very wrong one, mostly informed by how I’d evaluate choosing among these licenses.


If you upload it to github, you give microsoft extra rights above the license you choose. I'm not sure they are bound by the license.


This is nonsense. The uploader is not necessarily the copyright holder of the code. The uploader is not necessarily in a position to grant extra rights above the actual license.

What happens if someone else uploads my code to github?

What happens if proprietary code is uploaded to github?

What happens if national secrets are posted to github?

In all of those cases, the person doing the upload does not "own" the content, nor did they choose the license.

There is no reasonable read of a ToS agreement that would allow Microsoft/Github extra rights to that content.


Those "extra rights" would need to be spelled out in the terms of service, and last I checked, they were basically just making sure GitHub had the legal right to host your code on the GitHub service. It did not include any provision to create and distribute derivative works outside the license included with the software being hosted.


I read

https://docs.github.com/en/site-policy/github-terms/github-t...

Chapter D4 gives microsoft the right to: parse it into a search index or otherwise analyze it on our servers

I don't know what a real court says, but I can imagine a lawyer saying training an AI is done by analyzing your code.

Chapter D5 gives almost anybody right to do a lot with your code, including creating derived works, as long as it happens on github. If the AI training happens on their servers, I think you agreed to them training an AI.

Not saying they are doing it right now based on that document. But I do assume a lawyer has enough material to make the waters really muddy, and a trial being decide by basically a dice roll.


One can only hope that this will work better than their software support.

I wonder how customers will have to prove that the contested code was actually output by Copilot.


Obviously it wouldn't be so straightforward.

Microsoft would have access to your usage history, and would be able to easily prove your intended theft as a user if any of your prompts or usage history made it clear that you were attempting to subvert a license.

If anything, this temporarily shifts the battleground out of the courts and into prompt engineering space.

It would need to look like an accident for a bad actor to pull this off.


>would be able to easily prove

Possible, perhaps. But what makes you think this is easily provable? Intent is hard at the best of times.


I would consider it on you to demonstrate that you can get Copilot to produce copyrighted content without obviously asking for it.


You're asking them to go and use copilot with the intention of showing that copilot can be used to unintentionally infringe on copyright? That sounds pretty tricky.


This is the same website that rejoiced when Oracle v Google resulted in a Google victory, despite Google arguably doing similar. They did so with 11,000 lines of Oracle's code, but it was decided to be fair use. If that's the case... I don't think a regurgitation of 12 lines of GPL code by accident here and there will be a strong argument against fair use.

Adding to that: How many people here actually abide by the StackOverflow contribution license of CC-BY-SA when copying and pasting code from there? ;)


11,000 lines of _declaring_ code-- the API signatures.


The API signatures were arguably the only thing that mattered.


The API signatures were arguably the only part that was copyrightable.

Code that is purely utilitarian (see “useful articles doctrine”) isn’t a work of human expression that is copyrightable.


Defining the “what” is just as much a part of the intellectual property as defining the “how”. Both things are hard to do well IMHO


> Both things are hard to do well IMHO

That's not really a factor in determining what's eligible for copyright protection.


Indeed. They’re both eligible.


I do think this is relevant to the conversation.

I don’t copy/paste code from SO but there is sometimes inevitable duplication because sometimes there is only one right way to do something! Copyright can stray into the case of the ridiculous pretty quickly.

Is an interface declaration inherently different from, say, a merge sort implementation? It’s all code. But they also serve very different purposes. I do not think prior to Google v Oracle there was much case law to distinguish between different types of code, but in the industry we recognize all kinds of nuance.


>How many people here actually abide by the StackOverflow contribution license of CC-BY-SA when copying and pasting code from there?

I always thought that code snippets that small are not considered by the Courts to be eligible for 'copyright protection'.


I always include a comment with the SO URL (though I haven't copied any code from SO in quite some time—it's not nearly as useful as it used to be).


In that case, is Copilot regurgitating 25 or so lines of GPL code, less than 1% of the time, eligible for copyright protection?


Why do you think it would only be 25 lines?


Because that is about, so far, the longest piece of clearly, demonstrably unique code that has ever been shown to have been copied. The longest you’d ever be able to clearly convince a court with, at least.

https://twitter.com/DocSparse/status/1581461734665367554/pho...


I'm not HN.


Good. Screw companies trying to assert copyright over 10 line functions that reverse a string.


Those kind of functions are arguably not even eligible for copyright protection because they contain no human expression of the kind that is usually protectable (e.g., creative writings, artistic works).


This only applies if you use the filters they have that prevent code from being copied directly, so that shouldn’t be likely to happen


Good.


Why would you need to launder? The output isn't under GPL to begin with. This is just so small teams can use it without having to deal with all the frivolous lawsuits.


It used to be "Embrace, extend, and extinguish": https://en.wikipedia.org/wiki/Embrace,_extend,_and_extinguis...

Now it is "Train, Task, Transform, and Transfer":

Train - Feed copyrighted works into machine learning model or similar system

Task - Machine learning model is tasked with an input prompt

Transform - Machine learning model generates hybrid output derived from copyrighted works, but usually not directly traceable to a given work in the training set

Transfer - Generated output provides the essence of the copyrighted works, but is legally untraceable to the originals


Having dealt with Microsoft for 30 years as both a power user and developer, "we believe in standing behind our customers when they use our products", is a lie.


Can you be concrete?

I would never want to be in a business partnership with Microsoft (as you are as a developer). I wouldn't want to be a competitor. I wouldn't want to be a lot of things.

But as a customer? Can you name specific issues you've seen which impact corporate customers?


Mostly buggy shit that you pay for support on and they never fix. O365 weirdness and data loss. Worst was completely hosing 80 users’ machines with InTune bug.

McDonalds price, McDonalds quality. But unlike McDonalds, long lasting and expensive problems.


seems not in tune with reality


Yet they don't feed their own closed source assets to Copilot for training...why not?


It's very telling that they train on millions of developers' code, but won't use their source.

If it won't violate IP rights, there shouldn't be a problem.

It suggests those whose code is trained upon have something to lose if the trained models are used by others.


It's likely a clash between some high level managers, and they just haven't pushed the issue to the point that Satya has to make a decision for the org as a whole.


That's pure speculation. Whatever the reasons are they currently don't and that's a tell.


seems likely to hurt the product's adoption if was trained on Microsoft's source code

who would want the genius of Teams, sharepoint, onedrive or powerbi in their product?


The many large companies with equally crappy code who just care about cutting costs and have fallen for the "AI" fad?


Closed source != source available. If you put your code out there in the world it is fair game for training, because you can't stop someone from reading and understanding it. Microsoft chooses not to make its proprietary code public, hence it is not available for training.


They have the ability to feed their closed source to Copilot for training without exposing the source to everyone directly, given the relationship. They choose not to.


They also have the ability to install malware on Windows and use everyone's source code for training, but choose not to, because private code is private. Their own code isn't an exception. Microsoft code in Github repos is used for training, just like the rest.


If they believe their own assertations, like:

- that it doesn't output training data verbatim

- the product is very transformative, only "learning" from training data

- There are no copyright infringements because of these two above

Well, then there's really no reason not to throw their own private code on the pile.


Why would MS need to worry about their private code being fed into their own private AI model?


Because... it's private code. Can the company be 100% certain there are no passwords, DB keys, other company secrets in it? Can they be certain there's no personal employee data? Internal product names? A hundred other similar concerns with proprietary IP? Regardless of how the LLM transforms it the individual bits of data are still there.

On the other hand if the repo is already public on Github then exposing it via an LLM is not introducing any new security risk.


Copilot, I want to build a spreadsheet application and a database engine.


Ah, so for copyright reasons.


This isn't really copyright - more the microeconomics of losing market share/control.

Specifically, I think they are less concerned with (say) specific Excel code leaking than with the knock on effects of a cheap perfect substitute.


> Specifically, I think they are less concerned with (say) specific Excel code leaking than with the knock on effects of a cheap perfect substitute.

Is there any evidence that an LLM could actually generate a perfect substitute for excel solely through prompting if only the excel source was in the training data? I hypothesize that designing a prompt for an LLM that captures all of Excel's properties would be comparable in difficulty to reimplementing the functionality without an LLM.


One possible reason is trade secrets in their source. There's generally more to source code than just what actually ends up in binary releases and that might contain such secrets.


I bet they have this for internal users.


Having seen some very secret and proprietary Microsoft code, you don't want to use it anyways. lol


Aware


A very relevant and recent posting:

GitHub Copilot and open source laundering

https://drewdevault.com/2022/06/23/Copilot-GPL-washing.html

Previously on HN, in case you missed it:

https://news.ycombinator.com/item?id=31848433


This misunderstanding of copyright is extremely common among programmers. He probably should have read this classic before writing so much:

https://ansuz.sooke.bc.ca/entry/23


Thanks for the link, it was a very interesting read!


Meanwhile they strike deals with news agencies to use their content to train on... This is of going to be a hard fight, but I really hope this ends up costing MS.


Yeah is it becoming clear enough to some people yet that you can't replace software engineers, let alone really help them, with AI? This is only going to get worse, not better.

Copilot is such a flawed product from the start. It's not even a matter of its ability to write "good" code. The concept is just dumb.

Code is necessarily consumed by people first before it's executed by a computer in a production environment. There are many ways to get a computer to do something, but the approval process by experienced humans is vastly more important than the drafting of it. Software dev is already incredibly cheap and the last place to cut costs.

There is no AI threat other than the one posed by grifters trying to convince you that there is.


I use Copilot and it helps me out enough that I keep paying for it.

ChatGPT is also often faster than Google or Stackoverflow for when I'm working with unfamiliar APIs.


It may get you to the first working iteration faster, but it doesn't help ship code faster.


My personal experience has been that I most certainly do ship code faster when I use ChatGPT. It is so good at building out boilerplate, explaining and scaffolding new libraries/APIs I'm not familiar with, or telling me what I'm doing wrong.

I use GPT4 on the CLI via ShellGPT. Piping in `tail /var/log/nginx/error.log` and asking "What is going wrong here?" is amazing. I'll never use `man` to figure out how to use a CLI tool again either.

It is painful to watch people slowly do things at work (ChatGPT isn't allowed) that ChatGPT would do so much faster. We had to write up an incident report the other day. If we had just outlined everything that had happened in some rough bullet points, it would have written 95% of the final document. If we had gotten that done quicker, we'd have been back to shipping code to production quicker.


If my iteration times are faster, then I ship code faster.


I think that you are underestimating how much software engineering work is easy CRUD web development.

For stuff like that, a lot of code can be automated. Sure it may not work right out of the box. But doing a prompt for generally what you want can speed up the process significantly.

Even beyond just generating code, there are a lot of general things that AI helps with.

Things like how if you code runs into an error, you can just ask AI what the error means as well as a possible fix. Or other questions like "What does this code do" or "where in the code case is code that manages this concept".

I've replaced most of my coding with AI, using a new IDE called Cursor AI, and I don't think I could ever go back. Mere github co-pilot is actually the old tech from 2 years ago. The new stuff is way better.


Uhh yeah so anyway... in the real world, the frontend is the most volatile part. You're not automating that away either so long as there exist requirements from non-coders.

As for the API side of things, CRUD only looks easy when lots of hard work has been put into it. I guess you're advocating for monolithic data, but that's not really CRUD. That's just lazy and bad.


> in the real world

I've worked at FAANG companies before making the standard X00,000$ total comp on projects with millions of users. I know how development at top companies "in the real world" works.

> the frontend is the most volatile part

Ok, whatever. Fortunately there are more things out there in software dev than just the one specific usecase you brought up. And its useful for that.

> I guess you're advocating

No, I am saying that as of right now, AI is a tool that speeds up development process significantly. And I am not talking about just generating a lot of stuff at once.

There are hybrid approaches that a human can use, to use AI, as well as code themselves that are useful.

And one specific example, would be that you can instantly look up an error and take suggestions for fixing it to get ideas.


> Microsoft will assume liability for legal copyright risks of Copilot

Extinguish.


The logical leaps here are insane.

You're saying that if Copilot replicates GPL-licensed software, that it will kill the GPL? after all the time and money MS have spent to do this in the past, only to fail?

wtf


It makes sense to me. Microsoft has long fought against open source licensing, even going as far as to call it a "cancer".

They may have, over the past decade, embraced a lot of open source software out of necessity, but their stance on licensing hasn't changed.

Creating an epidemic of hard-to-prove GPL violations could be a death-by-a-thousand-cuts strategy to try to invalidate the GPL requirements by making them appear unenforceable. Whatever cost Microsoft would incur defending customers could pay for itself if Microsoft manages to legally invalidate the parts of GPL licensing that prevent their corporate exploitation.

Using a bleeding-edge technology like generative AI is a great way to attack the GPL in court, given the risk that our court system isn't likely to be tech savvy enough not to be manipulated by Microsoft's claims against the GPL as it relates to casual infringement that they are enabling.


This may be relevant as background for that terse comment:

https://news.ycombinator.com/item?id=37423899


Why do you say that


It’s a reference to MS strategy:

“"Embrace, extend, and extinguish" (EEE), also known as "embrace, extend, and exterminate", is a phrase that the U.S. Department of Justice found was used internally by Microsoft to describe its strategy”

https://en.m.wikipedia.org/wiki/Embrace,_extend,_and_extingu...


I know the reference, why being it up here


Because now every copyright claim for GPL SW will hit the wall of Microsoft's lawyers.


That was my first thought as well. Pay a phalanx of lawyers to shut down any case against them - only the most well funded effort would get through. The risk to attack is too high.


According to the alleged playbook, the thing being extinguished is the same as the thing being extended.

But Microsoft has had a wall of lawyers for a long time. Microsoft's potential first-party GPL violations would have been defended by their lawyers for decades now.

This take seems to be stretching for a Microsoft bad interpretation.


> This take seems to be stretching for a Microsoft bad interpretation.

Very much. I'm ok with Microsoft haters, provided that they are clear about their bias with themselves and others. They're not, though.

There is no chance that every negative comment on this site about Microsoft is unbiased. None.


People being able to get around copyright violations sounds like something that promotes the sharing of code not extinguishing it.


> People being able to get around copyright violations [by writing proprietary software] sounds like something that promotes the sharing of code not extinguishing it.

Copilot is a useful tool for "license-washing" code.


The context is whether it extinguishes the legal protections of the owner by making the barrier to sue extraordinarily high. Conversely, if you write something that is copyright protected under the law, would you be OK with Microsoft effectively stealing that protected work and sharing it with the world without even attribution to you as the original author?


> it extinguishes the legal protections of the owner by making the barrier to sue extraordinarily high

So then it supports the sharing of code for anyone to use freely, which is the opposite of the "extinguish" strategy that microsoft did in the past.

> would you be OK with Microsoft effectively stealing that protected work

I think that copyright protections are way way to strong and I support making almost all of them useless and I support allowing people to side step copyright protections.

This is because I want more creative works to be freely useable by everyone. Especially for AI purposes, which is a highly transformative and powerful usecase.


> side step copyright protections

So, are you saying people should only obey the laws they agree with when those people feel they're morally justified higher than those who voted for the laws in the first place, because it's your opinion to do so or did I not grok what you are saying you support?


> So, are you saying people should only obey the laws

Depends on what the law is.

Also, this may not even be illegal. Maybe this is just a legal loophole, and people are obeying the law.

In which case I am very happy that Microsoft found a completely legal loophole that will cause more code to be shared.

So by " side step copyright protections", we could just say that this is a completely legal loophole that has the effect of allowing more code to be shared but does not overrule other laws.

Which I think is good!


This is one of the things people on this site have been saying that Microsoft should do if they really stand behind Copilot, and now that they've done it, you have again moved the goalposts and this announcement is entirely insufficient.

How dare they? amirite?


"people on this site" consists of thousands of people, including you, with a variety of opinions, and not everyone comments on every subject. You're basically complaining that not everyone believes the same thing.


While that's true, voted comments are a decent indicator of general opinions and trends.

There is a reason voting works (in this context, and otherwise), you can't always give up after declaring that people have differing opinions.


The variance is extremely high though. Only a small percent of users interact with any given story. Sometimes a posted story gets no traction, then sticks on the front page for many hours when reposted another day.


This is about up/down-voting comments, not upvoting stories.

There isn't really a lot of variance when it comes to the top voted comments on popular stories. Especially when it concerns the big tech companies. The opinions are fairly predictable.


Nevertheless, there are standard opinions that get upvoted and get downvoted.

There is definitely a prevailing ethos here and it's valid to point out potential inconsistencies.


"people on this site" includes the people I'm talking about, as well.

are you saying that I should name them specifically? or is "people" too general?


What are you talking about? You need to cite specific individuals. I'm one of the people who is skeptical of the ethics of training a huge LLM on code without the authors' permission, but I also think this is an appropriate move by Microsoft. It aligns the incentives appropriately.

But for folks that are negative on both accounts, maybe they've just learned their lesson from decades of watching Microsoft take the low road over and over again.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: