The complexity described seems to be resting on the unestablished idea that weights are copyrightable in the first place. If they're not, then presumably "available weights", "ethical weights", and "open weights" are all the same: open weights. Either your weights are under NDA and presumably considered to be a trade secret, or they are public, and the words in your "license" mean absolutely nothing? That seems like a rather important point to bring up when discussing the licensing landscape for weights...
> The complexity described seems to be resting on the unestablished idea that weights are copyrightable in the first place.
Yes. Weights probably aren't copyrightable in the US. See Feist vs. Rural Telephone, in which the Supreme Court ruled that telephone directories are not copyrightable. The copyright clause in the Constitution ("To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.") is understood to require human authorship. The US does not have database copyright, or "sweat of the brow" copyright. That it was expensive to produce some collection of data does not make it copyrightable.
Outputs from LLMs, machine generated art, and machine generated music probably are not copyrightable either. US Copyright Office: "Based on the Office's understanding of the generative AI technologies currently available, users do not exercise ultimate creative control over how such systems interpret prompts and generate material. Instead, these prompts function more like instructions to a commissioned artist."[1]
> Outputs from LLMs, machine generated art, and machine generated music probably are not copyrightable either.
I don't have a strong sense of whether this is reasonable (I see arguments both ways) but I do think it's pretty strongly at odds with how we treat photographs. There are a bunch of photos on my phone where I unquestionably own the copyright, despite putting in much less creativity than I did for some AI images I've generated.
I don't think it's clear how to resolve this, but I do think that if we are going to protect photos and not prompted AI images, the distinction needs to turn on something other than whether "sufficient creativity" was applied to the input of the mechanical system.
Edited to add: It's probably also worth calling out that the question of whether we protect the work produced by a person's use of mechanical system is a separate one from whether we protect the work of others when it is (in various ways, to various degrees, with various likelihoods) reproduced by use of those mechanical systems.
Your prompt for the AI image generation is copyrightable.
The output is not.
The photo you take involved choices of composition and timing and equipment choice.
Just because you don't feel you put in a lot of consideration does not mean at a fundamental level that you still put in creative choices that give the resulting product copyright protection.
But if you took that photo and put it into software which made a derivative image without human creativity, then while the original image would be copyrightable the resulting derivative output would not.
The ideal path forward in copyright would be no infringement in use of materials for training and no protection in AI output with infringement possible against output too close/derivative of protected images.
It's not infringement if you learned to draw tracing Mickey Mouse, your Gerry Gerbil cartoons are fine, but if you draw Mickey Mouse and distribute it, you'll hear from Disney's lawyers.
AI should be the same with the exception that the Gerry Gerbil cartoons would not be copyrightable.
If the prompt is copyrightable, then why wouldn't that copyright flow through to the output?
I can't legally pirate Windows just because the source code was run through a compiler. Even though the compiler itself adds no additional creativity, the underlying source code is still a creative work[0], so pirating the binaries still infringes a copyright. Just one that's in a slightly different place than what we're normally used to thinking about.
Just to drive the point home, there's a few other situations in which copyright "flows through" to things not subject to copyright. Back in the days of copyright formalities, if you published before properly registering something, your work would be born into the public domain. And this occasionally happened to serial media - e.g. someone might just forget to register the third season of a TV show. In that particular case[1], seasons one and two are still copyrighted, and because season three is a derivative work of the prior season, nobody but the original owner can actually make any use of season three. The only practical difference is that the company that owns that TV show lost one year of copyright ownership over the third season.
[0] By definitions of law. I honestly think most software shouldn't have been made copyrightable, but once Congress said "software is copyrightable" that put that question to bed.
[1] I don't remember the name of the TV show or the court case, but this IS a thing that happened and this theory IS court tested.
In general if you pay someone to paint a picture they own the copyright and it needs to be assigned back to you even if you give them quite specific instructions. The instructions lacked sufficient control over the outcome to give some form of dual copyright.
Presumably that general rule would also prevent your instructions to DALLE from giving you copyright ownership of the output either. The AI isn’t getting ownership, so it’s either in the public domain or a derivative work from the artists creating the training data.
Right. And furthermore, I don't actually think prompts alone are copyrightable in most cases - I just wanted to propose an argumentum ad absurdum. Certainly, you can't argue creativity when you're also keyword-stuffing your prompts to call up various feature sets that the model just so happens to associate with them.
The output of a compiler (i.e. A translation program) is created via a prompt (the source code). The output object code is very much copyrighted. People keyword stuff their source code all the time (pragmas) in order to influence the generated output. Why does that object code deserve copyright protection except when the compiler is an AI model (i.e. A translation program)? Compilers use genetic algorithms and weights from profiling in order to generate better output. Where does the output stop being capable of copyright protection because it's no longer "creative"?
If a museum can include a small portion of a frame around a public domain painting and claim new copyright as a result - surely any smallest spark or creative influence qualifies, including choosing a single word and choosing the model and time and which output is selected does as well.
The idea of work for hire, and the notion of copyright assignment, applies to people and not machines or processes employed in the creation of a work. Your brush manufacturer would never dare try to claim that their creative selection of fibres and thus their contribution to the unique brush patterns in your painting constitutes a creative contribution to your work. Why is a complex digital model which does the same any different?
Perhaps it is copyright as a whole that is wrong and is nothing to do with AI. This is what we get for creating imaginary property as a means to finance speculative creative endeavours in a capitalist system. So, yeah. Fun times there - once again technology challenges another economic status quo.
Multiple independently created compilers can directly translate source code to unoptimized machine code that works in a completely straightforward fashion based on the definition of the language. There’s a great deal of complexity involved in creating more optimized output, but the goal is to have functionally equivalent programs.
There’s no way to map DALLE prompts into any kind of obvious picture from the input. Even DALLE itself can produce a wide range of outputs from a single input.
There is, the input is the description of the image so produced plus the hidden elements and parameters (randomness, etc) that users often don't see - with these there is a deterministic input to output relationship. The fitness of the model is in how closely the output matches what we expect to see from them given the inputs we give. That's the point of them. Models are compilers. The distinction is really only in the complexity and ambiguity of the language specifications they implement - not in any fundamental aspect of their function. There isn't a single person alive who understands how a non-trivial compiler works in its entirety, just as nobody really knows how LLMs work yet. That's not the point.
That’s not “independently created” you’re suggesting reimplementing the output of a process not from first principles but from the output of the process. I can make a compiler in a programming language without it being a derivative work of any other compiler.
Further, people have programmed in languages before any compilers where created which worked after the compilers where created.
The CPU is a compiler for programs written in the machine instruction set architecture the CPU claims to implement which happens to output real world effects just as a compiler outputs program code. So, no, you can't.
Words have meanings, and the instruction pipeline consists of electrical signals - and those early CPUs were almost all microcoded or had multiphase clocks or some other implementation abstraction which they did not expose to their architectural state... so yes, they were in a very real sense compilers.
Simply because I didn't state "for all and every" doesn't invalidate my point, nor does it support yours as true - further, "heavily favored" suffers from the same problem. The point is, there's a system which takes as input formatted in a specification (a program) and some transformed output (a set of actions to be taken or another program input for another compiler). So, there you go. If a hot dog on a bun could be considered a sandwich, then a CPU could be considered to be a compiler. shrug disagree all you like.
If you say X is Y, but it’s not true for all X then the statement is false. Ie: “All integers are even.” is false.
As to your point that’s not what CPU’s do though, they have both a set of instructions and a set of IO with the outside world. A compiler always results in the same output from a given set of instructions, but with CPU’s you can run the same code and get wildly different output due to that IO.
The only way you can call a CPU a compiler is as a subset of its capabilities. If they they have internal microcodes where a given instruction gets translated into a different internal representation, but that’s not the end it also executes those microcodes.
> The photo you take involved choices of composition and timing and equipment choice.
And the AI image involved choices of prompt and model, and subsequent selection from among several generated images.
I recognize that what you said here:
> Your prompt for the AI image generation is copyrightable.
> The output is not.
... probably represents the state of the law at the moment (with meaningful amounts of uncertainty), but I don't think there's a principled difference based on the amount or nature of creativity involved. IMO the equivalent would be "you own the specification of (position, equipment, relevant world state) but not the photo" which obviously doesn't do anything we want for photography. And I guess that's a part of my point. We should pick the policy we want to make sure we capture the incentives we want. Maybe it is best that AI assisted art (past some point?) not be copyrightable. But I don't think basing the distinction on the amount or nature or... propagation (I guess?) of creativity makes any sense in distinguishing flippant and bullshit photographs (at least a third of my photos, although I would hesitate to apply the labels to any particular photo by someone else) from prompt-driven generative works.
> then while the original image would be copyrightable the resulting derivative output would not.
Excellent! I'll put the Inheritance Cycle through a synonymiser, and have a copyright-free (if somewhat degraded) version. Take that, Christopher Paolini!
… wait.
What you say might well be correct: the law is often foolish. But I'd imagine the creativity-free derivative work still counts as a derivative work of the original, copyright-eligible work.
Yes it'd be a derivative work owned by Paolini. Paolini would have copyright to the derivative work, to the extent that he has rights over derivative material. However, the prompter would have nothing
If I write a poem and put it into Stable Diffusion, how is what is produced not a derivative work of my poem? We can argue that it's a derivative work of many other things, but that doesn't make it not a derivative work of the poem.
One way it might not be is if Stable Diffusion is seem more like a hash algorithm than a synonymiser. But I don't see why it should, because there's a meaningful correspondence between the input and the output of the system.
On photography, the argument was condensed into "who pushed the button". We saw it with the monkey auto-portrait copyright fight where copyright was not granted to the photographer, and other nature photography using photo traps where the copyright stuck with the human basically because they were the last operator of the camera.
The interesting part is, those controversial case are pretty recent when the art of photography is century(ies?) old now. I wouldn't expect super clear guidelines regarding AI art before a few decades of weird cases fought tooth and nails in court.
As discussed in Section 306, the Copyright Act protects “original works of authorship.” 17 U.S.C. § 102(a) (emphasis added). To qualify as a work of “authorship” a work must be created by a human being. See Burrow-Giles Lithographic Co., 111 U.S. at 58. Works that do not satisfy this requirement are not copyrightable.
The U.S. Copyright Office will not register works produced by nature, animals, or plants. Likewise, the Office cannot register a work purportedly created by divine or supernatural beings, although the Office may register a work where the application or the deposit copy(ies) state that the work was inspired by a divine spirit.
Examples:
• A photograph taken by a monkey.
• A mural painted by an elephant.
...
Similarly, the Office will not register works produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author. The crucial question is “whether the ‘work’ is basically one of human authorship, with the computer [or other device] merely being an assisting instrument, or whether the traditional elements of authorship in the work (literary, artistic, or musical expression or elements of selection, arrangement, etc.) were actually conceived and executed not by man but by a machine.” U.S. COPYRIGHT OFFICE, REPORT TO THE LIBRARIAN OF CONGRESS BY THE REGISTER OF COPYRIGHTS 5 (1966).
I'm not sure I understand the "source" term here. All the AI images I've seen so far were generated by humans using software tools like neural networks.
This is mostly right - It depends on what the weights represent and how they were generated so I would not go as far as the initial claim.
A collection of numbers is copyrightable if it's the encoded result of a creative process. Just because it's represented as a bunch of numbers does not make it non copyrightable. That's why it says "
original works of authorship fixed in any tangible medium of expression, now known or later developed, from which they can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. "
You can't just classify the weights as facts simply because they are numbers. If they are creatively made by a human they would be copyrightable. Mechanically computed from random numbers, no. Somewhere in the middle? Harder
I'm not a lawyer, but it seems like you stood up a straw man there.
>Just because it's represented as a bunch of numbers does not make it non copyrightable.
Can you give an example of where the bunch of numbers is copyrightable when it's not just a numeric encoding of something that was already copyrightable? Taking music and encoding it as a wav file is not a creative work, but it's a representation of a copyrighted work.
Maybe you could create a long list of numbers and call it an artistic impression, but that's clearly not what AI weights are. I'm interested to hear an example of your copyrightable numbers.
"Can you give an example of where the bunch of numbers is copyrightable when it's not just a numeric encoding of something that was already copyrightable?"
Sure, there are "poems" that consist of just a groups of numbers that are copyrighted. They are not encodings, it's just a string of numbers. It's indistinguishable from a bunch of numbers. This is just one example, there are lots.
They are enforceable to the degree it's creative, and to the degree the infringing use is also creative.
So you would not be able to sue me for using those numbers in a math equation. You would be able to sue me for reproducing your poem in a book of poems :)
As feist says, the creativity required for copyright is quite minimal.
But it's still only as protectable as it is creative.
Look - AI is not the first thing to have this "issue". The answer remains the same as it always was - it's mostly about the process not the output.
The output mostly matters is if the output is not intended to be creative (or it's de minimis or ...).
Copyright as it currently exists is weird.
Like if you go to the copyright office and try to register your ssh public key and say "this was generated by ssh-keygen i had nothing to do with it" you may get a different result than if you said "this is my new visually stunning masterpiece, my ssh public key, which was generated with computer help but I used 37 precisely timed keyboard smashes to do it. Prints are available from my gallery for $500"
I fully agree with what you say, with one bit of nuance to point out:
> Like if you go to the copyright office and try to register your ssh public key and say "this was generated by ssh-keygen i had nothing to do with it" you may get a different result than if you said "this is my new visually stunning masterpiece, my ssh public key, which was generated with computer help but I used 37 precisely timed keyboard smashes to do it. Prints are available from my gallery for $500"
The important thing, of course, isn't whether the copyright office denies to register your copyright, but instead what courts will ultimately do when you attempt to enforce your copyright.
We know the current administrative algorithms used by the copyright offices. We have less clarity on what courts will ultimately do.
The key factor of Feist v Rural is whether there was any original or creative process in the way the facts were arranged.
Here, there's a whole lot of creative decisions in labelling and guiding of training that produces the weights, so it's reasonable to think it might be copyrightable.
That is, the numbers are a whole lot more original than the issuance of phone numbers or part numbers.
The requirement for expertise doesn’t necessarily imply that that setting up perimeters for training AI is necessarily copyrightable. A normal brick wall for example needs skills to create but doesn’t qualify as the goal is not creative. If so the mechanical output of a process that doesn’t qualify for copyright is not going to qualify.
Labeling training data may qualify for copyright, but if the underlying training data doesn’t taint the output as a derivative work then labeling isn’t going to qualify by itself.
Thus without some new and very generous interpretation AI companies are at best not going to benefit from copyright and at worst may be forced to create all training data in house. My suspicion is this generation of AI companies are in a very difficult situation.
> but if the underlying training data doesn’t taint the output as a derivative work then labeling isn’t going to qualify by itself.
It depends. If each individual training item has a small impact on the output coefficients, then perhaps it's not a derivative work of them. But if there's a large creative process in determining model training procedure, deciding labelling strategies, and applying those-- perhaps those numbers are strongly derived from those things.
That sounds like wishful thinking, individual training items have significant impact on the result.
Anyway, suppose you’re building an AI to walk, there’s nothing creative about selecting 9.8m/s/s for gravity that’s simply the ideal value to achieve a desired goal. Labeling an elephant as “Elephant” rather than “coat hanger” is similarly a functional choice.
Just because a person is holding a camera and taking a photo doesn’t mean the result is copyrightable.
> Anyway, suppose you’re building an AI to walk, there’s nothing creative about selecting 9.8m/s/s for gravity that’s simply the ideal value to achieve a desired goal.
Suppose you're not building a strawman, but instead building an AI to be an LLM. The exact sequence of what you choose to do for instruction tuning, and the metrics and labels that you choose, the prompt/response pairs you write, and the loss functions you employ are quite creative. They greatly affect the coefficients and are not simple mechanical steps and are the result of a large amount of creative choice.
We are nowhere near a point where they are an uncreative, mechanical recipe to follow.
> Just because a person is holding a camera and taking a photo doesn’t mean the result is copyrightable.
No, but in the overwhelming majority of circumstances it is. What it depends upon is whether the person holding the camera is making a significant, original creative choice.
I am not sure what courts will decide, but I am certain that there is more creativity and originality employed than you are giving OpenAI et al. credit for.
> not simple mechanical steps and are the result of creative choice
Creative choices requires intentional control over the output across a meaningfully different range of viable possibilities. A brick layer has a huge range of viable options in the specific brick and its alignment in a wall but none of those choices are artistically meaningful.
The coefficients are also not in any meaningful sense chosen based on instruction tuning. It’s no more under direct control than the specific arrangements of atoms in the brick wall and is instead the output of a purely mechanical process.
> We are nowhere near a point where they are an uncreative, mechanical recipe to follow.
Thus: We are nowhere near the point where the output is under creative control rather than being the result of a poorly understood mechanical recipe.
> Factual compilations, on the other hand, may possess the requisite originality. The compilation author typically chooses which facts to include, in what order to place them, and how to arrange the collected data so that they may be used effectively by readers. These choices as to selection and arrangement, so long as they are made independently by the compiler and entail a minimal degree of creativity, are sufficiently original that Congress may protect such compilations through the copyright laws. Nimmer ss 2.11[D], 3.03; Denicola 523, n. 38. Thus, even a directory that contains absolutely no protectible written expression, only facts, meets the constitutional minimum for copyright protection if it features an original selection or arrangement.
Alphabetical order wasn't quite enough. But people directing the work that produces the coefficients are doing considerably more creative work than that.
> Thus: We are nowhere near the point where the output is under creative control rather than being the result of a poorly understood mechanical recipe.
No one requires complete creative control of the output. I can spatter paint and have relatively poor control of what's happening, but I am certainly generating a copyrightable work when I engage in creative choices as part of this.
> Alphabetical order wasn’t quite enough, But people directing the work that produces the coefficients are doing considerably more creative work than that.
I agree it’s more effort but the metric isn’t effort so I disagree that qualifies the coefficients as copyrightable. The SHA256 hash of a movie isn’t copyrightable even though the movie itself was.
> No one requires complete creative control
That’s a strawman, there are requirements for creative control. You don’t own copyright to your normal dumps, but you can get copyright from looking down and selecting to take a picture. That’s the low bar for a creativity requirement, but it exists.
> I agree it’s more effort but the metric isn’t effort so I disagree that qualifies the coefficients as copyrightable.
I know the metric is no longer effort. But there's a lot of creative choices that I've mentioned that greatly affect the coefficients, even if we don't know what those creative choices are going to do to each film grain in the photograph or coefficient in the matrix.
> You don’t own copyright to your normal dumps
Yes, there's an explicit exemption in LOC's guidelines for things that are the direct output of natural processes.
If you have a lot of choices affecting output, then the output is subject to copyright. Indeed, the Supremes above said that factual contemplations can qualify if they involve a "minimal degree of creativity".
IANAL, but I’d wonder whether ‘creativity’ is really present in labelling - and indeed, mightn’t it be the last thing you want? I’d argue labelling should be strictly factual and reproducible, and ideally following a logical structure… maybe akin to how addresses of buildings might appear in a phone directory…
(Agree that the skill in knowing how to code and guide the training of a model is probably very different though. It’s not just access to compute time that separates me from OpenAI :) )
> Here, there's a whole lot of creative decisions in labelling and guiding of training that produces the weights
Often, labelling is part of large public datasets that are chosen for use for that exact reason, and/or is otherwise not the work of the party claiming copyright in the model.
> Can you give an example of where the bunch of numbers is copyrightable when it's not just a numeric encoding of something that was already copyrightable? Taking music and encoding it as a wav file is not a creative work, but it's a representation of a copyrighted work.
There are random number books and I know one of them has a copyright registration [1].
Below is an implementation of Marsaglia's invention, from p348, courtesy infamous NR[1]. Its a MWC (multiply with carry) random number generator, with two parameters, variable a and base b=2^32.
---
For a, "The values below are recommended with no particular ordering."
ID a
B1 4294957665
B2 4294963023
B3 4162943475
B4 3947008974
B5 3874257210
B6 2936881968
B7 2811536238
B8 2654432763
B9 1640531364
---
as we all now know, the whole thing is copyrighted - you can't redistribute that code and can't use those specific numbers to generate random numbers without purchasing a license, which only allows you to use it once in your personal machine; that's why GSL[2]. The pseudorandom numbers you would get from MWC if you use above numbers are also copyrighted since they are work-product.
I would say that is uncertain. Model weights are always going to effectively be a huge collection of statistics about the training corpus. Unless you are envisioning artisanal, hand-crafted, free-range model weights where a person used a non-mathematical method to purposely and creatively choose each one?
IANAL, but weights are probably not copyrightable because they are “useful” i.e. they have intrinsic utilitarian function as they are necessary for a model to function as described. Useful things can not be copyrighted, only patented. This is the key difference between the two. For works that are useful and creative, only the non-functional aspects can be copyrighted.
Maybe a published copy of the weights might be copyrightable, in the exact form of a “creatively” ordered listing, but the weights themselves would almost certainly not be if the US judicial system rules consistently.
This bypasses the entire argument of whether it is human authorship as weights themselves in bulk are just straight up non-copyrightable regardless of origin according to this reading of the law and precedent.
> Weights probably aren't copyrightable in the US. ... is understood to require human authorship.
Are you arguing here that because the weights come from an optimization program, they are not "human authored"? If so I find that to be a strange assertion. If I'm working every day on my model and training algorithm to ensure it produces the best weights possible to solve my problem, I would be very surprised for someone to tell me I have no ownership over those weights because they are generated from a program I wrote and data that I own.
In addition to what you mentioned, I'd like to add that photos are copyrightable by the photographer, who merely tweaked some "hyper-parameters" (exposure, ISO, aperture, whatever) and decided when and where to press a button. If such "minimal" amounts of authorship is considered sufficient, I won't be as confident as GP to claim that training an LLM, which requires orders of magnitudes more hyperparameters, labor and capital investment, to be uncopyrightable due to lack of human authorship involved.
It might end up uncopyrightable due to other reasons, but probably not this.
Btw, the phone directories are quite different -- they're just compilation of raw factual public domain information. The LLM weights are anything but. In fact one of the leading theories as to why LLMs are not copyrightable is that they infringe upon the copyrights of the source training materials. (I also don't want to guess whether that argument holds)
But is it enforceable? Companies can put in contracts all kind of nonsense, it doesn't mean all of it is unconditionally enforceable, right?
Ie. if somebody creates company that sells milkshakes and they say you can't use them to feed employees of competing milkshakes companies - it wouldn't fly, would it?
Perhaps you're thinking of the EULA kind of situation. EULAs are probably not enforceable because they require the user to agree to additional terms after the main sale of contract is complete. It's considered a one-sided agreement because the user doesn't get anything in return (the contract of sale of the software already grants the right to use the software).
For LLMs it really depends on the situation. If it's presented in a EULA scenario, where you already bought the rights to use the LLM and ClosedAI gave you the EULA with additional terms afterwards, then the logic above applies. But then everyone knows EULAs aren't very enforceable these days, and nobody buys packaged software any more, so this scenario is quite unlikely these days.
So, if the clause is just one of the many conditions in their main contract of service, of which you had ample opportunity to review before purchasing/agreeing to use their service, then as long as the terms are legal (eg. don't contradict some law), parties are generally free to agree to whatever they want in a contract, and courts will generally uphold those terms.
"can't use them to feed employees of competing milkshakes companies" is probably enforceable. Sounds silly, but I can't think of any reason why it wouldn't be upheld. Unless there's antitrust factors involved.
"Can't use output of their API to train competitive models" is most likely enforceable. Unless there's antitrust factors involved. These kinds of terms are pretty common too. Nobody seriously thinks they're unenforceable per se.
Of course there are practical barriers to enforce a contract -- the aggrieved party has to discover the breach, gather sufficient evidence, and file a lawsuit. As an average Joe individual, you're probably not worrying about getting sued by a company for trivial breaches of service agreements. Most likely the service provider will just cut the service instead of spending thousands of dollars tracking you down (and risk taking a PR hit for going after the little guy). But between businesses, the risks of getting sued by a competitor is real, and no sane lawyer would advise the business to ignore such contract terms.
(Btw, I am not a lawyer. I've studied these things a bit though.)
Would strongly depend on the contract. Probably wouldn’t fly in a post sake terms of service agreement, but you’d likely be in breach of a normal contract yeah.
Exactly. A lot of the difficulty here is how they skip is the hugely important issue:
An entirely reasonable, if not fully tested, statement is the following:
Every single one of these AI weight things itself is a result of unencumbered, massive, law-breaking, right-violating copyright infringement -- accordingly, it's extremely difficult to say anything morally justifiable or authoritative about anyone elses "rights" downstream, and to try to inject the word "ethical" makes the whole thing even more ridiculous.
> is a result of unencumbered, massive, law-breaking, right-violating copyright infringement
Why? Copyright covers expression not information, AIs can learn information from any source regardless of copyright. They should just not regurgitate copyrighted content, that's all. And much of what organic content is online is common knowledge, thus can't be copyright-controlled.
These are not bad arguments, but I don't think they're conclusive. I am a lawyer, and I could absolutely see this going the other way. "You can't make these machine things without literally feeding this copyrighted information into them, therefore they do contain a copy. You can see this by when they reproduce, e.g. the "getty images" deal."
*this is not legal advice, dangit commenter person below
> you can't make these machine things without literally feeding this copyrighted information into them, therefore they do contain a copy.
They don't necessarily do. Think about that. You can take some copyrighted material and transform the information contained in it (for instance a fictional book). You can then write a summary. The summary contains information that was present in the original but it has been transformed and hence it's not a copy. The ML model contains information that has been generalized by some degree. So it's just a grey area IMO.
I'm not saying that "they do or don't objectively" because that doesn't matter as much as people think it does. I'm thinking of what a "jury" COULD decide. I think average joe on a jury is very likely to see that process as "feeding them in."
More over, you are clearly not in violation of copyright if you are talking about statistics about the material. In your example, printing out a "there were 7000 instances of the word 'the'" is certainly not a violation. A ML model is just a huge pile of these statistics.
However, saying "the first word of the book is 'The'" would not be a violation, while repeating that for every word in the book, as a whole, would be one.
I agree with you but I think it's important to have some nuance. Imagine I build a statistical model for 10-word sequences (10-grams) and then I trained it on a single book. I probably could pick some starting words and get most of the book back from the "statistics" I compiled. If I trained the same model on a giant dataset, the one book would just contribute to the stats.
All that to say, the models have potential to memorize, but they don't, and if they do it's an undesirable failure mode, not some deliberate copying.
I like this argument a lot; but again -- how does this play out in the real world? It's pretty easy to refute what will happen in real life. Think, e.g Batman. I could write a very new and original "Batman" comic that doesn't strongly resemble anything -- movie, toy, comic, whatever -- that exists, but would be recognizable to fans.
Once it starts doing well, will DC come after me? You bet.
These models can definitely be used to intentionally store and recall content that is copyrighted in a way that's not subject to fair use. (eg: trivially, I can very easily train a large model that has a small subnetwork which encodes a compressed or even lossless copy of a picture, and if I were to intentionally train a model is that way then this would be no less a copyright violation than distributing a JPEG of the same image embedded in some large binary).
But also, an unintentional copy of a copyrighted image is not a violation of copyright. (eg: an executable binary which happens to contain the bits corresponding to a picture of Batman -- but which are actually instruction sequences and were provably not intended to encode the picture -- clearly doesn't infringe.)
LLMs are somewhere in-between #1 and #2, and the intent can happen both in the training and also the prompting.
Stack on top of this the fact that the models can also definitely generate content that counts as fair use, or which isn't copyrighted.
It's the multitude of possible outputs, across the copyright spectrum, combined with the function of intent in training and/or prompting, which make this such a thorny legal issue for which existing copyright statute and jurisprudence is ill-suited.
Taking your Batman example: DC would come after you for trademark as well as copyright, and the copyright claims would be very carefully evaluated with respect to your very specific work. But here we are talking about a large model that can generate tons of different work which isn't subject to copyright or which is possibly fair use.
I don't think that existing jurisprudence (or even statute?!) can handle this situation very well, at all, without tons of arbitrary interpretative work on the parts of juries/judges, because of the multitude and vague intent issues described above.
(...Also presumably the merits of the DC case wouldn't matter because your victory would be pyhrric unless you are a mega-corp. Which from a legal theory perspective is neither here nor there but from a legal practicality perspective may inform how companies go about enforcing copyright claims on model weights/outputs.)
Anyways. I think we have a right mess on our hands and the legislature needs to do their damn jobs. Welcome to America, I guess :)
Honestly, your second to last sentence is literally the kind of thing I hate hearing most from non-lawyers; the whole "if the legislature were just smarter" thing is just a weird pie-in-the-sky concept that is more-or-less like saying "the world would be better if CEOs were less greedy."
Like, yes, but it's not very likely to happen and it's not a particularly horrible thing if it doesn't; the law is slow and little-c conservative and you're just expecting it to be something it MOST often just ain't.
Let's say you take the harry potter books and create a spreadsheet with each word in it as a column, and the number of times that word appears. Would that violate the copyright? I'd be interested in the rationale if someone thinks it would.
If your table was the number of times a word was followed by a chain of other words, that would be a closer comparison to AI weights. In that case it would be possible with reasonable accuracy to reconstruct passages from the harry potter books (see GitHub Copilot).
The copyright aspect makes more sense when you start thinking of AI training models as lossy compression for the original works. Is a downsampled copy of the new Star Wars movie still protected under copyright?
Just tabulating the word counts would not violate copyright as it is considered facts and figures.
It resembles lossy compression in some ways, but in other important ways I think it doesn’t?
Like, if one has access to such a model, and doesn’t count it towards the size cost of a compression/decompression program nor as part of the compressed size of the compressed images, then that should allow for compressing images to have substantially fewer bits than one would otherwise be able to achieve (at least, assuming that one doesn’t care about the amount of time used to compress/decompress. Idk if this is actually practical.)
But unlike say, a zip file, the model doesn’t give you a representation of like, a list of what images (or image/caption pairs) it was trained on.
Or like, in your analogy with the lower resolution of the movie, the lower resolution of it still tells you how long the movie is (though maybe not as precisely due to lower framerate, but that’s just going to be off by less than a second, unless you have an exceedingly low framerate, but that’s hardly a video at that point.)
There is a sense in which any model of some data yields a way to compress data-points from it, where better models generally give a smaller size. But, like, any (precisely stated) description counts as a model?
So, whether it is “like lossy compression” in a way that matters to copyright, I would think depends a lot on things like,
Well, for one thing, isn’t there some kind of “might someone consume the allegedly infringing work as a substitute for the original work, e.g. if cheaper?” test?
For a lower resolution version of Star Wars movie, people clearly would.
But if one wanted to view some particular artwork that is in the training set, I would think that one couldn’t really obtain such a direct substitute? (Well, without using the work as an input to the trained model, asking it to make a variation, but in that case one already has the work separate from the model, so that’s not really relevant.)
If I wanted to know what happened in minute 33 of the Star Wars movie, I could look at minute 33 of the compressed version.
what is a 'copy'? byte accurate, or 'something with general resemblance'? would a badly compressed "copy" image of a copyrighted material still be 'a copy' or would it be some other thing? would low quality image compression be enough to skirt around copyright claims? image formats and viewers just 'reproduce' an impression of original data from derive compressed data. it is also just 'information that's been generalized by some degree' - for space saving purposes and so on. so, what if image generators could be thought of as a 'very good multi-image compression algorithm' that can output multiple images as well, to a 'somewhat recognizable degree'.
Badly compressed still counts.
I think if the data allows you to reconstruct a recognizable recreation of the original work, you have a good chance of it being considered a derivative copy.
A mono audio version of Star Wars, compressed down to 320x240, filmed from the back of a theater on a VHS camera, converted to Video CD, would under any reasonable interpretation be just a copy of the original.
I assume it starts getting murky when there's some sort of transformation done it it. What if I run motion capture on it, and use that motion capture data to create a cartoon version of Star Paws (my puppies in space epic)?
What if I do a scene for scene recreation as the animated cartoon (removing any mentions to copyrighted names -- Luke Skywalker is now Duke Dogwalker, for example)? In this case, there's been no actual data transfer -- all the sprites are hand drawn, backgrounds etc.
What would be an interesting exercise would be to try and create a series of artifacts that each on their own are considered non-derivatives, but can be used together to reconstitute the original. For example, create a compression method that relies heavily on transforms / macroblocks, but strip out any of the actual pixel data from the film. That info might be supplied as palette files which are themselves not really copyrighted data, but together with the compressed transform stream can be used to recreate the original video.
This is a great example. Summarizing or paraphrasing copyrighted content, or simply using it as a seed to generate input-output pairs - this kind of data transformation prior to training could solve the issues with copyright. It cleanly separates form from content.
> It's just a race for which test case gets to the supreme court first really...
Not really for practical purposes. In the long term, the Supreme Court can and does overrule its own precedent, so the first case on the specific issue to get to the Supreme Court doesn’t end the discussion.
In the short-term, cases get resolved by lower courts and parties either lack funds to do the maximum level of appeals, or the Supreme Court chooses not to hear appeals (they tend to prefer an issue to be well-developed with circuit case law, often waiting till there is a conflict between the Circuit Courts of Appeal, before taking it up), so the state of the law prior to any specific ruling on the narrow topic by the Supreme Court matters quite a bit.
...and the Supreme Court could rule however it likes. It doesn't matter what anyone else says, or what any law says, what any lawyer or other judge says.
They could be completely biased, could completely ignore everyone and everything else and rule however they want.
I'm almost surprised they still bother to write any kind of "legal reasoning" in their ruling and don't simply focus on what the ruling is rather than why they ruled that way. But I guess such "reasoning" still serves a propaganda purpose and still provides a fig leaf for those who still believe in the quaint absurdity that "we are a nation of laws, not men."
Supreme court precedent seems to impact a lot of decisions...
Plenty of companies who have legal teams will keep an eye on the legal landscape of court decisions, and use them to decide if our T&C's or contracts need rewriting, or if any precedent puts us at legal risk.
Sure - the supreme court could overthrow its precedent anytime, but until it does, a lot of people will act as if what they say is the law.
Yeah I didn't even think that was controversial. I'd always been taught that copyright and patents exist to explicitly restrict what people can do by granting a monopoly to the owners in order to encourage invention and creative work.
Edit to add I'm not saying I agree with the justification or am trying to argue for it, only that the point above is commonly raised as the justification, implying that the intrusion on a person's rights is known and accepted.
Natural rights are a fiction to pretend that someone’s moral code is a privileged aspect of physical reality in a way every competing moral code is not.
That's going a bit far. They're just fictions that are privileged over certain other fictions--it's like how you can often cast magic missile in D&D but you can't usually cast expelliarmus, it comes down to which fiction we agree to inhabit.
Even so, you can ask whether a given moral code is more principled than another (e.g. in the sense of having some algebraic structure), and use that to investigate what might be considered "more natural". For example, one might argue that if a "natural" right exists, then it ought to be symmetric under exchange of humans (or sentient beings or whatever). It's then "more natural" to conclude that you have a right to perform actions that have no interaction or consequences for other humans (e.g. to sing a copyrighted song to yourself in an empty room or downloading a song that you already have on CD but don't feel like ripping yourself) than those that do (e.g. taking food from someone because you'd otherwise starve).
> Even so, you can ask whether a given moral code is more principled than another
What does “more principled” mean of a moral code? How does one quantify “degree of principledness”?
> and use that to investigate what might be considered "more natural".
What does the preceding (being “more principled”) have to do with being “more natural”? And what significance does being “more natural” have?
And none of that has any relevance to what is usually described as “natural rights”; its like taking existing words and coming upnwith entirely novel meanings and then a whole architecture around them, which is pretty advanced equivocation.
When the web was young, there was a lot of information considered "public" like criminal record, marriage records, birth certificates, property records, etc. But those were still fairly veiled because of the amount of effort required to see them. Suddenly these were getting blasted all over the internet because now that was an easy thing to do, and everyone had to rethink what "public" meant.
I suspect we're going to see the same kind of rethink about intellectual property in the age of AI.
Copyright is for things that are the result of human creativity. If the weights come from running an algorithm on a training set (that one does not have a copyright to) then how can the weights then be copyrightable? They might be a derivative work, but that just means they infringe copyright, not that they are copyrightable themselves.
Note that the requirements for copyright are not consistent between nations.
The US has the "threshold of originality" as its principle. Under that doctrine, it requires some human (and this has been emphasized many times over the years) originality in order for something to be copyrighted. It's a low bar for how original it needs to be, but it must be human (monkeys taking selfies are not human).
> Under a "sweat of the brow" doctrine, the creator of a work, even if it is completely unoriginal, is entitled to have that effort and expense protected; no one else may use such a work without permission, but must instead recreate the work by independent research or effort.
The definitive case for this in the US that set the two apart is Feist Publications, Inc., v. Rural Telephone Service Co. ( https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R.... ) where it was deemed that a telephone directory is not copyrightable in the US as there is no originality in it... but under the sweat of the brow doctrine it would have been.
So the "[c]opyright is for things that are the result of human creativity" gets an "it depends" and it would be curious to see if companies that are firmly in the "models are valuable" camp go to the UK for what I believe would be a more favorable copyright protection.
... However there are other IP laws around trade secrets that may be better for it in the US (I'm not as familiar in that domain - I would be curious to find out).
A photo is presumed to be copyrightable. Even horrible photos taken by somebody without any aesthetic sense are presumed to be copyrightable. The argument (AFAIK) is that the photographer chooses the time, location, object, and tweaks various settings of the camera (exposure, aperture, etc.), and these choices are considered sufficient for a photo to be copyrightable.
How about LLMs?
The hyperparameters of LLMs are hugely important in training LLMs, as is the choice of source training data. To me the "degrees of freedom" (and hence room for "creativity") in training LLMs are larger than that of a photographer taking a photo. And as of today, training a good LLM is probably objectively harder than taking a good photo, even if we forget about hardware costs for a moment.
It's easy to convince judges and juries that copying phone numbers into a phone book doesn't require human creativity. But we're talking about the most bleeding edge tech companies producing a bleeding edge new product here. I think it's going to be really hard to convince judges and juries that making this new shiny thing doesn't require human creativity. Maybe in say 20 years when even a 10 year old can train a LLM the situation might change, but as of today, quite unlikely IMHO.
The answer would be if the weights are transformative enough, and the copyright would come from the person who decided what images to include in the training set.
The act of choosing to place images in a certain arrangement, such as a collage, can be copyrightable. The same could be said for the "act" of choosing what images to include in a training set and which parameters to use to train the model.
The legal system is not like a computer program. The line between what is "creative" and what is not concrete, but is instead up to the interpretation of the judge who rules on it.
So your phonebook modifications may or may not be considered "creative" depending on the judge and your ability to convince them. The more your modify it, the more likely you are to convince a judge it is a creative work, though.
It would depend on how transformative the work is.
There is in fact a whole art form where people cut out words from different newspapers and books, for example, and re-arrange those words to form new and interesting art.
So there are ways in which such a work would be a creative work, and ways it which it would not, and it would depend on the particular instance and example.
It seems very difficult to ensure that a model will never output any of the copyrighted content that it was trained on. I can only think of three ways, but perhaps there are others
1. Evaluate every output from the model to ensure that none of the outputs are copyrighted
2. Evaluate every input to a model to ensure that the inputs are either not copyrighted or properly licensed
3. Change the definition of copyright so that ML models can do whatever they want
Nobody is doing #1, because that makes the business models not work. Established brands (like Adobe) are doing #2. I get the feeling that there are a lot of ML startups that are hoping that #3 will happen, but it seems unlikely
Ensuring a model never outputs copyrighted content is unimportant and tangential. It's irrelevant. You don't look for a way to make humans output no copyrighted content, you address each time they do case by case.
A model training being rendered fair use doesn't mean any of its output can be used for whatever regardless.
I think when GP says "address each time case by case", they mean "you sue them when they infringe", instead of "this human has an illegal brain because it remembers Taylor Swift's songs".
PS: your "#1" is really hard to do and I'd guess it is infeasible. Even Google (esp. Youtube) with their vast data capabilities, often gets it wrong.
My issue with this take is that machines are not people. We only have lax rules for humans precisely because they are humans, not on the basis that they can learn. Copyrighted works are produced for people and given how human learning works, applying the derivative works rule to humans would be completely impractical and destroy the point of works with copyright. The same cannot be said for AI companies treating everything on the internet as fair use for training AI.
Interesting. I wonder what they mean by "Ethical" -- instead of e.g. saying "definitely free and open." I'm willing to bet "stuff they gathered from likely unwitting Adobe users."
They at least say they did and other copyright free artworks. They are a big company and know that they would get sued, so it should be in their interest to do it this way.
It means trained on data from stock photo sites they own, for which all uploaders agreed to terms of service which state that uploaded materials can be fed into AI training.
IANAL, I rather think the weight is not copyrightable anyway, and, if I build a model on copyrighted data, I would conclude that the inseparable but reproducible parts of weight retains copyrights, despite the whole weight not having its own.
No, that doesn’t follow at all. The argument is that either the training or the expression violated existing cooyrights through the making of unlicensed copies. It’s not based on open source licensing. Although OSS viral licensing may well apply if fair use is not a successful defense.
If it's public domain, then no copyright is violated. I'm not talking about public-domain data; the G-G-GP specifically mentioned the possible legal interpretation that training on large amounts of publicly visible (but not public domain) data is itself a copyright violation.
There is no official ruling... yet. We are very early in this rapid public development. Laws and rulings take years or decades.
They are trained on a lot of text. News sites, comments, books etc. Most books and news sites fall under copyright. Is this fair use? Who knows. Fair use is also an American thing. ChatGPT can be used in the EU, which doesn't have such a broad view of fair use.
If you make a game only out of a lot of copyrighted assets without paying it isn't fair use. Are LLMs different?
What about image generation, which you can prompt the models for specific styles of artists, which works are all copyrighted, but still used for training?
That's also my understanding, either the weights are copyrightable and then all the models need explicit agreements for any work they include in it because models become derivatives or they are not copyrightable being just machine data (the most likely scenario in my opinion), they can't have it both ways.
There is also a (IMO less likely, but still conceivable) scenario where weights ARE copyrightable, but represent fair use of the training data on grounds of being "sufficiently transformative".
I consider that one super likely, but then using the model to make competing works with one the artists in their own style is a non-fair use derivative work
Your case wouldn't be about style, it would be about specific elements that you posit were memorized and regurgitated by the model. The fact that you're creating art in the same style/medium as the author is what negates the "sufficiently transformative" fair use defense.
Basically, that world ignores the AI model completely. If your resulting work wouldn't be fair use if you directly were working with something from the training set, it wouldn't be fair use if you fed it through an AI model first.
I think there could be an argument that it's copyrightable but not a derivative work.
If I read a few books about a subject as research, and then I write an article about the subject, it's my own copyright. The fact that I did research doesn't make it derivative of those books (correct me if I'm wrong, IANAL).
Perhaps a model created from copyrighted material be treated in the same way?
> If I read a few books about a subject as research, and then I write an article about the subject, it's my own copyright.
Yes, because in that case you'd be the "author" doing "creative work".
> Perhaps a model created from copyrighted material be treated in the same way?
Who would be the author doing creative work in this case? The people who decided what training material to use? Perhaps, but it seems a stretch for the people who selected the training material to be authors but not the people who created the training material.
The inputs could be copyrighted and the weights could be copyrighted if creating the weights from the inputs is (legally) regarded as a transformative use. And I think it could reasonably be considered to be transformative - the weights don't look anything like the input data.
Disclaimer: IANAL. So far as I know, no court has ruled on whether this qualifies as a transformative use. I take no position on how the courts will actually rule. I merely say that they could regard this as transformative use. (But see jerf's "creativity" argument for another hurdle that weights must pass to be copyrightable.)
Transformative use doesn't necessarily mean copyrightable.
Google's thumbnails are a purely mathematical transformation on images (no copyright themselves), and yet are considered a transformative use.
I believe that trained models are similarly a purely mathematical transformation of {data}, but is transformative in what that can be used for going forward.
"Can" bearing a lot of weight in that sentence.
It's how the human, with agency, uses the model that may be a derivative or copyright infringing use - not the model itself nor necessarily the output.
The output of a generative AI may be similar enough to an existing work that it is derivative of that work. It is possible to construct a prompt that infringes on an existing work even if that work wasn't part of the training data.
For that case, consider you drew a picture. That picture that you just drew isn't part of any training data. I could presumably look at it and describe it with sufficient detail that something similar enough would be generated... and that may be considered a derivative work. The same test could be applied to me describing it to someone on Fiverr with the same outcome.
If I were to publish that work by the generative AI or Fiverr - who would be infringing on copyright? me? or the black box that may be AI or Fiverr that created a picture based on my prompts?
Another way to look at it is if a thing reproduced a data subjectively resembling originals and then you used it anyhow, then its non-transformative use, and methods used is just extra details.
No, weights are not just data fed but also the training process itself. I think the whole argument hinges on how much human thought and action is needed in training the model.
On the other end of the spectrum, AI generated content couldn't be copyrighted if there is no human involvement. If someone asks GPT to write 1000 poems, it couldn't be copyrighted.
In a sane legal system a new copyright law would be passed to clarify all of this. In ours, the poor copyright office needs to make things up on the fly.
Their recent decision that implies that anything that AI is used to produce is non-copyrightable is silly, sad, and not sustainable.
Compiled object code of a bunch of code you didn't write. I don't know why programmers are so eager to forget that copyright is not at all about what something is, and all about where it came from. It'd be hard to assert that you hold copyright over object code compiled from code you didn't write!
>> And very much of intellectual property law comes down to rules regarding intangible attributes of bits - Who created the bits? Where did they come from? Where are they going? Are they copies of other bits?
Or is the fact that compiled code enjoys copyright protection, even though it is not human generated, evidence that being generated by a human is not overly important for copyright protection?
An mp3 encoding of a wav file of a copyrighted song is still copyrighted, despite those exact bits never having existed before, and being created entirely by a computer.
What happens if we train a neural network on a single, copyrighted work? Say it has one input node (or even zero, if you like), and regardless of this input, its output is always exactly the copyrighted work it was trained on. What do its weights represent? Clearly, its weights represent a direct encoding of the original work. Those weights are copyrightable, but not by the person who trained the neural network -- the copyright is held by the owner of the original work.
What if we train the neural network on just two copyrighted works? If its one input node is 0, it outputs the first, and if it's 1, it outputs the 2nd. Almost certainly, its weights are a complicated, tangled mix encoding both, like a compression algorithm that completely rearranged its input. Who owns the copyright to those weights? To whatever extent the weights can be "factored out" into a set representing the first work and a set representing the second, clearly the copyright holder of the first work holds the copyright on the first "factored set", and the 2nd on the 2nd. It seems obvious that we must be able to do this "factoring out" somehow (even if the topology of the factored networks is different), because we know both works are exactly represented by the weights, and the neural network itself can use this information to reconstruct them both, so they're in there ... somewhere. So is there a sort of "joint copyright" on the combined weights, where nobody is really allowed to do anything with it without approval of the other? Regardless, it's still clear that whoever trained the neural network has no claim on any copyright.
Where is the breaking point extending this from 2 works to a billion? People make arguments like "drawing a car from memory isn't infringing on copyright design of that car", which ... are you sure? Reproducing a piece of music from memory (and selling it) is usually copyright infringement. You're allowed to learn a Taylor Swift song as part of your musical training, but you're not usually allowed to then play it back from memory and sell that recording (I'm not sure I morally agree with this treatment of covers, nor if it's globally applicable). So the argument that "surely neural networks are allowed to learn from copyrighted works" misses the point: they can learn all they want, but as soon as they reproduce verbatim (or close enough) a copyrighted work, they're infringing. And if they're representing a complete copy of the work within their weights (which they obviously are if they can reproduce it), then the original copyright holder has a claim on those weights. And never in this process has the trainer of the NN acquired any copyright to anything. The real trainer is a bunch of GPUs, after all.
If the neural network cannot reproduce any of the copyrighted works verbatim, then we're getting closer to "fair use" territory. Yes, it's permissible to write a summary of a copyrighted work. That is so lossy as to not "compete" with the original work in any meaningful way. If it could be demonstrated that neural networks do not encode completed works (no matter how hard the factorization would be), then one could make this argument. Unfortunately, the evidence is that LLMs are more than happy to completely regurgitate copyrighted works verbatim. It seems to me the copyright holder of the original work therefore must hold a share of the claim on the weights. Still, the GPUs that trained the network do not magically acquire copyright over anything.
I wonder if the real answer is that the weights are copyrighted, and that copyright is held jointly by hundreds of millions of people, and nobody can do anything with those weights without the approval of all the others. I'm not saying I like that universe, but I am saying it's the most internally consistent answer I can think of, and seems to follow from the above argument.
In fact, the "factoring out" process shouldn't even be that hard: find the input vector that forces the ANN to output the copyrighted work verbatim. There should be some simple method of "baking in" the first step of the feedforward algorithm, applying that vector to the first layer of weights, and then considering the input layer as the first hidden layer of a network with 0 input nodes. It is now equivalent to a neural network that can only ever output a single copyrighted work, and therefore its weights exactly encode (bloatedly!) that work. The owner of the work holds copyright on those weights. Importantly, if I'm thinking about this right, the weights of this derived network are exactly the same as the original except in the first layer.
On the other hand, we need the original input vector for this to work, and one could argue that the network weights are simply the algorithm for decoding the input vector into the copyrighted work. So the originator holds copyright on the input vector, not the weights. Does it matter if the input vector has smaller information content than the original work? Clearly this argument relies on the input vector being the "actual encoding", and therefore must have at least as much information. If the input vector is an embedding of "please show me the latest Tom Clancy novel in full", this argument breaks down.
No, a document's contents aren't inherently copyrightable. They have to be a creative work or a method of production, and part of that basically means it has to be human generated content (as opposed to computer or animal generated).
AI weights might be considered a method of production, but that isn't clear yet.
> Was there no work put into their creation by someone?
This is the “sweat of the brow” theory of copyrightability, which courts have rejected (for good reason based on the statute.)
“Someone did work to enable this thing to exist” is not sufficient to make a thing copyright-protected.
> There's no fundamental difference between an image, code, or weights.
And neither images, code, nor weights that are mechanically produced with no creative input by a particular author are subject to copyright in their own right (depending on their relation to the source material on which the mechanical process rests, they may be covered by the copyright on the source material.)
The best argument for weights being copyrightable (and it probably applies better to some models than others) is that the assembly of source material is a creative work subject to a compilers copyright, and that the model weights themselves are just a mechanical translation of that compilation subject to its copyright.
> Was there no work put into their creation by someone?
Putting work into something is not a sufficient cirteria for copyright.
> All of it is just a stream of bytes that the computer can interpret somehow
This is also not a sufficient or at all relevant cirteria for assigning copyright.
Also, in the sense you presented, those files are not fundamentally different from random noise. Which is not a particularly useful reduction for this exercise.
Weights are the output of a mechanical process over the training set with no element of human authorship, just as much the output a model produces with a prompt is, which the Copyright Office has already declared outside of copyright.
> Is a document not copyrightable based on its contents?
Creative process is the bigger issue.
> Weights are just a different kind of a document.
And who sits down and writes this document of weights?
Copyright is not for "documents", it is for works that have creativity in them. The legal bar for that level of creativity is low, so low that it is easy to come away thinking that anything that can be cast as a "document" must be copyrightable, but the bar is in fact not zero.
In particular, taking other documents and shoving them through a process that generates a lot of other numbers with no human or creative interaction is definitely something I'd be concerned the courts would judge as not sufficiently creative to be copyrightable. The process itself would certainly consist of copyrightable code, but the output doesn't necessarily. This would be somewhat similar to the observation that there is no copyright to be had in a big table of files and their MD5 hashes (or other hashes), such as a Linux distro might use for integrity checking. Lots of copyright in the original file contents, copyright available on the process for producing these tables, but the tables themselves would likely be ruled not itself copyrightable as there is no creativity in that output.
Note this also has absolutely nothing to do with the question of whether AI output is copyrightable, this is about the huge table of numbers that make up the neural net weights being copyrightable. (Though it would be sort of an interesting question for the legal system to grapple with as to how a non-copyrightable set of numbers could then produce something copyrightable. Call it a philosophical variation on the "copyright washing" argument; can copyright spring from a non-copyrightable source other than a human brain, thus somehow "flowing uphill"? Would a human brain be copyrightable? Stay tuned for those questions, I guess, or if not you, your grandchildren.)
Per your other comments, "work" is not the bar, "creativity" is. "Size" is not the bar either. Merely being a much larger table of numbers than a list of hashes or a phone book is not the question. No human is in that table of numbers creatively saying "no, wait, this neural weight should be -1.5 instead of 2.0 to produce this creative effect". No human is even capable of working in the medium of neural net weights in a creative manner.
If you want to go the "novel legal theory" route, you could play with claiming creativity in the selection of input material and claim the resulting neural weights has a copyright in compilation: https://en.wikipedia.org/wiki/Copyright_in_compilation That's a long way from a slam dunk though. Way out on a legal limb there. It isn't entirely clear to me what exact rights would result from such a claim either. It would be a landmark copyright court case for sure.
IANAL, but I suspect that the "novel legal theory" in your last paragraph would fail. It might succeed if you gave GPT a hand-curated list of materials; hoovering up the entire internet is not that.
Recipes, for example, are not copyrightable in the US. Neither are some of the concepts behind creating a fillable form. It's not an all-or-nothing system.
AI licensing is extremely complex. Unlike software licensing, AI isn’t as simple as applying current proprietary/open source software licenses. AI has multiple components—the source code, weights, data, etc.—that are licensed differently.
Are you joking? This isn't wrong, per se, but it's worded as though written by someone with only the most casual / cursory interaction and knowledge of this area of law / commerce (e.g., including licensing, copyright, trademark / service mark, patent, etc.) ... until perhaps quite recently.
Yes, the AREA IS complicated. No, so-called "AI" is not introducing all sorts of novel issues, structures, etc. "AI" has some nuances distinct from much of what has come before (happens basically every time more significant tech comes along) and some possibly more unique questions related to economics, ethics, philosophy, and the like, but the relevant areas of law and practice have often been complicated and sort of "bleeding edge", even going back before the industrial revolution.
Big money, powerful tech, large-scale economic forces, etc. = lots of maneuvering, legislation, litigation, etc. = complicated "rules of the game".
Drawing the distinction vs. software in general is reasonable - but, the rather click-baity headline and "I just learned about 'IP' law and bah gawd y'all are doin' it wrong" tone to the start of this article suggest, to me, that this isn't likely to be the best article to use as a reference to learn about these issues.
The article makes a good point: we should prevent “open-washing” and draw a distinction between well-intentioned restrictive licenses like “Open”RAIL and true open source. However, I worry the name “ethical source” is itself a bit question-begging. While outfits like Bloom may believe in good-faith ethical principles, their definition of ethics isn’t necessarily everyone’s. If restricted models are “ethical”, is releasing open weights “unethical”? Conversely, is releasing a model with PII or artist styles in it “ethical” if a few known use cases are forbidden? There’s no one right answer. Labeling any one set of restrictions as “ethical” off the bat makes discussion harder and puts open source on the back foot to justify “not being ethical”. Better to just call them “restricted models” or “guarded models”, and leave it to individuals to decide if these restrictions are beneficial or not.
I think the more interesting aspect of all this is that the confusion created by this new business model ( not sure to classify it so business model had to do ) appears to be largely intentional. The subject matter is complicated to begin with experts being niche of a niche of a niche and the assumption that the general public can even understand it ( and whether it can even dumbed down to digestible sound bites ) is, in my mind, very optimistic. Now, courts are not typically stacked with dummies, but again how many are well versed in issues of technology?
All in all, I don't disagree with the point you raised, but I worry that all this will only further muddy the water for the general population.
"Now, courts are not typically stacked with dummies, but again how many are well versed in issues of technology?"
Even if they are well versed in issues of technology that does not mean they'll make what any given one of would consider a good decision, as plenty of people well versed in issues of technology disagree with each other on these issues.
Nothing guarantees that on, on any issue, really, as you can always find people who disagree.. and if they happen to be judges, they get to decide unless another higher judge overrule them.. and that judge has the same problem as the first.
Sure. My point is that I would so much rather have a decision handed down that was considered on actual merits ( we might disagree, but at least I would be able to see some sort of real consideration and not what amounts to talking points from various lobbyists ). A judge that has zero exposure in that area is at best 50/50 and regardless of the ruling I will be annoyed that a person with zero knowledge is declaring how something he knows little to no about can be used ( just like I am more and more annoyed about political class in Washington, but I am more inclined to believe these days they know exactly what they are doing -- serve their own interests ).
To your point, it is absolutely not panacea ( new blood is inevitably ending in government and the result so far is in line with what you said ), but it would at least be a starting point.
Hmm. Makes a few unsubstantiated claims, with hand-wavy appeals to risks that our private corp overlords are presumably protecting us humble users from, now that they've built their product on open source and data by closing it down and changing terminology to suit.
There's an intelligent discussion to be had, and I think this otherwise-reasonable article could be part of it if it toned down the presumption and condescension a little.
Weights might be copyrightable but in no universe are they copyrightable by OpenAI, Google, etc just because they did the training and spent money on GPUs.
The only people who can possibly own the copyright, if any such copyright exists, are the authors of the training data.
I find this whole discussion about copyright of weights almost absurd, the incredible amount of deference given to our corporate lords is such that we are “hallucinating” new forms of IP protection for NN weights that have never existed in any kind of statue or case law and cut completely against the grain of all the law that currently exists.
>Weights might be copyrightable but in no universe are they copyrightable by OpenAI, Google, etc just because they did the training and spent money on GPUs.
I don't see why not. If you took all the same training data, you would not get the same weights. Especially if RLHF was used to tune those weights. The weights are not a set of facts, they are the result of work, sometimes millions of dollars of work. Surely they deserve copyright protection if they are ever "published".
If not, then they are a trade secret, and other rules apply.
> The ethical license category applies to licenses that allow commercial use of the component but includes field of endeavor and/or behavioral use restrictions set by the licensor.
I don’t love the name, “ethical license” sounds like a description of the license: this license is ethical. Really this sort of license imposes a particular ethical framework on the user.
Not to throw shade, though. It is actually hard to come up neutral sounding name for this sort of license I think. I keep thinking of things like “morality encumbered license,” but that sounds ridiculously euphemistic in a weird way.
Yes I was going to say the same thing. It's a branding that has been applied by the license's proponents, and I personally reject a lot of what they call "ethics" as well as the idea of whatever monitoring and enforcement the restrictions entail - maybe calling it a religious license would be better.
One thing I don't see discussed enough is that, ok let's say the weights are unencumbered, and the source is under an OSI license: the point of open source licenses and free software was to expose the *human understandable* meaning of the final program.
That's why distributing binaries isn't allowed even though technically all of the functionality is present in the machine code. AI weights are basically binary blobs. We don't know what they mean, there is really no source code for them. The best we can do is various black box manipulations on them like LoRA, etc, similar to what we can do to a binary blob.
>> AI weights are basically binary blobs. We don't know what they mean, there is really no source code for them.
No. You can do further training on them. If they are something less than code I don't think it's going to warrant all this talk about licensing. GPL, MIT, or some proprietary should cover it.
You can do further training on them, just like you can patch a binary blob. There are some surgeries you can do to the weights, and there are analyses you can do to poke at them and try to understand them, but ultimately they weren't created from a human understandable spec, and without a ton of reverse engineering work the weights by themselves aren't human understandable: hence the "source" component is missing.
The source code that generated the weights is one step removed from the kind of source code we'd need to interpret a bunch of AI weights. It's really meta-source code
> Some people have the perspective that if a license isn’t open source, it’s proprietary. I think it’s more nuanced than that and believe there are three more license types worth naming: non-commercial NDA, non-commercial public, and ethical.
It’s very useful to remember the U.S. government definition of commercial software: it is software that “Has been sold, leased, or licensed to the general public” [1]
This means that a “non-commercial license” is a bit of an oxymoron to a lot of people. Their definition of commercial includes all software with a license, and does not depend on whether the software costs money. (Perhaps not entirely unlike how FSF does not define “free software” based on whether it costs money.)
I'm disappointed that the article is only making the (somewhat pedantic) distinction between source code and weights. From the quotation marks in the headline I hoped that it would instead be making the distinction between human-readable source code and machine-readable compiled form.
For example, IMHO (IANAL) an AI code-completion tool that had been trained on GPL software is (or should be) only be legal to distribute if it is accompanied by the training code _and all the code ingested during training_ (or an offer to provide such code upon request).
This is an interesting point. If you read the OSI open source definition, specifically on source code (quoted below) I'm inclined to treat the training data as part of the source code for the purpose of determining whether to consider any model open source.
2. Source Code
The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.
A computer program is a set of instructions that may be executed. Weights are values that may be loaded by a program, but are not a program in and of themselves.
Would you consider a Python program to be data rather than program just because it is text input to the python interpreter instead of machine code for the CPU?
Weights are literally numbers computed as output. They are not instructions. The semantics of those numbers even when emplaced (loaded) in an artificial neural net is such that they do not execute. They are not instructions. LLM engines and diffusers perform searches where the weights are used to calculate additional output.
Is source code, like Python text, data? Yes. All code is data. But not all data are source code.
If I gave you a web request log, you would not assert it is a program. If I gave you a CSV file with time-series values from a sensor, you would not assert it is a program. If I hand you a database of contact information, you would not assert it is a program. Weight files are the equivalent of CSV files. They are are a dump of parameter values computed from training.
They are not a program.
The definition of computer program is well worn. So is the definition of source code, and the definition of parameters. Weights are parameters.
The difference between code and data only exists in our minds. There is no distinction. Both code and data make the computer do things (and, yes, both code and data only make the computer do things if other conditions are permitting, for example if executed with the right interpreter, or loaded with the right type of viewer). Anything that can be expressed as code can be expressed as data, and vice versa.
> Weights are literally numbers computed as output. They are not instructions.
They are instructions if you consider the LLM system itself to be a kind of weird, indirect virtual machine. Each number can be mapped to a set of instructions that are executed. Even your CPU uses numbers (machine codes) to execute.
Join me in saying: ...code is data is code is data is code is data...
No, declarative programs exist. They are not instructions.
There is no real line between code and data. This is an observation that runs all the way from Turing Machines in computability theory to the Von Neumann architecture and homoiconicity in Lisp.
What we call 'data' is just code that needs a cleverer interpreter.
Less into theory and more into "wait wtf": Some of the older projects I've worked on were written by people who loved database-driven stuff, to the point they did things like put perl code into one table column (with sentinel values you had to find/replace before `eval`ing the code) and sql into another table that retrieved values for those find/replaces, both retrieved and executed by some really generic code.
They’re not data though, they’re coefficients. They are the only thing that significantly differentiates one model from another.
If I told you the economy can be accurately modelled by
GDP(x) = Ax + B
But I don’t define A And B for you because it’s proprietary, you haven’t learned anything other than what you can glean from the structure of the model itself (it’s linear, there’s only a single input etc)
If most of these models are similarly structured, I’d say the weights are the program.
But not "raw" data. They are derived from other data and a program. If this was a collaboration where one collaborator did the processing and one sourced the data, they would likely both claim some amount of ownership of the trained weights.
At a minimum, it would be an active area of negotiation that the attorneys would take notice of. Source: have negotiated these agreements.
I imagine it is not settled law, but there's a clear argument to be made that regardless of the difficulty in curating the data set, it's still a data set.
Can it be licensed and sold. Yes, surely. Is it proper to pretend an open source license is sufficient protection, probably not.
When people talk about weights, they talk about a network of weights that takes an input and computes an output. There is really not much difference between a saved model and a program.
That's the least of it. In Lisp the distinction between code n data is blurred all the time. In F18 assembly i frequently have "double entendres" which are used as code or as literals depending on the entry point. I think at least once there was code and data in the same entry point. Assembly n Lisp are both homoiconic, after all. N verb at the end of the sentence, are you transliterating German, or a two-foot green Jedi master full of wisdom?
So, Open Data. Got it. This is the same category as config files that are kept up to date by a program as it runs.
- Is it "a program"? Very clearly not.
- Is it source code? You can argue either way. The program won't work without it, but "this specific one" is not required for the program to do something, and that ambiguity means you probably don't want to call it "source code" because it's too vague.
- Is it data used by a program in order to perform its task? Absolutely. It even uniquely defines the program behaviour, and so is a thing onto itself within the context of the program it's used by.
Agreed, output weighs are target code, and no one would argue the contrary. Companies pretending to publish source code is nothing new.
Stallman defines source code as "the preferred way in which developers modify the program"
I wrote for wikipedia once that
"Stallman's definition thus contemplates JavaScript and HTML's source-target ambivalence, as well as contemplating possible future forms of software production, like visual programming languages, or datasets in Machine Learning."
So the datasets could be a form or source code, but the most appropriate source code would be the code that crawls or downloads the dataset and modifies it.
>While the RAIL organization suggests adding the word “Open” to RAIL licenses that include similar open-access and free-use as open source (i.e. OpenRAIL-M), this is confusing since the license is not open source so long as it includes usage restrictions. A better name would be EthicalRAIL-M. Using the term “ethical” to describe this category license clearly indicates its functional difference from open source licenses.
I don't even think we should be using the word "ethical" because it implies that anything more permissive is unethical. We should call these morality clause licenses.
The question of whether or not we should have morality clauses involved is complicated. Most bad actors do not give a shit about the licensing status of the code they are using. And these licenses also cause headaches for people who want to follow the rules[0] and avoid copyleft trolling[1]. On the other hand, the morality clauses in OpenRAIL-M are relatively straightforward and non-obnoxious.
[0] This also applies to "non-commercial" licensing, since that is a concept entirely foreign to copyright law. As far as I'm concerned the 'NC' clause in Creative Commons just means 'OK to torrent'.
[1] A practice in which people abuse copyleft licenses to try and extract licensing agreements for minor license violations. The forgiveness periods added to GPLv3 and later versions of Creative Commons are specifically to prevent this behavior.
If anything - this entire conversation just highlights (Over and Over and Over and Over again) how absolutely bonkers abusive our current copyright laws are.
The vast majority of small individuals are compelled by contract to surrender their rights to large corporations. Those large corporations then abuse the ever loving fuck out of those rights.
The express intent of copyright is now a sad joke.
Personally - I'm pretty over the entire show. This system is generating an incredible amount of inequality. New and novel content is absolutely NOT getting made, and these laws are creating vicious infights that drain resources from well intentioned companies & individuals and pass them along to complete scam corporations.
We are told stories as children that we cannot retell in our own voices decades later to our own children.
I am firmly ready to burn this copyright system to the fucking ground. It's been 300 years since the Statute of Anne - I'm ready for a different game.
Fully agree that the existing copyright and intellectual property systems are dire need of deep reform. But to get people on board, you can't just propose burning it all down, you need to point to a viable alternative. Say, limiting copyrights to something sane like 15 or 30 years. Or making it easier to invalidate obvious or trivial patents.
Or do you really want to do away with notions of intellectual property altogether? You can make an argument for that, but there would lead to deep economic changes, and you need to anticipate what the end result would look like. You still need some way to encourage the creation of new content.
Pointing out that our copyright/IP system is broken is easy. And you're right, it's totally broken! Coming up with a fix is hard work.
>Pointing out that our copyright/IP system is broken is easy. And you're right, it's totally broken! Coming up with a fix is hard work.
The problem I have with these arguments is they ultimately tend to boil down to the devil you know or the devil you don't know.
We keep claiming when something is broken we must provide a "fix" and the assumption is that fix has to be better than the current approach. There's pretty much no way to guarantee this because the systems in place are the only systems with evidence. So, because we have other ideas, we dare not try them because they have to "fix" the problem. The amount of inertia that keeps corruption in motion bothers me and at a fundamental level most of the inertia comes down, ironically, back to property ownership. If we abolish copyright or change it we have to make sure things are fair/equitable. Well sure, that's ideal, but what we have isn't even remotely fair and equitable anymore, so even something broken is likely an improvement.
We have no willingness as a society to try some modifications and be willing to accept failure, then shift to the next modification and iterate around until we get something sane in place. As such, the systems in place remain in place and more and more holes are found to exploit as time progress.
Our systems need to be more adaptable. Founders of the country understood that which is why they made the legal system a legal adaptable system. The question has always been though, what is the threshold? We've played it safe so long that much of the entire system designed to adapt to fix these issues has itself been targeted and gummed up intentionally to prevent that.
> You still need some way to encourage the creation of new content.
Do you? What's the argument for this? Is there some sort of extreme shortage of creative work that the state should find it necessary to encourage it? How about we end copyright, and if there's ever a problem, we offer copyrights for a short period to fluff the commons up again. A copyright anti-holiday, as it were.
Instead we do the opposite: automatically copyright everything anyone produces, and make it very difficult to surrender your copyright (unless Google or Microsoft want it, then if you object you're literally a Luddite caveman who is trying to turn back the clock on modernity because you're old, stupid, and afraid of fire.)
Pointing out that IP is broken is _not_ easy because most people believe in the contradictory notion of intellectual property, you included, not knowing the legal history of IP, the legal and economic history of the concept of property, and so on. If it were easy, it would be obvious to everybody that 1) IP law is immoral and 2) nothing bad would happen if it's abolished outright.
> 2) nothing bad would happen if it's abolished outright.
It is interesting that people living in the places with the weakest IP laws will pay a premium to import baby formula from the places with the strictest laws.
> Here's a free ebook on the subject
Of course it is some right-libertarian wonk piece.
This has nothing to do with IP laws, and everything to do with baby formula laws. You don't seriously think that without the ability to sell the recipe, nobody would invent a safe and effective baby formula, right?
I'm not sure you fully grasp all the dimensions of IP law.
If you have two brands of baby formula, Death brand that kills babies, and OK brand that is perfectly fine, and you start putting Death brand in fake cans labeled OK brand - that is absolutely an IP enforcement issue. The desire of OK brand to protect their brand, and profits, combined with reasonable IP laws allows them to lead enforcement actions and protect consumers.
That is absolutely not the kind of intellectual property that anyone hates.
The kind of intellectual property we are talking about is the one where Death isn't allowed to make baby formula that doesn't kill babies, because OK patented making baby formula that doesn't kill babies and won't give them a license.
I don’t think we need to use the legal system to encourage creation of new content! That’s a natural thing people do. In fact there’s a lot of artistic remixing that is illegal or ambiguously legal under the current copyright regime that can be a powerful form of expression.
I really don’t think we need government policy to encourage artists to create art. (At least not of this sort - I am all for art grants.)
There is also the whole patent / copyright trolling issue too. The fact that $BIG_CORP can hire armies of lawyers to freeze competitors and beat them to market by filing frivolous lawsuits is yet another example of insanity in the whole system.
It's a problem with legal system (not unique to any specific country, mind you, the problem is global), not patent or copyright system specifically. It grew incredible amounts of complexity so pro se became a sad joke in all but simplest cases, and there's no incentive to fix it - quite the opposite, everyone in the system is all for keeping the status quo, because it generates money.
Personally I don’t think patents do what people believe they do (encourage innovation). It’s a bigger discussion but briefly, the only literal function of a patent is to discourage innovation by legally barring anyone from using a patented idea as part of a new innovation. The idea we have is that the secondary effects of this will be increased profits for inventors and therefore more innovation. But actually there’s loads of secondary effects and often many of them outweigh the effect of increased profit. For every one inventor that gets a patent there might be 100 prevented from using that idea in a different and innovative way.
A classic example is 3D printers. Stratasys spent 15 years selling printers that cost tens of thousands of dollars. It wasn’t until the patent expired that people figured out how to make them for $250. Those cheaper printers are enabling mechanical engineers and designers to accelerate their process and make other new innovations faster. Stratasys had such a powerful patent they never bothered innovating down in price, instead rested on their laurels selling $25k printers to big customers.
So how many inventions were delayed or shelved because the inventors couldn’t afford a $25,000 3D printer, and $250 printers didn’t exist yet? Both Stratasys and IBM held patents related to 3D printing and they had to cross license to go in to production, so how many others would have come up with 3D printing in the 1990’s if they had not been patented? Would first mover advantage in a free market have been enough to stimulate development of 3D printers? Could we have had $2000 3D printers in the early 2000’s (Stratasys sold theirs for $30k) instead of ten years later? How many engineers would have invented new gadgets faster if they had a 3D printer ten years earlier?
Another possible example is the x86 and x86-64 ISAs locked between AMD and Intel. I don’t think Intel would have become complacent had there been more competitors…
… or the whole “oracle vs google” over the java API.
I recently watched the documentary Fire in the Blood (2013) [1] about the use, by big pharma, of patents and WIPO to obstruct access to affordable antiretrovirals (ARVs) in Africa during the worst years of the AIDS epidemic, leading to over ten million deaths. All of this when the African market for these medications represented less than 1% of the total market, in dollars. It’s absolutely infuriating!
So you don't like that large corporations can abuse current copyright laws, and your solution when they find a way to launder what they don't control yet (e.g. open source code) is to burn the copyright system to the ground, such that now they can also abuse what remains?
> I am firmly ready to burn this copyright system to the fucking ground.
Same, but the issue is not copyright, which is simply an effort to wield the state to control intellectual property in the same way the state is wielded to control physical property.
The compounding problem arises when property is capital, defined as the means to convert labor into new value. Capitalism is specifically a system in which one can wield control of capital (intellectual or otherwise) to extract profit from labor then trade that profit for more capital. As a result, capital accumulates infinitely, independent of the value produced by the labor which is provided to society.
Artists require capital to convert their labor into value just as any other worker would, so where should that capital come from if not from control of the value they produce? Society must solve this problem or we will not have art to begin with. Only looking at the demand side obfuscates such issues that arise on the supply side, and the only reason we're talking about them now is that digital technology has solved the scarcity problem on the supply side. It has not solved the scarcity problem on the demand side, however.
Finally, art, just like all technological progress, is always the product of entire societies and the history of all mankind that came before it. For this reason, all copyright and patents have no rational basis and are merely bandaids for the ill side effects of controlling capital to extract profit from labor to begin with.
> AI also poses socio-ethical consequences that don’t exist on the same scale as computer software, necessitating more restrictions like behavioral use restrictions
There's plenty of software that has, or could have, similar restrictions. Consider software that allows you to plan vantage points for a shooting or estimate the impact of using explosives at various locations. And the government regulates all sorts of software for export/download because it has military use--everything from development tools to high performance chips that could be used to crunch numbers for a nuclear program, CAD software that can help you build (or destroy) a bridge, etc. The CPUs and GPUs themselves are regulated at certain performance levels, I think.
People, from early school, all the way up to university, use copyrighted materials to learn various topics and obtain degrees. This trains our brains using the work of others.
The same is true as we navigate life. We learn various skills and subjects consuming the work of others.
And, yes, in the case of most people, we use that training to pursue various careers, obtain work and get paid for it.
How can there be a claim of infringement on the part of LLM's and not on every person who has ever used a book, website, article, video or publication to learn something?
Because humans are not machines. Otherwise we would send machines to prison when they kill someone, right?
I see it like this:
Say you write a book. I assume you would find it obvious that I am not allowed to copy your book, replace your name by mine, and sell it, right? That's the point of copyright.
Now say I don't just copy-paste your book, but I run it through a software that replaces some words with synonyms (without losing quality or meaning), and I sell it all the same. Are you fine with that? I would tend to say that I am still abusing your copyright on your book.
Generative AIs can do exactly that, and the people using generative AIs don't have a simple way to check if the output they got is a slightly-modified copy of copyrighted material or not. All we know is that the AI is a machine taking the words in your book, processing them automatically, and generating a new text. Those are not humans who learned about the world and write down their thoughts, but machines that copy-pasted-and-modified words.
Copyrighted materials are either licensed specifically for a human or it's implied that a human will use them to learn.
Naturally, human memory is going to distort and change that information over time. But as soon as you use it in an AI, which has superhuman capabilities of memory, that would go out the window.
Imho the weights are the real meat for most typical models, you can run with them and continue training them with your own code. It's not even guaranteed that the original code would be very useful for that.
But if you are going to make that distinction, for which you can make a case I think, shouldn't you include a third dimension, 'data'? The code alone is hardly useful if you want to rebuild the weights, but all it tells you is that they're loading their proprietary data and then using PyTorch to set up and train the model. You can't reproduce anything using just that. So the real equivalent of open source would be imho either open weights, or open data plus code plus weights (the latter are arguably redundant, but still practical to include). Given that the size of that repo will typically be gigantic, I think open weights is the case we should really be focusing on. I'd rather have a paper explaining the model together with the weights, rather than code that I can't run anyway, if I'm designing an algorithm to continue training the model.
Just a thought, what would happen if all copyrights were abolished or the generative AI revolution we’ve been seeing will continue to the point where almost everything machine generated to a point, and therefore, open season if derivative copyright isn’t protected, then what would actually happen to the US economy?
I don’t buy the argument that people just won’t innovate anymore as there won’t be an incentive anymore just doesn’t cut it. There are multiple motivations that exist simultaneously, for example - governments have a motivation to technologically advanced compared to peer nations, humans have an inherent desire to create, power, notoriety, etc etc.
So in that case, let’s just abolish the first layer of incentive that actually uncovers more greed than anything by absurd copyright laws and just open the flood gates and get rid of all copyright. We need some real innovation and all this babble about who owns an ‘idea’ is way too restricting.
> Unlike software licensing, AI isn’t as simple as applying current proprietary/open source software licenses. AI has multiple components—the source code, weights, data, etc.—that are licensed differently.
Software also has multiple components, often the same as the ones listed by the author. But what do I know, to me AI is just another example of software.
Weights are just matrices with values between a certain range. So are digital images - just matrices with values. Images are covered by copyright laws, so why shouldn't weights also be?
Might want to get ahead of the curve on this one. How would this work? Would I get a tattoo with a license spelling out covering the contents of my body?
The lack of freedom to modification makes it not "open" either.
Comparing to traditional software, weights are actually worse than binary. You can't "decompile" the weights into the training source code so there is no way for the community to make useful changes to them.
Just like you can't de-compile a binary without loss of information, "source" means that you can reconstruct it, so the training data should be available as well as the code that was used to train it, and the build script that invoked it.
A focus on licensing ignores that there are security incentive to not run just any weights you find floating around the net. Getting exploited through miss-aligned networks is a very real threat and really hard to combat.
OK but I mean it's functionally a kind of machine code for a strange machine with a neural transformer architecture, like a 'binary blob'. It's outside of the paradigm where machine code is created only by compilation of copyrightable source code written by humans following their creative "aha moment".
Copyright protection is what gives protection when you put something out into the public. The desire to not publish something is evidence against having these protections, because people know they are not copyright able, so for that reason and others they keep it private. You just presented evidence against your position.
This post did cover many of the same ideas I have been ruminating on concerning model weights and the nomenclature of current efforts. That's also why I generally tend to stick with calling these[0] "local/self hosted models" for the time being. A major reason for my reluctance is that I see weights far closer to binary than code, making a distinction important and current FOSS concepts not really applicable.
Of course, this all hinges on the idea that weights by themselves are inherently protected by current copyright, which still seems to be an unsettled topic, hotly debated by both laypeople and legal professionals. Authors generally are afforded copyright on their work by default, and weights raises so many questions concerning authorship that have never been considered.
This being such a contested issue, which will require new laws and/or precedent (depending on the legal system), is very problematic. Regardless of where you live, generally courts and government entities are not famous for their speedy reaction to new things, so clarity may take a while, at which point the industry might have already settled on some agreement that then may be adopted as a basis for actual legislation, which would likely favor financially well baked entities already actively lobbying for their interests, such as OpenAI.
Some have also pointed out that this is arguing semantics, and I am
tempted to agree in principle, but also want to emphasize that I feel
this is a situation where that can be valuable. Should weights in some way be afforded copyright protection, clear nomenclature will be needed. Putting some thought into this now is definitely not the worst idea.
I very strongly feel that the specific word "ethical" as part of defining licenses is not the best idea, though. "Ethical" can carry vastly different connotations, depending on a myriad of factors, many of which would go beyond the use-focused definition laid out in the post. Due to this, I'd argue for "behavioral" or "restricted use" over "ethical", as both more clearly state what the intended effect is in cases such as Open RAIL-M[1].
Part of my strong feelings on the use of the word "ethical" come from the fact that with weights and training data, there has been a lot of discussion concerning both rights of and considerations for creators whose published works have been used to create those weights. Due to this, the use of "ethical" referring to a group of licenses could give some the impression that this may indicate that the training data used was "ethically sourced", i.e. in agreement with the original creator. This is something that in my eyes should also have clear labeling, though with weights being very hard to reliably trace back to source data, it currently seems impossible to verify, making this essentially just a good faith effort.
I'm not sure OCV gets to decide any of this. Just like I don't think OSI trying to be the sole dictator of the term "Open Source" works out long term. My opinion is always received controversially about things like this, but terms evolve to meet the common usage by the people. If people are calling this "Open Source", and there are more people who want to call this "Open Source", than people who don't; unless you intend to legally bar them from using the term, with actual action, like a lawsuit or something then eventually this will will also be encompassed by the term "Open Source" as people know it like it or not.
Yes I know this term is currently defined explicitly by OSI, no I don't think language prescriptivism wins out regardless how hard they try with it, and since I haven't seen any of the hundreds of quasi Open Source, but not really, companies get dragged to court over usage of the term, this is all toothless complaining in my view.
As to their actual point, I might actually agree with them if it were only the weights being shared. In most cases the configuration is also shared which allows popular frameworks to instantiate the model and then execute it for either inference or further training making the release fully suitable for modification and rerelease. I don't need the exact implementation of FlashAttention they used if I can load the model into Huggingface and use theirs, or mine or whatever.
Edit: This obviously doesn't apply to the models who have restrictions placed on usage just in case people think I mean every instance of sharing a model. Those are obviously restricted use and I agree it muddies the term.