AI weights are not open “source”

ndriscoll · on July 5, 2023

The complexity described seems to be resting on the unestablished idea that weights are copyrightable in the first place. If they're not, then presumably "available weights", "ethical weights", and "open weights" are all the same: open weights. Either your weights are under NDA and presumably considered to be a trade secret, or they are public, and the words in your "license" mean absolutely nothing? That seems like a rather important point to bring up when discussing the licensing landscape for weights...

Animats · on July 5, 2023

> The complexity described seems to be resting on the unestablished idea that weights are copyrightable in the first place.

Yes. Weights probably aren't copyrightable in the US. See Feist vs. Rural Telephone, in which the Supreme Court ruled that telephone directories are not copyrightable. The copyright clause in the Constitution ("To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.") is understood to require human authorship. The US does not have database copyright, or "sweat of the brow" copyright. That it was expensive to produce some collection of data does not make it copyrightable.

Outputs from LLMs, machine generated art, and machine generated music probably are not copyrightable either. US Copyright Office: "Based on the Office's understanding of the generative AI technologies currently available, users do not exercise ultimate creative control over how such systems interpret prompts and generate material. Instead, these prompts function more like instructions to a commissioned artist."[1]

[1] https://www.reuters.com/world/us/us-copyright-office-says-so...

dllthomas · on July 5, 2023

> Outputs from LLMs, machine generated art, and machine generated music probably are not copyrightable either.

I don't have a strong sense of whether this is reasonable (I see arguments both ways) but I do think it's pretty strongly at odds with how we treat photographs. There are a bunch of photos on my phone where I unquestionably own the copyright, despite putting in much less creativity than I did for some AI images I've generated.

I don't think it's clear how to resolve this, but I do think that if we are going to protect photos and not prompted AI images, the distinction needs to turn on something other than whether "sufficient creativity" was applied to the input of the mechanical system.

Edited to add: It's probably also worth calling out that the question of whether we protect the work produced by a person's use of mechanical system is a separate one from whether we protect the work of others when it is (in various ways, to various degrees, with various likelihoods) reproduced by use of those mechanical systems.

kromem · on July 5, 2023

Your prompt for the AI image generation is copyrightable.

The output is not.

The photo you take involved choices of composition and timing and equipment choice.

Just because you don't feel you put in a lot of consideration does not mean at a fundamental level that you still put in creative choices that give the resulting product copyright protection.

But if you took that photo and put it into software which made a derivative image without human creativity, then while the original image would be copyrightable the resulting derivative output would not.

The ideal path forward in copyright would be no infringement in use of materials for training and no protection in AI output with infringement possible against output too close/derivative of protected images.

It's not infringement if you learned to draw tracing Mickey Mouse, your Gerry Gerbil cartoons are fine, but if you draw Mickey Mouse and distribute it, you'll hear from Disney's lawyers.

AI should be the same with the exception that the Gerry Gerbil cartoons would not be copyrightable.

kmeisthax · on July 5, 2023

If the prompt is copyrightable, then why wouldn't that copyright flow through to the output?

I can't legally pirate Windows just because the source code was run through a compiler. Even though the compiler itself adds no additional creativity, the underlying source code is still a creative work[0], so pirating the binaries still infringes a copyright. Just one that's in a slightly different place than what we're normally used to thinking about.

Just to drive the point home, there's a few other situations in which copyright "flows through" to things not subject to copyright. Back in the days of copyright formalities, if you published before properly registering something, your work would be born into the public domain. And this occasionally happened to serial media - e.g. someone might just forget to register the third season of a TV show. In that particular case[1], seasons one and two are still copyrighted, and because season three is a derivative work of the prior season, nobody but the original owner can actually make any use of season three. The only practical difference is that the company that owns that TV show lost one year of copyright ownership over the third season.

[0] By definitions of law. I honestly think most software shouldn't have been made copyrightable, but once Congress said "software is copyrightable" that put that question to bed.

[1] I don't remember the name of the TV show or the court case, but this IS a thing that happened and this theory IS court tested.

Retric · on July 6, 2023

In general if you pay someone to paint a picture they own the copyright and it needs to be assigned back to you even if you give them quite specific instructions. The instructions lacked sufficient control over the outcome to give some form of dual copyright.

Presumably that general rule would also prevent your instructions to DALLE from giving you copyright ownership of the output either. The AI isn’t getting ownership, so it’s either in the public domain or a derivative work from the artists creating the training data.

kmeisthax · on July 6, 2023

Right. And furthermore, I don't actually think prompts alone are copyrightable in most cases - I just wanted to propose an argumentum ad absurdum. Certainly, you can't argue creativity when you're also keyword-stuffing your prompts to call up various feature sets that the model just so happens to associate with them.

justinjlynn · on July 6, 2023

The output of a compiler (i.e. A translation program) is created via a prompt (the source code). The output object code is very much copyrighted. People keyword stuff their source code all the time (pragmas) in order to influence the generated output. Why does that object code deserve copyright protection except when the compiler is an AI model (i.e. A translation program)? Compilers use genetic algorithms and weights from profiling in order to generate better output. Where does the output stop being capable of copyright protection because it's no longer "creative"?

If a museum can include a small portion of a frame around a public domain painting and claim new copyright as a result - surely any smallest spark or creative influence qualifies, including choosing a single word and choosing the model and time and which output is selected does as well.

The idea of work for hire, and the notion of copyright assignment, applies to people and not machines or processes employed in the creation of a work. Your brush manufacturer would never dare try to claim that their creative selection of fibres and thus their contribution to the unique brush patterns in your painting constitutes a creative contribution to your work. Why is a complex digital model which does the same any different?

Perhaps it is copyright as a whole that is wrong and is nothing to do with AI. This is what we get for creating imaginary property as a means to finance speculative creative endeavours in a capitalist system. So, yeah. Fun times there - once again technology challenges another economic status quo.

Retric · on July 6, 2023

Multiple independently created compilers can directly translate source code to unoptimized machine code that works in a completely straightforward fashion based on the definition of the language. There’s a great deal of complexity involved in creating more optimized output, but the goal is to have functionally equivalent programs.

There’s no way to map DALLE prompts into any kind of obvious picture from the input. Even DALLE itself can produce a wide range of outputs from a single input.

justinjlynn · on July 6, 2023

There is, the input is the description of the image so produced plus the hidden elements and parameters (randomness, etc) that users often don't see - with these there is a deterministic input to output relationship. The fitness of the model is in how closely the output matches what we expect to see from them given the inputs we give. That's the point of them. Models are compilers. The distinction is really only in the complexity and ambiguity of the language specifications they implement - not in any fundamental aspect of their function. There isn't a single person alive who understands how a non-trivial compiler works in its entirety, just as nobody really knows how LLMs work yet. That's not the point.

Retric · on July 7, 2023

That’s not “independently created” you’re suggesting reimplementing the output of a process not from first principles but from the output of the process. I can make a compiler in a programming language without it being a derivative work of any other compiler.

Further, people have programmed in languages before any compilers where created which worked after the compilers where created.

justinjlynn · on July 8, 2023

The CPU is a compiler for programs written in the machine instruction set architecture the CPU claims to implement which happens to output real world effects just as a compiler outputs program code. So, no, you can't.

Retric · on July 8, 2023

No all CPU’s aren’t a compiler. Words have meanings and you used them incorrectly.

Early CPU’s didn’t compile anything they directly executed the instruction pipeline.

justinjlynn · on July 9, 2023

Words have meanings, and the instruction pipeline consists of electrical signals - and those early CPUs were almost all microcoded or had multiphase clocks or some other implementation abstraction which they did not expose to their architectural state... so yes, they were in a very real sense compilers.

Retric · on July 10, 2023

Again “Almost” means not every. So no saying all CPU’s are compilers is clearly false from the words you just wrote.

Also, the number of CPU’s manufactured heavily favored very simple designs.

justinjlynn · on July 10, 2023

Simply because I didn't state "for all and every" doesn't invalidate my point, nor does it support yours as true - further, "heavily favored" suffers from the same problem. The point is, there's a system which takes as input formatted in a specification (a program) and some transformed output (a set of actions to be taken or another program input for another compiler). So, there you go. If a hot dog on a bun could be considered a sandwich, then a CPU could be considered to be a compiler. shrug disagree all you like.

Retric · on July 10, 2023

If you say X is Y, but it’s not true for all X then the statement is false. Ie: “All integers are even.” is false.

As to your point that’s not what CPU’s do though, they have both a set of instructions and a set of IO with the outside world. A compiler always results in the same output from a given set of instructions, but with CPU’s you can run the same code and get wildly different output due to that IO.

The only way you can call a CPU a compiler is as a subset of its capabilities. If they they have internal microcodes where a given instruction gets translated into a different internal representation, but that’s not the end it also executes those microcodes.

dllthomas · on July 6, 2023

> The photo you take involved choices of composition and timing and equipment choice.

And the AI image involved choices of prompt and model, and subsequent selection from among several generated images.

I recognize that what you said here:

> Your prompt for the AI image generation is copyrightable.

> The output is not.

... probably represents the state of the law at the moment (with meaningful amounts of uncertainty), but I don't think there's a principled difference based on the amount or nature of creativity involved. IMO the equivalent would be "you own the specification of (position, equipment, relevant world state) but not the photo" which obviously doesn't do anything we want for photography. And I guess that's a part of my point. We should pick the policy we want to make sure we capture the incentives we want. Maybe it is best that AI assisted art (past some point?) not be copyrightable. But I don't think basing the distinction on the amount or nature or... propagation (I guess?) of creativity makes any sense in distinguishing flippant and bullshit photographs (at least a third of my photos, although I would hesitate to apply the labels to any particular photo by someone else) from prompt-driven generative works.

wizzwizz4 · on July 5, 2023

> then while the original image would be copyrightable the resulting derivative output would not.

Excellent! I'll put the Inheritance Cycle through a synonymiser, and have a copyright-free (if somewhat degraded) version. Take that, Christopher Paolini!

… wait.

What you say might well be correct: the law is often foolish. But I'd imagine the creativity-free derivative work still counts as a derivative work of the original, copyright-eligible work.

chottocharaii · on July 6, 2023

Yes it'd be a derivative work owned by Paolini. Paolini would have copyright to the derivative work, to the extent that he has rights over derivative material. However, the prompter would have nothing

wizzwizz4 · on July 7, 2023

If I write a poem and put it into Stable Diffusion, how is what is produced not a derivative work of my poem? We can argue that it's a derivative work of many other things, but that doesn't make it not a derivative work of the poem.

One way it might not be is if Stable Diffusion is seem more like a hash algorithm than a synonymiser. But I don't see why it should, because there's a meaningful correspondence between the input and the output of the system.

makeitdouble · on July 5, 2023

On photography, the argument was condensed into "who pushed the button". We saw it with the monkey auto-portrait copyright fight where copyright was not granted to the photographer, and other nature photography using photo traps where the copyright stuck with the human basically because they were the last operator of the camera.

The interesting part is, those controversial case are pretty recent when the art of photography is century(ies?) old now. I wouldn't expect super clear guidelines regarding AI art before a few decades of weird cases fought tooth and nails in court.

tokai · on July 5, 2023

Eh? The copyright was the photographers and not the monkeys.

https://petapixel.com/2018/04/24/photographer-wins-monkey-se...

shagie · on July 5, 2023

You'll still face an uphill battle.

https://www.copyright.gov/comp3/chap300/ch300-copyrightable-...

313.2 Works That Lack Human Authorship

As discussed in Section 306, the Copyright Act protects “original works of authorship.” 17 U.S.C. § 102(a) (emphasis added). To qualify as a work of “authorship” a work must be created by a human being. See Burrow-Giles Lithographic Co., 111 U.S. at 58. Works that do not satisfy this requirement are not copyrightable.

The U.S. Copyright Office will not register works produced by nature, animals, or plants. Likewise, the Office cannot register a work purportedly created by divine or supernatural beings, although the Office may register a work where the application or the deposit copy(ies) state that the work was inspired by a divine spirit.

    Examples:
    • A photograph taken by a monkey.
    • A mural painted by an elephant.
    ...

Similarly, the Office will not register works produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author. The crucial question is “whether the ‘work’ is basically one of human authorship, with the computer [or other device] merely being an assisting instrument, or whether the traditional elements of authorship in the work (literary, artistic, or musical expression or elements of selection, arrangement, etc.) were actually conceived and executed not by man but by a machine.” U.S. COPYRIGHT OFFICE, REPORT TO THE LIBRARIAN OF CONGRESS BY THE REGISTER OF COPYRIGHTS 5 (1966).

makeitdouble · on July 5, 2023

There was two different arguments:

- whether the monkey has the copyright (PETA's argument): this was smacked down by the court twice, the second court explicitely setting a precedent.

- whether the photograph has coypright: as far as I know he doesn't, as the work was deemed non copyrightable (ruled as not created by a human)

https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...

deepsun · on July 5, 2023

> Outputs from LLMs, machine generated art, and machine generated music probably are not copyrightable either.

Let me put a straw man, and try to find a middle point, when the copyright argument stops being applicable:

1. A painting was done by an artist.

2. On a computer.

3. With a help from an image processor software.

4. Using some advanced filters, like super-resolution, that utilize computer vision techniques. Like neural networks.

Many smartphones already automatically process your* photos with some advanced CV algorithms. That can be called "machine generated art".

I'd personally prefer to stop saying "neural network did X", same way as we don't say "a bulldozer built a road, a crane built a house".

comfypotato · on July 5, 2023

The distinction, defining your straw man, is simply that the image itself is generated by the “commissioned artist” that is the AI.

Even non-generative-AI inside Photoshop only mutates images. Generative AI is the source of images.

drdeca · on July 5, 2023

Is it though? What of the e.g. choice of prompt, guidance scale, maybe a specification of a pose, etc.?

Or, is the distinction you are making based on there being an image before the model is used?

deepsun · on July 6, 2023

I'm not sure I understand the "source" term here. All the AI images I've seen so far were generated by humans using software tools like neural networks.

DannyBee · on July 5, 2023

This is mostly right - It depends on what the weights represent and how they were generated so I would not go as far as the initial claim.

A collection of numbers is copyrightable if it's the encoded result of a creative process. Just because it's represented as a bunch of numbers does not make it non copyrightable. That's why it says " original works of authorship fixed in any tangible medium of expression, now known or later developed, from which they can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. "

You can't just classify the weights as facts simply because they are numbers. If they are creatively made by a human they would be copyrightable. Mechanically computed from random numbers, no. Somewhere in the middle? Harder

kevin42 · on July 5, 2023

I'm not a lawyer, but it seems like you stood up a straw man there.

>Just because it's represented as a bunch of numbers does not make it non copyrightable.

Can you give an example of where the bunch of numbers is copyrightable when it's not just a numeric encoding of something that was already copyrightable? Taking music and encoding it as a wav file is not a creative work, but it's a representation of a copyrighted work.

Maybe you could create a long list of numbers and call it an artistic impression, but that's clearly not what AI weights are. I'm interested to hear an example of your copyrightable numbers.

DannyBee · on July 5, 2023

"Can you give an example of where the bunch of numbers is copyrightable when it's not just a numeric encoding of something that was already copyrightable?"

Sure, there are "poems" that consist of just a groups of numbers that are copyrighted. They are not encodings, it's just a string of numbers. It's indistinguishable from a bunch of numbers. This is just one example, there are lots.

They are enforceable to the degree it's creative, and to the degree the infringing use is also creative.

So you would not be able to sue me for using those numbers in a math equation. You would be able to sue me for reproducing your poem in a book of poems :)

As feist says, the creativity required for copyright is quite minimal. But it's still only as protectable as it is creative.

Look - AI is not the first thing to have this "issue". The answer remains the same as it always was - it's mostly about the process not the output.

The output mostly matters is if the output is not intended to be creative (or it's de minimis or ...).

Copyright as it currently exists is weird.

Like if you go to the copyright office and try to register your ssh public key and say "this was generated by ssh-keygen i had nothing to do with it" you may get a different result than if you said "this is my new visually stunning masterpiece, my ssh public key, which was generated with computer help but I used 37 precisely timed keyboard smashes to do it. Prints are available from my gallery for $500"

mlyle · on July 5, 2023

I fully agree with what you say, with one bit of nuance to point out:

> Like if you go to the copyright office and try to register your ssh public key and say "this was generated by ssh-keygen i had nothing to do with it" you may get a different result than if you said "this is my new visually stunning masterpiece, my ssh public key, which was generated with computer help but I used 37 precisely timed keyboard smashes to do it. Prints are available from my gallery for $500"

The important thing, of course, isn't whether the copyright office denies to register your copyright, but instead what courts will ultimately do when you attempt to enforce your copyright.

We know the current administrative algorithms used by the copyright offices. We have less clarity on what courts will ultimately do.

mlyle · on July 5, 2023

The key factor of Feist v Rural is whether there was any original or creative process in the way the facts were arranged.

Here, there's a whole lot of creative decisions in labelling and guiding of training that produces the weights, so it's reasonable to think it might be copyrightable.

That is, the numbers are a whole lot more original than the issuance of phone numbers or part numbers.

Retric · on July 5, 2023

The requirement for expertise doesn’t necessarily imply that that setting up perimeters for training AI is necessarily copyrightable. A normal brick wall for example needs skills to create but doesn’t qualify as the goal is not creative. If so the mechanical output of a process that doesn’t qualify for copyright is not going to qualify.

Labeling training data may qualify for copyright, but if the underlying training data doesn’t taint the output as a derivative work then labeling isn’t going to qualify by itself.

Thus without some new and very generous interpretation AI companies are at best not going to benefit from copyright and at worst may be forced to create all training data in house. My suspicion is this generation of AI companies are in a very difficult situation.

mlyle · on July 5, 2023

> but if the underlying training data doesn’t taint the output as a derivative work then labeling isn’t going to qualify by itself.

It depends. If each individual training item has a small impact on the output coefficients, then perhaps it's not a derivative work of them. But if there's a large creative process in determining model training procedure, deciding labelling strategies, and applying those-- perhaps those numbers are strongly derived from those things.

Retric · on July 5, 2023

That sounds like wishful thinking, individual training items have significant impact on the result.

Anyway, suppose you’re building an AI to walk, there’s nothing creative about selecting 9.8m/s/s for gravity that’s simply the ideal value to achieve a desired goal. Labeling an elephant as “Elephant” rather than “coat hanger” is similarly a functional choice.

Just because a person is holding a camera and taking a photo doesn’t mean the result is copyrightable.

mlyle · on July 5, 2023

> Anyway, suppose you’re building an AI to walk, there’s nothing creative about selecting 9.8m/s/s for gravity that’s simply the ideal value to achieve a desired goal.

Suppose you're not building a strawman, but instead building an AI to be an LLM. The exact sequence of what you choose to do for instruction tuning, and the metrics and labels that you choose, the prompt/response pairs you write, and the loss functions you employ are quite creative. They greatly affect the coefficients and are not simple mechanical steps and are the result of a large amount of creative choice.

We are nowhere near a point where they are an uncreative, mechanical recipe to follow.

> Just because a person is holding a camera and taking a photo doesn’t mean the result is copyrightable.

No, but in the overwhelming majority of circumstances it is. What it depends upon is whether the person holding the camera is making a significant, original creative choice.

I am not sure what courts will decide, but I am certain that there is more creativity and originality employed than you are giving OpenAI et al. credit for.

Retric · on July 6, 2023

> not simple mechanical steps and are the result of creative choice

Creative choices requires intentional control over the output across a meaningfully different range of viable possibilities. A brick layer has a huge range of viable options in the specific brick and its alignment in a wall but none of those choices are artistically meaningful.

The coefficients are also not in any meaningful sense chosen based on instruction tuning. It’s no more under direct control than the specific arrangements of atoms in the brick wall and is instead the output of a purely mechanical process.

> We are nowhere near a point where they are an uncreative, mechanical recipe to follow.

Thus: We are nowhere near the point where the output is under creative control rather than being the result of a poorly understood mechanical recipe.

mlyle · on July 6, 2023

I don't really agree.

Here's what the supremes said in Feist V. Rural:

> Factual compilations, on the other hand, may possess the requisite originality. The compilation author typically chooses which facts to include, in what order to place them, and how to arrange the collected data so that they may be used effectively by readers. These choices as to selection and arrangement, so long as they are made independently by the compiler and entail a minimal degree of creativity, are sufficiently original that Congress may protect such compilations through the copyright laws. Nimmer ss 2.11[D], 3.03; Denicola 523, n. 38. Thus, even a directory that contains absolutely no protectible written expression, only facts, meets the constitutional minimum for copyright protection if it features an original selection or arrangement.

Alphabetical order wasn't quite enough. But people directing the work that produces the coefficients are doing considerably more creative work than that.

> Thus: We are nowhere near the point where the output is under creative control rather than being the result of a poorly understood mechanical recipe.

No one requires complete creative control of the output. I can spatter paint and have relatively poor control of what's happening, but I am certainly generating a copyrightable work when I engage in creative choices as part of this.

Retric · on July 6, 2023

> Alphabetical order wasn’t quite enough, But people directing the work that produces the coefficients are doing considerably more creative work than that.

I agree it’s more effort but the metric isn’t effort so I disagree that qualifies the coefficients as copyrightable. The SHA256 hash of a movie isn’t copyrightable even though the movie itself was.

> No one requires complete creative control

That’s a strawman, there are requirements for creative control. You don’t own copyright to your normal dumps, but you can get copyright from looking down and selecting to take a picture. That’s the low bar for a creativity requirement, but it exists.

mlyle · on July 6, 2023

> I agree it’s more effort but the metric isn’t effort so I disagree that qualifies the coefficients as copyrightable.

I know the metric is no longer effort. But there's a lot of creative choices that I've mentioned that greatly affect the coefficients, even if we don't know what those creative choices are going to do to each film grain in the photograph or coefficient in the matrix.

> You don’t own copyright to your normal dumps

Yes, there's an explicit exemption in LOC's guidelines for things that are the direct output of natural processes.

If you have a lot of choices affecting output, then the output is subject to copyright. Indeed, the Supremes above said that factual contemplations can qualify if they involve a "minimal degree of creativity".

In the end, we'll see.

mft_ · on July 5, 2023

IANAL, but I’d wonder whether ‘creativity’ is really present in labelling - and indeed, mightn’t it be the last thing you want? I’d argue labelling should be strictly factual and reproducible, and ideally following a logical structure… maybe akin to how addresses of buildings might appear in a phone directory…

(Agree that the skill in knowing how to code and guide the training of a model is probably very different though. It’s not just access to compute time that separates me from OpenAI :) )

dragonwriter · on July 5, 2023

> Here, there's a whole lot of creative decisions in labelling and guiding of training that produces the weights

Often, labelling is part of large public datasets that are chosen for use for that exact reason, and/or is otherwise not the work of the party claiming copyright in the model.

gjvnq · on July 6, 2023

> Can you give an example of where the bunch of numbers is copyrightable when it's not just a numeric encoding of something that was already copyrightable? Taking music and encoding it as a wav file is not a creative work, but it's a representation of a copyrighted work.

There are random number books and I know one of them has a copyright registration [1].

[1]: https://publicrecords.copyright.gov/detailed-record/7060844

dxbydt · on July 5, 2023

> Mechanically computed from random numbers, no

Even random numbers are copyrightable.

Below is an implementation of Marsaglia's invention, from p348, courtesy infamous NR[1]. Its a MWC (multiply with carry) random number generator, with two parameters, variable a and base b=2^32. --- For a, "The values below are recommended with no particular ordering." ID a B1 4294957665 B2 4294963023 B3 4162943475 B4 3947008974 B5 3874257210 B6 2936881968 B7 2811536238 B8 2654432763 B9 1640531364 --- as we all now know, the whole thing is copyrighted - you can't redistribute that code and can't use those specific numbers to generate random numbers without purchasing a license, which only allows you to use it once in your personal machine; that's why GSL[2]. The pseudorandom numbers you would get from MWC if you use above numbers are also copyrighted since they are work-product.

[1]http://numerical.recipes/book/book.html [2]https://www.gnu.org/software/gsl/design/gsl-design.html

saynay · on July 5, 2023

I would say that is uncertain. Model weights are always going to effectively be a huge collection of statistics about the training corpus. Unless you are envisioning artisanal, hand-crafted, free-range model weights where a person used a non-mathematical method to purposely and creatively choose each one?

Veserv · on July 6, 2023

IANAL, but weights are probably not copyrightable because they are “useful” i.e. they have intrinsic utilitarian function as they are necessary for a model to function as described. Useful things can not be copyrighted, only patented. This is the key difference between the two. For works that are useful and creative, only the non-functional aspects can be copyrighted.

Maybe a published copy of the weights might be copyrightable, in the exact form of a “creatively” ordered listing, but the weights themselves would almost certainly not be if the US judicial system rules consistently.

This bypasses the entire argument of whether it is human authorship as weights themselves in bulk are just straight up non-copyrightable regardless of origin according to this reading of the law and precedent.

radarsat1 · on July 5, 2023

> Weights probably aren't copyrightable in the US. ... is understood to require human authorship.

Are you arguing here that because the weights come from an optimization program, they are not "human authored"? If so I find that to be a strange assertion. If I'm working every day on my model and training algorithm to ensure it produces the best weights possible to solve my problem, I would be very surprised for someone to tell me I have no ownership over those weights because they are generated from a program I wrote and data that I own.

hnfong · on July 6, 2023

In addition to what you mentioned, I'd like to add that photos are copyrightable by the photographer, who merely tweaked some "hyper-parameters" (exposure, ISO, aperture, whatever) and decided when and where to press a button. If such "minimal" amounts of authorship is considered sufficient, I won't be as confident as GP to claim that training an LLM, which requires orders of magnitudes more hyperparameters, labor and capital investment, to be uncopyrightable due to lack of human authorship involved.

It might end up uncopyrightable due to other reasons, but probably not this.

Btw, the phone directories are quite different -- they're just compilation of raw factual public domain information. The LLM weights are anything but. In fact one of the leading theories as to why LLMs are not copyrightable is that they infringe upon the copyrights of the source training materials. (I also don't want to guess whether that argument holds)

mirekrusin · on July 5, 2023

Assuming that weights are not copyrightable, how much restrictions can you put on output through API from those networks/weights?

Ie. if ClosedAI says you can't use output of their API to train competitive models - is that enforceable or not?

photonerd · on July 5, 2023

That would be down to contract/terms of service. You’d be in breach of that, not copyright

mirekrusin · on July 5, 2023

But is it enforceable? Companies can put in contracts all kind of nonsense, it doesn't mean all of it is unconditionally enforceable, right?

Ie. if somebody creates company that sells milkshakes and they say you can't use them to feed employees of competing milkshakes companies - it wouldn't fly, would it?

hnfong · on July 6, 2023

Perhaps you're thinking of the EULA kind of situation. EULAs are probably not enforceable because they require the user to agree to additional terms after the main sale of contract is complete. It's considered a one-sided agreement because the user doesn't get anything in return (the contract of sale of the software already grants the right to use the software).

For LLMs it really depends on the situation. If it's presented in a EULA scenario, where you already bought the rights to use the LLM and ClosedAI gave you the EULA with additional terms afterwards, then the logic above applies. But then everyone knows EULAs aren't very enforceable these days, and nobody buys packaged software any more, so this scenario is quite unlikely these days.

So, if the clause is just one of the many conditions in their main contract of service, of which you had ample opportunity to review before purchasing/agreeing to use their service, then as long as the terms are legal (eg. don't contradict some law), parties are generally free to agree to whatever they want in a contract, and courts will generally uphold those terms.

"can't use them to feed employees of competing milkshakes companies" is probably enforceable. Sounds silly, but I can't think of any reason why it wouldn't be upheld. Unless there's antitrust factors involved.

"Can't use output of their API to train competitive models" is most likely enforceable. Unless there's antitrust factors involved. These kinds of terms are pretty common too. Nobody seriously thinks they're unenforceable per se.

Of course there are practical barriers to enforce a contract -- the aggrieved party has to discover the breach, gather sufficient evidence, and file a lawsuit. As an average Joe individual, you're probably not worrying about getting sued by a company for trivial breaches of service agreements. Most likely the service provider will just cut the service instead of spending thousands of dollars tracking you down (and risk taking a PR hit for going after the little guy). But between businesses, the risks of getting sued by a competitor is real, and no sane lawyer would advise the business to ignore such contract terms.

(Btw, I am not a lawyer. I've studied these things a bit though.)

photonerd · on July 5, 2023

Would strongly depend on the contract. Probably wouldn’t fly in a post sake terms of service agreement, but you’d likely be in breach of a normal contract yeah.

jrm4 · on July 5, 2023

Exactly. A lot of the difficulty here is how they skip is the hugely important issue:

An entirely reasonable, if not fully tested, statement is the following:

Every single one of these AI weight things itself is a result of unencumbered, massive, law-breaking, right-violating copyright infringement -- accordingly, it's extremely difficult to say anything morally justifiable or authoritative about anyone elses "rights" downstream, and to try to inject the word "ethical" makes the whole thing even more ridiculous.

visarga · on July 5, 2023

> is a result of unencumbered, massive, law-breaking, right-violating copyright infringement

Why? Copyright covers expression not information, AIs can learn information from any source regardless of copyright. They should just not regurgitate copyrighted content, that's all. And much of what organic content is online is common knowledge, thus can't be copyright-controlled.

version_five · on July 5, 2023

A lot of people are just upset because their local equilibrium has been disrupted and they think that means they lost a natural right.

"You wouldn't look at a car and then remember what that looked like when someone asks you to draw another"

jrm4 · on July 5, 2023

These are not bad arguments, but I don't think they're conclusive. I am a lawyer, and I could absolutely see this going the other way. "You can't make these machine things without literally feeding this copyrighted information into them, therefore they do contain a copy. You can see this by when they reproduce, e.g. the "getty images" deal."

*this is not legal advice, dangit commenter person below

m4nu3l · on July 5, 2023

> you can't make these machine things without literally feeding this copyrighted information into them, therefore they do contain a copy.

They don't necessarily do. Think about that. You can take some copyrighted material and transform the information contained in it (for instance a fictional book). You can then write a summary. The summary contains information that was present in the original but it has been transformed and hence it's not a copy. The ML model contains information that has been generalized by some degree. So it's just a grey area IMO.

jrm4 · on July 6, 2023

I'm not saying that "they do or don't objectively" because that doesn't matter as much as people think it does. I'm thinking of what a "jury" COULD decide. I think average joe on a jury is very likely to see that process as "feeding them in."

saynay · on July 5, 2023

More over, you are clearly not in violation of copyright if you are talking about statistics about the material. In your example, printing out a "there were 7000 instances of the word 'the'" is certainly not a violation. A ML model is just a huge pile of these statistics.

However, saying "the first word of the book is 'The'" would not be a violation, while repeating that for every word in the book, as a whole, would be one.

version_five · on July 5, 2023

I agree with you but I think it's important to have some nuance. Imagine I build a statistical model for 10-word sequences (10-grams) and then I trained it on a single book. I probably could pick some starting words and get most of the book back from the "statistics" I compiled. If I trained the same model on a giant dataset, the one book would just contribute to the stats.

All that to say, the models have potential to memorize, but they don't, and if they do it's an undesirable failure mode, not some deliberate copying.

jrm4 · on July 5, 2023

I like this argument a lot; but again -- how does this play out in the real world? It's pretty easy to refute what will happen in real life. Think, e.g Batman. I could write a very new and original "Batman" comic that doesn't strongly resemble anything -- movie, toy, comic, whatever -- that exists, but would be recognizable to fans.

Once it starts doing well, will DC come after me? You bet.

ke88y · on July 5, 2023

These models can definitely be used to intentionally store and recall content that is copyrighted in a way that's not subject to fair use. (eg: trivially, I can very easily train a large model that has a small subnetwork which encodes a compressed or even lossless copy of a picture, and if I were to intentionally train a model is that way then this would be no less a copyright violation than distributing a JPEG of the same image embedded in some large binary).

But also, an unintentional copy of a copyrighted image is not a violation of copyright. (eg: an executable binary which happens to contain the bits corresponding to a picture of Batman -- but which are actually instruction sequences and were provably not intended to encode the picture -- clearly doesn't infringe.)

LLMs are somewhere in-between #1 and #2, and the intent can happen both in the training and also the prompting.

Stack on top of this the fact that the models can also definitely generate content that counts as fair use, or which isn't copyrighted.

It's the multitude of possible outputs, across the copyright spectrum, combined with the function of intent in training and/or prompting, which make this such a thorny legal issue for which existing copyright statute and jurisprudence is ill-suited.

Taking your Batman example: DC would come after you for trademark as well as copyright, and the copyright claims would be very carefully evaluated with respect to your very specific work. But here we are talking about a large model that can generate tons of different work which isn't subject to copyright or which is possibly fair use.

I don't think that existing jurisprudence (or even statute?!) can handle this situation very well, at all, without tons of arbitrary interpretative work on the parts of juries/judges, because of the multitude and vague intent issues described above.

(...Also presumably the merits of the DC case wouldn't matter because your victory would be pyhrric unless you are a mega-corp. Which from a legal theory perspective is neither here nor there but from a legal practicality perspective may inform how companies go about enforcing copyright claims on model weights/outputs.)

Anyways. I think we have a right mess on our hands and the legislature needs to do their damn jobs. Welcome to America, I guess :)

Curious to hear your thoughts on these issues.

jrm4 · on July 6, 2023

Honestly, your second to last sentence is literally the kind of thing I hate hearing most from non-lawyers; the whole "if the legislature were just smarter" thing is just a weird pie-in-the-sky concept that is more-or-less like saying "the world would be better if CEOs were less greedy."

Like, yes, but it's not very likely to happen and it's not a particularly horrible thing if it doesn't; the law is slow and little-c conservative and you're just expecting it to be something it MOST often just ain't.

mike_d · on July 5, 2023

> The summary contains information that was present in the original but it has been transformed and hence it's not a copy.

The summary also contains original thought, something is added to it by a human to make it unique. AI models are primarily deriviative.

A better example would be: if I take 1,000 different copyrighted works and put them into a ZIP file, does that resulting file violate copyright?

version_five · on July 5, 2023

That example is awful, whatever side of the debate one is on

kevin42 · on July 5, 2023

Let's say you take the harry potter books and create a spreadsheet with each word in it as a column, and the number of times that word appears. Would that violate the copyright? I'd be interested in the rationale if someone thinks it would.

mike_d · on July 5, 2023

If your table was the number of times a word was followed by a chain of other words, that would be a closer comparison to AI weights. In that case it would be possible with reasonable accuracy to reconstruct passages from the harry potter books (see GitHub Copilot).

The copyright aspect makes more sense when you start thinking of AI training models as lossy compression for the original works. Is a downsampled copy of the new Star Wars movie still protected under copyright?

Just tabulating the word counts would not violate copyright as it is considered facts and figures.

drdeca · on July 5, 2023

It resembles lossy compression in some ways, but in other important ways I think it doesn’t?

Like, if one has access to such a model, and doesn’t count it towards the size cost of a compression/decompression program nor as part of the compressed size of the compressed images, then that should allow for compressing images to have substantially fewer bits than one would otherwise be able to achieve (at least, assuming that one doesn’t care about the amount of time used to compress/decompress. Idk if this is actually practical.)

But unlike say, a zip file, the model doesn’t give you a representation of like, a list of what images (or image/caption pairs) it was trained on.

Or like, in your analogy with the lower resolution of the movie, the lower resolution of it still tells you how long the movie is (though maybe not as precisely due to lower framerate, but that’s just going to be off by less than a second, unless you have an exceedingly low framerate, but that’s hardly a video at that point.)

There is a sense in which any model of some data yields a way to compress data-points from it, where better models generally give a smaller size. But, like, any (precisely stated) description counts as a model?

So, whether it is “like lossy compression” in a way that matters to copyright, I would think depends a lot on things like,

Well, for one thing, isn’t there some kind of “might someone consume the allegedly infringing work as a substitute for the original work, e.g. if cheaper?” test?

For a lower resolution version of Star Wars movie, people clearly would.

But if one wanted to view some particular artwork that is in the training set, I would think that one couldn’t really obtain such a direct substitute? (Well, without using the work as an input to the trained model, asking it to make a variation, but in that case one already has the work separate from the model, so that’s not really relevant.)

If I wanted to know what happened in minute 33 of the Star Wars movie, I could look at minute 33 of the compressed version.

pxoe · on July 5, 2023

what is a 'copy'? byte accurate, or 'something with general resemblance'? would a badly compressed "copy" image of a copyrighted material still be 'a copy' or would it be some other thing? would low quality image compression be enough to skirt around copyright claims? image formats and viewers just 'reproduce' an impression of original data from derive compressed data. it is also just 'information that's been generalized by some degree' - for space saving purposes and so on. so, what if image generators could be thought of as a 'very good multi-image compression algorithm' that can output multiple images as well, to a 'somewhat recognizable degree'.

hex4def6 · on July 5, 2023

Badly compressed still counts. I think if the data allows you to reconstruct a recognizable recreation of the original work, you have a good chance of it being considered a derivative copy.

A mono audio version of Star Wars, compressed down to 320x240, filmed from the back of a theater on a VHS camera, converted to Video CD, would under any reasonable interpretation be just a copy of the original.

I assume it starts getting murky when there's some sort of transformation done it it. What if I run motion capture on it, and use that motion capture data to create a cartoon version of Star Paws (my puppies in space epic)? What if I do a scene for scene recreation as the animated cartoon (removing any mentions to copyrighted names -- Luke Skywalker is now Duke Dogwalker, for example)? In this case, there's been no actual data transfer -- all the sprites are hand drawn, backgrounds etc.

What would be an interesting exercise would be to try and create a series of artifacts that each on their own are considered non-derivatives, but can be used together to reconstitute the original. For example, create a compression method that relies heavily on transforms / macroblocks, but strip out any of the actual pixel data from the film. That info might be supplied as palette files which are themselves not really copyrighted data, but together with the compressed transform stream can be used to recreate the original video.

visarga · on July 5, 2023

This is a great example. Summarizing or paraphrasing copyrighted content, or simply using it as a seed to generate input-output pairs - this kind of data transformation prior to training could solve the issues with copyright. It cleanly separates form from content.

londons_explore · on July 5, 2023

It's just a race for which test case gets to the supreme court first really...

dragonwriter · on July 5, 2023

> It's just a race for which test case gets to the supreme court first really...

Not really for practical purposes. In the long term, the Supreme Court can and does overrule its own precedent, so the first case on the specific issue to get to the Supreme Court doesn’t end the discussion.

In the short-term, cases get resolved by lower courts and parties either lack funds to do the maximum level of appeals, or the Supreme Court chooses not to hear appeals (they tend to prefer an issue to be well-developed with circuit case law, often waiting till there is a conflict between the Circuit Courts of Appeal, before taking it up), so the state of the law prior to any specific ruling on the narrow topic by the Supreme Court matters quite a bit.

pmoriarty · on July 5, 2023

...and the Supreme Court could rule however it likes. It doesn't matter what anyone else says, or what any law says, what any lawyer or other judge says.

They could be completely biased, could completely ignore everyone and everything else and rule however they want.

I'm almost surprised they still bother to write any kind of "legal reasoning" in their ruling and don't simply focus on what the ruling is rather than why they ruled that way. But I guess such "reasoning" still serves a propaganda purpose and still provides a fig leaf for those who still believe in the quaint absurdity that "we are a nation of laws, not men."

londons_explore · on July 5, 2023

Supreme court precedent seems to impact a lot of decisions...

Plenty of companies who have legal teams will keep an eye on the legal landscape of court decisions, and use them to decide if our T&C's or contracts need rewriting, or if any precedent puts us at legal risk.

Sure - the supreme court could overthrow its precedent anytime, but until it does, a lot of people will act as if what they say is the law.

staunton · on July 5, 2023

This is the first time I ever saw a comment including the text "I am a lawyer". Does that mean the comment technically contains "legal advice"?

lcnPylGDnU4H9OF · on July 5, 2023

As always, a lawyer is not necessarily your lawyer.

mike_d · on July 5, 2023

If you choose to pay him, sure.

wahnfrieden · on July 5, 2023

IP itself violates a natural right

(Yes the idea of rights is also unnatural and absent from visions such as anarchy)

version_five · on July 5, 2023

Yeah I didn't even think that was controversial. I'd always been taught that copyright and patents exist to explicitly restrict what people can do by granting a monopoly to the owners in order to encourage invention and creative work.

Edit to add I'm not saying I agree with the justification or am trying to argue for it, only that the point above is commonly raised as the justification, implying that the intrusion on a person's rights is known and accepted.

dragonwriter · on July 5, 2023

Natural rights are a fiction to pretend that someone’s moral code is a privileged aspect of physical reality in a way every competing moral code is not.

__MatrixMan__ · on July 5, 2023

That's going a bit far. They're just fictions that are privileged over certain other fictions--it's like how you can often cast magic missile in D&D but you can't usually cast expelliarmus, it comes down to which fiction we agree to inhabit.

ndriscoll · on July 5, 2023

Even so, you can ask whether a given moral code is more principled than another (e.g. in the sense of having some algebraic structure), and use that to investigate what might be considered "more natural". For example, one might argue that if a "natural" right exists, then it ought to be symmetric under exchange of humans (or sentient beings or whatever). It's then "more natural" to conclude that you have a right to perform actions that have no interaction or consequences for other humans (e.g. to sing a copyrighted song to yourself in an empty room or downloading a song that you already have on CD but don't feel like ripping yourself) than those that do (e.g. taking food from someone because you'd otherwise starve).

dragonwriter · on July 5, 2023

> Even so, you can ask whether a given moral code is more principled than another

What does “more principled” mean of a moral code? How does one quantify “degree of principledness”?

> and use that to investigate what might be considered "more natural".

What does the preceding (being “more principled”) have to do with being “more natural”? And what significance does being “more natural” have?

And none of that has any relevance to what is usually described as “natural rights”; its like taking existing words and coming upnwith entirely novel meanings and then a whole architecture around them, which is pretty advanced equivocation.

svachalek · on July 5, 2023

When the web was young, there was a lot of information considered "public" like criminal record, marriage records, birth certificates, property records, etc. But those were still fairly veiled because of the amount of effort required to see them. Suddenly these were getting blasted all over the internet because now that was an easy thing to do, and everyone had to rethink what "public" meant.

I suspect we're going to see the same kind of rethink about intellectual property in the age of AI.

pulvinar · on July 5, 2023

Not sure about criminal records, but the other records are generally still public. Not blasted all over, but there if you know where to look.

Not that we really had all that much privacy in the past, as anyone who's browsed old newspapers knows.

pfdietz · on July 5, 2023

Copyright is for things that are the result of human creativity. If the weights come from running an algorithm on a training set (that one does not have a copyright to) then how can the weights then be copyrightable? They might be a derivative work, but that just means they infringe copyright, not that they are copyrightable themselves.

shagie · on July 5, 2023

Note that the requirements for copyright are not consistent between nations.

The US has the "threshold of originality" as its principle. Under that doctrine, it requires some human (and this has been emphasized many times over the years) originality in order for something to be copyrighted. It's a low bar for how original it needs to be, but it must be human (monkeys taking selfies are not human).

https://en.wikipedia.org/wiki/Threshold_of_originality

In England, the doctrine is "sweat of the brow" instead.

https://en.wikipedia.org/wiki/Sweat_of_the_brow

> Under a "sweat of the brow" doctrine, the creator of a work, even if it is completely unoriginal, is entitled to have that effort and expense protected; no one else may use such a work without permission, but must instead recreate the work by independent research or effort.

The definitive case for this in the US that set the two apart is Feist Publications, Inc., v. Rural Telephone Service Co. ( https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._R.... ) where it was deemed that a telephone directory is not copyrightable in the US as there is no originality in it... but under the sweat of the brow doctrine it would have been.

So the "[c]opyright is for things that are the result of human creativity" gets an "it depends" and it would be curious to see if companies that are firmly in the "models are valuable" camp go to the UK for what I believe would be a more favorable copyright protection.

... However there are other IP laws around trade secrets that may be better for it in the US (I'm not as familiar in that domain - I would be curious to find out).

hnfong · on July 6, 2023

Think about copyrights of photos.

A photo is presumed to be copyrightable. Even horrible photos taken by somebody without any aesthetic sense are presumed to be copyrightable. The argument (AFAIK) is that the photographer chooses the time, location, object, and tweaks various settings of the camera (exposure, aperture, etc.), and these choices are considered sufficient for a photo to be copyrightable.

How about LLMs?

The hyperparameters of LLMs are hugely important in training LLMs, as is the choice of source training data. To me the "degrees of freedom" (and hence room for "creativity") in training LLMs are larger than that of a photographer taking a photo. And as of today, training a good LLM is probably objectively harder than taking a good photo, even if we forget about hardware costs for a moment.

It's easy to convince judges and juries that copying phone numbers into a phone book doesn't require human creativity. But we're talking about the most bleeding edge tech companies producing a bleeding edge new product here. I think it's going to be really hard to convince judges and juries that making this new shiny thing doesn't require human creativity. Maybe in say 20 years when even a 10 year old can train a LLM the situation might change, but as of today, quite unlikely IMHO.

stale2002 · on July 5, 2023

The answer would be if the weights are transformative enough, and the copyright would come from the person who decided what images to include in the training set.

The act of choosing to place images in a certain arrangement, such as a collage, can be copyrightable. The same could be said for the "act" of choosing what images to include in a training set and which parameters to use to train the model.

kevin42 · on July 5, 2023

Does that mean if someone copies a phone book but leaves out some numbers and adds some other numbers then it's a creative work?

saynay · on July 5, 2023

The legal system is not like a computer program. The line between what is "creative" and what is not concrete, but is instead up to the interpretation of the judge who rules on it.

So your phonebook modifications may or may not be considered "creative" depending on the judge and your ability to convince them. The more your modify it, the more likely you are to convince a judge it is a creative work, though.

stale2002 · on July 5, 2023

It would depend on how transformative the work is.

There is in fact a whole art form where people cut out words from different newspapers and books, for example, and re-arrange those words to form new and interesting art.

So there are ways in which such a work would be a creative work, and ways it which it would not, and it would depend on the particular instance and example.

wheelie_boy · on July 5, 2023

It seems very difficult to ensure that a model will never output any of the copyrighted content that it was trained on. I can only think of three ways, but perhaps there are others

1. Evaluate every output from the model to ensure that none of the outputs are copyrighted

2. Evaluate every input to a model to ensure that the inputs are either not copyrighted or properly licensed

3. Change the definition of copyright so that ML models can do whatever they want

Nobody is doing #1, because that makes the business models not work. Established brands (like Adobe) are doing #2. I get the feeling that there are a lot of ML startups that are hoping that #3 will happen, but it seems unlikely

og_kalu · on July 5, 2023

Ensuring a model never outputs copyrighted content is unimportant and tangential. It's irrelevant. You don't look for a way to make humans output no copyrighted content, you address each time they do case by case.

A model training being rendered fair use doesn't mean any of its output can be used for whatever regardless.

wheelie_boy · on July 5, 2023

> you address each time they do case by case.

That's what I listed as #1 - evaluate each individual output of the model to see if it violates copyright.

hnfong · on July 6, 2023

I think when GP says "address each time case by case", they mean "you sue them when they infringe", instead of "this human has an illegal brain because it remembers Taylor Swift's songs".

PS: your "#1" is really hard to do and I'd guess it is infeasible. Even Google (esp. Youtube) with their vast data capabilities, often gets it wrong.

slaymaker1907 · on July 5, 2023

My issue with this take is that machines are not people. We only have lax rules for humans precisely because they are humans, not on the basis that they can learn. Copyrighted works are produced for people and given how human learning works, applying the derivative works rule to humans would be completely impractical and destroy the point of works with copyright. The same cannot be said for AI companies treating everything on the internet as fair use for training AI.

wilde · on July 5, 2023

Tell that to some illegal numbers: https://en.wikipedia.org/wiki/Illegal_number

ldoughty · on July 5, 2023

> Every single one of these AI weight things itself is a result of unencumbered, massive, law-breaking, right-violating copyright infringement

Maybe the popular and free ones. Adobe has a product in beta that uses "ethical training data" as a selling point.

jrm4 · on July 5, 2023

Interesting. I wonder what they mean by "Ethical" -- instead of e.g. saying "definitely free and open." I'm willing to bet "stuff they gathered from likely unwitting Adobe users."

themoonisachees · on July 5, 2023

Adobe also happens to own Adobe stock, so maybe they simply trained on their own corpus.

Who am I kidding this is Adobe of course they're fucking over their users

Dr4kn · on July 5, 2023

They at least say they did and other copyright free artworks. They are a big company and know that they would get sued, so it should be in their interest to do it this way.

FooBarWidget · on July 5, 2023

It means trained on data from stock photo sites they own, for which all uploaders agreed to terms of service which state that uploaded materials can be fed into AI training.

jrm4 · on July 5, 2023

So yes, exactly what I said. :)

zitterbewegung · on July 5, 2023

If I collect a set of copyright free data or public domain data would we conclude that the weights are also public domain?

numpad0 · on July 5, 2023

IANAL, I rather think the weight is not copyrightable anyway, and, if I build a model on copyrighted data, I would conclude that the inseparable but reproducible parts of weight retains copyrights, despite the whole weight not having its own.

jprete · on July 5, 2023

No, that doesn’t follow at all. The argument is that either the training or the expression violated existing cooyrights through the making of unlicensed copies. It’s not based on open source licensing. Although OSS viral licensing may well apply if fair use is not a successful defense.

zitterbewegung · on July 6, 2023

So it doesn't follow the logic that the output of a computer program can be copyrighted if the input can ? It must not be a transformative work?

tedunangst · on July 5, 2023

What copyright is violated by training on public domain data?

jprete · on July 5, 2023

If it's public domain, then no copyright is violated. I'm not talking about public-domain data; the G-G-GP specifically mentioned the possible legal interpretation that training on large amounts of publicly visible (but not public domain) data is itself a copyright violation.

numpad0 · on July 5, 2023

DMCA works for free data too. It doesn’t matter if your gains are in the form of fiat or crypto or social currency.

8note · on July 5, 2023

Are they a work of art in and of themselves? I don't think you could tell without litigation

jrm4 · on July 5, 2023

That seems fair. I was under the impression that there weren't too many out there like this.

hcks · on July 5, 2023

> is a result of unencumbered, massive, law-breaking, right-violating copyright infringement

Is there an official ruling? Or is it just a Reddit-style over exaggeration?

Dr4kn · on July 5, 2023

There is no official ruling... yet. We are very early in this rapid public development. Laws and rulings take years or decades.

They are trained on a lot of text. News sites, comments, books etc. Most books and news sites fall under copyright. Is this fair use? Who knows. Fair use is also an American thing. ChatGPT can be used in the EU, which doesn't have such a broad view of fair use.

If you make a game only out of a lot of copyrighted assets without paying it isn't fair use. Are LLMs different?

What about image generation, which you can prompt the models for specific styles of artists, which works are all copyrighted, but still used for training?

xg15 · on July 5, 2023

Furthermore, if weights are copyrightable, wouldn't this make the issue of training data licenses even more urgent?

IANAL, but if weights are IP, wouldn't they constitute a "derived work" of the training data?

realusername · on July 5, 2023

That's also my understanding, either the weights are copyrightable and then all the models need explicit agreements for any work they include in it because models become derivatives or they are not copyrightable being just machine data (the most likely scenario in my opinion), they can't have it both ways.

OkayPhysicist · on July 5, 2023

There is also a (IMO less likely, but still conceivable) scenario where weights ARE copyrightable, but represent fair use of the training data on grounds of being "sufficiently transformative".

8note · on July 5, 2023

I consider that one super likely, but then using the model to make competing works with one the artists in their own style is a non-fair use derivative work

Filligree · on July 5, 2023

Style explicitly isn’t copyrightable. It’ll need to be for some other reason.

OkayPhysicist · on July 5, 2023

Your case wouldn't be about style, it would be about specific elements that you posit were memorized and regurgitated by the model. The fact that you're creating art in the same style/medium as the author is what negates the "sufficiently transformative" fair use defense.

Basically, that world ignores the AI model completely. If your resulting work wouldn't be fair use if you directly were working with something from the training set, it wouldn't be fair use if you fed it through an AI model first.

shaky-carrousel · on July 5, 2023

Also known as "having your cake and eating it".

floomk · on July 5, 2023

Sadly this seems to be the most likely considering how the US is ran

phantom784 · on July 5, 2023

I think there could be an argument that it's copyrightable but not a derivative work.

If I read a few books about a subject as research, and then I write an article about the subject, it's my own copyright. The fact that I did research doesn't make it derivative of those books (correct me if I'm wrong, IANAL).

Perhaps a model created from copyrighted material be treated in the same way?

bloak · on July 5, 2023

> If I read a few books about a subject as research, and then I write an article about the subject, it's my own copyright.

Yes, because in that case you'd be the "author" doing "creative work".

> Perhaps a model created from copyrighted material be treated in the same way?

Who would be the author doing creative work in this case? The people who decided what training material to use? Perhaps, but it seems a stretch for the people who selected the training material to be authors but not the people who created the training material.

slaymaker1907 · on July 5, 2023

The difference is you are person and have many more rights than a machine.

floomk · on July 5, 2023

That's because you are human and have rights that a computer program doesn't

adamc · on July 5, 2023

A fertile subject for sf stories.

AnimalMuppet · on July 5, 2023

"Transformative use".

The inputs could be copyrighted and the weights could be copyrighted if creating the weights from the inputs is (legally) regarded as a transformative use. And I think it could reasonably be considered to be transformative - the weights don't look anything like the input data.

Disclaimer: IANAL. So far as I know, no court has ruled on whether this qualifies as a transformative use. I take no position on how the courts will actually rule. I merely say that they could regard this as transformative use. (But see jerf's "creativity" argument for another hurdle that weights must pass to be copyrightable.)

shagie · on July 5, 2023

Transformative use doesn't necessarily mean copyrightable.

Google's thumbnails are a purely mathematical transformation on images (no copyright themselves), and yet are considered a transformative use.

I believe that trained models are similarly a purely mathematical transformation of {data}, but is transformative in what that can be used for going forward.

"Can" bearing a lot of weight in that sentence.

It's how the human, with agency, uses the model that may be a derivative or copyright infringing use - not the model itself nor necessarily the output.

The output of a generative AI may be similar enough to an existing work that it is derivative of that work. It is possible to construct a prompt that infringes on an existing work even if that work wasn't part of the training data.

For that case, consider you drew a picture. That picture that you just drew isn't part of any training data. I could presumably look at it and describe it with sufficient detail that something similar enough would be generated... and that may be considered a derivative work. The same test could be applied to me describing it to someone on Fiverr with the same outcome.

If I were to publish that work by the generative AI or Fiverr - who would be infringing on copyright? me? or the black box that may be AI or Fiverr that created a picture based on my prompts?

numpad0 · on July 5, 2023

Another way to look at it is if a thing reproduced a data subjectively resembling originals and then you used it anyhow, then its non-transformative use, and methods used is just extra details.

YetAnotherNick · on July 5, 2023

No, weights are not just data fed but also the training process itself. I think the whole argument hinges on how much human thought and action is needed in training the model.

On the other end of the spectrum, AI generated content couldn't be copyrighted if there is no human involvement. If someone asks GPT to write 1000 poems, it couldn't be copyrighted.

golemotron · on July 5, 2023

In a sane legal system a new copyright law would be passed to clarify all of this. In ours, the poor copyright office needs to make things up on the fly.

Their recent decision that implies that anything that AI is used to produce is non-copyrightable is silly, sad, and not sustainable.

spullara · on July 5, 2023

This has been my position from the beginning. It is very hard for me to imagine that weights can be copyrighted at all. IANAL.

nerdponx · on July 5, 2023

Weights are equivalent to compiled object code IMO. All else follows from there.

feoren · on July 5, 2023

Compiled object code of a bunch of code you didn't write. I don't know why programmers are so eager to forget that copyright is not at all about what something is, and all about where it came from. It'd be hard to assert that you hold copyright over object code compiled from code you didn't write!

xigency · on July 5, 2023

> programmers are so eager to forget that copyright is not at all about what something is, and all about where it came from

See “What color are your bits?”: https://ansuz.sooke.bc.ca/entry/23

>> And very much of intellectual property law comes down to rules regarding intangible attributes of bits - Who created the bits? Where did they come from? Where are they going? Are they copies of other bits?

cubefox · on July 5, 2023

Or is the fact that compiled code enjoys copyright protection, even though it is not human generated, evidence that being generated by a human is not overly important for copyright protection?

feoren · on July 5, 2023

An mp3 encoding of a wav file of a copyrighted song is still copyrighted, despite those exact bits never having existed before, and being created entirely by a computer.

feoren · on July 5, 2023

Some thought experiments:

What happens if we train a neural network on a single, copyrighted work? Say it has one input node (or even zero, if you like), and regardless of this input, its output is always exactly the copyrighted work it was trained on. What do its weights represent? Clearly, its weights represent a direct encoding of the original work. Those weights are copyrightable, but not by the person who trained the neural network -- the copyright is held by the owner of the original work.

What if we train the neural network on just two copyrighted works? If its one input node is 0, it outputs the first, and if it's 1, it outputs the 2nd. Almost certainly, its weights are a complicated, tangled mix encoding both, like a compression algorithm that completely rearranged its input. Who owns the copyright to those weights? To whatever extent the weights can be "factored out" into a set representing the first work and a set representing the second, clearly the copyright holder of the first work holds the copyright on the first "factored set", and the 2nd on the 2nd. It seems obvious that we must be able to do this "factoring out" somehow (even if the topology of the factored networks is different), because we know both works are exactly represented by the weights, and the neural network itself can use this information to reconstruct them both, so they're in there ... somewhere. So is there a sort of "joint copyright" on the combined weights, where nobody is really allowed to do anything with it without approval of the other? Regardless, it's still clear that whoever trained the neural network has no claim on any copyright.

Where is the breaking point extending this from 2 works to a billion? People make arguments like "drawing a car from memory isn't infringing on copyright design of that car", which ... are you sure? Reproducing a piece of music from memory (and selling it) is usually copyright infringement. You're allowed to learn a Taylor Swift song as part of your musical training, but you're not usually allowed to then play it back from memory and sell that recording (I'm not sure I morally agree with this treatment of covers, nor if it's globally applicable). So the argument that "surely neural networks are allowed to learn from copyrighted works" misses the point: they can learn all they want, but as soon as they reproduce verbatim (or close enough) a copyrighted work, they're infringing. And if they're representing a complete copy of the work within their weights (which they obviously are if they can reproduce it), then the original copyright holder has a claim on those weights. And never in this process has the trainer of the NN acquired any copyright to anything. The real trainer is a bunch of GPUs, after all.

If the neural network cannot reproduce any of the copyrighted works verbatim, then we're getting closer to "fair use" territory. Yes, it's permissible to write a summary of a copyrighted work. That is so lossy as to not "compete" with the original work in any meaningful way. If it could be demonstrated that neural networks do not encode completed works (no matter how hard the factorization would be), then one could make this argument. Unfortunately, the evidence is that LLMs are more than happy to completely regurgitate copyrighted works verbatim. It seems to me the copyright holder of the original work therefore must hold a share of the claim on the weights. Still, the GPUs that trained the network do not magically acquire copyright over anything.

I wonder if the real answer is that the weights are copyrighted, and that copyright is held jointly by hundreds of millions of people, and nobody can do anything with those weights without the approval of all the others. I'm not saying I like that universe, but I am saying it's the most internally consistent answer I can think of, and seems to follow from the above argument.

feoren · on July 5, 2023

In fact, the "factoring out" process shouldn't even be that hard: find the input vector that forces the ANN to output the copyrighted work verbatim. There should be some simple method of "baking in" the first step of the feedforward algorithm, applying that vector to the first layer of weights, and then considering the input layer as the first hidden layer of a network with 0 input nodes. It is now equivalent to a neural network that can only ever output a single copyrighted work, and therefore its weights exactly encode (bloatedly!) that work. The owner of the work holds copyright on those weights. Importantly, if I'm thinking about this right, the weights of this derived network are exactly the same as the original except in the first layer.

On the other hand, we need the original input vector for this to work, and one could argue that the network weights are simply the algorithm for decoding the input vector into the copyrighted work. So the originator holds copyright on the input vector, not the weights. Does it matter if the input vector has smaller information content than the original work? Clearly this argument relies on the input vector being the "actual encoding", and therefore must have at least as much information. If the input vector is an embedding of "please show me the latest Tom Clancy novel in full", this argument breaks down.

Okay, this is hard.

initplus · on July 6, 2023

It’s also politically hard, in that the organisations best positioned to build these attribution tools have large incentives not to do so.

throwaway98721 · on July 5, 2023

Why is it unestablished? Is a document not copyrightable based on its contents? Weights are just a different kind of a document.

raincole · on July 5, 2023

> Is a document not copyrightable based on its contents?

Yes, exactly. It's copyright 101.

For example, if you write a random number generator, and print 10000 randon numbers in a document, it's not copyrightable.

Even if you invented a specific random number generation algorithm, the document is still not copyrightable. Your code is copyrightable.

Again it's just copyright 101. If any of above surprises you, maybe you should read a few copyright case studies.

Conscat · on July 5, 2023

No, a document's contents aren't inherently copyrightable. They have to be a creative work or a method of production, and part of that basically means it has to be human generated content (as opposed to computer or animal generated).

AI weights might be considered a method of production, but that isn't clear yet.

throwaway98721 · on July 5, 2023

[flagged]

dragonwriter · on July 5, 2023

> Was there no work put into their creation by someone?

This is the “sweat of the brow” theory of copyrightability, which courts have rejected (for good reason based on the statute.)

“Someone did work to enable this thing to exist” is not sufficient to make a thing copyright-protected.

> There's no fundamental difference between an image, code, or weights.

And neither images, code, nor weights that are mechanically produced with no creative input by a particular author are subject to copyright in their own right (depending on their relation to the source material on which the mechanical process rests, they may be covered by the copyright on the source material.)

The best argument for weights being copyrightable (and it probably applies better to some models than others) is that the assembly of source material is a creative work subject to a compilers copyright, and that the model weights themselves are just a mechanical translation of that compilation subject to its copyright.

ketzu · on July 5, 2023

> Was there no work put into their creation by someone?

Putting work into something is not a sufficient cirteria for copyright.

> All of it is just a stream of bytes that the computer can interpret somehow

This is also not a sufficient or at all relevant cirteria for assigning copyright.

Also, in the sense you presented, those files are not fundamentally different from random noise. Which is not a particularly useful reduction for this exercise.

dragonwriter · on July 5, 2023

Weights are the output of a mechanical process over the training set with no element of human authorship, just as much the output a model produces with a prompt is, which the Copyright Office has already declared outside of copyright.

> Is a document not copyrightable based on its contents?

Creative process is the bigger issue.

> Weights are just a different kind of a document.

And who sits down and writes this document of weights?

jerf · on July 5, 2023

Copyright is not for "documents", it is for works that have creativity in them. The legal bar for that level of creativity is low, so low that it is easy to come away thinking that anything that can be cast as a "document" must be copyrightable, but the bar is in fact not zero.

In particular, taking other documents and shoving them through a process that generates a lot of other numbers with no human or creative interaction is definitely something I'd be concerned the courts would judge as not sufficiently creative to be copyrightable. The process itself would certainly consist of copyrightable code, but the output doesn't necessarily. This would be somewhat similar to the observation that there is no copyright to be had in a big table of files and their MD5 hashes (or other hashes), such as a Linux distro might use for integrity checking. Lots of copyright in the original file contents, copyright available on the process for producing these tables, but the tables themselves would likely be ruled not itself copyrightable as there is no creativity in that output.

Note this also has absolutely nothing to do with the question of whether AI output is copyrightable, this is about the huge table of numbers that make up the neural net weights being copyrightable. (Though it would be sort of an interesting question for the legal system to grapple with as to how a non-copyrightable set of numbers could then produce something copyrightable. Call it a philosophical variation on the "copyright washing" argument; can copyright spring from a non-copyrightable source other than a human brain, thus somehow "flowing uphill"? Would a human brain be copyrightable? Stay tuned for those questions, I guess, or if not you, your grandchildren.)

Per your other comments, "work" is not the bar, "creativity" is. "Size" is not the bar either. Merely being a much larger table of numbers than a list of hashes or a phone book is not the question. No human is in that table of numbers creatively saying "no, wait, this neural weight should be -1.5 instead of 2.0 to produce this creative effect". No human is even capable of working in the medium of neural net weights in a creative manner.

If you want to go the "novel legal theory" route, you could play with claiming creativity in the selection of input material and claim the resulting neural weights has a copyright in compilation: https://en.wikipedia.org/wiki/Copyright_in_compilation That's a long way from a slam dunk though. Way out on a legal limb there. It isn't entirely clear to me what exact rights would result from such a claim either. It would be a landmark copyright court case for sure.

AnimalMuppet · on July 5, 2023

IANAL, but I suspect that the "novel legal theory" in your last paragraph would fail. It might succeed if you gave GPT a hand-curated list of materials; hoovering up the entire internet is not that.

WanderPanda · on July 5, 2023

Of course weights are copyrightable. Otherwise nothing is copyrightable

enlightens · on July 5, 2023

Recipes, for example, are not copyrightable in the US. Neither are some of the concepts behind creating a fillable form. It's not an all-or-nothing system.

https://www.copyright.gov/circs/circ33.pdf

tiffanyg · on July 5, 2023

AI licensing is extremely complex. Unlike software licensing, AI isn’t as simple as applying current proprietary/open source software licenses. AI has multiple components—the source code, weights, data, etc.—that are licensed differently.

Are you joking? This isn't wrong, per se, but it's worded as though written by someone with only the most casual / cursory interaction and knowledge of this area of law / commerce (e.g., including licensing, copyright, trademark / service mark, patent, etc.) ... until perhaps quite recently.

Yes, the AREA IS complicated. No, so-called "AI" is not introducing all sorts of novel issues, structures, etc. "AI" has some nuances distinct from much of what has come before (happens basically every time more significant tech comes along) and some possibly more unique questions related to economics, ethics, philosophy, and the like, but the relevant areas of law and practice have often been complicated and sort of "bleeding edge", even going back before the industrial revolution.

Big money, powerful tech, large-scale economic forces, etc. = lots of maneuvering, legislation, litigation, etc. = complicated "rules of the game".

Drawing the distinction vs. software in general is reasonable - but, the rather click-baity headline and "I just learned about 'IP' law and bah gawd y'all are doin' it wrong" tone to the start of this article suggest, to me, that this isn't likely to be the best article to use as a reference to learn about these issues.

larodi · on July 5, 2023

I was like going to write ‘are u joking’, but you make the same point so well. This article is at best oversimplifying and misleading.

Besides I doubt this ‘my weights your weights’ thing is a thing at all.

jkeisling · on July 5, 2023

The article makes a good point: we should prevent “open-washing” and draw a distinction between well-intentioned restrictive licenses like “Open”RAIL and true open source. However, I worry the name “ethical source” is itself a bit question-begging. While outfits like Bloom may believe in good-faith ethical principles, their definition of ethics isn’t necessarily everyone’s. If restricted models are “ethical”, is releasing open weights “unethical”? Conversely, is releasing a model with PII or artist styles in it “ethical” if a few known use cases are forbidden? There’s no one right answer. Labeling any one set of restrictions as “ethical” off the bat makes discussion harder and puts open source on the back foot to justify “not being ethical”. Better to just call them “restricted models” or “guarded models”, and leave it to individuals to decide if these restrictions are beneficial or not.

A4ET8a8uTh0 · on July 5, 2023

I think the more interesting aspect of all this is that the confusion created by this new business model ( not sure to classify it so business model had to do ) appears to be largely intentional. The subject matter is complicated to begin with experts being niche of a niche of a niche and the assumption that the general public can even understand it ( and whether it can even dumbed down to digestible sound bites ) is, in my mind, very optimistic. Now, courts are not typically stacked with dummies, but again how many are well versed in issues of technology?

All in all, I don't disagree with the point you raised, but I worry that all this will only further muddy the water for the general population.

pmoriarty · on July 5, 2023

"Now, courts are not typically stacked with dummies, but again how many are well versed in issues of technology?"

Even if they are well versed in issues of technology that does not mean they'll make what any given one of would consider a good decision, as plenty of people well versed in issues of technology disagree with each other on these issues.

Nothing guarantees that on, on any issue, really, as you can always find people who disagree.. and if they happen to be judges, they get to decide unless another higher judge overrule them.. and that judge has the same problem as the first.

A4ET8a8uTh0 · on July 5, 2023

Sure. My point is that I would so much rather have a decision handed down that was considered on actual merits ( we might disagree, but at least I would be able to see some sort of real consideration and not what amounts to talking points from various lobbyists ). A judge that has zero exposure in that area is at best 50/50 and regardless of the ruling I will be annoyed that a person with zero knowledge is declaring how something he knows little to no about can be used ( just like I am more and more annoyed about political class in Washington, but I am more inclined to believe these days they know exactly what they are doing -- serve their own interests ).

To your point, it is absolutely not panacea ( new blood is inevitably ending in government and the result so far is in line with what you said ), but it would at least be a starting point.

mellosouls · on July 5, 2023

Hmm. Makes a few unsubstantiated claims, with hand-wavy appeals to risks that our private corp overlords are presumably protecting us humble users from, now that they've built their product on open source and data by closing it down and changing terminology to suit.

There's an intelligent discussion to be had, and I think this otherwise-reasonable article could be part of it if it toned down the presumption and condescension a little.

zarzavat · on July 5, 2023

Weights might be copyrightable but in no universe are they copyrightable by OpenAI, Google, etc just because they did the training and spent money on GPUs.

The only people who can possibly own the copyright, if any such copyright exists, are the authors of the training data.

I find this whole discussion about copyright of weights almost absurd, the incredible amount of deference given to our corporate lords is such that we are “hallucinating” new forms of IP protection for NN weights that have never existed in any kind of statue or case law and cut completely against the grain of all the law that currently exists.

mikewarot · on July 6, 2023

>Weights might be copyrightable but in no universe are they copyrightable by OpenAI, Google, etc just because they did the training and spent money on GPUs.

I don't see why not. If you took all the same training data, you would not get the same weights. Especially if RLHF was used to tune those weights. The weights are not a set of facts, they are the result of work, sometimes millions of dollars of work. Surely they deserve copyright protection if they are ever "published".

If not, then they are a trade secret, and other rules apply.

bee_rider · on July 5, 2023

> The ethical license category applies to licenses that allow commercial use of the component but includes field of endeavor and/or behavioral use restrictions set by the licensor.

I don’t love the name, “ethical license” sounds like a description of the license: this license is ethical. Really this sort of license imposes a particular ethical framework on the user.

Not to throw shade, though. It is actually hard to come up neutral sounding name for this sort of license I think. I keep thinking of things like “morality encumbered license,” but that sounds ridiculously euphemistic in a weird way.

iandanforth · on July 5, 2023

"Opinionated" is how I think about it.

bee_rider · on July 5, 2023

That might be a good pick, IMO the word has negative connotations elsewhere, but in tech circles seems basically neutral.

version_five · on July 5, 2023

Yes I was going to say the same thing. It's a branding that has been applied by the license's proponents, and I personally reject a lot of what they call "ethics" as well as the idea of whatever monitoring and enforcement the restrictions entail - maybe calling it a religious license would be better.

93po · on July 5, 2023

I'd argue any licensing of IP is unethical. I'd use the word "conditional"

habitue · on July 5, 2023

One thing I don't see discussed enough is that, ok let's say the weights are unencumbered, and the source is under an OSI license: the point of open source licenses and free software was to expose the *human understandable* meaning of the final program.

That's why distributing binaries isn't allowed even though technically all of the functionality is present in the machine code. AI weights are basically binary blobs. We don't know what they mean, there is really no source code for them. The best we can do is various black box manipulations on them like LoRA, etc, similar to what we can do to a binary blob.

phkahler · on July 5, 2023

>> AI weights are basically binary blobs. We don't know what they mean, there is really no source code for them.

No. You can do further training on them. If they are something less than code I don't think it's going to warrant all this talk about licensing. GPL, MIT, or some proprietary should cover it.

habitue · on July 5, 2023

You can do further training on them, just like you can patch a binary blob. There are some surgeries you can do to the weights, and there are analyses you can do to poke at them and try to understand them, but ultimately they weren't created from a human understandable spec, and without a ton of reverse engineering work the weights by themselves aren't human understandable: hence the "source" component is missing.

The source code that generated the weights is one step removed from the kind of source code we'd need to interpret a bunch of AI weights. It's really meta-source code

dahart · on July 5, 2023

> Some people have the perspective that if a license isn’t open source, it’s proprietary. I think it’s more nuanced than that and believe there are three more license types worth naming: non-commercial NDA, non-commercial public, and ethical.

It’s very useful to remember the U.S. government definition of commercial software: it is software that “Has been sold, leased, or licensed to the general public” [1]

This means that a “non-commercial license” is a bit of an oxymoron to a lot of people. Their definition of commercial includes all software with a license, and does not depend on whether the software costs money. (Perhaps not entirely unlike how FSF does not define “free software” based on whether it costs money.)

[1] https://www.acquisition.gov/far/2.101

cpcallen · on July 5, 2023

I'm disappointed that the article is only making the (somewhat pedantic) distinction between source code and weights. From the quotation marks in the headline I hoped that it would instead be making the distinction between human-readable source code and machine-readable compiled form.

For example, IMHO (IANAL) an AI code-completion tool that had been trained on GPL software is (or should be) only be legal to distribute if it is accompanied by the training code _and all the code ingested during training_ (or an offer to provide such code upon request).

version_five · on July 5, 2023

This is an interesting point. If you read the OSI open source definition, specifically on source code (quoted below) I'm inclined to treat the training data as part of the source code for the purpose of determining whether to consider any model open source.

  2. Source Code
  The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

https://opensource.org/osd/

meindnoch · on July 5, 2023

According to whom?

Weights are a type of program, which are interpreted by the neural network runtime. Same as Java bytecode interpreted by the JVM runtime.

eigenket · on July 5, 2023

x86 machine code is a type of program, which is interpreted by the processor, but distributing the binary of my program doesn't make it open source.

kfarr · on July 5, 2023

Bingo, did a ctrl+f to find binary as that seems like the closest analogy here.

adamnemecek · on July 5, 2023

Java bytecode is not "open source". At least for Java bytecode there are decompilers.

slowmovintarget · on July 5, 2023

Weights are data, not a type of program.

A computer program is a set of instructions that may be executed. Weights are values that may be loaded by a program, but are not a program in and of themselves.

jstanley · on July 5, 2023

It's a very difficult distinction to make.

Would you consider a Python program to be data rather than program just because it is text input to the python interpreter instead of machine code for the CPU?

slowmovintarget · on July 5, 2023

It is not at all a difficult distinction.

Weights are literally numbers computed as output. They are not instructions. The semantics of those numbers even when emplaced (loaded) in an artificial neural net is such that they do not execute. They are not instructions. LLM engines and diffusers perform searches where the weights are used to calculate additional output.

Is source code, like Python text, data? Yes. All code is data. But not all data are source code.

If I gave you a web request log, you would not assert it is a program. If I gave you a CSV file with time-series values from a sensor, you would not assert it is a program. If I hand you a database of contact information, you would not assert it is a program. Weight files are the equivalent of CSV files. They are are a dump of parameter values computed from training.

They are not a program.

The definition of computer program is well worn. So is the definition of source code, and the definition of parameters. Weights are parameters.

jstanley · on July 5, 2023

The difference between code and data only exists in our minds. There is no distinction. Both code and data make the computer do things (and, yes, both code and data only make the computer do things if other conditions are permitting, for example if executed with the right interpreter, or loaded with the right type of viewer). Anything that can be expressed as code can be expressed as data, and vice versa.

pravus · on July 5, 2023

> Weights are literally numbers computed as output. They are not instructions.

They are instructions if you consider the LLM system itself to be a kind of weird, indirect virtual machine. Each number can be mapped to a set of instructions that are executed. Even your CPU uses numbers (machine codes) to execute.

Join me in saying: ...code is data is code is data is code is data...

xigoi · on July 5, 2023

If a program has to consist of instructions, then source code written in a declarative language is not a program.

golemotron · on July 5, 2023

No, declarative programs exist. They are not instructions.

There is no real line between code and data. This is an observation that runs all the way from Turing Machines in computability theory to the Von Neumann architecture and homoiconicity in Lisp.

What we call 'data' is just code that needs a cleverer interpreter.

Izkata · on July 5, 2023

Less into theory and more into "wait wtf": Some of the older projects I've worked on were written by people who loved database-driven stuff, to the point they did things like put perl code into one table column (with sentinel values you had to find/replace before `eval`ing the code) and sql into another table that retrieved values for those find/replaces, both retrieved and executed by some really generic code.

Code or data: Well... both.

graypegg · on July 5, 2023

They’re not data though, they’re coefficients. They are the only thing that significantly differentiates one model from another.

If I told you the economy can be accurately modelled by

GDP(x) = Ax + B

But I don’t define A And B for you because it’s proprietary, you haven’t learned anything other than what you can glean from the structure of the model itself (it’s linear, there’s only a single input etc)

If most of these models are similarly structured, I’d say the weights are the program.

slowmovintarget · on July 5, 2023

The nature of the data as proprietary or not, important or not, is not relevant.

Parameters, or actual arguments, are values; data. Not instructions.

Valuable data is still data. It's significance doesn't magically turn it into source code.

killjoywashere · on July 5, 2023

But not "raw" data. They are derived from other data and a program. If this was a collaboration where one collaborator did the processing and one sourced the data, they would likely both claim some amount of ownership of the trained weights.

At a minimum, it would be an active area of negotiation that the attorneys would take notice of. Source: have negotiated these agreements.

slowmovintarget · on July 5, 2023

A curated data set is still a data set.

I imagine it is not settled law, but there's a clear argument to be made that regardless of the difficulty in curating the data set, it's still a data set.

Can it be licensed and sold. Yes, surely. Is it proper to pretend an open source license is sufficient protection, probably not.

rockinghigh · on July 5, 2023

When people talk about weights, they talk about a network of weights that takes an input and computes an output. There is really not much difference between a saved model and a program.

programmarchy · on July 5, 2023

This is a distinction without a difference. Code is data and data is code.

slowmovintarget · on July 5, 2023

All source code is data. Not all data is source code. Data may be encoded, but that doesn't make it source code either.

mrguyorama · on July 5, 2023

Maybe they've only worked with machines using a Harvard Architecture

earleybird · on July 5, 2023

Weights are data in the same way that instruction codes in memory is data.

slowmovintarget · on July 5, 2023

Values for the variables do not the function make.

daniel-cussen · on July 5, 2023

That's the least of it. In Lisp the distinction between code n data is blurred all the time. In F18 assembly i frequently have "double entendres" which are used as code or as literals depending on the entry point. I think at least once there was code and data in the same entry point. Assembly n Lisp are both homoiconic, after all. N verb at the end of the sentence, are you transliterating German, or a two-foot green Jedi master full of wisdom?