I hope I'm not too late to your thread to ask questions. I find your field absol...

karpathy · on Sept 11, 2014

Hi, I didn't notice this reply for a long time, but felt bad not to reply to something so long :)

1. Yes, good datasets exist. They grow over time and they are open. It takes a lot of time. ILSVRC is good example. There are many more: Pascal VOC, COCO probably best known among them.

2. The optima vary. Training set has huge impact and is super super important. Has to be large and varied otherwise you've lost even before you choose a model

3. They can be very fast to run. With a good GPU, the most recent convnets can run in few milliseconds per image

4. Everything is open and BSD/MIT, no issues at all. Models are often distributed, for example the ILSVRC 2012 model is included with Caffe framework.

5. The trained models are TRIVIAL for layman to use. But they are not trivial to train.

6. Yes there is a lot of specialized knowledge needed to tweak model params.

The summary is these models are now trivial to use - you literally give it an image and it gives you very good predictions for what's inside in a few milliseconds (or few seconds on CPU, not GPU). They are not trivial to train and understand, though. That needs time, practice and mathematical understanding. Caffe is the best framework to look at. Pretrained models are available for 1000 ILSVRC classes for classification and for 200 ILSVRC classes in detection, but not for many other tasks (e.g. scene classification etc.)

possibilistic · on Sept 6, 2014

(Note: Posted as self-reply because original comment was too long)

I picture a website kind of like Github, or perhaps a language package index like npm's. Except this website is concerned with classification instead of code. You would see the following kinds of downloads:

  * Highly optimized, performant, pre-trained classifier models. 
    Perhaps numbering in the thousands if the site were popular. 
    You'd see a wide variety of classifiers: general ones,
    specific things like "dog breeds" and "celebrities" and "car types".

  * Common library code capable of running the models against user data.
    Available in a variety of languages. The model classifier format would 
    have to be generic enough that it can run under libraries in any 
    environment.

There might also be downloads directed at supporting researchers. Stuff like:

  * Well put-together training data sets

  * Human-curated annotations, categorizations, ontologies, etc. 
    available as metadata that can be paired with the training sets in any 
    way desired. Not all of it may be useful for any given classifier.

Do you know if there is already an existing "standard format" for encoding or serializing pre-trained models for the purpose of sharing and exchanging them? There are likely a few algorithm-specific serialization formats for persisting internal graphs and weights and so forth. But some of these results are left trapped entirety within the confines of the internal data structures of a particular implementation...

In any case, I'm not aware of the existence of one universal format encoding all the things. Because why would there be? What would be "standard" about it? The algorithm space is rather wide and different algorithms encode different things, so there wouldn't even be cross-cutting similarities to take advantage of. It would be like inventing a file format for "text files and PNG images and fonts together!". Arbitrary, pointless. An absurd idea.

Assume it's not pointless, though; start forming a picture of a universal data format or scheme for encoding and sharing all the possible training results irrespective of the source algorithm that produced them or the one that is required to produce the results. We can't just encode the "training result" because we've already demonstrated the complete and utter uselessness. Instead, the universal scheme would have to encode at least three things: a "language descriptor" written as an abstract machine language whose task is creating a bridge between the computer-generated "training result" and the predetermined "classification result" when a user input is provided.

    descriptor(input, perception encoding) => classification result

Apart from the user input, everything else is encoded into our data serialization format. The "descriptor" would ordinarily have been some C or MatLab (or whatever) code. It's the part that would have told us "this picture is of a KITTEN" or "this text was written by STEPHEN KING" given all the other inputs. Now it is an encoding of an abstract circuit, state machine, or some other language grammar. Notice also how this has become entirely self-hosting.

If there other other arguments, then,

    descriptor(input, A, B, C..., perception encoding) => classification result

Where metadata concerning the purpose, names, types, ranges, and defaults for `A, B, C...` are also encoded in the data format. Classification types, ranges, etc. must also be encoded,

    classification result ∈ (class P, Q, R...)

Instead of being compiled to a reduced representation and inlined into the body of the "descriptor", they could be provided as a parameter, adding a further degree of indirection. I won't show any further notation.

To quickly summarize again: The language descriptor performs the task of parsing the model, then accepting an arbitrary input (a classification set, possible parameters, and the subject material), and ultimately yielding production of an output.

By now you've likely noticed that all of this mess isn't technically different than stuffing the executable of the classifier program itself into the serialization of the training results. You'd be correct, of course. It might seem arbitrary here, but I think I'll demonstrate a few nifty results later on. Besides, I'm not really suggesting they be contained in the same file.

To make use of these abstractions, there would need to be some client libraries (C++, Java, Python, etc.) provided that make it trivially easy to load and evaluate any of the classifiers from your own code. Since we went to the trouble of encoding the aforementioned "language descriptor" as an abstract grammar, the whole classifier (training and all) can be hosted and run from anywhere there is a client library provided, essentially making the classifier available from any language. What's more is the client libraries would not require constant updates to support new algorithm variations--it's baked into the data format, so we get the capability for free simply by swapping files.

           import pyclassifier 

           #include <classification>

           etc...


           classifier = classifiers.standardSvm()


           classifier = classifiers.load("my_novel_algorithm")


           classifier = classifiers.load("clustering_doe_et_al_09")

Another cool thing we could do is define shared sets of "classification results". Instead of defining classes and categories and whatnot on a situation-to-situation basis, perhaps we could draw from a global pool of concepts and ideas pulled from the world. We can impart stable names to as many different things and concepts as possible: classes like "CAR", "PERSON", "BOY", "DOG", "TANK", etc. -- all designed to be globally unique and robust identifiers that can continue to evolve over time without breaking our classifier algorithms. A side benefit is that all classifiers would begin to speak the same shared language. (Granted, not all classification outputs would be amenable to this. Some would. Photo seem like a great use case.)

Now, If we were to build that category database as an ontology database instead... Perhaps you could begin to semantically infer things?

          {Dexter's Lab} implies {Cartoon}

          Might we infer the person who uploaded it is a 90's kid?

Or for a more graph topology-based, semantic kind of result,

         {Velociraptor} implies {Dinosaur}
                        implies {Predator}
                        implies {Extinct Animals}
                        implies {Seen on film} {Jurassic Park}

         Coincident occurrence with

         {Person}, -> 
         {Man}, -> 
         {Sam Neill} 

         {Sam Neill} was {Seen on film} {Jurassic Park}

         We can probably be sure that you're looking at a still from {Jurassic Park} at this point.

I'm not claiming searching the graph like that would be efficient, of course. But if you've got ontological overlap that is cheap to check, it might be fun...

The classifier ontology would be versioned, but probably very slow moving in terms of changes. It might be impractical to package the entire database with an app. With ontologies, you can plug in subsets and wire them to other ones later on,

          "AKC-STANDARD-DOG-BREEDS_CLASSES-1.0.1"

          "CARTOONS-OF-THE-90S-0.1.0"

The pre-trained classifier data can be versioned. Say that someone else did all the work of training it to recognize Bob Ross paintings or whatever.

          classifier = new Classifier("algorithm").forModel("BOB-ROSS-PAINTINGS-0.1")

          classifier.classify(new Image("http://i.imgur.com..."));

Oops, someone forgot to include happy little trees. But we can fix that,

          // Happy Little Trees edition.

          classifier = new Classifier("algorithm").forModel("BOB-ROSS-PAINTINGS-1.0")  

          classifier.classify(new Image("http://i.imgur.com..."));

          Produces results:

             95% Bob Ross
             100% Happy Little Trees
             75% Happy Little Clouds

If this kind of tooling and ecosystem existed, do you even know how much fun I could have on Reddit?

But in all seriousness, think of the practicality of reusability. Downloading and running classifiers other people trained, from the language of your choice? That's powerful and empowering. It takes the tech out of the realm of "Google playtoy" and puts it in our collective hands.

Think of the kinds of novel apps that the Average Joe programmer could develop. And if this type of thing truly got the support of the image processing crowd, I can't fathom how much improvement we'd witness on a year-to-year basis.

Does anything like this exist in your field for researchers now? If so, could it be made to be usable by laymen? (Or does it already exist for general audiences? Am I living under a rock?)

If this kind of thing doesn't exist or isn't shared, what steps could be made toward making something like this a reality? Are there critical gating pieces that need to come together first in order to make all of it work? Or conversely, do you feel strongly that something like this just isn't feasible?

Possible complications on building an "open source" set of classifiers is the deep knowledge required to contribute. And what about patents? It's my understanding that universities like to patent research (eg. SIFT), and AFAIK there must be broad coverage of of this space by the universities. That would be a major setback.

Anyway, I've rambled on far too much. If anyone managed to read all of that, please forgive me for inundating you with such a crazy, ill-informed, and long-winded diatribe. I hope I made sense.

tlarkworthy · on Sept 6, 2014

The neural architectures change so a set of parameters for one network won't run on another. Image size changes too which changes the nn behind it. There is not much work on mapping one set of parameters onto another. Transfer learning might have a little applicability. But unlikely at the needing edge of vision research.

Worth noting a neural network IS a general purpose function encoding.