Hacker News new | past | comments | ask | show | jobs | submit login
Building a deeper understanding of images (googleresearch.blogspot.com)
220 points by xtacy on Sept 5, 2014 | hide | past | favorite | 51 comments



I am one of the people who helped analyze the results of the mentioned ILSVRC challenge. In particular, I performed an experiment comparing Google's performance to that of a human a week ago and wrote up the results in this blog post:

http://karpathy.github.io/2014/09/02/what-i-learned-from-com...

TLDR is that it's very exciting that the models are starting to perform on par with humans (on ILSVRC classification at least), and doing so on orders of milliseconds. The included page also has a link to our annotation interface where you can try to compete against their model yourself, and see its predictions and mistakes.


So what the process for handling misclassification? http://cs.stanford.edu/people/karpathy/ilsvrc/val/ILSVRC2012... is definitely a grand piano not an upright which is the "official" answer.

Edit: http://cs.stanford.edu/people/karpathy/ilsvrc/val/ILSVRC2012... is also misclassified. There isn't a bee house there. Closest thing is a barbecue.

Edit 2: http://cs.stanford.edu/people/karpathy/ilsvrc/val/ILSVRC2012... appears to be a building based on scaling clues.



Thanks for being here. Why do you think the results improved so much this year?


Google hasn't released details about their model yet, but the VGG team in close 2nd place had a very simple and beautiful "vanilla" CovnNet model, but with more careful hyperparameter settings and training/testing protocol. In other words, the source of improvement are tweaks on the original Krizhevsky architecture from 2012, not completely new and unexpected ideas. That is not to belittle the contribution - these experiments take forever to run and require very good practices and intuitions for what to try next. Karen released the details of the model only yesterday on arXiv (http://arxiv.org/pdf/1409.1556v1.pdf)

More specifically, as can be seen in the paper, it seems that very deep stacks of conv/conv/pool modules with tiny 3x3 filters work well (which is satisfying because it's really simple and beautiful), and from being more thorough with the training and testing protocol (data augmentations, averaging, multiscale approaches in both train/test time, etc).

Google will release details of their method on Sept 12 so we'll know more. From their abstract, it seems they have a more significant departure from a basic convnet architecture.


I hope I'm not too late to your thread to ask questions. I find your field absolutely fascinating, though I've never taken my interest further than basic undergrad image processing or simple machine learning. I wish I had more time to follow all of my interests.

Anyway, this one particular gray area has been bugging me, and I've been hoping to run into a researcher or someone that has the appropriate context to clarify it for me. It's not technical question, per se.--it's a set of questions that build up to an uncertainty related to model training, reuse, and sharing. (I hope I'm asking the relevant questions...) I don't expect answers to all of this as it would take way too much of anyone's time, but maybe you can pick a few key points and discuss. I'm also really dying to get feedback on the last part as to its feasibility.

1. Do computer vision / modeling researchers typically share common training and evaluation data sets, or is it mostly kept proprietary? If good data sets do exist out in the open, do they undergo continual improvement, or are they frozen? Are these training sets typically amenable to only one type of training--ie., black box, offline vs online, etc.? Does it take a lot of practice/skill to know how much to hold back for post-training evaluation? Do you have an estimate for how much time and effort is involved in manually annotating and curating these data sets?

2. Once an algorithm is developed, does a model get trained under different parameters? Does tweaking those parameters lead to vastly different results? Are there typically distinct optima for a given classification task, or can it vary? Does the training set have a big impact on the performance of the algorithm? Which is more important, the training data or the algorithm?

3. Once models undergo offline training, are they fast to run? Is there a typical runtime complexity, or do different types of models operate significantly differently under the hood (by principle or by computational complexity)? Can you run an extremely complicated and robust model with a ton of classification outputs in sub-millisecond time on commodity hardware?

4. Can trained models be packaged and redistributed as a kind of "open source"? Are there any obvious barriers preventing this, such as the existence (or proliferation) of patents in your field? Do computer vision researchers like to share their code / results? If it's not a common practice to share code, would a large number of researchers be downright opposed to having their (patentable) algorithms and (copyrightable?) models made available to others?

5. Are trained models too complicated for the layman to use and produce good, consistent results? (For our purposes, I would consider a layman to be someone with at least some understanding of basic computer science, including exposure to data structures, algorithms, and a little bit of mathematical ability. Perhaps none of this too deep.)

6. Highly related to the last question, would there be a lot of specialized knowledge required to tweak model input parameters? Would these parameters correlate to mathematical operations? Would there be any arcane and seemingly arbitrary weights to adjust that are deeply encoded into the model itself? (Not sure if I'm far left field here or not.)

I think what I'm ultimately hitting at is that it would be freaking awesome if there were a good set of robust pre-trained classifier models available for the layman programmer. Models that continually undergo development as improvements are made to training sets, algorithms, the literature--what have you. Please tell me if this kind of thing already exists.

Anyway, I feel like I'm about to go on a long-winded talk about a technology I've been imagining. It's something along the lines of a specification that allows for the broad spectrum sharing of reusable, generic, containerized ML and classifier training results with others. In the world I envision, you would share trained models and import them just as you would code libraries.

Let me reiterate: if this sort of thing happens to exist in the wild already, please point me to it! I can think of tons of uses. :)

Note: replied to my own comment because original text was too long.


Hi, I didn't notice this reply for a long time, but felt bad not to reply to something so long :)

1. Yes, good datasets exist. They grow over time and they are open. It takes a lot of time. ILSVRC is good example. There are many more: Pascal VOC, COCO probably best known among them.

2. The optima vary. Training set has huge impact and is super super important. Has to be large and varied otherwise you've lost even before you choose a model

3. They can be very fast to run. With a good GPU, the most recent convnets can run in few milliseconds per image

4. Everything is open and BSD/MIT, no issues at all. Models are often distributed, for example the ILSVRC 2012 model is included with Caffe framework.

5. The trained models are TRIVIAL for layman to use. But they are not trivial to train.

6. Yes there is a lot of specialized knowledge needed to tweak model params.

The summary is these models are now trivial to use - you literally give it an image and it gives you very good predictions for what's inside in a few milliseconds (or few seconds on CPU, not GPU). They are not trivial to train and understand, though. That needs time, practice and mathematical understanding. Caffe is the best framework to look at. Pretrained models are available for 1000 ILSVRC classes for classification and for 200 ILSVRC classes in detection, but not for many other tasks (e.g. scene classification etc.)


(Note: Posted as self-reply because original comment was too long)

I picture a website kind of like Github, or perhaps a language package index like npm's. Except this website is concerned with classification instead of code. You would see the following kinds of downloads:

  * Highly optimized, performant, pre-trained classifier models. 
    Perhaps numbering in the thousands if the site were popular. 
    You'd see a wide variety of classifiers: general ones,
    specific things like "dog breeds" and "celebrities" and "car types".

  * Common library code capable of running the models against user data.
    Available in a variety of languages. The model classifier format would 
    have to be generic enough that it can run under libraries in any 
    environment. 
There might also be downloads directed at supporting researchers. Stuff like:

  * Well put-together training data sets

  * Human-curated annotations, categorizations, ontologies, etc. 
    available as metadata that can be paired with the training sets in any 
    way desired. Not all of it may be useful for any given classifier. 
Do you know if there is already an existing "standard format" for encoding or serializing pre-trained models for the purpose of sharing and exchanging them? There are likely a few algorithm-specific serialization formats for persisting internal graphs and weights and so forth. But some of these results are left trapped entirety within the confines of the internal data structures of a particular implementation...

In any case, I'm not aware of the existence of one universal format encoding all the things. Because why would there be? What would be "standard" about it? The algorithm space is rather wide and different algorithms encode different things, so there wouldn't even be cross-cutting similarities to take advantage of. It would be like inventing a file format for "text files and PNG images and fonts together!". Arbitrary, pointless. An absurd idea.

Assume it's not pointless, though; start forming a picture of a universal data format or scheme for encoding and sharing all the possible training results irrespective of the source algorithm that produced them or the one that is required to produce the results. We can't just encode the "training result" because we've already demonstrated the complete and utter uselessness. Instead, the universal scheme would have to encode at least three things: a "language descriptor" written as an abstract machine language whose task is creating a bridge between the computer-generated "training result" and the predetermined "classification result" when a user input is provided.

    descriptor(input, perception encoding) => classification result 

Apart from the user input, everything else is encoded into our data serialization format. The "descriptor" would ordinarily have been some C or MatLab (or whatever) code. It's the part that would have told us "this picture is of a KITTEN" or "this text was written by STEPHEN KING" given all the other inputs. Now it is an encoding of an abstract circuit, state machine, or some other language grammar. Notice also how this has become entirely self-hosting.

If there other other arguments, then,

    descriptor(input, A, B, C..., perception encoding) => classification result

Where metadata concerning the purpose, names, types, ranges, and defaults for `A, B, C...` are also encoded in the data format. Classification types, ranges, etc. must also be encoded,

    classification result ∈ (class P, Q, R...)

Instead of being compiled to a reduced representation and inlined into the body of the "descriptor", they could be provided as a parameter, adding a further degree of indirection. I won't show any further notation.

To quickly summarize again: The language descriptor performs the task of parsing the model, then accepting an arbitrary input (a classification set, possible parameters, and the subject material), and ultimately yielding production of an output.

By now you've likely noticed that all of this mess isn't technically different than stuffing the executable of the classifier program itself into the serialization of the training results. You'd be correct, of course. It might seem arbitrary here, but I think I'll demonstrate a few nifty results later on. Besides, I'm not really suggesting they be contained in the same file.

To make use of these abstractions, there would need to be some client libraries (C++, Java, Python, etc.) provided that make it trivially easy to load and evaluate any of the classifiers from your own code. Since we went to the trouble of encoding the aforementioned "language descriptor" as an abstract grammar, the whole classifier (training and all) can be hosted and run from anywhere there is a client library provided, essentially making the classifier available from any language. What's more is the client libraries would not require constant updates to support new algorithm variations--it's baked into the data format, so we get the capability for free simply by swapping files.

           import pyclassifier 

           #include <classification>

           etc...


           classifier = classifiers.standardSvm()


           classifier = classifiers.load("my_novel_algorithm")


           classifier = classifiers.load("clustering_doe_et_al_09")

Another cool thing we could do is define shared sets of "classification results". Instead of defining classes and categories and whatnot on a situation-to-situation basis, perhaps we could draw from a global pool of concepts and ideas pulled from the world. We can impart stable names to as many different things and concepts as possible: classes like "CAR", "PERSON", "BOY", "DOG", "TANK", etc. -- all designed to be globally unique and robust identifiers that can continue to evolve over time without breaking our classifier algorithms. A side benefit is that all classifiers would begin to speak the same shared language. (Granted, not all classification outputs would be amenable to this. Some would. Photo seem like a great use case.)

Now, If we were to build that category database as an ontology database instead... Perhaps you could begin to semantically infer things?

          {Dexter's Lab} implies {Cartoon}

          Might we infer the person who uploaded it is a 90's kid? 

Or for a more graph topology-based, semantic kind of result,

         {Velociraptor} implies {Dinosaur}
                        implies {Predator}
                        implies {Extinct Animals}
                        implies {Seen on film} {Jurassic Park}

         Coincident occurrence with

         {Person}, -> 
         {Man}, -> 
         {Sam Neill} 

         {Sam Neill} was {Seen on film} {Jurassic Park}

         We can probably be sure that you're looking at a still from {Jurassic Park} at this point.

I'm not claiming searching the graph like that would be efficient, of course. But if you've got ontological overlap that is cheap to check, it might be fun...

The classifier ontology would be versioned, but probably very slow moving in terms of changes. It might be impractical to package the entire database with an app. With ontologies, you can plug in subsets and wire them to other ones later on,

          "AKC-STANDARD-DOG-BREEDS_CLASSES-1.0.1"

          "CARTOONS-OF-THE-90S-0.1.0"

The pre-trained classifier data can be versioned. Say that someone else did all the work of training it to recognize Bob Ross paintings or whatever.

          classifier = new Classifier("algorithm").forModel("BOB-ROSS-PAINTINGS-0.1")

          classifier.classify(new Image("http://i.imgur.com..."));

Oops, someone forgot to include happy little trees. But we can fix that,

          // Happy Little Trees edition.

          classifier = new Classifier("algorithm").forModel("BOB-ROSS-PAINTINGS-1.0")  

          classifier.classify(new Image("http://i.imgur.com..."));

          Produces results:

             95% Bob Ross
             100% Happy Little Trees
             75% Happy Little Clouds

If this kind of tooling and ecosystem existed, do you even know how much fun I could have on Reddit?

But in all seriousness, think of the practicality of reusability. Downloading and running classifiers other people trained, from the language of your choice? That's powerful and empowering. It takes the tech out of the realm of "Google playtoy" and puts it in our collective hands.

Think of the kinds of novel apps that the Average Joe programmer could develop. And if this type of thing truly got the support of the image processing crowd, I can't fathom how much improvement we'd witness on a year-to-year basis.

Does anything like this exist in your field for researchers now? If so, could it be made to be usable by laymen? (Or does it already exist for general audiences? Am I living under a rock?)

If this kind of thing doesn't exist or isn't shared, what steps could be made toward making something like this a reality? Are there critical gating pieces that need to come together first in order to make all of it work? Or conversely, do you feel strongly that something like this just isn't feasible?

Possible complications on building an "open source" set of classifiers is the deep knowledge required to contribute. And what about patents? It's my understanding that universities like to patent research (eg. SIFT), and AFAIK there must be broad coverage of of this space by the universities. That would be a major setback.

Anyway, I've rambled on far too much. If anyone managed to read all of that, please forgive me for inundating you with such a crazy, ill-informed, and long-winded diatribe. I hope I made sense.


The neural architectures change so a set of parameters for one network won't run on another. Image size changes too which changes the nn behind it. There is not much work on mapping one set of parameters onto another. Transfer learning might have a little applicability. But unlikely at the needing edge of vision research.

Worth noting a neural network IS a general purpose function encoding.


Now if only Google could develop a way to serve static text content without using JavaScript!

(All I get is a B with twirling gears in it...)


This is a longstanding Blogger bug that happens when cookies are blocked. They haven't fixed it because you and I are the only people on the planet who whitelist cookies.


It's also incompatible with the Readability plugin. :(


I believe I use the standard Firefox cookie policies and I have to explicitly allow a bunch of domains in NoScript to see anything on the blogspot site. I do use Ghostery, though.


"typical incarnations of which consist of over 100 layers with a maximum depth of over 20 parameter layers)" Anyone know exactly what that means? I'm guessing that that there are 100 layers total, 20 of which have tunable parameters, and the other 80 of which don't--e.g., max pooling and normalization.


That's pretty amazing. It seems like we're at a point where we could build really practical robots with this?

Robots to do dishes, weed crops, pick fruit? Why isn't this being applied to more tasks?


I think one of the main reasons is that the improvements have been drastically significant and very recent; There hasn't been enough time to convert the research code into open source code and libraries, but you can confidently expect these models to become pervasive over the next few years not just in robotics, but in all perception systems.

A few good open source libraries out there that have the building blocks for putting together similar models: - Caffe (C++) http://caffe.berkeleyvision.org/ - cudaconvnet2 (C++) https://code.google.com/p/cuda-convnet2/ - DeepBeliefSDK (iOS,Android,OS X, Raspberry Pi, JS) https://github.com/jetpacapp/DeepBeliefSDK


karpathy is being too modest here. He's also created his own convnet.js [1], which you can play with online [2] (complete with relevant, working demos), etc.

[1] https://github.com/karpathy/convnetjs [2] http://cs.stanford.edu/people/karpathy/convnetjs/


I would eat a "hat with a wide brim" if Google isn't going to release a robot that can do basic household chores (laundry, dishes, dusting) within the next 3 years.

Google has been gobbling up robotics startups, and given how Google also loves gobbling up personal data, having robot "boots on the ground" in every home must be extremely appealing to them.


The idea is right but if by "release" you mean "sell to consumers" I think you're a little over-optimistic about the timeline.

Google's driverless car project is a direct descendant of the tech developed for the DARPA Grand/Urban Challenges. Those took place in 2005-2007 [1]. It's been ~9 years since the first Grand Challenge, and while there has been great progress in this area no one will sell me an autonomous car just yet.

The first DARPA robotics challenge was held in December of 2013 [2]. If you check out some of the videos from that competition [3] [4] (or some of the PR2 videos another user posted), you'll see that we can make general-purpose humanoids that can do some pretty neat stuff, but there's still a lot of work to be done before I can buy one that will reliably do many different household tasks.

The Atlas platform that many of the competitors used in the DRC was developed by Boston Dynamics, who were subsequently bought by Google. So yes, they are definitely in this space. Maybe progress in humanoids will go faster just by virtue of the fact that we've got an extra decade of research to build on, but the journey from "works in a lab/demo, barely/slowly" to a viable commercial product is a long one.

[1] http://en.wikipedia.org/wiki/DARPA_Grand_Challenge [2] http://en.wikipedia.org/wiki/DARPA_Robotics_Challenge [3] http://www.youtube.com/watch?v=hzmMVHGNXvI (sped-up highlights) [4] http://www.youtube.com/watch?v=mwWm3HaDbnQ (full 10-hour recording of competition live-stream)


I'll congratulate you on not specifying a time duration for your meal. Indigestible material can cause a blockage in your colon, which can cause severe pain, damage to your colon, and even death. I suggest chopping the hat up into very, very small pieces, and eating them over a very long period, such as a month or longer.


If you'll excuse me, I have to write some letters to the legal department at the hat factory where I work.


I doubt it. The easy parts of laundry and dishes are already done by simple robots sold in every white goods department. The remaining parts are very demanding indeed.

Research labs are not robustly demonstrating these capabilities yet, even with very expensive robots.


Another robot researcher agrees. Classifying images is totally different from enabling a robot to perform a difficult task. Object recognition is a supervised learning problem, mapping image -> vector of probabilities. Robotic manipulation is a control problem, mapping a long sequence of images -> a long sequence of actions.


I had this thought the other day - not robots washing the laundry, but folding it. Folding is actually not that simple after thinking about it.


https://www.youtube.com/watch?v=gy5g33S0Gzo

Work is progressing. That's a $300K robot.

(edit: no affiliation. Video shows PR2 robot at Berkeley folding towels competently but very slowly in 2010)


... Do you want me to note that in my calender?



Robots are being applied as fast as (i) they really work, and (ii) they make economic sense.

Precise, powerful mechanical actuation is really expensive. High quality motors use expensive rare-earth magnets and require very accurate manufacturing.

Right now, only very valuable manipulation tasks can justify using a robot. Hence, robots built most of my car, but I take out my own trash.

Non-physical AI stuff will catch on much faster than robots.

edit: The exception is clever things like Roomba, where the mechanical parts are dirt cheap, and the robot's behaviour compensates for its lack of precision. Neat.


I only partially agree. Just recently, pretty decent stepper motors have become very cheap (~$6 for NEMA-17), thx to all this 3d printer boom, and lots of things are now finally possible to implement on a budget.

But I agree that good servos with high torque are still super-expensive.


Some manipulation tasks require force control, too.


Surprisingly, this is also getting solved. For example, there're demos of self-calibrating cheap Delta robots which use force sensors on the plate: https://groups.google.com/forum/#!topic/deltabot/6fxnM20nYKc

Sure, it's not yet there, but with the space of feasible is growing fast.


I'm not an expert on 3D printing, but it looks like this uses force sensing to make sure the effector velocity vector is nicely parallel to the position of the base plate. The effector is not exerting any force on an object. So it's velocity control not (the more demanding) force control.

Let me know if I have misunderstood what's happening here.


You're right, I was slightly off in my example.


I wonder if using imprecise motors would be acceptable with sufficient AI. So it could account for errors and imprecision in it's movement.


Example project partly motivated by that idea:

http://cswww.essex.ac.uk/staff/owen/machine/cronos.html

(no affiliation)


We don't really classify it this way, but Google's self driving car is the practical robotic implementation of this. There has been a bunch of press around how the Google self-driving car has to have streets mapped out for it, etc. This research will go directly to issues like: "Is that a 'domestic cat' or 'paper bag' in the street up ahead?"


What if it's a domestic cat in a paper bag?


Because it was just invented over the past 2 years? And it is being used, but mostly by Google and Facebook.



Just a small list of other things you need:

1. Natural language understanding

2. 3D object understanding

3. Planning

4. Object manipulation

5. Navigation

6. Speech recognition


You can realistically skip 1 and 6, control it with your smartphone instead. Or, you can simplify the problem by using more constrained verbal commands. I was told that speech recognition works very well when you have a simple grammar instead of an open-ended language.


Really, just for pulling weeds? Maybe you thought I meant a robot that can do arbitrary household chores. I actually meant specialized robots for specific tasks.


You skip 1 and 6 then. 3 and 5 isn't bad right now. 4 might be tricky (dependent on task at hand - weed pulling can be surprisingly difficult depending on context). 2 would likely be difficult.

The results for the actual competition are here. You can take a quick skip at the error rates of the first place approaches. For some tasks they're down to 5-10% which might be acceptable for some tasks, but could also be completely unacceptable for others. shrug

The day is probably coming, but it probably won't be soon. Not the least because robots are actually ridiculous to work with / develop.

http://www.image-net.org/challenges/LSVRC/2014/results


I wonder whether some of the intermediate layers in these models might correspond to something like "living room" or other locations that provide additional information about the objects that might be in the scene. For example, I suspect it was much easier for me to identify the preamp and the wii in one of the pictures because I knew it was a living room/den instead of an office or study.


In this case, no they didn't have pre-labeled location/setting available. You can see one of the datasets they used here: http://image-net.org/challenges/LSVRC/2013/#data.

Generally speaking, Neural Networks are black boxes. The layers interact with each other but not in a defined categorical manner like that. Layer size/depth are parameters you provide when setting up that have tradeoffs in result accuracy, space, time spent, etc like jpeg quality.


I wish this was available as a translation app. You point your phone at a fruit stand and it names every single item, and you can then ask the vendor for the item by name.

It isn't that crazy, in fact that's exactly what they have right now but just in English only.


Not exactly what your'e talking about but you should take a look at Word Lens(which google just bought).

Basically you hold your phone up and position it over a piece of text using the camera. It then OCRs the text, translates it, and replaces it in realtime in the camera feed. It's pretty remarkable.


These classifications are amazing but the fact that the first image in the article is classified as "a dog wearing a wide-brimmed hat" and not as "a chihuahua wearing a sombrero" is telling of how far we are from true understanding of images.

Only a human possessed with the relevant cultural stereotypes (chihuahua implies Mexican, ergo, the hat must be a sombrero) could make that conclusion.

Even so, I firmly believe that at this rate of improvement, we're not far from that kind of deep understanding.


I'm not sure how your example is more true.


How big is the model? Training these kinds of networks is expert work and requires enormous infrastructure; but if they released the model, I'm sure people like us could come up with all sorts of very useful applications.


If you're interested, Caffe http://caffe.berkeleyvision.org/ comes with some pre-trained models for ImageNet, which was close to state-of-the-art a year or two ago.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: