Ten Years of Image Synthesis

ansk · on Nov 29, 2022

Image generative models have made significant progress over the last few years, but an often-overlooked confounding variable is that during this period, the predominant benchmarking task used by researchers (specifically those working on large-scale image synthesis) switched from unconditional image synthesis to text-to-image synthesis. Text-to-image synthesis is a fundamentally easier task as the text-conditioned distribution is much more predictable than the unconditional counterpart. Additionally, text-to-image plays to the strengths of diffusion models, which benefit from having a conditioning signal to guide the sampling process (via classifier(-free) guidance). As someone who is observing this progress from an adjacent field, it has been hard to tease out how much of this seemingly abrupt progress can be attributed to the fact that the preferred benchmarks and datasets have been shifting behind the scenes. I suspect that if these variables were well controlled, progress over the last few years would look more like the steady, incremental improvements that occurred in the field in the years prior.

multimodal · on Nov 30, 2022

Maybe I'm misreading the tone of your comment - but is that a bad thing? Reframing the approach to a problem, to state-of-the-art results; that strikes me as nothing if not progress. Don't mean to come across as negative here -- I just don't see why anyone would control those variables.

You list reasons why text-to-image is easier than unconditional image gen - but isn't that the point? They're both under the umbrella of image generation, which is ultimately pushed forward.

theemathas · on Nov 30, 2022

You can give a blank prompt to stable diffusion and it will give you reasonable images https://www.reddit.com/r/StableDiffusion/comments/xu0dt6/no_...

GaggiX · on Nov 30, 2022

The good old GAN models were trained on single domain dataset or on Imagenet, and these datasets only encode a small set of the real distribution, make them easier to fit, also I don't think GANs are flexible enough to be effectively used in text-to-image generation, so I don't think it's an easier task, it was a more difficult task and it required the development of autoregressive and diffusion model to actually make it feasible.

sanxiyn · on Nov 30, 2022

BigGAN trained on JFT-300M and it worked fine. In fact, JFT-300M was easier than ImageNet. In their own words:

"Interestingly, unlike models trained on ImageNet, where training tends to collapse without heavy regularization, the models trained on JFT-300M remain stable over many hundreds of thousands of iterations. This suggests that moving beyond ImageNet to larger datasets may partially alleviate GAN stability issues."

This suggests GAN will work even better on, say, LAION-5B. It's just that nobody tried.

"GANs are not flexible enough to be effectively used in text-to-image generation" is an absurd statement. GANs have latents, there is no reason why just feeding CLIP to GANs shouldn't work. Again, nobody tried because most GAN works preceded CLIP.

We should try GANs again. It was abandoned without any evidence whatsoever that diffusion model etc is required. GANs have genuine advantages over diffusion models, like much faster sampling.

GaggiX · on Nov 30, 2022

"It was abandoned without any evidence whatsoever that diffusion model etc is required.", diffusion models beats GAN models in quality on all datasets it was tested on, I suggest to read (again probably): "Diffusion Models Beat GANs on Image Synthesis" [1], the difference in quality is astounding, also the fact that GANs seems to work well with small one domain dataset, then they became incrementally more difficult to train with Imagenet and then it became easier again with 300M images only demonstrates how sensible these models are to hyperparameters and unstable they are to train (some distributions are better than others), these are the two main reasons why researchers have switched from GAN to Diffusion and autoregressive models, I like your attitude towards protecting the legacy of GANs but I don't put too much faith in them if not for autoencoders and upscalers (although some NN such as SwinIR have completely abandoned the adversarial loss). I bet the future for generative models will be distilled diffusion models for fast sampling. Btw you comment has been clearly influenced by an article [2] from Gwern blog but you didn't link it.

[1]: https://arxiv.org/abs/2105.05233 [2]: https://www.gwern.net/GANs

flopsamjetsam · on Nov 29, 2022

I thought this was an excellent summary for someone like me, who has an interest in the subject but only a small amount of knowledge about it. Linking to the papers was a good touch.

hakuseki · on Nov 30, 2022

There's some older work as well. Geoffrey Hinton's team used deep belief networks in 2006 to generate "handwitten" digits: http://www.cs.toronto.edu/~hinton/digits.html

Keyframe · on Nov 29, 2022

What's next though? To paraphrase Hemingway - It all happened gradually and then suddenly.

planetsprite · on Nov 29, 2022

In the arts, music, graphic design, animation, 3D models, and eventually realistic looking video seem to be the clear next steps.

The key to AI art is that while the universe is infinitely complex, the visual form of most things humans can recognize follow a collection of natural patterns. Anything humans can do can be computed, the only issue is being able to set up the scope for the patterns we seek to replicate. For games like chess, a bit easier, for images that represent label-able things, harder but still manageable. The main problem of making things beyond this is giving a valid scope to the whole problem: what patterns, precisely, do we intend to replicate?

khazhoux · on Nov 29, 2022

My mind is blown away by Midjourney v4, but the open problems are still plentiful (even before getting into video/motion, 3d etc).

After a few dozen prompts, you quickly learn how the image generation is simultaneously amazing and dumb. The dumb part is the next 10 years of improvements.

This is like Google 1999.

ajmurmann · on Nov 29, 2022

Just getting the number of fingers and to a lesser degree arms and legs correct would solve 80% of the issues I encounter

khazhoux · on Nov 30, 2022

Right. And imagine being able to refine an image with the instruction to move the left arm down a little, and lift the chin up just a tad. Today that's impossible.