I wouldn't necessarily say a major breakthrough as such but I do think some arch...

I wouldn't necessarily say a major breakthrough as such but I do think some architectural change is needed. There are concepts in images that we don't rely on purely visual understanding for - like words, we have a language model that we use to rely on when we see text in images. I think we need the same thing in our models to reach the next level of capabilities by combining models across different domains. I don't know if this manifests as pre-training with a language model and then expanding and updating the tensors as part of image training, or some more complicated merger of the models.

To learn logical concepts just from images seems entirely impractical, like we can't rely on having enough images such that models can understand words coherently as language. You could draw a picture of a sign that says "children crossing" not because you can understand and remember exactly what an image of such a sign would look like, but because you have an understand of English and the character set that would let you reproduce it. If you tried to learn to create the same sign in Arabic you'd either need to see a huge number of signs to learn from or (more likely) build a language model for Arabic.

The kind of abstract understandings that we know we can train in language models just aren't learned by image transformers at this scale (or likely any practical scale). A language model could easily understand: "A red cube is stacked on top of a blue plate, a green pyramid is balanced on the red cube" and infer things like the position of the pyramid relative to the blue plate, image models quickly fall over with such examples.

An interesting nascent (and hacky) example of the benefits of combining models is people are using language models like GPT-3 to create better prompts for image models.