I feel that Midjourney v5 really lets you explore different worlds.
One recent feature the guide missed is the permutation and repeat features [1]. They're quite helpful for power users that want to explore multiple styles quickly.
Last week I tried putting together a short film using GPT-4 and Midjourney v5. I was stunned by the cinematic frames Midjourney v5 was able to create:
Damn. It's no Harry Potter by Balenciaga, but it's surprisingly compelling given that most of it was generated (granted, with prompting) by AI tools. I notice you credited GPT-4, Midjourney, and Metavoice, but was the music AI generated as well?
I've gotta say, I've seen worse storytelling and cinematography come from actual, serious humans who were getting paid to do it. And the Balenciaga thing is obviously a joke, because that whole genre only works because the style of those videos is beyond parody and sails right over the uncanny valley. This is different. This is interesting. I like it.
I'm glad you pointed out the music.
The music was also AI generated with a tool called AIVA [1].
I'd never composed a piece of music before, and I was pretty surprised by what I could "create". I spent 30~60 minutes max creating the score.
Some parts of their product still feel janky, but as an overall concept, it's quite fascinating. One of the interactions I enjoyed was that AIVA creates scores with different tracks (layers). So I was able to edit tracks I don't like (e.g., change a Piano track to Brass) or have AIVA completely regenerate certain sections of the score (e.g., redo the bridge, regenerate the chorus sections).
One difference from Midjourney is that there's no text-based prompting. Instead, you "prompt" through music inspiration.
So, basically, if I want it to compose Baroque music, I can give it Vivaldi, Bach, and maybe a little early classical, tell it to go to work, and end up with something that sounds like it came out in 1765?
I wonder what the limitations are on that whole "musical prompting" deal.
Your video was better in my opinion because it has a real story. All the Balenciaga videos out there are really just realistically rendered parodies with little or no emotion to them.
It is, really. There are a bunch of imitators out there now, but they're less funny to me. I don't think it's because they're generally less well done. While most of them that I've seen are less well done, in the sense that the voices are more "computer-y" sounding, and the rendering not as good, I think it's because the original really is a parody, and doing a parody of a parody just starts getting more ridiculous without getting funnier.
I ended up subscribing to the $60 plan, mainly to get access to the Stealth Mode.
I used about ~4h of fast time during the project. With that said, I could have created this with the $10 plan (3.3 hours) if I had to.
Are what came up via:
The interior of a future-brutalist megastructure. Single image comics style sci-fi robotic architecture, 70s, jet era Jetson cantilever modular space station modules, flag patch on walls, pastel glowing symbols, sci-fi towers architecture, foggy, pastel volumetric light and atmosphere, glass and steel pastel, The 5th element, The Matrix, Blade Runner, photography award, ultra elegant wide angle, volumetric light at mid day nice sky, trending on artstation, Unreal engine hyper realistic photography magazine cover, no people --v 5 --q 2 --upbeta
Strongly disagree. These are 3 sets I just generated with v5 using the following simplified version of your prompt: "interior of a future-brutalist megastructure, comic pastel style, sci-fi robotic architecture, 70s".
Specific words and phrases that accurately translate what you want into the latent space are way more important than complex prompts.
If you've played around with MJ much you'll know what I mean. Some words and phrases overpower everything else in the prompt. I've found that good MJ prompts are very vibe based.
I've been experimenting a lot with both very simple and complex prompts. I think they both have their use cases. In the example you've replicated the desired image is quite generic. You aren't aiming for a very particular subject doing a particular thing, perhaps both quite obscure, with a particular composition and positional relation, a precise palette etc.
But some of the most thought-provoking results I get from Midjourney are with short, quasi-macaronic prompts [1]. I know enough botanical and zoological nomenclature to conjure fantastic lifeforms close to what's in my imagination, by contriving Latin binomials that just sound right. I can also give certain subjects the cultural flavor I want by making up nonsense words that sound like Swahili, French, Finnish or whatever, in situation where actually including the language name, or real words from the language, take the image in the wrong direction.
I had some GPU hours left to burn up the other day and even figured out how to make a short Midjourney animation with dozens of images in a smooth sequence. There are ways of very carefully stepping around local bits of the latent space.
> Some words and phrases overpower everything else in the prompt. I've found that good MJ prompts are very vibe based.
Are you talking about stuff like the trick of adding "artstation," or "in the style of..." or "taken with a blah blah lens/camera?" That all sounds like it generally comes from bias in the training data set to me. It doesn't make it any less valid an observation or any less useful, but I wonder if A) that is the case, and B) if it's hinting at the idea that training these things on source data that was initially generated by humans is going to create some kind of limitation in what they can do.
No, I mean sometimes particular subjects or adjectives will overpower everything else in the prompt.
For example, I tried some weird stonehenge prompts like "mcdonalds in stonehenge" and "stonehenge as mcdonalds store". Stonehenge completely overpowers McDonalds in generated images.
Another example, when I prompt with "squad of space marines" I get tabletop warhammer figurines. However, when I prompt with just "space marines" I get really nice art. It's easy to accidentally slip into a different part of the latent space.
The same thing tends to happen with particular adjectives and descriptors as well (E.g. colour, material, texture). This is a good thing if you can figure out words that translate your idea into the latent space well, but it can make controlling the output tricky sometimes.
That is interesting. My immediate hypothesis would still be that these effects and artifacts are due to biases in the training data. A group of WH40K figures is probably more likely to be identified as a "squad" of Space Marines than just "space marines." Likewise, I suspect there are probably many more pictures of Stonehenge up on the internet than there are of McDonalds'. Or, maybe the training process just somehow found Stonehenge more interesting than McDonalds (I couldn't blame it lol).
One thing I've noticed is that sometimes, the exact same prompt will generate very different images. Once, I got an image of a blue sports car, off to the side of the road, a kind of craftsman-ish looking interior of a house, and two pictures of some random outdoor forest type area. I don't remember the prompt I used, unfortunately, because I ended up writing it off as a glitch, but it would be interesting to see if anybody else has had it happen.
> words that translate your idea into the latent space well,
So, this phrase gave me a sort of a thought: what if certain words or phrases, like, say, "McDonald's" and "Stonehenge," are just so far apart in the model space (again, likely due to biases in the training set, or just the fact that there aren't many McDonald's' restaurants at Stonehenge), that the more interesting or common or unusual one serves as a kind of attractor and dominates the generation process most of the time?
Do you know if these effects are documented anywhere in the prompt engineering literature?
You might want to experiment with repeating the parts of the prompt you want to emphasize, and putting those parts in the front of the prompt, while putting the parts you want to deemphasize near the end.
I think the really long prompts persist because it's a lot of iterating and waiting (and/or fast hours) to try to pare it down and figure out which parts worked, which you could instead use making more images.
Honestly I'm not sure why people use such long prompts. I find it way easier to iterate on prompts when they're short and you just change a few key words to find the words that have the biggest impact on the generated images. It's really easy to fire off a bunch of prompts and queue a bunch of MJ jobs.
Seems to be a lot of superstition around prompts. Using longer prompts makes it feel like you're really casting a complex spell that makes you more advanced than a new user.
There is also an issue in many Standard Diffusion model files that makes them quite chaotic with short prompts. With some of those, I've found I get much better results by simply copy-paste-paste-paste-paste-paste-pasting a short positive and negative prompt even though in theory that should have negligible effect. Not all model files suffer from this, but it leads to superstition.
Then there's the whole "+masterpiece, +best quality, -bad hands, -extra limbs" superstition that comes from Danbooru-based models, but people coming on board just assume it's necessary everywhere.
That’s because we’ve sucked all joy out of creating something, so now we’re trying to create interesting prompts to feel creative again, it’s quite simple no ?
I think people can't believe it's really as simple as it is. If it's that simple, then there is no sacred knowledge, nothing to gatekeep, nothing to be the "expert" on. It means there are no novel methods or processes to discover. There is nothing to "learn" other than the syntax and a large enough vocabulary to describe what you're imagining.
There are prompt books and prompt sites where you can buy them a dollar at a time. The hustle culture has built this fog of fake complexity and hidden methods in order to prop up their little cottage industry. Every AI Guy on TikTok has a way to "maximize your productivity with GPT" so you can "Start an AI company".
We haven't sucked the joy out of creating things, we've just tried to further commodify the process of creating and in turn spun a web of myths and lore where there didn't need to be any.
>We haven't sucked the joy out of creating things, we've just tried to further commodify the process of creating and in turn spun a web of myths and lore where there didn't need to be any.
> That’s because we’ve sucked all joy out of creating something
Did we? Go grab a brush and some paint and create. The joy always come from the inside. If you can't seem to find it anymore inward is where you should look for it.
I don’t think those results are actually any better than what can be achieved with much more simple prompts. This seems like an example of the “Ikea effect”, where when once you’ve put in effort to assemble something, you perceive it to have more value.
You can add a prompt guide to a ChatGPT system prompt using the OpenAI Playground. Then you can go back to writing simple prompts and ChatGPT will spit out all the magic keywords and model parameters, based on its knowledge and your system instructions.
I find that the more complicated the prompt, the less detail I see in people. With simple prompts I get photorealistic, but with complicated I get animation quality.
Not sure if this is just my prompts or something general.
I wonder how accurate this approach is.
One thing is being trained on someone's art, another is getting hints from the rest of the whole collective knowledge of how that artist works.
checking the V5 tab I almost immediately found Akira Toriyama, and the art is ... well, sure on one of the four drawings there are some Goku-like hair on ...some cheeks (not the face ones) drawn in the style of Terry Gilliam?
Then of course, for older artists the match is much stronger, and for some of the modern and famous ones too. But it seems to me that the whole method could be heavily biased.
Is the latest Midjourney now considered better than Stable Diffusion? If so, will Stable Diffusion catch up? I strongly prefer the open source nature and ability to run locally of Stable Diffusion.
SD can be better than MJ if you dig really deep into the open source nature of SD. Open source UI and plugins, community-made models of many types and uses, tribal knowledge of techniques... It’s very complicated and requires a lot of searching, reading, installing and experimentation.
The base tech and models straight from Stability AI give pretty crap results if you just plainly describe a scene.
MJ in contrast provides a great results out of the box. Say anything, get beautiful picture of something. From there you need to figure out specifically what you actually want.
However, if you really want to iterate on just specific details of a scene with a specific layout, with specific characters in specific poses, with specific style elements, MJ is too chaotic to control at that fine detail. So to is SD out of the box. But, if you take the time to learn how to install and use ControlNet, highly specialized models/LORA/textual inversions from wildly varying sources, in-painting, latent upscaling, hook up Photoshop/Krita/Blender integrations, ect, ect… you can eventually get very precise control of SD’s results. And, then new better tech releases next week! :D
> Is the latest Midjourney now considered better than Stable Diffusion?
Its better in terms of having a no-user-configuration service available that gets you from zero to decent results with nothing more than prompting.
SD is better in available specialized customized models (finetuned checkpoints and the various kinds of mix-and-match auxiliary models and embeddings that can be used with them), not having banned topics because it is self-hostable, available tooling and UIs that expose tuning parameters and incorporate support for integrating techniques like guided generation with various types of ControlNet models, animation, inpainting, outpainting, prompting-by-region, etc., with image generation.
Midjourney provides much better images by default. It's really impressive.
Stable Diffusion's advantage is in the huge amount of open source activity around it. Most recently that resulted in ControlNet, which is far more powerful than anything Midjourney can currently do - if you know how to use it.
Look around a bit for info on controlnet. You can use depth maps, scribble in where you want objects to be, or place human poses in a scene and SD will use it to generate an image. You can combined multiple controlnet models and control how much they contribute to the scene. The level of control available is pretty awesome. I say that as someone who was in the DALL E beta and used midjourney for a few months (though I guess I don’t know what advancements they’ve made in the last few months).
Both StabilityAI and the open source community are working on improvements to Stable Diffusion.
Keep in mind StabilityAI is also pursuing LLMs and the host of other model types, whereas text to image is Midjourney's single core competency and value prop. Midjourney is very focused on staying ahead.
edit: I wanted to add that the extensive training costs can be prohibitive for the OSS community to fully participate. Coordination via groups such as LAION can help, but gone are the days of individual OSS participants contributing directly to core foundational model training.
In fact here's a list of painters whose style is immune to direct mimicry in Midjourney because their name is banned:
Ambreen Butt
Jan Cox
Constance Gordon-Cumming
Dai Xi
Jessie Alexandra Dick
Dong Qichang
Dong Yuan
Willy Finch
Constance Gordon-Cumming
Spencer Gore
Ernő Grünbaum
Guo Xi
Elena Guro
Adolf Hitler
Prince Hoare
William Hoare
Fanny McIan
Willy Bo Richardson
Shang Xi
Wang Duo
Wang E
Wang Fu
Wang Guxiang
Wang Hui
Wang Jian
Wang Lü
Wang Meng
Wang Mian
Wang Shimin
Wang Shishen
Victor Wang
Wang Wei
Wang Wu
Wang Ximeng
Wang Yi
Wang Yuan
Wang Yuanqi
Wang Zhenpeng
Wang Zhongyu
Xi Gang
Xie Shichen
Xu Xi
I have this list because I recently made a site[1] that displays the 4 images from a prompt of "Lotus, in the style of <paintername> <birth-death dates> [nation of origin]" for every painter listed on wikipedia's "List of painters" -- except for those in the above list.
The fact that they banned both Xi and Jinping separately to prevent Xi Jinping was surprising to me. Twice as banned as Adolf Hitler.
[1] https://lotuslotuslotus.com - small chance you get an NSFW image if you hit upon Fernando Botero or John Armstrong, perhaps there's more.
So if you're an artist and don't want your style to be used in an AI product, all you have to do is changing your name that includes a variation of controversial leader names (xi, adolf), or slightly offensive name (dick, gore)?
> So if you're an artist and don't want your style to be used in an AI product, all you have to do is changing your name that includes a variation of controversial leader names (xi, adolf), or slightly offensive name (dick, gore)?
Nope, that doesn't stop models from being trained on your art. It makes it somewhat more difficult for people to prompt specifically for your style, but your art still influences output, and there may be other ways (e.g., titles of specific works) to deliberately and specifically evoke it in particular.
ChatGPT made me a python script for automating pasting results, as well as most other tasks related to this project.
I already had a discord bot I wrote by hand before for downloading the images.
I thought of the project in the morning while my kid was getting ready for school and had it running jobs before we were out the door, worked a little before I left for work, and and a little more after work, and it was done before dinner time.
I listened in to their weekly chat today and it sounded like they'd be happy if they could just ban all political mimickry. I think the AI images of Trump being arrested (which looked like Midjourney to me) were disappointing.
Maybe that just means banning images that are meant to fool people, and obvious satire would be ok, but they might be ok erring on the side of caution.
(Nothing in the talk was this explicit, but this was my read of the subtext)
It's been noticeably better than Stable Diffusion since v3 at least (I wasn't paying attention before that). It's on v5 now, and I think MJ has continued to get better faster than SD through this time period.
ControlNet for Stable Diffusion may be an exception to this.
There's lots of versions of Stable Diffusion so I've had a hard time knowing what to compare exactly. But from what I've seen none of them come close to Midjourney.
Stable Diffusion does more things though, like in-painting where you can erase part of an image and then have it recreated. I've seen videos of people doing impressive things with in-painting and extensive regeneration of each portion of an image until it's just right. Seems like a ton of work though. Still I've had some fun using it to modify images or extend images.
Midjourney v4 was already better than Stable Diffusion. The new Dall-e (which you can use on Bing) I also find it better...
The main difference with Stable Diffusion is that you can fine tune with your own dataset. There's img2img, and a bunch of other tools. But the base model it's really worse than competitors right now.
Also SD can do porn, which Midjourney forbids for some reason. They're leaving an assload of money on the table and somebody will nab it sooner than later.
The Dall-e API sucks so much right now, I’ve been experimenting with it the past few days and it produces a lot of horrors. I even used the Dall-e prompt book as a guide but still so many more misses than hits. Even when it gives a non horrifying image it’s just decent. 5/10 rating
I started testing out the official Stable diffusion API and it already gives you way more control than the Dall-e API and seems to produce less horrifying images that are better quality but I feel like Dall-e understands the prompts better. 7/10
I would love to try mid journey but I uninstalled discord years ago and have no plans to ever reinstall it ever again. So I’ll wait for the API access if they ever do it. 0/10 (only for being discord only)
Standard Dall-e 2.0 is worse than Stable Diffusion... Dall-e Experimental, is available on Bing, which I guess is Dall-e 3... Similar approach of GPT4 they did with Bing as well I guess...
From what I understand, SD doesn't handle color space correctly (or at all), hence all the weird saturated blue-magenta-orange-beige gradients in a lot of its example outputs. And why its output often feels more like a bad Photoshop collage than a proper blend. It's probably trained on unmanaged sRGB. In which case the SD model is fundamentally flawed, since doing math in sRGB space is nonsense and causes bias (specifically those saturated gradients are a sign of just that...). Although I don't know for sure, I didn't find any color management code in their scripts when I looked for it, so I'm assuming this is the case.
I'd be happy to be corrected if anyone knows the details on this in SD.
The founder gave an interesting interview about this.
In short when they did user testing with a Web UI, people would write a basic prompt "dog", "cat" and then get stuck.
Discord makes it social by default, you can see what people are prompting and get inspiration.
I also found it a bit weird, but now that ive onboarded its great having a multi platform native experience (ie even chatgpt still has no mobile apps).
I don't know anyone who's used Midjourney more than a day and stuck around in one of the newbies channels; pretty much everyone slides into the bot's DMs.
The social angle is not what's fueling Midjourney's success. The quality of its output certainly is.
I dunno, even if it only matters for a couple of hours starting out, those hours might easily be critical. There are many technically strong products that fail because they didn't know how to get users to take those first timid steps.
I agree that the social learning component is very useful. Initially, I was taken aback that they force you to use discord but it is great for learning from others.
I created my own Discord "server" (which is not truly a server per se; more like your server account [1]), and then installed MJ bot. This way I can keep all my creations in a single place, with the added benefit that I have my own channels for things like compilation of links, prompts database, separate channels for different types of creations, and so on.
I even invited a couple of family members to join my Discord server, so they can use MJ if they want, without having to share my Discord account (but of course it still counts towards my MJ credits).
All in all, I have been incredibly (and surprisingly) pleased with Discord experience.
I have done this too. How did you get the bot to use your MJ credits? I found that the bot still uses the credits of each person who invokes the command.
I'll take it back. I tested and you're right: it was using the caller's account, not the person who owns the server. So it doesn't share your MJ credits in any way.
There must be a way of creating a Discord bot wrapper which holds your credentials, but on a quick search I wasn't able to find anything.
What a big bummer. That would be the ideal environment.
Discord is all kinds of nightmares. Every time I open it I have to close multiple popovers trying to sell me things, something about "Nitro", stuff about running my own Discord etc. Then there's the fact that there's red dots on channels that I've never been in and there's no way of working out why I have a red dot on there or viewing just my mentions. I hate it.
Discord is actually a really great experience for power users. I can generate new image sets as fast as I can type because MJ's Discord bot takes care of everything else for me.
Are there SD applications that have UX as good as that?
But then you have to scroll through the entire chat to find your images (unless there’s an east way to jump to your results without trawling through the whole thing).
I’m guessing you are referring to archive.org style public archiving.
But, I’ll mention anyway that for your own generations there is a web interface to search and bulk download your images. And, a web interface is in the works. You lose out on the social aspect of prompting with friends. But, gain some UX for working solo. Or, you can work solo in a DM convo with the bot.
Yeah Base SD is pretty bad. That said, the new Kandinsky 2.1 model is really good. Much better than any of the base SD models and is open source. Try that some time
Bing image creator is also really good and free with a hundred generations per day
There's a lot of Stable Diffusion models though, I would say the opposite, it's hard to compete for Midjourney when you have 20 different competing fined-tuned Stable Diffusion models depending on what you want along with ControlNet, automatic1111 & comfyui.
I agree with the parent comment. The quality of output images from MJ v5 is vastly better than anything I’ve been able to get out of SD regardless of which tuning model I’ve tried. SD’s underlying model on top of which everything is built just isn’t that good.
Seem about the same if you know how to use them imho. I’ll say one thing Midjourney images are always way more “basic” in terms of aesthetic. Everything looks like the Artstation front page, which while very competent is tacky look.
Tried to sign up, but won't bother after they falsely claimed a "security issue with my browser" and then demanded a phone number on that basis.
I hate this hostile, intrusive UX pattern. I verified my email, I never use VPN, and my browser's ad blockers were disabled. Yet they claim "security issue". Discord can shove their dishonesty.
And yes I tried one of those free number websites but didn't work because millions of other people use those same numbers.
Every month or two, I have a need to login to Discord for something like MJ. Discord's whole login sequence is more painful than any other website ever created. It has multiple borderline-crazy CAPTCHA challenges and still makes you verify by email and/or phone. This is on the same browser and device I logged into a month or two ago.
I don't get anything like this from other services on the same device.
The crazy thing about generative AI models is they provide a service that's significantly better in both price/output and bests possible output.
In addition to being able to create completely novel, high quality images, they can do so at speeds that no human could ever hope to match. And image models of this quality have only been around for a year, imagine where they'll be in five.
I empathize will all the artists who feel cheated out of these models training on all their data, but the sad reality is these AI models are just far too useful to ever go away. The world's standards for art and text have gone up faster than they ever have in world history over the course of the last 10 months
Honestly I don’t understand the doom and gloom with AI.
It’s going to make humans hella productive.
For artists that means less staring at blank screens and fishing for inspiration.
The future is that anyone will be able to whip up anything without needing that particular skill that is a blocker. Instead they can pull from collective intelligence and piece something new together.
If an artist wants to make games but can’t code they can make games like a boss in the future because they can funnel their productivity into all the things they were never capable of before.
It’s literally infinite possibilities. Like I’m a software engineer. I’m sure these AI systems will be boss tier at a lot of stuff I know.
That’s good. That means I can finally stop doing the things I hate the most when it comes to making software and do the fun part: solve problems. Just like you use a calculator and formulas to solve problems and not doing every calculation by hand from scratch..
Anyway just my 2 cent. All this doomer talk is being pushed by people to get people afraid. Honestly if you’re at the forefront of technology you should be striving to shape the future for better where everyone can make hella money.
Imagine how much stuff there will be to sell to everyone, and everyone can channel their true potential and creativity into building cool stuff. You like cars and want to make body mods? Can’t code? Don’t know cad? No worries someone has set up a service which helps you easily get those molds delivered to you. And if that’s your thing you can put that workflow together and make money.
You’re not crazy, it’s just that people are freaking out because there are many folks that do a lot of this for a living. If anyone can code a game, do all the graphics, music, voice acting, script generation etc. then no one is going to pay those people anymore.
so ideally a single guy with ultra-prompts can create a whole AAA game on his own, while all the other people you used to need don't have to do crunch-time but can do their own great stuff instead? of course the prices for stuff also go down rapidly, but it only needs to be enough for a single guy instead of a whole publisher/development studio chain, so there are several orders of magnitude wiggle room to go
Because its gonna eat people's lunch if it gets going. Many in the elite saw their lunch get eaten by the internet and are now against any further innovation, in stark contrast to the culture of the 90s and 2000s. Thus the AI doomerism and the broader techlash you are seeing today.
Some think entire categories of careers are at risk of being less valuable or eliminated. In a society that didn't treat obsoleted professionals like used condoms, that would less daunting. We don't live in that society and won't by the time many people have to deal with the ramifications of this.
> The world's standards for art and text have gone up faster than they ever have in world history over the course of the last 10 months
This is a funny sentiment because to me, the exact opposite has occurred. Now that everyone can create work of comparable technical ability, never has it been more clear that taste and other conceptual skills are lacking in those who haven’t spent hours upon hours engaged with and making art.
I’m sure many will decry this as snobbery, but I think this is also obvious in the reliance on various famous artists names in prompts. I expect a similar phenomenon will emerge with AI generated code where things that look superficially impressive are terrible in other dimensions (architecture, efficiency, security, etc)
I agree, but according to some people this is just statistics too. Not sure I agree with this but yeah. Don’t be surprised if you get a response like that :)
Don’t get me wrong, what’s technically been accomplished with the creation of generative AI is amazing. It’s just that it doesn’t make your average person off the street suddenly think with the novelty or acumen of a Max Ernst or John Carmack. But I also doubt that said average person has those aspirations (Max Ernst or John Carmack wouldn’t be exceptional if they did) and also expect that those that do will get buried under the onslaught of pedestrian schlock.
The big problem is the variation. If you want a specific, complex scene it's basically impossible to get exactly what you want in my experience.
I've also found that some words and phrases will overpower the rest of the prompt, so it can be hard to get to specific areas of the latent space.
I think they're really good for mood board kind of work and exploring ideas, but you'll probably still need an artist to create something that is specifically what you want.
If working in Stable Diffusion, with ControlNet, Latent Couple, SegmentAnything, and inpainting, you can get incredibly complex scene composition. It's not just typing in a description and getting what you want in one step, but it's still a significantly easier skillset to learn than what would be needed to create the same level of artwork manually.
One shot random prompt stuff is better on Midjourney and I'm still subscribed to it, but I would say I do 99% of my playing around inside StableDiffusion these days.
I drew literal stick figures in GIMP yesterday and went from weird lumpy dragon on a castle drawn by a 3rd grader to an epic terrifying dragon in about ten minutes yesterday. The value of both img2img and then Controlnet cannot be overstated for defining what you want in a scene. Literally my eight year old can do it, given I help her with the prompts.
> The big problem is the variation. If you want a specific, complex scene it's basically impossible to get exactly what you want in my experience.
The UI/UX of a proper workflow is still very janky (even in user-friendly UIs like auto1111), but you can get very specific, complicated scenes with a mixture of ControlNet + inpainting.
> I've also found that some words and phrases will overpower the rest of the prompt, so it can be hard to get to specific areas of the latent space.
I've also had this problem (particularly with multiple colored objects, like "blue eyes, brown hair"). Apparently Cutoff (https://github.com/hnmr293/sd-webui-cutoff) is very good at addressing this "leakage", but I haven't implemented it yet.
Right now, things are really good for mood board kind of work, but there's a lot of tech already available for enabling a lot of back-and-forth "work" on images to get them into exactly what you're imagining. There aren't a lot of good UIs for them all yet though; as soon as they get a little more user-friendly, I'd expect another boom in generative AI for artists (this time, with artists being the primary benefactor).
If img2img works in Midjourney the same way it does in Stable Diffusion, probably not. Img2img doesn’t put “guard rails” to keep everything where you intend it to be and will bleed over and make changes to the image. Civitai.com has some really great models I’ve been using to get some really incredible outputs lately. Generally takes me 1-1/2 hours to get something exactly how I want it, aside from things the model just absolutely doesn’t understand (getting a dragon to breathe fire or try making a woman have tusks like an orc, even with an image showing what it looks like the AI has no concept of what the heck I’m talking about)
Counterpoint: they only generate novel differentiation but won't generate new detailed products.
This basically means you get thousands of different nostalgic triggering visuals but it's a visual trick the same way marvel movies all follow the same premise and eventually satiate the space.
Partially true, I'm currently working with multiple clients in China where AI generated art is heavily utilised. But we are acutely aware of the limitations of AI, therefore human resources were hired for the purpose of clean up AI art.
Given the cost saving potential of AI art generation, I suspect it'll only be a matter of time before it becomes mainstream in the west.
Instead of having an "us" against "them" mentality, there might be a future where human artists can work alongside AI to push human creativity to a new level.
AI art is also currently cliched in an anti-novelty way. High-level image composition and low-level style are based on the training set and while you can mix and match and tween between different points in the latent space, genuinely novel stuff doesn't come out; you're stuck within a bounded space. They're kaleidoscopes of kitsch. If you want a seven legged demon with three faces all around its head, a monster that doesn't exist and hasn't been drawn in lots of poses in extant art, you can't reach it with prompts.
There's also a "terrain" in the latent space, with valleys of popularity and mountain ranges of novel intersections. Generated images can flip from one valley to the other with a small change in prompt, and it's hard to stay in the intersection. Fine tuning helps alter this terrain - that's mostly what fine tuning does. Rather than teaching genuinely new things, it mostly warps and tweaks what's already there to make it easier to target, at the cost of making everything else a tiny bit harder to get to. Teaching the model new things requires including lots of regularization images to ensure it doesn't forget all the other stuff, which is more expensive than fine tuning or dreambooth.
I like your mental model here, I’ve described it to friends as it feels like beach combing looking for the perfect seashell. I have a vague idea of where to go to find nice pictures through prompts but am not totally sure what I’ll find when I get there, a lot of them are weird and broken, but some can be cleaned up, polished and look really nice. I can’t really affect what comes out of the “ocean”, though, like you said.
It certainly should not be an us-against-them mentality but artists aren't the ones that need to change their tune. Being both a career artist and a career developer, I'm privy to voices from both professions. Frankly, I find many developers' tenor disrespectful or even sadistic– gleefully proclaimimg and gloating about the supposed obsolescence of artists like they're taking about their favorite sports team beating a long time rival.
Firstly, this ignorantly trivializes the value, even commercially, of artistic talent (hint, tool usage pretty a small part of it.) The most important ingredient and biggest time sink in bigger high-level projects is the artistic minds that decide what goes on the screen to begin with... how they're made is an implementation detail. The images these algorithms pump out are dazzling to amateurs, but it's not close to precise, reliable or consistent enough to make content for these projects. People who do this stuff at a high level know that this tech will be relagated to supporting tools-- tools that copy one element manually made to many different contexts, mood board or story board panel generators, photo or compositing filters, etc. for the foreseeable future. Insisting otherwise is like saying github copilot is imminently about to replace developers, as if coding is the only important thing that developers contribute to software projects. Sure, it will eventually replace a lot of lower-end utility developers because the higher level developers will be so much more productive, but that's a very different thing. Speaking of that...
Secondly, in the market for the higher-volume, lower artistic effort commodity commercial art, tool mastery IS the big selling point, and that market will take a giant hit. These are people with mortgages, kids approaching college age, maybe carrying for sick relatives or relying on employer-sponsored health care for insulin. And it's not like you're getting laid off and can get another job– your entire category of employment is toast. Someone flamboyantly dancing on your grave in public and smugly telling you to find a new profession is pretty fucking good reason to get defensive or offended.
Marvel movies however make more money on average than any other kind of movie. If that's where the money is, that's where the capital efforts of artistic creation will tend towards. Whether the art is trite or original is up to the highest brow critics to determine.
Once you arrive at Ant Man 3, the box office lemon has been squeezed. There is far more money in serialized content, which is also where all the creativity goes.
Yes, it's basically true. For true novelty, you need to create a bunch of examples and teach a model (fine tuning). Depending on how novel it is, it may be hard to teach and teaching may damage a model's generality (over training), so you need to include examples of everything else and train at a lower rate, and it'll take longer and cost more. But it's possible.
I have a hunch based on no insight or rational thought, but it is that the wall will be generic for AI apropos to art/creation. It will be doomed to uninventiveness.
The speed of hardware compute improvements has been on a constant exponential curve for the past 50 years. The speed of increased model capability has been on a consistent linear rate of improvement for the last 10. It would be the exception, not the rule, for this progress to slow necessarily in the near future.
I've played around with v5 and it generates some insane images. Vast improvement over the early days in Sept 2022 when Stable Diffusion had just come out as well, and Midjourney still had lots of room for improvement.
I've done this and have gotten great results. I made a base prompt that includes all of midjourneys documentation, and as gpt gives me prompts I plug them into midjourney and each of them a rating. I'd like to know if you've done something similar or if you have different methods
GPT LLMs etc are designed to output strings that look like real language, and follow rules of grammar and word constructions etc that are used by people. Image generating models can be manipulated by using prompts with all kinds of strange combinations of words and letters that would never come out of a language generating model.
I see. Well to clarify my point. I use slight missellings or concstinations or repetitions or specific word orders or layers of completely different concepts to great effect to pull and nudge the current image generators. This is just an intuitive feedback loop, that follows the logic of the images being generated, it's a new way to use language and letters. I can't imagine a llm can do this with purpose
One recent feature the guide missed is the permutation and repeat features [1]. They're quite helpful for power users that want to explore multiple styles quickly.
Last week I tried putting together a short film using GPT-4 and Midjourney v5. I was stunned by the cinematic frames Midjourney v5 was able to create:
https://youtu.be/6O_tOuUcG9s
I (human) wrote the prompts for Midjourney, though.
[1] https://docs.midjourney.com/docs/permutations