> Current -ve prompts:
ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy
If you want a really good picture of an imaginary person it helps if you use "extra limbs, extra legs, extra arms" as negative prompts!
I've always thought all the modern AI tools have an "AI illustration" style but the realistic images in that tweet are amazing and 1& uncanny valley. It's like I could be fooled until I really give a good look. I guess it's kind of the same on the illustrative stuff that look really good until you see the shadows are coming from different light sources in multiple parts of the image or there are only three fingers on the hand.
All in all I hate it because the prompts I see are things like "cyberpunk forest by Salvador Dali". You've got a tool that gives you the power of Gandalf and you prompt that?
> All in all I hate it because the prompts I see are things like "cyberpunk forest by Salvador Dali". You've got a tool that gives you the power of Gandalf and you prompt that?
That's one of the better prompts I've seen. Dissimilar but really strong aesthetic styles a skilled human could mesh pretty well, interesting images and shows up the strengths (some of the forests are really good, and the ones without trees are pleasantly foresty nevertheless) and weaknesses (it fails completely on 'cyberpunk' and 'Dali' once you start adding other parameters that influence the visual style) of the model.
Plus I'd be much more likely to end up with a calendar of "cyberpunk forest by Salvador Dali" images on my wall than "Mickey Mouse in a tuxedo with a cigar"
If you mean a finger is always adjacent to a finger, locally, but holistically, the model doesn't know to stop after 4 of them, and will happily generate 8 fingers, then yes.
If you mean locally as in the size of a hand being right while holistically the person is wrong, no.
The overall images "tend" to be right (once you grasp prompting) and elements, even appear right at first, but if you focus attention on those elements, they are often not quite right.
So perhaps it's the definition of local and holistic.
I'd ask (1) what the zipper on her left chest is for, (2) where the necklace for the charm hanging along her centerline is (or zipper if it's a zipper pull), and (3) and how the geometry / gravity works on the patch on her left chest.
Which is about what you'd expect from a generator that understands patterns, but not meanings.
They miss on things that cannot be, because they don't understand things or rules, only patterns.
They still have an "AI illustration" style. Supposedly photorealistic images tend to look extremely photoshopped and humans in them look like they've been 3d rendered (albeit with a very high quality). They look like the heavily edited "plastic-like" images on covers of magazines.
I'm pretty sure this problem is not hard to fix in the long run, though.
A friend of mine and I (independently) spent a day or so playing around with Stable Diffusion recently. We both came to the conclusion that, as things stand now, creating images in the style of impressionists/surrealists/cubists etc. works best because you're not really expecting realism, anatomical correctness etc.
I was able to come up with someone paddling a canoe in a Turner seascape. The only thing I couldn't get right was a proper canoe paddle and paddling motion but everything else was pretty much perfect.
Most of these tricks don’t work on SD; they’re cargo culting from a different model (NovelAI) whose data genuinely has those keywords in it. SD is trained off the whole internet so those aren’t super common captions.
Yeah – I tested these sorts of "bad image" negative prompts a lot in 1.5 and found they had almost no impact whatsoever. It may be different in 2.0, like the tweet author says, but it is also pretty telling that in that tweet they're using "blurry", "blurred", and "grainy" and are rendering images with heavily blurred backgrounds and obvious film grain.
Specific common keywords like "amputated" may have a positive impact, though. Hard to tell. Doing apples-to-apples comparisons with negative keywords is challenging because even a single extra keyword tends to completely change the image.
One thing that SD really impressed me by, though, is its understanding of symmetry. "Symmetrical composition" is an incredibly powerful phrase: https://imgur.com/a/lioJ8ak
And it does, indeed, extend to anatomy as well – "symmetrical eyes" can help a lot, while "symmetrical arms" renders people with their arms raised or outstretched.
I did some tests on SD 1.5 with certain challenging prompts such as gymnasts doing a handstand. Using no negative prompt they became amorphous blobs. I'm guessing because gymnasts are often in dynamic poses which are hard for SD to understand.
I decided to add a negative prompt. With a bit of experimentation I realised all the "bad" had no effect. However, "blob" actually made most of the deformities go away and "amputee" did help against partial limbs being generated.
Something that worked even better was replacing "gymnast" with "athletic man"/"athletic woman" in the positive prompt.
Welcome to the latent space where you can add, subtract and operate on words like they are mathematical objects. I suppose people are going to intuitively learn how the latent space works by exercising prompts.
Symmetry is effective for compression, so that makes sense - when messing with NovelAI I actually couldn't get it to generate asymmetrical hairstyles like Lain's.
I think they work, but not in the way that people seem to be using them.
Take the negative prompt "bad hands". The AI doesn't know what bad hands are, that's a human concept. But it does know what hands are, so it hides them. In the example image the hands, arms, and feet are all hidden.
In theory, using the negative prompt "hands" would be just as effective.
I'm not an expert, but I was given the above explanation by someone who knows a lot more than me and it makes sense.
The model can be taught what "bad hands" look like by feeding it some samples of that with the "bad hands" tag. And existing image archives do have pictures tagged things like "bad hands" and "bad anatomy" because actual artists do draw things wrong sometimes.
I imagine that on the long term people will start making archives of AI mistakes and train the AI on those to try to make them less common.
Danbooru and other "booru" websites has those exact tags. Tags in booru is in direction of a flattened and deduped YOLO output but better and manually assigned, which was exploited in NovelAI. cf.[1]
That's not true. The community was using negative prompts before NovelAI came out with their model. And personally, I've seen negative prompts make a big difference, especially when finetuning the outputs.
keywords? These are embedding models. Clip puts those phrases into an embedding that encompasses a location in the space you want to avoid. No need for the "keywords" to be in the image dataset.
So the problem with that, is you're visualising the space with only points that exist in the image dataset. The language embedding has more information that comes from the language that isn't contained in images.
It handles bad, and it handles anatomy. If there aren't single images that cover that - that's exactly what language embeddings solve for.
If my office is a vat of liquid spice, sign me up. (reference to guild space navigators of dune (watch?v=AGqdE1NdMTg), but in this case, we'd be the grotesque quasi-immortal 'media' navigators, and instead of navigating through space, we'd be generating prompts to steer the thoughts of the masses and navigate society through the dystopic future where AI art rules every conceivable variation of human creativity thought)
> the dystopic future where AI art rules every conceivable variation of human creativity thought
...This part admittedly trip me: How is it that a system that mirrors every variation of human creativity dystopic? Human creativity is softly bounded by the environments we interact with & the techniques created/(taught to us) for creating such works, along with the knowledge & philosophies that were also created/(taught to us). Ultimately, human creativity is limited in terms of contextual data. The entire art genre of retrofuturism showcases this intentional lack of data in practice.
Scenario: A HASDMLASKD drive doesn't mean anything at first, until a general guiding focus is given to the concept of this drive. Only when it's been given some context do our imaginations fill in the gaps (e.g. space, or storage, or for submarines).
If there's a system that encompasses/surpasses the area that human creativity exists in, that doesn't mean the "oh woe to humanity" doomerism that comes in quick reaction to such a system. It just means that there's a system that can be systematically learnt from & help augment current creation capabilities that lead to more works in the future: Such doomerism is only warranted within a nihilistic context of "humanity will never surpass X", when a more appropriate "X will help increase the area that humanity lives within" could be slotted in.
> Human creativity is softly bounded by the environments we interact with & the techniques created/(taught to us) for creating such works
Exactly this. I'm personally awful at drawing. I'm also awful at most design software. I'm really bad at bringing a concept that lives in my head into the world. I've noticed that playing with Stable Diffusion has allowed me to create things I wouldn't have been able to otherwise. It allows me to create art for projects that I otherwise would've made a lame logo for, or used some stock photography. I don't have the money to hire a skilled artist anyway, so this gives me new possibilities.
> Such doomerism is only warranted within a nihilistic context of "humanity will never surpass X", when a more appropriate "X will help increase the area that humanity lives within" could be slotted in.
I think it's a lack of imagination. It's hard to imagine the jobs of the future. We assume work is a fixed sum game, but given new resources we would take different goals and make different plans. It always depends on what is possible, not what was possible.
I don't like the inbuilt censorship going on in this version. People should be able to opt-in or opt-out of that. Sometimes you want a little bit of the old ultraviolence...
I think its smart. Best to keep it out of the hand of freaks for now otherwise people will start thinking of image generators as porn generators.
First they should sort out the legal question of training the AI on copyrighted material and propose use cases that the general public will find value in. Censorship can be dealt with later.
When I first tried inpainting, I used an image of this old poster for a game called "Myth II: Soulblighter" to see if I could expand it into a larger version.
The request was blocked because "violence was detected". It was a hand-drawn image of a video game boss attacking others with a scythe (it looked like this: https://www.mobygames.com/game/myth-ii-soulblighter/cover-ar...)... there wasn't even any gore, he was mid-swing. I'm a 50 year old guy, I'm not a 10 year old boy, and I don't need to have "violence" (seriously? a hand drawn painting of a video game scene??) censored from me. This nanny-state upstream censorship is BS... I'm just a nostalgic old nerd who wanted a UWQHD version of this image, for sentimental reasons, and this misguided rule stopped my joy.
There is absolutely no evidence that plain nudity, nor hand-drawn violence (which has pervaded comics, video games and movies for decades) has a detrimental effect on human psychology. And yet... the Puritan influence still exists!
At least, if I ran SD 1.5 locally, I could render whatever I wanted to again, but now I can no longer do even that if I use the 2.0 model. This is dumb. Apparently, I'm a "freak" for thinking this.
Whatever content is avoided in AI training is going to end in the bin. It won't be referenced, the ideas won't propagate, it will get much less attention. I bet artists will start tracking how many times their names have been conjured up in prompts to demonstrate their impact.
This is why, although I can see my work benefiting from AI tremendously a bit further down the line, I don’t feel like it’s a good use of my time to learn to write prompts right now.
Things are changing so fast it feels better to just wait until we’re no longer in this phase of having to relearn the tool constantly. In other tech getting in early is important to keep up - with AI generators, I feel the promise is that as the tech gets better, you’ll need to know less and less to use it .
The user perception of OS is mostly third party software availability and driver support. I've used all three, and as far as "operating system" comes, Linux is the only thing that feels good to use.
But, I suppose the analogy isn't entirely inaccurate either. MJ is entirely owned by someone else, it can be removed at any time, including anyone's availability. While SD is flexible, allows a stable foundation to be built on, and you can extend it any way you want...
I couldn't have built a gRPC based wrapper around MJ, set up a bot that listens to prompts sent through telegram, and post back images. Hm.. or I suppose one could do the same with with MJ API... so, bad example :D
Does anyone know how much did Stable Diffusion 2.0 cost to train? Also is their model open sourced? Like can I take the code and train it assuming I had the money and resources?
”A lot”. There was some amount of numbers, on how many A100’s were used to train the SD1.5 and what was the training data size, discussed here
https://news.ycombinator.com/item?id=33727467
People in SD subreddit have been finetuning the SD-models, so depending what you want to do, it should be doable.
https://twitter.com/emostaque/status/1596864150134984705
> Current -ve prompts: ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy
If you want a really good picture of an imaginary person it helps if you use "extra limbs, extra legs, extra arms" as negative prompts!