Back to the list

SD3 Goes Head-to-Head With SDXL, MidJourney, and Ideogram—Which AI Image Maker Is Best?

decrypt.co 13 June 2024 23:44, UTC

Stability AI's latest big release, SD3, has generated considerable buzz in the AI community. With promises of enhanced prompt adherence, efficiency, accuracy, and overall quality, SD3 went live yesterday hoping to set a new benchmark in image generation. We quickly set out to see just how well SD3 compares against its predecessor, SDXL, as well as against other leading models, MidJourney and Ideogram.

Our head-to-head comparison used the same prompts for each model to ensure a fair fight, even though it might seem unconventional due to the intrinsic differences among the models. The evaluation included a variety of scenarios, testing the models' ability to handle detailed artistic prompts and everyday scenarios alike. With the same seed used for SD3 and SDXL and standardized negative prompts for Stable Diffusion generations, the playing field was leveled.

Here are our results across a variety of image types. All the images are presented in the same order: SD3 (top left), SDXL (top right), MidJourney (bottom left) and Ideogram (bottom right). We'll share our takes on each, but you can also judge for yourself.

Illustrations

Prompt: Hand-drawn illustration of a giant spider chasing a woman in the jungle, extremely scary, anguish, dark and creepy scenery, horror, hints of analog photography influence, sketch.

SD3 and SDXL both adopted a black-and-white style reminiscent of old comics. SD3's output, however, was significantly more detailed, capturing intricate elements such as the spider’s legs and the woman's distressed expression. MidJourney took a more artful approach, producing a vibrant illustration that—while visually appealing—deviated from the prompt's "hand-drawn" and "sketch" directives. Ideogram's interpretation mirrored SD3's stylistic approach but added a bluish hue that was not specified in the prompt and was not a sketch.

In terms of accuracy, SD3 and Ideogram correctly depicted the woman running away from the spider, aligning closely with the prompt's narrative. Conversely, SDXL and MidJourney inaccurately showed the woman approaching the spider, which contradicted the prompt. Given the prompt's specification of a sketch, SD3's black-and-white, highly detailed illustration was more accurate than Ideogram’s colored composition, which lacked facial detail.

Winner: SD3.

Non-standard generations

Prompt: A lizard wearing a suit.

SD3 delivered a precise depiction of a lizard in a suit, closely adhering to the prompt. The lizard retained its natural appearance, with scales and reptilian features, seamlessly integrated into a well-tailored suit. In contrast, SDXL, MidJourney, and Ideogram anthropomorphized the lizard, creating humanoid lizards instead.

SDXL and MidJourney’s versions were highly detailed and realistic, resembling photographs. MidJourney's output had a lifelike texture and depth, almost resembling analog photography, but didn’t generate the suit. Ideogram’s portrait was heavily edited, akin to official photos taken by politicians, with a polished and formal look. Despite the high quality of these outputs, SD3 excelled in realism, prompt adherence, and accuracy, making its result the most believable.

Winner: SD3.

The elephant in the room: the “L” word

Prompt: A beautiful woman lying on the grass.

Something clearly went wrong with SD3.

This prompt made the cut because one of the first things the AI art community noted was SD3’s inability to generate pictures of people lying on grass. In fact, this has quickly turned into a meme.

SDXL presented a waist-up photo of the woman, focusing on her upper body and face. MidJourney and Ideogram opted for close-up images. MidJourney’s result was the most realistic, showcasing fine details in the woman’s features and the grass around her. However, it overemphasized the bokeh effect, blurring not only the background but also parts of the woman’s body. Ideogram avoided the excessive bokeh issue, maintaining clarity in the woman’s body and the grass.

As for SD3, it's an inexplicable fail. In fact, SD3 seems to struggle to generating images of humans “lying” not only on grass, but on anything. We tried photos, illustrations, renders. We tried generating men, women, elders, children, and anything resembling a person. The “lying” pose turns them all into colossal monstrosities.

Winner: With SD3 tossed out, this one is a tie between MidJourney and Ideogram.

Artistic styles

Prompt: A man and a woman having dinner in a futuristic restaurant, illustration, post-impressionism, impasto.

This test evaluated the models' ability to reproduce specific artistic movements. SD3 excelled, generating impasto strokes and capturing the essence of post-impressionism. The texture and layering of the paint in SD3’s output were evident, showcasing a deep understanding of the style.

SDXL was a close second, successfully emulating the post-impressionism style but lacking the pronounced impasto technique. MidJourney and Ideogram did not demonstrate a clear comprehension of the artistic styles, producing generic illustrations that did not align with the prompt’s specifications.

Winner: SD3.

Specific artists and their styles

Prompt: A man and a woman having dinner in a futuristic restaurant, illustration in the style of Vincent Van Gogh.

SD3 demonstrated a strong ability to replicate Van Gogh’s style, incorporating his distinctive brushstrokes and color palette throughout, and notably with the depiction of the couple. The composition also accurately depicted a futuristic restaurant. SDXL followed closely, blending realistic comic-style characters with a Van Gogh-inspired environment.

MidJourney’s output was less coherent, failing to depict the restaurant and lacking the requested artistic style. The couple appeared to be dining in water, which deviated from the prompt. Ideogram produced a straightforward photo of a man and a woman in a restaurant, without any attempt to emulate Van Gogh’s style.

Winner: SD3.

Photorealism

Prompt: Professional photo, close-up portrait photo of a Caucasian man, wearing a black sweater, serious face, dramatic lighting, nature, gloomy, cloudy weather, bokeh.

SD3 effectively captured the serious, gloomy expression and black sweater attire with dramatic lighting and a shallow depth of field, creating a moody, professional look. The composition included a gloomy, natural environment, aligning well with the prompt.

SDXL’s output followed the traditional AI-generated portrait style, with an overcast sky and foliage in the blurred background. However, the face appeared heavily edited, lacking realistic imperfections. MidJourney’s version featured a warm color palette and an urban background, deviating from the prompt's nature aspect.

Ideogram’s composition met all criteria, delivering a close-up framing, black sweater, serious expression, gloomy outdoor lighting, and a hint of bokeh in the background. It was also the most realistic photo among the models.

Winner: Ideogram.

Text Generation

Prompt: A woman posing in front of a wall in a futuristic city with a sign saying "Emerge by Decrypt."

Text generation proved challenging for all models. None of the models successfully rendered the text “Emerge by Decrypt” accurately. SDXL provided the most futuristic cityscape but failed to include all elements specified in the prompt. SD3 managed to generate the wall, sign, and city—albeit with text inaccuracies.

MidJourney was the most accurate one, producing the sign, the futuristic atmosphere of the city and the wall. Ideogram generated the wall and city but omitted the sign. Despite these issues, SD3’s ability to incorporate all key elements of the composition, even with imperfect text, made it the winner in this scenario.

Winner: MidJourney—but this was a lucky generation, as Ideogram tends to be more consistent at generating text in images overall.

Conclusion

SD3 demonstrates significant improvements over its predecessor SDXL and competitive performance against MidJourney and Ideogram in a variety of scenarios. SD3 excels in prompt adherence, as promised, as well as detail and artistic style reproduction. SD3 has proven its potential as a robust base model.

However, its heavy censorship and perplexing limitations in generating people in certain positions suggest it may be best used in conjunction with other tools.

For example, users may want to generate their images with SD 1.5, SDXL, or Pixart, and then encode those generations and send them to a de-noise sampler with SD3. This would offload the image creation process to SD3 but would use a previous generation as a reference instead of generating everything from scratch. This makes even more sense currently, as there are no custom models or even Controlnets or LoRAs to give users more options to influence the model.

In its current state, SD3 is better than SDXL for a lot of use cases—but not enough to replace it.

Edited by Ryan Ozawa.

decrypt.co