Technical password of generated image AI

In the past few years, artificial intelligence (AI) has made great progress, and AI’s new products include AI image generator. This is a tool that can convert input statements into images. There are many AI tools for text-to-image conversion, but the most prominent ones are DALL-E 2, Stable Diffusion and Midjourney.

DALL-E 2 is developed by OpenAI and the project of chatgpt is complementary. It generates images through a paragraph of text description. Its GPT-3 converter model trained with more than 10 billion parameters can interpret natural language input and generate corresponding images.

DALL-E 2 mainly consists of two parts-converting user input into a representation of an image (called Prior), and then converting this representation into an actual photo (called Decoder).

The text and images used in it are embedded in another network called CLIP (Contrast Language-Image Pre-training), which is also developed by OpenAI. CLIP is a neural network that returns the best title for the input image. What it does is the opposite of what DALL-E 2 does-it converts images into text, while DALL-E 2 converts text into images. The purpose of introducing CLIP is to learn the connection between visual and text representation of objects.

DALL-E 2′ s job is to train two models. The first one is Prior, which accepts text labels and creates CLIP image embedding. The second is Decoder, which accepts CLIP image embedding and generates images. After the model training is completed, the reasoning process is as follows:

  • The input text is converted into CLIP text embedding using neural network.

  • Use Principal Component Analysis to reduce the dimension of text embedding.

  • Create an image embedding using text embedding.

  • After entering the Decoder step, the diffusion model is used to embed the image into an image.

  • The image is enlarged from 64×64 to 256×256, and finally enlarged to 1024×1024 by using convolutional neural network.

Stable Diffusion is a text-to-image model, which uses CLIP ViT-L/14 text encoder and can adjust the model through text prompts. It separates the imaging process into a "diffusion" process at runtime-starting from the noisy situation, gradually improving the image until there is no noise at all, and gradually approaching the provided text description.

Stable Diffusion is based on Latent Diffusion Model(LDM), which is a top-notch text-to-image synthesis technology. Before understanding the working principle of LDM, let’s look at what is diffusion model and why we need LDM.

Diffusion Models, DM) is a generation model based on Transformer, which samples a piece of data (such as an image) and gradually increases the noise over time until the data cannot be recognized. This model tries to return the image to its original form, and in the process, it learns how to generate pictures or other data.

The problem of DM is that powerful DM often consumes a lot of GPU resources, and the cost of reasoning is quite high due to Sequential Evaluations. In order to train DM on limited computing resources without affecting its quality and flexibility, Stable Diffusion applies DM to powerful Pre-trained Autoencoders.

On this premise, the diffusion model is trained, which makes it possible to achieve an optimal balance between reducing complexity and preserving data details, and significantly improves the visual reality. The cross attention layer is introduced into the model structure, which makes the diffusion model a powerful and flexible generator and realizes the high-resolution image generation based on convolution.

Midjourney is also a tool driven by artificial intelligence, which can generate images according to the user’s prompts. MidJourney is good at adapting to the actual artistic style and creating images with any combination of effects that users want. It is good at environmental effects, especially fantasy and science fiction scenes, which look like the artistic effects of games.

DALL-E 2 uses millions of image data for training, and its output results are more mature, which is very suitable for enterprises to use. When there are more than two characters, the image generated by DALL-E 2 is much better than that generated by Midjourney or Stable Diffusion.

Midjourney is a tool famous for its artistic style. Midjourney uses its Discord robot to send and receive requests for AI servers, and almost everything happens on Discord. The resulting image rarely looks like a photo, it seems to be more like a painting.

Stable Diffusion is an open source model that everyone can use. It has a good understanding of contemporary art images and can produce works of art full of details. However, it needs to explain the complex prompt. Stable Diffusion is more suitable for generating complex and creative illustrations. However, there are some shortcomings in creating general images.