After completing the work on ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment, my objective shifted towards the lightweight and cost-effective transformation of the Stable Diffusion series models into image generation models that are conditioned on cross-modal sequences of text and images. I explored various approaches for integrating text and image information in the field of Multimodal Large Language Models (MLLM), and ultimately developed the first version of my solution, which I have named EMMA (Efficient Multi Modal Adapter).
This blog post merely serves to collect the results of EMMA; it is not intended to be a formal report.
EMMA Results#
The results presented here are all from EMMA-SD 1.5, which is the model trained based on SD1.5. As for EMMA-SDXL, I have not yet trained it; I am currently borrowing machines to do so.
Given a character image, generate a short story featuring him as the protagonist:
(The images were cropped from ComfyUI preview node, with four images corresponding to four random seeds, and no cherry-picking was involved.)
EMMA does not require high-quality images of the main character; for example, even the subject of a doodle would suffice.
The text description for EMMA can be progressive, meaning that as more detailed descriptions are added incrementally on top of a basic description, EMMA can still follow the prompt effectively, with minimal impact on other unrelated elements.
EMMA does not simply paste the character image onto the generated image in a specific area; instead, it integrates text and image information at a deep level. For instance, when there is a conflict between the textual and visual information, EMMA tends to prioritize following the textual description while extracting the remaining necessary information from the image. This allows for a certain level of editing capability, as demonstrated in the following example:
Changing the blue skirt in the character image to a red skirt:
The capabilities of EMMA, such as character consistency and editability, can be directly inherited across various community models, including even video generation models like AnimateDiff.
The effects of AnimateDiff, real-life style, and anime style models are shown from top to bottom; from left to right, the effects of editing the skirt color are displayed.
What kind of data does EMMA use?#
The EMMA-person actually undergoes two stages of training. In the first stage, it utilizes a dataset of over 30 million image-text pairs, which is the same dataset used to train ELLA. During this stage, I directly employ the embeddings of randomly cropped image regions as the visual conditions. Even though these randomly cropped regions are derived from the original images, they do not exhibit any significant bias. Consequently, the EMMA model does not learn to simply decode this region’s information into pixels and paste it back onto the original image. Instead, it integrates the image-text information in a deep and sophisticated manner.
The results I showcased on Twitter, are derived from this model.
The young girl adorned with a pearl necklace was not simply copy-pasted onto the generated image; instead, EMMA responded to the prompt, deeply integrating with the textual information.
Subsequently, I gathered data featuring various actions and contexts of the same character, briefly fine-tuning the aforementioned first-stage model. The amount of data used in this stage was actually not substantial, as the first-stage model had already undergone extensive training.
Where can you get EMMA?#
I learned a hard lesson with ELLA when I prematurely promised to something beyond my control. This time, regarding EMMA’s open-source status, I make no promises. However, given that EMMA is highly compatible with the existing SD ecosystem, I believe that if the open-source community takes an interest, they could develop applications far beyond my imagination once EMMA is open-sourced.
If you are very interested in EMMA, you are welcome to contact me through the contact information listed on the About - wrong.wang page, or via X(@wangbudui).