ai-models/DiffSynth-Studio/Qwen-Image-i2L

Public

Code Issues Pull requests Events Packages Insights

main

Qwen-Image-i2L/README.md

kelseye.xh<kelseye@users.noreply.huggingface.co>

Update README.md

4d6855ed

PreviewCode viewBlame

Raw

Qwen-Image-i2L (Image to LoRA)

Model Introduction

The i2L (Image to LoRA) model is a structure designed based on a crazy idea. The model takes an image as input and outputs a LoRA model trained on that image.

We are open-sourcing four models:

Qwen-Image-i2L-Style
- Introduction: This is our first model that can be considered successfully trained. Its detail preservation capability is very weak, but this actually allows it to effectively extract style information from images. Therefore, this model can be used for style transfer.
- Image Encoder: SigLIP2, DINOv3
- Parameters: 2.4B
Qwen-Image-i2L-Coarse
- Introduction: This model is a scaled-up version of Qwen-Image-i2L-Style. The LoRA it produces can preserve content information from the image, but details are not perfect. If this model is used for style transfer, it requires more input images, otherwise the model will tend to generate the content of the input images. We do not recommend using this model alone.
- Image Encoder: SigLIP2, DINOv3, Qwen-VL (Resolution: 224 x 224)
- Parameters: 7.9B
Qwen-Image-i2L-Fine
- Introduction: This model is an incremental update version of Qwen-Image-i2L-Coarse and must be used together with it. It increases the image encoding resolution of Qwen-VL to 1024 x 1024, thereby capturing more detailed information.
- Image Encoder: SigLIP2, DINOv3, Qwen-VL (Resolution: 1024 x 1024)
- Parameters: 7.6B
Qwen-Image-i2L-Bias
- Introduction: This model is a static supplementary LoRA. Since the training data distribution of Qwen-Image-i2L-Coarse and Qwen-Image-i2L-Fine differs from that of the Qwen-Image base model, the images generated by their LoRAs are not consistent with Qwen-Image's preferences. Using this LoRA makes the generated images closer to the Qwen-Image style.
- Image Encoder: None
- Parameters: 30M

These models still have many limitations, with significant room for improvement in generalization and detail preservation. We are open-sourcing these models to inspire more innovative research.

Showcase

Style

The Qwen-Image-i2L-Style model can be used to quickly generate style LoRAs by simply inputting a few images with a unified style. Below are our generated results; all random seeds are 0.

Style 1: Abstract Vector

Input Images:

Generated Images:

a cat	a dog	a girl

Style 2: Black & White Sketch

Input Images:

Generated Images:

a cat	a dog	a girl

Style 3: Rough Sketch

Input Images:

Generated Images:

a cat	a dog	a girl

Style 4: Blue Flat

Input Images:

Generated Images:

a cat	a dog	a girl

Coarse + Fine + Bias

The combination of Qwen-Image-i2L-Coarse, Qwen-Image-i2L-Fine, and Qwen-Image-i2L-Bias can generate LoRA weights that preserve image content and detail information. These weights can serve as initialization weights for LoRA training to accelerate convergence.

LoRA Dataset: Puppy Backpack

Training Data:

Sample Generation During Training:

	Steps: 100	Steps: 200	Steps: 300	Steps: 400	Steps: 500
Random Init
Image to LoRA Init

LoRA Dataset: Teddy Bear

Training Data:

Sample Generation During Training:

	Steps: 100	Steps: 200	Steps: 300	Steps: 400	Steps: 500
Random Init
Image to LoRA Init

LoRA Dataset: Blueberries in a Bowl

Training Data:

Sample Generation During Training:

	Steps: 100	Steps: 200	Steps: 300	Steps: 400	Steps: 500
Random Init
Image to LoRA Init

Inference Code

Install DiffSynth-Studio:


git clone https://github.com/modelscope/DiffSynth-Studio.git
cd DiffSynth-Studio
pip install -e .

Qwen-Image-i2L-Style


from diffsynth.pipelines.qwen_image import (
    QwenImagePipeline, ModelConfig,
    QwenImageUnit_Image2LoRAEncode, QwenImageUnit_Image2LoRADecode
)
from modelscope import snapshot_download
from safetensors.torch import save_file
import torch
from PIL import Image

vram_config_disk_offload = {
    "offload_dtype": "disk",
    "offload_device": "disk",
    "onload_dtype": "disk",
    "onload_device": "disk",
    "preparing_dtype": torch.bfloat16,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}

# Load models
pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="SigLIP2-G384/model.safetensors", **vram_config_disk_offload),
        ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="DINOv3-7B/model.safetensors", **vram_config_disk_offload),
        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-i2L", origin_file_pattern="Qwen-Image-i2L-Style.safetensors", **vram_config_disk_offload),
    ],
    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)

# Load images
snapshot_download(
    model_id="DiffSynth-Studio/Qwen-Image-i2L",
    allow_file_pattern="assets/style/1/*",
    local_dir="data/examples"
)
images = [
    Image.open("data/examples/assets/style/1/0.jpg"),
    Image.open("data/examples/assets/style/1/1.jpg"),
    Image.open("data/examples/assets/style/1/2.jpg"),
    Image.open("data/examples/assets/style/1/3.jpg"),
    Image.open("data/examples/assets/style/1/4.jpg"),
]

# Model inference
with torch.no_grad():
    embs = QwenImageUnit_Image2LoRAEncode().process(pipe, image2lora_images=images)
    lora = QwenImageUnit_Image2LoRADecode().process(pipe, **embs)["lora"]

save_file(lora, "model_style.safetensors")

Qwen-Image-i2L-Coarse, Qwen-Image-i2L-Fine, Qwen-Image-i2L-Bias


from diffsynth.pipelines.qwen_image import (
    QwenImagePipeline, ModelConfig,
    QwenImageUnit_Image2LoRAEncode, QwenImageUnit_Image2LoRADecode
)
from diffsynth.utils.lora import merge_lora
from diffsynth import load_state_dict
from modelscope import snapshot_download
from safetensors.torch import save_file
import torch
from PIL import Image

vram_config_disk_offload = {
    "offload_dtype": "disk",
    "offload_device": "disk",
    "onload_dtype": "disk",
    "onload_device": "disk",
    "preparing_dtype": torch.bfloat16,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}

# Load models
pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config_disk_offload),
        ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="SigLIP2-G384/model.safetensors", **vram_config_disk_offload),
        ModelConfig(model_id="DiffSynth-Studio/General-Image-Encoders", origin_file_pattern="DINOv3-7B/model.safetensors", **vram_config_disk_offload),
        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-i2L", origin_file_pattern="Qwen-Image-i2L-Coarse.safetensors", **vram_config_disk_offload),
        ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-i2L", origin_file_pattern="Qwen-Image-i2L-Fine.safetensors", **vram_config_disk_offload),
    ],
    processor_config=ModelConfig(model_id="Qwen/Qwen-Image-Edit", origin_file_pattern="processor/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)

# Load images
snapshot_download(
    model_id="DiffSynth-Studio/Qwen-Image-i2L",
    allow_file_pattern="assets/lora/3/*",
    local_dir="data/examples"
)
images = [
    Image.open("data/examples/assets/lora/3/0.jpg"),
    Image.open("data/examples/assets/lora/3/1.jpg"),
    Image.open("data/examples/assets/lora/3/2.jpg"),
    Image.open("data/examples/assets/lora/3/3.jpg"),
    Image.open("data/examples/assets/lora/3/4.jpg"),
    Image.open("data/examples/assets/lora/3/5.jpg"),
]

# Model inference
with torch.no_grad():
    embs = QwenImageUnit_Image2LoRAEncode().process(pipe, image2lora_images=images)
    lora = QwenImageUnit_Image2LoRADecode().process(pipe, **embs)["lora"]
    
    lora_bias = ModelConfig(model_id="DiffSynth-Studio/Qwen-Image-i2L", origin_file_pattern="Qwen-Image-i2L-Bias.safetensors")
    lora_bias.download_if_necessary()
    lora_bias = load_state_dict(lora_bias.path, torch_dtype=torch.bfloat16, device="cuda")
    
    lora = merge_lora([lora, lora_bias])

save_file(lora, "model_coarse_fine_bias.safetensors")

Generate Images Using Generated LoRA


from diffsynth.pipelines.qwen_image import QwenImagePipeline, ModelConfig
import torch

vram_config = {
    "offload_dtype": "disk",
    "offload_device": "disk",
    "onload_dtype": torch.bfloat16,
    "onload_device": "cpu",
    "preparing_dtype": torch.bfloat16,
    "preparing_device": "cuda",
    "computation_dtype": torch.bfloat16,
    "computation_device": "cuda",
}

pipe = QwenImagePipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="transformer/diffusion_pytorch_model*.safetensors", **vram_config),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="text_encoder/model*.safetensors", **vram_config),
        ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="vae/diffusion_pytorch_model.safetensors", **vram_config),
    ],
    tokenizer_config=ModelConfig(model_id="Qwen/Qwen-Image", origin_file_pattern="tokenizer/"),
    vram_limit=torch.cuda.mem_get_info("cuda")[1] / (1024 ** 3) - 0.5,
)

pipe.load_lora(pipe.dit, "model_style.safetensors")

image = pipe("a cat", seed=0, height=1024, width=1024, num_inference_steps=50)
image.save("image.jpg")

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111