Z-Image-Turbo-Fun-Controlnet-Union-2.1

Update

[2026.02.26] Update to version 2602, with support for Gray Control.
[2026.01.12] Update to version 2601, with support for Scribble Control. Added lite models (1.9GB, 5 layers). Retrained Control and Tile models with enriched mask varieties, improved training schedules, and multi-resolution control images (512~1536) to fix mask pattern leakage and large control_context_scale artifacts.
[2025.12.22] Performed 8-step distillation on v2.1 to restore acceleration lost when applying ControlNet. Uploaded a tile model for super-resolution.
[2025.12.17] Fixed v2.0 typo (control_layers used instead of control_noise_refiner), which caused double forward pass and slow inference. Speed restored in v2.1.

Model Card

a. 2602 Models

Name	Description
Z-Image-Turbo-Fun-Controlnet-Union-2.1-2602-8steps.safetensors	Supports multiple control conditions (Canny, Depth, Pose, MLSD, Hed, Scribble, and Gray).
Z-Image-Turbo-Fun-Controlnet-Union-2.1-lite-2602-8steps.safetensors	Same training scheme as the 2601 version, but with control applied to fewer layers. Supports multiple control conditions (Canny, Depth, Pose, MLSD, Hed, Scribble, and Gray).

b. 2601 Models

Name	Description
Z-Image-Turbo-Fun-Controlnet-Union-2.1-2601-8steps.safetensors	Compared to the old version, this model uses more diverse masks, a more reasonable training schedule, and multi-resolution control images (512–1536) instead of single resolution (512). This reduces artifacts and mask information leakage while improving robustness. Supports multiple control conditions (Canny, Depth, Pose, MLSD, Hed, and Scribble).
Z-Image-Turbo-Fun-Controlnet-Tile-2.1-2601-8steps.safetensors	Compared to the old version, uses higher training resolution and a more refined distillation schedule, reducing bright spots and artifacts.
Z-Image-Turbo-Fun-Controlnet-Union-2.1-lite-2601-8steps.safetensors	Same training scheme as the 2601 version, but with control applied to fewer layers, resulting in weaker control. This allows for larger control_context_scale values with more natural results, and is also better suited for lower-spec machines. Supports multiple control conditions (Canny, Depth, Pose, MLSD, Hed, and Scribble).
Z-Image-Turbo-Fun-Controlnet-Tile-2.1-lite-2601-8steps.safetensors	Same training scheme as the 2601 version, but with control applied to fewer layers, resulting in weaker control. Allows larger control_context_scale values with more natural results, and better suits lower-spec machines.

c. Models Before 2601

Name	Description
Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps.safetensors	Distilled from version 2.1 using an 8-step distillation algorithm. Compared to version 2.1, 8-step prediction yields clearer images with more reasonable composition. Supports Canny, Depth, Pose, MLSD, and Hed.
Z-Image-Turbo-Fun-Controlnet-Tile-2.1-8steps.safetensors	A Tile model trained on high-definition datasets (up to 2048×2048) for super-resolution, distilled using an 8-step algorithm. 8-step prediction is recommended.
Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors	A retrained model fixing the typo in version 2.0, with faster single-step speed. Supports Canny, Depth, Pose, MLSD, and Hed. However, like version 2.0, some acceleration capability was lost during training, requiring more steps and cfg.
Z-Image-Turbo-Fun-Controlnet-Union-2.0.safetensors	ControlNet weights for Z-Image-Turbo. Compared to version 1.0, more layers are modified with longer training. However, a code typo caused layer blocks to forward twice, resulting in slower speed. Supports Canny, Depth, Pose, MLSD, and Hed. Some acceleration capability was lost during training, requiring more steps.

Model Features

This ControlNet is applied to 15 layer blocks and 2 refiner layer blocks (Lite models: 3 layer blocks and 2 refiner layer blocks). It supports multiple control conditions including Canny, HED, Depth, Pose, and MLSD (supporting Scribble in 2601 models and Gray in 2602 models).
Inpainting mode is also supported. For inpaint mode, use a larger control_context_scale for better image continuity.
Training Process:
- 2.0: Trained from scratch for 70,000 steps on 1M high-quality images (general and human-centric content) at 1328 resolution with BFloat16 precision, batch size 64, learning rate 2e-5, and text dropout ratio 0.10.
- 2.1: Continued training from 2.0 weights for 11,000 additional steps after fixing a typo, using the same parameters and dataset.
- 2.1-8-steps: Distilled from version 2.1 using an 8-step distillation algorithm for 5,500 steps.
Note on Steps:
- 2.0 and 2.1: Higher control_context_scale values may require more inference steps for better results, likely because the control model has not been distilled.
- 2.1-8-steps: Use 8 steps for inference.
Adjust control_context_scale (optimal range: 0.65–1.00) for stronger control and better detail preservation. A detailed prompt is highly recommended for stability.
In versions 2.0 and 2.1, applying ControlNet to Z-Image-Turbo caused loss of acceleration capability and blurry images. For strength and step count testing details, refer to Scale Test Results (generated with version 2.0).

Results

a. Difference between 2.1-8steps and 2.1-2601-8steps.

The old 8-steps model had bright spots/artifacts when the control_context_scale was too large, while the new version does not.

Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps	Z-Image-Turbo-Fun-Controlnet-Union-2.1-2601-8steps

The old 8-steps model sometimes learned the mask information and tended to completely fill the mask during removal, while the new version does not.

Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps	Z-Image-Turbo-Fun-Controlnet-Union-2.1-2601-8steps

b. Difference between 2.1 and 2.1-8steps.

8 steps results:

Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps	Z-Image-Turbo-Fun-Controlnet-Union-2.1

c. Generation Results With 2.1-lite-2601-8steps

Shares the same training scheme as the 2601 version, but with control applied to fewer layers, resulting in weaker control. This allows for larger control_context_scale values with more natural results, and is also better suited for lower-spec machines.

Pose	Output

Pose	Output

Canny	Output

d. Generation Results With 2.1-2601-8steps

Depth	Output

Pose + Inpaint	Output

Pose + Inpaint	Output

Pose	Output

Pose	Output

Pose	Output

Canny	Output

HED	Output

Depth	Output

Low Resolution	High Resolution

e. Gray Control Results with 2602 Models

Low Resolution	High Resolution

Inference

Go to the VideoX-Fun repository for more details.

Please clone the VideoX-Fun repository and create the required directories:

# Clone the code
git clone https://github.com/aigc-apps/VideoX-Fun.git

# Enter VideoX-Fun's directory
cd VideoX-Fun

# Create model directories
mkdir -p models/Diffusion_Transformer
mkdir -p models/Personalized_Model

Then download the weights into models/Diffusion_Transformer and models/Personalized_Model.

📦 models/
├── 📂 Diffusion_Transformer/
│   └── 📂 Z-Image-Turbo/
├── 📂 Personalized_Model/
│   ├── 📦 Z-Image-Turbo-Fun-Controlnet-Union-2.1.safetensors
│   ├── 📦 Z-Image-Turbo-Fun-Controlnet-Union-2.1-8steps.safetensors
│   └── 📦 Z-Image-Turbo-Fun-Controlnet-Tile-2.1-8steps.safetensors

Then run the file examples/z_image_fun/predict_t2i_control_2.1.py and examples/z_image_fun/predict_i2i_inpaint_2.1.py.

(Obsolete) Scale Test Results: