Diffusers documentation
ZImageTransformer2DModel
ZImageTransformer2DModel
A Transformer model for image-like data from Z-Image.
ZImageTransformer2DModel
class diffusers.ZImageTransformer2DModel
< source >( all_patch_size = (2,)all_f_patch_size = (1,)in_channels = 16dim = 3840n_layers = 30n_refiner_layers = 2n_heads = 30n_kv_heads = 30norm_eps = 1e-05qk_norm = Truecap_feat_dim = 2560siglip_feat_dim = Nonerope_theta = 256.0t_scale = 1000.0axes_dims = [32, 48, 48]axes_lens = [1024, 512, 512] )
forward
< source >( x: listtcap_feats: listreturn_dict: bool = Truecontrolnet_block_samples: dict[int, torch.Tensor] | None = Nonesiglip_feats: list[list[torch.Tensor]] | None = Noneimage_noise_mask: list[list[int]] | None = Nonepatch_size: int = 2f_patch_size: int = 1 )
Parameters
- x (
listoftorch.Tensoror nestedlistoftorch.Tensor) — Input latents. A flat list when running in standard mode, or a nested list when running in omni mode. - t (
torch.Tensor) — Used to indicate denoising step. - cap_feats (
listoftorch.Tensoror nestedlistoftorch.Tensor) — Conditional caption embeddings (embeddings computed from the input conditions such as prompts) to use. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return a~models.transformer_2d.Transformer2DModelOutputinstead of a plain tuple. - controlnet_block_samples (
dictofinttotorch.Tensor, optional) — A mapping from block index to tensor that if specified are added to the residuals of transformer blocks. - siglip_feats (
listoflistoftorch.Tensor, optional) — Optional SigLIP image features used as additional conditioning. - image_noise_mask (
listoflistofint, optional) — Per-image noise masks indicating noisy vs. clean tokens in omni mode. - patch_size (
int, optional, defaults to 2) — Spatial patch size used to patchify the input latents. - f_patch_size (
int, optional, defaults to 1) — Temporal patch size used to patchify the input latents.
The ZImageTransformer2DModel forward method.
Flow: patchify -> t_embed -> x_embed -> x_refine -> cap_embed -> cap_refine -> [siglip_embed -> siglip_refine] -> build_unified -> main_layers -> final_layer -> unpatchify
patchify_and_embed
< source >( all_image: listall_cap_feats: listpatch_size: intf_patch_size: int )
Patchify for basic mode: single image per batch item.
patchify_and_embed_omni
< source >( all_x: listall_cap_feats: listall_siglip_feats: listpatch_size: intf_patch_size: intimages_noise_mask: list )
Patchify for omni mode: multiple images per batch item with noise masks.