Diffusers documentation

ErnieImageTransformer2DModel

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.39.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

ErnieImageTransformer2DModel

A Transformer model for image-like data from ERNIE-Image.

A Transformer model for image-like data from ERNIE-Image-Turbo.

ErnieImageTransformer2DModel

class diffusers.ErnieImageTransformer2DModel

< >

( hidden_size: int = 3072num_attention_heads: int = 24num_layers: int = 24ffn_hidden_size: int = 8192in_channels: int = 128out_channels: int = 128patch_size: int = 1text_in_dim: int = 2560rope_theta: int = 256rope_axes_dim: typing.Tuple[int, int, int] = (32, 48, 48)eps: float = 1e-06qk_layernorm: bool = True )

forward

< >

( hidden_states: Tensortimestep: Tensortext_bth: Tensortext_lens: Tensorreturn_dict: bool = True )

Parameters

  • hidden_states (torch.Tensor of shape (batch_size, in_channels, height, width)) — Input hidden_states.
  • timestep (torch.LongTensor) — Used to indicate denoising step.
  • text_bth (torch.Tensor) — Conditional text embeddings (embeddings computed from the input conditions such as prompts) to use, shaped (batch_size, text_length, embed_dims).
  • text_lens (torch.Tensor) — Per-sample text sequence lengths used to build the attention mask.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a ~models.transformer_2d.Transformer2DModelOutput instead of a plain tuple.

The ErnieImageTransformer2DModel forward method.

Update on GitHub