Diffusers documentation

Krea 2

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Krea 2

Krea 2 (K2) is a flow-matching text-to-image model built around a single-stream MMDiT with grouped-query attention. A Qwen3-VL text encoder provides the conditioning: instead of the last hidden state, hidden states from twelve decoder layers are tapped per token and fused inside the transformer by a small text-fusion stage. Images are decoded with the Qwen-Image VAE.

Two checkpoints are released, sharing the same architecture but with different recommended sampler settings:

  • Base (midtrain) — use the full sampler with classifier-free guidance: num_inference_steps=28, guidance_scale=4.5.
  • TDM (distilled) — distilled for few-step sampling, run with num_inference_steps=8 and guidance disabled (guidance_scale=0.0).

guidance_scale follows the Krea 2 convention: the velocity is computed as cond + guidance_scale * (cond - uncond) and guidance is enabled whenever guidance_scale > 0 (this equals the usual CFG formulation with scale 1 + guidance_scale).

Text-to-image

import torch
from diffusers import Krea2Pipeline

# Load from a local directory produced by the Krea 2 conversion (no hub repo yet).
pipe = Krea2Pipeline.from_pretrained("path/to/krea2-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "a fox in the snow"
image = pipe(
    prompt,
    height=1024,
    width=1024,
    num_inference_steps=28,
    guidance_scale=4.5,
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]
image.save("krea2.png")

Krea2Pipeline

class diffusers.Krea2Pipeline

< >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKLQwenImage text_encoder: Qwen3VLModel tokenizer: AutoTokenizer transformer: Krea2Transformer2DModel text_encoder_select_layers: tuple[int, ...] | list[int] | None = None is_distilled: bool = False patch_size: int = 2 )

Parameters

  • scheduler (FlowMatchEulerDiscreteScheduler) — Euler flow-matching scheduler. The Krea 2 sigma schedule is the resolution-aware exponential time shift, so the scheduler config is expected to set use_dynamic_shifting=True together with the Krea 2 shift parameters (base_shift=0.5, max_shift=1.15, base_image_seq_len=256, max_image_seq_len=6400).
  • vae (AutoencoderKLQwenImage) — The Qwen-Image variational auto-encoder (f8, 16 latent channels) used to decode latents to images.
  • text_encoder (PreTrainedModel) — A Qwen3-VL model (e.g. Qwen3VLModel of Qwen/Qwen3-VL-4B-Instruct). The pipeline consumes a stack of hidden states tapped from several decoder layers rather than the last hidden state.
  • tokenizer (AutoTokenizer) — The tokenizer paired with the text encoder.
  • transformer (Krea2Transformer2DModel) — The Krea 2 single-stream MMDiT that predicts the flow-matching velocity.
  • text_encoder_select_layers (tuple[int, ...], optional) — Indices into the text encoder’s hidden_states tuple (0 is the embedding output) whose states are stacked per token as the transformer’s text conditioning. Must have transformer.config.num_text_layers entries.
  • is_distilled (bool, optional, defaults to False) — Whether the transformer is the few-step distilled (TDM/turbo) checkpoint. When True a fixed timestep shift mu=1.15 is used; otherwise mu is computed from the image resolution.
  • patch_size (int, optional, defaults to 2) — Side length of the square patches the latents are packed into before entering the transformer. The effective pixel-to-token downsampling factor is vae_scale_factor * patch_size.

The Krea 2 pipeline for text-to-image generation.

__call__

< >

( prompt: str | list[str] | None = None negative_prompt: str | list[str] | None = None height: int = 1024 width: int = 1024 num_inference_steps: int = 28 sigmas: list[float] | None = None guidance_scale: float = 4.5 num_images_per_prompt: int = 1 generator: torch._C.Generator | list[torch._C.Generator] | None = None latents: torch.Tensor | None = None prompt_embeds: torch.Tensor | None = None prompt_embeds_mask: torch.Tensor | None = None negative_prompt_embeds: torch.Tensor | None = None negative_prompt_embeds_mask: torch.Tensor | None = None output_type: str | None = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, dict], NoneType]] = None callback_on_step_end_tensor_inputs: list = ['latents'] attention_kwargs: dict[str, typing.Any] | None = None max_sequence_length: int = 512 ) Krea2PipelineOutput or tuple

Parameters

  • prompt (str or list[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds.
  • negative_prompt (str or list[str], optional) — The prompt or prompts not to guide the image generation. Ignored when guidance_scale <= 0; defaults to an empty prompt when guidance is enabled.
  • height (int, defaults to 1024) — The height in pixels of the generated image. Rounded up to a multiple of 16 if needed.
  • width (int, defaults to 1024) — The width in pixels of the generated image. Rounded up to a multiple of 16 if needed.
  • num_inference_steps (int, defaults to 28) — The number of denoising steps. Use 28 for the base (midtrain) checkpoint and 8 for the few-step distilled (TDM) checkpoint.
  • sigmas (list[float], optional) — Custom sigmas for the scheduler. If not defined, the default linspace(1.0, 1/num_inference_steps, num_inference_steps) grid is used (the resolution-aware shift is applied inside the scheduler).
  • guidance_scale (float, defaults to 4.5) — Classifier-free guidance scale, following the Krea 2 convention: the velocity is computed as cond + guidance_scale * (cond - uncond) and guidance is enabled whenever guidance_scale > 0 (this equals the usual CFG formulation with scale 1 + guidance_scale). Set to 0.0 to disable (e.g. for the TDM checkpoint).
  • num_images_per_prompt (int, defaults to 1) — The number of images to generate per prompt.
  • generator (torch.Generator or list[torch.Generator], optional) — One or more torch generator(s) to make generation deterministic.
  • latents (torch.Tensor, optional) — Pre-generated noisy latents in packed form (batch_size, image_seq_len, in_channels), sampled from a Gaussian distribution, to be used as inputs for image generation.
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings of shape (batch_size, text_seq_len, num_text_layers, text_hidden_dim). If not provided, embeddings are generated from prompt.
  • prompt_embeds_mask (torch.Tensor, optional) — Boolean mask for prompt_embeds; required when prompt_embeds is passed.
  • negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings; same layout as prompt_embeds.
  • negative_prompt_embeds_mask (torch.Tensor, optional) — Boolean mask for negative_prompt_embeds; required when negative_prompt_embeds is passed.
  • output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between "pil", "np", "pt" or "latent".
  • return_dict (bool, optional, defaults to True) — Whether or not to return a Krea2PipelineOutput instead of a plain tuple.
  • callback_on_step_end (Callable, optional) — A function that is called at the end of each denoising step with callback_on_step_end(self, step, timestep, callback_kwargs).
  • callback_on_step_end_tensor_inputs (list[str], optional, defaults to ["latents"]) — The list of tensor inputs for the callback_on_step_end function. Must be a subset of ._callback_tensor_inputs.
  • attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
  • max_sequence_length (int, defaults to 512) — Fixed text sequence length consumed by the transformer; prompts are padded or truncated to it.

Returns

Krea2PipelineOutput or tuple

Krea2PipelineOutput if return_dict is True, otherwise a tuple, whose first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import Krea2Pipeline

>>> # Load from a local directory produced by the Krea 2 conversion (no hub repo yet).
>>> pipe = Krea2Pipeline.from_pretrained("path/to/krea2-diffusers", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "a fox in the snow"
>>> # Base (midtrain) checkpoint defaults. For the few-step distilled (TDM) checkpoint use
>>> # `num_inference_steps=8, guidance_scale=0.0` instead.
>>> image = pipe(prompt, num_inference_steps=28, guidance_scale=4.5).images[0]
>>> image.save("krea2.png")

encode_prompt

< >

( prompt: str | list[str] device: torch.device | None = None num_images_per_prompt: int = 1 prompt_embeds: torch.Tensor | None = None prompt_embeds_mask: torch.Tensor | None = None max_sequence_length: int = 512 )

Parameters

  • prompt (str or list[str], optional) — prompt to be encoded
  • device — (torch.device): torch device
  • num_images_per_prompt (int) — number of images that should be generated per prompt
  • prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings of shape (batch_size, text_seq_len, num_text_layers, text_hidden_dim). Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
  • prompt_embeds_mask (torch.Tensor, optional) — Pre-generated boolean mask marking valid text tokens, of shape (batch_size, text_seq_len). Required when prompt_embeds is passed.
  • max_sequence_length (int, defaults to 512) — Fixed text sequence length consumed by the transformer; prompts are padded or truncated to it.

get_text_hidden_states

< >

( prompt: str | list[str] max_sequence_length: int = 512 device: torch.device | None = None )

Tokenize prompt into the fixed-length Krea 2 layout and tap the selected encoder hidden states.

Returns a (hidden_states, attention_mask) tuple of shapes (batch_size, text_seq_len, num_text_layers, text_hidden_dim) and (batch_size, text_seq_len) (bool).

prepare_position_ids

< >

( text_seq_len: int grid_height: int grid_width: int device: device )

Build the (text_seq_len + grid_height * grid_width, 3) rotary coordinates for the combined sequence: text tokens sit at the origin, image tokens carry their (0, h, w) latent-grid coordinates.

Krea2PipelineOutput

class diffusers.pipelines.krea2.Krea2PipelineOutput

< >

( images: list[PIL.Image.Image] | numpy.ndarray )

Parameters

  • images (list[PIL.Image.Image] or np.ndarray) — List of denoised PIL images of length batch_size or numpy array of shape (batch_size, height, width, num_channels).

Output class for the Krea 2 pipeline.

Update on GitHub