Krea 2

Krea 2 (K2) is a flow-matching text-to-image model built around a single-stream MMDiT with grouped-query attention. A Qwen3-VL text encoder provides the conditioning: instead of the last hidden state, hidden states from twelve decoder layers are tapped per token and fused inside the transformer by a small text-fusion stage. Images are decoded with the Qwen-Image VAE.

Two checkpoints are released, sharing the same architecture but with different recommended sampler settings:

Base (midtrain) — use the full sampler with classifier-free guidance: num_inference_steps=28, guidance_scale=4.5.
TDM (distilled) — distilled for few-step sampling, run with num_inference_steps=8 and guidance disabled (guidance_scale=0.0).

guidance_scale follows the Krea 2 convention: the velocity is computed as cond + guidance_scale * (cond - uncond) and guidance is enabled whenever guidance_scale > 0 (this equals the usual CFG formulation with scale 1 + guidance_scale).

Text-to-image

import torch
from diffusers import Krea2Pipeline

# Load from a local directory produced by the Krea 2 conversion (no hub repo yet).
pipe = Krea2Pipeline.from_pretrained("path/to/krea2-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "a fox in the snow"
image = pipe(
    prompt,
    height=1024,
    width=1024,
    num_inference_steps=28,
    guidance_scale=4.5,
    generator=torch.Generator("cuda").manual_seed(0),
).images[0]
image.save("krea2.png")

Krea2Pipeline

class diffusers.Krea2Pipeline

< source >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKLQwenImage text_encoder: Qwen3VLModel tokenizer: AutoTokenizer transformer: Krea2Transformer2DModel text_encoder_select_layers: tuple[int, ...] | list[int] | None = None is_distilled: bool = False patch_size: int = 2 )

Parameters

scheduler (FlowMatchEulerDiscreteScheduler) — Euler flow-matching scheduler. The Krea 2 sigma schedule is the resolution-aware exponential time shift, so the scheduler config is expected to set use_dynamic_shifting=True together with the Krea 2 shift parameters (base_shift=0.5, max_shift=1.15, base_image_seq_len=256, max_image_seq_len=6400).
vae (AutoencoderKLQwenImage) — The Qwen-Image variational auto-encoder (f8, 16 latent channels) used to decode latents to images.
text_encoder (PreTrainedModel) — A Qwen3-VL model (e.g. Qwen3VLModel of Qwen/Qwen3-VL-4B-Instruct). The pipeline consumes a stack of hidden states tapped from several decoder layers rather than the last hidden state.
tokenizer (AutoTokenizer) — The tokenizer paired with the text encoder.
transformer (Krea2Transformer2DModel) — The Krea 2 single-stream MMDiT that predicts the flow-matching velocity.
text_encoder_select_layers (tuple[int, ...], optional) — Indices into the text encoder’s hidden_states tuple (0 is the embedding output) whose states are stacked per token as the transformer’s text conditioning. Must have transformer.config.num_text_layers entries.
is_distilled (bool, optional, defaults to False) — Whether the transformer is the few-step distilled (TDM/turbo) checkpoint. When True a fixed timestep shift mu=1.15 is used; otherwise mu is computed from the image resolution.
patch_size (int, optional, defaults to 2) — Side length of the square patches the latents are packed into before entering the transformer. The effective pixel-to-token downsampling factor is vae_scale_factor * patch_size.

The Krea 2 pipeline for text-to-image generation.

call

< source >

( prompt: str | list[str] | None = None negative_prompt: str | list[str] | None = None height: int = 1024 width: int = 1024 num_inference_steps: int = 28 sigmas: list[float] | None = None guidance_scale: float = 4.5 num_images_per_prompt: int = 1 generator: torch._C.Generator | list[torch._C.Generator] | None = None latents: torch.Tensor | None = None prompt_embeds: torch.Tensor | None = None prompt_embeds_mask: torch.Tensor | None = None negative_prompt_embeds: torch.Tensor | None = None negative_prompt_embeds_mask: torch.Tensor | None = None output_type: str | None = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, dict], NoneType]] = None callback_on_step_end_tensor_inputs: list = ['latents'] attention_kwargs: dict[str, typing.Any] | None = None max_sequence_length: int = 512 ) → Krea2PipelineOutput or tuple

Parameters

prompt (str or list[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds.
negative_prompt (str or list[str], optional) — The prompt or prompts not to guide the image generation. Ignored when guidance_scale <= 0; defaults to an empty prompt when guidance is enabled.
height (int, defaults to 1024) — The height in pixels of the generated image. Rounded up to a multiple of 16 if needed.
width (int, defaults to 1024) — The width in pixels of the generated image. Rounded up to a multiple of 16 if needed.
num_inference_steps (int, defaults to 28) — The number of denoising steps. Use 28 for the base (midtrain) checkpoint and 8 for the few-step distilled (TDM) checkpoint.
sigmas (list[float], optional) — Custom sigmas for the scheduler. If not defined, the default linspace(1.0, 1/num_inference_steps, num_inference_steps) grid is used (the resolution-aware shift is applied inside the scheduler).
guidance_scale (float, defaults to 4.5) — Classifier-free guidance scale, following the Krea 2 convention: the velocity is computed as cond + guidance_scale * (cond - uncond) and guidance is enabled whenever guidance_scale > 0 (this equals the usual CFG formulation with scale 1 + guidance_scale). Set to 0.0 to disable (e.g. for the TDM checkpoint).
num_images_per_prompt (int, defaults to 1) — The number of images to generate per prompt.
generator (torch.Generator or list[torch.Generator], optional) — One or more torch generator(s) to make generation deterministic.
latents (torch.Tensor, optional) — Pre-generated noisy latents in packed form (batch_size, image_seq_len, in_channels), sampled from a Gaussian distribution, to be used as inputs for image generation.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings of shape (batch_size, text_seq_len, num_text_layers, text_hidden_dim). If not provided, embeddings are generated from prompt.
prompt_embeds_mask (torch.Tensor, optional) — Boolean mask for prompt_embeds; required when prompt_embeds is passed.
negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings; same layout as prompt_embeds.
negative_prompt_embeds_mask (torch.Tensor, optional) — Boolean mask for negative_prompt_embeds; required when negative_prompt_embeds is passed.
output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between "pil", "np", "pt" or "latent".
return_dict (bool, optional, defaults to True) — Whether or not to return a Krea2PipelineOutput instead of a plain tuple.
callback_on_step_end (Callable, optional) — A function that is called at the end of each denoising step with callback_on_step_end(self, step, timestep, callback_kwargs).
callback_on_step_end_tensor_inputs (list[str], optional, defaults to ["latents"]) — The list of tensor inputs for the callback_on_step_end function. Must be a subset of ._callback_tensor_inputs.
attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to the AttentionProcessor as defined under self.processor in diffusers.models.attention_processor.
max_sequence_length (int, defaults to 512) — Fixed text sequence length consumed by the transformer; prompts are padded or truncated to it.

Returns

Krea2PipelineOutput or tuple

Krea2PipelineOutput if return_dict is True, otherwise a tuple, whose first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import Krea2Pipeline

>>> # Load from a local directory produced by the Krea 2 conversion (no hub repo yet).
>>> pipe = Krea2Pipeline.from_pretrained("path/to/krea2-diffusers", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "a fox in the snow"
>>> # Base (midtrain) checkpoint defaults. For the few-step distilled (TDM) checkpoint use
>>> # `num_inference_steps=8, guidance_scale=0.0` instead.
>>> image = pipe(prompt, num_inference_steps=28, guidance_scale=4.5).images[0]
>>> image.save("krea2.png")

encode_prompt

< source >

( prompt: str | list[str] device: torch.device | None = None num_images_per_prompt: int = 1 prompt_embeds: torch.Tensor | None = None prompt_embeds_mask: torch.Tensor | None = None max_sequence_length: int = 512 )

Parameters

prompt (str or list[str], optional) — prompt to be encoded
device — (torch.device): torch device
num_images_per_prompt (int) — number of images that should be generated per prompt
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings of shape (batch_size, text_seq_len, num_text_layers, text_hidden_dim). Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
prompt_embeds_mask (torch.Tensor, optional) — Pre-generated boolean mask marking valid text tokens, of shape (batch_size, text_seq_len). Required when prompt_embeds is passed.
max_sequence_length (int, defaults to 512) — Fixed text sequence length consumed by the transformer; prompts are padded or truncated to it.

get_text_hidden_states

< source >

( prompt: str | list[str] max_sequence_length: int = 512 device: torch.device | None = None )

Tokenize prompt into the fixed-length Krea 2 layout and tap the selected encoder hidden states.

Returns a (hidden_states, attention_mask) tuple of shapes (batch_size, text_seq_len, num_text_layers, text_hidden_dim) and (batch_size, text_seq_len) (bool).

prepare_position_ids

< source >

( text_seq_len: int grid_height: int grid_width: int device: device )

Build the (text_seq_len + grid_height * grid_width, 3) rotary coordinates for the combined sequence: text tokens sit at the origin, image tokens carry their (0, h, w) latent-grid coordinates.

Krea2PipelineOutput

class diffusers.pipelines.krea2.Krea2PipelineOutput

< source >

( images: list[PIL.Image.Image] | numpy.ndarray )

Parameters

images (list[PIL.Image.Image] or np.ndarray) — List of denoised PIL images of length batch_size or numpy array of shape (batch_size, height, width, num_channels).

Output class for the Krea 2 pipeline.

Update on GitHub

Diffusers

Krea 2

Text-to-image

Krea2Pipeline

class diffusers.Krea2Pipeline

__call__

encode_prompt

get_text_hidden_states

prepare_position_ids

Krea2PipelineOutput

class diffusers.pipelines.krea2.Krea2PipelineOutput

call