Diffusers documentation
JoyAI-Image-Edit
JoyAI-Image-Edit
JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT). A central principle of JoyAI-Image is the closed-loop collaboration between understanding, generation, and editing.
JoyAI-Image-Edit supports general image editing as well as spatial editing capabilities including object move, object rotation, and camera control.
| Model | Description | Download |
|---|---|---|
| JoyAI-Image-Edit | Instruction-guided image editing with precise and controllable spatial manipulation | Hugging Face |
import torch
from diffusers import JoyImageEditPipeline
from diffusers.utils import load_image
pipeline = JoyImageEditPipeline.from_pretrained(
"jdopensource/JoyAI-Image-Edit-Diffusers", torch_dtype=torch.bfloat16
)
pipeline.to("cuda")
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg")
prompt = "Add wings to the astronaut."
output = pipeline(
image=image,
prompt=prompt,
num_inference_steps=40,
guidance_scale=4.0,
generator=torch.Generator("cuda").manual_seed(0),
).images[0]
output.save("joyimage_edit_output.png")Spatial editing
JoyAI-Image supports three spatial editing prompt patterns: Object Move, Object Rotation, and Camera Control. For best results, follow the prompt templates below as closely as possible. For more information, refer to SpatialEdit.
Object Move
Move a target object into a specified region marked by a red box in the input image.
Move the <object> into the red box and finally remove the red box.
Object Rotation
Rotate an object to a specific canonical view. Supported <view> values: front, right, left, rear, front right, front left, rear right, rear left.
Rotate the <object> to show the <view> side view.
Camera Control
Change the camera viewpoint while keeping the 3D scene unchanged.
Move the camera.
- Camera rotation: Yaw {y_rotation}°, Pitch {p_rotation}°.
- Camera zoom: in/out/unchanged.
- Keep the 3D scene static; only change the viewpoint.JoyImageEditPipeline
class diffusers.JoyImageEditPipeline
< source >( scheduler: FlowMatchEulerDiscreteSchedulervae: AutoencoderKLWantext_encoder: Qwen3VLForConditionalGenerationtokenizer: Qwen2Tokenizertransformer: JoyImageEditTransformer3DModelprocessor: Qwen3VLProcessortext_token_max_length: int = 2048 )
Diffusion pipeline for image editing using the JoyImage architecture.
The pipeline encodes text and image conditioning via a Qwen3-VL text encoder, denoises latents with a 3-D transformer, and decodes the result with a WAN VAE.
Model offloading order: text_encoder -> transformer -> vae.
__call__
< source >( image: PIL.Image.Image | numpy.ndarray | torch.Tensor | list[PIL.Image.Image] | list[numpy.ndarray] | list[torch.Tensor] | None = Noneprompt: str | list[str] = Noneheight: int | None = Nonewidth: int | None = Nonenum_inference_steps: int = 40timesteps: typing.List[int] = Nonesigmas: typing.List[float] = Noneguidance_scale: float = 4.0negative_prompt: typing.Union[str, typing.List[str], NoneType] = Nonenum_images_per_prompt: typing.Optional[int] = 1generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Nonelatents: typing.Optional[torch.Tensor] = Noneprompt_embeds: typing.Optional[torch.Tensor] = Noneprompt_embeds_mask: typing.Optional[torch.Tensor] = Nonenegative_prompt_embeds: typing.Optional[torch.Tensor] = Nonenegative_prompt_embeds_mask: typing.Optional[torch.Tensor] = Noneoutput_type: typing.Optional[str] = 'pil'return_dict: bool = Truecallback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = Nonecallback_on_step_end_tensor_inputs: typing.List[str] = ['latents']max_sequence_length: int = 4096enable_denormalization: bool = True ) → [~pipelines.joyimage.JoyImageEditPipelineOutput] or torch.Tensor
Parameters
- prompt (str or List[str]) — The prompt or prompts to guide generation.
- height (int) — Height of the generated output in pixels.
- width (int) — Width of the generated output in pixels.
- image (PipelineImageInput, optional) —
Reference image used for conditioning. When provided the pipeline operates in image-editing mode with
num_items=2. - num_inference_steps (int, optional, defaults to 40) — Number of denoising steps. More steps generally improve quality at the cost of slower inference.
- timesteps (List[int], optional) —
Custom timesteps for the denoising process. When provided,
num_inference_stepsis inferred from the list length. - sigmas (List[float], optional) —
Custom sigmas for the denoising process. Mutually exclusive with
timesteps. - guidance_scale (float, optional, defaults to 4.0) — Classifier-free guidance scale.
- negative_prompt (str or List[str], optional) — Negative prompt(s) used to suppress undesired content.
- num_images_per_prompt (int, optional, defaults to 1) — Number of generated samples per prompt.
- generator (torch.Generator or List[torch.Generator], optional) — RNG generator(s) for deterministic sampling.
- latents (torch.Tensor, optional) — Pre-generated noisy latents for the target slot. Sampled from a Gaussian distribution when not provided. Can be used to seed generation from a specific starting noise tensor.
- prompt_embeds (torch.Tensor, optional) —
Pre-computed prompt embeddings. When provided
promptcan be omitted. - prompt_embeds_mask (torch.Tensor, optional) —
Attention mask for
prompt_embeds. - negative_prompt_embeds (torch.Tensor, optional) — Pre-computed negative prompt embeddings.
- negative_prompt_embeds_mask (torch.Tensor, optional) —
Attention mask for
negative_prompt_embeds. - output_type (str, optional, defaults to
"pil") — Output format. Pass"latent"to return raw latents. - return_dict (bool, optional, defaults to True) — Whether to return a JoyImageEditPipelineOutput or a plain tensor.
- callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, optional) —
Callback invoked at the end of each denoising step with signature
(self, step: int, timestep: int, callback_kwargs: Dict). - callback_on_step_end_tensor_inputs (List[str], optional, defaults to
["latents"]) — Tensor keys included incallback_kwargsforcallback_on_step_end. - max_sequence_length (int, optional, defaults to 4096) — Maximum sequence length for prompt encoding.
- enable_denormalization (bool, optional, defaults to True) — Denormalise latents before VAE decoding.
Returns
[~pipelines.joyimage.JoyImageEditPipelineOutput] or torch.Tensor
If return_dict is True, returns a pipeline output object containing the generated image(s).
Otherwise returns the image tensor directly.
Generate an edited image conditioned on a reference image and a text prompt.
Examples:
>>> import torch
>>> from diffusers import JoyImageEditPipeline
>>> from diffusers.utils import load_image
>>> model_id = "jdopensource/JoyAI-Image-Edit-Diffusers"
>>> pipe = JoyImageEditPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/astronaut.jpg")
>>> output = pipe(
... image=image, # pass an image for editing; omit for text-to-image generation
... prompt="Add wings to the astronaut.",
... num_inference_steps=40,
... guidance_scale=4.0,
... generator=torch.manual_seed(0),
... )
>>> output.images[0].save("joyimage_edit.png")check_inputs
< source >( promptheightwidthnegative_prompt = Noneprompt_embeds = Nonenegative_prompt_embeds = Noneprompt_embeds_mask = Nonenegative_prompt_embeds_mask = Nonecallback_on_step_end_tensor_inputs = None )
Raises
ValueError
ValueError— On any invalid combination of arguments.
Validate pipeline inputs before the forward pass.
Invert normalize_latents to recover the original latent scale.
encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]]device: typing.Optional[torch.device] = Nonenum_images_per_prompt: int = 1prompt_embeds: typing.Optional[torch.Tensor] = Noneprompt_embeds_mask: typing.Optional[torch.Tensor] = Nonemax_sequence_length: int = 1024template_type: str = 'image' )
Parameters
- prompt — Prompt string or list of prompt strings.
- device — Target device.
- num_images_per_prompt — Number of outputs to generate per prompt.
- prompt_embeds — Pre-computed prompt embeddings.
- prompt_embeds_mask — Attention mask for pre-computed embeddings.
- max_sequence_length — Maximum output sequence length.
- template_type — Prompt template key (
"image"or"multiple_images").
Encode a text prompt into embeddings (text-only path).
Pre-computed prompt_embeds bypass encoding entirely.
encode_prompt_multiple_images
< source >( prompt: typing.Union[str, typing.List[str]]device: typing.Optional[torch.device] = Nonenum_images_per_prompt: int = 1images: typing.Optional[torch.Tensor] = Noneprompt_embeds: typing.Optional[torch.Tensor] = Noneprompt_embeds_mask: typing.Optional[torch.Tensor] = Nonetemplate_type: typing.Optional[str] = 'multiple_images'max_sequence_length: typing.Optional[int] = None )
Parameters
- prompt — Prompt string(s), optionally containing
<image>\ntokens. - device — Target device.
- num_images_per_prompt — Number of outputs to generate per prompt.
- images — Pixel tensors corresponding to the inline image tokens.
- prompt_embeds — Pre-computed prompt embeddings.
- prompt_embeds_mask — Attention mask for pre-computed embeddings.
- template_type — Must be
"multiple_images". - max_sequence_length — If set, truncate the output to this length
(keeping the last
max_sequence_lengthtokens).
Encode prompts that contain inline image tokens via the Qwen processor.
<image>\n placeholders in each prompt string are replaced by the Qwen vision special tokens before being
fed to the multimodal encoder.
normalize_latents
< source >( latent: Tensor )
Normalise latents using per-channel statistics from the VAE config.
Uses (latent - mean) / std when the VAE exposes latents_mean and latents_std; otherwise falls back to
scaling by scaling_factor.
prepare_latents
< source >( batch_size: intnum_channels_latents: intheight: intwidth: intvideo_length: intdtype: dtypedevice: devicegenerator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType]latents: typing.Optional[torch.Tensor] = Noneimage: typing.Optional[typing.List[PIL.Image.Image]] = Noneenable_denormalization: bool = True )
Parameters
- batch_size — Number of samples in the batch.
- num_channels_latents — Latent channel dimension from the transformer config.
- height — Spatial height in pixels.
- width — Spatial width in pixels.
- video_length — Number of frames (1 for image inference).
- dtype — Floating-point dtype for the latent tensor.
- device — Target device.
- generator — RNG generator(s) for reproducible sampling.
- latents — Optional user-provided initial noise for the target slot. When
Nonerandom noise is sampled. - image — Optional list of PIL reference images to VAE-encode as conditioning slots.
- enable_denormalization — Whether to normalise encoded reference latents.
Raises
ValueError
ValueError— Ifgeneratoris a list whose length differs frombatch_size.
Prepare the initial noisy latent tensor for the denoising loop.
JoyImageEditPipelineOutput
class diffusers.JoyImageEditPipelineOutput
< source >( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )
Output class for JoyImageEdit generation pipelines.