In-Context Edit:
Enabling Instructional Image Editing
with In-Context Generation in Large Scale Diffusion Transformer

Zechuan Zhang¹ Ji Xie¹ Yu Lu¹ Zongxin Yang² Yi Yang^1 *
¹Zhejiang University ²Harvard University
^*Corresponding Author
Official Project Page

Paper Code Model Demo

We introduce ICEdit, an efficient and effective framework for instruction-based image editing. With only 1% trainable parameters (200M) and 0.1% training data (50k) compared to previous methods, ICEdit demonstrates strong generalization capabilities and is capable of handling diverse editing tasks.

Compared with commercial models such as Gemini, GPT4o, we are more open-source, with lower costs, faster speed (it takes about 9 seconds to process one image), and powerful performance.

With ComfyUI-nunchaku, only 4 GB VRAM GPU is enough to try our model!

What can ICEdit do?

Abstract

Instruction-based image editing enables robust image modification via natural language prompts, yet current methods face a precision-efficiency tradeoff. Fine-tuning methods demand significant computational resources and large datasets, while training-free techniques struggle with instruction comprehension and edit quality. We resolve this dilemma by leveraging large-scale Diffusion Transformer (DiT)' enhanced generation capacity and native contextual awareness. Our solution introduces three contributions: (1) an in-context editing framework for zero-shot instruction compliance using in-context prompting, avoiding structural changes; (2) a LoRA-MoE hybrid tuning strategy that enhances flexibility with efficient adaptation and dynamic expert routing, without extensive retraining; and (3) an early filter inference-time scaling method using vision-language models (VLMs) to select better initial noise early, improving edit quality. Extensive evaluations demonstrate our method's superiority: it outperforms state-of-the-art approaches while requiring only 0.1% training data and 1% trainable parameters compared to baselines. This work establishes a new paradigm that enables high-precision yet efficient instruction-guided editing.

How does it work?

We implement a training-free in-context editing paradigm based on DiTs, e.g. FLUX, where the model generates edited outputs (right panel of a diptych) by processing “in-context prompts” alongside the source image (left panel). While persistent failure cases remain, the achieved advantages establish a robust baseline that facilitates efficient fine-tuning for precision enhancement. We implement parameter-efficient LoRA adapters with mixture-of-experts (MoE) routing within the DiT framework, which dynamically activates task-specific experts during editing. Trained on minimal publicly available data (50K), it improves editing success rates across diverse scenarios without architectural modifications or large-scale retraining. We also design an inference-time scaling strategy to improve the editing quality. For more details, please refer to the paper.

Generalization Capabilities

Comparison with Commercial Models

Compared with commercial models such as Gemini and GPT-4O, our methods are comparable to and even superior to these commercial models in terms of character ID preservation and instruction following. We are more open-source than them, with lower costs, faster speed (it takes about 9 seconds to process one image), and powerful performance.

Comparison with State-of-the-Art Methods

Disclaimer

We open-source this project for academic research. The vast majority of images used in this project are either generated or licensed. If you have any concerns, please contact us, and we will promptly remove any inappropriate content. Any models related to FLUX base model must adhere to the original licensing terms.

This research aims to advance the field of generative AI. Users are free to create images using this tool, provided they comply with local laws and exercise responsible usage. The developers are not liable for any misuse of the tool by users.

BibTex

@article{zhang2025ICEdit,
  title={In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer},
  author={Zhang, Zechuan and Xie, Ji and Lu, Yu and Yang, Zongxin and Yang, Yi},
  journal={arXiv},
  year={2025},
  url={https://arxiv.org/abs/2504.20690},
}

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer Zechuan Zhang1 Ji Xie1 Yu Lu1 Zongxin Yang2 Yi Yang1 * 1Zhejiang University 2Harvard University *Corresponding Author Official Project Page Paper Code Model Demo