Instruction-based image editing enables robust image modification via natural language prompts, yet current methods face a precision-efficiency tradeoff. Fine-tuning methods demand significant computational resources and large datasets, while training-free techniques struggle with instruction comprehension and edit quality. We resolve this dilemma by leveraging large-scale Diffusion Transformer (DiT)' enhanced generation capacity and native contextual awareness. Our solution introduces three contributions: (1) an in-context editing framework for zero-shot instruction compliance using in-context prompting, avoiding structural changes; (2) a LoRA-MoE hybrid tuning strategy that enhances flexibility with efficient adaptation and dynamic expert routing, without extensive retraining; and (3) an early filter inference-time scaling method using vision-language models (VLMs) to select better initial noise early, improving edit quality. Extensive evaluations demonstrate our method's superiority: it outperforms state-of-the-art approaches while requiring only 0.1% training data and 1% trainable parameters compared to baselines. This work establishes a new paradigm that enables high-precision yet efficient instruction-guided editing.
We implement a training-free in-context editing paradigm based on DiTs, e.g. FLUX, where the model generates edited outputs (right panel of a diptych) by processing “in-context prompts” alongside the source image (left panel). While persistent failure cases remain, the achieved advantages establish a robust baseline that facilitates efficient fine-tuning for precision enhancement. We implement parameter-efficient LoRA adapters with mixture-of-experts (MoE) routing within the DiT framework, which dynamically activates task-specific experts during editing. Trained on minimal publicly available data (50K), it improves editing success rates across diverse scenarios without architectural modifications or large-scale retraining. We also design an inference-time scaling strategy to improve the editing quality. For more details, please refer to the paper.
Compared with commercial models such as Gemini and GPT-4O, our methods are comparable to and even superior to these commercial models in terms of character ID preservation and instruction following. We are more open-source than them, with lower costs, faster speed (it takes about 9 seconds to process one image), and powerful performance.
We open-source this project for academic research. The vast majority of images
used in this project are either generated or licensed. If you have any concerns,
please contact us, and we will promptly remove any inappropriate content.
Any models related to FLUX
base model must adhere to the original licensing terms.
This research aims to advance the field of generative AI. Users are free to
create images using this tool, provided they comply with local laws and exercise
responsible usage. The developers are not liable for any misuse of the tool by users.
@article{zhang2025ICEdit,
title={In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer},
author={Zhang, Zechuan and Xie, Ji and Lu, Yu and Yang, Zongxin and Yang, Yi},
journal={arXiv},
year={2025}
}
Thanks to
for the page template.