CVPR 2026

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Jaehyun Park^1,*, Minyoung Ahn^2,3,*, Minkyu Kim³, Jonghyun Lee³, Jae-Gil Lee¹, Dongmin Park^3,†

¹KAIST, ²Seoul National University, ³KRAFTON

* Equal contribution | † Corresponding author

Overview of ArtiAgent: challenges and approach

Overview of ArtiAgent. (a) Visual artifacts in state-of-the-art diffusion models; current VLMs (GPT-5, Gemini-2.5-Pro) fail to detect them. (b) Our agentic framework ArtiAgent: three coordinated agents synthesize artifact data at scale. (c) VLM-based artifact comprehension: detection, explanation, and localization. (d) Reward-guided artifact-free image generation. (e) Artifact correction via VLM-guided inpainting.

Abstract

Despite recent advances in diffusion models, AI-generated images still often contain visual artifacts that compromise realism. Although more thorough pre-training and bigger models might reduce artifacts, there is no assurance that they can be completely eliminated, making artifact mitigation a highly crucial area of study.

Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale. We propose ArtiAgent, an agentic framework that efficiently creates pairs of real and artifact-injected images without human intervention. It comprises three agents: a perception agent that recognizes and grounds entities and subentities from real images; a synthesis agent that introduces artifacts via novel patch-wise embedding manipulation within a diffusion transformer; and a curation agent that filters synthesized artifacts and generates local and global explanations for each instance.

Using ArtiAgent, we synthesize 100K images with rich artifact annotations and demonstrate both efficacy and versatility across diverse applications, including fine-tuning open-source VLMs that consistently outperform proprietary systems (GPT-5, Gemini-2.5-Pro) on artifact detection, localization, and explanation.

Method

ArtiAgent pipeline. (a) The perception agent detects entities and subentities using Grounded-SAM. (b) The synthesis agent injects artifacts via patch mapping tools and the inversion-injection paradigm. (c) The curation agent filters low-quality results and generates textual explanations.

Three Specialized Agents

Perception Agent

Decomposes an input image into a hierarchical vocabulary of entities (e.g., dog) and subentities (e.g., nose, leg) using out-of-the-box VLMs, then grounds them with Grounded-SAM for precise spatial localization.

Synthesis Agent

Uses grounding to inject artifacts via a toolbox of four tools (add, remove, distort, fuse) and a novel inversion-injection module that manipulates patch attention in diffusion transformers to generate realistic structural artifacts.

Curation Agent

Performs data filtering (LPIPS-based for distortions, VLM-based for structural artifacts) and generates local and global textual explanations for each artifact, producing training-ready supervision.

Artifact Injection Toolbox

Visualization of each target-reference patch mapping and its resulting artifact.

The toolbox provides four patch-mapping tools:

Add — Duplicates a peripheral subentity (e.g., adds an extra finger) by mapping original subentity patches to a nearby candidate region.
Remove — Omits a peripheral subentity by replacing its patches with surrounding background context.
Distort — Applies transformation kernels (jitter, strip, random permutation) to intermediate subentity patches.
Fuse — Blends two entity instances by cross-referencing their overlapping patch regions.

Inversion-Injection Module

Inversion-injection module: the right arm from the reference patches is added to the target patches below it.

We extend the inversion-restoration paradigm from image editing into an inversion-injection module that manipulates self-attention in diffusion transformers (FLUX.1-dev via FireFlow).

Two complementary mechanisms operate on the target patch region:

PE Injection — Replaces the rotary positional embeddings of target patches with those of reference patches, controlling where the model believes denoising occurs. This is a novel contribution not previously used in image editing.
Value Injection — Reuses cached value embeddings from the inversion stage to provide realistic semantic content for the target region.

Their combination produces localized, realistic structural artifacts while keeping the background fully consistent with the original image.

ArtiBench Dataset

We introduce ArtiBench, a benchmark of 1K images generated by five modern diffusion models (SD3.5, FLUX, Qwen-Image, Nano-Banana), annotated by 12 human labelers with binary labels, bounding boxes, and textual descriptions per artifact region.

Unlike prior benchmarks, ArtiBench covers all three tasks (detection, localization, explanation) and focuses on recent models where structural artifacts are subtle and diverse, making it a significantly harder and more representative.

Benchmark	Samples	Det.	Loc.	Exp.
RichHF	955		✓
LOKI	229		✓	✓
SynthScars	1K		✓	✓
ArtiBench	1K	✓	✓	✓

Experiments

Main Results

Table 3: Artifact understanding performance

Artifact understanding performance across detection, localization, and explanation.

Open-source VLMs fine-tuned with ArtiAgent-generated data consistently outperform GPT-5 and Gemini-2.5-Pro on artifact detection, localization, and explanation across ArtiBench and three existing benchmarks (RichHF, LOKI, SynthScars).

Data Scaling Effect

Scaling effect with Qwen2.5-VL-7B; metrics averaged across all benchmarks per task.

Performance grows consistently as training data from ArtiAgent scales up:

With only 1K samples, localization and explanation already surpass GPT-5, demonstrating ArtiAgent's sample efficiency.
Binary detection continues to improve up to 100K scale, benefiting from greater artifact diversity.
Results highlight ArtiAgent's strong scaling potential and rich supervision quality.

Downstream Application 1: Reward-Guided Artifact-Free Generation

ArtiAgent's paired data design (clean vs. artifact-injected) provides ideal supervision for a CLIP-based reward model. Combined with test-time scaling on FLUX-schnell:

The artifact-preference reward improves steadily across search rounds.
Generated images show progressively clearer structures and fewer artifacts.
Demonstrates that artifact understanding can directly improve image generation.

Reward-guided generation: the artifact-free preference score increases across rounds.

Downstream Application 2: VLM-Guided Artifact Correction

The ArtiAgent-trained VLM guides FLUX inpainting to fix artifacts.

An artifact-aware VLM (Qwen2.5-VL-7B fine-tuned on ArtiAgent data) powers an iterative inpainting pipeline:

VLM detects and localizes the artifact region in the generated image.
FLUX inpainting synthesizes a corrected version of the artifact area.
VLM re-evaluates the corrected image and repeats until artifact-free.

The pipeline reliably locates and corrects structural artifacts with natural, structurally consistent content.

BibTeX

@inproceedings{park2026artiagent,
  title     = {See and Fix the Flaws: Enabling VLMs and Diffusion Models
               to Comprehend Visual Artifacts via Agentic Data Synthesis},
  author    = {Park, Jaehyun and Ahn, Minyoung and Kim, Minkyu and
               Lee, Jonghyun and Lee, Jae-Gil and Park, Dongmin},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer
               Vision and Pattern Recognition (CVPR)},
  year      = {2026},
}