InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model

InstaInpaint
Instant 3D-Scene Inpainting with Masked Large Reconstruction Model

¹Shanghai Jiao Tong University, ²UC Merced, ³Singapore University of Technology and Design

Abstract

Recent advances in 3D scene reconstruction enable real-time viewing in virtual and augmented reality. To support interactive operations for better immersiveness, such as moving or editing objects, 3D scene inpainting methods are proposed to repair or complete the altered geometry. However, current approaches rely on lengthy and computationally intensive optimization, making them impractical for real-time or online applications.

We propose InstaInpaint, a reference-based feed-forward framework that produces 3D-scene inpainting from a 2D inpainting proposal within 0.4 seconds. We develop a self-supervised masked-finetuning strategy to enable training of our custom large reconstruction model (LRM) on the large- scale dataset. Through extensive experiments, we analyze and identify several key designs that improve generalization, textural consistency, and geometric correctness. InstaInpaint achieves a 1000× speed-up from prior methods while maintaining a state-of-the-art performance across two standard benchmarks. Moreover, we show that InstaInpaint generalizes well to flexible downstream applications such as object insertion and multi-region inpainting.

Pipeline Architecture & Mask Generation

Overall pipeline. Given a video clip, a reference view and three inpaint views are selected. The reference view remains intact - its RGB values and Plücker ray coordinates are directly tokenized. For the inpainting views, we first apply multi-view masks to the images, then concatenate these masked images with their Plücker coordinates and the binary masks before tokenization. Tokens are sent into transformer blocks to predict pixel-aligned Gaussians. Supervision views are randomly sampled from the remaining frames of the video clips to compute photometric loss against novel view renders.

Overview of mask generation methods. a) Object Mask Generation. We propagate scene videos through a video segmentation model to obtain object masks. b) Random Image Mask. We randomly mask the same area for all input views. c) Geometric Mask Generation. We use a GS-LRM to predict metric depthmaps and then warp randomly sampled oval masks to inpaint views.

Ablation study on mask types and mask encoding methods. Ablation study shows that the three types of masks are complementary, reducing potential training biases. It also demonstrate our mask encoding method guarentees uniform token distribution of reconstruction regions and the targeted inpainting regions.

Quantitative Comparison

3D Inpainting quality and speed. Left: Our proposed method reconstructs the inpainted scene at a much faster speed with more competitive quality compared with existing approaches. Right: Our proposed method takes only 0.4s for reconstruction and 3D inpainting.

Quantitative comparisons with state-of-the-art methods. InstaInpaint performs favorably against state-of-the-art methods on two standard benchmarks, SPIn-NerF and LLFF.

Ablation study on mask types and mask encoding methods.. Ablation study shows that the three types of masks are complementary, reducing potential training biases. It also demonstrate our mask encoding method guarentees uniform token distribution of reconstruction regions and the targeted inpainting regions.

Comparison with Baseline

We provide qualitative comparisons with previous state-of-the-art 3D inpainting methods: Mald-NeRF , GScream and InFusion . Please click the button to select scenes and viewing mode.

Original Scene

Text-Guided Object Insersion

Original Scene

Edited Scene

Depth

BibTeX


    @misc{you2025instainpaint,
      title={InstaInpaint: Instant 3D-Scene Inpainting with Masked Large Reconstruction Model}, 
      author={Junqi You and Chieh Hubert Lin and Weijie Lyu and Zhengbo Zhang and Ming-Hsuan Yang},
      year={2025},
      url={https://arxiv.org/abs/2506.10980}, 
    }

InstaInpaint Instant 3D-Scene Inpainting with Masked Large Reconstruction Model

InstaInpaint can generate inpainted 3D scenes in 0.4 seconds and supports background inpainting, object insertion and multi-region inpainting simultaneously.