ASIA: Adaptive 3D Segmentation using Few Image Annotations

Teaser

Our method, ASIA, is trained on a few user-annotated images—the “Reference Annotations”—and produces part segmentations for a given 3D shape that adhere to these references. It generates accurate results despite geometric variations (e.g., hair) and structural differences (e.g., microphones and sofa beds). ASIA can also segment multiple objects at once (e.g., the two microphones in the box were treated as a single shape), even when trained only on single instances. The rightmost example shows how our method can adapt annotations from different object categories, cars and horses in this case, to segment their "hybrids", namely, 3D ponycycles.

Abstract

We introduce ASIA (Adaptive 3D Segmentation using few Image Annotations), a novel framework that enables segmentation of possibly non-semantic and non-text-describable “parts” in 3D. Our segmentation is controllable through a few user-annotated in-the-wild images, which are easier to collect than multi-view images, less demanding to annotate than 3D models, and more precise than potentially ambiguous text descriptions. Our method leverages the rich priors of text-to-image diffusion models, such as Stable Diffusion (SD), to transfer segmentations from image space to 3D, even when the annotated and target objects differ significantly in geometry or structure. During training, we optimize a text token for each segment and fine-tune our model with a novel cross-view part correspondence loss. At inference, we segment multi-view renderings of the 3D mesh, fuse the labels in UV-space via voting, refine them with our novel Noise Optimization technique, and finally map the UV-labels back onto the mesh. ASIA provides a practical and generalizable solution for both semantic and non-semantic 3D segmentation tasks, outperforming existing methods by a noticeable margin in both quantitative and qualitative evaluations.

Our Approach

Training Pipeline. Given a few in-the-wild images and their segmentations, ASIA learns a set of tokens—one for each part—that align SD attention maps to localize the segments. We also finetune the SD UNet through LoRA with our novel part-aware correspondence loss to ensure that part features, and hence the predicted segments, remain consistent across views. The Mask Extractor computes the segmentation masks using intermediate features from multiple SD attention layers.

Inference Pipeline. Given a 3D mesh, we first render multi-view RGB images and noise them to prepare the input to our model. We also extract geometric edges of the mesh, from same views as the RGB images, to provide them as input through a pre-trained ControlNet. We then pass these, along with the trained text tokens, to our model with trained LoRA layers. We then extract the segmentations from Mask Extractor, project the labels to UV-space, and aggregate all the partially labelled UV-maps to a single, complete, globally consistent atlas through voting. This can then be wrapped onto the input mesh to get the segmented output. For Noise Optimization, we render the aggregated atlas of labels into the same views as the input RGB images (pseudo-GTs) and optimize $\mathcal{E}_{consis}(\cdot)$ (our objective for Noise Optimization) to update the per-view input noise and further enhance multi-view consistency.

Results

Qualitative results of our approach for adaptive 3D segmentation, with a reference annotation on the left and generated results on the right.

Check out the paper to learn more. 🙂

BibTeX


@article{perla2025asia,
    title={{ASIA}: Adaptive 3D Segmentation using Few Image Annotations},
    author = {Perla, Sai Raj Kishore and Vora, Aditya and Nag, Sauradip, Mahdavi-Amiri, Ali and Zhang, Hao},
    journal = {SIGGRAPH Asia Conference Papers},
    publisher = {ACM New York, NY, USA},
    year = {2025},
    doi = {10.1145/3757377.3763821},
    url = {https://github.com/sairajk/asia},
}

Adaptive 3D Segmentation using Few Image Annotations

TL;DR. We segment 3D shapes into possibly non-text describable parts (adaptive), as the user desires (controllable), using only a few annotated in-the-wild images as references (few-shot).

Teaser

Abstract

Our Approach

Results

BibTeX