Single positive click
Single positive click
Single positive click
Single positive click
Single positive click
Second positive click
Single positive click
Second negative click
We present iSeg, a new interactive technique for segmenting 3D shapes. Previous works have focused mainly on leveraging pre-trained 2D foundation models for 3D segmentation based on text. However, text may be insufficient for accurately describing fine-grained spatial segmentations. Moreover, achieving a consistent 3D segmentation using a 2D model is challenging since occluded areas of the same semantic region may not be visible together from any 2D view. Thus, we design a segmentation method conditioned on fine user clicks, which operates entirely in 3D. Our system accepts user clicks directly on the shape's surface, indicating the inclusion or exclusion of regions from the desired shape partition. To accommodate various click settings, we propose a novel interactive attention module capable of processing different numbers and types of clicks, enabling the training of a single unified interactive segmentation model. We apply iSeg to a myriad of shapes from different domains, demonstrating its versatility and faithfulness to the user's specifications.
iSeg includes two parts: an encoder that maps vertex coordinates to a deep semantic vector, denoted as Mesh Feature Field (MFF), and a decoder that takes the MFF and user clicks and predicts the corresponding mesh segment. The decoder contains an interactive attention layer supporting a variable number and type (both positive and negative) of clicks. We leverage the pre-trained 2D segmentation model Segment Anything to supervise our training with 2D segmentation masks using rendered images of the shape and the 2D projection of the 3D clicks. Although iSeg is trained using noisy and inconsistent 2D segmentations, it is view-consistent by construction.
We propose a novel interactive attention mechanism, accommodating variable numbers and types of clicks, positive and negative. Our attention layer learns the clicks' representation and computes their interaction with the mesh vertices. This key element in our method enables a unified decoder architecture supporting various settings of user clicks.
Our method segments parts in a 3D-consistent manner, regardless of whether the surface is occluded from the point click (left). Furthermore, we may input two point clicks occluded from each other (right). This supervision does not exist for our model training, since occluded 3D surfaces cannot be seen together from any single 2D view.
The Segment Anything model (SAM) is highly sensitive to the viewing angle. For the same clicked 3D point, it may generate substantially different 2D segmentation masks that are inconsistent in 3D, depending on the view direction. By contrast, iSeg segments the shape directly in 3D and is 3D consistent by construction.
iSeg optimizes a condition-agnostic feature field, capable of transferring between shapes. The feature vector of a point click of one mesh (left) is used to segment the same shape (middle) as well as another shape from a different domain (right).
Our localized and contiguous segmentations enable various shape edits, such as deleting or selecting the segmented region, shrinking it, or extruding it along the surface normal.
iSeg produces fine-grained segmentations from a single or couple of clicks (both positive and negative) as input. It is highly flexible and is able to select parts that vary in size, geometry, and semantic meaning.
@article{lang2024iseg,
author = {Lang, Itai and Xu, Fei and Decatur, Dale and Babu, Sudarshan and Hanocka, Rana},
title = {{iSeg: Interactive 3D Segmentation via Interactive Attention}},
journal = {arXiv preprint arXiv:2404.03219},
year = {2024}
}