MeshOn

Abstract

We propose MeshOn, a method that finds physically and semantically realistic compositions of two input meshes. Given an accessory, a base mesh with a user-defined target region, and optional text strings for both meshes, MeshOn uses a multi-step optimization framework to realistically fit the meshes onto each other while preventing intersections. We initialize the shapes' rigid configuration via a structured alignment scheme using Vision-to-Language Models, which we then optimize using a combination of attractive geometric losses, and a physics-inspired barrier loss that prevents surface intersections. We then obtain a final deformation of the object, assisted by a diffusion prior. Our method successfully fits accessories of various materials over a breadth of target regions, and is designed to fit directly into existing digital artist workflows. We demonstrate the robustness and accuracy of our pipeline by comparing it with generative approaches and traditional registration algorithms.

Applications

MeshOn is capable of fitting a wide variety of accessories on a range of different meshes. Our method handles challenging spatial relationships such as sliding glasses along a head to align with the ears, positioning hats and helmets to conform to head curvature, wrapping bands around articulated limbs, and fitting objects along long, curved surfaces. These examples demonstrate the flexibility of our fitting trajectory and its ability to adapt accessories to widely varying shapes and poses.

We exemplify our algorithm's applicability through several prototypical examples. As shown in Figure 4, our method can find tight, non-intersecting configurations between a diverse range of meshes, even when these contain complex geometries, concavities, and topologies.

Region-controlled composition

Ring placement under four distinct user-selected target regions

Region sensitivity examples showing how accessory placement changes with the selected region

As expected, our algorithm's output is most sensitive to the selected target region on the base mesh.

(a) Given four distinct user-selected target regions, MeshOn places a ring precisely on each selected region, demonstrating explicit user control over the fitting process. (b) Because the method adheres strictly to the selected region, asset placement is sensitive to the region definition; for example, glasses sit naturally on the ears only if a small portion of the ear region is included in the selection.

Designed to fit into artist workflows

Comparison between Instant3dit and MeshOn showing preservation of rig and texture information

Digital artists often load shapes from pre-existing libraries of meshes that include additional information like animation rigs, skinning weights, texture maps and more. Generative shape editing methods like Instant3dit discard this information, performing global changes to the shapes and merging them together. MeshOn is designed to fit exactly within this common artistic pipeline, and perfectly preserves all pre-existing information on the input meshes.

Mesh composition is the process of assembling several pre-modeled 3D objects into a unified asset. It is a fundamental part of many 3D content creation workflows, in which an artist selects accessories from an existing library and manually rotates, translates and deforms them to fit realistically on a character's base mesh without intersections.

However, recent advances in 3D content creation have focused mostly on generative tasks; for example, creating shapes from text inputs. These methods produce doubtlessly impressive results; yet, they do so at the cost of artist control and by relying on implicit geometric representations that are not directly usable in graphics pipelines. Even when triangle meshes are generated, these are significantly lower in quality than those already available to artists in pre-existing shape libraries, lacking critical attributes like textures, animation rigs, and distinct material parameters.

Method

We compose the two meshes through a multi-step optimization pipeline: from an initialization obtained with a Vision-to-Language Model, we start by obtaining a tight fit containing intersections, which we then resolve. After finetuning the fit to obtain the best possible rigid fit, we improve it further by allowing small deformations in the accessory object.

The inputs to our method are a base mesh and an object or accessory mesh in arbitrary relative position, together with a highlighted region of the base mesh and an optional text string describing both shapes.

These goals often act in opposition to one another; for example, achieving a tighter fit will often produce intersections, while removing intersections will tend to produce a looser fit. We address this challenge through a multi-step optimization framework in which the different constraints, objectives and degrees of freedom of the problem are introduced one at a time.

Elastic deformation

The physically-inspired optimization of the previous step outputs a rigidly transformed mesh, which we interpret as the best possible rigid fit of the object onto the base mesh. In reality, however, many accessories fit their wearer more tightly by undergoing small deformations: a hat is stretched by a head's shape, a necklace drapes over a person's chest, a face mask deforms to fit the nose and mouth of a surgeon.

MeshOn is capable of fitting the accessory onto the base mesh in a way that considers material properties. Instead of having to specify the numeric material parameters, artists can provide material guidance through text prompts, that get combined with rendered images and interpreted by a VLM to output specific elastic parameters.

A particularly relevant set of parameters are the material elastic properties λ and μ, which can be derived from the material's Young's modulus and Poisson's ratio. These quantities are tabulated for well-known materials; however, we make the process more intuitive for an artist by assigning the values from text inputs using an LLM.

Comparisons

Comparison of MeshOn against classical registration baselines with blowup insets

Unlike ours, classical registration algorithms like ICP contain no semantic guidance: therefore, they will produce unrealistic configurations; e.g., glasses that sit on the eyes upside down. These algorithms are also not designed to avoid intersections; therefore, they will produce many of them.

Additionally, non-neural methods like those based on the Iterative Closest Point algorithm will not account for the semantic fit between the two objects; e.g., fitting the glasses backwards on the face.

By modifying a shape using diffusion guidance, a highlighted region, and an optional text prompt, our work may seem similar to existing text-guided generative mesh-editing works. Unlike these, however, our work preserves both the object and base meshes as well as any information stored in them.

Taking Instant3dit as a representative example, Instant3dit produces a generative edit that merges the geometry of both objects and discards the animation rig of the base mesh and the parametrization of the accessory. By contrast, our method preserves all these, enabling common downstream tasks like animation, physical simulation, and texture painting.

Quantitative Evaluation

We quantitatively evaluate our method along two complementary dimensions: semantic alignment quality and geometric validity.

Alignment Baselines

Metric	Instant3Dit	Best ICP	Best FGR	RANSAC+ICP	RANSAC+FGR	Ours
CLIP (↑)	0.311	0.341	0.333	0.331	0.321	0.356
CLIP-IQA (↑)	0.559	0.586	0.609	0.608	0.610	0.604
VQA (↑)	65.62	73.71	73.97	73.72	73.68	74.49

Ablation Study

Metric	Init	Step 1	Step 2	Step 3	Step 4	Ours
CLIP (↑)	0.338	0.351	0.346	0.347	0.354	0.356
CLIP-IQA (↑)	0.594	0.605	0.609	0.595	0.613	0.604
VQA (↑)	74.197	73.967	73.776	74.049	73.985	74.49

Quantitative evaluation using CLIP, CLIP-IQA, and VQA scores. Top: comparison against alignment baselines. Bottom: ablation study of our multi-step pipeline. Scores are averaged over per-example best renderings (best of 4); higher is better.

Across these metrics, our full method achieves the strongest or near-strongest performance. Alignment-based methods also frequently produce mesh intersections, an undesirable artifact for most graphics workflows.

Penetration statistics

Metric	Best ICP	Best FGR	RANSAC+ICP	RANSAC+FGR	Ours
Intersecting faces (↓)	628	660	544	434	0
Max penetration (↓)	0.0197	0.0237	0.0243	0.0237	0

Penetration statistics averaged over ten experiments. We report the total number of intersecting faces and the maximum penetration depth. Our method produces intersection-free compositions.

BibTeX

.

MeshOn: Intersection-Free Mesh-to-Mesh Composition