Skip to the content.
Iron Man
a candle made of colorful crochet
Colorful Crochet Candle
a lamp made of brick
Brick Lamp
a horse wearing an astronaut suit
Astronaut Horse

Text2Mesh produces color and geometric details over a variety of source meshes, driven by a target text prompt. Our stylization results coherently blend unique and ostensibly unrelated combinations of text, capturing both global semantics and part-aware attributes.

Code Paper Supplementary


In this work, we develop intuitive controls for editing the style of 3D objects. Our framework, Text2Mesh, stylizes a 3D mesh by predicting color and local geometric details which conform to a target text prompt. We consider a disentangled representation of a 3D object using a fixed mesh input (content) coupled with a learned neural network, which we term neural style field network. In order to modify style, we obtain a similarity score between a text prompt (describing style) and a stylized mesh by harnessing the representational power of CLIP. Text2Mesh requires neither a pre-trained generative model nor a specialized 3D mesh dataset. It can handle low-quality meshes (non-manifold, boundaries, etc.) with arbitrary genus, and does not require UV parameterization. We demonstrate the ability of our technique to synthesize a myriad of styles over a wide variety of 3D meshes.



Text2Mesh modifies an input mesh to conform to the target text by predicting color and geometric details. The weights of the neural style network are optimized by rendering multiple 2D images and applying 2D augmentations, which are given a similarity score to the target from the CLIP-based semantic loss.

View Consistency

We use CLIP’s ability to jointly embed text and images to produce view-consistent and semantically meaningful stylizations over the entire 3D shape.

croissant made of colorful crochet
armadillo made of gold
donkey wearing jeans

General Stylization

For the same input mesh, Text2Mesh is capable of generating a variety of different local geometric displacements to synthesize a wide range of styles.


We show the distinct effect of each of our design choices on the quality of the final stylization through a series of ablations.


Ablation on the priors used in our method (full) for a candle mesh and target ‘Candle made of bark’: w/o our style field network (−net), w/o 2D augmentations (−aug), w/o positional encoding (−FFN), w/o crop augmentations for ψlocal (−crop), w/o the geometry-only component of Lsim (−displ), and learning over a 2D plane in 3D space (−3D).

Text Specificity

We are able to synthesize styles of increasing specificity while retaining the details from the previous levels.

Clothes iron
Clothes iron
made of crochet
Golden clothes iron
made of crochet

Shiny golden clothes
iron made of crochet
Luxo lamp
Blue steel luxo lamp

Blue steel luxo lamp
with corrugated metal

Beyond Text-Driven Manipulation

We further leverage the joint vision-language embedding space to demonstrate the multi-modal stylization capabilities of our method.

Image and Mesh Targets

image target
image target
image target
mesh target


    author = {Michel, Oscar
              and Bar-On, Roi
              and Liu, Richard
              and Benaim, Sagie
              and Hanocka, Rana
    title = {Text2Mesh: Text-Driven Neural Stylization for Meshes},
    journal = {arXiv preprint arXiv:2112.03221},
    year  = {2021}