$

Cost per second

For enterprise pricing and custom weights or models

Hunyuan3D 2.0

Hunyuan3D 2.0 is an advanced 3D synthesis system designed for generating high-resolution, textured 3D assets. It consists of two main components: Hunyuan3D-DiT, a shape generation model, and Hunyuan3D-Paint, a texture synthesis mode. The system uses a two-stage pipeline: first, a bare mesh is created, and then a texture map is synthesized for the mesh. This approach decouples the complexities of shape and texture generation and allows for texturing of both generated and handcrafted meshes.

Key Features of Hunyuan3D 2.0

Shape Generation

  • It is a large-scale flow-based diffusion model.

  • It uses a Hunyuan3D-ShapeVAE autoencoder to capture fine-grained details on meshes. The ShapeVAE employs vector sets and an importance sampling method to extract representative features and capture details such as edges and corners. It uses 3D coordinates and the normal vector of point clouds sampled from the surface of 3D shapes as inputs for the encoder and instructs the decoder to predict the Signed Distance Function (SDF) of the 3D shape, which can be further decoded into triangle mesh.

  • A dual-single stream transformer is built on the latent space of the VAE with a flow-matching objective.

  • It is trained to predict object token sequences from a user-provided image and the predicted tokens are further decoded into a polygon mesh with the VAE decoder.

  • It uses a pre-trained image encoder (DINOv2 Giant) to extract conditional image tokens, processing images at 518 x 518 resolution. The background of the input image is removed, and the object is resized and repositioned to remove the negative impacts of the background.

Texture Synthesis

  • It uses a mesh-conditioned multi-view generation pipeline.

  • It uses a three-stage framework: pre-processing, multi-view image synthesis, and texture baking.

  • An image delighting module is used to convert the input image to an unlit state to produce light-invariant texture maps. The multi-view generation model is trained on white-light illuminated images, enabling illumination-invariant texture synthesis.

  • A geometry-aware viewpoint selection strategy is employed, selecting 8 to 12 viewpoints for texture synthesis.

  • A double-stream image conditioning reference-net, a multi-task attention mechanism, and a strategy for geometry and view conditioning are also used in this pipeline.

  • It uses a multi-task attention mechanism with reference and multi-view attention modules to ensure consistency across generated views.

  • It conditions the model with multi-view canonical normal maps and coordinate maps. A learnable camera embedding is used to boost the viewpoint clue for the multi-view diffusion model.

  • It uses dense-view inference with a view dropout strategy to enhance 3D perception.

  • A single-image super-resolution model is used to enhance texture quality, and an inpainting approach is applied to fill any uncovered patches on the texture map.

  • It is compatible with text- and image-to-texture generation, utilizing T2I models and conditional generation modules

Performance and Evaluation

  • Hunyuan3D 2.0 outperforms previous state-of-the-art models in geometry detail, condition alignment, and texture quality.

  • Hunyuan3D-ShapeVAE surpasses other methods in shape reconstruction.

  • Hunyuan3D-DiT produces more accurate condition following results compared to other methods.

  • The system produces high-quality textured 3D assets.

  • User studies demonstrate the superiority of Hunyuan3D 2.0 in terms of visual quality and adherence to image conditions