Qijun Gan, Zijie Zhou and Jianke ZhuQijun Gan, Zijie Zhou and Jianke Zhu are with the College of Computer Science and Technology, Zhejiang University, Zheda Rd 38th, Hangzhou, China. Email: {ganqijun, zjzhou, jkzhu}@zju.edu.cn;Jianke Zhu is the Corresponding Author.
Abstract
Hand avatars play a pivotal role in a wide array of digital interfaces, enhancing user immersion and facilitating natural interaction within virtual environments. While previous studies have focused on photo-realistic hand rendering, little attention has been paid to reconstruct the hand geometry with fine details, which is essential to rendering quality. In the realms of extended reality and gaming, on-the-fly rendering becomes imperative. To this end, we introduce an expressive hand avatar, named XHand, that is designed to comprehensively generate hand shape, appearance, and deformations in real-time. To obtain fine-grained hand meshes, we make use of three feature embedding modules to predict hand deformation displacements, albedo, and linear blending skinning weights, respectively. To achieve photo-realistic hand rendering on fine-grained meshes, our method employs a mesh-based neural renderer by leveraging mesh topological consistency and latent codes from embedding modules. During training, a part-aware Laplace smoothing strategy is proposed by incorporating the distinct levels of regularization to effectively maintain the necessary details and eliminate the undesired artifacts. The experimental evaluations on InterHand2.6M and DeepHandMesh datasets demonstrate the efficacy of XHand, which is able to recover high-fidelity geometry and texture for hand animations across diverse poses in real-time. To reproduce our results, we will make the full implementation publicly available at https://github.com/agnJason/XHand.
Index Terms:
3D hand reconstruction, animatable avatar, MANO.
I Introduction
HAND avatars are crucial in various digital environments, including virtual reality, digital entertainment, and human-computer interaction[1, 2, 3, 4]. Accurate representation and lifelike motion of hand avatars are essential to deliver an authentic and engaging user experience. Due to the complexity of hand muscles and the personalized nature, it is challenging to obtain the fine-grained hand representation[5, 6, 7, 8], which directly affect the user experience in virtual spaces.
Parametric model-based methods[9, 10, 5] have succeeded in modeling digital human, which offer the structured frameworks to efficiently analyze and manipulate the shapes and poses of human bodies and hands. These models have played a crucial role in various applications, enabling computer animation and hand-object interaction[11, 12, 13, 14, 15]. Since they predominantly rely on mesh-based representations, it restricts them to fixed topology and limited resolution of the 3D mesh. Consequently, it is difficult for these models to accurately represent the intricate details, such as muscle, garments and hair. This hinders them from rendering high fidelity images[16].Model-free methods offer effective solutions for representing hand meshes through various techniques. Graph Convolutional Network (GCN)-based and UV-based representations of 3D hand meshes[17, 18] enable the reconstruction of diverse hand poses with detailed deformations. Lightweight auto-encoders[12, 19] further enhance real-time hand mesh prediction. Despite these advancements in capturing accurate hand poses, these methods still fall short in preserving intricate geometric details.
Recently, neural implicit representations[20, 21] have emerged as powerful tools in synthesizing novel views for static scenes. Some studies[22, 23, 24, 25, 26, 16] have expanded these methods into the realm of articulated objects, notably the human body, to facilitate photo-realistic rendering. LiveHand[8] achieves real-time rendering through a neural implicit representation along with a super-resolution renderer. Karunratanakulet al.[6] present a self-shadowing hand renderer. Coronaet al.[16] introduce a neural model LISA that predicts the color and the signed distance with respect to each hand bone independently. Despite the promising results, it struggles to capture intricate high-frequency details and lacks capability of real-time rendering. Meanwhile, Chenet al.[27] make use of occupancy and illumination fields to obtain hand geometry, while the generated hand geometry lacks the intricate details and appears to be smoothing surface. These methods have difficulties in recovering the detailed geometry that usually plays a crucial role in photo-realistic rendering.
In addition to hand modeling methods, several studies have focused on reconstructing animatable human bodies or animals[28, 29, 30, 31, 32, 33, 34, 35]. Building accurate human body models presents significant challenges due to the complex deformations involved, particularly in capturing fine details such as textures and scan-like appearances, especially in smaller areas like hands and faces[5, 23, 36, 37, 25, 38]. To address these challenges, several approaches have been developed with detailed 3D scans. For instance, previous works[22, 39, 40] have focused on establishing correspondences between pose space and standard space through techniques such as linear blend skinning and inverse skinning weights. These advancements collectively contribute to more precise and realistic human body modeling, while their results for hand modeling remain smooth.
To address these challenges, we propose XHand, an expressive hand avatar that achieves real-time performance (see Fig.1). Our approach includes feature embedding modules that predict hand deformation displacements, vertex albedo, and linear blending skinning (LBS) weights using a subdivided MANO model[9]. These modules utilize average features of the hand mesh and compute feature offsets for different poses, addressing the difficulty in directly learning dynamic personalized hand color and texture due to significant pose-dependent variations. By distinguishing between average and pose-dependent features, our modules simplify the training task and improve result accuracy. Additionally, we incorporate a part-aware Laplace smoothing term to enhance the efficiency of geometric information extraction from images, applying various levels of regularization.
To achieve photo-realistic hand rendering, we use a mesh-based neural renderer that leverages latent codes from the feature embedding modules, maintaining topological consistency. This method preserves detailed features and minimizes artifacts through various regularization levels. We evaluate our approach using the InterHand2.6M dataset[41] and the DeepHandMesh collection[19]. Experimental results show that XHand outperforms previous methods, providing high-fidelity meshes and real-time rendering of hands in various poses.

Our main contributions are summarized as follows:
- •
A real-time expressive hand avatar with high fidelity results on both rendering and geometry, which is trained with an effective part-aware Laplace smoothing strategy.
- •
An effective feature embedding module to simplify the training objectives and enhance the prediction accuracy by distinguishing invariant average features and pose-dependent features;
- •
An end-to-end framework to create photo-realistic and fine-grained hand avatars. The promising results indicate that our method outperforms the previous approaches.
The remainder of this paper is arranged as follows. Related works are introduced in SectionII. The proposed XHandmodel and corresponding training process are thoroughly depictedin SectionIII. The experimental results and discussion arepresented in SectionIV. Finally, SectionV sets out the conclusion of this paperand discusses the limitations.
II Relate Work
II-A Parametric Model-based Method
3D animatable human models[10, 9, 5] enable shape deformation and animation by decoding the low-dimensional parameters into a high-dimensional space. Loperet al.[10] introduce a linear model to explicitly represent the human body through adjusting shape and pose parameters. MANO hand model[9] utilizes a rigged hand mesh with fixed-topology that can be easily deformed according to the parameters. The low resolution of the template mesh hinders its application in scenarios requiring higher precision. To address this limitation, Liet al.[7] integrate muscle groups with shape registration, which results in an optimized mesh with finer appearance. Furthermore,parametric model-based methods[43, 44, 45, 11, 1, 2, 46, 47, 48] have shown the promising results in accurately recovering hand poses from input images, however, they have difficulty in effectively capturing textures and geometric details for the resulting meshes. In this paper, our proposed XHand approach is able to capture the fine details of both appearance and geometry by taking advantages of Lambertian reflectance model[49].
II-B Model-free Approach
Parametric models have proven to be valuable in incorporating prior knowledge of pose and shape in hand geometry reconstruction[9], while their representation capability is restricted due to the low resolution of the template mesh. To address this issue, Choiet al.[17] introduce a network based on graph convolutional neural networks (GCN) that directly estimates the 3D coordinates of human mesh from 2D human pose. Chenet al.[18] present a UV-based representation of 3D hand mesh to estimate hand vertex positions. Mobrecon[50] predicts hand mesh in real-time through a 2D encoder and a 3D decoder. Despite the encouraging results, the above methods still cannot capture the geometric details of hand. Moonet al.[19] propose an encoder-decoder framework that employs a template mesh to learn corrective parameters for pose and appearance. Although having achieved the improved geometry and articulated deformation, it has difficulty in rendering photo-realistic hand images. Gan et al.[51] introduce an optimized pipeline that utilizes multi-view images to reconstruct a static hand mesh. Unfortunately, it overlooks the variations due to joint movements. Karunratanakulet al.[6] design a shadow-aware differentiable rendering scheme that optimizes the abledo and normal map to represent hand avatar. However, its geometry remains smoothing. In contrast to the above methods, our proposed XHand approach is able to simultaneously synthesize the detailed geometry and photo-realistic images for drivable hands.
II-C Neural Hand Representation
There are various alternatives available for neural hand representations, such as HandAvatar[27], HandNeRF[26], LISA[16] and LiveHand[8]. In order to achieve high fidelity rendering of human hands, Chenet al.[27] propose HandAvatar to generate photo-realistic hand images with arbitrary poses, which take into account both occupancy and illumination fields. LISA[16] is a neural implicit model with hand textures, which focuses on signed distance functions (SDFs) and volumetric rendering. Mundraet al.[8] propose LiveHand that makes use of a low-resolution NeRF representation to describe dynamic hands and a CNN-based super-resolution module to facilitate high-quality rendering. Despite the efficiency in rendering hand images, it is hard for those approaches to capture the details of hand mesh geometry. Luanet al.[52] introduce a frequency decomposition loss to estimate the personalized hand shape from a single image, which effectively address the challenge of data scarcity. Chenet al. introduce a spatially varying linear lighting model as a neural renderer to preserve personalized fidelity and sharp details under natural illumination. Zhenget al. facilitate the creation of detailed hand avatars from a single image by learning and utilizing data-driven hand priors. In this work, our presented XHand method focuses on synthesizing the hand avatars with fine-grained geometry in real-time.
II-D Generic Animatable Objects
In addition to the aforementioned methods on hand modeling, there have been some studies reconstructing animatable whole or partial human bodies or animals[28, 29, 30]. Face models primarily pay their attention to facial expressions, appearance, and texture, rather than handling large-scale deformations[32, 33, 34, 35]. Zheng at al.[32] bridge the gap between explicit mesh and implicit representations by a deformable point-based model that incorporates intrinsic albedo and normal shading. To build human body model[5, 23, 53, 36, 37, 25, 38], numerous challenges arise from the intricate deformations, which make it arduous to precisely capture intricate details, such as textures and scan-like appearances, especially in smaller areas like the hands and face. Previous works[22, 39, 40] have explored to establish the correspondences between pose space and template space through linear blend skinning and inverse skinning weights. Alldiecket al.[13] employ learning-based implicit representations to model human bodies via SDFs. Chenet al.[23] propose a forward skinning model that finds all canonical correspondences of deformed points. Shenet al.[54] introduce XAvatar to achieve high fidelity of rigged human bodies, which employ part-aware sampling and initialization strategies to learn neural shapes and deformation fields.
III Method

Given multi-view images for frames captured from viewpoints with pose and shape of their corresponding parametric hand models like MANO[9], our proposed approach aims to simultaneously recover the expressive personalized hand meshes with fine details and render photo-realistic image in real-time. Fig.2 shows an overview of our method. Given the hand pose parameters , the fine-grained posed mesh is obtained from feature embedding modules (Sec.III-A), which are designed to obtain Linear Blending Skinning (LBS) weights, vertex displacements and albedo by combining the average features of the mesh with pose-driven feature mapping. With the refined mesh, the mesh-based neural renderer achieves real-time photo-realistic rendering with respect to the vertex albedo , normals , and latent code in feature embedding modules.
III-A Detailed Hand Representation
In this paper, the parametric hand model MANO[9] is employed to initialize the hand geometry, which effectively maps the pose parameter with per-bone parts and the shape parameter onto a template mesh with vertices . Such mapping is based on linear blending skinning with the weights . Thus, the posed hand mesh can be obtained by
(1) |

Geometry Refinement. After increasing the MANO mesh resolution for fine geometry using the subdivision method in[27], a personalized vertex displacement field is introduced to allow the extra deformation for each vertex in the template mesh. The refined posed hand mesh can be computed as below
(2) |
The original MANO mesh[9], consisting of 778 vertices and 1538 faces, has limited capacity to accurately represent fine-grained details[27]. To overcome this challenge by enhancing the mesh resolution to capture intricate features, we employ an uniform subdivision strategy on the MANO template mesh, as shown in Fig.3. By adding new vertices at midpoint of each edge for three times, we obtain a refined mesh with 49,281 vertices and 98,432 faces. To associate skinning weights with these additional vertices, we compute the average weights assigned to the endpoints of the corresponding edges.
Let denote the subdivision function for MANO mesh. The high resolution template mesh and LBS weights can be extracted as follows
(3) |
To enhance the fidelity of the hand geometry, the vertex displacements and the LBS weights are pose-dependent for each individual. This enables an accurate representation of the deformation under different poses. To this end, we propose the feature embedding modules and to better capture the intricate details of hand mesh, LBS weights are derived from the LBS embedding . The displacement embedding generates the vertex displacements . Given the hand pose parameters for frames, the mesh features are predicted as follows
(4) |
Thus, the refined mesh at time can be formulated as below
(5) |
Feature Embedding Module. Generally, it is challenging to learn the distinctive hand features in different poses. To better separate between the deformation caused by changes in posture and the inherent characteristics of the hand, we present an efficient feature embedding module in this paper. It relies on the average features of hand mesh and computes offsets of features in different poses, as illustrated in Fig.4.
Given a personalized hand mesh and its pose at time , our feature embedding module extracts mesh features as follows
(6) |
where denotes the average vertex features of hand mesh.

To represent the mesh features of personalized hand generated with hand pose , we design the following embedding function
(7) |
where is vertex latent code to encode different vertices. denotes a pose decoder that is combined with multi-layer perceptrons (MLPs). It projects the pose and latent code onto the implicit space. To align with the feature space, is the mapping matrix to convert the implicit space into feature space , which subjects to
(8) |
The personalized mesh features can be derived by combining the average vertex features and the pose-dependent offsets. Consequently, the LBS weights can be derived with average LBS weights , pose decoder , latent code and mapping matrix as follows
(9) |
Similarly, the vertex displacements can be obtained as follows
(10) |
where denotes average displacements. , and are pose decoder, latent code and mapping matrix for , respectively. The depths of within the LBS embedding module and within the albedo embedding module are set to 5, with each layer consisting of 128 neurons. Additionally, the depth of within the displacement embedding module is 8, where the number of neurons is 512.
Remark. The feature embedding modules allows for the interpretable acquisition of hand features corresponding to the pose . The average mesh features are stored in , while the features offsets are affected by the pose . More importantly, the training objectives are greatly simplified by taking into account of the average features constraints, which leads to the faster convergence and improved accuracy.
III-B Mesh Rendering
Inverse Rendering. In order to achieve rapid and differentiable rendering of detailed mesh , an inverse renderer is employed to synthesize hand images. Assuming that the skin color follows the Lambertian reflectance model[55], the rendered image can be calculated from the Spherical Harmonics coefficients , the vertex normal , and the vertex albedo using the following equation
(11) |
where is camera parameter of the -th viewpoint. represents Spherical Harmonics (SH) function of the third order. is the vertex normal computed from the vertices of mesh . Similar to Eq.4, the pose-dependent albedo can be obtained from feature embedding module with average vertex albedo , pose decoder , latent code and mapping matrix as follows
(12) |
By analyzing how the variations in brightness relate to the hand shape, inverse rendering with the Lambertian reflectance model can effectively disentangle geometry and appearance.
Mesh-based Neural Rendering.The NeRF-based methods usually employ volumetric rendering along its corresponding camera ray to acquire pixel color[26, 8], which usually require a large amount of training time. Instead, we aim to minimize the sampling time and enhance the rendering quality by making use of a mesh-based neural rendering method that is able to take advantage of the consistent topology of our refined mesh.
The mesh is explicitly represented by triangular facets so that the intersection points between rays and meshes are located within the facets. The features that describe meshes, such as position, color, and normal, are associated with their respective vertices. Consequently, the attributes of intersection points can be calculated by interpolating the three vertices of triangular facet to its intersection point. The efficient differentiable rasterization[56] ensures the feasibility of inverse rendering and mesh-based neural rendering.
Given a camera view , our mesh-based neural render synthesizes the image with respect to the position , normal , feature vector and ray direction , where , and are obtained through interpolating with . in neural render contains the latent codes and detached from and , and feature vector [51]. is utilized to represent the latent code of vertices during rendering. As in[20], the neural network comprises 8 fully-connected layers with ReLU activations and 256 channels per layer, excluding the output layer. Furthermore, it includes a skip connection that concatenates the input to the fifth layer, which is depicted in Fig.5.

III-C Training Process
To obtain a personalized hand representation, the parameters of the three feature embedding modules , , and , as well as the neural render , require to be optimized based on multi-view image sequences. Our training process consists of three steps, including initialization, training feature embedding modules, and training the mesh-based neural render.
Initialization of XHand.To train our proposed XHand model, the average features of mesh in feature embedding significantly affect training efficiency and results. Random initialization has great impact on training due to estimation errors in and , which m ay lead to the failure of inverse rendering. Therefore, it is crucial to initialize the neural hand representation. To this end, the reconstruction result of the first frame () is treated as the initial model.
Inspired by[58, 51], XHand model is initialized from multi-view images. The vertex displacement and vertex albedo of hand mesh are jointly optimized through inverse rendering. Mesh generation is obtained from Eq.2, and the rendering equation is same as Eq.11. The loss function during initialization is formulated as below
(13) |
where is the Laplacian matrix[59]. Laplacian terms and are employed to regularize the mesh optimization, as the mesh features are supposed to be smooth. Uniform weights of the Laplacian matrix are adopted in training. The outcomes and are used to initialize and . The initialization of is directly derived from MANO model[9].
Loss Functions of Feature Embedding.Inverse rendering is utilized to learn the parameters of three feature embedding modules , and . is introduced to minimize the errors of rendering images as follows
(14) |
where represents the rendering loss. is the regularization term. Inspired by[60], we use error combined with an SSIM term to form the as below
(15) |
where denotes the trade-off coefficient.
To enhance the efficiency in extracting geometric information from images, we introduce the part-aware Laplace smoothing term . The Laplace matrix of mesh feature is defined as . Hierarchical weights are introduced to balance the weights of regularisation via different levels of smoothness. in matrix is defined as follows
(16) |
where represent the threshold values for the hierarchical weighting and denote the balanced coefficients. The part-aware Laplace smoothing is used to reduce excessive roughness in albedo and displacement without affecting the fine details, which is defined as follows
(17) |
By employing varying degrees of hierarchical weights to trade-off Laplacian smoothing, is able to better constrain feature optimization in different scenarios. In our cases, minor irregularities are considered to be acceptable, while excessive changes are undesirable. Therefore, the threshold can be dynamically controlled through the quantiles of Laplace matrix , where those greater than will be assigned larger balance coefficients.
The following regularization terms are introduced to conform the optimized mesh to the hand geometry
(18) |
where and are part-aware Laplacian smoothing terms to maintain albedo and displacement flattening during training. , and are utilized to ensure that the optimized hand mesh remains close to the MANO model, where each term is assigned with constant coefficients denoted by and . Let represents the loss between the mask rendered during inverse rendering and the original MANO mask. penalizes the edge length changes of with respect to MANO mesh as , where is the Euclidean distance between adjacent vertices and on the mesh edges. denotes the edge distance of the subdivided MANO mesh . is employed to constrain the degree of displacement.
Loss Functions of Neural Renderer.Once the latent codes and of and are detached, is used to minimize the residuals between the rendered image and the ground truth like Eq.15
(19) |
where denotes balanced coefficient.
IV Experiments
IV-A Datasets
InterHand2.6M. The InterHand2.6M dataset[41] is a large collection of images, each with a resolution of pixels, accompanied by MANO annotations. It includes multi-view temporal sequences of both single and interacting hands. The experiments primarily utilize the 5 FPS version of this dataset.
DeepHandMesh. The DeepHandMesh dataset[19] features images captured from five different viewpoints, matching the resolution of those in InterHand2.6M. It also provides corresponding 3D hand scans, facilitating the validation of mesh reconstruction quality against 3D ground truth data.
IV-B Experimental Setup
Implementation Details. In the experiments, our proposed XHand model is mainly trained and evaluated on the 5FPS version of Interhand2.6M dataset[41], which is made of large-scale multi-view sequences capturing a wide range of hand poses. Each sequence has dozens of images with the size of . As in[27, 26], XHand model is trained on the InterHand2.6M dataset with 20 views across 50 frames for each sequence. The remaining frames are used for evaluation. To assess the quality of mesh reconstruction, we conduct experiments on DeepHandMesh dataset[19], which consists of 3D hand scans along with images captured from five different views. The images are with the same size as those in InterHand2.6M dataset. We conducted all the experiments on a PC with NVIDIA RTX 3090 GPU having 24GB GPU memory.

We employ PyTorch and Adam Optimizer with a learning rate of . To facilitate differentiable rasterization, we make use of the off-the-shelf renderer nvdiffrast[56]. As in[57], positional encoding is performed on and before feeding them into the rendering network. In our training process, the feature embedding modules are firstly trained for 500 epochs using inverse rendering. Then, feature embedding modules and neural render are jointly trained for 500 epochs, where the average features in feature embedding modules are updated every 50 epochs. We empirically found that the best performance is achieved in case of , , , and . To avoid the excessive displacements and color variations, in , is set to the first quartile of , is set to , and is . Similarly, in , is the median of , and , . The lengths of latent codes , , and are set to 10, 10, 10 and 20, respectively.
Evaluation Metrics. In the experiments, we fit the hand mesh representations to multi-view images sequence for single scene. For fair comparison, we employ the same evaluation metrics as in[8, 27, 26], which measure the synthesized results with peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and learned perceptual image patch similarity (LPIPS). We calculate the average point-to-surface Euclidean distance (P2S) to assess the accuracy of the reconstructed hand mesh, which is measured in millimeters since the Chamfer distance metric is considered unsuitable due to scale variations between MANO and 3D scans.
IV-C Experimental Results
To investigate the efficacy of our proposed XHand, we treat the subdivided MANO model[9] with vertex albedo as our baseline, which has the merits of the efficient explicit representation. Moreover, we compare our model against several rigged hand expression methods, including LISA[16], HandAvatar[27], HandNeRF[26], and LiveHand[8]. For fair comparison, LiveHand is re-trained with the same setting and LISA is reproduced by[8].
Model | LPIPS | PSNR | SSIM | FPS |
MANO[9] with abledo | 0.026 | 28.56 | 0.972 | 306.0 |
HandAvatar[27] | 0.050 | 33.01 | 0.933 | 0.2 |
LISA[16] | 0.078 | 29.36 | - | 3.7 |
HandNeRF[26] | 0.048 | 33.02 | 0.974 | - |
LiveHand[8] | 0.025 | 33.79 | 0.985 | 45.5 |
Ours | 0.012 | 34.32 | 0.986 | 56.2 |
Model | Rigid fist | Relaxed | Thumb up | Average |
MANO[9] | 6.469 | 5.719 | 5.224 | 5.659 |
DHM[19] | 2.695 | 3.995 | 3.639 | 3.492 |
Ours | 2.593 | 2.189 | 2.162 | 2.276 |
We firstly perform the quantitative evaluation on rendering quality, as shown in TableI. The evaluation metrics of LISA[16] are adopted from LiveHand[8] and the results of HandNeRF[26] are obtained from their original paper. It can be seen that our proposed XHand approach achieves the best results with a PSNR of 34.3dB. Our baseline drives a textured MANO model through LBS weights. Due to lacking the ability to handle illumination changes across different scenes and poses, there exist some artifacts with a PSNR of 28.6dB. NeRF-based methods[16, 27, 26, 8] present the competitive PSNR results, which rely on MANO mesh without fine-grained geometry during rendering. By taking advantage of fine-grained meshes estimated by XHand, our method outperforms the previous approaches using volumetric representation in terms of the rendering quality. Benefiting from our design, XHand achieves 56 frames per second (FPS) on inference. Specifically, the feature embedding modules require 0.7 milliseconds, inverse rendering requires 15 milliseconds and the neural rendering module needs 0.1 milliseconds.
TableII shows the results on DeepHandMesh dataset. Our method outperforms the annotated MANO mesh[9] and DHM[19] by 3.3 mm and 1.2 mm on P2S. This indicates that our proposed feature embedding module can accurately capture the underlying hand mesh deformation comparing to the encoder-decoder scheme in DHM.More experimental results conduct on the DeepHandMesh[19] dataset are visualized in Fig.7.

For better illustration, Fig.6 shows the more detailed comparisons of rendering and geometry on InterHand2.6M test split. Due to the limited expressive capability, it is hard for the baseline MANO model[9] to capture muscle details varying across different poses. Although the hand meshes generated by HandAvatar[27] have more details than MANO, they are still smoothing compared to ours. In terms of geometry, our method exhibits more prominent skin wrinkles based on different poses. The NeRF-based method HandNeRF[26] and LiveHand[8] yield the competitive render results, while they still rely on the MANO model and cannot obtain fine-grained hand geometry. On the contrary, our approach effectively presents an accurate hand representation by taking advantage of the feature embedding module and the topological consistent mesh model, resulting in enhanced rendering quality and geometry quality. Fig.8 visualizes the results of different identities animated using reference poses.

The proposed method efficiently drives personalized hand expressions from arbitrary hand gesture inputs. To demonstrate its performance, in-the-wild data serve as a reference for hand poses, as illustrated in Fig.9. The pose parameters of in-the-wild videos are extracted from HaMeR[42].It is worth noting that we can enhance the vividness of the images by using different spherical harmonic coefficients for relighting.

IV-D Ablation Study

We perform extensive ablation experiments on the InterHand2.6M dataset test set to validate the contributions of various modules and settings within our framework. First, we aim to demonstrate the performance improvements achieved by our proposed feature embedding module and part-aware Laplace smoothing strategy, consistent with our design intentions for the fusion modules. Second, we intend to showcase the robust performance of our XHand model across different numbers of views, highlighting its effectiveness even with limited viewpoints. Furthermore, we conduct a comparative analysis of various neural rendering networks. Based on this evaluation, we have chosen MLPs to enhance both the inference speed and the rendering quality, ensuring efficient and high-fidelity output. The following sections detail these ablation experiments and analyze the results comprehensively.
Ablation Study on Different Components. In the first row of Fig.10, it can be seen that our method significantly highlights skeletal movements and skin changes. Moreover, our design resolves the issue of lighting variations. Our proposed part-aware Laplacian regularization effectively reduces the surface artifacts without sacrificing the details. The feature embedding modules are able to guide the learning of hand avatars by distinguishing average features and pose features, which enhance the reconstruction accuracy.
Model | LPIPS | PSNR | SSIM |
MANO[9] with abledo | 0.0257 | 28.56 | 0.9715 |
w/o feature embedding | 0.0139 | 32.81 | 0.9838 |
w/o | 0.0129 | 32.87 | 0.9843 |
w/o Position Encoder | 0.0114 | 33.95 | 0.9853 |
Ours | 0.0123 | 34.32 | 0.9859 |
Num views | LPIPS | PSNR | SSIM |
1-view | 0.0209 | 29.34 | 0.9712 |
5-view | 0.0135 | 32.72 | 0.9823 |
10-view | 0.0129 | 33.50 | 0.9832 |
20-view | 0.0123 | 34.32 | 0.9859 |
30-view | 0.0091 | 35.23 | 0.9865 |
TableIII shows that the level of mesh detail significantly affects image quality. The rendering results are substantially enhanced through feature embedding. The part-aware Laplacian regularization yields more realistic geometric results, indirectly improving the accuracy of the neural render. Furthermore, the Position Encoder in neural rendering leads to better image quality.
Ablation Study on Number of Views. Typically, the performance of each model is improved along with the increasing number of input images, particularly for the NeRF-based methods. Also, insufficient training data may lead to the reconstruction failure. We conducted ablation experiments using different numbers of views as inputs. As shown in TableIV, we trained the model on sequences of 1, 5, 10, 20 and 30 views to demonstrate the impact of views. Despite being trained with a limited number of viewpoints, including as few as a single viewpoint, our method effectively captures the hand articulations. Furthermore, we achieve the competitive results in case of more than 10 input views.
Method | LPIPS | PSNR | SSIM | FPS |
XHand-MLPs | 0.012 | 34.32 | 0.986 | 56.2 |
XHand-UNet | 0.011 | 34.72 | 0.987 | 46.2 |
XHand-EG3D[61] | 0.013 | 32.3 | 0.981 | 40.4 |
Choices of Neural Rendering. Traditional neural radiance fields[20] typically employ 8-layer MLPs as the renderer. In contrast, our mesh-based network eliminates the necessity for point cloud sampling, which is able to render through vertex features. Benefiting from topology consistency, our neural renderer can make use of UNet[62] which leads to promising performance. To explore this, we conduct ablation experiments on both network architectures, as detailed in TableV. These experimental results demonstrate that a UNet with 4 layers achieves superior rendering quality, albeit at the expense of inference speed. In comparison to UNet, MLPs can enhance performance by 20% with only a marginal loss in accuracy. Therefore, we have chosen to employ MLPs as our neural renderer. Furthermore, our investigation into a well-designed image generation network, EG3D[61], reveals its unsuitability for neural rendering.
V Conclusion
We present XHand, a real-time expressive hand avatar with photo-realistic rendering and fine-grained geometry. By taking advantage of the effective feature embedding modules to distinguish average features and pose-dependent features, we obtain the finely detailed meshes with respect to hand poses. To ensure the high quality of hand synthesis, our method employs a mesh-based neural render that takes consideration of mesh topological consistency. During the training process, we introduce the part-aware Laplace regularization to reduce the artifacts while maintaining the details through different levels of regularization. Rigorous evaluations conducted on the InterHand2.6M and DeepHandMesh datasets demonstrate the ability to produce high-fidelity geometry and texture for hand animations across a wide range of poses.
Our method relies on accurate MANO annotations provided by the dataset during training. For future work, we will consider to explore the effective MANO model parameter estimator.
References
- [1]B.Doosti, S.Naha, M.Mirbagheri, and D.J. Crandall, “Hope-net: A graph-based model for hand-object pose estimation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 6607–6616.
- [2]Y.Hasson, B.Tekin, F.Bogo, I.Laptev, M.Pollefeys, and C.Schmid, “Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 571–580.
- [3]H.Fan, T.Zhuo, X.Yu, Y.Yang, and M.Kankanhalli, “Understanding atomic hand-object interaction with human intention,” IEEE Trans. Circuit Syst. Video Technol., vol.32, no.1, pp. 275–285, 2021.
- [4]H.Cheng, L.Yang, and Z.Liu, “Survey on 3d hand gesture recognition,” IEEE Trans. Circuit Syst. Video Technol., vol.26, no.9, pp. 1659–1673, 2015.
- [5]G.Pavlakos, V.Choutas, N.Ghorbani, T.Bolkart, A.A.A. Osman, D.Tzionas, and M.J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 10 975–10 985.
- [6]K.Karunratanakul, S.Prokudin, O.Hilliges, and S.Tang, “Harp: Personalized hand reconstruction from a monocular rgb video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 12 802–12 813.
- [7]Y.Li, L.Zhang, Z.Qiu, Y.Jiang, N.Li, Y.Ma, Y.Zhang, L.Xu, and J.Yu, “NIMBLE: a non-rigid hand model with bones and muscles,” ACM Trans. on Graph., pp. 120:1–120:16, 2022.
- [8]A.Mundra, J.Wang, M.Habermann, C.Theobalt, M.Elgharib etal., “Livehand: Real-time and photorealistic neural hand rendering,” in Int. Conf. Comput. Vis., 2023, pp. 18 035–18 045.
- [9]J.Romero, D.Tzionas, and M.J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” ACM Trans. on Graph., pp. 245:1–245:17, 2017.
- [10]M.Loper, N.Mahmood, J.Romero, G.Pons-Moll, and M.J. Black, “SMPL: a skinned multi-person linear model,” ACM Trans. on Graph., pp. 248:1–248:16, 2015.
- [11]Z.Cao, I.Radosavovic, A.Kanazawa, and J.Malik, “Reconstructing hand-object interactions in the wild,” in Int. Conf. Comput. Vis., 2021, pp. 12 397–12 406.
- [12]G.M. Lim, P.Jatesiktat, and W.T. Ang, “Mobilehand: Real-time 3d hand shape and pose estimation from color image,” in International Conference on Neural Information Processing, 2020, pp. 450–459.
- [13]T.Alldieck, H.Xu, and C.Sminchisescu, “imghum: Implicit generative models of 3d human shape and articulated pose,” in Int. Conf. Comput. Vis., 2021, pp. 5441–5450.
- [14]J.Ren and J.Zhu, “Pyramid deep fusion network for two-hand reconstruction from rgb-d images,” IEEE Trans. Circuit Syst. Video Technol., 2024.
- [15]S.Guo, E.Rigall, Y.Ju, and J.Dong, “3d hand pose estimation from monocular rgb with feature interaction module,” IEEE Trans. Circuit Syst. Video Technol., vol.32, no.8, pp. 5293–5306, 2022.
- [16]E.Corona, T.Hodan, M.Vo, F.Moreno-Noguer, C.Sweeney, R.Newcombe, and L.Ma, “Lisa: Learning implicit shape and appearance of hands,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 20 501–20 511.
- [17]H.Choi, G.Moon, and K.M. Lee, “Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose,” in Eur. Conf. Comput. Vis., 2020, pp. 769–787.
- [18]P.Chen, Y.Chen, D.Yang, F.Wu, Q.Li, Q.Xia, and Y.Tan, “I2uv-handnet: Image-to-uv prediction network for accurate and high-fidelity 3d hand mesh modeling,” in Int. Conf. Comput. Vis., 2021, pp. 12 909–12 918.
- [19]G.Moon, T.Shiratori, and K.M. Lee, “Deephandmesh: A weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling,” in Eur. Conf. Comput. Vis., 2020, pp. 440–455.
- [20]B.Mildenhall, P.P. Srinivasan, M.Tancik, J.T. Barron, R.Ramamoorthi, and R.Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, pp. 99–106, 2021.
- [21]P.Wang, L.Liu, Y.Liu, C.Theobalt, T.Komura, and W.Wang, “Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,” Adv. Neural Inform. Process. Syst., vol.34, pp. 27 171–27 183, 2021.
- [22]C.-Y. Weng, B.Curless, P.P. Srinivasan, J.T. Barron, and I.Kemelmacher-Shlizerman, “Humannerf: Free-viewpoint rendering of moving people from monocular video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 210–16 220.
- [23]X.Chen, Y.Zheng, M.J. Black, O.Hilliges, and A.Geiger, “SNARF: differentiable forward skinning for animating non-rigid neural implicit shapes,” in Int. Conf. Comput. Vis., 2021, pp. 11 574–11 584.
- [24]L.Liu, M.Habermann, V.Rudnev, K.Sarkar, J.Gu, and C.Theobalt, “Neural actor: Neural free-view synthesis of human actors with pose control,” ACM Trans. on Graph., pp. 1–16, 2021.
- [25]S.Peng, Y.Zhang, Y.Xu, Q.Wang, Q.Shuai, H.Bao, and X.Zhou, “Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 9054–9063.
- [26]Z.Guo, W.Zhou, M.Wang, L.Li, and H.Li, “Handnerf: Neural radiance fields for animatable interacting hands,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 21 078–21 087.
- [27]X.Chen, B.Wang, and H.-Y. Shum, “Hand avatar: Free-pose hand animation and rendering from monocular video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 8683–8693.
- [28]G.Yang, C.Wang, N.D. Reddy, and D.Ramanan, “Reconstructing animatable categories from videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 995–17 005.
- [29]H.Luo, T.Xu, Y.Jiang, C.Zhou, Q.Qiu, Y.Zhang, W.Yang, L.Xu, and J.Yu, “Artemis: Articulated neural pets with appearance and motion synthesis,” ACM Trans. on Graph., pp. 164:1–164:19, 2022.
- [30]S.Wu, R.Li, T.Jakab, C.Rupprecht, and A.Vedaldi, “Magicpony: Learning articulated 3d animals in the wild,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 8792–8802.
- [31]C.Cao, T.Simon, J.K. Kim, G.Schwartz, M.Zollhöfer, S.Saito, S.Lombardi, S.Wei, D.Belko, S.Yu, Y.Sheikh, and J.M. Saragih, “Authentic volumetric avatars from a phone scan,” ACM Trans. on Graph., pp. 163:1–163:19, 2022.
- [32]Y.Zheng, W.Yifan, G.Wetzstein, M.J. Black, and O.Hilliges, “Pointavatar: Deformable point-based head avatars from videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 21 057–21 067.
- [33]Y.Zheng, V.F. Abrevaya, M.C. Bühler, X.Chen, M.J. Black, and O.Hilliges, “I M avatar: Implicit morphable head avatars from videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 13 535–13 545.
- [34]P.Grassal, M.Prinzler, T.Leistner, C.Rother, M.Nießner, and J.Thies, “Neural head avatars from monocular RGB videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 632–18 643.
- [35]X.Gao, C.Zhong, J.Xiang, Y.Hong, Y.Guo, and J.Zhang, “Reconstructing personalized semantic facial nerf models from monocular video,” ACM Trans. on Graph., pp. 200:1–200:12, 2022.
- [36]G.Yang, M.Vo, N.Neverova, D.Ramanan, A.Vedaldi, and H.Joo, “Banmo: Building animatable 3d neural models from many casual videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 2853–2863.
- [37]M.Habermann, L.Liu, W.Xu, M.Zollhöfer, G.Pons-Moll, and C.Theobalt, “Real-time deep dynamic characters,” ACM Trans. on Graph., pp. 94:1–94:16, 2021.
- [38]F.Xu, Y.Liu, C.Stoll, J.Tompkin, G.Bharaj, Q.Dai, H.Seidel, J.Kautz, and C.Theobalt, “Video-based characters: Creating new human performances from a multi-view video database,” ACM Trans. on Graph., p.32, 2011.
- [39]S.Peng, S.Zhang, Z.Xu, C.Geng, B.Jiang, H.Bao, and X.Zhou, “Animatable neural implicit surfaces for creating avatars from videos,” CoRR, vol. abs/2203.08133, 2022.
- [40]B.L. Bhatnagar, C.Sminchisescu, C.Theobalt, and G.Pons-Moll, “Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration,” in Adv. Neural Inform. Process. Syst., 2020, pp. 12 909–12 922.
- [41]G.Moon, S.-I. Yu, H.Wen, T.Shiratori, and K.M. Lee, “Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image,” in Eur. Conf. Comput. Vis., 2020, pp. 548–564.
- [42]G.Pavlakos, D.Shan, I.Radosavovic, A.Kanazawa, D.Fouhey, and J.Malik, “Reconstructing hands in 3d with transformers,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 9826–9836.
- [43]A.Boukhayma, R.deBem, and P.H. Torr, “3d hand shape and pose from images in the wild,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 10 835–10 844.
- [44]Y.Hasson, G.Varol, D.Tzionas, I.Kalevatykh, M.J. Black, I.Laptev, and C.Schmid, “Learning joint reconstruction of hands and manipulated objects,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 11 807–11 816.
- [45]D.Kong, L.Zhang, L.Chen, H.Ma, X.Yan, S.Sun, X.Liu, K.Han, and X.Xie, “Identity-aware hand mesh estimation and personalization from rgb images,” in Eur. Conf. Comput. Vis., 2022, pp. 536–553.
- [46]J.Ren, J.Zhu, and J.Zhang, “End-to-end weakly-supervised single-stage multiple 3d hand mesh reconstruction from a single rgb image,” Computer Vision and Image Understanding, p. 103706, 2023.
- [47]H.Sun, X.Zheng, P.Ren, J.Wang, Q.Qi, and J.Liao, “Smr: Spatial-guided model-based regression for 3d hand pose and mesh reconstruction,” IEEE Trans. Circuit Syst. Video Technol., vol.34, no.1, pp. 299–314, 2023.
- [48]M.Li, J.Wang, and N.Sang, “Latent distribution-based 3d hand pose estimation from monocular rgb images,” IEEE Trans. Circuit Syst. Video Technol., vol.31, no.12, pp. 4883–4894, 2021.
- [49]M.Oren and S.K. Nayar, “Generalization of lambert’s reflectance model,” in Proc. Int. Conf. Comput. Graph. Intera. Tech., 1994, pp. 239–246.
- [50]X.Chen, Y.Liu, Y.Dong, X.Zhang, C.Ma, Y.Xiong, Y.Zhang, and X.Guo, “Mobrecon: Mobile-friendly hand mesh reconstruction from monocular image,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 20 544–20 554.
- [51]Q.Gan, W.Li, J.Ren, and J.Zhu, “Fine-grained multi-view hand reconstruction using inverse rendering,” in AAAI, 2024.
- [52]T.Luan, Y.Zhai, J.Meng, Z.Li, Z.Chen, Y.Xu, and J.Yuan, “High fidelity 3d hand shape reconstruction via scalable graph frequency decomposition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 795–16 804.
- [53]H.Zhu, Y.Liu, J.Fan, Q.Dai, and X.Cao, “Video-based outdoor human reconstruction,” IEEE Trans. Circuit Syst. Video Technol., vol.27, no.4, pp. 760–770, 2016.
- [54]K.Shen, C.Guo, M.Kaufmann, J.J. Zarate, J.Valentin, J.Song, and O.Hilliges, “X-avatar: Expressive human avatars,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 911–16 921.
- [55]B.K.P. Horn, “Shape from shading; a method for obtaining the shape of a smooth opaque object from one view,” Ph.D. dissertation, Massachusetts Institute of Technology, USA, 1970.
- [56]S.Laine, J.Hellsten, T.Karras, Y.Seol, J.Lehtinen, and T.Aila, “Modular primitives for high-performance differentiable rendering,” ACM Trans. on Graph., pp. 194:1–194:14, 2020.
- [57]K.Aliev, A.Sevastopolsky, M.Kolos, D.Ulyanov, and V.S. Lempitsky, “Neural point-based graphics,” in Eur. Conf. Comput. Vis., 2020, pp. 696–712.
- [58]L.Lin, S.Peng, Q.Gan, and J.Zhu, “Fasthuman: Reconstructing high-quality clothed human in minutes,” in International Conference on 3D Vision, 2024.
- [59]A.Nealen, T.Igarashi, O.Sorkine, and M.Alexa, “Laplacian mesh optimization,” in Proc. Int. Conf. Comput. Graph. Intera. Tech., 2006, pp. 381–389.
- [60]B.Kerbl, G.Kopanas, T.Leimkühler, and G.Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Trans. on Graph., pp. 1–14, 2023.
- [61]E.R. Chan, C.Z. Lin, M.A. Chan, K.Nagano, B.Pan, S.DeMello, O.Gallo, L.J. Guibas, J.Tremblay, S.Khamis etal., “Efficient geometry-aware 3d generative adversarial networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 123–16 133.
- [62]O.Ronneberger, P.Fischer, and T.Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.
![]() | Qijun Gan is currently a PhD candidate in the College of Computer Science and Technology, Zhejiang University, Hangzhou, China. Before that, he received the bachelor degree from University of International Business and Economics, China. His research interests include machine learning and computer vision, with a focus on 3D reconstruction. |
![]() | Zijie Zhou received the B.S. degree in Communication Engineering from Beijing University of Post and Telecommunication, Beijing, China, in 2022. He is currently a postgraduate student in the School of Software Technology, Zhejiang University, Hangzhou, China. His research interests include computer vision and deep learning. |
![]() | Jianke Zhu received the master’s degree from University of Macau in Electrical and Electronics Engineering, and the PhD degree in computer science and engineering from The Chinese University of Hong Kong, Hong Kong in 2008. He held a post-doctoral position at the BIWI Computer Vision Laboratory, ETH Zurich, Switzerland. He is currently a Professor with the College of Computer Science, Zhejiang University, Hangzhou, China. His research interests include computer vision and robotics. |