XHand: Real-time Expressive Hand Avatar (2025)

Qijun Gan, Zijie Zhou and Jianke ZhuQijun Gan, Zijie Zhou and Jianke Zhu are with the College of Computer Science and Technology, Zhejiang University, Zheda Rd 38th, Hangzhou, China. Email: {ganqijun, zjzhou, jkzhu}@zju.edu.cn;Jianke Zhu is the Corresponding Author.

Abstract

Hand avatars play a pivotal role in a wide array of digital interfaces, enhancing user immersion and facilitating natural interaction within virtual environments. While previous studies have focused on photo-realistic hand rendering, little attention has been paid to reconstruct the hand geometry with fine details, which is essential to rendering quality. In the realms of extended reality and gaming, on-the-fly rendering becomes imperative. To this end, we introduce an expressive hand avatar, named XHand, that is designed to comprehensively generate hand shape, appearance, and deformations in real-time. To obtain fine-grained hand meshes, we make use of three feature embedding modules to predict hand deformation displacements, albedo, and linear blending skinning weights, respectively. To achieve photo-realistic hand rendering on fine-grained meshes, our method employs a mesh-based neural renderer by leveraging mesh topological consistency and latent codes from embedding modules. During training, a part-aware Laplace smoothing strategy is proposed by incorporating the distinct levels of regularization to effectively maintain the necessary details and eliminate the undesired artifacts. The experimental evaluations on InterHand2.6M and DeepHandMesh datasets demonstrate the efficacy of XHand, which is able to recover high-fidelity geometry and texture for hand animations across diverse poses in real-time. To reproduce our results, we will make the full implementation publicly available at https://github.com/agnJason/XHand.

Index Terms:

3D hand reconstruction, animatable avatar, MANO.

I Introduction

HAND avatars are crucial in various digital environments, including virtual reality, digital entertainment, and human-computer interaction[1, 2, 3, 4]. Accurate representation and lifelike motion of hand avatars are essential to deliver an authentic and engaging user experience. Due to the complexity of hand muscles and the personalized nature, it is challenging to obtain the fine-grained hand representation[5, 6, 7, 8], which directly affect the user experience in virtual spaces.

Parametric model-based methods[9, 10, 5] have succeeded in modeling digital human, which offer the structured frameworks to efficiently analyze and manipulate the shapes and poses of human bodies and hands. These models have played a crucial role in various applications, enabling computer animation and hand-object interaction[11, 12, 13, 14, 15]. Since they predominantly rely on mesh-based representations, it restricts them to fixed topology and limited resolution of the 3D mesh. Consequently, it is difficult for these models to accurately represent the intricate details, such as muscle, garments and hair. This hinders them from rendering high fidelity images[16].Model-free methods offer effective solutions for representing hand meshes through various techniques. Graph Convolutional Network (GCN)-based and UV-based representations of 3D hand meshes[17, 18] enable the reconstruction of diverse hand poses with detailed deformations. Lightweight auto-encoders[12, 19] further enhance real-time hand mesh prediction. Despite these advancements in capturing accurate hand poses, these methods still fall short in preserving intricate geometric details.

Recently, neural implicit representations[20, 21] have emerged as powerful tools in synthesizing novel views for static scenes. Some studies[22, 23, 24, 25, 26, 16] have expanded these methods into the realm of articulated objects, notably the human body, to facilitate photo-realistic rendering. LiveHand[8] achieves real-time rendering through a neural implicit representation along with a super-resolution renderer. Karunratanakulet al.[6] present a self-shadowing hand renderer. Coronaet al.[16] introduce a neural model LISA that predicts the color and the signed distance with respect to each hand bone independently. Despite the promising results, it struggles to capture intricate high-frequency details and lacks capability of real-time rendering. Meanwhile, Chenet al.[27] make use of occupancy and illumination fields to obtain hand geometry, while the generated hand geometry lacks the intricate details and appears to be smoothing surface. These methods have difficulties in recovering the detailed geometry that usually plays a crucial role in photo-realistic rendering.

In addition to hand modeling methods, several studies have focused on reconstructing animatable human bodies or animals[28, 29, 30, 31, 32, 33, 34, 35]. Building accurate human body models presents significant challenges due to the complex deformations involved, particularly in capturing fine details such as textures and scan-like appearances, especially in smaller areas like hands and faces[5, 23, 36, 37, 25, 38]. To address these challenges, several approaches have been developed with detailed 3D scans. For instance, previous works[22, 39, 40] have focused on establishing correspondences between pose space and standard space through techniques such as linear blend skinning and inverse skinning weights. These advancements collectively contribute to more precise and realistic human body modeling, while their results for hand modeling remain smooth.

To address these challenges, we propose XHand, an expressive hand avatar that achieves real-time performance (see Fig.1). Our approach includes feature embedding modules that predict hand deformation displacements, vertex albedo, and linear blending skinning (LBS) weights using a subdivided MANO model[9]. These modules utilize average features of the hand mesh and compute feature offsets for different poses, addressing the difficulty in directly learning dynamic personalized hand color and texture due to significant pose-dependent variations. By distinguishing between average and pose-dependent features, our modules simplify the training task and improve result accuracy. Additionally, we incorporate a part-aware Laplace smoothing term to enhance the efficiency of geometric information extraction from images, applying various levels of regularization.

To achieve photo-realistic hand rendering, we use a mesh-based neural renderer that leverages latent codes from the feature embedding modules, maintaining topological consistency. This method preserves detailed features and minimizes artifacts through various regularization levels. We evaluate our approach using the InterHand2.6M dataset[41] and the DeepHandMesh collection[19]. Experimental results show that XHand outperforms previous methods, providing high-fidelity meshes and real-time rendering of hands in various poses.

XHand: Real-time Expressive Hand Avatar (1)

Our main contributions are summarized as follows:

  • A real-time expressive hand avatar with high fidelity results on both rendering and geometry, which is trained with an effective part-aware Laplace smoothing strategy.

  • An effective feature embedding module to simplify the training objectives and enhance the prediction accuracy by distinguishing invariant average features and pose-dependent features;

  • An end-to-end framework to create photo-realistic and fine-grained hand avatars. The promising results indicate that our method outperforms the previous approaches.

The remainder of this paper is arranged as follows. Related works are introduced in SectionII. The proposed XHandmodel and corresponding training process are thoroughly depictedin SectionIII. The experimental results and discussion arepresented in SectionIV. Finally, SectionV sets out the conclusion of this paperand discusses the limitations.

II Relate Work

II-A Parametric Model-based Method

3D animatable human models[10, 9, 5] enable shape deformation and animation by decoding the low-dimensional parameters into a high-dimensional space. Loperet al.[10] introduce a linear model to explicitly represent the human body through adjusting shape and pose parameters. MANO hand model[9] utilizes a rigged hand mesh with fixed-topology that can be easily deformed according to the parameters. The low resolution of the template mesh hinders its application in scenarios requiring higher precision. To address this limitation, Liet al.[7] integrate muscle groups with shape registration, which results in an optimized mesh with finer appearance. Furthermore,parametric model-based methods[43, 44, 45, 11, 1, 2, 46, 47, 48] have shown the promising results in accurately recovering hand poses from input images, however, they have difficulty in effectively capturing textures and geometric details for the resulting meshes. In this paper, our proposed XHand approach is able to capture the fine details of both appearance and geometry by taking advantages of Lambertian reflectance model[49].

II-B Model-free Approach

Parametric models have proven to be valuable in incorporating prior knowledge of pose and shape in hand geometry reconstruction[9], while their representation capability is restricted due to the low resolution of the template mesh. To address this issue, Choiet al.[17] introduce a network based on graph convolutional neural networks (GCN) that directly estimates the 3D coordinates of human mesh from 2D human pose. Chenet al.[18] present a UV-based representation of 3D hand mesh to estimate hand vertex positions. Mobrecon[50] predicts hand mesh in real-time through a 2D encoder and a 3D decoder. Despite the encouraging results, the above methods still cannot capture the geometric details of hand. Moonet al.[19] propose an encoder-decoder framework that employs a template mesh to learn corrective parameters for pose and appearance. Although having achieved the improved geometry and articulated deformation, it has difficulty in rendering photo-realistic hand images. Gan et al.[51] introduce an optimized pipeline that utilizes multi-view images to reconstruct a static hand mesh. Unfortunately, it overlooks the variations due to joint movements. Karunratanakulet al.[6] design a shadow-aware differentiable rendering scheme that optimizes the abledo and normal map to represent hand avatar. However, its geometry remains smoothing. In contrast to the above methods, our proposed XHand approach is able to simultaneously synthesize the detailed geometry and photo-realistic images for drivable hands.

II-C Neural Hand Representation

There are various alternatives available for neural hand representations, such as HandAvatar[27], HandNeRF[26], LISA[16] and LiveHand[8]. In order to achieve high fidelity rendering of human hands, Chenet al.[27] propose HandAvatar to generate photo-realistic hand images with arbitrary poses, which take into account both occupancy and illumination fields. LISA[16] is a neural implicit model with hand textures, which focuses on signed distance functions (SDFs) and volumetric rendering. Mundraet al.[8] propose LiveHand that makes use of a low-resolution NeRF representation to describe dynamic hands and a CNN-based super-resolution module to facilitate high-quality rendering. Despite the efficiency in rendering hand images, it is hard for those approaches to capture the details of hand mesh geometry. Luanet al.[52] introduce a frequency decomposition loss to estimate the personalized hand shape from a single image, which effectively address the challenge of data scarcity. Chenet al. introduce a spatially varying linear lighting model as a neural renderer to preserve personalized fidelity and sharp details under natural illumination. Zhenget al. facilitate the creation of detailed hand avatars from a single image by learning and utilizing data-driven hand priors. In this work, our presented XHand method focuses on synthesizing the hand avatars with fine-grained geometry in real-time.

II-D Generic Animatable Objects

In addition to the aforementioned methods on hand modeling, there have been some studies reconstructing animatable whole or partial human bodies or animals[28, 29, 30]. Face models primarily pay their attention to facial expressions, appearance, and texture, rather than handling large-scale deformations[32, 33, 34, 35]. Zheng at al.[32] bridge the gap between explicit mesh and implicit representations by a deformable point-based model that incorporates intrinsic albedo and normal shading. To build human body model[5, 23, 53, 36, 37, 25, 38], numerous challenges arise from the intricate deformations, which make it arduous to precisely capture intricate details, such as textures and scan-like appearances, especially in smaller areas like the hands and face. Previous works[22, 39, 40] have explored to establish the correspondences between pose space and template space through linear blend skinning and inverse skinning weights. Alldiecket al.[13] employ learning-based implicit representations to model human bodies via SDFs. Chenet al.[23] propose a forward skinning model that finds all canonical correspondences of deformed points. Shenet al.[54] introduce XAvatar to achieve high fidelity of rigged human bodies, which employ part-aware sampling and initialization strategies to learn neural shapes and deformation fields.

III Method

XHand: Real-time Expressive Hand Avatar (2)

Given multi-view images {It,i|i=1,,N,t=1,,T}conditional-setsubscript𝐼𝑡𝑖formulae-sequence𝑖1𝑁𝑡1𝑇\{I_{t,i}|i=1,...,N,t=1,...,T\}{ italic_I start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_N , italic_t = 1 , … , italic_T } for T𝑇Titalic_T frames captured from N𝑁Nitalic_N viewpoints with pose {θt|t=1,,T}conditional-setsubscript𝜃𝑡𝑡1𝑇\{\theta_{t}|t=1,...,T\}{ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t = 1 , … , italic_T } and shape β𝛽\betaitalic_β of their corresponding parametric hand models like MANO[9], our proposed approach aims to simultaneously recover the expressive personalized hand meshes with fine details and render photo-realistic image in real-time. Fig.2 shows an overview of our method. Given the hand pose parameters θ𝜃\thetaitalic_θ, the fine-grained posed mesh is obtained from feature embedding modules (Sec.III-A), which are designed to obtain Linear Blending Skinning (LBS) weights, vertex displacements and albedo by combining the average features of the mesh with pose-driven feature mapping. With the refined mesh, the mesh-based neural renderer achieves real-time photo-realistic rendering with respect to the vertex albedo ρ𝜌\rhoitalic_ρ, normals 𝒩𝒩\mathcal{N}caligraphic_N, and latent code Q𝑄Qitalic_Q in feature embedding modules.

III-A Detailed Hand Representation

In this paper, the parametric hand model MANO[9] is employed to initialize the hand geometry, which effectively maps the pose parameter θJ×3𝜃superscript𝐽3\theta\in\mathbb{R}^{J\times 3}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_J × 3 end_POSTSUPERSCRIPT with J𝐽Jitalic_J per-bone parts and the shape parameter β10𝛽superscript10\beta\in\mathbb{R}^{10}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT onto a template mesh ¯¯\bar{\mathcal{M}}over¯ start_ARG caligraphic_M end_ARG with vertices V𝑉Vitalic_V. Such mapping ΩΩ\Omegaroman_Ω is based on linear blending skinning with the weights W|V|×J𝑊superscript𝑉𝐽W\in\mathbb{R}^{|V|\times J}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT | italic_V | × italic_J end_POSTSUPERSCRIPT. Thus, the posed hand mesh \mathcal{M}caligraphic_M can be obtained by

=Ω(¯,W,θ,β).Ω¯𝑊𝜃𝛽\mathcal{M}=\Omega(\bar{\mathcal{M}},W,\theta,\beta).caligraphic_M = roman_Ω ( over¯ start_ARG caligraphic_M end_ARG , italic_W , italic_θ , italic_β ) .(1)
XHand: Real-time Expressive Hand Avatar (3)

Geometry Refinement. After increasing the MANO mesh resolution for fine geometry using the subdivision method in[27], a personalized vertex displacement field D𝐷Ditalic_D is introduced to allow the extra deformation for each vertex in the template mesh. The refined posed hand mesh finesubscript𝑓𝑖𝑛𝑒\mathcal{M}_{fine}caligraphic_M start_POSTSUBSCRIPT italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT can be computed as below

fine=Ω(¯+D,W,θ,β).subscript𝑓𝑖𝑛𝑒Ωsuperscript¯𝐷superscript𝑊𝜃𝛽\mathcal{M}_{fine}=\Omega(\bar{\mathcal{M}}^{\prime}+D,W^{\prime},\theta,\beta).caligraphic_M start_POSTSUBSCRIPT italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT = roman_Ω ( over¯ start_ARG caligraphic_M end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_D , italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_θ , italic_β ) .(2)

The original MANO mesh[9], consisting of 778 vertices and 1538 faces, has limited capacity to accurately represent fine-grained details[27]. To overcome this challenge by enhancing the mesh resolution to capture intricate features, we employ an uniform subdivision strategy on the MANO template mesh, as shown in Fig.3. By adding new vertices at midpoint of each edge for three times, we obtain a refined mesh with 49,281 vertices and 98,432 faces. To associate skinning weights with these additional vertices, we compute the average weights assigned to the endpoints of the corresponding edges.

Let 𝒮𝒮\mathcal{S}caligraphic_S denote the subdivision function for MANO mesh. The high resolution template mesh ¯superscript¯\bar{\mathcal{M}}^{\prime}over¯ start_ARG caligraphic_M end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and LBS weights Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be extracted as follows

¯,W=𝒮(¯,W).superscript¯superscript𝑊𝒮¯𝑊\bar{\mathcal{M}}^{\prime},W^{\prime}=\mathcal{S}(\bar{\mathcal{M}},W).over¯ start_ARG caligraphic_M end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_S ( over¯ start_ARG caligraphic_M end_ARG , italic_W ) .(3)

To enhance the fidelity of the hand geometry, the vertex displacements D𝐷Ditalic_D and the LBS weights Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are pose-dependent for each individual. This enables an accurate representation of the deformation under different poses. To this end, we propose the feature embedding modules ΨDsubscriptΨ𝐷\Psi_{D}roman_Ψ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and ΨlbssubscriptΨ𝑙𝑏𝑠\Psi_{lbs}roman_Ψ start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT to better capture the intricate details of hand mesh, LBS weights Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are derived from the LBS embedding ΨlbssubscriptΨ𝑙𝑏𝑠\Psi_{lbs}roman_Ψ start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT. The displacement embedding ΨDsubscriptΨ𝐷\Psi_{D}roman_Ψ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT generates the vertex displacements D𝐷Ditalic_D. Given the hand pose parameters {θt|t=1,,T}conditional-setsubscript𝜃𝑡𝑡1𝑇\{\theta_{t}|t=1,...,T\}{ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t = 1 , … , italic_T } for T𝑇Titalic_T frames, the mesh features are predicted as follows

Dt=ΨD(θt),Wt=Ψlbs(θt).formulae-sequencesubscript𝐷𝑡subscriptΨ𝐷subscript𝜃𝑡subscriptsuperscript𝑊𝑡subscriptΨ𝑙𝑏𝑠subscript𝜃𝑡D_{t}=\Psi_{D}(\theta_{t}),W^{\prime}_{t}=\Psi_{lbs}(\theta_{t}).italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Ψ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Ψ start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .(4)

Thus, the refined mesh finesubscript𝑓𝑖𝑛𝑒\mathcal{M}_{fine}caligraphic_M start_POSTSUBSCRIPT italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT at time t𝑡titalic_t can be formulated as below

fine=Ω(¯+Dt,Wt,θt,β).subscript𝑓𝑖𝑛𝑒Ωsuperscript¯subscript𝐷𝑡subscriptsuperscript𝑊𝑡subscript𝜃𝑡𝛽\mathcal{M}_{fine}=\Omega(\bar{\mathcal{M}}^{\prime}+D_{t},W^{\prime}_{t},%\theta_{t},\beta).caligraphic_M start_POSTSUBSCRIPT italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT = roman_Ω ( over¯ start_ARG caligraphic_M end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_β ) .(5)

Feature Embedding Module. Generally, it is challenging to learn the distinctive hand features in different poses. To better separate between the deformation caused by changes in posture and the inherent characteristics of the hand, we present an efficient feature embedding module in this paper. It relies on the average features of hand mesh and computes offsets of features in different poses, as illustrated in Fig.4.

Given a personalized hand mesh \mathcal{M}caligraphic_M and its pose θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t, our feature embedding module extracts mesh features fsubscript𝑓f_{\mathcal{M}}italic_f start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT as follows

f=Ψ(θt|f¯),subscript𝑓Ψconditionalsubscript𝜃𝑡subscript¯𝑓f_{\mathcal{M}}=\Psi(\theta_{t}|\bar{f}_{\mathcal{M}}),italic_f start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT = roman_Ψ ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ) ,(6)

where f¯subscript¯𝑓\bar{f}_{\mathcal{M}}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT denotes the average vertex features of hand mesh.

XHand: Real-time Expressive Hand Avatar (4)

To represent the mesh features of personalized hand generated with hand pose θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we design the following embedding function

Ψ(θt|f¯)=f¯+Φ(θt,Q)𝒦,Ψconditionalsubscript𝜃𝑡subscript¯𝑓subscript¯𝑓Φsubscript𝜃𝑡𝑄𝒦\Psi(\theta_{t}|\bar{f}_{\mathcal{M}})=\bar{f}_{\mathcal{M}}+\Phi(\theta_{t},Q%)*\mathcal{K},roman_Ψ ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ) = over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT + roman_Φ ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Q ) ∗ caligraphic_K ,(7)

where Q𝑄Qitalic_Q is vertex latent code to encode different vertices. ΦΦ\Phiroman_Φ denotes a pose decoder that is combined with multi-layer perceptrons (MLPs). It projects the pose θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and latent code Q𝑄Qitalic_Q onto the implicit space. To align with the feature space, 𝒦𝒦\mathcal{K}caligraphic_K is the mapping matrix to convert the implicit space msuperscript𝑚\mathbb{R}^{m}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT into feature space nsuperscript𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, which subjects to

j=1n𝒦ij=1,fori=1,2,,m.formulae-sequencesuperscriptsubscript𝑗1𝑛subscript𝒦𝑖𝑗1for𝑖12𝑚\sum_{j=1}^{n}\mathcal{K}_{ij}=1,\quad\text{for }i=1,2,\ldots,m.∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 , for italic_i = 1 , 2 , … , italic_m .(8)

The personalized mesh features fsubscript𝑓f_{\mathcal{M}}italic_f start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT can be derived by combining the average vertex features f¯subscript¯𝑓\bar{f}_{\mathcal{M}}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT and the pose-dependent offsets. Consequently, the LBS weights Wtsuperscriptsubscript𝑊𝑡W_{t}^{\prime}italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be derived with average LBS weights f¯lbssubscript¯𝑓𝑙𝑏𝑠\bar{f}_{lbs}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT, pose decoder ΦlbssubscriptΦ𝑙𝑏𝑠\Phi_{lbs}roman_Φ start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT, latent code Qlbssubscript𝑄𝑙𝑏𝑠Q_{lbs}italic_Q start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT and mapping matrix 𝒦lbssubscript𝒦𝑙𝑏𝑠\mathcal{K}_{lbs}caligraphic_K start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT as follows

Wt=Ψlbs(θt|f¯lbs)=f¯lbs+Φlbs(θt,Qlbs)𝒦lbs.superscriptsubscript𝑊𝑡subscriptΨ𝑙𝑏𝑠conditionalsubscript𝜃𝑡subscript¯𝑓𝑙𝑏𝑠subscript¯𝑓𝑙𝑏𝑠subscriptΦ𝑙𝑏𝑠subscript𝜃𝑡subscript𝑄𝑙𝑏𝑠subscript𝒦𝑙𝑏𝑠W_{t}^{\prime}=\Psi_{lbs}(\theta_{t}|\bar{f}_{lbs})=\bar{f}_{lbs}+\Phi_{lbs}(%\theta_{t},Q_{lbs})*\mathcal{K}_{lbs}.italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_Ψ start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT ) = over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT + roman_Φ start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT ) ∗ caligraphic_K start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT .(9)

Similarly, the vertex displacements Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be obtained as follows

Dt=ΨD(θt|f¯D)=f¯D+ΦD(θt,QD)𝒦D,subscript𝐷𝑡subscriptΨ𝐷conditionalsubscript𝜃𝑡subscript¯𝑓𝐷subscript¯𝑓𝐷subscriptΦ𝐷subscript𝜃𝑡subscript𝑄𝐷subscript𝒦𝐷D_{t}=\Psi_{D}(\theta_{t}|\bar{f}_{D})=\bar{f}_{D}+\Phi_{D}(\theta_{t},Q_{D})*%\mathcal{K}_{D},italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Ψ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) = over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT + roman_Φ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) ∗ caligraphic_K start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ,(10)

where f¯Dsubscript¯𝑓𝐷\bar{f}_{D}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT denotes average displacements. ΦDsubscriptΦ𝐷\Phi_{D}roman_Φ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, QDsubscript𝑄𝐷Q_{D}italic_Q start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and 𝒦Dsubscript𝒦𝐷\mathcal{K}_{D}caligraphic_K start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT are pose decoder, latent code and mapping matrix for ΨDsubscriptΨ𝐷\Psi_{D}roman_Ψ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, respectively. The depths of ΦlbssubscriptΦ𝑙𝑏𝑠\Phi_{lbs}roman_Φ start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT within the LBS embedding module ΨlbssubscriptΨ𝑙𝑏𝑠\Psi_{lbs}roman_Ψ start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT and ΦρsubscriptΦ𝜌\Phi_{\rho}roman_Φ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT within the albedo embedding module ΨρsubscriptΨ𝜌\Psi_{\rho}roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT are set to 5, with each layer consisting of 128 neurons. Additionally, the depth of ΦDsubscriptΦ𝐷\Phi_{D}roman_Φ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT within the displacement embedding module ΨDsubscriptΨ𝐷\Psi_{D}roman_Ψ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is 8, where the number of neurons is 512.

Remark. The feature embedding modules allows for the interpretable acquisition of hand features fsubscript𝑓f_{\mathcal{M}}italic_f start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT corresponding to the pose θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The average mesh features are stored in f¯subscript¯𝑓\bar{f}_{\mathcal{M}}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT, while the features offsets are affected by the pose θ𝜃\thetaitalic_θ. More importantly, the training objectives are greatly simplified by taking into account of the average features constraints, which leads to the faster convergence and improved accuracy.

III-B Mesh Rendering

Inverse Rendering. In order to achieve rapid and differentiable rendering of detailed mesh finesubscript𝑓𝑖𝑛𝑒\mathcal{M}_{fine}caligraphic_M start_POSTSUBSCRIPT italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT, an inverse renderer is employed to synthesize hand images. Assuming that the skin color follows the Lambertian reflectance model[55], the rendered image B𝐵Bitalic_B can be calculated from the Spherical Harmonics coefficients 𝐆𝐆\mathbf{G}bold_G, the vertex normal 𝒩𝒩\mathcal{N}caligraphic_N, and the vertex albedo ρ𝜌\rhoitalic_ρ using the following equation

B(πi)=ρSH(𝐆,𝒩),𝐵superscript𝜋𝑖𝜌𝑆𝐻𝐆𝒩B(\pi^{i})=\rho\cdot SH(\mathbf{G},\mathcal{N}),italic_B ( italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = italic_ρ ⋅ italic_S italic_H ( bold_G , caligraphic_N ) ,(11)

where πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is camera parameter of the i𝑖iitalic_i-th viewpoint. SH()𝑆𝐻SH(\cdot)italic_S italic_H ( ⋅ ) represents Spherical Harmonics (SH) function of the third order. 𝒩𝒩\mathcal{N}caligraphic_N is the vertex normal computed from the vertices of mesh finesubscript𝑓𝑖𝑛𝑒\mathcal{M}_{fine}caligraphic_M start_POSTSUBSCRIPT italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT. Similar to Eq.4, the pose-dependent albedo ρtsubscript𝜌𝑡\rho_{t}italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be obtained from feature embedding module ΨρsubscriptΨ𝜌\Psi_{\rho}roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT with average vertex albedo f¯ρsubscript¯𝑓𝜌\bar{f}_{\rho}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT, pose decoder ΦρsubscriptΦ𝜌\Phi_{\rho}roman_Φ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT, latent code Qρsubscript𝑄𝜌Q_{\rho}italic_Q start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT and mapping matrix 𝒦ρsubscript𝒦𝜌\mathcal{K}_{\rho}caligraphic_K start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT as follows

ρt=Ψρ(θt)=f¯ρ+Φρ(θt,Qρ)𝒦ρ.subscript𝜌𝑡subscriptΨ𝜌subscript𝜃𝑡subscript¯𝑓𝜌subscriptΦ𝜌subscript𝜃𝑡subscript𝑄𝜌subscript𝒦𝜌\rho_{t}=\Psi_{\rho}(\theta_{t})=\bar{f}_{\rho}+\Phi_{\rho}(\theta_{t},Q_{\rho%})*\mathcal{K}_{\rho}.italic_ρ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT + roman_Φ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ) ∗ caligraphic_K start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT .(12)

By analyzing how the variations in brightness relate to the hand shape, inverse rendering with the Lambertian reflectance model can effectively disentangle geometry and appearance.

Mesh-based Neural Rendering.The NeRF-based methods usually employ volumetric rendering along its corresponding camera ray 𝐝𝐝\mathbf{d}bold_d to acquire pixel color[26, 8], which usually require a large amount of training time. Instead, we aim to minimize the sampling time and enhance the rendering quality by making use of a mesh-based neural rendering method that is able to take advantage of the consistent topology of our refined mesh.

The mesh is explicitly represented by triangular facets so that the intersection points between rays and meshes are located within the facets. The features that describe meshes, such as position, color, and normal, are associated with their respective vertices. Consequently, the attributes of intersection points can be calculated by interpolating the three vertices of triangular facet to its intersection point. The efficient differentiable rasterization[56] ensures the feasibility of inverse rendering and mesh-based neural rendering.

Given a camera view πisuperscript𝜋𝑖\pi^{i}italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, our mesh-based neural render 𝒞(πi)𝒞superscript𝜋𝑖\mathcal{C}(\pi^{i})caligraphic_C ( italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) synthesizes the image with respect to the position 𝐱𝐱\mathbf{x}bold_x, normal 𝒩𝒩\mathcal{N}caligraphic_N, feature vector 𝐡𝐡\mathbf{h}bold_h and ray direction 𝐝𝐝\mathbf{d}bold_d, where 𝐱𝐱\mathbf{x}bold_x, 𝐡𝐡\mathbf{h}bold_h and 𝒩𝒩\mathcal{N}caligraphic_N are obtained through interpolating with finesubscript𝑓𝑖𝑛𝑒\mathcal{M}_{fine}caligraphic_M start_POSTSUBSCRIPT italic_f italic_i italic_n italic_e end_POSTSUBSCRIPT. 𝐡𝐡\mathbf{h}bold_h in neural render 𝒞(πi)𝒞superscript𝜋𝑖\mathcal{C}(\pi^{i})caligraphic_C ( italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) contains the latent codes QDsubscript𝑄𝐷Q_{D}italic_Q start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and Qρsubscript𝑄𝜌Q_{\rho}italic_Q start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT detached from ΨDsubscriptΨ𝐷\Psi_{D}roman_Ψ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and ΨρsubscriptΨ𝜌\Psi_{\rho}roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT, and feature vector Qrendersubscript𝑄𝑟𝑒𝑛𝑑𝑒𝑟Q_{render}italic_Q start_POSTSUBSCRIPT italic_r italic_e italic_n italic_d italic_e italic_r end_POSTSUBSCRIPT[51]. Qrendersubscript𝑄𝑟𝑒𝑛𝑑𝑒𝑟Q_{render}italic_Q start_POSTSUBSCRIPT italic_r italic_e italic_n italic_d italic_e italic_r end_POSTSUBSCRIPT is utilized to represent the latent code of vertices during rendering. As in[20], the neural network 𝒞𝒞\mathcal{C}caligraphic_C comprises 8 fully-connected layers with ReLU activations and 256 channels per layer, excluding the output layer. Furthermore, it includes a skip connection that concatenates the input to the fifth layer, which is depicted in Fig.5.

XHand: Real-time Expressive Hand Avatar (5)

III-C Training Process

To obtain a personalized hand representation, the parameters of the three feature embedding modules ΨDsubscriptΨ𝐷\Psi_{D}roman_Ψ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, ΨlbssubscriptΨ𝑙𝑏𝑠\Psi_{lbs}roman_Ψ start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT, and ΨρsubscriptΨ𝜌\Psi_{\rho}roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT, as well as the neural render 𝒞𝒞\mathcal{C}caligraphic_C, require to be optimized based on multi-view image sequences. Our training process consists of three steps, including initialization, training feature embedding modules, and training the mesh-based neural render.

Initialization of XHand.To train our proposed XHand model, the average features f¯subscript¯𝑓\bar{f}_{\mathcal{M}}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT of mesh in feature embedding significantly affect training efficiency and results. Random initialization has great impact on training due to estimation errors in ΨlbssubscriptΨ𝑙𝑏𝑠\Psi_{lbs}roman_Ψ start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT and ΨDsubscriptΨ𝐷\Psi_{D}roman_Ψ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, which m ay lead to the failure of inverse rendering. Therefore, it is crucial to initialize the neural hand representation. To this end, the reconstruction result of the first frame (t=1𝑡1t=1italic_t = 1) is treated as the initial model.

Inspired by[58, 51], XHand model is initialized from multi-view images. The vertex displacement D𝐷Ditalic_D and vertex albedo ρ𝜌\rhoitalic_ρ of hand mesh are jointly optimized through inverse rendering. Mesh generation is obtained from Eq.2, and the rendering equation is same as Eq.11. The loss function during initialization is formulated as below

init=iB(πi)Ii1+L×D+L×ρ,subscript𝑖𝑛𝑖𝑡subscript𝑖subscriptnorm𝐵superscript𝜋𝑖subscript𝐼𝑖1𝐿𝐷𝐿𝜌\displaystyle\mathcal{L}_{init}=\sum\limits_{i}||B(\pi^{i})-I_{i}||_{1}+\sum L%\times D+\sum L\times\rho,caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | italic_B ( italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∑ italic_L × italic_D + ∑ italic_L × italic_ρ ,(13)

where L𝐿Litalic_L is the Laplacian matrix[59]. Laplacian terms L×D𝐿𝐷L\times Ditalic_L × italic_D and L×ρ𝐿𝜌L\times\rhoitalic_L × italic_ρ are employed to regularize the mesh optimization, as the mesh features are supposed to be smooth. Uniform weights of the Laplacian matrix are adopted in training. The outcomes D𝐷Ditalic_D and ρ𝜌\rhoitalic_ρ are used to initialize ΨDsubscriptΨ𝐷\Psi_{D}roman_Ψ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and ΨρsubscriptΨ𝜌\Psi_{\rho}roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT. The initialization of ΨlbssubscriptΨ𝑙𝑏𝑠\Psi_{lbs}roman_Ψ start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT is directly derived from MANO model[9].

Loss Functions of Feature Embedding.Inverse rendering is utilized to learn the parameters of three feature embedding modules ΨDsubscriptΨ𝐷\Psi_{D}roman_Ψ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, ΨlbssubscriptΨ𝑙𝑏𝑠\Psi_{lbs}roman_Ψ start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT and ΨρsubscriptΨ𝜌\Psi_{\rho}roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT. invsubscript𝑖𝑛𝑣\mathcal{L}_{inv}caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_v end_POSTSUBSCRIPT is introduced to minimize the errors of rendering images as follows

inv=rgb+reg,subscript𝑖𝑛𝑣subscript𝑟𝑔𝑏subscript𝑟𝑒𝑔\mathcal{L}_{inv}=\mathcal{L}_{rgb}+\mathcal{L}_{reg},caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_v end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT ,(14)

where rgbsubscript𝑟𝑔𝑏\mathcal{L}_{rgb}caligraphic_L start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT represents the rendering loss. regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT is the regularization term. Inspired by[60], we use L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT error combined with an SSIM term to form the rgbsubscript𝑟𝑔𝑏\mathcal{L}_{rgb}caligraphic_L start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT as below

rgbsubscript𝑟𝑔𝑏\displaystyle\mathcal{L}_{rgb}caligraphic_L start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT=λiB(πi)Ii1+(1λ)SSIM(B(πi),Ii),absent𝜆subscript𝑖subscriptnorm𝐵superscript𝜋𝑖subscript𝐼𝑖11𝜆subscript𝑆𝑆𝐼𝑀𝐵superscript𝜋𝑖subscript𝐼𝑖\displaystyle=\lambda\sum\limits_{i}||B(\pi^{i})-I_{i}||_{1}+(1-\lambda)%\mathcal{L}_{SSIM}(B(\pi^{i}),I_{i}),= italic_λ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | italic_B ( italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_λ ) caligraphic_L start_POSTSUBSCRIPT italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT ( italic_B ( italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,(15)

where λ𝜆\lambdaitalic_λ denotes the trade-off coefficient.

To enhance the efficiency in extracting geometric information from images, we introduce the part-aware Laplace smoothing term pLapsubscript𝑝𝐿𝑎𝑝\mathcal{L}_{pLap}caligraphic_L start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT. The Laplace matrix 𝐀𝐀\mathbf{A}bold_A of mesh feature f𝑓fitalic_f is defined as 𝐀=L×f𝐀𝐿𝑓\mathbf{A}=L\times fbold_A = italic_L × italic_f. Hierarchical weights ϕpLapsubscriptitalic-ϕ𝑝𝐿𝑎𝑝\phi_{pLap}italic_ϕ start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT are introduced to balance the weights of regularisation via different levels of smoothness. φisubscript𝜑𝑖\varphi_{i}italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in matrix ϕpLapsubscriptitalic-ϕ𝑝𝐿𝑎𝑝\phi_{pLap}italic_ϕ start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT is defined as follows

φi={γ10<𝐀i<p1γ2p1<𝐀i<p2,subscript𝜑𝑖casessubscript𝛾10subscript𝐀𝑖subscript𝑝1subscript𝛾2subscript𝑝1subscript𝐀𝑖subscript𝑝2\varphi_{i}=\left\{\begin{array}[]{l}\gamma_{1}\quad 0<\mathbf{A}_{i}<p_{1}\\\gamma_{2}\quad p_{1}<\mathbf{A}_{i}<p_{2}\\...\end{array}\right.,italic_φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0 < bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < bold_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL … end_CELL end_ROW end_ARRAY ,(16)

where {p1,p2,}subscript𝑝1subscript𝑝2\{p_{1},p_{2},\ldots\}{ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … } represent the threshold values for the hierarchical weighting and {γ1,γ2,}subscript𝛾1subscript𝛾2\{\gamma_{1},\gamma_{2},\ldots\}{ italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … } denote the balanced coefficients. The part-aware Laplace smoothing pLapsubscript𝑝𝐿𝑎𝑝\mathcal{L}_{pLap}caligraphic_L start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT is used to reduce excessive roughness in albedo and displacement without affecting the fine details, which is defined as follows

pLap(f)=iϕpLap𝐀.subscript𝑝𝐿𝑎𝑝𝑓subscript𝑖subscriptitalic-ϕ𝑝𝐿𝑎𝑝𝐀\mathcal{L}_{pLap}(f)=\sum\limits_{i}\phi_{pLap}\mathbf{A}.caligraphic_L start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT ( italic_f ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT bold_A .(17)

By employing varying degrees of hierarchical weights to trade-off Laplacian smoothing, pLapsubscript𝑝𝐿𝑎𝑝\mathcal{L}_{pLap}caligraphic_L start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT is able to better constrain feature optimization in different scenarios. In our cases, minor irregularities are considered to be acceptable, while excessive changes are undesirable. Therefore, the threshold p𝑝pitalic_p can be dynamically controlled through the quantiles of Laplace matrix A𝐴Aitalic_A, where those greater than p𝑝pitalic_p will be assigned larger balance coefficients.

The following regularization terms are introduced to conform the optimized mesh to the hand geometry

reg=pLap(ρ)+pLap(D)+α1mask+α2e+α3d.subscript𝑟𝑒𝑔subscript𝑝𝐿𝑎𝑝𝜌subscript𝑝𝐿𝑎𝑝𝐷subscript𝛼1subscript𝑚𝑎𝑠𝑘subscript𝛼2subscript𝑒subscript𝛼3subscript𝑑\displaystyle\mathcal{L}_{reg}=\mathcal{L}_{pLap}(\rho)+\mathcal{L}_{pLap}(D)+%\alpha_{1}\mathcal{L}_{mask}+\alpha_{2}\mathcal{L}_{e}+\alpha_{3}\mathcal{L}_{%d}.caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT ( italic_ρ ) + caligraphic_L start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT ( italic_D ) + italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT .(18)

where pLap(ρ)subscript𝑝𝐿𝑎𝑝𝜌\mathcal{L}_{pLap}(\rho)caligraphic_L start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT ( italic_ρ ) and pLap(D)subscript𝑝𝐿𝑎𝑝𝐷\mathcal{L}_{pLap}(D)caligraphic_L start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT ( italic_D ) are part-aware Laplacian smoothing terms to maintain albedo and displacement flattening during training. masksubscript𝑚𝑎𝑠𝑘\mathcal{L}_{mask}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT, esubscript𝑒\mathcal{L}_{e}caligraphic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and dsubscript𝑑\mathcal{L}_{d}caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT are utilized to ensure that the optimized hand mesh remains close to the MANO model, where each term is assigned with constant coefficients denoted by α1,α2subscript𝛼1subscript𝛼2\alpha_{1},\alpha_{2}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Let mask=iM^M1subscript𝑚𝑎𝑠𝑘subscript𝑖subscriptnorm^𝑀𝑀1\mathcal{L}_{mask}=\sum_{i}||\hat{M}-M||_{1}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | over^ start_ARG italic_M end_ARG - italic_M | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT represents the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss between the mask M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG rendered during inverse rendering and the original MANO mask. esubscript𝑒\mathcal{L}_{e}caligraphic_L start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT penalizes the edge length changes of eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT with respect to MANO mesh as i,je^ijeij22subscript𝑖𝑗superscriptsubscriptnormsubscript^𝑒𝑖𝑗subscript𝑒𝑖𝑗22\sum_{i,j}||\hat{e}_{ij}-e_{ij}||_{2}^{2}∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | | over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where e^ijsubscript^𝑒𝑖𝑗\hat{e}_{ij}over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the Euclidean distance ||||22||\cdot||_{2}^{2}| | ⋅ | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT between adjacent vertices Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Vjsubscript𝑉𝑗V_{j}italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on the mesh edges. eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes the edge distance of the subdivided MANO mesh ¯superscript¯\bar{\mathcal{M}}^{\prime}over¯ start_ARG caligraphic_M end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.d=iDi22subscript𝑑subscript𝑖superscriptsubscriptnormsubscript𝐷𝑖22\mathcal{L}_{d}=\sum_{i}||D_{i}||_{2}^{2}caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is employed to constrain the degree of displacement.

Loss Functions of Neural Renderer.Once the latent codes QDsubscript𝑄𝐷Q_{D}italic_Q start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and Qρsubscript𝑄𝜌Q_{\rho}italic_Q start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT of ΨDsubscriptΨ𝐷\Psi_{D}roman_Ψ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and ΨρsubscriptΨ𝜌\Psi_{\rho}roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT are detached, neusubscript𝑛𝑒𝑢\mathcal{L}_{neu}caligraphic_L start_POSTSUBSCRIPT italic_n italic_e italic_u end_POSTSUBSCRIPT is used to minimize the residuals between the rendered image and the ground truth like Eq.15

neu=ωi𝒞(πi)Ii1+(1ω)SSIM(𝒞(πi),Ii),subscript𝑛𝑒𝑢𝜔subscript𝑖subscriptnorm𝒞superscript𝜋𝑖subscript𝐼𝑖11𝜔subscript𝑆𝑆𝐼𝑀𝒞superscript𝜋𝑖subscript𝐼𝑖\mathcal{L}_{neu}=\omega\sum\limits_{i}||\mathcal{C}(\pi^{i})-I_{i}||_{1}+(1-%\omega)\mathcal{L}_{SSIM}(\mathcal{C}(\pi^{i}),I_{i}),caligraphic_L start_POSTSUBSCRIPT italic_n italic_e italic_u end_POSTSUBSCRIPT = italic_ω ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | caligraphic_C ( italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) - italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_ω ) caligraphic_L start_POSTSUBSCRIPT italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT ( caligraphic_C ( italic_π start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,(19)

where ω𝜔\omegaitalic_ω denotes balanced coefficient.

IV Experiments

IV-A Datasets

InterHand2.6M. The InterHand2.6M dataset[41] is a large collection of images, each with a resolution of 512×334512334512\times 334512 × 334 pixels, accompanied by MANO annotations. It includes multi-view temporal sequences of both single and interacting hands. The experiments primarily utilize the 5 FPS version of this dataset.

DeepHandMesh. The DeepHandMesh dataset[19] features images captured from five different viewpoints, matching the resolution of those in InterHand2.6M. It also provides corresponding 3D hand scans, facilitating the validation of mesh reconstruction quality against 3D ground truth data.

IV-B Experimental Setup

Implementation Details. In the experiments, our proposed XHand model is mainly trained and evaluated on the 5FPS version of Interhand2.6M dataset[41], which is made of large-scale multi-view sequences capturing a wide range of hand poses. Each sequence has dozens of images with the size of 512×334512334512\times 334512 × 334. As in[27, 26], XHand model is trained on the InterHand2.6M dataset with 20 views across 50 frames for each sequence. The remaining frames are used for evaluation. To assess the quality of mesh reconstruction, we conduct experiments on DeepHandMesh dataset[19], which consists of 3D hand scans along with images captured from five different views. The images are with the same size as those in InterHand2.6M dataset. We conducted all the experiments on a PC with NVIDIA RTX 3090 GPU having 24GB GPU memory.

XHand: Real-time Expressive Hand Avatar (6)

We employ PyTorch and Adam Optimizer with a learning rate of 5e45superscript𝑒45e^{-4}5 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. To facilitate differentiable rasterization, we make use of the off-the-shelf renderer nvdiffrast[56]. As in[57], positional encoding is performed on 𝐝𝐝\mathbf{d}bold_d and 𝐱𝐱\mathbf{x}bold_x before feeding them into the rendering network. In our training process, the feature embedding modules are firstly trained for 500 epochs using inverse rendering. Then, feature embedding modules and neural render are jointly trained for 500 epochs, where the average features f¯subscript¯𝑓\bar{f}_{\mathcal{M}}over¯ start_ARG italic_f end_ARG start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT in feature embedding modules are updated every 50 epochs. We empirically found that the best performance is achieved in case of λ=ω=0.8𝜆𝜔0.8\lambda=\omega=0.8italic_λ = italic_ω = 0.8, α1=10subscript𝛼110\alpha_{1}=10italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 10, α2=1e5subscript𝛼21superscript𝑒5\alpha_{2}=1e^{5}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 italic_e start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, and α3=1e4subscript𝛼31superscript𝑒4\alpha_{3}=1e^{4}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 1 italic_e start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. To avoid the excessive displacements and color variations, in pLap(ρ)subscript𝑝𝐿𝑎𝑝𝜌\mathcal{L}_{pLap}(\rho)caligraphic_L start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT ( italic_ρ ), p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is set to the first quartile of Aρsubscript𝐴𝜌A_{\rho}italic_A start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT, γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is set to 0.10.10.10.1, and γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is 1111. Similarly, in pLap(D)subscript𝑝𝐿𝑎𝑝𝐷\mathcal{L}_{pLap}(D)caligraphic_L start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT ( italic_D ), p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the median of ADsubscript𝐴𝐷A_{D}italic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, and γ1=0.1subscript𝛾10.1\gamma_{1}=0.1italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.1, γ2=20subscript𝛾220\gamma_{2}=20italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 20. The lengths of latent codes Qlbssubscript𝑄𝑙𝑏𝑠Q_{lbs}italic_Q start_POSTSUBSCRIPT italic_l italic_b italic_s end_POSTSUBSCRIPT, QDsubscript𝑄𝐷Q_{D}italic_Q start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, Qρsubscript𝑄𝜌Q_{\rho}italic_Q start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT and Qrendersubscript𝑄𝑟𝑒𝑛𝑑𝑒𝑟Q_{render}italic_Q start_POSTSUBSCRIPT italic_r italic_e italic_n italic_d italic_e italic_r end_POSTSUBSCRIPT are set to 10, 10, 10 and 20, respectively.

Evaluation Metrics. In the experiments, we fit the hand mesh representations to multi-view images sequence for single scene. For fair comparison, we employ the same evaluation metrics as in[8, 27, 26], which measure the synthesized results with peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and learned perceptual image patch similarity (LPIPS). We calculate the average point-to-surface Euclidean distance (P2S) to assess the accuracy of the reconstructed hand mesh, which is measured in millimeters since the Chamfer distance metric is considered unsuitable due to scale variations between MANO and 3D scans.

IV-C Experimental Results

To investigate the efficacy of our proposed XHand, we treat the subdivided MANO model[9] with vertex albedo as our baseline, which has the merits of the efficient explicit representation. Moreover, we compare our model against several rigged hand expression methods, including LISA[16], HandAvatar[27], HandNeRF[26], and LiveHand[8]. For fair comparison, LiveHand is re-trained with the same setting and LISA is reproduced by[8].

ModelLPIPS \downarrowPSNR \uparrowSSIM \uparrowFPS \uparrow
MANO[9] with abledo0.02628.560.972306.0
HandAvatar[27]0.05033.010.9330.2
LISA[16]0.07829.36-3.7
HandNeRF[26]0.04833.020.974-
LiveHand[8]0.02533.790.98545.5
Ours0.01234.320.98656.2
ModelRigid fistRelaxedThumb upAverage
MANO[9]6.4695.7195.2245.659
DHM[19]2.6953.9953.6393.492
Ours2.5932.1892.1622.276

We firstly perform the quantitative evaluation on rendering quality, as shown in TableI. The evaluation metrics of LISA[16] are adopted from LiveHand[8] and the results of HandNeRF[26] are obtained from their original paper. It can be seen that our proposed XHand approach achieves the best results with a PSNR of 34.3dB. Our baseline drives a textured MANO model through LBS weights. Due to lacking the ability to handle illumination changes across different scenes and poses, there exist some artifacts with a PSNR of 28.6dB. NeRF-based methods[16, 27, 26, 8] present the competitive PSNR results, which rely on MANO mesh without fine-grained geometry during rendering. By taking advantage of fine-grained meshes estimated by XHand, our method outperforms the previous approaches using volumetric representation in terms of the rendering quality. Benefiting from our design, XHand achieves 56 frames per second (FPS) on inference. Specifically, the feature embedding modules require 0.7 milliseconds, inverse rendering requires 15 milliseconds and the neural rendering module needs 0.1 milliseconds.

TableII shows the results on DeepHandMesh dataset. Our method outperforms the annotated MANO mesh[9] and DHM[19] by 3.3 mm and 1.2 mm on P2S. This indicates that our proposed feature embedding module can accurately capture the underlying hand mesh deformation comparing to the encoder-decoder scheme in DHM.More experimental results conduct on the DeepHandMesh[19] dataset are visualized in Fig.7.

XHand: Real-time Expressive Hand Avatar (7)

For better illustration, Fig.6 shows the more detailed comparisons of rendering and geometry on InterHand2.6M test split. Due to the limited expressive capability, it is hard for the baseline MANO model[9] to capture muscle details varying across different poses. Although the hand meshes generated by HandAvatar[27] have more details than MANO, they are still smoothing compared to ours. In terms of geometry, our method exhibits more prominent skin wrinkles based on different poses. The NeRF-based method HandNeRF[26] and LiveHand[8] yield the competitive render results, while they still rely on the MANO model and cannot obtain fine-grained hand geometry. On the contrary, our approach effectively presents an accurate hand representation by taking advantage of the feature embedding module and the topological consistent mesh model, resulting in enhanced rendering quality and geometry quality. Fig.8 visualizes the results of different identities animated using reference poses.

XHand: Real-time Expressive Hand Avatar (8)

The proposed method efficiently drives personalized hand expressions from arbitrary hand gesture inputs. To demonstrate its performance, in-the-wild data serve as a reference for hand poses, as illustrated in Fig.9. The pose parameters of in-the-wild videos are extracted from HaMeR[42].It is worth noting that we can enhance the vividness of the images by using different spherical harmonic coefficients for relighting.

XHand: Real-time Expressive Hand Avatar (9)

IV-D Ablation Study

XHand: Real-time Expressive Hand Avatar (10)

We perform extensive ablation experiments on the InterHand2.6M dataset test set to validate the contributions of various modules and settings within our framework. First, we aim to demonstrate the performance improvements achieved by our proposed feature embedding module and part-aware Laplace smoothing strategy, consistent with our design intentions for the fusion modules. Second, we intend to showcase the robust performance of our XHand model across different numbers of views, highlighting its effectiveness even with limited viewpoints. Furthermore, we conduct a comparative analysis of various neural rendering networks. Based on this evaluation, we have chosen MLPs to enhance both the inference speed and the rendering quality, ensuring efficient and high-fidelity output. The following sections detail these ablation experiments and analyze the results comprehensively.

Ablation Study on Different Components. In the first row of Fig.10, it can be seen that our method significantly highlights skeletal movements and skin changes. Moreover, our design resolves the issue of lighting variations. Our proposed part-aware Laplacian regularization effectively reduces the surface artifacts without sacrificing the details. The feature embedding modules are able to guide the learning of hand avatars by distinguishing average features and pose features, which enhance the reconstruction accuracy.

ModelLPIPS \downarrowPSNR \uparrowSSIM \uparrow
MANO[9] with abledo0.025728.560.9715
w/o feature embedding0.013932.810.9838
w/o pLapsubscript𝑝𝐿𝑎𝑝\mathcal{L}_{pLap}caligraphic_L start_POSTSUBSCRIPT italic_p italic_L italic_a italic_p end_POSTSUBSCRIPT0.012932.870.9843
w/o Position Encoder0.011433.950.9853
Ours0.012334.320.9859
Num viewsLPIPS \downarrowPSNR \uparrowSSIM \uparrow
1-view0.020929.340.9712
5-view0.013532.720.9823
10-view0.012933.500.9832
20-view0.012334.320.9859
30-view0.009135.230.9865

TableIII shows that the level of mesh detail significantly affects image quality. The rendering results are substantially enhanced through feature embedding. The part-aware Laplacian regularization yields more realistic geometric results, indirectly improving the accuracy of the neural render. Furthermore, the Position Encoder in neural rendering leads to better image quality.

Ablation Study on Number of Views. Typically, the performance of each model is improved along with the increasing number of input images, particularly for the NeRF-based methods. Also, insufficient training data may lead to the reconstruction failure. We conducted ablation experiments using different numbers of views as inputs. As shown in TableIV, we trained the model on sequences of 1, 5, 10, 20 and 30 views to demonstrate the impact of views. Despite being trained with a limited number of viewpoints, including as few as a single viewpoint, our method effectively captures the hand articulations. Furthermore, we achieve the competitive results in case of more than 10 input views.

MethodLPIPS \downarrowPSNR \uparrowSSIM \uparrowFPS \uparrow
XHand-MLPs0.01234.320.98656.2
XHand-UNet0.01134.720.98746.2
XHand-EG3D[61]0.01332.30.98140.4

Choices of Neural Rendering. Traditional neural radiance fields[20] typically employ 8-layer MLPs as the renderer. In contrast, our mesh-based network eliminates the necessity for point cloud sampling, which is able to render through vertex features. Benefiting from topology consistency, our neural renderer can make use of UNet[62] which leads to promising performance. To explore this, we conduct ablation experiments on both network architectures, as detailed in TableV. These experimental results demonstrate that a UNet with 4 layers achieves superior rendering quality, albeit at the expense of inference speed. In comparison to UNet, MLPs can enhance performance by 20% with only a marginal loss in accuracy. Therefore, we have chosen to employ MLPs as our neural renderer. Furthermore, our investigation into a well-designed image generation network, EG3D[61], reveals its unsuitability for neural rendering.

V Conclusion

We present XHand, a real-time expressive hand avatar with photo-realistic rendering and fine-grained geometry. By taking advantage of the effective feature embedding modules to distinguish average features and pose-dependent features, we obtain the finely detailed meshes with respect to hand poses. To ensure the high quality of hand synthesis, our method employs a mesh-based neural render that takes consideration of mesh topological consistency. During the training process, we introduce the part-aware Laplace regularization to reduce the artifacts while maintaining the details through different levels of regularization. Rigorous evaluations conducted on the InterHand2.6M and DeepHandMesh datasets demonstrate the ability to produce high-fidelity geometry and texture for hand animations across a wide range of poses.

Our method relies on accurate MANO annotations provided by the dataset during training. For future work, we will consider to explore the effective MANO model parameter estimator.

References

  • [1]B.Doosti, S.Naha, M.Mirbagheri, and D.J. Crandall, “Hope-net: A graph-based model for hand-object pose estimation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 6607–6616.
  • [2]Y.Hasson, B.Tekin, F.Bogo, I.Laptev, M.Pollefeys, and C.Schmid, “Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 571–580.
  • [3]H.Fan, T.Zhuo, X.Yu, Y.Yang, and M.Kankanhalli, “Understanding atomic hand-object interaction with human intention,” IEEE Trans. Circuit Syst. Video Technol., vol.32, no.1, pp. 275–285, 2021.
  • [4]H.Cheng, L.Yang, and Z.Liu, “Survey on 3d hand gesture recognition,” IEEE Trans. Circuit Syst. Video Technol., vol.26, no.9, pp. 1659–1673, 2015.
  • [5]G.Pavlakos, V.Choutas, N.Ghorbani, T.Bolkart, A.A.A. Osman, D.Tzionas, and M.J. Black, “Expressive body capture: 3d hands, face, and body from a single image,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 10 975–10 985.
  • [6]K.Karunratanakul, S.Prokudin, O.Hilliges, and S.Tang, “Harp: Personalized hand reconstruction from a monocular rgb video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 12 802–12 813.
  • [7]Y.Li, L.Zhang, Z.Qiu, Y.Jiang, N.Li, Y.Ma, Y.Zhang, L.Xu, and J.Yu, “NIMBLE: a non-rigid hand model with bones and muscles,” ACM Trans. on Graph., pp. 120:1–120:16, 2022.
  • [8]A.Mundra, J.Wang, M.Habermann, C.Theobalt, M.Elgharib etal., “Livehand: Real-time and photorealistic neural hand rendering,” in Int. Conf. Comput. Vis., 2023, pp. 18 035–18 045.
  • [9]J.Romero, D.Tzionas, and M.J. Black, “Embodied hands: Modeling and capturing hands and bodies together,” ACM Trans. on Graph., pp. 245:1–245:17, 2017.
  • [10]M.Loper, N.Mahmood, J.Romero, G.Pons-Moll, and M.J. Black, “SMPL: a skinned multi-person linear model,” ACM Trans. on Graph., pp. 248:1–248:16, 2015.
  • [11]Z.Cao, I.Radosavovic, A.Kanazawa, and J.Malik, “Reconstructing hand-object interactions in the wild,” in Int. Conf. Comput. Vis., 2021, pp. 12 397–12 406.
  • [12]G.M. Lim, P.Jatesiktat, and W.T. Ang, “Mobilehand: Real-time 3d hand shape and pose estimation from color image,” in International Conference on Neural Information Processing, 2020, pp. 450–459.
  • [13]T.Alldieck, H.Xu, and C.Sminchisescu, “imghum: Implicit generative models of 3d human shape and articulated pose,” in Int. Conf. Comput. Vis., 2021, pp. 5441–5450.
  • [14]J.Ren and J.Zhu, “Pyramid deep fusion network for two-hand reconstruction from rgb-d images,” IEEE Trans. Circuit Syst. Video Technol., 2024.
  • [15]S.Guo, E.Rigall, Y.Ju, and J.Dong, “3d hand pose estimation from monocular rgb with feature interaction module,” IEEE Trans. Circuit Syst. Video Technol., vol.32, no.8, pp. 5293–5306, 2022.
  • [16]E.Corona, T.Hodan, M.Vo, F.Moreno-Noguer, C.Sweeney, R.Newcombe, and L.Ma, “Lisa: Learning implicit shape and appearance of hands,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 20 501–20 511.
  • [17]H.Choi, G.Moon, and K.M. Lee, “Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose,” in Eur. Conf. Comput. Vis., 2020, pp. 769–787.
  • [18]P.Chen, Y.Chen, D.Yang, F.Wu, Q.Li, Q.Xia, and Y.Tan, “I2uv-handnet: Image-to-uv prediction network for accurate and high-fidelity 3d hand mesh modeling,” in Int. Conf. Comput. Vis., 2021, pp. 12 909–12 918.
  • [19]G.Moon, T.Shiratori, and K.M. Lee, “Deephandmesh: A weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling,” in Eur. Conf. Comput. Vis., 2020, pp. 440–455.
  • [20]B.Mildenhall, P.P. Srinivasan, M.Tancik, J.T. Barron, R.Ramamoorthi, and R.Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, pp. 99–106, 2021.
  • [21]P.Wang, L.Liu, Y.Liu, C.Theobalt, T.Komura, and W.Wang, “Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,” Adv. Neural Inform. Process. Syst., vol.34, pp. 27 171–27 183, 2021.
  • [22]C.-Y. Weng, B.Curless, P.P. Srinivasan, J.T. Barron, and I.Kemelmacher-Shlizerman, “Humannerf: Free-viewpoint rendering of moving people from monocular video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 210–16 220.
  • [23]X.Chen, Y.Zheng, M.J. Black, O.Hilliges, and A.Geiger, “SNARF: differentiable forward skinning for animating non-rigid neural implicit shapes,” in Int. Conf. Comput. Vis., 2021, pp. 11 574–11 584.
  • [24]L.Liu, M.Habermann, V.Rudnev, K.Sarkar, J.Gu, and C.Theobalt, “Neural actor: Neural free-view synthesis of human actors with pose control,” ACM Trans. on Graph., pp. 1–16, 2021.
  • [25]S.Peng, Y.Zhang, Y.Xu, Q.Wang, Q.Shuai, H.Bao, and X.Zhou, “Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 9054–9063.
  • [26]Z.Guo, W.Zhou, M.Wang, L.Li, and H.Li, “Handnerf: Neural radiance fields for animatable interacting hands,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 21 078–21 087.
  • [27]X.Chen, B.Wang, and H.-Y. Shum, “Hand avatar: Free-pose hand animation and rendering from monocular video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 8683–8693.
  • [28]G.Yang, C.Wang, N.D. Reddy, and D.Ramanan, “Reconstructing animatable categories from videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 995–17 005.
  • [29]H.Luo, T.Xu, Y.Jiang, C.Zhou, Q.Qiu, Y.Zhang, W.Yang, L.Xu, and J.Yu, “Artemis: Articulated neural pets with appearance and motion synthesis,” ACM Trans. on Graph., pp. 164:1–164:19, 2022.
  • [30]S.Wu, R.Li, T.Jakab, C.Rupprecht, and A.Vedaldi, “Magicpony: Learning articulated 3d animals in the wild,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 8792–8802.
  • [31]C.Cao, T.Simon, J.K. Kim, G.Schwartz, M.Zollhöfer, S.Saito, S.Lombardi, S.Wei, D.Belko, S.Yu, Y.Sheikh, and J.M. Saragih, “Authentic volumetric avatars from a phone scan,” ACM Trans. on Graph., pp. 163:1–163:19, 2022.
  • [32]Y.Zheng, W.Yifan, G.Wetzstein, M.J. Black, and O.Hilliges, “Pointavatar: Deformable point-based head avatars from videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 21 057–21 067.
  • [33]Y.Zheng, V.F. Abrevaya, M.C. Bühler, X.Chen, M.J. Black, and O.Hilliges, “I M avatar: Implicit morphable head avatars from videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 13 535–13 545.
  • [34]P.Grassal, M.Prinzler, T.Leistner, C.Rother, M.Nießner, and J.Thies, “Neural head avatars from monocular RGB videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 18 632–18 643.
  • [35]X.Gao, C.Zhong, J.Xiang, Y.Hong, Y.Guo, and J.Zhang, “Reconstructing personalized semantic facial nerf models from monocular video,” ACM Trans. on Graph., pp. 200:1–200:12, 2022.
  • [36]G.Yang, M.Vo, N.Neverova, D.Ramanan, A.Vedaldi, and H.Joo, “Banmo: Building animatable 3d neural models from many casual videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 2853–2863.
  • [37]M.Habermann, L.Liu, W.Xu, M.Zollhöfer, G.Pons-Moll, and C.Theobalt, “Real-time deep dynamic characters,” ACM Trans. on Graph., pp. 94:1–94:16, 2021.
  • [38]F.Xu, Y.Liu, C.Stoll, J.Tompkin, G.Bharaj, Q.Dai, H.Seidel, J.Kautz, and C.Theobalt, “Video-based characters: Creating new human performances from a multi-view video database,” ACM Trans. on Graph., p.32, 2011.
  • [39]S.Peng, S.Zhang, Z.Xu, C.Geng, B.Jiang, H.Bao, and X.Zhou, “Animatable neural implicit surfaces for creating avatars from videos,” CoRR, vol. abs/2203.08133, 2022.
  • [40]B.L. Bhatnagar, C.Sminchisescu, C.Theobalt, and G.Pons-Moll, “Loopreg: Self-supervised learning of implicit surface correspondences, pose and shape for 3d human mesh registration,” in Adv. Neural Inform. Process. Syst., 2020, pp. 12 909–12 922.
  • [41]G.Moon, S.-I. Yu, H.Wen, T.Shiratori, and K.M. Lee, “Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image,” in Eur. Conf. Comput. Vis., 2020, pp. 548–564.
  • [42]G.Pavlakos, D.Shan, I.Radosavovic, A.Kanazawa, D.Fouhey, and J.Malik, “Reconstructing hands in 3d with transformers,” in IEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 9826–9836.
  • [43]A.Boukhayma, R.deBem, and P.H. Torr, “3d hand shape and pose from images in the wild,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 10 835–10 844.
  • [44]Y.Hasson, G.Varol, D.Tzionas, I.Kalevatykh, M.J. Black, I.Laptev, and C.Schmid, “Learning joint reconstruction of hands and manipulated objects,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 11 807–11 816.
  • [45]D.Kong, L.Zhang, L.Chen, H.Ma, X.Yan, S.Sun, X.Liu, K.Han, and X.Xie, “Identity-aware hand mesh estimation and personalization from rgb images,” in Eur. Conf. Comput. Vis., 2022, pp. 536–553.
  • [46]J.Ren, J.Zhu, and J.Zhang, “End-to-end weakly-supervised single-stage multiple 3d hand mesh reconstruction from a single rgb image,” Computer Vision and Image Understanding, p. 103706, 2023.
  • [47]H.Sun, X.Zheng, P.Ren, J.Wang, Q.Qi, and J.Liao, “Smr: Spatial-guided model-based regression for 3d hand pose and mesh reconstruction,” IEEE Trans. Circuit Syst. Video Technol., vol.34, no.1, pp. 299–314, 2023.
  • [48]M.Li, J.Wang, and N.Sang, “Latent distribution-based 3d hand pose estimation from monocular rgb images,” IEEE Trans. Circuit Syst. Video Technol., vol.31, no.12, pp. 4883–4894, 2021.
  • [49]M.Oren and S.K. Nayar, “Generalization of lambert’s reflectance model,” in Proc. Int. Conf. Comput. Graph. Intera. Tech., 1994, pp. 239–246.
  • [50]X.Chen, Y.Liu, Y.Dong, X.Zhang, C.Ma, Y.Xiong, Y.Zhang, and X.Guo, “Mobrecon: Mobile-friendly hand mesh reconstruction from monocular image,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 20 544–20 554.
  • [51]Q.Gan, W.Li, J.Ren, and J.Zhu, “Fine-grained multi-view hand reconstruction using inverse rendering,” in AAAI, 2024.
  • [52]T.Luan, Y.Zhai, J.Meng, Z.Li, Z.Chen, Y.Xu, and J.Yuan, “High fidelity 3d hand shape reconstruction via scalable graph frequency decomposition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 795–16 804.
  • [53]H.Zhu, Y.Liu, J.Fan, Q.Dai, and X.Cao, “Video-based outdoor human reconstruction,” IEEE Trans. Circuit Syst. Video Technol., vol.27, no.4, pp. 760–770, 2016.
  • [54]K.Shen, C.Guo, M.Kaufmann, J.J. Zarate, J.Valentin, J.Song, and O.Hilliges, “X-avatar: Expressive human avatars,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 16 911–16 921.
  • [55]B.K.P. Horn, “Shape from shading; a method for obtaining the shape of a smooth opaque object from one view,” Ph.D. dissertation, Massachusetts Institute of Technology, USA, 1970.
  • [56]S.Laine, J.Hellsten, T.Karras, Y.Seol, J.Lehtinen, and T.Aila, “Modular primitives for high-performance differentiable rendering,” ACM Trans. on Graph., pp. 194:1–194:14, 2020.
  • [57]K.Aliev, A.Sevastopolsky, M.Kolos, D.Ulyanov, and V.S. Lempitsky, “Neural point-based graphics,” in Eur. Conf. Comput. Vis., 2020, pp. 696–712.
  • [58]L.Lin, S.Peng, Q.Gan, and J.Zhu, “Fasthuman: Reconstructing high-quality clothed human in minutes,” in International Conference on 3D Vision, 2024.
  • [59]A.Nealen, T.Igarashi, O.Sorkine, and M.Alexa, “Laplacian mesh optimization,” in Proc. Int. Conf. Comput. Graph. Intera. Tech., 2006, pp. 381–389.
  • [60]B.Kerbl, G.Kopanas, T.Leimkühler, and G.Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Trans. on Graph., pp. 1–14, 2023.
  • [61]E.R. Chan, C.Z. Lin, M.A. Chan, K.Nagano, B.Pan, S.DeMello, O.Gallo, L.J. Guibas, J.Tremblay, S.Khamis etal., “Efficient geometry-aware 3d generative adversarial networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 123–16 133.
  • [62]O.Ronneberger, P.Fischer, and T.Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.
XHand: Real-time Expressive Hand Avatar (11)Qijun Gan is currently a PhD candidate in the College of Computer Science and Technology, Zhejiang University, Hangzhou, China. Before that, he received the bachelor degree from University of International Business and Economics, China. His research interests include machine learning and computer vision, with a focus on 3D reconstruction.
XHand: Real-time Expressive Hand Avatar (12)Zijie Zhou received the B.S. degree in Communication Engineering from Beijing University of Post and Telecommunication, Beijing, China, in 2022. He is currently a postgraduate student in the School of Software Technology, Zhejiang University, Hangzhou, China. His research interests include computer vision and deep learning.
XHand: Real-time Expressive Hand Avatar (13)Jianke Zhu   received the master’s degree from University of Macau in Electrical and Electronics Engineering, and the PhD degree in computer science and engineering from The Chinese University of Hong Kong, Hong Kong in 2008. He held a post-doctoral position at the BIWI Computer Vision Laboratory, ETH Zurich, Switzerland. He is currently a Professor with the College of Computer Science, Zhejiang University, Hangzhou, China. His research interests include computer vision and robotics.
XHand: Real-time Expressive Hand Avatar (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Wyatt Volkman LLD

Last Updated:

Views: 5621

Rating: 4.6 / 5 (46 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Wyatt Volkman LLD

Birthday: 1992-02-16

Address: Suite 851 78549 Lubowitz Well, Wardside, TX 98080-8615

Phone: +67618977178100

Job: Manufacturing Director

Hobby: Running, Mountaineering, Inline skating, Writing, Baton twirling, Computer programming, Stone skipping

Introduction: My name is Wyatt Volkman LLD, I am a handsome, rich, comfortable, lively, zealous, graceful, gifted person who loves writing and wants to share my knowledge and understanding with you.