TOUCH: Text-gUided Controllable Generation of Free-Form Hand-Object Interactions

1University of Science and Technology of China

Overview of our work. We extend HOI generation beyond laboratory “grasp” settings(left) toward broader daily HOI modalities(right), enabling the modeling of more human-like interactions. Our dataset WildO2, built from Internet videos, covers more contacts, more objects, and more actions, and is enriched with descriptive synthetic captions (DSCs) to support fine-grained semantic controllable HOI generation with our method, TOUCH.

For each sample, the leftmost image is the hand-object interaction frame (Ihoi) from Internet videos. The middle image is our reconstruction of Ihoi, used as ground truth for WildO2. The rightmost image shows the free-form HOI generated by our method, TOUCH, using object meshes and DSCs. The bottom row shows the corresponding DSCs. (Pull right to see more.)
More Contacts
More Objects
More Actions

Abstract

Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce the free-form HOI generation, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2, an in-the-wild diverse diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 610 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method's ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities.

Method Overview

Overview of our three-stage framework TOUCH for generating hand-object interactions from multi-level text prompts and object meshes. CIM stands for the Condition Injection Module.

Dataset Generation Pipeline & Analysis

The proposed data generation pipeline for WildO2. The process begins with O2HOI(Object-only to Hand-Object Interaction) frame pair extraction from in-the-wild videos, followed by a three-stage pipeline for 3D reconstruction, camera alignment, and hand-object refinement to produce high-fidelity interaction data.

Dataset Analysis. (a) Breakdown of WildO2 reconstruction outcomes. (b) An illustration of the interplay between the most frequent object categories, interaction types, and hand contact regions. Object and action definitions are adapted and refined from Something-Something V2. Contact regions are derived based on our dataset analysis. (c) Specific segmentation of the 17 hand parts and their contact frequency distribution in the dataset, along with a contact heatmap of the entire hand.

More WildO2 Samples

This section presents a small subset of WildO2 to demonstrate the diversity of the dataset and the free-form nature of hand poses. For each sample, the left image shows the hand-object interaction frame (Ihoi) from Internet videos. The right image displays our reconstruction of Ihoi, which serves as the ground truth for WildO2.

Experiment Results

Comparisons of different methods on samples from the WildO2 test set. Each sample consists of SSCs and an object mesh as input, with the output being an interactive hand pose.

Push [green pen], applying Apply [index pad] to exert gentle force on the [body] of [green pen], causing it to slightly move.
Push [green pen], applying Apply [thumb pad] to exert gentle force on the [body] of [green pen], causing it to slightly move.
Lift up [green pen], applying [thumb, index pad] to grip the [handle] of [green pen] firmly, ensuring a secure hold while lifting one end without letting it drop.
Lift up [green pen], applying [thumb, index, middle pad] to grip the [handle] of [green pen] firmly, ensuring a secure hold while lifting one end without letting it drop.

Visualized cases of generation under different control conditions for the same object.

BibTeX


            @misc{han2025touchtextguidedcontrollablegeneration,
                      title={TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions}, 
                      author={Guangyi Han and Wei Zhai and Yuhang Yang and Yang Cao and Zheng-Jun Zha},
                      year={2025},
                      eprint={2510.14874},
                      archivePrefix={arXiv},
                      primaryClass={cs.CV},
                      url={https://arxiv.org/abs/2510.14874}, 
                }