COHESIV

Abstract

In this paper we learn to segment hands and hand-held objects from motion. Our system takes a single RGB image and hand location as input to segment the hand and hand-held object. For learning, we generate responsibility maps that show how well a hand's motion explains other pixels' motion in video. We use these responsibility maps as pseudo-labels to train a weakly-supervised neural network using an attention-based similarity loss and contrastive loss. Our system outperforms alternate methods, achieving good performance on the 100DOH, EPIC-KITCHENS, and HO3D datasets.

Paper

Paper /

Supplemental /

Code /

Poster /

Slides

Citation


@INPROCEEDINGS{Shan21, 
    author = {Shan, Dandan and Higgins, Richard E.L. and Fouhey, David F.}
    title = {{COHESIV}: Contrastive Object and Hand Embedding Segmentation In Video},
    booktitle = {NeurIPS}, 
    year = {2021} 
}

Results

Acknowledgement

This work is supported by the National Science Foundation.

DS and RH thank Mohamed El Banani, Karan Desai, Sarah Jabbour, Nilesh Kulkarni, Shengyi Qian and Max Smith for the feedback and support.