Unsupervised Learning of Efficient Geometry-Aware Neural Articulated Representations

ECCV 2022

Atsuhiro Noguchi¹, Xiao Sun², Stephen Lin², Tatsuya Harada^{1,
3}

¹The University of Tokyo, ²Microsoft Research Asia, ³RIKEN

ENARF-GAN learns pose-controllable and geometry-aware 3D representations for articulated objects without supervision. It only takes various unlabeled single-view RGB images and a pose prior distribution for training, and learns to generate color and density fields of the objects. ENARF-GAN can learn disentangled representations for the appearance, viewpoint, and object pose, which is visualized in the videos on the left.

Abstract

We propose an unsupervised method for 3D geometry-aware representation learning of articulated objects, in which no image-pose pairs or foreground masks are used for training. Though photorealistic images of articulated objects can be rendered with explicit pose control through existing 3D neural representations, these methods require ground truth 3D pose and foreground masks for training, which are expensive to obtain. We obviate this need by learning the representations with GAN training. The generator is trained to produce realistic images of articulated objects from random poses and latent vectors by adversarial training. To avoid a high computational cost for GAN training, we propose an efficient neural representation for articulated objects based on tri-planes and then present a GAN-based framework for its unsupervised training. Experiments demonstrate the efficiency of our method and show that GAN-based training enables the learning of controllable 3D representations without paired supervision.

Method

To achieve efficient GAN training, we first propose a novel implicit representation for articulated objects called Efficient-NARF (ENARF), which is an extension of NARF. The model pipeline is visualized in the bellow figure. ENARF follows an efficient tri-plane based 3D representation proposed in EG3D, and extends it to articulated objects. We use tri-planes to represent hidden features and part probabilities of arbitrary 3D locations in the canonical space. We sample the feature and probability of each part at the input location, which are combined and converted to the color and density of that location with a small MLP. Thanks to the explicit tri-plane representation, we can reduce the computational complexity significantly compared to NARF.

To train ENARF without supervision, we propose GAN-based training of it. A generator generates images of articulated objects from randomly sampled latent vectors and poses, and is trained by the GAN objective. The generator consists of two networks: a foreground generator and a background generator G_b. The foreground generator network further consists of a tri-plane generator G_tri and ENARF G_ENARF. ENARF-based generator G_ENARF generates the foreground RGB image and mask from the randomly generated tri-planes from G_tri, and the background generator G_b generates background RGB image. The final output RGB image is a composite of the foreground and background images.

Results