Unsupervised Learning of Efficient Geometry-Aware Neural Articulated Representations

ECCV 2022


Atsuhiro Noguchi1, Xiao Sun2, Stephen Lin2, Tatsuya Harada1, 3

1The University of Tokyo, 2Microsoft Research Asia, 3RIKEN

Paper Code


ENARF-GAN learns pose-controllable and geometry-aware 3D representations for articulated objects without supervision. It only takes various unlabeled single-view RGB images and a pose prior distribution for training, and learns to generate color and density fields of the objects. ENARF-GAN can learn disentangled representations for the appearance, viewpoint, and object pose, which is visualized in the videos on the left.



Abstract

We propose an unsupervised method for 3D geometry-aware representation learning of articulated objects, in which no image-pose pairs or foreground masks are used for training. Though photorealistic images of articulated objects can be rendered with explicit pose control through existing 3D neural representations, these methods require ground truth 3D pose and foreground masks for training, which are expensive to obtain. We obviate this need by learning the representations with GAN training. The generator is trained to produce realistic images of articulated objects from random poses and latent vectors by adversarial training. To avoid a high computational cost for GAN training, we propose an efficient neural representation for articulated objects based on tri-planes and then present a GAN-based framework for its unsupervised training. Experiments demonstrate the efficiency of our method and show that GAN-based training enables the learning of controllable 3D representations without paired supervision.


Method


To achieve efficient GAN training, we first propose a novel implicit representation for articulated objects called Efficient-NARF (ENARF), which is an extension of NARF. The model pipeline is visualized in the bellow figure. ENARF follows an efficient tri-plane based 3D representation proposed in EG3D, and extends it to articulated objects. We use tri-planes to represent hidden features and part probabilities of arbitrary 3D locations in the canonical space. We sample the feature and probability of each part at the input location, which are combined and converted to the color and density of that location with a small MLP. Thanks to the explicit tri-plane representation, we can reduce the computational complexity significantly compared to NARF.



To train ENARF without supervision, we propose GAN-based training of it. A generator generates images of articulated objects from randomly sampled latent vectors and poses, and is trained by the GAN objective. The generator consists of two networks: a foreground generator and a background generator Gb. The foreground generator network further consists of a tri-plane generator Gtri and ENARF GENARF. ENARF-based generator GENARF generates the foreground RGB image and mask from the randomly generated tri-planes from Gtri, and the background generator Gb generates background RGB image. The final output RGB image is a composite of the foreground and background images.


Results


Image and geometry generated by ENARF-GAN on SURREAL and AIST++ datasets are shown in the bellow videos. ENARF-GAN can disentangle appearance, viewpoint, and object pose from images without using supervision such as annotations of object's keypoint and foreground mask for each image.


We adapted the template for this website from IDR.