ENARF-GAN learns pose-controllable and geometry-aware 3D representations for articulated objects without supervision. It only takes various unlabeled single-view RGB images and a pose prior distribution for training, and learns to generate color and density fields of the objects. ENARF-GAN can learn disentangled representations for the appearance, viewpoint, and object pose, which is visualized in the videos on the left.
We propose an unsupervised method for 3D geometry-aware representation learning of articulated objects, in which no image-pose pairs or foreground masks are used for training. Though photorealistic images of articulated objects can be rendered with explicit pose control through existing 3D neural representations, these methods require ground truth 3D pose and foreground masks for training, which are expensive to obtain. We obviate this need by learning the representations with GAN training. The generator is trained to produce realistic images of articulated objects from random poses and latent vectors by adversarial training. To avoid a high computational cost for GAN training, we propose an efficient neural representation for articulated objects based on tri-planes and then present a GAN-based framework for its unsupervised training. Experiments demonstrate the efficiency of our method and show that GAN-based training enables the learning of controllable 3D representations without paired supervision.
To achieve efficient GAN training, we first propose a novel implicit representation for articulated objects called Efficient-NARF (ENARF), which is an extension of NARF. The model pipeline is visualized in the bellow figure. ENARF follows an efficient tri-plane based 3D representation proposed in EG3D, and extends it to articulated objects. We use tri-planes to represent hidden features and part probabilities of arbitrary 3D locations in the canonical space. We sample the feature and probability of each part at the input location, which are combined and converted to the color and density of that location with a small MLP. Thanks to the explicit tri-plane representation, we can reduce the computational complexity significantly compared to NARF.
To train ENARF without supervision, we propose GAN-based training of it. A generator generates images of articulated objects from randomly sampled latent vectors and poses, and is trained by the GAN objective. The generator consists of two networks: a foreground generator and a background generator Gb. The foreground generator network further consists of a tri-plane generator Gtri and ENARF GENARF. ENARF-based generator GENARF generates the foreground RGB image and mask from the randomly generated tri-planes from Gtri, and the background generator Gb generates background RGB image. The final output RGB image is a composite of the foreground and background images.
We adapted the template for this website from IDR.