Projects & Code»Landmark Discovery for Image Modeling

Yuting Zhang1, Yijie Guo1, Yixin Jin1, Yijun Luo1, Zhiyuan He1, Honglak Lee1,2

1 University of Michigan, Ann Arbor; 2 Google Brain


        Deep neural networks can model images with rich latent representations, but they cannot naturally conceptualize structures of object categories in a human-perceptible way. This paper addresses the problem of learning object structures in an image modeling process without supervision. We propose an autoencoding formulation to discover landmarks as explicit structural representations. The encoding module outputs landmark coordinates, whose validity is ensured by constraints that reflect the necessary properties for landmarks. The decoding module takes the landmarks as a part of the learnable input representations in an end-to-end differentiable framework. Our discovered landmarks are semantically meaningful and more predictive of manually annotated landmarks than those discovered by previous methods. The coordinates of our landmarks are also complementary features to pre-trained deep-neural-network representations in recognizing visual attributes. In addition, the proposed method naturally creates an unsupervised, perceptible interface to manipulate object shapes and decode images with controllable structures.

 

Overall of the neural network architecture.

Paper

Unsupervised Discovery of Object Landmarks as Structural Representations
Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee
Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Oral presentation
[] [] [paper (main, appendices, supp-videos .tar.gz)] [arXiv] [project (code & results)] [poster] [slides] [oral presentation .mp4]

Deep neural networks can model images with rich latent representations, but they cannot naturally conceptualize structures of object categories in a human-perceptible way. This paper addresses the problem of learning object structures in an image modeling process without supervision. We propose an autoencoding formulation to discover landmarks as explicit structural representations. The encoding module outputs landmark coordinates, whose validity is ensured by constraints that reflect the necessary properties for landmarks. The decoding module takes the landmarks as a part of the learnable input representations in an end-to-end differentiable framework. Our discovered landmarks are semantically meaningful and more predictive of manually annotated landmarks than those discovered by previous methods. The coordinates of our landmarks are also complementary features to pretrained deep-neural-network representations in recognizing visual attributes. In addition, the proposed method naturally creates an unsupervised, perceptible interface to manipulate object shapes and decode images with controllable structures.
@inproceedings{2018-cvpr-lmdis-rep,
  author={Yuting Zhang and Yijie Guo and Yixin Jin and Yijun Luo and Zhiyuan He and Honglak Lee},
  booktitle={Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={Unsupervised Discovery of Object Landmarks as Structural Representations},
  year={2018},
  month={June},
  url={http://www.ytzhang.net/files/publications/2018-cvpr-lmdis-rep.pdf},
  arxiv={1804.04412}
}

Code

We provide code and model release in Python+TensorFlow. The code can be obtained from our GitHub repository: 

Overview of Results

For landmark discovery, our method is compared with
(Thewlis et al., ICCV 2017)
James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object landmarks by factorized spatial embeddings. In ICCV, 2017.
.
Videos:    [Face Manipulation]   [Digit Morphing]  

1. Facial landmark discovery & its application to supervised tasks 

Figure: Discovering 10 landmarks on
CelebA
CelebA Dataset: Ziwei Liu and Ping Luo and Xiaogang Wang and Xiaoou Tang. Deep Learning Face Attributes in the Wild. In ICCV, 2015.
. Compared to Thewlis et al.'s, our landmarks are more stable and consistent across images. The errors occurred in Thewlis et al.'s are described in the last row. Our paper provides
more results
Figure 25 (10-landmark) and Figure 26 (30-landmark) in Appendix E.1 of our paper
of 10-landmark and 30-landmark models on CelebA.
Figure: Our model can also be trained and tested on unaligned facial images. Please refer to our paper for
more results
Figure 28 in Appendix E.1 of our paper
on CelebA non-aligned images and
results
Appendix E.2 in our paper
on
AFLW
The AFLW Dataset: Martin Koestinger, Paul Wohlhart, Peter M. Roth and Horst Bischof. Annotated Facial Landmarks in the Wild: A Large-scale, Real-world Database for Facial Landmark Localization. In First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies, 2011.
.
Table: Linear regression is used to map the discovered landmarks to the human annotated landmarks. In this way, landmark discovery models can be quickly converted to a detector of human-defined landmarks. Surprisingly, our unsupervised discovery model outperforms recent fully supervised facial landmark detectors. And, our model needs only 1000 labeled facial images, for the linear regressor training, to achieve the reported error on the
MAFL
MAFL is a subset of CelebA.
test set.
Method Test set error
MAFLALFW
Fully
supervised
RCPR
X. P. Burgos-Artizzu, P. Perona, and P. Dollár. Robust face landmark estimation under occlusion. In ICCV, 2013.
-11.60
CFAN
Jie Zhang, Shiguang Shan, Meina Kan, and Xilin Chen. Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In ECCV, 2014.
15.8410.94
TCDCN
Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Learning deep representation for face alignment with auxiliary attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(5):918–930, 2016.
7.957.65
Cascaded CNN
Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point detection. In CVPR, 2013.
9.73 8.97
RAR
Shengtao Xiao, Jiashi Feng, Junliang Xing, Hanjiang Lai, Shuicheng Yan, and Ashraf Kassim. Robust facial landmark detection via recurrent attentive-refinement networks. In ECCV, 2016.
-7.23
MTCNN
Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep multi-task learning. In ECCV, 2014.
5.396.90
Unsupervised
discovery
Thewlis et al. (50 discovered landmarks)
James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object landmarks by factorized spatial embeddings. In ICCV, 2017.
6.6710.53
Thewlis et al. (dense object frames)
James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object frames by dense equivariant image labelling. In NIPS, 2017.
5.838.80
Ours (10 discovered landmarks)
This is our method.
3.467.01
Ours (30 discovered landmarks)
This is our method.
3.156.58
Hover over (or click) the name of a method to see the citation.
Figure: Visualization for predicting annotated facial landmarks.
More results
Figure 27 Appendix E.1 of our paper
on CeleA (MAFL test set) are in our paper.
Colorful cross: discovered landmark; Red dot: annotated landmark; Circle: regressed landmark, whose color represent its distance to the annotated landmarks. See the color bar for the distance (i.e., prediction error).
Table: The discovered landmarks are an
explicit
directly perceptible by human
part of the image representation in our model. The landmarks learned without supervision can be used for shape-related facial attribute prediction on CelebA. The linear classifier built upon our discovered landmarks (30 in total = a 60-dim feature vector) outperforms that upon the pretrained FaceNet features (128-dim or 1792-dim).
Detailed results
Table 3 in the main paper
are in our paper.
Method Ours (discovered landmarks) FaceNet (top-layer) FaceNet (conv-layer)
Feature dimension 60 128 1792
Accuracy / % 83.2 80.0 82.4

2. Image manipulation using discovered landmarks

         The image representation learnt by our model consists of the discovered landmarks and their latent descriptors. The two parts of the representations turn out to be disentangled. Using our jointly trained decoding module, we can manipulate the object shape without changing other appearance factors. In particular, we synthesize flows to adjust the discovered landmarks of an input image.
More details
Section 4.4 and Appendix A in our paper
are in our paper.
Video: Face manipulation using our 10-landmark model. Flows are on all landmarks.
Video: Face manipulation using our 30-landmark model. Flows are on all landmarks.
Video: Face manipulation using our 30-landmark model. Flows are on the 3 mouth landmarks.
Figure: Human-body manipulation using our 16-landmark model.
         For simple images, like in MNIST, landmarks alone can be enough to describe the object shapes. On MNIST, we train our landmark discovery model without the landmark descriptor pathway. With the help of a single model trained for all ten digits, we can perform geometrically meaningful morphing between different digits.
More details
Appendix B in our paper
about our experiments on MNIST are in our paper.
Video: Morphing 8 to other digits.
Video: Morphing 4 to other digits.

3. Landmark discovery on cat heads, cars, shoes, and human bodies

Figure: Discovering 10 landmarks on cat heads.
More results
Appendix E.3 in our paper
are in our paper.
Figure: Discovering 10 landmarks on cars in the profile view.
More results
Appendix E.4 in our paper
are in our paper.
Figure: Discovering 8 landmarks on shoes in the profile view.
More results
Appendix E.5 in our paper
are in our paper.
Figure: Discovering 16 landmarks on human body images from
Human3.6M
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.
. Different from other dataset, the Human3.6M training data is in the video format, so we can calculate the optical flows and take them as the self-supervision for the constraint to discover landmarks.
More details and results
Appendix C in our paper
are in our paper.