Landmark Discovery for Image Modeling - Yuting Zhang‘s Homepage

Yuting Zhang¹, Yijie Guo¹, Yixin Jin¹, Yijun Luo¹, Zhiyuan He¹, Honglak Lee^1,2

¹University of Michigan, Ann Arbor; ²Google Brain

Deep neural networks can model images with rich latent representations, but they cannot naturally conceptualize structures of object categories in a human-perceptible way. This paper addresses the problem of learning object structures in an image modeling process without supervision. We propose an autoencoding formulation to discover landmarks as explicit structural representations. The encoding module outputs landmark coordinates, whose validity is ensured by constraints that reflect the necessary properties for landmarks. The decoding module takes the landmarks as a part of the learnable input representations in an end-to-end differentiable framework. Our discovered landmarks are semantically meaningful and more predictive of manually annotated landmarks than those discovered by previous methods. The coordinates of our landmarks are also complementary features to pre-trained deep-neural-network representations in recognizing visual attributes. In addition, the proposed method naturally creates an unsupervised, perceptible interface to manipulate object shapes and decode images with controllable structures.

Overall of the neural network architecture.

Paper

Unsupervised Discovery of Object Landmarks as Structural Representations
Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee
Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Oral presentation
[] [] [paper (main, appendices, supp-videos .tar.gz)] [arXiv] [project (code & results)] [poster] [slides] [oral presentation .mp4]

Deep neural networks can model images with rich latent representations, but they cannot naturally conceptualize structures of object categories in a human-perceptible way. This paper addresses the problem of learning object structures in an image modeling process without supervision. We propose an autoencoding formulation to discover landmarks as explicit structural representations. The encoding module outputs landmark coordinates, whose validity is ensured by constraints that reflect the necessary properties for landmarks. The decoding module takes the landmarks as a part of the learnable input representations in an end-to-end differentiable framework. Our discovered landmarks are semantically meaningful and more predictive of manually annotated landmarks than those discovered by previous methods. The coordinates of our landmarks are also complementary features to pretrained deep-neural-network representations in recognizing visual attributes. In addition, the proposed method naturally creates an unsupervised, perceptible interface to manipulate object shapes and decode images with controllable structures.

@inproceedings{2018-cvpr-lmdis-rep,
  author={Yuting Zhang and Yijie Guo and Yixin Jin and Yijun Luo and Zhiyuan He and Honglak Lee},
  booktitle={Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={Unsupervised Discovery of Object Landmarks as Structural Representations},
  year={2018},
  month={June},
  url={http://www.ytzhang.net/files/publications/2018-cvpr-lmdis-rep.pdf},
  arxiv={1804.04412}
}

Code

We provide code and model release in Python+TensorFlow. The code can be obtained from our GitHub repository:

Code & Models (@GitHub): https://github.com/YutingZhang/lmdis-rep

Overview of Results

For landmark discovery, our method is compared with

(Thewlis et al., ICCV 2017)

James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object landmarks by factorized spatial embeddings. In ICCV, 2017.

Videos: [Face Manipulation] [Digit Morphing]

1. Facial landmark discovery & its application to supervised tasks

Figure: Discovering 10 landmarks on

CelebA

CelebA Dataset: Ziwei Liu and Ping Luo and Xiaogang Wang and Xiaoou Tang. Deep Learning Face Attributes in the Wild. In ICCV, 2015.

. Compared to Thewlis et al.'s, our landmarks are more stable and consistent across images. The errors occurred in Thewlis et al.'s are described in the last row. Our paper provides

more results

Figure 25 (10-landmark) and Figure 26 (30-landmark) in Appendix E.1 of our paper

of 10-landmark and 30-landmark models on CelebA.

Figure: Our model can also be trained and tested on unaligned facial images. Please refer to our paper for

more results

Figure 28 in Appendix E.1 of our paper

on CelebA non-aligned images and

results

Appendix E.2 in our paper

AFLW

The AFLW Dataset: Martin Koestinger, Paul Wohlhart, Peter M. Roth and Horst Bischof. Annotated Facial Landmarks in the Wild: A Large-scale, Real-world Database for Facial Landmark Localization. In First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies, 2011.

Table: Linear regression is used to map the discovered landmarks to the human annotated landmarks. In this way, landmark discovery models can be quickly converted to a detector of human-defined landmarks. Surprisingly, our unsupervised discovery model outperforms recent fully supervised facial landmark detectors. And, our model needs only 1000 labeled facial images, for the linear regressor training, to achieve the reported error on the

MAFL

MAFL is a subset of CelebA.

test set.

Method		Test set error
Method		MAFL	ALFW
Fully supervised	RCPR X. P. Burgos-Artizzu, P. Perona, and P. Dollár. Robust face landmark estimation under occlusion. In ICCV, 2013.	-	11.60
	CFAN Jie Zhang, Shiguang Shan, Meina Kan, and Xilin Chen. Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment. In ECCV, 2014.	15.84	10.94
	TCDCN Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Learning deep representation for face alignment with auxiliary attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(5):918–930, 2016.	7.95	7.65
	Cascaded CNN Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point detection. In CVPR, 2013.	9.73	8.97
	RAR Shengtao Xiao, Jiashi Feng, Junliang Xing, Hanjiang Lai, Shuicheng Yan, and Ashraf Kassim. Robust facial landmark detection via recurrent attentive-refinement networks. In ECCV, 2016.	-	7.23
	MTCNN Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep multi-task learning. In ECCV, 2014.	5.39	6.90
Unsupervised discovery	Thewlis et al. (50 discovered landmarks) James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object landmarks by factorized spatial embeddings. In ICCV, 2017.	6.67	10.53
	Thewlis et al. (dense object frames) James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object frames by dense equivariant image labelling. In NIPS, 2017.	5.83	8.80
	Ours (10 discovered landmarks) This is our method.	3.46	7.01
	Ours (30 discovered landmarks) This is our method.	3.15	6.58
Hover over (or click) the name of a method to see the citation.

Figure: Visualization for predicting annotated facial landmarks.

More results

Figure 27 Appendix E.1 of our paper

on CeleA (MAFL test set) are in our paper.

Colorful cross: discovered landmark; Red dot: annotated landmark; Circle: regressed landmark, whose color represent its distance to the annotated landmarks. See the color bar for the distance (i.e., prediction error).

Table: The discovered landmarks are an

explicit

directly perceptible by human

part of the image representation in our model. The landmarks learned without supervision can be used for shape-related facial attribute prediction on CelebA. The linear classifier built upon our discovered landmarks (30 in total = a 60-dim feature vector) outperforms that upon the pretrained FaceNet features (128-dim or 1792-dim).

Detailed results

Table 3 in the main paper

are in our paper.

Method	Ours (discovered landmarks)	FaceNet (top-layer)	FaceNet (conv-layer)
Feature dimension	60	128	1792
Accuracy / %	83.2	80.0	82.4

2. Image manipulation using discovered landmarks

The image representation learnt by our model consists of the discovered landmarks and their latent descriptors. The two parts of the representations turn out to be disentangled. Using our jointly trained decoding module, we can manipulate the object shape without changing other appearance factors. In particular, we synthesize flows to adjust the discovered landmarks of an input image.

More details

Section 4.4 and Appendix A in our paper

are in our paper.

Click here to browse more videos of Face Manipulation.

Video: Face manipulation using our 10-landmark model. Flows are on all landmarks.

Video: Face manipulation using our 30-landmark model. Flows are on all landmarks.

Video: Face manipulation using our 30-landmark model. Flows are on the 3 mouth landmarks.

Figure: Human-body manipulation using our 16-landmark model.

For simple images, like in MNIST, landmarks alone can be enough to describe the object shapes. On MNIST, we train our landmark discovery model without the landmark descriptor pathway. With the help of a single model trained for all ten digits, we can perform geometrically meaningful morphing between different digits.

More details

Appendix B in our paper

about our experiments on MNIST are in our paper.

Click here to browse more videos for Morphing MNIST Digits.

Video: Morphing 8 to other digits.

Video: Morphing 4 to other digits.

3. Landmark discovery on cat heads, cars, shoes, and human bodies

Figure: Discovering 10 landmarks on cat heads.

More results

Appendix E.3 in our paper

are in our paper.

Figure: Discovering 10 landmarks on cars in the profile view.

More results

Appendix E.4 in our paper

are in our paper.

Figure: Discovering 8 landmarks on shoes in the profile view.

More results

Appendix E.5 in our paper

are in our paper.

Figure: Discovering 16 landmarks on human body images from

Human3.6M

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.

. Different from other dataset, the Human3.6M training data is in the video format, so we can calculate the optical flows and take them as the self-supervision for the constraint to discover landmarks.

More details and results

Appendix C in our paper

are in our paper.