Single-Image Depth Inference Using Generative Adversarial Networks
Abstract
:1. Introduction
2. Related Works
3. Our Proposed Method
3.1. Problem Formulation
3.2. Network Architecture
3.2.1. Bilinear Upsampling Layer
3.2.2. Residual Transposed Convolution Block
3.2.3. Encoder-Decoder Skip Connections
3.2.4. Loss Function
4. Experiments
4.1. Dataset
4.2. Data Augmentation
4.3. Implementation Details
4.4. Evaluation
- Karsch et al. [37] proposed a non-parametric approach where they first find candidate images that are most similar to the input image from a prior database containing the images and their corresponding depth maps. Next, they warp the candidate images and depth maps using sift and optical flow features to align it with the input image. Lastly, they used the warped candidates to formulate an optimization problem that minimizes three terms: a data term that measures the distance of the predicted depth maps to the warped candidates, a spatial smoothness term that encourages the depth values to have small intensity gradients, and a database prior term that incorporates the assumptions of the database.
- Ladicky et al. [36] discretized the depth values and phrased the problem of depth estimation into pixel-wise depth classification. They train a multi-class boosted classifier [61] from extracted features such as textons [62], SIFT [63], local quantized ternary patterns [64], and self similarirty features [65] for each pixel.
- M. Liu et al. [41] first clusters the pixels of the image into a set of superpixels. They then used discrete-continuous conditional random fields on the superpixels to predict the depth values, where the continuous variables correspond to the depth values of the superpixels and the discrete variables encode additional information on the neighborhood structure of the superpixels.
- F. Liu et al. [44] predicts the depth maps of superpixels using a convolutional neural network. They then use a conditional random field to impose smoothness and consistency among neighboring superpixels.
- Wang et al. [13] used two convolutional neural networks to model both depth and semantic labels on a global and local scale. To combine the predictions, they used a hierarchical conditional random field.
- Eigen et al. [30] used an AlexNet based architecture to produce a coarse depth map of the scene at a global level followed by another convolutional neural network that makes local refinements to the predicted depth map.
- Eigen and Fergus [31] proposed a multi-scale and multi-task convolutional neural network that jointly predicts the depth maps, surface normals, and semantic labels. The idea is that the knowledge learned by each of the tasks can be shared, which can further improve the performance of the model as compared to learning each of the tasks independently.
- Roy and Todorovic [42] proposed a multi-stage and multi-component framework to predict depth maps using neural regression forests, which combines random forests with convolutional neural networks. They create an ensemble of networks in a tree-like fashion where every node is tied to a small local network. The predicted depth values are then averaged over the trees.
- Chakrabarti et al. [43] used Gaussian mixture models (GMM) to model the distribution of depth derivatives at different orientation and scales across small overlapping patches. The mixture weights of the GMM are inferred using convolutional neural networks.
- (rel) Mean Relative Error
- (rmse) Root Mean Squared Error
- (log10) Log10 error
- () Thresholded Accuracy: % of s.t. threshold, where threshold
4.5. Failure Cases
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
RGB | Red, Green, and Blue |
3D | 3 Dimensional |
2D | 2 Dimensional |
SIFT | Scale-invariant feature transform |
CNN | Convolutional Neural Networks |
GAN | Generative Adversarial Network |
rel | Mean Relative Error |
rmse | Root Mean Squared Error |
log10 | Log10 error |
References
- Lenz, I.; Lee, H.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724. [Google Scholar] [CrossRef] [Green Version]
- Michels, J.; Saxena, A.; Ng, A.Y. High speed obstacle avoidance using monocular vision and reinforcement learning. In Proceedings of the ACM 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 593–600. [Google Scholar]
- Hadsell, R.; Sermanet, P.; Ben, J.; Erkan, A.; Scoffier, M.; Kavukcuoglu, K.; Muller, U.; LeCun, Y. Learning long-range vision for autonomous off-road driving. J. Field Robot. 2009, 26, 120–144. [Google Scholar] [CrossRef] [Green Version]
- Huang, P.C.; Lin, J.R.; Li, G.L.; Tai, K.H.; Chen, M.J. Improved depth-assisted error concealment algorithm for 3D video transmission. IEEE Trans. Multimed. 2017, 19, 2625–2632. [Google Scholar] [CrossRef]
- Yang, C.; Cheung, G.; Stankovic, V.; Chan, K.; Ono, N. Sleep apnea detection via depth video and audio feature learning. IEEE Trans. Multimed. 2017, 19, 822–835. [Google Scholar] [CrossRef]
- Yang, C.; Cheung, G.; Stankovic, V. Estimating heart rate and rhythm via 3D motion tracking in depth video. IEEE Trans. Multimed. 2017, 19, 1625–1636. [Google Scholar] [CrossRef]
- Shotton, J.; Sharp, T.; Kipman, A.; Fitzgibbon, A.; Finocchio, M.; Blake, A.; Cook, M.; Moore, R. Real-time human pose recognition in parts from single depth images. Commun. ACM 2013, 56, 116–124. [Google Scholar] [CrossRef]
- Shen, W.; Deng, K.; Bai, X.; Leyvand, T.; Guo, B.; Tu, Z. Exemplar-based human action pose correction. IEEE Trans. Cybern. 2014, 44, 1053–1066. [Google Scholar] [CrossRef] [PubMed]
- Lopes, O.; Reyes, M.; Escalera, S.; Gonzàlez, J. Spherical blurred shape model for 3-D object and pose recognition: Quantitative analysis and HCI applications in smart environments. IEEE Trans. Cybern. 2014, 44, 2379–2390. [Google Scholar] [CrossRef]
- Hu, M.C.; Chen, C.W.; Cheng, W.H.; Chang, C.H.; Lai, J.H.; Wu, J.L. Real-time human movement retrieval and assessment with kinect sensor. IEEE Trans. Cybern. 2015, 45, 742–753. [Google Scholar] [CrossRef] [PubMed]
- Devanne, M.; Wannous, H.; Berretti, S.; Pala, P.; Daoudi, M.; Del Bimbo, A. 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Trans. Cybern. 2015, 45, 1340–1352. [Google Scholar] [CrossRef] [PubMed]
- Liu, A.A.; Su, Y.T.; Jia, P.P.; Gao, Z.; Hao, T.; Yang, Z.X. Multiple/single-view human action recognition via part-induced multitask structural learning. IEEE Trans. Cybern. 2015, 45, 1194–1208. [Google Scholar] [CrossRef]
- Wang, P.; Shen, X.; Lin, Z.; Cohen, S.; Price, B.; Yuille, A.L. Towards unified depth and semantic prediction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2800–2809. [Google Scholar]
- Yang, J.; Gan, Z.; Li, K.; Hou, C. Graph-based segmentation for RGB-D data using 3-D geometry enhanced superpixels. IEEE Trans. Cybern. 2015, 45, 927–940. [Google Scholar] [CrossRef]
- Husain, F.; Dellen, B.; Torras, C. Consistent depth video segmentation using adaptive surface models. IEEE Trans. Cybern. 2015, 45, 266–278. [Google Scholar] [CrossRef]
- Scharstein, D.; Szeliski, R.; Zabih, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In Proceedings of the IEEE Workshop on Stereo and Multi-Baseline Vision, (SMBV 2001); IEEE Computer Society: Washington, DC, USA, 2001; pp. 131–140. [Google Scholar]
- Liu, L.K.; Chan, S.H.; Nguyen, T.Q. Depth reconstruction from sparse samples: Representation, algorithm, and sampling. IEEE Trans. Image Process. 2015, 24, 1983–1996. [Google Scholar] [CrossRef]
- Memisevic, R.; Conrad, C. Stereopsis via deep learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Granada, Spain, 16 December 2011; Volume 1, p. 2. [Google Scholar]
- Sinz, F.H.; Candela, J.Q.; Bakır, G.H.; Rasmussen, C.E.; Franz, M.O. Learning depth from stereo. In Joint Pattern Recognition Symposium; Springer: Berlin, Germany, 2004; pp. 245–252. [Google Scholar]
- Pan, Y.; Liu, R.; Guan, B.; Du, Q.; Xiong, Z. Accurate depth extraction method for multiple light-coding-based depth cameras. IEEE Trans. Multimed. 2017, 19, 685–701. [Google Scholar] [CrossRef]
- Ge, K.; Hu, H.; Feng, J.; Zhou, J. Depth estimation using a sliding camera. IEEE Trans. Image Process. 2016, 25, 726–739. [Google Scholar] [CrossRef]
- Li, J.; Lu, M.; Li, Z.N. Continuous depth map reconstruction from light fields. IEEE Trans. Image Process. 2015, 24, 3257–3265. [Google Scholar]
- Zhang, Y.; Xiong, Z.; Yang, Z.; Wu, F. Real-time scalable depth sensing with hybrid structured light illumination. IEEE Trans. Image Process. 2014, 23, 97–109. [Google Scholar] [CrossRef]
- Yang, L.; Liu, J.; Tang, X. Depth from water reflection. IEEE Trans. Image Process. 2015, 24, 1235–1243. [Google Scholar] [CrossRef]
- Howard, I.P. Perceiving in Depth, Basic Mechanisms; Oxford University Press: Oxford, UK, 2012; Volume 1. [Google Scholar]
- Reichelt, S.; Häussler, R.; Fütterer, G.; Leister, N. Depth cues in human visual perception and their realization in 3D displays. Proc. SPIE 2010, 7690, 76900B. [Google Scholar]
- Loomis, J.M.; Da Silva, J.A.; Philbeck, J.W.; Fukusima, S.S. Visual perception of location and distance. Curr. Dir. Psychol. Sci. 1996, 5, 72–77. [Google Scholar] [CrossRef]
- Knill, D.C. Reaching for visual cues to depth: The brain combines depth cues differently for motor control and perception. J. Vis. 2005, 5, 2. [Google Scholar] [CrossRef]
- Phan, R.; Androutsos, D. Robust semi-automatic depth map generation in unconstrained images and video sequences for 2D to stereoscopic 3D conversion. IEEE Trans. Multimed. 2014, 16, 122–136. [Google Scholar] [CrossRef]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the Neural Information Processing Systems 2014 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 2366–2374. [Google Scholar]
- Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, Boston, MA, USA, 7–12 June 2015; pp. 2650–2658. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Tran, L.; Yin, X.; Liu, X. Disentangled representation learning gan for pose-invariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; Volume 4, p. 7. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Hoiem, D.; Efros, A.A.; Hebert, M. Geometric context from a single image. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV 2005), Beijing, China, 17–20 October 2005; Volume 1, pp. 654–661. [Google Scholar]
- Ladicky, L.; Shi, J.; Pollefeys, M. Pulling things out of perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 89–96. [Google Scholar]
- Karsch, K.; Liu, C.; Kang, S.B. Depth extraction from video using non-parametric sampling. In European Conference on Computer Vision; Springer: Berlin, Germany, 2012; pp. 775–788. [Google Scholar]
- Konrad, J.; Wang, M.; Ishwar, P. 2d-to-3d image conversion by learning depth from examples. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Providence, RI, USA, 16–21 June 2012; pp. 16–22. [Google Scholar]
- Saxena, A.; Chung, S.H.; Ng, A.Y. Learning depth from single monocular images. Available online: http://59.80.44.45/papers.nips.cc/paper/2921-learning-depth-from-single-monocular-images.pdf (accessed on 1 April 2019).
- Liu, B.; Gould, S.; Koller, D. Single image depth estimation from predicted semantic labels. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; pp. 1253–1260. [Google Scholar]
- Liu, M.; Salzmann, M.; He, X. Discrete-continuous depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 716–723. [Google Scholar]
- Roy, A.; Todorovic, S. Monocular depth estimation using neural regression forest. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegan, NV, USA, 26 June–1 July 2016; pp. 5506–5514. [Google Scholar]
- Chakrabarti, A.; Shao, J.; Shakhnarovich, G. Depth from a single image by harmonizing overcomplete local network predictions. In Proceedings of the Neural Information Processing Systems Conference (NIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 2658–2666. [Google Scholar]
- Liu, F.; Shen, C.; Lin, G.; Reid, I. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 2024–2039. [Google Scholar] [CrossRef]
- Bousmalis, K.; Trigeorgis, G.; Silberman, N.; Krishnan, D.; Erhan, D. Domain separation networks. In Proceedings of the Neural Information Processing Systems Conference (NIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 343–351. [Google Scholar]
- Hoffman, J.; Wang, D.; Yu, F.; Darrell, T. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv, 2016; arXiv:1612.02649. [Google Scholar]
- Wulfmeier, M.; Bewley, A.; Posner, I. Addressing appearance change in outdoor robotics with adversarial domain adaptation. arXiv, 2017; arXiv:1703.01461. [Google Scholar]
- Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv, 2015; arXiv:1511.06434. [Google Scholar]
- Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Huang, X.; Wang, X.; Metaxas, D. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv, 2016; arXiv:1612.03242. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegan, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv, 2014; arXiv:1409.1556. [Google Scholar]
- Odena, A.; Dumoulin, V.; Olah, C. Deconvolution and checkerboard artifacts. Distill 2016, 1, e3. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Sun, J.; Ren, S. Identity mappings in deep residual networks. In European Conference on Computer Vision; Springer: Berlin, Germany, 2016; pp. 630–645. [Google Scholar]
- Srivastava, R.K.; Greff, K.; Schmidhuber, J. Training very deep networks. In Proceedings of the Neural Information Processing Systems 2015, Montreal, QC, Canada, 7–12 December 2015; pp. 2377–2385. [Google Scholar]
- Mao, X.; Shen, C.; Yang, Y.B. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Proceedings of the Neural Information Processing Systems 2016 (NIPS 2016), Barcelona, Spain, 5–10 December 2016; pp. 2802–2810. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin, Germany, 2015; pp. 234–241. [Google Scholar]
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from rgbd Images; Springer: Berlin, Germany, 2012; pp. 746–760. [Google Scholar]
- Levin, A.; Lischinski, D.; Weiss, Y. Colorization using optimization. In ACM Transactions on Graphics; ACM: New York, NY, USA, 2004; Volume 23, pp. 689–694. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Neural Information Processing Systems (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Zhang, J.; Mitliagkas, I.; Ré, C. YellowFin and the art of momentum tuning. arXiv, 2017; arXiv:1706.03471. [Google Scholar]
- Torralba, A.; Murphy, K.; Freeman, W. Sharing features: Efficient boosting procedures for multiclass object detection. In Proceedings of the 2004 IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 27 June–2 July 2004; p. II. [Google Scholar]
- Malik, J.; Belongie, S.; Leung, T.; Shi, J. Contour and texture analysis for image segmentation. Int. J. Comput. Vis. 2001, 43, 7–27. [Google Scholar] [CrossRef]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Ul Hussain, S.; Triggs, B. Visual recognition using local quantized patterns. In European Conference on Computer Vision; Springer: Berlin, Germany, 2012; pp. 716–729. [Google Scholar]
- Shechtman, E.; Irani, M. Matching local self-similarities across images and videos. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
Error (Lower Is Better) | Accuracy (Higher Is Better) | |||||
---|---|---|---|---|---|---|
rel | log10 | rmse | ||||
Karsch et al. [37] | 0.349 | 0.134 | 1.214 | 0.447 | 0.745 | 0.897 |
Ladicky et al. [36] | - | - | - | 0.542 | 0.829 | 0.941 |
M. Liu et al. [41] | 0.335 | 0.127 | 1.060 | - | - | - |
F. Liu et al. [44] | 0.230 | 0.095 | 0.824 | 0.614 | 0.883 | 0.975 |
Wang et al. [13] | 0.220 | 0.094 | 0.745 | 0.605 | 0.890 | 0.970 |
Eigen et al. [30] | 0.215 | - | 0.907 | 0.611 | 0.887 | 0.971 |
Roy and Todorovic [42] | 0.187 | 0.078 | 0.744 | - | - | - |
Eigen and Fergus [31] | 0.158 | - | 0.641 | 0.769 | 0.950 | 0.988 |
Chakrabarti et al. [43] | 0.149 | - | 0.620 | 0.806 | 0.958 | 0.987 |
Ours | 0.176 | 0.074 | 0.597 | 0.867 | 0.965 | 0.988 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tan, D.S.; Yao, C.-Y.; Ruiz, C., Jr.; Hua, K.-L. Single-Image Depth Inference Using Generative Adversarial Networks. Sensors 2019, 19, 1708. https://doi.org/10.3390/s19071708
Tan DS, Yao C-Y, Ruiz C Jr., Hua K-L. Single-Image Depth Inference Using Generative Adversarial Networks. Sensors. 2019; 19(7):1708. https://doi.org/10.3390/s19071708
Chicago/Turabian StyleTan, Daniel Stanley, Chih-Yuan Yao, Conrado Ruiz, Jr., and Kai-Lung Hua. 2019. "Single-Image Depth Inference Using Generative Adversarial Networks" Sensors 19, no. 7: 1708. https://doi.org/10.3390/s19071708
APA StyleTan, D. S., Yao, C. -Y., Ruiz, C., Jr., & Hua, K. -L. (2019). Single-Image Depth Inference Using Generative Adversarial Networks. Sensors, 19(7), 1708. https://doi.org/10.3390/s19071708