A Multi-Level Approach to Waste Object Segmentation
Abstract
:1. Introduction
2. Related Work
2.1. Waste Object Segmentation and Related Tasks
2.2. Semantic Segmentation
3. Our Approach
3.1. Layered Deep Models
3.2. Coherent Segmentation with CRF
3.3. Generating Object Region Proposals
3.4. Model Inference
3.5. Model Learning
4. Experimental Evaluation
- MJU-Waste Dataset. In this work, we created a new benchmark for waste object segmentation. The dataset is available from https://github.com/realwecan/mju-waste/. To the best of our knowledge, MJU-Waste is the largest public benchmark available for waste object segmentation, with 1485 images for training, 248 for validation and 742 for testing. For each color image, we provide the co-registered depth image captured using an RGBD camera. We manually labeled each of the image. More details about our dataset are presented in Section 4.1.
- TACO Dataset. The Trash Annotations in COntext (TACO) dataset [5] is another public benchmark for waste object segmentation. Images are collected from mainly outdoor environments such as woods, roads and beaches. The dataset is available from http://tacodataset.org/. Individual images in this dataset are either under the CC BY 4.0 license or the ODBL (c) OpenLitterMap & Contributors license. See http://tacodataset.org/ for details. The current version of the dataset contains 1500 images, and a split with 1200 images for training, 150 for validation and 150 for testing is available from the authors. In all experiments that follow, we use this split from the authors.
- Intersection over Union (IoU) for the c-th class is the intersection of the prediction and ground-truth regions of the c-th class over the union of them, defined as:
- mean IoU (mIoU) is the average IoU of all C classes:
- Pixel Precision (Prec) for the c-th class is the percentage of correctly classified pixels of all predictions of the c-th class:
- Mean pixel precision (Mean) is the average class-wise pixel precision:
4.1. The MJU-Waste Dataset
4.2. Implementation Details
- Segmentation networks and . Following [21,74], we use the polynomial learning rate policy with the initial learning rate set to and the power factor set to . The total number of iterations are set to 50 epochs on both datasets with a batch size of 4 images. In all experiments, we use the ImageNet-pretrained backbones [85] and a standard SGD optimizer with momentum and weight decay factors set to and , respectively. To avoid overfitting, standard data augmentation techniques including random mirroring, resizing (with a resize factor between and 2), cropping and random Gaussian blur [21] are used. The base (and cropped) image sizes for and are set to and pixels during training, respectively.
- Object region proposals. To maintain a concise set of object region proposals, we empirically set the minimum and maximum number of pixels and in an object region proposal. For MJU-Waste, and are set to 900 and 40,000, respectively. For TACO, and are set to 25,000 and 250,000, due to the larger image sizes. Object region proposals that are either too small or too big will simply be discarded.
- CRF parameters. We initialize the CRF parameters with the default values in [19] and follow a simple grid search strategy to find the optimal values of CRF parameters in each term. For reference, the CRF parameters used in our experiments are listed in Table 2. We note that our model is somewhat robust to the exact values of these parameters, and for each dataset we use the same parameters for all segmentation models.
- Training details and codes. In this work, we use a publicly available implementation to train the segmentation networks and . The CNN training codes are available from: https://github.com/Tramac/awesome-semantic-segmentation-pytorch/. We use the default training settings unless otherwise specified earlier in this section. For CRF inference we use another public implementation. The CRF inference codes are available from: https://github.com/lucasb-eyer/pydensecrf/. The complete set of CRF parameters are summarized in Table 2.
4.3. Results on the MJU-Waste Dataset
- FCN-8s [17]. FCN is a seminal work in CNN-based semantic segmentation. In particular, FCN proposes to transform fully connected layers into convolutional layers that enables a classification net to output a probabilistic heatmap of object layouts. In our experiments, we use the network architecture as proposed in [17], which adopts a VGG16 [86] backbone. In terms of the skip connections, we choose the FCN-8s variant as it retains more precise location information by fusing features from the early and layers.
- PSPNet [21]. PSPNet proposes the pyramid pooling module for multi-scale context aggregation. Specifically, we choose the ResNet-101 [14] backbone variant for a good tradeoff between model complexity and performance. The pyramid pooling module concatenates the features from the last layer of the block with the same features applied with , , and average pooling and upsampling to harvest multi-scale contexts.
- CCNet [22]. CCNet presents an attention-based context aggregation method for semantic segmentation. We also choose the ResNet-101 backbone for this method. Therefore, the overall architecture is similar to PSPNet except that we use the Recurrent Criss Cross Attention (RCCA) module for context modeling. Specifically, given the features, the RCCA module obtains a self-attention map to aggregate the context information in horizontal and vertical directions. Similarly, the resultant features are concatenated with the features for downstream segmentation.
- DeepLabv3 [23]. DeepLabv3 proposes the Atrous Spatial Pyramid Pooling (ASPP) module for capturing the long-range contexts. Specifically, ASPP proposes the parallel dilated convolutions with varying atrous rates to encode features from different sized receptive fields. The atrous rates used in our experiments are 12, 24 and 36. In addition, we experimented with both ResNet-50 and ResNet-101 backbones on the MJU-Waste dataset to explore the performance impact of different backbone architectures.
- Baseline. DeepLabv3 baseline with a ResNet-50 backbone.
- Object only. The above baseline with additional object-level reasoning. This method is implemented by retaining only the two unary terms of Equation (2). All pixel-level pairwise terms are turned off. This will test if the object-level reasoning will contribute to the baseline performance.
- Object and appearance. The baseline with object-level reasoning plus the appearance and the spatial smoothing pairwise terms. The depth pairwise terms are turned off. This will test if the additional pixel affinity information (without depth, however) is useful. It also verifies the efficacy of the depth pairwise terms.
- Appearance and depth. The baseline with all pixel-level pairwise terms but without the object-level unary term. This will test if an object-level fine segmentation network is necessary, as well as the performance contribution of the pixel-level pairwise terms alone.
- Full model. Our full model with all components proposed in Section 3.
4.4. Results on the TACO Dataset
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Eriksen, C.W.; James, J.D.S. Visual attention within and around the field of focal attention: A zoom lens model. Percept. Psychophys. 1986, 40, 225–240. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Shulman, G.L.; Wilson, J. Spatial frequency and selective attention to local and global information. Perception 1987, 16, 89–101. [Google Scholar] [CrossRef] [PubMed]
- Pashler, H.E. The Psychology of Attention; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
- Palmer, S.E. Vision Science: Photons to Phenomenology; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
- Proença, P.F.; Simões, P. TACO: Trash Annotations in Context for Litter Detection. arXiv 2020, arXiv:2003.06975. [Google Scholar]
- Alexe, B.; Deselaers, T.; Ferrari, V. Measuring the objectness of image windows. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2189–2202. [Google Scholar] [CrossRef] [Green Version]
- Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef] [Green Version]
- Zitnick, C.L.; Dollár, P. Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision; Springer: Zurich, Switzerland, 2014; pp. 391–405. [Google Scholar]
- Cheng, M.M.; Zhang, Z.; Lin, W.Y.; Torr, P. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3286–3293. [Google Scholar]
- Wang, Y.; Liu, J.; Li, Y.; Yan, J.; Lu, H. Objectness-aware semantic segmentation. In Proceedings of the 24th ACM international conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 307–311. [Google Scholar]
- Xia, F.; Wang, P.; Chen, L.C.; Yuille, A.L. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In European Conference on Computer Vision; Springer: Amsterdam, The Netherlands, 2016; pp. 648–663. [Google Scholar]
- Alexe, B.; Deselaers, T.; Ferrari, V. Classcut for unsupervised class segmentation. In European Conference on Computer Vision; Springer: Crete, Greece, 2010; pp. 380–393. [Google Scholar]
- Vezhnevets, A.; Ferrari, V.; Buhmann, J.M. Weakly supervised semantic segmentation with a multi-image model. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 643–650. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Zhang, Z.; Lin, H.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. ResNeSt: Split-Attention Networks. arXiv 2020, arXiv:2004.08955. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems; NeurIPS Foundation: Montreal, QC, Canada, 2015; pp. 91–99. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
- Krähenbühl, P.; Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems; NeurIPS Foundation: Granada, Spain, 2011; pp. 109–117. [Google Scholar]
- Han, J.; Shao, L.; Xu, D.; Shotton, J. Enhanced computer vision with microsoft kinect sensor: A review. IEEE Trans. Cybern. 2013, 43, 1318–1334. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the 2019 International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Yang, M.; Thung, G. Classification of Trash for Recyclability Status; CS229 Project Report; Stanford University: Stanford, CA, USA, 2016. [Google Scholar]
- Bircanoğlu, C.; Atay, M.; Beşer, F.; Genç, Ö.; Kızrak, M.A. RecycleNet: Intelligent waste sorting using deep neural networks. In Proceedings of the 2018 Innovations in Intelligent Systems and Applications (INISTA), Thessaloniki, Greece, 3–5 July 2018; pp. 1–7. [Google Scholar]
- Aral, R.A.; Keskin, Ş.R.; Kaya, M.; Hacıömeroğlu, M. Classification of trashnet dataset based on deep learning models. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 2058–2062. [Google Scholar]
- Awe, O.; Mengistu, R.; Sreedhar, V. Smart Trash Net: Waste Localization and Classification; CS229 Project Report; Stanford University: Stanford, CA, USA, 2017. [Google Scholar]
- Chu, Y.; Huang, C.; Xie, X.; Tan, B.; Kamal, S.; Xiong, X. Multilayer hybrid deep-learning method for waste classification and recycling. Comput. Intell. Neurosci. 2018, 2018, 5060857. [Google Scholar] [CrossRef] [Green Version]
- Vo, A.H.; Son, L.H.; Vo, M.T.; Le, T. A Novel Framework for Trash Classification Using Deep Transfer Learning. IEEE Access 2019, 7, 178631–178639. [Google Scholar] [CrossRef]
- Ramalingam, B.; Lakshmanan, A.K.; Ilyas, M.; Le, A.V.; Elara, M.R. Cascaded machine-learning technique for debris classification in floor-cleaning robot application. Appl. Sci. 2018, 8, 2649. [Google Scholar] [CrossRef] [Green Version]
- Yin, J.; Apuroop, K.G.S.; Tamilselvam, Y.K.; Mohan, R.E.; Ramalingam, B.; Le, A.V. Table Cleaning Task by Human Support Robot Using Deep Learning Technique. Sensors 2020, 20, 1698. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Rad, M.S.; von Kaenel, A.; Droux, A.; Tieche, F.; Ouerhani, N.; Ekenel, H.K.; Thiran, J.P. A computer vision system to localize and classify wastes on the streets. In International Conference on Computer Vision Systems; Springer: Shenzhen, China, 2017; pp. 195–204. [Google Scholar]
- Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 2013, arXiv:1312.6229. [Google Scholar]
- Wang, Y.; Zhang, X. Autonomous garbage detection for intelligent urban management. MATEC Web Conf. 2018, 232, 01056. [Google Scholar] [CrossRef] [Green Version]
- Bai, J.; Lian, S.; Liu, Z.; Wang, K.; Liu, D. Deep learning based robot for automatically picking up garbage on the grass. IEEE Trans. Consum. Electron. 2018, 64, 382–389. [Google Scholar] [CrossRef] [Green Version]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Deepa, T.; Roka, S. Estimation of garbage coverage area in water terrain. In Proceedings of the 2017 International Conference On Smart Technologies For Smart Nation (SmartTechCon), Bengaluru, India, 17–19 August 2017; pp. 347–352. [Google Scholar]
- Mittal, G.; Yagnik, K.B.; Garg, M.; Krishnan, N.C. Spotgarbage: Smartphone app to detect garbage using deep learning. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, Heidelberg, Germany, 12–16 September 2016; pp. 940–945. [Google Scholar]
- Zeng, D.; Zhang, S.; Chen, F.; Wang, Y. Multi-scale CNN based garbage detection of airborne hyperspectral data. IEEE Access 2019, 7, 104514–104527. [Google Scholar] [CrossRef]
- Qiu, Y.; Chen, J.; Guo, J.; Zhang, J.; Liu, S.; Chen, S. Three dimensional object segmentation based on spatial adaptive projection for solid waste. In CCF Chinese Conference on Computer Vision; Springer: Tianjin, China, 2017; pp. 453–464. [Google Scholar]
- Wang, C.; Liu, S.; Zhang, J.; Feng, Y.; Chen, S. RGB-D Based Object Segmentation in Severe Color Degraded Environment. In CCF Chinese Conference on Computer Vision; Springer: Tianjin, China, 2017; pp. 465–476. [Google Scholar]
- Grard, M.; Brégier, R.; Sella, F.; Dellandréa, E.; Chen, L. Object segmentation in depth maps with one user click and a synthetically trained fully convolutional network. In Human Friendly Robotics; Springer: Reggio Emilia, Italy, 2019; pp. 207–221. [Google Scholar]
- He, X.; Zemel, R.S.; Carreira-Perpiñán, M.Á. Multiscale conditional random fields for image labeling. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 27 June–2 July 2004; Volume 2, p. II. [Google Scholar]
- Shotton, J.; Winn, J.; Rother, C.; Criminisi, A. Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In European Conference on Computer Vision; Springer: Graz, Austria, 2006; pp. 1–15. [Google Scholar]
- Ladickỳ, L.; Russell, C.; Kohli, P.; Torr, P.H. Associative hierarchical crfs for object class image segmentation. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 739–746. [Google Scholar]
- Gould, S.; Fulton, R.; Koller, D. Decomposing a scene into geometric and semantically consistent regions. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009; pp. 1–8. [Google Scholar]
- Kumar, M.P.; Koller, D. Efficiently selecting regions for scene understanding. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3217–3224. [Google Scholar]
- Munoz, D.; Bagnell, J.A.; Hebert, M. Stacked hierarchical labeling. In European Conference on Computer Vision; Springer: Crete, Greece, 2010; pp. 57–70. [Google Scholar]
- Tighe, J.; Lazebnik, S. Superparsing: Scalable nonparametric image parsing with superpixels. In European Conference on Computer Vision; Springer: Crete, Greece, 2010; pp. 352–365. [Google Scholar]
- Liu, C.; Yuen, J.; Torralba, A. Nonparametric scene parsing via label transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2368–2382. [Google Scholar] [CrossRef] [Green Version]
- Lempitsky, V.; Vedaldi, A.; Zisserman, A. Pylon model for semantic segmentation. In Advances in Neural Information Processing Systems; NeuraIPS Foundation: Granada, Spain, 2011; pp. 1485–1493. [Google Scholar]
- Liu, W.; Rabinovich, A.; Berg, A.C. Parsenet: Looking wider to see better. arXiv 2015, arXiv:1506.04579. [Google Scholar]
- Mostajabi, M.; Yadollahpour, P.; Shakhnarovich, G. Feedforward semantic segmentation with zoom-out features. In Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3376–3385. [Google Scholar]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Lin, G.; Shen, C.; Van Den Hengel, A.; Reid, I. Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the 2016 Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 3194–3203. [Google Scholar]
- Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the 2018 Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
- Ding, H.; Jiang, X.; Shuai, B.; Qun Liu, A.; Wang, G. Context contrasted feature and gated multi-scale aggregation for scene segmentation. In Proceedings of the 2018 Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2393–2402. [Google Scholar]
- Lin, D.; Ji, Y.; Lischinski, D.; Cohen-Or, D.; Huang, H. Multi-scale context intertwining for semantic segmentation. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 603–619. [Google Scholar]
- Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H. Conditional random fields as recurrent neural networks. In Proceedings of the 2015 International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1529–1537. [Google Scholar]
- Schwing, A.G.; Urtasun, R. Fully connected deep structured networks. arXiv 2015, arXiv:1503.02351. [Google Scholar]
- Liu, Z.; Li, X.; Luo, P.; Loy, C.C.; Tang, X. Semantic image segmentation via deep parsing network. In Proceedings of the 2015 International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1377–1385. [Google Scholar]
- Jampani, V.; Kiefel, M.; Gehler, P.V. Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks. In Proceedings of the 2016 Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4452–4461. [Google Scholar]
- Chandra, S.; Kokkinos, I. Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs. In European Conference on Computer Vision; Springer: Amsterdam, The Netherlands, 2016; pp. 402–418. [Google Scholar]
- Pohlen, T.; Hermans, A.; Mathias, M.; Leibe, B. Full-resolution residual networks for semantic segmentation in street scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4151–4160. [Google Scholar]
- Gadde, R.; Jampani, V.; Kiefel, M.; Kappler, D.; Gehler, P.V. Superpixel convolutional networks using bilateral inceptions. In European Conference on Computer Vision; Springer: Amsterdam, The Netherlands, 2016; pp. 597–613. [Google Scholar]
- Li, X.; Jie, Z.; Wang, W.; Liu, C.; Yang, J.; Shen, X.; Lin, Z.; Chen, Q.; Yan, S.; Feng, J. Foveanet: Perspective-aware urban scene parsing. In Proceedings of the 2017 International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 784–792. [Google Scholar]
- Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the 2015 International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1520–1528. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing And Computer-Assisted Intervention; Springer: Munich, Germany, 2015; pp. 234–241. [Google Scholar]
- Yuan, Y.; Wang, J. Ocnet: Object context network for scene parsing. arXiv 2018, arXiv:1809.00916. [Google Scholar]
- Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Change Loy, C.; Lin, D.; Jia, J. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 267–283. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- Lin, G.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1925–1934. [Google Scholar]
- Takikawa, T.; Acuna, D.; Jampani, V.; Fidler, S. Gated-scnn: Gated shape cnns for semantic segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 5229–5238. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning rich features from RGB-D images for object detection and segmentation. In European Conference on Computer Vision; Springer: Zurich, Switzerland, 2014; pp. 345–360. [Google Scholar]
- Li, Z.; Gan, Y.; Liang, X.; Yu, Y.; Cheng, H.; Lin, L. Lstm-cf: Unifying context modeling and fusion with lstms for rgb-d scene labeling. In European Conference on Computer Vision; Springer: Amsterdam, The Netherlands, 2016; pp. 541–557. [Google Scholar]
- Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
- Hu, X.; Yang, K.; Fei, L.; Wang, K. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1440–1444. [Google Scholar]
- Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Asian Conference on Computer Vision; Springer: Taipei, Taiwan, 2016; pp. 213–228. [Google Scholar]
- Qi, X.; Liao, R.; Jia, J.; Fidler, S.; Urtasun, R. 3D graph neural networks for rgbd semantic segmentation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5199–5208. [Google Scholar]
- Lai, K.; Bo, L.; Ren, X.; Fox, D. A large-scale hierarchical multi-view rgb-d object dataset. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 1817–1824. [Google Scholar]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; NeurIPS Foundation: Lake Tahoe, CA, USA, 2012; pp. 1097–1105. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Modalities | Images | Data Split | Image Size | Objects Per Image | Classes | |
---|---|---|---|---|---|---|
MJU-Waste | RGBD | 2475 | single | |||
TACO | RGB only | 1500 | ≈ |
MJU-Waste | 1 | 3 | 1 | 1 | 20 | 20 | 1 | 10 | 20 |
TACO | 1 | 3 | 1 | - | 100 | 20 | 10 | - | - |
Dataset: | |||||
---|---|---|---|---|---|
MJU-Waste (Test) | Backbone | IoU | mIoU | Prec | Mean |
Baseline Approaches | |||||
FCN-8s [17] | VGG-16 | 75.28 | 87.35 | 85.95 | 92.83 |
PSPNet [21] | ResNet-101 | 78.62 | 89.06 | 86.42 | 93.11 |
CCNet [22] | ResNet-101 | 83.44 | 91.54 | 92.92 | 96.35 |
DeepLabv3 [23] | ResNet-50 | 79.92 | 89.73 | 86.30 | 93.06 |
DeepLabv3 [23] | ResNet-101 | 84.11 | 91.88 | 89.69 | 94.77 |
Proposed Multi-Level (ML) Model | |||||
FCN-8s-ML | VGG-16 | 82.29 | 90.95 | 91.75 | 95.76 |
(+7.01) | (+3.60) | (+5.80) | (+2.93) | ||
PSPNet-ML | ResNet-101 | 81.81 | 90.70 | 89.65 | 94.73 |
(+3.19) | (+1.64) | (+3.23) | (+1.62) | ||
CCNet-ML | ResNet-101 | 86.63 | 93.17 | 96.05 | 97.92 |
(+3.19) | (+1.63) | (+3.13) | (+1.57) | ||
DeepLabv3-ML | ResNet-50 | 84.35 | 92.00 | 91.73 | 95.78 |
(+4.43) | (+2.27) | (+5.43) | (+2.72) | ||
DeepLabv3-ML | ResNet-101 | 87.84 | 93.79 | 94.43 | 97.14 |
(+3.73) | (+1.91) | (+4.74) | (+2.37) |
Dataset: | |||||||
---|---|---|---|---|---|---|---|
MJU-Waste (val) | Object? | Appearance? | Depth? | IoU | mIoU | Prec | Mean |
Baseline | ✗ | ✗ | ✗ | 80.86 | 90.24 | 87.49 | 93.67 |
+ components | ✓ | ✗ | ✗ | 81.43 | 90.53 | 88.06 | 93.96 |
✓ | ✓ | ✗ | 85.44 | 92.58 | 91.79 | 95.83 | |
✗ | ✓ | ✓ | 83.45 | 91.57 | 91.84 | 95.83 | |
Full model | ✓ | ✓ | ✓ | 86.07 | 92.90 | 92.77 | 96.32 |
MJU-Waste (val) | Scene-Level | Object-Level | Pixel-Level | Total |
---|---|---|---|---|
inference time (ms) | 52 | 352 | 398 | 802 |
Dataset: | |||||
---|---|---|---|---|---|
TACO (Test) | Backbone | IoU | mIoU | Prec | Mean |
Baseline Approaches | |||||
FCN-8s [17] | VGG-16 | 70.43 | 84.31 | 85.50 | 92.21 |
DeepLabv3 [23] | ResNet-101 | 83.02 | 90.99 | 88.37 | 94.00 |
Proposed Multi-Level (ML) Model | |||||
FCN-8s-ML | VGG-16 | 74.21 | 86.35 | 90.36 | 94.65 |
(+3.78) | (+2.04) | (+4.86) | (+2.44) | ||
DeepLabv3-ML | ResNet-101 | 86.58 | 92.90 | 92.52 | 96.07 |
(+3.56) | (+1.91) | (+4.15) | (+2.07) |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, T.; Cai, Y.; Liang, L.; Ye, D. A Multi-Level Approach to Waste Object Segmentation. Sensors 2020, 20, 3816. https://doi.org/10.3390/s20143816
Wang T, Cai Y, Liang L, Ye D. A Multi-Level Approach to Waste Object Segmentation. Sensors. 2020; 20(14):3816. https://doi.org/10.3390/s20143816
Chicago/Turabian StyleWang, Tao, Yuanzheng Cai, Lingyu Liang, and Dongyi Ye. 2020. "A Multi-Level Approach to Waste Object Segmentation" Sensors 20, no. 14: 3816. https://doi.org/10.3390/s20143816
APA StyleWang, T., Cai, Y., Liang, L., & Ye, D. (2020). A Multi-Level Approach to Waste Object Segmentation. Sensors, 20(14), 3816. https://doi.org/10.3390/s20143816