AIFormer: Adaptive Interaction Transformer for 3D Point Cloud Understanding
Abstract
:1. Introduction
- We propose the Adaptive Interaction Transformer Block by extracting local relational features within the local region of a reference point and modeling long-range dependencies between reference points, respectively. Then the local and global features are fused with Adaptive Interaction. AIFormer achieves effective local and global semantic capture of points.
- We propose enhancement approaches for local and global features that facilitate the information communication of points. The former extracts high-frequency geometric features from the local region of a reference point using the geometric relation function, while the latter captures richer positional and semantic information through contextual relative semantic encoding as position encoding.
- We propose a hierarchical transformer architecture based on the AIFormer Block, called AIFormer, which is demonstrated effectiveness via extensive experiments and analysis, and achieves state-of-the-art or comparable performance in several 3D point cloud segmentation task.
2. Related Work
2.1. 3D Point Cloud Analysis
2.2. Transformer Architectures
2.3. Point Cloud Transformers
3. Method
3.1. Review Point-Based Transformer
3.2. Adaptive Interaction Transformer Block
3.2.1. Local Relation Aggregation Module
3.2.2. Global Context Aggregation Module
3.2.3. Adaptive Interaction Module
3.3. Relevant Components
3.3.1. Initial Point Embedding
3.3.2. Downsample and Upsample
4. Experiments
4.1. Data and Metric
- The Stanford Large-Scale 3D Indoor Spaces (S3DIS) is one of the notable indoor datasets. The dataset focuses on indoor environments, scanning six large indoor areas in three different buildings. These data are primarily structured as RGB-D imagery, and converted into point clouds. S3DIS offers detailed semantic labels, consisting mainly of thirteen common object categories (e.g., walls, floors, chairs, tables, etc.). Typically, Areas 1–4 and 6 are used as the training set, while Area 5 serves as the test set. We follow this convention to obtain results that can be fairly evaluated with existing methods.
- ScanNetv2 is a richly annotated dataset of 3D scans of indoor environments. It covers a wider variety of indoor environments, extending from educational and office spaces to residential rooms and public spaces. ScanNetv2 includes semantic annotations for over 2.5 million segments in more than 1500 scans, across hundreds of different spaces, containing more than 17.5 million annotation points. Each point is assigned a semantic label from 20 categories (e.g., show curtain, refrigerators, picture, etc.). The dataset is divided into three parts: 1201 scenes for the training, 312 scenes for the validation, and 100 scenes for online testing. Due to its high-quality point cloud annotations and challenging scenes, it is widely used for tasks such as 3D semantic segmentation, 3D object recognition, and other forms of scene understanding. Owing to its high-quality point cloud annotations and the variety of challenging indoor scenes, ScanNetv2 is widely used for tasks such as 3D semantic segmentation, 3D object recognition, and other forms of scene understanding.
- SemanticKITTI is a dataset specifically focused on outdoor large-scale scenes. The dataset is an extension of the KITTI Vision Benchmarking Suite, using LiDAR point clouds collected from a vehicle moving through urban and rural areas, but with dense semantic annotation of the dataset’s 3D point clouds. Different in scale and classes from the indoor scenes dataset, the data provided consists mainly of nineteen common outdoor environments classes. Not only traffic participants are included, but also functional classes for ground such as parking lots and sidewalks. Typically, sequences 0–7 and 9–10 are used for training, sequence 8 for validation, and sequences 11–21 for online testing. The dataset is primarily used to evaluate semantic segmentation and other tasks in autonomous driving scenes.
4.2. Experiment Settings
4.3. Experiment Result and Analysis
4.3.1. Evaluation on S3DIS
4.3.2. Evaluation on ScanNetv2
4.3.3. Evaluation on SemanticKITTI
4.4. Ablation Study
4.4.1. Low-Level Geometric Relation
4.4.2. Geometric Relation Function
4.4.3. Contextual Relative Semantic Encoding
4.4.4. Efficacy of Adaptive Interaction Processing
4.4.5. Efficacy of Initial Point Embedding
4.4.6. Module Design
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Milioto, A.; Vizzo, I.; Behley, J.; Stachniss, C. RangeNet++: Fast and Accurate LiDAR Semantic Segmentation. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), The Venetian Macao, Macau, China, 4–8 November 2019; pp. 4213–4220. [Google Scholar]
- Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. PointCNN: Convolution On X-Transformed Points. In Proceedings of the 31th Advances in Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 2–8 December 2018. [Google Scholar]
- Jiang, L.; Zhao, H.; Liu, S.; Shen, X.; Fu, C.-W.; Jia, J. Hierarchical Point-Edge Interaction Network for Point Cloud Semantic Segmentation. In Proceedings of the 17th IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10432–10440. [Google Scholar]
- Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic Graph CNN for Learning on Point Clouds. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
- Tang, H.; Liu, Z.; Zhao, S.; Lin, Y.; Lin, J.; Wang, H.; Han, S. Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual, 23–28 August 2020; pp. 685–702. [Google Scholar]
- Choy, C.; Gwak, J.; Savarese, S. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3070–3079. [Google Scholar]
- Graham, B.; Engelcke, M.; van der Maaten, L. 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 9224–9232. [Google Scholar]
- Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D ShapeNets: A Deep Representation for Volumetric Shapes. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1912–1920. [Google Scholar]
- Qi, C.R.; Su, H.; Kaichun, M.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Learning Semantic Segmentation of Large-Scale Point Clouds with Random Sampling. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8338–8354. [Google Scholar] [CrossRef] [PubMed]
- Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. KPConv: Flexible and Deformable Convolution for Point Clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6410–6419. [Google Scholar]
- Wu, W.; Qi, Z.; Fuxin, L. PointConv: Deep Convolutional Networks on 3D Point Clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9613–9622. [Google Scholar]
- Yan, X.; Zheng, C.; Li, Z.; Wang, S.; Cui, S. PointASNL: Robust Point Clouds Processing Using Nonlocal Neural Networks with Adaptive Sampling. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5588–5597. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
- Guo, M.; Cai, J.; Liu, Z.; Mu, T.; Martin, R.R.; Hu, S. PCT: Point Cloud Transformer. Comp. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
- Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point Transformer. In Proceedings of the 19th IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 16239–16248. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 30th Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5105–5114. [Google Scholar]
- Zhang, C.; Wan, H.; Shen, X.; Wu, Z. PVT: Point-Voxel Transformer for Point Cloud Learning. arXiv 2021, arXiv:2108.06076. [Google Scholar] [CrossRef]
- Park, C.; Jeong, Y.; Cho, M.; Park, J. Fast Point Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16928–16937. [Google Scholar]
- Lai, X.; Liu, J.; Jiang, L.; Wang, L.; Zhao, H.; Liu, S.; Qi, X.; Jia, J. Stratified Transformer for 3D Point Cloud Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8490–8499. [Google Scholar]
- Duan, L.; Zhao, S.; Xue, N.; Gong, M.; Xia, G.-S.; Tao, D. ConDaFormer: Disassembled Transformer with Local Structure Enhancement for 3D Point Cloud Understanding. In Proceedings of the 36th Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Li, X.-L.; Guo, M.-H.; Mu, T.-J.; Martin, R.R.; Hu, S.-M. Long Range Pooling for 3D Large-Scale Scene Understanding. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 20–22 June 2023; pp. 10300–10311. [Google Scholar]
- Qiu, H.; Yu, B.; Tao, D. Collect-and-Distribute Transformer for 3D Point Cloud Analysis. arXiv 2023, arXiv:2306.01257. [Google Scholar]
- He, Y.; Yu, H.; Yang, Z.; Liu, X.; Sun, W.; Mian, A. Full Point Encoding for Local Feature Aggregation in 3-D Point Clouds. IEEE Trans. Neural Netw. Learn. Syst. 2024. early access. [Google Scholar] [CrossRef]
- Li, H.; Zheng, T.; Chi, Z.; Yang, Z.; Wang, W.; Wu, B.; Lin, B.; Cai, D. APPT: Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding. arXiv 2023, arXiv:2303.17815. [Google Scholar]
- Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep Learning for 3D Point Clouds: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4338–4364. [Google Scholar] [CrossRef]
- Zhu, X.; Zhou, H.; Wang, T.; Hong, F.; Ma, Y.; Li, W.; Li, H.; Lin, D. Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9934–9943. [Google Scholar]
- Yan, X.; Gao, J.; Li, J.; Zhang, R.; Li, Z.; Huang, R.; Cui, S. Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), Virtually, 2–9 February 2021; pp. 3101–3109. [Google Scholar]
- Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. LargeKernel3D: Scaling up Kernels in 3D Sparse CNNs. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 20–22 June 2023; pp. 13488–13498. [Google Scholar]
- Xu, C.; Wu, B.; Wang, Z.; Zhan, W.; Vajda, P.; Keutzer, K.; Tomizuka, M. SqueezeSegV3: Spatially-Adaptive Convolution for Efficient Point-Cloud Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Virtual, 23–28 August 2020; pp. 1–19. [Google Scholar]
- Li, M.; Wang, G.; Zhu, M.; Li, C.; Liu, H.; Pan, X.; Long, Q. DFAMNet: Dual Fusion Attention Multi-Modal Network for Semantic Segmentation on LiDAR Point Clouds. Appl. Intell. 2024, 54, 3169–3180. [Google Scholar] [CrossRef]
- Puy, G.; Boulch, A.; Marlet, R. Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation. In Proceedings of the 20th IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 3356–3366. [Google Scholar]
- Zhang, Y.; Zhou, Z.; David, P.; Yue, X.; Xi, Z.; Gong, B.; Foroosh, H. PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9598–9607. [Google Scholar]
- Lin, Y.; Yan, Z.; Huang, H.; Du, D.; Liu, L.; Cui, S.; Han, X. FPConv: Learning Local Flattening for Point Convolution. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4292–4301. [Google Scholar]
- Hu, W.; Zhao, H.; Jiang, L.; Jia, J.; Wong, T.-T. Bidirectional Projection Network for Cross Dimension Scene Understanding. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14368–14377. [Google Scholar]
- Peng, B.; Wu, X.; Jiang, L.; Chen, Y.; Zhao, H.; Tian, Z.; Jia, J. OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation. arXiv 2024, arXiv:2403.14418. [Google Scholar]
- Fan, Y.-C.; Liao, K.-Y.; Xiao, Y.-S.; Lu, M.-H.; Yan, W.-Z. 3D Point Cloud Semantic Segmentation System Based on Lightweight FPConv. IEEE Access 2023, 11, 31767–31777. [Google Scholar] [CrossRef]
- Gong, J.; Xu, J.; Tan, X.; Song, H.; Qu, Y.; Xie, Y.; Ma, L. Omni-Supervised Point Cloud Segmentation via Gradual Receptive Field Component Reasoning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11668–11677. [Google Scholar]
- Qian, G.; Li, Y.; Peng, H.; Mai, J.; Hammoud, H.; Elhoseiny, M.; Ghanem, B. PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies. In Proceedings of the 35th Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; pp. 23192–23204. [Google Scholar]
- Kang, X.; Wang, C.; Chen, X. Region-Enhanced Feature Learning for Scene Semantic Segmentation. IEEE Trans. Multimed. 2023. early access. [Google Scholar] [CrossRef]
- Wei, M.; Wei, Z.; Zhou, H.; Hu, F.; Si, H.; Chen, Z.; Zhu, Z.; Qiu, J.; Yan, X.; Guo, Y.; et al. AGConv: Adaptive Graph Convolution on 3D Point Clouds. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9374–9392. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. In Proceedings of the 30th Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Han, Q.; Fan, Z.; Dai, Q.; Sun, L.; Cheng, M.-M.; Liu, J.; Wang, J. On the Connection between Local Attention and Dynamic Depth-Wise Convolution. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
- Chen, Q.; Wu, Q.; Wang, J.; Hu, Q.; Hu, T.; Ding, E.; Cheng, J.; Wang, J. MixFormer: Mixing Features across Windows and Dimensions. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5239–5249. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 19th IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Gu, J.; Kwon, H.; Wang, D.; Ye, W.; Li, M.; Chen, Y.-H.; Lai, L.; Chandra, V.; Pan, D.Z. Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12084–12093. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 19th IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 548–558. [Google Scholar]
- Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. MaxViT: Multi-Axis Vision Transformer. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 459–479. [Google Scholar]
- Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June; pp. 12124–12134.
- Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. In Proceedings of the 34th Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; pp. 9355–9366. [Google Scholar]
- Li, W.; Wang, X.; Xia, X.; Wu, J.; Li, J.; Xiao, X.; Zheng, M.; Wen, S. SepViT: Separable Vision Transformer. arXiv 2022, arXiv:2203.15380. [Google Scholar]
- Fan, Q.; Huang, H.; Zhou, X.; He, R. Lightweight Vision Transformer with Bidirectional Interaction. In Proceedings of the 37th Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
- Lahoud, J.; Cao, J.; Khan, F.S.; Cholakkal, H.; Anwer, R.M.; Khan, S.; Yang, M.-H. 3D Vision with Transformers: A Survey. arXiv 2024, arXiv:2208.04309. [Google Scholar]
- Lu, D.; Xie, Q.; Wei, M.; Xu, L.; Li, J. Transformers in 3D Point Clouds: A Survey. arXiv 2022, arXiv:2205.07417. [Google Scholar]
- Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point Transformer V2: Grouped Vector Attention and Partition-Based Pooling. In Proceedings of the 35th Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; pp. 33330–33342. [Google Scholar]
- Liu, Z.; Yang, X.; Tang, H.; Yang, S.; Han, S. FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 20–22 June 2023; pp. 1200–1211. [Google Scholar]
- Fan, L.; Pang, Z.; Zhang, T.; Wang, Y.-X.; Zhao, H.; Wang, F.; Wang, N.; Zhang, Z. Embracing Single Stride 3D Object Detector with Sparse Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8458–8468. [Google Scholar]
- Xiang, P.; Wen, X.; Liu, Y.-S.; Zhang, H.; Fang, Y.; Han, Z. Retro-FPN: Retrospective Feature Pyramid Network for Point Cloud Semantic Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 20–22 June 2023; pp. 17780–17792. [Google Scholar]
- Yang, Y.; Guo, Y.; Xiong, J.; Liu, Y.; Pan, H.; Wang, P.; Tong, X.; Guo, B. Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding. arXiv 2023, arXiv:2304.06906. [Google Scholar]
- Hui, L.; Yang, H.; Cheng, M.; Xie, J.; Yang, J. Pyramid Point Cloud Transformer for Large-Scale Place Recognition. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6078–6087. [Google Scholar]
- Yang, J.; Zhang, Q.; Ni, B.; Li, L.; Liu, J.; Zhou, M.; Tian, Q. Modeling Point Clouds with Self-Attention and Gumbel Subset Sampling. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3318–3327. [Google Scholar]
- Ai, D.; Xu, C.; Zhang, X.; Ai, Y.; Bai, Y.; Liu, Y. ASSA-Net: Semantic Segmentation Network for Point Clouds Based on Adaptive Sampling and Self-Attention. In Proceedings of the 2023 5th International Conference on Natural Language Processing (ICNLP), Guangzhou, China, 24–26 March 2023; pp. 60–64. [Google Scholar]
- Zhang, C.; Wan, H.; Shen, X.; Wu, Z. PatchFormer: An Efficient Point Transformer with Patch Attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11789–11798. [Google Scholar]
- Yang, X.; Jin, M.; He, W.; Chen, Q. PointCAT: Cross-Attention Transformer for Point Cloud. arXiv 2023, arXiv:2304.03012. [Google Scholar]
- Huang, Z.; Zhao, Z.; Li, B.; Han, J. LCPFormer: Towards Effective 3D Point Cloud Analysis via Local Context Propagation in Transformers. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4985–4996. [Google Scholar] [CrossRef]
- Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3D Semantic Parsing of Large-Scale Indoor Spaces. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1534–1543. [Google Scholar]
- Rozenberszki, D.; Litany, O.; Dai, A. Language-Grounded Indoor 3D Semantic Segmentation in the Wild. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 125–141. [Google Scholar]
- Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proceedings of the 17th IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9296–9306. [Google Scholar]
- Tang, L.; Zhan, Y.; Chen, Z.; Yu, B.; Tao, D. Contrastive Boundary Learning for Point Cloud Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8479–8489. [Google Scholar]
- Wu, X.; Jiang, L.; Wang, P.-S.; Liu, Z.; Liu, X.; Qiao, Y.; Ouyang, W.; He, T.; Zhao, H. Point Transformer V3: Simpler, Faster, Stronger. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 4840–4851. [Google Scholar]
- Tao, A.; Duan, Y.; Wei, Y.; Lu, J.; Zhou, J. SegGroup: Seg-Level Supervision for 3D Instance and Semantic Segmentation. IEEE Trans. Image Process. 2022, 31, 4952–4965. [Google Scholar] [CrossRef] [PubMed]
- Kong, L.; Liu, Y.; Chen, R.; Ma, Y.; Zhu, X.; Li, Y.; Hou, Y.; Qiao, Y.; Liu, Z. Rethinking Range View Representation for LiDAR Segmentation. In Proceedings of the 120th IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 228–240. [Google Scholar]
Dataset | Drop Points | Rotate | Flip | Scale | Jitter | Distort | Chromatic | Grid Size |
---|---|---|---|---|---|---|---|---|
ScanNetv2 | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | 0.02 m |
S3DIS | ✔ | ✔ | ✔ | ✔ | ✔ | 0.02 m | ||
SemanticKITTI | ✔ | ✔ | ✔ | ✔ | 0.04 m |
Dataset | Epoch | Leraning Rate | Weight Decay | Scheduler | Optimizer | Batch Size |
---|---|---|---|---|---|---|
ScanNetV2 | 1200 | 0.001 | 0.02 | Cosine | AdamW | 16 |
S3DIS | 1200 | 0.006 | 0.01 | OneCycleLR | AdamW | 12 |
SemanticKITTI | 80 | 0.002 | 0.005 | Cosine | AdamW | 8 |
Method | mIoU (%) | mAcc (%) | OA (%) | Ceiling | Floor | Wall | Beam | Colum | Window | Door | Table | Chair | Sofa | Bookcase | Board | Clutter |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PointNet [9] | 41.1 | 49.0 | - | 88.8 | 97.3 | 69.8 | 0.1 | 3.9 | 46.3 | 10.8 | 52.6 | 58.9 | 40.3 | 5.9 | 26.4 | 33.2 |
PointNet++ [17] | 57.3 | 63.5 | - | 91.3 | 96.9 | 78.7 | 0.0 | 16.0 | 54.9 | 31.9 | 83.5 | 74.6 | 67.2 | 49.3 | 54.2 | 45.9 |
PointCNN [2] | 57.3 | 63.9 | 85.9 | 92.3 | 98.2 | 79.4 | 0.0 | 17.6 | 22.8 | 62.1 | 80.6 | 74.4 | 66.7 | 31.7 | 62.1 | 56.7 |
PointWeb [3] | 60.3 | 66.6 | 87.0 | 92.0 | 98.5 | 79.4 | 0.0 | 21.1 | 59.7 | 34.8 | 76.3 | 88.3 | 46.9 | 69.3 | 64.9 | 52.5 |
MinkwskiNet [6] | 65.4 | 71.7 | - | 91.8 | 98.7 | 86.2 | 0.0 | 34.1 | 48.9 | 62.4 | 89.9 | 81.6 | 74.9 | 47.2 | 74.4 | 58.6 |
PointConv [12] | 58.3 | 64.7 | 85.4 | 92.8 | 96.3 | 77.0 | 0.0 | 18.2 | 47.7 | 54.3 | 87.9 | 72.8 | 61.6 | 65.9 | 33.9 | 49.3 |
KPConv [11] | 65.4 | 70.9 | - | 92.6 | 97.3 | 81.4 | 0.0 | 16.5 | 54.5 | 69.5 | 90.1 | 80.2 | 74.6 | 66.4 | 63.7 | 58.1 |
PointWeb [3] | 61.9 | 68.3 | 87.2 | 91.5 | 98.2 | 81.4 | 0.0 | 23.3 | 65.3 | 40.0 | 75.5 | 87.7 | 58.5 | 67.8 | 65.6 | 49.7 |
PointASNL [13] | 62.6 | 68.5 | 87.7 | 94.3 | 98.4 | 79.1 | 0.0 | 26.7 | 55.2 | 66.2 | 83.3 | 86.8 | 47.6 | 68.3 | 56.4 | 52.1 |
PCT [15] | 61.3 | 67.7 | - | 92.5 | 98.4 | 80.6 | 0.0 | 19.3 | 61.6 | 48.0 | 76.5 | 85.2 | 46.2 | 67.7 | 67.9 | 52.2 |
PTv1 [16] | 70.4 | 76.5 | - | 94.0 | 98.5 | 86.3 | 0.0 | 38.0 | 63.4 | 74.3 | 89.1 | 82.4 | 74.3 | 80.2 | 76.0 | 59.3 |
PointNeXt [39] | 70.5 | 76.8 | 90.6 | 94.2 | 98.5 | 84.4 | 0.0 | 37.7 | 59.3 | 74.0 | 83.1 | 91.6 | 77.4 | 77.2 | 78.8 | 60.6 |
CBL [70] | 69.4 | 75.2 | 90.6 | 93.9 | 98.4 | 84.2 | 0.0 | 37.0 | 57.7 | 71.9 | 91.7 | 81.8 | 77.8 | 75.6 | 69.1 | 62.9 |
FastPointTrans. [19] | 70.3 | 77.9 | - | 94.2 | 98.0 | 86.0 | 0.2 | 53.8 | 61.2 | 77.3 | 81.3 | 89.4 | 60.1 | 72.8 | 80.4 | 58.9 |
Our | 70.7 | 75.9 | 90.5 | 91.7 | 98.3 | 84.1 | 0.0 | 28.0 | 63.5 | 75.5 | 81.7 | 92.1 | 80.3 | 80.1 | 82.5 | 61.2 |
Method | mIoU (%) | Bathtub | Bed | Bookshelf | Cabinet | Chair | Counter | Curtain | Desk | Door | Floor | Other Furniture | Picture | Refrigerator | Shower Curtain | Sink | Sofa | Table | Toilet | Wall | Window |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PointNet++ [17] | 33.9 | 58.4 | 47.8 | 45.8 | 25.6 | 36.0 | 25.0 | 24.7 | 27.8 | 26.1 | 67.6 | 18.3 | 11.7 | 21.2 | 14.5 | 36.4 | 34.6 | 23.2 | 54.8 | 52.3 | 25.2 |
PointConv [12] | 66.6 | 70.3 | 78.1 | 75.1 | 65.5 | 83.0 | 47.1 | 76.9 | 47.4 | 53.7 | 95.1 | 47.5 | 27.9 | 63.5 | 69.8 | 67.5 | 75.1 | 55.3 | 81.6 | 80.6 | 70.3 |
KPConv [11] | 68.4 | 84.7 | 75.8 | 78.4 | 64.7 | 81.4 | 47.3 | 77.2 | 60.5 | 59.4 | 93.5 | 45.0 | 18.1 | 58.7 | 80.5 | 69.0 | 78.5 | 61.4 | 88.2 | 81.9 | 63.2 |
PointWeb [3] | 61.8 | 72.9 | 66.8 | 64.7 | 59.7 | 76.6 | 41.4 | 68.0 | 52.0 | 52.5 | 94.6 | 43.2 | 21.5 | 49.3 | 59.9 | 63.8 | 61.7 | 57.0 | 89.7 | 80.6 | 60.5 |
MinkowskiNet [6] | 73.6 | 85.9 | 81.8 | 83.2 | 70.9 | 84.0 | 52.1 | 85.3 | 66.0 | 64.3 | 95.1 | 54.4 | 28.6 | 73.1 | 89.3 | 67.5 | 77.2 | 68.3 | 87.4 | 85.2 | 72.7 |
FPConv [34] | 63.9 | 78.5 | 76.0 | 71.3 | 60.3 | 79.8 | 39.2 | 53.4 | 60.3 | 52.4 | 94.8 | 45.7 | 25.0 | 53.8 | 72.3 | 59.8 | 69.6 | 61.4 | 87.2 | 79.9 | 56.7 |
PointASNL [13] | 66.6 | 78.1 | 75.9 | 69.9 | 64.4 | 82.2 | 47.5 | 77.9 | 56.4 | 50.4 | 95.3 | 42.8 | 20.3 | 58.6 | 75.4 | 66.1 | 75.3 | 58.8 | 90.2 | 81.3 | 64.2 |
BPNet [35] | 74.9 | 90.9 | 81.8 | 81.1 | 75.2 | 83.9 | 48.5 | 84.2 | 67.3 | 64.4 | 95.7 | 52.8 | 30.5 | 77.3 | 85.9 | 78.8 | 81.8 | 69.3 | 91.6 | 85.6 | 72.3 |
RFCR [38] | 70.2 | 88.9 | 74.5 | 81.3 | 67.2 | 81.8 | 49.3 | 81.5 | 62.3 | 61.0 | 94.7 | 47.0 | 24.9 | 59.4 | 84.8 | 70.5 | 77.9 | 64.6 | 89.2 | 82.3 | 61.1 |
CBL [70] | 70.5 | 76.9 | 77.5 | 80.9 | 68.7 | 82.0 | 43.9 | 81.2 | 66.1 | 59.1 | 94.5 | 51.5 | 17.1 | 63.3 | 85.6 | 72.0 | 79.6 | 66.8 | 88.9 | 84.7 | 68.9 |
SegGroup [72] | 62.7 | 81.8 | 74.7 | 70.1 | 60.2 | 76.4 | 38.5 | 62.9 | 49.0 | 50.8 | 93.1 | 40.9 | 20.1 | 56.4 | 72.5 | 61.8 | 69.2 | 53.9 | 87.3 | 79.4 | 54.8 |
ST [20] | 74.7 | 90.1 | 80.3 | 84.5 | 75.7 | 84.6 | 51.2 | 82.5 | 69.6 | 64.5 | 95.6 | 57.6 | 26.2 | 74.4 | 86.1 | 74.2 | 77.0 | 70.5 | 89.9 | 86.0 | 73.4 |
LargeKernel [29] | 74.0 | 91.0 | 82.0 | 80.6 | 74.0 | 85.2 | 54.5 | 82.6 | 59.4 | 64.3 | 95.5 | 54.1 | 26.3 | 72.3 | 85.8 | 77.5 | 76.7 | 67.8 | 93.3 | 84.8 | 69.4 |
REFL-Net [40] | 72.9 | 73.7 | 82.3 | 76.6 | 72.6 | 85.2 | 46.8 | 86.5 | 68.4 | 63.4 | 95.3 | 56.5 | 29.7 | 77.3 | 77.4 | 77.7 | 74.9 | 66.6 | 90.7 | 85.0 | 70.8 |
Retro-FPN [59] | 74.4 | 84.2 | 80.0 | 76.7 | 74.0 | 83.6 | 54.1 | 91.4 | 67.2 | 62.6 | 95.8 | 55.2 | 27.2 | 77.7 | 88.6 | 69.6 | 80.1 | 67.4 | 94.1 | 85.8 | 71.7 |
PTv3 [71] | 79.4 | 94.1 | 81.3 | 85.1 | 78.2 | 89.0 | 59.7 | 91.6 | 69.6 | 71.3 | 97.9 | 63.5 | 38.4 | 79.3 | 90.7 | 82.1 | 79.0 | 69.6 | 96.7 | 90.3 | 80.5 |
Ours | 74.9 | 73.8 | 80.9 | 84.1 | 77.0 | 83.0 | 53.8 | 90.9 | 67.8 | 67.9 | 96.0 | 54.5 | 33.2 | 76.7 | 79.4 | 74.2 | 78.6 | 69.6 | 91.9 | 88.0 | 77.0 |
Method | mIoU (%) | Car | Bicycle | Motorcycle | Truck | Other-Vehicle | Person | Bicyclist | Motorcyclist | Road | Parking | Sidewalk | Other-Ground | Building | Fence | Vegetation | Trunk | Terrain | Pole | Traffic-Sign |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PointNet [9] | 14.6 | 46.3 | 1.3 | 0.3 | 0.1 | 0.8 | 0.2 | 0.2 | 0.0 | 61.6 | 15.8 | 35.7 | 1.4 | 41.4 | 12.9 | 31.0 | 4.6 | 17.6 | 2.4 | 3.7 |
PointNet++ [17] | 20.1 | 53.7 | 1.9 | 0.2 | 0.9 | 0.2 | 0.9 | 1.0 | 0.0 | 72.0 | 18.7 | 41.8 | 5.6 | 62.3 | 16.9 | 46.5 | 13.8 | 30.0 | 6.0 | 8.9 |
KPConv [11] | 58.8 | 96.0 | 32.0 | 42.5 | 33.4 | 44.3 | 61.5 | 61.6 | 11.8 | 88.8 | 61.3 | 72.7 | 31.6 | 95.0 | 64.2 | 84.8 | 69.2 | 69.1 | 56.4 | 47.4 |
RangeNet++ [1] | 52.2 | 91.4 | 25.7 | 34.4 | 25.7 | 23.0 | 38.3 | 38.8 | 4.8 | 91.8 | 65.0 | 75.2 | 27.8 | 87.4 | 58.6 | 80.5 | 55.1 | 64.6 | 47.9 | 55.9 |
RandLANet [10] | 50.3 | 94.0 | 19.8 | 21.4 | 42.7 | 38.7 | 47.5 | 48.8 | 4.6 | 90.4 | 56.9 | 67.9 | 15.5 | 81.1 | 49.7 | 78.3 | 60.3 | 59.0 | 44.2 | 38.1 |
SqueezeSegV3 [30] | 55.9 | 92.5 | 38.7 | 36.5 | 29.6 | 33.0 | 45.6 | 46.2 | 20.1 | 91.7 | 63.4 | 74.8 | 26.4 | 89.0 | 59.4 | 82.0 | 58.7 | 65.4 | 49.6 | 58.9 |
SPVNAS [5] | 66.4 | 97.3 | 51.5 | 50.8 | 59.8 | 58.8 | 65.7 | 65.2 | 43.7 | 90.2 | 67.6 | 75.2 | 16.9 | 91.3 | 65.9 | 86.1 | 73.4 | 71.0 | 64.2 | 66.9 |
PointASNL [13] | 46.8 | 87.9 | 57.6 | 25.1 | 39.0 | 29.2 | 34.2 | 57.6 | 0.0 | 87.4 | 24.3 | 74.3 | 1.8 | 83.1 | 43.9 | 84.1 | 52.2 | 70.6 | 57.8 | 36.9 |
PolarNet [33] | 54.3 | 93.8 | 40.3 | 30.1 | 22.9 | 28.5 | 43.2 | 40.2 | 5.6 | 90.8 | 61.7 | 74.4 | 21.7 | 90.0 | 61.3 | 84.0 | 65.5 | 67.8 | 51.8 | 57.5 |
Cylinder3D [27] | 67.8 | 97.1 | 67.6 | 64.0 | 50.8 | 58.6 | 73.9 | 67.9 | 36.0 | 91.4 | 65.1 | 75.5 | 32.3 | 91.0 | 66.5 | 85.4 | 71.8 | 68.5 | 62.6 | 65.6 |
JS3C-Net [28] | 66.0 | 95.8 | 59.3 | 52.9 | 54.3 | 46.0 | 69.5 | 65.4 | 39.9 | 88.9 | 61.9 | 72.1 | 31.9 | 92.5 | 70.8 | 84.5 | 69.8 | 67.9 | 60.7 | 68.7 |
WaffleIron [32] | 67.3 | 96.5 | 62.3 | 64.1 | 55.2 | 48.7 | 70.4 | 77.8 | 29.6 | 90.5 | 69.5 | 75.9 | 24.6 | 91.8 | 68.1 | 85.4 | 70.8 | 69.6 | 62.0 | 65.2 |
RangeViT [73] | 64.0 | 95.4 | 55.8 | 43.5 | 29.8 | 42.1 | 63.9 | 58.2 | 38.1 | 93.1 | 70.2 | 80.0 | 32.5 | 92.0 | 69.0 | 85.3 | 70.6 | 71.2 | 60.8 | 64.7 |
DFAMNet [31] | 69.0 | 96.7 | 54.5 | 80.8 | 95.5 | 68.5 | 79.7 | 91.7 | 0.2 | 94.3 | 50.6 | 81.8 | 4.1 | 91.8 | 65.8 | 89.4 | 73.4 | 76.8 | 63.1 | 51.9 |
Ours | 68.0 | 96.6 | 65.4 | 59.8 | 56.2 | 53.6 | 74.6 | 69.3 | 37.3 | 89.6 | 67.7 | 74.2 | 23.4 | 91.8 | 68.9 | 85.9 | 73.9 | 71.4 | 65.5 | 67.4 |
ID | Low-Level Relation | Channels | mIoU (%) |
---|---|---|---|
(1) | 1 | 74.3 | |
(2) | 6 | 74.5 | |
(3) | 4 | 75.0 | |
(4) | 10 | 75.5 |
MLP Layer | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|
Layer Params (M) | 2.06 | 4.11 | 6.17 | 8.22 | 10.27 |
mIoU (%) | 74.92 | 75.21 | 75.52 | 75.51 | 75.31 |
Function | Method | ScanNetv2 | S3DIS |
---|---|---|---|
without normalization | 73.2 | 68.4 | |
75.3 | 70.3 | ||
75.5 | 70.7 | ||
74.2 | 68.6 | ||
74.7 | 69.9 | ||
Adaptive Interaction | 75.5 | 70.7 |
ID | PointEmb. | DataAug. | LRA | GCA | AI | mIoU (%) |
---|---|---|---|---|---|---|
(1) | ✔ | 67.8 | ||||
(2) | ✔ | 71.8 | ||||
(3) | ✔ | ✔ | 72.1 | |||
(4) | ✔ | ✔ | ✔ | 73.8 | ||
(5) | ✔ | ✔ | ✔ | ✔ | 74.1 | |
(6) | ✔ | ✔ | ✔ | ✔ | 73.9 | |
(7) | ✔ | ✔ | ✔ | ✔ | 74.5 | |
(8) | ✔ | ✔ | ✔ | ✔ | ✔ | 75.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chu, X.; Zhao, S.; Dai, H. AIFormer: Adaptive Interaction Transformer for 3D Point Cloud Understanding. Remote Sens. 2024, 16, 4103. https://doi.org/10.3390/rs16214103
Chu X, Zhao S, Dai H. AIFormer: Adaptive Interaction Transformer for 3D Point Cloud Understanding. Remote Sensing. 2024; 16(21):4103. https://doi.org/10.3390/rs16214103
Chicago/Turabian StyleChu, Xutao, Shengjie Zhao, and Hongwei Dai. 2024. "AIFormer: Adaptive Interaction Transformer for 3D Point Cloud Understanding" Remote Sensing 16, no. 21: 4103. https://doi.org/10.3390/rs16214103
APA StyleChu, X., Zhao, S., & Dai, H. (2024). AIFormer: Adaptive Interaction Transformer for 3D Point Cloud Understanding. Remote Sensing, 16(21), 4103. https://doi.org/10.3390/rs16214103