Layout Aware Semantic Element Extraction for Sustainable Science & Technology Decision Support
Abstract
:1. Introduction
- (1)
- We implement the Vi-SEE model with a SOTA instance segmentation object detection algorithm.
- (2)
- We define a new scientific KG with 11 different semantic elements (i.e., SEKG).
- (3)
- We define a new LA-SEE framework for textual and visual semantic element extraction and knowledge organization; combining our previously built LAME framework [13] for metadata extraction, Vi-SEE for visual object detection, and SEKG structure for knowledge organization.
- (4)
- We propose two user scenarios based on the proposed SEKG to confirm promising applications.
2. Related Work
2.1. Metadata Extraction from Articles
2.2. Vision-Based Document Analysis
2.3. Scientific Knowledge Extraction
2.4. Document Modeling
3. LA-SEE Framework
- (1)
- The LAME [13] model extracts five metadata types (title, author, affiliation, keywords, and abstract) from the first PDF page.
- (2)
- Vi-SEE performs object detection for the remaining pages to extract other semantic elements (paragraphs, figures, tables, captions, and references) and post-processing to obtain texts of the elements. Figures and tables are saved as image files, whereas other metadata and semantic elements are converted to JavaScript Object Notation (JSON) format.
- (3)
- Extracted semantic elements go through knowledge organizing/mapping under our SEKG structure defined in Section 3.3. The metadata from LAME and document objects from Vi-SEE are collectively referred to as semantic elements.
3.1. LAME
3.2. Vi-SEE
3.2.1. ISTR Selection
3.2.2. Post-Processing Identified Semantic Elements
- Text extraction: BBox areas are converted into texts using PDFMiner [47] parsing results for text, lists, and titles extracted from the ISTR based model. PDFMiner returns parsed texts with position information for PDF document. We extract texts using the left-top and right-bottom positions for the detected areas. The extracted semantic elements are references, paragraphs, and section titles.
- Figure/table extraction: We take screenshots encompassing the BBoxes and save them as images for detected areas such as figures and tables.
- Caption extraction: Rather than using BBox coordinates for the detected semantic elements (text, figures, and tables), we find candidate areas for captions based on a distance measure and change the areas into texts using PDFMiner’s parsed results. The closest text BBox is resolved as a caption. We compute the distance between the midpoint for the detected figure (or table) BBoxes and the midpoint for text BBoxes. FTmid refers to the midpoint found using the bounding box of figure or table. Tmid refers to the midpoint found using the bounding box of text. Thus, midpoint for figure (or table) BBox can be expressed as
3.3. Organizing Knowledge with SEKG for Multiple Documents
4. Experiments
4.1. Datasets
4.1.1. Data for Metadata Extraction
4.1.2. Dataset for Vi-SEE
4.2. Proposed LA-SEE Performance
- (1)
- Mask R-CNN model pre-trained with PubLayNet data,
- (2)
- DETR model pre-trained with ImageNet data, and
- (3)
- ISTR model pre-trained with ImageNet data.
4.3. Constructed Semantic Element Statistics
5. Decision Support Applications in Science and Technology Domain
5.1. Scientific Knowledge Guide
5.2. Questions and Answering over Tables
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Conflicts of Interest
References
- Knowledge Graph. Available online: https://en.wikipedia.org/wiki/Knowledge_graph (accessed on 10 December 2021).
- Augenstein, I.; Das, M.; Riedel, S.; Vikraman, L.; McCallum, A. SemEval 2017 Task 10: ScienceIE-Extracting Keyphrases and Relations from Scientific Publications. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 546–555. [Google Scholar] [CrossRef] [Green Version]
- Hou, Y.; Jochim, C.; Gleize, M.; Bonin, F.; Ganguly, D. Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 5203–5213. [Google Scholar] [CrossRef]
- Jain, S.; van Zuylen, M.; Hajishirzi, H.; Beltagy, I. SciREX: A Challenge Dataset for Document-Level Information Extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7506–7516. [Google Scholar] [CrossRef]
- Gábor, K.; Buscaldi, D.; Schumann, A.K.; QasemiZadeh, B.; Zargayouna, H.; Charnois, T. Semeval-2018 task 7: Semantic relation extraction and classification in scientific papers. In Proceedings of the 12th International Workshop on Semantic Evaluation, New Orleans, LA, USA, 5–6 June 2018; pp. 679–688. [Google Scholar] [CrossRef] [Green Version]
- Xu, J.; Kim, S.; Song, M.; Jeong, M.; Kim, D.; Kang, J.; Rousseau, J.F.; Li, X.; Xu, W.; Torvik, V.I.; et al. Building a PubMed knowledge graph. Sci. Data 2020, 7, 205. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
- Mondal, I.; Hou, Y.; Jochim, C. End-to-End NLP Knowledge Graph Construction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1885–1895. [Google Scholar] [CrossRef]
- Liu, S.K.; Xu, R.L.; Geng, B.Y.; Sun, Q.; Duan, L.; Liu, Y.M. Metaknowledge Extraction Based on Multi-Modal Documents. IEEE Access 2021, 9, 50050–50060. [Google Scholar] [CrossRef]
- Li, M.; Xu, Y.; Cui, L.; Huang, S.; Wei, F.; Li, Z.; Zhou, M. DocBank: A Benchmark Dataset for Document Layout Analysis. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 13–18 September 2020; pp. 949–960. [Google Scholar] [CrossRef]
- Xu, Y.; Xu, Y.; Lv, T.; Cui, L.; Wei, F.; Wang, G.; Lu, Y.; Florencio, D.; Zhang, C.; Che, W.; et al. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual Event, 1–6 August 2021; Volume 1, pp. 2579–2591. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
- Choi, J.; Kong, H.; Yoon, H.; Oh, H.-S.; Jung, Y. LAME: Layout Aware Metadata Extraction Approach for Research Articles. arXiv 2021, arXiv:2112.12353. [Google Scholar]
- Han, H.; Giles, C.L.; Manavoglu, E.; Zha, H.; Zhang, Z.; Fox, E.A. Automatic document metadata extraction using support vector machines. In Proceedings of the 2003 Joint Conference on Digital Libraries, Houston, TX, USA, 27–31 May 2003; pp. 37–48. [Google Scholar] [CrossRef] [Green Version]
- Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014. [Google Scholar] [CrossRef] [Green Version]
- Kim, S.; Ji, S.; Jeong, H.; Yoon, H.; Choi, S. Metadata Extraction based on Deep Learning from Academic Paper in PDF. J. KIISE 2019, 46, 644–652. [Google Scholar] [CrossRef]
- Luong, M.-T.; Nguyen, T.D.; Kan, M.-Y. Logical structure recovery in scholarly articles with rich document features. Int. J. Digit. Libr. Syst. 2010, 1, 1–23. [Google Scholar] [CrossRef]
- Adhikari, A.; Ram, A.; Tang, R.; Lin, J. DocBERT: BERT for Document Classification. arXiv 2019, arXiv:1904.08398. [Google Scholar]
- Yu, S.; Su, J.; Luo, D. Improving bert-based text classification with auxiliary sentence and domain knowledge. IEEE Access 2019, 7, 176600–176612. [Google Scholar] [CrossRef]
- Gu, X.; Yoo, K.M.; Ha, J.-W. Dialogbert: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances. arXiv 2021, arXiv:2012.01775. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A Lite Bert for Self-Supervised Learning of Language Representations. arXiv 2020, arXiv:1909.11942. [Google Scholar]
- Beltagy, I.; Lo, K.; Cohan, A. Scibert: A Pretrained Language Model for Scientific Text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
- Garncarek, Ł.; Powalski, R.; Stanisławek, T.; Topolski, B.; Halama, P.; Graliński, F. LAMBERT: Layout-Aware Language Modeling for Information Extraction. In Proceedings of the International Conference on Document Analysis and Recognition, Lausanne, Switzerland, 5–10 September 2021; pp. 532–547. [Google Scholar] [CrossRef]
- Constantin, A.; Pettifer, S.; Voronkov, A. PDFX: Fully-automated PDF-to-XML conversion of scientific literature. In Proceedings of the 2013 ACM Symposium on Document Engineering, Florence, Italy, 10–13 September 2013; pp. 177–180. [Google Scholar] [CrossRef] [Green Version]
- Ahmed, M.W.; Afzal, M.T. FLAG-PDFe: Features oriented metadata extraction framework for scientific publications. IEEE Access 2020, 8, 99458–99469. [Google Scholar] [CrossRef]
- Zhong, X.; Tang, J.; Yepes, A.J. Publaynet: Largest dataset ever for document layout analysis. In Proceedings of the 2019 International Conference on Document Analysis and Recognition, Sydney, Australia, 20–25 September 2019; pp. 1015–1022. [Google Scholar] [CrossRef] [Green Version]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar] [CrossRef]
- Detectron2. Available online: https://github.com/facebookresearch/detectron2 (accessed on 7 December 2021).
- Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 14454–14463. [Google Scholar] [CrossRef]
- Wang, J.; Song, L.; Li, Z.; Sun, H.; Sun, J.; Zheng, N. End-to-end object detection with fully convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 15849–15858. [Google Scholar] [CrossRef]
- Ren, M.; Zemel, R.S. End-to-end instance segmentation with recurrent attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6656–6664. [Google Scholar] [CrossRef] [Green Version]
- Shen, Y.; Ji, R.; Wang, Y.; Wu, Y.; Cao, L. Cyclic guidance for weakly supervised joint detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 697–707. [Google Scholar] [CrossRef]
- Hu, J.; Cao, L.; Lu, Y.; Zhang, S.; Wang, Y.; Li, K.; Huang, F.; Shao, L.; Ji, R. ISTR: End-to-End Instance Segmentation with Transformers. arXiv 2021, arXiv:2105.00637. [Google Scholar]
- Kaplan, F.; Oliveira, S.A.; Clematide, S.; Ehrmann, M.; Barman, R. Combining visual and textual features for semantic segmentation of historical newspapers. J. Data Min. Digit. Humanit. 2021. [Google Scholar] [CrossRef]
- Xu, Y.; Lv, T.; Cui, L.; Wang, G.; Lu, Y.; Florencio, D.; Zhang, C.; Wei, F. LayoutXLM: Multi-Modal Pre-Training for Multilingual Visually-Rich Document Understanding. arXiv 2021, arXiv:2104.08836. [Google Scholar]
- Teufel, S.; Siddharthan, A.; Tidhar, D. Automatic classification of citation function. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, 22–23 July 2006; pp. 103–110. [Google Scholar] [CrossRef] [Green Version]
- Tsai, C.T.; Kundu, G.; Roth, D. Concept-based analysis of scientific literature. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013; pp. 1733–1738. [Google Scholar] [CrossRef]
- Kim, S.N.; Medelyan, O.; Kan, M.Y.; Baldwin, T. Automatic keyphrase extraction from scientific articles. Lang. Resour. Eval. 2013, 47, 723–742. [Google Scholar] [CrossRef]
- Hasan, K.S.; Ng, V. Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, 22–27 June 2014; Volume 1, pp. 1262–1273. [Google Scholar] [CrossRef]
- Ronzano, F.; Saggion, H. Knowledge extraction and modeling from scientific publications. In Proceedings of the International Workshop on Semantic, Analytics, Visualization, Montreal, QC, Canada, 11 April 2016; pp. 11–25. [Google Scholar] [CrossRef] [Green Version]
- Yang, C.; Zhang, J.; Wang, H.; Li, B.; Han, J. Neural concept map generation for effective document classification with interpretable structured summarization. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Xi’an, China, 25–30 July 2020; pp. 1629–1632. [Google Scholar] [CrossRef]
- Zheng, B.; Wen, H.; Liang, Y.; Duan, N.; Che, W.; Jiang, D.; Zhou, M.; Liu, T. Document Modeling with Graph Attention Networks for Multi-grained Machine Reading Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6708–6718. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
- PDFMiner: Python PDF Parser and Analyzer. Available online: http://www.unixuser.org/~euske/python/pdfminer/ (accessed on 20 November 2021).
- Jurgens, D.; Kumar, S.; Hoover, R.; McFarland, D.; Jurafsky, D. Measuring the evolution of a scientific field through citation frames. Trans. Assoc. Comput. Linguist. 2018, 6, 391–406. [Google Scholar] [CrossRef] [Green Version]
- Li, M.; Cui, L.; Huang, S.; Wei, F.; Zhou, M.; Li, Z. Tablebank: Table benchmark for image-based table detection and recognition. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 1918–1925. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar] [CrossRef] [Green Version]
- KoELECTRA: Pretrained ELECTRA Model for Korean. Available online: https://github.com/monologg/KoELECTRA (accessed on 17 December 2021).
- DETR GitHub. Available online: //github.com/facebookresearch/detr (accessed on 10 December 2021).
- ISTR GitHub. Available online: https://github.com/hujiecpp/ISTR (accessed on 10 December 2021).
- Coco Dataset Detection Eval. Available online: https://cocodataset.org/#detection-eval (accessed on 21 December 2021).
- Chen, W.; Chang, M.; Schlinger, E.; Wang, W.Y.; Cohen, W.W. Open Question Answering over Tables and Text. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4–8 May 2021. [Google Scholar]
- Geva, M.; Gupta, A.; Berant, J. Injecting Numerical Reasoning Skills into Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 946–958. [Google Scholar] [CrossRef]
Metadata Field | Label (i.e., Layout) | Count |
---|---|---|
Out of boundary | O | 637,856 |
Title (in Korean) | title_ko | 46,056 |
Title (in English) | title_en | 64,414 |
Affiliation (in Korean) | aff_ko | 39,233 |
Affiliation (in English) | aff_en | 63,434 |
Abstract (in Korean) | abstract_ko | 31,885 |
Abstract (in English) | abstract_en | 55,318 |
Keywords (in Korean) | keywords_ko | 21,685 |
Semantic Elements | Training Set | Test Set | Total |
---|---|---|---|
Section title | 32,284 | 5724 | 38,008 |
Paragraph | 111,253 | 19,990 | 131,243 |
Reference | 57,813 | 10,197 | 68,010 |
Table | 4433 | 782 | 5215 |
Figure | 12,937 | 2303 | 15,240 |
Total num. of pages | 17,024 | 3055 | 20,079 |
System Settings | Specification |
---|---|
CPU | Intel(R) Core(TM) i9-10900X CPU @ 3.70GHz |
GPU | Tesla V100-PCIE-32GB × 2 |
RAM | 256 G |
Cuda Version | Cuda 10.1 |
Element | LAME [13] (F1-Score) | KoELECTRA [51] (F1-Score) |
Title | 0.92 | 0.87 |
Abstract | 0.9 | 0.9 |
Keywords | 0.94 | 0.91 |
Author | 0.9 | 0.73 |
Affiliation | 0.92 | 0.57 |
Average | 0.93 | 0.87 |
Element | Vi-SEE (mAP) | Mask R-CNN Trained with PubLayNet and Fine-Tuned with Our Data (mAP) |
Section title | 0.6841 | 0.6445 |
Paragraph | 0.8456 | 0.8018 |
Table | 0.9323 | 0.8989 |
Figure | 0.8975 | 0.7144 |
Reference | 0.8959 | 0.8398 |
Average | 0.8493 | 0.7798 |
Model | AP | AP50 | AP75 | APm | APl |
---|---|---|---|---|---|
A | 77.99% | 93.46% | 86.98% | 41.61% | 80.23% |
B | 81.5% | 97.1% | 90.1% | 59.3% | 82.1% |
C | 85.11% | 98.16% | 93.33% | 65.44% | 85.60% |
Semantic Element Types | Extracted Elements (Count) |
---|---|
Title | 49,094 |
Abstract | 60,526 |
Keywords | 56,634 |
Author | 52,216 |
Affiliation | 50,951 |
Section title | 1,019,749 |
Paragraph | 2,700,498 |
Table | 138,613 |
Figure | 388,416 |
Caption | 527,029 |
Reference | 1,711,959 |
Total PDF documents | 49,649 |
Total extracted semantic elements | 6,782,685 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, H.; Choi, J.; Park, S.; Jung, Y. Layout Aware Semantic Element Extraction for Sustainable Science & Technology Decision Support. Sustainability 2022, 14, 2802. https://doi.org/10.3390/su14052802
Kim H, Choi J, Park S, Jung Y. Layout Aware Semantic Element Extraction for Sustainable Science & Technology Decision Support. Sustainability. 2022; 14(5):2802. https://doi.org/10.3390/su14052802
Chicago/Turabian StyleKim, Hyuntae, Jongyun Choi, Soyoung Park, and Yuchul Jung. 2022. "Layout Aware Semantic Element Extraction for Sustainable Science & Technology Decision Support" Sustainability 14, no. 5: 2802. https://doi.org/10.3390/su14052802
APA StyleKim, H., Choi, J., Park, S., & Jung, Y. (2022). Layout Aware Semantic Element Extraction for Sustainable Science & Technology Decision Support. Sustainability, 14(5), 2802. https://doi.org/10.3390/su14052802