Vec2Dynamics: A Temporal Word Embedding Approach to Exploring the Dynamics of Scientific Keywords—Machine Learning as a Case Study
Abstract
:1. Introduction
- Numerically, we compute the average stability of k-nearest neighbours of keywords over all the time windows and detect the stable/unstable periods in the history of Machine Learning.
- Visually, we illustrate the advantages of temporal word embeddings that show the evolvement of scientific keywords by drawing Venn diagrams showing k-NN keywords over each two subsequent time periods. All visualisations show that our approach is able to illustrate the dynamics of Machine Learning literature.
- To detect the dynamics of keywords, word vectors are learned across time. Then, based on the similarity measure between the embedding vectors of keywords, the k-nearest neighbors (k-NN) of each keyword are defined over successive timespans.
- The change in stability of k-NN over time refers to the dynamics of keywords and accordingly the dynamics of the research area.
- Vec2Dynamics is evaluated in the area of Machine Learning with the NIPS publications between the years 1987 and 2016.
- Both numerical and visual methods are adopted to perform a descriptive analysis and evaluate the effectiveness of Vec2Dynamics in tracking the dynamics of scientific keywords.
- Machine Learning timeline has been proposed as standard for descriptive analysis. A generally good consistency between the obtained results and the machine learning timeline has been found.
- Venn diagrams have been used in this paper for qualitative analyses to highlight the semantic shifts of scientific keywords by showing the evolvement of the semantic neighborhood of scientific keywords over time.
2. Related Work
2.1. Computational History of Science: Approaches and Disciplines
2.2. Computational History of Computer Science
2.2.1. Bibliometrics and Hybrid Approaches
2.2.2. Content-Based Approaches
2.3. Limitations
3. Vec2Dynamics
3.1. Vec2Dynamics Architecture
- i
- Data preprocessing. At this stage, we preprocess and clean up the textual content of research papers taking into account the specificity of scientific language. For instance, we consider the frequent use of bigrams in scientific language such as “information system”, and “artificial intelligence”, and we construct a bag of keywords where keywords are either unigrams or bigrams. Unigram or 1-gram represents a one-word sequence, such as “science” or “data”, while a bigram or 2-gram represents a two-word sequence of words like “machine learning” or “data science”. Data preprocessing consists then of two steps: (a) the removal of stop words, which refer to the words that appear very frequently like “and”, “the”, and “a”; and (b) the construction of bag of words where words are either unigrams or bigrams. More details on the data preprocessing stage are given in our previous work [15].
- ii
- Word embedding. At this stage, we adopt the skip-gram neural network architecture of the word2vec embedding model [32] to learn word vectors over time. This stage is repeated for each corpus that corresponds to the corpus of all research papers in the time window. More details will be given in Section 3.2.
- iii
- Similarity computation. After generating the vector representation of keywords, we apply cosine similarity between embedding vectors to find the k-nearest neighbors of each keyword. Recall that cosine similarity between two keywords and refers to the cosine measure between embedding vectors and as follows:As with the previous stage, this stage of similarity computation is also repeated at each time window .
- iv
- Stability computation. At this stage, we study the stability of k-NN of each keyword of interest over time in order to track the dynamics of the scientific literature. To do so, we define a stability measure (Equation (4)) that could be computed between the sets of k-NN keywords over two subsequent time windows and . Based on the obtained stability values, we define four types of keywords/topics: recurrent, non recurrent, persistent and emerging keywords. More details will be given in Section 3.3.
3.2. Dynamic Word Embedding
3.2.1. Notation
3.2.2. Skip-Gram Model
3.3. k-NN Stability
3.3.1. Notation
3.3.2. Interpretation
Algorithm1: Finding the types of keywords based on their dynamism. |
4. Experiments
4.1. NIPS Dataset
4.2. Results and Discussion
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
ACL | Association for Computational Linguistics |
CLR | Computational Literature Review |
CS | Computer Science |
CSO | Computer Science Ontology |
DCA | Document-Citation Analysis |
FoS | Field of Study |
IS | Information Science |
k-NN | k- Nearest Neighbors |
LDA | Latent Dirichlet Allocation |
LSTM | Long-Short-Term Memory |
MAG | Microsoft Academic Graph |
ML | Machine Learning |
NIPS | Neural Information Processing Systems |
SE | software engineering |
TAM | Technology Acceptance Model |
References
- Xia, F.; Wang, W.; Bekele, T.M.; Liu, H. Big Scholarly Data: A Survey. IEEE Trans. Big Data 2017, 3, 18–35. [Google Scholar] [CrossRef]
- Yu, Z.; Menzies, T. FAST2: An intelligent assistant for finding relevant papers. Expert Syst. Appl. 2019, 120, 57–71. [Google Scholar] [CrossRef] [Green Version]
- An, Y.; Han, M.; Park, Y. Identifying dynamic knowledge flow patterns of business method patents with a hidden Markov model. Scientometrics 2017, 113, 783–802. [Google Scholar] [CrossRef]
- Anderson, A.; McFarland, D.; Jurafsky, D. Towards a Computational History of the ACL: 1980–2008. In Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, Jeju Island, Korea, 10 July 2012; pp. 13–21. [Google Scholar]
- Effendy, S.; Yap, R.H. Analysing Trends in Computer Science Research: A Preliminary Study Using The Microsoft Academic Graph. In Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 3–7 April 2017; pp. 1245–1250. [Google Scholar]
- Hall, D.; Jurafsky, D.; Manning, C.D. Studying the History of Ideas Using Topic Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, Honolulu, HI, USA, 25–27 October 2008; pp. 363–371. [Google Scholar]
- Hoonlor, A.; Szymanski, B.K.; Zaki, M.J. Trends in Computer Science Research. Commun. ACM 2013, 56, 74–83. [Google Scholar] [CrossRef] [Green Version]
- Hou, J.; Yang, X.; Chen, C. Emerging trends and new developments in information science: A document co-citation analysis (2009–2016). Scientometrics 2018, 115, 869–892. [Google Scholar] [CrossRef]
- Mortenson, M.J.; Vidgen, R. A Computational Literature Review of the Technology Acceptance Model. Int. J. Inf. Manag. 2016, 36, 1248–1259. [Google Scholar] [CrossRef] [Green Version]
- Rossetto, D.E.; Bernardes, R.C.; Borini, F.M.; Gattaz, C.C. Structure and evolution of innovation research in the last 60 years: Review and future trends in the field of business through the citations and co-citations analysis. Scientometrics 2018, 115, 1329–1363. [Google Scholar] [CrossRef]
- Santa Soriano, A.; Lorenzo Álvarez, C.; Torres Valdés, R.M. Bibliometric analysis to identify an emerging research area: Public Relations Intelligence. Scientometrics 2018, 115, 1591–1614. [Google Scholar] [CrossRef]
- Zhang, C.; Guan, J. How to identify metaknowledge trends and features in a certain research field? Evidences from innovation and entrepreneurial ecosystem. Scientometrics 2017, 113, 1177–1197. [Google Scholar] [CrossRef]
- Taskin, Z.; Al, U. A content-based citation analysis study based on text categorization. Scientometrics 2018, 114, 335–357. [Google Scholar] [CrossRef]
- Ruas, T.; Grosky, W.; Aizawa, A. Multi-sense embeddings through a word sense disambiguation process. Expert Syst. Appl. 2019, 136, 288–303. [Google Scholar] [CrossRef] [Green Version]
- Dridi, A.; Gaber, M.M.; Azad, R.M.A.; Bhogal, J. Leap2Trend: A Temporal Word Embedding Approach for Instant Detection of Emerging Scientific Trends. IEEE Access 2019, 7, 176414–176428. [Google Scholar] [CrossRef]
- Weismayer, C.; Pezenka, I. Identifying emerging research fields: A longitudinal latent semantic keyword analysis. Scientometrics 2017, 113, 1757–1785. [Google Scholar] [CrossRef]
- Picasso, A.; Merello, S.; Ma, Y.; Oneto, L.; Cambria, E. Technical analysis and sentiment embeddings for market trend prediction. Expert Syst. Appl. 2019, 135, 60–70. [Google Scholar] [CrossRef]
- Boyack, K.W.; Smith, C.; Klavans, R. Toward predicting research proposal success. Scientometrics 2018, 114, 449–461. [Google Scholar] [CrossRef] [Green Version]
- Liu, Y.; Huang, Z.; Yan, Y.; Chen, Y. Science Navigation Map: An Interactive Data Mining Tool for Literature Analysis. In Proceedings of the 24th International Conference on World Wide Web, WWW’15 Companion, Florence, Italy, 18–22 May 2015; pp. 591–596. [Google Scholar]
- Qiu, Q.; Xie, Z.; Wu, L.; Li, W. Geoscience keyphrase extraction algorithm using enhanced word embedding. Expert Syst. Appl. 2019, 125, 157–169. [Google Scholar] [CrossRef]
- Alam, M.M.; Ismail, M.A. RTRS: A recommender system for academic researchers. Scientometrics 2017, 113, 1325–1348. [Google Scholar] [CrossRef]
- Dey, R.; Roy, A.; Chakraborty, T.; Ghosh, S. Sleeping beauties in Computer Science: Characterization and early identification. Scientometrics 2017, 113, 1645–1663. [Google Scholar] [CrossRef]
- Effendy, S.; Jahja, I.; Yap, R.H. Relatedness Measures Between Conferences in Computer Science: A Preliminary Study Based on DBLP. In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Korea, 7–11 April 2014; pp. 1215–1220. [Google Scholar]
- Effendy, S.; Yap, R.H.C. The Problem of Categorizing Conferences in Computer Science. In Research and Advanced Technology for Digital Libraries; Fuhr, N., Kovács, L., Risse, T., Nejdl, W., Eds.; Springer: Cham, Switzerland, 2016; pp. 447–450. [Google Scholar]
- Kim, S.; Hansen, D.; Helps, R. Computing research in the academy: Insights from theses and dissertations. Scientometrics 2018, 114, 135–158. [Google Scholar] [CrossRef]
- Glass, R.; Vessey, I.; Ramesh, V. Research in software engineering: An analysis of the literature. Inf. Softw. Technol. 2002, 44, 491–506. [Google Scholar] [CrossRef]
- Schlagenhaufer, C.; Amberg, M. A descriptive literature review and classification framework for gamification in information systems. In Proceedings of the Twenty-Third European Conference on Information Systems (ECIS), Münster, Germany, 26–29 May 2015; pp. 1–15. [Google Scholar]
- Martin, P.Y.; Turner, B.A. Grounded Theory and Organizational Research. J. Appl. Behav. Sci. 1986, 22, 141–157. [Google Scholar] [CrossRef]
- Salatino, A.A.; Osborne, F.; Motta, E. How are topics born? Understanding the research dynamics preceding the emergence of new areas. PeerJ Comput. Sci. 2017, 3, e119. [Google Scholar] [CrossRef] [Green Version]
- He, J.; Chen, C. Predictive Effects of Novelty Measured by Temporal Embeddings on the Growth of Scientific Literature. Front. Res. Metrics Anal. 2018, 3, 9. [Google Scholar] [CrossRef] [Green Version]
- Dridi, A.; Gaber, M.M.; Azad, R.M.A.; Bhogal, J. DeepHist: Towards a Deep Learning-based Computational History of Trends in the NIPS. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and their Compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Volume 14, pp. 1532–1543. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef] [Green Version]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Mikolov, T.; Yih, W.t.; Zweig, G. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, USA, 9–14 June 2013; pp. 746–751. [Google Scholar]
- Dridi, A.; Gaber, M.M.; Azad, R.M.A.; Bhogal, J. k-NN Embedding Stability for word2vec Hyper-Parametrisation in Scientific Text. In International Conference on Discovery Science; Springer: Cham, Switzerland, 2018; pp. 328–343. [Google Scholar]
- Osborne, F.; Motta, E. Mining Semantic Relations between Research Areas. In The Semantic Web—ISWC 2012; Cudré-Mauroux, P., Heflin, J., Sirin, E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bernstein, A., et al., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 410–426. [Google Scholar]
- Orkphol, K.; Yang, W. Word Sense Disambiguation Using Cosine Similarity Collaborates with Word2vec and WordNet. Future Internet 2019, 11, 114. [Google Scholar] [CrossRef] [Green Version]
- Wikipedia. Timeline of Machine Learning. 2022. Available online: https://en.wikipedia.org/wiki/Timeline_of_machine_learning (accessed on 1 December 2021).
- Ho, T.K. Random Decision Forests. In Proceedings of the Third International Conference on Document Analysis and Recognition, ICDAR’95, Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 278–282. [Google Scholar]
- Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Campbell, M.; Hoane, A.J., Jr.; Hsu, F.H. Deep Blue. Artif. Intell. 2002, 134, 57–83. [Google Scholar] [CrossRef] [Green Version]
- Le, Q.V.; Ranzato, M.; Monga, R.; Devin, M.; Chen, K.; Corrado, G.S.; Dean, J.; Ng, A.Y. Building High-level Features Using Large Scale Unsupervised Learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, Edinburgh, UK, 26 June–1 July 2012; Omnipress: Madison, WI, USA, 2012; pp. 507–514. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, NIPS’12, Lake Tahoe, NV, USA, 3–6 December 2012; Curran Associates Inc.: Red Hook, NY, USA, 2012; Volume 1, pp. 1097–1105. [Google Scholar]
- Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR’14, Columbus, OH, USA, 23–28 June 2014; IEEE Computer Society: Washington, DC, USA, 2014; pp. 1701–1708. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Collobert, R.; Bengio, S.; Mariéthoz, J. Torch: A modular machine learning software library. In Technical Report IDIAP-RR 02-46; IDIAP: Martigny, Switzerland, 2002. [Google Scholar]
- Mani, I.; Maybury, M.T. Advances in Automatic Text Summarization; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
- Karypis, G.; Kumar, V. Chameleon: Hierarchical clustering using dynamic modeling. Computer 1999, 32, 68–75. [Google Scholar] [CrossRef] [Green Version]
Time Window | Papers | Words | Vocabulary |
---|---|---|---|
From 1987 to 1989 | 288 | 16,273 | 9147 |
From 1990 to 1992 | 417 | 465,169 | 169,728 |
From 1993 to 1995 | 453 | 914,871 | 166,954 |
From 1996 to 1998 | 456 | 1,387,070 | 173,341 |
From 1999 to 2001 | 499 | 1,943,821 | 197,845 |
From 2002 to 2004 | 615 | 2,716,271 | 264,241 |
From 2005 to 2007 | 631 | 3,595,398 | 292,681 |
From 2008 to 2010 | 807 | 4,847,535 | 379,086 |
From 2011 to 2013 | 1037 | 6,501,435 | 480,440 |
From 2014 to 2016 | 1386 | 8,732,443 | 610,383 |
Keywords | 93–95 | 96–98 | 99–01 | 02–04 | 05–07 | 08–10 | 11–13 | 14–16 |
---|---|---|---|---|---|---|---|---|
machine_learning | n/a | 0 | 0 | |||||
neural_network | 0 | 0 | ||||||
supervised_learning | ||||||||
unsupervised_learning | 0 | |||||||
reinforcement_learning | 0 | 0 | 0 | |||||
time_series | ||||||||
artificial_intelligence | n/a | n/a | n/a | n/a | 0 | n/a | ||
gaussian_process | n/a | n/a | 0 | 0 | 0 | |||
semi_supervised | n/a | n/a | n/a | n/a | ||||
active_learning | n/a | n/a | n/a | |||||
decision_trees | n/a | |||||||
dimensionality_reduction | n/a | n/a | ||||||
dynamic_programming | 0 | 0 | 0 | |||||
gradient_descent | 0 | |||||||
hidden_markov | 0 | |||||||
mutual_information | n/a | |||||||
nearest_neighbor | 0 | |||||||
pattern_recognition | n/a | n/a | n/a | |||||
monte_carlo | 0 | 0 | ||||||
graphical_model | n/a | n/a | n/a | 0 | ||||
Average | −0.531 | −0.41 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dridi, A.; Gaber, M.M.; Azad, R.M.A.; Bhogal, J. Vec2Dynamics: A Temporal Word Embedding Approach to Exploring the Dynamics of Scientific Keywords—Machine Learning as a Case Study. Big Data Cogn. Comput. 2022, 6, 21. https://doi.org/10.3390/bdcc6010021
Dridi A, Gaber MM, Azad RMA, Bhogal J. Vec2Dynamics: A Temporal Word Embedding Approach to Exploring the Dynamics of Scientific Keywords—Machine Learning as a Case Study. Big Data and Cognitive Computing. 2022; 6(1):21. https://doi.org/10.3390/bdcc6010021
Chicago/Turabian StyleDridi, Amna, Mohamed Medhat Gaber, Raja Muhammad Atif Azad, and Jagdev Bhogal. 2022. "Vec2Dynamics: A Temporal Word Embedding Approach to Exploring the Dynamics of Scientific Keywords—Machine Learning as a Case Study" Big Data and Cognitive Computing 6, no. 1: 21. https://doi.org/10.3390/bdcc6010021
APA StyleDridi, A., Gaber, M. M., Azad, R. M. A., & Bhogal, J. (2022). Vec2Dynamics: A Temporal Word Embedding Approach to Exploring the Dynamics of Scientific Keywords—Machine Learning as a Case Study. Big Data and Cognitive Computing, 6(1), 21. https://doi.org/10.3390/bdcc6010021