HeteEdgeWalk: A Heterogeneous Edge Memory Random Walk for Heterogeneous Information Network Embedding
Abstract
:1. Introduction
- We propose HeteEdgeWalk, a HIN-embedding technique based on network schema. It is an approach that does not involve meta-paths and optimizes the representation more fairly and comprehensively by capturing composite interactions;
- We analyze the disadvantages of sampling based on node types and suggest sampling based on edge types as a more versatile and finer-grained method for exploring HINs;
- Four large real-world datasets are utilized to evaluate the experimental. The results show that HeteEdgeWalk achieves optimal or sub-optimal performance compared to the baselines. In the study of parameter sensitivity, our approach exhibits robustness in various scenarios.
2. Related Work
3. Materials and Methods
3.1. Preliminaries
3.2. Dynamically Adjusted Bidirectional Edge Sampling Walk
Algorithm 1: Bi-EdgeSamplingWalk | |
Input: the graph , the start node per walk , the walk length , the decay parameter α. Output: the random walk sequence . | |
1: Initialize ; | |
2: ; | |
3: ; | |
4: ; | |
5: for do | |
6: ; | |
7: ; | |
8: update and ; | |
9: ; | |
10: end for | |
11: return |
3.3. Node Embedding Learning with SkipGram
Algorithm 2: ) | |
Input: the graph , the numbers of walks per node , the walk length , the node-embedding dimension , the context window size , the decay parameter α. Output: the node representations | |
1: Initialize ; | |
2: ; | |
3: for do; | |
4: W = Bi-EdgeSamplingWalk | |
5: ; | |
6: end for | |
7: end for | |
8: return |
3.4. Experimental Setup
3.4.1. Dataset Description
- DBLP [13] is an academic network based on four types of nodes. It includes 5915 authors (A), 5237 papers (P), 4479 topics (T), and 18 venues (V). In addition, each author corresponds to the below four domain areas: “data mining”, “information retrieval”, “database”, and “machine learning”;
- ACM [26] is the other academic network containing three types of nodes. The nodes of the network include 7167 authors (A), 4039 papers (P), and 60 subjects (S). Each paper falls under one of the three research categories of “databases”, “wireless communications”, or “data mining”;
- Foursquare [27] is a graph with four different types of nodes based on the user’s check-in history in New York City. It contains 1250 points of interest (P), 2449 users, 25,904 check-ins (C), and 168 timestamps (T). Additionally, each point of interest (P) is given a label in accordance with its class, such as “bar”;
- Movies [13] is an augmented movie graph with four different types of nodes. There are 10,350 actors (A), 6517 movies (M), 2582 directors (D), and 1335 composers (C) in it. The original data contains only three heterogeneous edge types (M–A, M–D, and M–C), while the augmented data adds two homogeneous edges (M–M and A–A). In addition, each movie corresponds to the following movie themes: “action”, “horror”, “adventure”, “science fiction”, and “crime”.
3.4.2. Baselines
- DeepWalk [7] is a classical homogeneous graph-embedding algorithm. It first performs random walks for each node and transforms the obtained random walks into node embeddings using the SkipGram model;
- Metapath2Vec [9] is a HIN-embedding algorithm based on meta-path-guided random walks. It first guides random walks through meta-paths defined in advance and then learns node embedding through the SkipGram model. For DBLP, the meta-paths as in [13] are P–A–P and A–P–V–P–A. The selection of P–A–P and P–S–P as meta-paths for ACM is similar to the selection in [26]. Similar to [13], the meta-paths for Foursquare are U–C–P–C–U and P–C–T–C–P, and the meta-paths for Movies are A–M–D–M–A and A–M–C–M–A, respectively;
- JUST [16] is a HIN-embedding algorithm that proposes whether meta-paths are necessary or not. The relational information of each node is treated as a domain, and whether to jump to the next node domain is controlled by the Jump and Stay operations. Jump is equivalent to selecting heterogeneous nodes, and similarly, Stay selects homogeneous nodes. It prioritizes heterogeneous Jump and reduces the heterogeneous Jump probability to achieve balanced sampling of heterogeneous and homogeneous edges. Additionally, it stores the last few selected node domain types to restrict the same node types of accesses. Finally, the node embedding is learned through the SkipGram model;
- Bidirectional Random Walks are the sampling part of MARU [24]. After this, it is simply called Bidirectional. It involves performing two independent random walks for each node and stitching together a random walk with the node as the center. This is used to overcome the problem that in traditional random walks, low-order nodes are mostly found at the beginning of the walk. It causes the SkipGram model to ignore information at the beginning of the walk, which leads to the loss of node information. Ref. [24] designed a unique heterogeneous SkipGram model and set the window size to the walk length to obtain complete context window information. However, in this paper, only the classical SkipGram algorithm is used for evaluation;
- SchemaWalk [13] is a schema-aware random walk algorithm for HIN embedding. The edge-sampling mechanism is used to control the selection of edge types by exponential descent. Nodes are then randomly drawn from their corresponding edge types to achieve a more balanced sampling effect. Finally, the sampled random walks are put into the SkipGram model to obtain the corresponding node embeddings and learn the node representations.
3.4.3. Implementation
4. Results
4.1. Node Classification Task
4.2. Node Clustering Task
4.3. Qualitative Analysis
4.4. Parameter Sensitivity Study
4.4.1. Impact of the Decay Parameter α
4.4.2. Node-Embedding Dimension
4.4.3. Context Window
4.5. Visualization
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Huang, Z.; Silva, A.; Singh, A. A broader picture of random-walk based graph embedding. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021. [Google Scholar]
- Tang, J.; Qu, M.; Mei, Q. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015. [Google Scholar]
- Liben-Nowell, D.; Kleinberg, J. The link prediction problem for social networks. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, New Orleans, LA, USA, 3–8 November 2003. [Google Scholar]
- Nasiri, E.; Berahmand, K.; Li, Y. A new link prediction in multiplex networks using topologically biased random walks. Chaos Solitons Fractals 2021, 151, 111230. [Google Scholar] [CrossRef]
- Opsahl, T.; Panzarasa, P. Clustering in weighted networks. Soc. Netw. 2009, 31, 155–163. [Google Scholar] [CrossRef]
- Zhan, L.; Jia, T. CoarSAS2hvec: Heterogeneous Information Network Embedding with Balanced Network Sampling. Entropy 2022, 24, 276. [Google Scholar] [CrossRef] [PubMed]
- Perozzi, B.; Al-Rfou, R.; Skiena, S. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014. [Google Scholar]
- Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; Mei, Q. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–25 May 2015. [Google Scholar]
- Dong, Y.; Chawla, N.V.; Swami, A. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017. [Google Scholar]
- Shi, C.; Hu, B.; Zhao, W.X.; Philip, S.Y. Heterogeneous information network embedding for recommendation. IEEE Trans. Knowl. Data Eng. 2018, 31, 357–370. [Google Scholar] [CrossRef] [Green Version]
- Fu, T.-y.; Lee, W.-C.; Lei, Z. HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017. [Google Scholar]
- He, Y.; Song, Y.; Li, J.; Ji, C.; Peng, J.; Peng, H. HeteSpaceyWalk: A Heterogeneous Spacey Random Walk for Heterogeneous Information Network Embedding. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019. [Google Scholar]
- Samy, A.E.; Giaretta, L.; Kefato, Z.T.; Girdzijauskas, Š. SchemaWalk: Schema Aware Random Walks for Heterogeneous Graph Embedding. In Proceedings of the Companion Proceedings of the Web Conference 2022, Lyon, France, 25–29 April 2022. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
- Zhang, D.; Yin, J.; Zhu, X.; Zhang, C. Metagraph2vec: Complex semantic path augmented heterogeneous network embedding. In Proceedings of the Advances in Knowledge Discovery and Data Mining: 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, 3–6 June 2018; Part II 22. pp. 196–208. [Google Scholar]
- Hussein, R.; Yang, D.; Cudré-Mauroux, P. Are Meta-Paths Necessary? Revisiting Heterogeneous Graph Embeddings. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino, Italy, 22–26 October 2018. [Google Scholar]
- Chang, S.; Han, W.; Tang, J.; Qi, G.-J.; Aggarwal, C.C.; Huang, T.S. Heterogeneous Network Embedding via Deep Architectures. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015. [Google Scholar]
- Gui, H.; Liu, J.; Tao, F.; Jiang, M.; Norick, B.; Han, J. Large-Scale Embedding Learning in Heterogeneous Event Data. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016. [Google Scholar]
- Carletti, T.; Fanelli, D.; Lambiotte, R. Random walks and community detection in hypergraphs. J. Phys. Complex. 2021, 2, 015011. [Google Scholar] [CrossRef]
- Hu, B.; Fang, Y.; Shi, C. Adversarial Learning on Heterogeneous Information Networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
- Cinaglia, P.; Cannataro, M. A Method Based on Temporal Embedding for the Pairwise Alignment of Dynamic Networks. Entropy 2023, 25, 665. [Google Scholar] [CrossRef] [PubMed]
- Zheng, Y.; Hu, R.; Fung, S.-f.; Yu, C.; Long, G.; Guo, T.; Pan, S. Clustering social audiences in business information networks. Pattern Recognit. 2020, 100, 107126. [Google Scholar] [CrossRef]
- Athanasios, A.; Charalampos, V.; Vasileios, T.; Ashraf, G.M. Protein-Protein Interaction (PPI) Network: Recent Advances in Drug Discovery. Curr. Drug Metab. 2017, 18, 5–10. [Google Scholar] [CrossRef] [PubMed]
- Jiang, J.-Y.; Li, Z.; Ju, C.J.-T.; Wang, W. MARU: Meta-context Aware Random Walks for Heterogeneous Network Representation Learning. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual Event, 19–23 October 2020. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
- Zhao, J.; Wang, X.; Shi, C.; Liu, Z.; Ye, Y. Network Schema Preserving Heterogeneous Information Network Embedding. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Yokohama, Japan, 11–17 July 2020. [Google Scholar]
- Yang, D.; Zhang, D.; Qu, B. Participatory Cultural Mapping Based on Collective Behavior Data in Location-Based Social Networks. ACM Trans. Intell. Syst. Technol. 2016, 7, 30. [Google Scholar] [CrossRef]
- Rehurek, R.; Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, 22 May 2010; pp. 45–50. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 1–9 January 2007. [Google Scholar]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Hamilton, W.L.; Ying, Z.; Leskovec, J. Inductive Representation Learning on Large Graphs. In Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Wang, Y.; Sun, Z.; He, Q.; Li, J.; Ni, M.; Yang, M. Self-supervised graph representation learning integrates multiple molecular networks and decodes gene-disease relationships. Patterns 2023, 4, 100651. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Su, Z. A Comment on “Cross-Platform Identification of Anonymous Identical Users in Multiple Social Media Networks”. IEEE Trans. Knowl. Data Eng. 2018, 30, 1409–1410. [Google Scholar] [CrossRef] [Green Version]
- Tajeuna, E.G.; Bouguessa, M.; Wang, S. Modeling and Predicting Community Structure Changes in Time-Evolving Social Networks. IEEE Trans. Knowl. Data Eng. 2019, 31, 1166–1180. [Google Scholar] [CrossRef]
Dataset | Node | Edge |
---|---|---|
DBLP | author: 5915 | P-T: 26,532 |
paper: 5237 | P-A: 13,589 | |
topic: 4479 | P-P: 6984 | |
venue: 18 | P-V: 4258 | |
ACM | author: 7167 | A-P: 13,407 P-S: 4019 |
paper: 4039 | ||
subject: 60 | ||
Foursquare | check-in: 25,904 | U-C: 25,904 |
user: 2449 | C-P: 25,904 | |
point of interest: 1250 | C-T: 25,904 | |
timestamp: 168 | U-U: 5696 | |
Movies | actor: 10,350 movie: 6517 director: 2582 composer: 1335 | A-A: 43,395 M-A: 23,223 M-M: 6194 M-D: 5630 M-C: 4001 |
Method | DBLP | ACM | Foursquare | Movies | ||||
---|---|---|---|---|---|---|---|---|
t | p | t | p | t | p | t | p | |
DeepWalk | −9.474 | 0.000 ** | −6.787 | 0.000 ** | −7.454 | 0.000 ** | −8.399 | 0.000 ** |
Metapath2vec | −25.272 | 0.000 ** | −8.457 | 0.000 ** | 5.952 | 0.001 ** | −13.641 | 0.000 ** |
JUST | −3.614 | 0.009 ** | −26.185 | 0.000 ** | −2.798 | 0.027 * | −23.956 | 0.000 ** |
Bidirectional | −6.554 | 0.000 ** | −0.535 | 0.609 | −3.792 | 0.007 ** | −17.626 | 0.000 ** |
SchemaWalk | −8.670 | 0.000 ** | −3.904 | 0.006 ** | −4.198 | 0.004 ** | −17.608 | 0.000 ** |
Method | DBLP | ACM | Foursquare |
---|---|---|---|
DeepWalk | 0.121 | 0.364 | 0.281 |
Metapath2Vec | 0.551 | 0.371 | 0.288 |
JUST | 0.136 | 0.341 | 0.287 |
Bidirectional | 0.248 | 0.375 | 0.275 |
SchemaWalk | 0.593 | 0.322 | 0.271 |
HeteEdgeWalk | 0.599 | 0.381 | 0.294 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, Z.; Zhang, S.; Zhang, J.; Jiang, M.; Liu, Y. HeteEdgeWalk: A Heterogeneous Edge Memory Random Walk for Heterogeneous Information Network Embedding. Entropy 2023, 25, 998. https://doi.org/10.3390/e25070998
Liu Z, Zhang S, Zhang J, Jiang M, Liu Y. HeteEdgeWalk: A Heterogeneous Edge Memory Random Walk for Heterogeneous Information Network Embedding. Entropy. 2023; 25(7):998. https://doi.org/10.3390/e25070998
Chicago/Turabian StyleLiu, Zhenpeng, Shengcong Zhang, Jialiang Zhang, Mingxiao Jiang, and Yi Liu. 2023. "HeteEdgeWalk: A Heterogeneous Edge Memory Random Walk for Heterogeneous Information Network Embedding" Entropy 25, no. 7: 998. https://doi.org/10.3390/e25070998
APA StyleLiu, Z., Zhang, S., Zhang, J., Jiang, M., & Liu, Y. (2023). HeteEdgeWalk: A Heterogeneous Edge Memory Random Walk for Heterogeneous Information Network Embedding. Entropy, 25(7), 998. https://doi.org/10.3390/e25070998