Generalized Sketches for Streaming Sets
Abstract
:1. Introduction
- User cardinality. A user’s cardinality is defined to be the number of distinct interests that the user links to, i.e., the cardinality of the user’s interest set. Monitoring user cardinalities in computer networks is fundamental for applications such as network anomaly detection [2]. One can use a variety of sketch methods, such as LPC [3], LogLog [4], HyperLogLog [5], MinCount [6], RoughEstimator [7], and their variants [2,8,9,10,11,12,13] to generate a compact data summary (or sketch) for each user’s interest set, and then infer user cardinalities from the generated sketches.
- Common-interest count. Two users’ common-interest count is defined as the number of interests that they both link to, i.e., the cardinality of the intersection of two users’ interest sets. It is popularly used for applications such as friend recommendation [14] and network anomaly detection [15]. One can use sketch and sampling methods in [13,14,15] to estimate users’ common-interest counts.
- User Jaccard similarity. The Jaccard similarity is a popular similarity measure. Two users’ similarity can be evaluated as the Jaccard similarity of their interest sets, which is defined as their common-interest count divided by the number of distinct interests at least one of them links to. One can use sketch methods such as MinHash [16], OPH [17], and their variants [18,19,20,21,22] to estimate users’ Jaccard similarities. Some of these sketch methods, such as Minhash, can be used for locality-sensitive hashing (LSH) indexing [16,23,24,25,26], which is capable of searching a query user’s similar users with sub-linear time.
- For sets given in a stream fashion, our method, SimCar, builds only one type of sketches (i.e., OH sketch) to effectively mine a variety of their statistics, including cardinality, intersection cardinality, and Jaccard similarity, as well as fast search similar sets. It outperforms the straightforward method of combining existing sketch-based cardinality estimation methods and Jaccard similarity estimation methods.
- We develop maximum likelihood estimation (MLE) methods to estimate set cardinalities and intersection cardinalities, which are more accurate than the original OH-sketch-based cardinality estimation method [6].
- Compared with state-of-the-art methods, experimental results on real-world datasets demonstrate that our method, SimCar, is up to 1000 times faster for online data streams processing, and reduces the memory usage by 13.5–23.8% to achieve the same performance accuracy.
2. Related Work
3. Problem Formulation
The stream of interest | |
The sets of users and interests, | |
The random rank value of interest | |
k | The number of registers used for a MinHash, OPH, or OH sketch |
The number of registers used for an HLL sketch | |
The set of interests of user | |
The set of common interests of users u and v in U, i.e., | |
The set of interests affiliated with user u but not user v, i.e., | |
The set of interests w in with | |
The set of interests w in with | |
The set of interests w in with | |
The cardinality of user , i.e., | |
The common-interest count of users u and v, i.e., | |
The number of elements in set i.e., | |
The number of interests w in with | |
The number of interests w in with | |
The number of interests w in with | |
The Jaccard similarity of sets and | |
OH sketch of , | |
DOH sketch of , | |
, | the element of vectors and , respectively |
Random permutation from W to W | |
Random variable selected from uniformly | |
b, c | Two parameters of LSH tables |
PDF of random variable x given condition y | |
The set of elements in that are smaller than 1 | |
The number of elements in that equal 1 | |
Parameters of Poisson approximations | |
The set of elements in and whose relations are Case j defined in Section 5.3, | |
The cardinality of set , | |
, | Gradient and Hessian matrix of likelihood function at |
4. Preliminaries
4.1. Giroire’s Algorithm
4.2. OPH
Algorithm 1: Online update procedure of OPH [17]. |
4.3. Optimal Densification
5. Our Framework SimCar
Algorithm 3: The CORE algorithm [59]. |
5.1. Jaccard Similarity Estimation and Nearest Neighbor Search
- Pre-processing phase. We first construct b hash tables for all users in U. Let , where c is an integer parameter used to determine the performance of LSH. Let , , denote the vector consisting of the to elements of DOH sketch , i.e., . Then, we insert each user u and its densification sketch into hash tables as
- Retrieving phase. Given a specific user u, we search for its similar users. Instead of scanning all users in set U, we only probe b different buckets from b hash tables, respectively, and report the users in any of these buckets as the potential candidates. Last, nearest neighbors of user u are detected by enumerating all users in set , and computing and sorting their Jaccard similarity estimations.
5.2. User Cardinality Estimation
5.3. Common-Interest Count Estimation
5.4. Space and Time Complexities
6. Evaluation
6.1. Datasets
# Users | # Interests | Size | |
---|---|---|---|
Movie | 6040 | 9746 | 1,000,209 |
Reuters | 21,557 | 60,234 | 1,464,182 |
Tropes | 152,093 | 64,415 | 3,232,134 |
Wikipedia | 2780 | 276,739 | 7,846,807 |
Flickr | 499,610 | 395,979 | 8,545,307 |
TREC | 556,077 | 1,729,302 | 151,632,178 |
6.2. Baselines
- MinHash and HLL. For common-interest count estimation, the baseline is the method [15], which estimates common-interest counts by generating both MinHash [16] and HyperLogLog [5] (short for HLL in the following paper) sketches for each user’s interest set. Clearly, one can use generated HLL sketches to estimate user cardinalities, and use generated Minhash sketches to estimate Jaccard similarities and build LSH tables for a nearest neighbor search.
- OPH+HLL. One can also solve all four tasks by combining OPH [17] and HLL, where OPH [17] is much faster than MinHash and exhibits comparable accuracy for applications such as Jaccard similarity estimation and nearest neighbor search [20,21,22]. In detail, HLL sketches were used to estimate user cardinalities, and optimal densified OPH sketches were used to estimate Jaccard similarity and build LSH tables for a rapid nearest neighbor search. To estimate the common-interest count of any two users u and v, we first easily obtained (1) estimations , , and of , , and from HLL sketches and our OH sketches; (2) estimation of Jaccard similarity from optimal densified OPH sketches and DOH sketches. As shown in [15], the common-interest count can be estimated by each of the following schemes: (scheme 1) ; (scheme 2) ; (scheme 3) . Similar to MinHash and HLL method [15], we initialize the parameter for our common-interest count estimation method in Section 5.3 and then run only a single Newton–Raphson iteration to obtain a more accurate estimation of . In our experiments, common-interest count estimations given by OPH and HLL are computed by the above schemes 1, 2, and 3 in a direct manner.
6.3. Metric
6.4. Runtime
6.5. Accuracy
Jaccard Similarity Estimation | Top 10 Nearest Neighbor Search | ||||||||
---|---|---|---|---|---|---|---|---|---|
NRMSE | Recall | the Number of Retrieved Users | |||||||
SimCar | MinHash | OPH | SimCar | MinHash | OPH | SimCar | MinHash | OPH | |
Reuters | 0.177 | 0.175 | 0.176 | 0.732 | 0.735 | 0.732 | 151 | 156 | 151 |
Movie | 0.072 | 0.072 | 0.073 | 0.800 | 0.800 | 0.800 | 599 | 597 | 598 |
Wikipedia | 0.085 | 0.085 | 0.085 | 0.799 | 0.800 | 0.800 | 2395 | 2398 | 2395 |
Tropes | 0.168 | 0.171 | 0.170 | 0.555 | 0.561 | 0.556 | 220 | 222 | 220 |
Flickr | 0.172 | 0.175 | 0.174 | 0.765 | 0.764 | 0.764 | 299 | 301 | 299 |
Actor | 0.240 | 0.235 | 0.239 | 0.740 | 0.734 | 0.739 | 174 | 175 | 175 |
TREC | 0.076 | 0.077 | 0.077 | 0.894 | 0.896 | 0.894 | 219 | 227 | 220 |
7. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
References
- Cormode, G.; Muthukrishnan, S. An Improved Data Stream Summary: The Count-min Sketch and Its Applications. J. Algorithms 2005, 55, 58–75. [Google Scholar] [CrossRef] [Green Version]
- Estan, C.; Varghese, G.; Fisk, M. Bitmap algorithms for counting active flows on high speed links. In Proceedings of the SIGCOMM, Karlsruhe, Germany, 25–29 August 2003; pp. 182–209. [Google Scholar]
- Whang, K.; Vander-zanden, B.T.; Taylor, H.M. A linear-time probabilistic counting algorithm for database applications. IEEE Trans. Database Syst. 1990, 15, 208–229. [Google Scholar] [CrossRef]
- Durand, M.; Flajolet, P. Loglog Counting of Large Cardinalities; Springer: Berlin/Heidelberg, Germany, 2003; pp. 605–617. [Google Scholar]
- Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Proceedings of the AOFA, Nice, France, 17–22 June 2007. [Google Scholar]
- Giroire, F. Order statistics and estimating cardinalities of massive data sets. Discret. Appl. Math. 2009, 157, 406–427. [Google Scholar] [CrossRef] [Green Version]
- Kane, D.M.; Nelson, J.; Woodruff, D.P. An Optimal Algorithm for the Distinct Elements Problem. In Proceedings of the PODS, Indianapolis, IN, USA, 6–11 June 2010; pp. 41–52. [Google Scholar]
- Zhao, Q.; Kumar, A.; Xu, J. Joint data streaming and sampling techniques for detection of super sources and destinations. In Proceedings of the ACM SIGCOMM IMC 2005, Berkeley, CA, USA, 19–21 October 2005; pp. 77–90. [Google Scholar]
- Yoon, M.; Li, T.; Chen, S.; Peir, J.K. Fit a spread estimator in small memory. In Proceedings of the IEEE INFOCOM 2009, Rio de Janeiro, Brazil, 19–25 April 2009; pp. 504–512. [Google Scholar]
- Wang, P.; Guan, X.; Qin, T.; Huang, Q. A Data Streaming Method for Monitoring Host Connection Degrees of High-Speed Links. IEEE Trans. Inf. Forensics Secur. 2011, 6, 1086–1098. [Google Scholar] [CrossRef]
- Xiao, Q.; Chen, S.; Chen, M.; Ling, Y. Hyper-Compact Virtual Estimators for Big Network Data Based on Register Sharing. In Proceedings of the SIGMETRICS, Portland, OR, USA, 15–19 June 2015; pp. 417–428. [Google Scholar]
- Chen, A.; Cao, J.; Shepp, L.; Nguyen, T. Distinct counting with a self-learning bitmap. J. Am. Stat. Assoc. 2011, 106, 879–890. [Google Scholar] [CrossRef] [Green Version]
- Ting, D. Towards Optimal Cardinality Estimation of Unions and Intersections with Sketches. In Proceedings of the SIGKDD, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
- Zhao, P.; Aggarwal, C.C.; He, G. Link prediction in graph streams. In Proceedings of the 32nd IEEE International Conference on Data Engineering, (ICDE 2016), Helsinki, Finland, 16–20 May 2016; pp. 553–564. [Google Scholar]
- Cohen, R.; Katzir, L.; Yehezkel, A. A Minimal Variance Estimator for the Cardinality of Big Data Set Intersection. In Proceedings of the SIGKDD, Halifax, NS, Canada, 13–17 August 2017. [Google Scholar]
- Broder, A.Z.; Charikar, M.; Frieze, A.M.; Mitzenmacher, M. Min-Wise Independent Permutations. J. Comput. Syst. Sci. 2000, 60, 630–659. [Google Scholar] [CrossRef] [Green Version]
- Li, P.; Owen, A.B.; Zhang, C. One Permutation Hashing. In Proceedings of the NIPS, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 3122–3130. [Google Scholar]
- Li, P.; König, A.C. b-Bit minwise hashing. In Proceedings of the WWW, Raleigh, NC, USA, 26–30 April 2010; pp. 671–680. [Google Scholar]
- Mitzenmacher, M.; Pagh, R.; Pham, N. Efficient estimation for high similarities using odd sketches. In Proceedings of the WWW, Doha, Qatar, 7–11 April 2014; pp. 109–118. [Google Scholar]
- Shrivastava, A.; Li, P. Improved Densification of One Permutation Hashing. In Proceedings of the UAI, Quebec City, QC, Canada, 23–27 July 2014; pp. 732–741. [Google Scholar]
- Shrivastava, A.; Li, P. Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search. In Proceedings of the ICML, Beijing, China, 21–26 June 2014; pp. 557–565. [Google Scholar]
- Shrivastava, A. Optimal Densification for Fast and Accurate Minwise Hashing. In Proceedings of the ICML, Sydney, Australia, 6–11 August 2017; pp. 3154–3163. [Google Scholar]
- Indyk, P.; Motwani, R. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the STOC, Dallas, TX, USA, 23–26 May 1998; pp. 604–613. [Google Scholar]
- Gionis, A.; Indyk, P.; Motwani, R. Similarity Search in High Dimensions via Hashing. In Proceedings of the PVLDB, Edinburgh, UK, 7–10 September 1999; pp. 518–529. [Google Scholar]
- Charikar, M. Similarity estimation techniques from rounding algorithms. In Proceedings of the STOC, Montreal, QC, Canada, 19–21 May 2002; pp. 380–388. [Google Scholar]
- Datar, M.; Immorlica, N.; Indyk, P.; Mirrokni, V.S. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the SOCG, Brooklyn, NY, USA, 8–11 June 2004; pp. 253–262. [Google Scholar]
- Wang, P.; Qi, Y.; Zhang, Y.; Zhai, Q.; Wang, C.; Lui, J.C.S.; Guan, X. A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets. In Proceedings of the KDD, Anchorage, AK, USA, 4–8 August 2019; pp. 25–33. [Google Scholar]
- Li, X.; Li, P. C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with Circulant Permutations. arXiv 2021, arXiv:2111.09544. [Google Scholar]
- Fernandez, R.C.; Min, J.; Nava, D.; Madden, S. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1190–1201. [Google Scholar]
- Manasse, M.; McSherry, F.; Talwar, K. Consistent Weighted Sampling; Technical Report; John Cappelen: Copenhagen, Denmark, 2010. [Google Scholar]
- Haeupler, B.; Manasse, M.S.; Talwar, K. ling Made Fast, Small, and Easy. CoRR 2014. abs/1410.4266. [Google Scholar]
- Ioffe, S. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In Proceedings of the ICDM, Sydney, Australia, 13–17 December 2010; pp. 246–255. [Google Scholar]
- Li, P. 0-Bit Consistent Weighted Sampling. In Proceedings of the KDD, Sydney, Australia, 10–13 August 2015; pp. 665–674. [Google Scholar]
- Wu, W.; Li, B.; Chen, L.; Zhang, C. Canonical Consistent Weighted Sampling for Real-Value Weighted Min-Hash. In Proceedings of the ICDM, Barcelona, Spain, 12–15 December 2016; pp. 1287–1292. [Google Scholar]
- Shrivastava, A. Simple and Efficient Weighted Minwise Hashing. In Proceedings of the NIPS, Barcelona, Spain, 5–10 December 2016; pp. 1498–1506. [Google Scholar]
- Wu, W.; Li, B.; Chen, L.; Zhang, C. Consistent Weighted Sampling Made More Practical. In Proceedings of the WWW, Seville, Spain, 16–18 November 2017; pp. 1035–1043. [Google Scholar]
- Ertl, O. BagMinHash-Minwise Hashing Algorithm for Weighted Sets. CoRR 2018. abs/1802.03914. [Google Scholar]
- Li, P.; Li, X.; Samorodnitsky, G.; Zhao, W. Consistent Sampling Through Extremal Process. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 1317–1327. [Google Scholar]
- Moulton, R.; Jiang, Y. Maximally Consistent Sampling and the Jaccard Index of Probability Distributions. arXiv 2018, arXiv:1809.04052. [Google Scholar]
- Qi, Y.; Wang, P.; Zhang, Y.; Zhao, J.; Tian, G.; Guan, X. Fast Generating A Large Number of Gumbel-Max Variables. In Proceedings of the WWW, Taipei, Taiwan, 20–24 April 2020; pp. 796–807. [Google Scholar]
- Panigrahy, R. Entropy based nearest neighbor search in high dimensions. In Proceedings of the SODA, Miami, FL, USA, 22–26 January 2006; pp. 1186–1195. [Google Scholar]
- Lv, Q.; Josephson, W.; Wang, Z.; Charikar, M.; Li, K. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In Proceedings of the VLDB, Vienna, Austria, 23–27 September 2007; pp. 950–961. [Google Scholar]
- Huang, Q.; Feng, J.; Zhang, Y.; Fang, Q.; Ng, W. Query-aware locality-sensitive hashing for approximate nearest neighbor search. PVLDB 2015, 9, 1–12. [Google Scholar] [CrossRef] [Green Version]
- Gan, J.; Feng, J.; Fang, Q.; Ng, W. Locality-sensitive hashing scheme based on dynamic collision counting. In Proceedings of the SIGMOD, Scottsdale, AZ, USA, 20 May 2012; pp. 541–552. [Google Scholar]
- Liu, Y.; Cui, J.; Huang, Z.; Li, H.; Shen, H.T. SK-LSH: An efficient index structure for approximate nearest neighbor search. PVLDB 2014, 7, 745–756. [Google Scholar] [CrossRef] [Green Version]
- Tao, Y.; Yi, K.; Sheng, C.; Kalnis, P. Quality and efficiency in high dimensional nearest neighbor search. In Proceedings of the SIGMOD, Providence, RI, USA, 29 June–2 July 2009; pp. 563–576. [Google Scholar]
- Satuluri, V.; Parthasarathy, S. Bayesian locality sensitive hashing for fast similarity search. PVLDB 2012, 5, 430–441. [Google Scholar] [CrossRef] [Green Version]
- Gao, J.; Visvesvaraya Jagadish, H.; Lu, W.; Chin Ooi, B. DSH: Data Sensitive Hashing for high-dimensional k-NN search. In Proceedings of the SIGMOD, Snowbird, UT, USA, 22–27 June 2014. [Google Scholar]
- Wang, Y.; Shrivastava, A.; Ryu, J. Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search. In Proceedings of the SIGMOD, Houston, TX, USA, 10–15 June 2018. [Google Scholar]
- Ahle, T.D.; Pagh, R.; Razenshteyn, I.; Silvestri, F. On the complexity of inner product similarity join. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, San Francisco, CA, USA, 26 June–1 July 2016; pp. 151–164. [Google Scholar]
- Neyshabur, B.; Srebro, N. On Symmetric and Asymmetric LSHs for Inner Product Search. In Proceedings of the International Conference on Machine Learning, Lille, France, 6 July–11 July 2015; pp. 1926–1934. [Google Scholar]
- Shrivastava, A.; Li, P. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canafa, 8–13 December 2014; pp. 2321–2329. [Google Scholar]
- Ting, D. Streamed Approximate Counting of Distinct Elements: Beating Optimal Batch Methods. In Proceedings of the SIGKDD, New York, NY, USA, 24–27 August 2014; pp. 442–451. [Google Scholar]
- Bachrach, Y.; Finkelstein, Y.; Gilad-Bachrach, R.; Katzir, L.; Koenigstein, N.; Nice, N.; Paquet, U. Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces. In Proceedings of the 8th ACM Conference on Recommender Systems, Silicon Valley, CA, USA, 6 October 2014; pp. 257–264. [Google Scholar]
- Ballard, G.; Kolda, T.G.; Pinar, A.; Seshadhri, C. Diamond sampling for approximate maximum all-pairs dot-product (MAD) search. In Proceedings of the ICDM, Atlantic City, NJ, USA, 14–17 November 2015; pp. 11–20. [Google Scholar]
- Flajolet, P.; Martin, G.N. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 1985, 31, 182–209. [Google Scholar] [CrossRef] [Green Version]
- Xiao, Q.; Zhou, Y.; Chen, S. Better with fewer bits: Improving the performance of cardinality estimation of large data streams. In Proceedings of the INFOCOM, Atlanta, GA, USA, 1–4 May 2017; pp. 1–9. [Google Scholar]
- Cohen, E.; Kaplan, H. Summarizing Data Using Bottom-k Sketches. In Proceedings of the PODC, Portland, OR, USA, 12–15 August 2007; pp. 225–234. [Google Scholar]
- Lumbroso, J. An optimal cardinality estimation algorithm based on order statistics and its full analysis. In Proceedings of the AofA, Vienna, Austria, 28 June–2 July 2010; pp. 489–504. [Google Scholar]
- Chen, W.; Liu, Y.; Guan, Y. Cardinality change-based early detection of large-scale cyber-attacks. In Proceedings of the INFOCOM, Turin, Italy, 14–19 April 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1788–1796. [Google Scholar]
- Flajolet, P. On Adaptive Sampling. Computing 1990, 43, 391–400. [Google Scholar] [CrossRef]
- Gibbons, P.B. Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. In Proceedings of the PVLDB, Roma, Italy, 11–14 September 2001; pp. 541–550. [Google Scholar]
- Mao, Y.; Gan, D.; Mwakapesa, D.S.; Nanehkaran, Y.A.; Tao, T.; Huang, X. A MapReduce-based K-means clustering algorithm. J. Supercomput. 2022, 78, 5181–5202. [Google Scholar] [CrossRef]
- Corizzo, R.; Pio, G.; Ceci, M.; Malerba, D. DENCAST: Distributed density-based clustering for multi-target regression. J. Big Data 2019, 6, 1–27. [Google Scholar] [CrossRef]
- Corizzo, R.; Dauphin, Y.; Bellinger, C.; Zdravevski, E.; Japkowicz, N. Explainable image analysis for decision support in medical healthcare. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 4667–4674. [Google Scholar]
- Cao, M.; Jia, W.; Lv, Z.; Zheng, L.; Liu, X. Superpixel-Based Feature Tracking for Structure from Motion. Appl. Sci. 2019, 9, 2961. [Google Scholar] [CrossRef] [Green Version]
- Ding, K.; Yang, Z.; Wang, Y.; Liu, Y. An improved perceptual hash algorithm based on u-net for the authentication of high-resolution remote sensing image. Appl. Sci. 2019, 9, 2972. [Google Scholar] [CrossRef] [Green Version]
- Jacquet, P.; Szpankowski, W. Analytical Depoissonization and its Applications. Theor. Comput. Sci. 1998, 201, 1–62. [Google Scholar] [CrossRef] [Green Version]
- Mitzenmacher, M.; Upfal, E. Probability and Computing—Randomized Algorithms and Probabilistic Analysis; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
- Bickel, P.J.; Doksum, K.A. Mathematical Statistics: Basic Ideas and Selected Topics, 2nd ed.; Prentice-Hall: Hoboken, NJ, USA, 2001; Volume 1. [Google Scholar]
- Ypma, T.J. Historical Development of the Newton-Raphson Method. SIAM Rev. 1995, 37, 531–551. [Google Scholar] [CrossRef] [Green Version]
- Mislove, A.; Marcon, M.; Gummadi, K.P.; Druschel, P.; Bhattacharjee, B. Measurement and analysis of online social networks. In Proceedings of the SIGCOMM, Kyoto, Japan, 27–31 August 2007; pp. 29–42. [Google Scholar]
- GroupLens Research. MovieLens Data Sets. 2006. Available online: http://www.grouplens.org/node/73 (accessed on 1 March 2022).
- Wikimedia Foundation. Wikimedia Downloads. 2010. Available online: http://dumps.wikimedia.org/ (accessed on 13 June 2021).
- Lewis, D.D.; Yang, Y.; Rose, T.G.; Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. J. Mach. Learn. Res. 2004, 5, 361–397. [Google Scholar]
- Kunegis, J. KONECT: The Koblenz Network Collection. In Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil, 13–17 May 2013; pp. 1343–1350. [Google Scholar]
- National Institute of Standards and Technology. Text REtrieval Conference (TREC) English Documents. 2010; Volume 4–5. Available online: http://trec.nist.gov/data/docs_eng.html (accessed on 11 June 2021).
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Guo, W.; Ye, K.; Qi, Y.; Jia, P.; Wang, P. Generalized Sketches for Streaming Sets. Appl. Sci. 2022, 12, 7362. https://doi.org/10.3390/app12157362
Guo W, Ye K, Qi Y, Jia P, Wang P. Generalized Sketches for Streaming Sets. Applied Sciences. 2022; 12(15):7362. https://doi.org/10.3390/app12157362
Chicago/Turabian StyleGuo, Wenhua, Kaixuan Ye, Yiyan Qi, Peng Jia, and Pinghui Wang. 2022. "Generalized Sketches for Streaming Sets" Applied Sciences 12, no. 15: 7362. https://doi.org/10.3390/app12157362
APA StyleGuo, W., Ye, K., Qi, Y., Jia, P., & Wang, P. (2022). Generalized Sketches for Streaming Sets. Applied Sciences, 12(15), 7362. https://doi.org/10.3390/app12157362