EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem
Abstract
:1. Introduction
2. Literature Review
2.1. Big Data Collection
2.2. Big Data Storage
2.3. Big Data Processing
2.4. Big Data Analysis
3. Proposed Big Data Management Platform
3.1. Platform Architecture
- Verification System: Users can register and login to the platform.
- Data Collection System: Users are able to provide the configurations required to collect the desired data.
- Data Analysis System: Users can configure the analytic jobs they want to run and then execute them to acquire their results.
- Visualization System: Users can receive visual results of their analytic jobs/processes.
- Database System: This is only used by the platform and not the users since it saves useful data for each process and each platform’s user.
- File System: This is only used by the platform and not by the users, as it distributes across multiple machines all the users’ analyzed, pre-processed, processed, and collected data.
3.1.1. Verification System
3.1.2. Data Collection System
3.1.3. Data Analysis System
3.1.4. Visualization System
3.1.5. Database System & File System
3.2. Platform Users
4. Case Example
4.1. Working Environment
4.2. Use Case Description
4.3. Platform Evaluation
4.3.1. Functional Evaluation
4.3.2. Performance Evaluation
5. Discussion
5.1. Overall Findings on Big Data Processing
5.2. Overall Findings on Big Data Analysis
5.3. Overall Findings of EverAnalyzer
6. Conclusions
6.1. Future Research Directions
6.2. Research Limitations
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Statista. Total Data Volume Worldwide 2010–2025. Available online: https://www.statista.com/statistics/871513/worldwide-data-created/ (accessed on 16 September 2022).
- Forbes. Big Data Goes Big. Available online: https://www.forbes.com/sites/rkulkarni/2019/02/07/big-data-goes-big/?sh=5b985d0920d7 (accessed on 17 January 2022).
- Bhosale, H.S.; Gadekar, D.P. A review paper on big data and Hadoop. IJSR 2014, 4, 1–7. [Google Scholar]
- SangeethaLakshmi, M.G.; Jayashree, M.M. Comparative Analysis of Various Tools for Data Mining and Big Data Mining. IRJET 2019, 6, 704–708. [Google Scholar]
- Apache Hadoop Home Page. Available online: https://hadoop.apache.org/ (accessed on 17 September 2022).
- Wu, Y.; Wu, C.; Li, B.; Zhang, L.; Li, Z.; Lau, F.C. Scaling social media applications into geo-distributed clouds. IEEE ACM Trans. Netw. 2014, 23, 689–702. [Google Scholar] [CrossRef]
- Zaharia, M.; Chowdhurry, M.; Das, T.; Dave, A.; Ma, J.; McCauley, M.; Franklin, J.M.; Shenker, S.; Stoica, I. Fast and interactive analytics over Hadoop data with Spark. Usenix Login 2012, 37, 45–51. [Google Scholar]
- Apache Hadoop. MapReduce Tutorial. Available online: https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html (accessed on 17 September 2022).
- Apache SparkTM. Unified Engine for Large-Scale Data Analytics. Available online: https://spark.apache.org/ (accessed on 17 September 2022).
- Ahmed, N.; Barczak, A.L.; Susnjak, T.; Rashid, M.A. A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench. J. Big Data 2020, 7, 110. [Google Scholar] [CrossRef]
- Ahmadvand, H.; Goudarzi, M.; Foroutan, F. Gapprox: Using gallup approach for approximation in big data processing. J. Big Data 2019, 6, 20. [Google Scholar] [CrossRef]
- Samadi, Y.; Zbakh, M.; Tadonki, C. Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks. Concurr. Comput. Pract. Exp. 2018, 30, e4367. [Google Scholar] [CrossRef]
- Isah, H.; Abughofa, T.; Mahfuz, S.; Ajerla, D.; Zulkernine, F.; Khan, S. A survey of distributed data stream processing frameworks. IEEE Access 2019, 7, 154300–154316. [Google Scholar] [CrossRef]
- Apache Spark. Mllib. Available online: https://spark.apache.org/mllib/ (accessed on 17 September 2022).
- Aziz, K.; Zaidouni, D.; Bellafkih, M. Big data processing using machine learning algorithms: Mllib and mahout use case. In Proceedings of the 12th International Conference on Intelligent Systems: Theories and Applications, Rabat, Morocco, 24–25 October 2018; 1st ed.. Association for Computing Machinery: New York, NY, USA, 2018; pp. 1–6. [Google Scholar]
- Bagga, S.; Sharma, A. Big data and its challenges: A review. In Proceedings of the 2018 4th International Conference on Computing Sciences (ICCS), Jalandhar, India, 30–31 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 183–187. [Google Scholar]
- Mavrogiorgou, A.; Kiourtis, A.; Kyriazis, D. Plug ‘n’play IoT devices: An approach for dynamic data acquisition from unknown heterogeneous devices. In Proceedings of the Conference on Complex, Intelligent, and Software Intensive Systems, Turin, Italy, 10–13 July 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 885–895. [Google Scholar]
- Goudarzi, M. Heterogeneous architectures for big data batch processing in mapreduce paradigm. IEEE Trans. Big Data 2017, 5, 18–33. [Google Scholar] [CrossRef]
- Koo, J.; Kang, G.; Kim, Y.G. Security and privacy in big data life cycle: A survey and open challenges. Sustainability 2020, 12, 10571. [Google Scholar] [CrossRef]
- Liu, J.; Li, J.; Li, W.; Wu, J. Rethinking big data: A review on the data quality and usage issues. ISPRS 2016, 115, 134–142. [Google Scholar] [CrossRef]
- Perakis, K.; Miltiadou, D.; De Nigro, A.; Torelli, F.; Montandon, L.; Magdalinou, A.; Mavrogiorgou, A.; Kyriazis, D. Data Sources and Gateways: Design and Open Specification. Acta Inform. Med. 2019, 27, 341. [Google Scholar] [CrossRef]
- Mavrogiorgou, A.; Kiourtis, A.; Kyriazis, D. A pluggable IoT middleware for integrating data of wearable medical devices. Smart Health 2022, 26, 100326. [Google Scholar] [CrossRef]
- Anderson, J.W.; Kennedy, K.E.; Ngo, L.B.; Luckow, A.; Apon, A.W. Synthetic data generation for the internet of things. In Proceedings of the 2014 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 27–30 October 2014; pp. 171–176. [Google Scholar]
- Sebek Homepage. Available online: https://honeynet.onofri.org/tools/sebek/ (accessed on 17 January 2022).
- Honeynet. Hflow2. Available online: https://www.honeynet.org/projects/old/hflow2/ (accessed on 17 January 2022).
- Viecco, C. Improving honeynet data analysis. In Proceedings of the 2007 IEEE SMC Information Assurance and Security Workshop, West Point, NY, USA, 20–22 June 2007; pp. 99–106. [Google Scholar]
- Honeynet. Nepenthes Pharm. Available online: https://www.honeynet.org/2009/11/29/nepenthes-pharm/ (accessed on 17 January 2022).
- Kojoney—A Honeypot for the SSH Service. Available online: http://kojoney.sourceforge.net/ (accessed on 17 September 2022).
- Honeynet. Capture-HPC. Available online: https://www.honeynet.org/projects/old/capture-hpc/ (accessed on 17 September 2022).
- Apache Kafka Home Page. Available online: https://kafka.apache.org/ (accessed on 17 September 2022).
- Apache Flume. Welcome to Apache Flume. Available online: https://flume.apache.org/ (accessed on 17 September 2022).
- Shapira, G.; Jeff, H. Flafka: Apache Flume Meets Apache Kafka for Event Processing. 2014. Available online: https://blog.cloudera.com/flafka-apache-flume-meets-apache-kafka-for-event-processing/ (accessed on 30 January 2022).
- Padgavankar, M.H.; Gupta, S.R. Big data storage and challenges. Int. J. Comput. Sci. Inf. Technol. 2014, 5, 2218–2223. [Google Scholar]
- Mavrogiorgou, A.; Kleftakis, S.; Mavrogiorgos, K.; Zafeiropoulos, N.; Menychtas, A.; Kiourtis, A.; Maglogiannis, I.; Kyriazis, D. beHEALTHIER: A microservices platform for analyzing and exploiting healthcare data. In Proceedings of the 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), Aveiro, Portugal, 7–9 June 2021; pp. 283–288. [Google Scholar]
- Mavrogiorgos, K.; Kiourtis, A.; Mavrogiorgou, A.; Kyriazis, D. A Comparative Study of MongoDB, ArangoDB and CouchDB for Big Data Storage. In Proceedings of the 2021 5th International Conference on Cloud and Big Data Computing (ICCBDC), Liverpool, UK, 13–15 August 2021; pp. 8–14. [Google Scholar]
- Redis Home Page. Available online: https://redis.io/ (accessed on 17 September 2022).
- Scalaris Home Page. Available online: https://scalaris.zib.de/ (accessed on 17 September 2022).
- Fallabs. Tokyo Tyrant. Available online: http://fallabs.com/tokyotyrant/perldoc/ (accessed on 17 September 2022).
- Riak Home Page. Available online: https://riak.com/ (accessed on 17 September 2022).
- Amazon.com. SimpleDB. Available online: https://aws.amazon.com/simpledb/ (accessed on 17 September 2022).
- Apache CouchDB. Available online: https://couchdb.apache.org/ (accessed on 17 September 2022).
- MongoDB: The Developer Data Platform. Available online: https://www.mongodb.com/ (accessed on 17 September 2022).
- Google. Code. Available online: https://code.google.com/archive/p/terrastore/ (accessed on 17 September 2022).
- Google. Cloud Bigtable. Available online: https://cloud.google.com/bigtable (accessed on 17 September 2022).
- Apache Hbase. Apache HbaseTM Home. Available online: https://hbase.apache.org/ (accessed on 17 September 2022).
- Hypertable.org Home Page. Available online: https://hypertable.org/ (accessed on 17 September 2022).
- Apache Cassandra. Available online: https://cassandra.apache.org/_/index.html (accessed on 17 September 2022).
- Mehdipour, F.; Noori, H.; Javadi, B. Energy-efficient big data analytics in datacenters. In Advances in Computers; Elsevier: Amsterdam, The Netherlands, 2016; pp. 59–101. [Google Scholar]
- Mavrogiorgou, A.; Kiourtis, A.; Manias, G.; Kyriazis, D. An Optimized KDD Process for Collecting and Processing Ingested and Streaming Healthcare Data. In Proceedings of the 2021 12th International Conference on Information and Communication Systems (ICICS), Valencia, Spain, 24–26 May 2021; pp. 49–56. [Google Scholar]
- Garg, S. Dynamic Power Management for Dark Silicon Multicore Processors. Adv. Comput. 2018, 110, 171–216. [Google Scholar]
- Khezr, S.N.; Navimipour, N.J. MapReduce and its applications, challenges, and architecture: A comprehensive review and directions for future research. J. Grid Comput. 2017, 15, 295–321. [Google Scholar] [CrossRef]
- Olston, C.; Reed, B.; Srivastava, U.; Kumar, R.; Tomkins, A. Pig latin: A not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 1099–1110. [Google Scholar]
- Thusoo, A.; Sarma, J.S.; Jain, N.; Shao, Z.; Chakka, P.; Anthony, S.; Liu, H.; Wyckoff, P.; Murthy, R. Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2009, 2, 1626–1629. [Google Scholar] [CrossRef]
- Herodotou, H.; Lim, H.; Luo, G.; Borisov, N.; Dong, L.; Cetin, F.B.; Babu, S. Starfish: A Self-tuning System for Big Data Analytics. In Proceedings of the CIDR 2011—Fifth Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 9–12 January 2011; pp. 261–272. [Google Scholar]
- Zaharia, M.; Chowdhury, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10), Boston, MA, USA, 22–25 June 2010. [Google Scholar]
- Isard, M.; Budiu, M.; Yu, Y.; Birrell, A.; Fetterly, D. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, Lisbon, Portugal, 21–23 March 2007; pp. 59–72. [Google Scholar]
- Malewicz, G.; Austern, M.H.; Bik, A.J.; Dehnert, J.C.; Horn, I.; Leiser, N.; Czajkowski, G. Pregel: A system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 6–10 June 2010; pp. 135–146. [Google Scholar]
- Chintapalli, S.; Dagit, D.; Evans, B.; Favivar, R.; Graves, T.; Holderbaugh, M.; Liu, Z.; Nusbaum, L.K.; Patil, K.; Peng, B.J.; et al. Benchmarking streaming computation engines: Storm, flink and spark streaming. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Chicago, IL, USA, 23–27 May 2016; pp. 1789–1792. [Google Scholar]
- Apache Storm. Available online: https://storm.apache.org/ (accessed on 17 September 2022).
- Apache Flink: Stateful Computations over Data Streams. Available online: http://flink.apache.org/ (accessed on 17 September 2022).
- Spark Streaming—Spark 3.3.1 Documentation. Available online: https://spark.apache.org/docs/latest/streaming-programming-guide.html (accessed on 30 August 2022).
- Noghabi, S.A.; Paramasivam, K.; Pan, Y.; Ramesh, N.; Bringhurst, J.; Gupta, I.; Campbell, R.H. Samza: Stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. Int. Conf. Very Large Data Bases 2017, 10, 1634–1645. [Google Scholar] [CrossRef]
- Apache Software Foundation. Apache ApexTM. Available online: http://apex.incubator.apache.org/ (accessed on 17 September 2022).
- Akidau, T.; Bradshaw, R.; Chambers, C.; Chernyak, S.; Fernandez-Moctezuma, R.J.; Lax, R.; McVeery, S.; Mills, D.; Perry, F.; Schmidt, E.; et al. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 2015, 8, 1792–1803. [Google Scholar] [CrossRef]
- Soltanpoor, R.; Sellis, T. Prescriptive analytics for big data. In Databases Theory and Applications, Proceedings of the Australasian Database Conference, Sydney, Australia, 28–29 September 2016; Springer: Cham, Switzerland, 2016; pp. 245–256. [Google Scholar]
- Kyriazis, D.; Autexier, S.; Boniface, M.; Engen, V.; Jimenez-Peris, R.; Jordan, B.; Jurak, G.; Kiourtis, A.; Kosmidis, T.; Lustrek, M.; et al. The CrowdHEALTH project and the Hollistic Health Records: Collective wisdom driving public health policies. Acta Inform. Med. 2019, 27, 369. [Google Scholar] [CrossRef] [PubMed]
- Petre, R.S. Data mining in cloud computing. Database Syst. J. 2012, 3, 67–71. [Google Scholar]
- Mavrogiorgou, A.; Kiourtis, A.; Kyriazis, D.; Themistocleous, M. A comparative study in data mining: Clustering and classification capabilities. In Information Systems, Proceedings of the European, Mediterranean, and Middle Eastern Conference on Information Systems, Coimbra, Portugal, 7–8 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 82–96. [Google Scholar]
- Bioinformatics Laboratory, University of Ljubljana. Orange. Available online: https://orangedatamining.com/ (accessed on 17 September 2022).
- Ripley, B.D. The R project in statistical computing. MSOR Connect. 2001, 1, 23–25. [Google Scholar] [CrossRef]
- Weka 3—Data Mining with Open Source Machine Learning Software in Java. Available online: https://www.cs.waikato.ac.nz/ml/weka/ (accessed on 17 September 2022).
- Github. Shogun. Available online: https://github.com/shogun-toolbox/shogun (accessed on 17 September 2022).
- RapidMiner Home Page. Available online: https://rapidminer.com/ (accessed on 17 September 2022).
- Neuraldesigner Home Page. Available online: https://www.neuraldesigner.com/ (accessed on 17 September 2022).
- Microsoft. Introduction to SharePoint—SharePoint in Microsoft 365. Available online: https://docs.microsoft.com/en-us/sharepoint/introduction (accessed on 17 November 2022).
- IBM. Cognos Analytics. Available online: https://www.ibm.com/products/cognos-analytics (accessed on 17 September 2022).
- Board. Intelligent Planning for Finance, Supply Chain, & Retail. Available online: https://www.board.com/en (accessed on 17 September 2022).
- Sisense. Infuse Analytics Everywhere. Available online: https://www.sisense.com/ (accessed on 17 September 2022).
- KEEL: A Software Tool to Assess Evolutionary Algorithms for Data Mining Problems (Regression, Classification, Clustering, Pattern Mining and so on). Available online: https://sci2s.ugr.es/keel/index.php (accessed on 17 September 2022).
- Meng, X.; Bradley, J.; Yavuz, B.; Sparks, E.; Venkataraman, S.; Liu, D.; Freeman, J.; Tsai, D.B.; Amde, M.; Owen, S.; et al. Mllib: Machine learning in apache spark. J. Mach. Learn. Res. 2016, 17, 1235–1241. [Google Scholar]
- Twitter. Docs. Available online: https://developer.twitter.com/en/docs (accessed on 17 September 2022).
- WHO. Health Topics. Available online: https://www.who.int/europe/health-topics (accessed on 17 September 2022).
- Mostafaeipour, A.; Jahangard, R.A.; Ahmadi, M.; Arockia, D.J. Investigating the performance of Hadoop and Spark platforms on machine learning algorithms. J. Supercomput. 2021, 77, 1273–1300. [Google Scholar] [CrossRef]
- Aziz, K.; Zaidouni, D.; Bellafkih, M. Real-time data analysis using Spark and Hadoop. In Proceedings of the 2018 4th International Conference on Optimization and Applications (ICOA), Mohammedia, Morocco, 26–27 April 2018; pp. 1–6. [Google Scholar]
- Pirzadeh, P. On the Performance Evaluation of Big Data Systems. Ph.D. Thesis, UC Irvine, Irvine, CA, USA, 2015. [Google Scholar]
- Mavridis, I.; Karatza, E. Log File Analysis in Cloud with Apache Hadoop and Apache Spark; Universidad Carlos III de Madrid: Madrid, Spain, 2015. [Google Scholar]
- Kuo, A.; Chrimes, D.; Qin, P.; Zamani, H. A Hadoop/MapReduce Based Platform for Supporting Health Big Data Analytics. Stud. Health Technol. Inform. 2019, 257, 229–235. [Google Scholar]
- Hazarika, A.V.; Ram, G.J.S.R.; Jain, E. Performance comparision of Hadoop and spark engine. In Proceedings of the 2017 International Conference on IoT in Social, Mobile, Analytics and Cloud (I-SMAC), Palladam, India, 10–11 February 2017; pp. 671–674. [Google Scholar]
- Ji, C.; Li, Y.; Qiu, W.; Awada, U.; Li, K. Big data processing in cloud computing environments. In Proceedings of the 2012 12th International Symposium on Pervasive Systems, Algorithms and Networks, San Marcos, TX, USA, 13–15 December 2012; pp. 17–23. [Google Scholar]
- Habeeb, R.A.A.; Nasaruddin, F.; Gani, A.; Hashem, I.A.T.; Ahmed, E.; Imran, M. Real-time big data processing for anomaly detection: A survey. Int. J. Inf. Manag. 2019, 45, 289–307. [Google Scholar] [CrossRef]
- Liu, L. Computing infrastructure for big data processing. Front. Comput. Sci. 2013, 7, 165–170. [Google Scholar] [CrossRef]
- Rajabion, L.; Shaltooki, A.A.; Taghikhah, M.; Ghasemi, A.; Badfar, A. Healthcare big data processing mechanisms: The role of cloud computing. Int. J. Inf. Manag. 2019, 49, 271–289. [Google Scholar] [CrossRef]
- Yao, Q.; Tian, Y.; Li, P.F.; Tian, L.L.; Qian, Y.M.; Li, J.S. Design and development of a medical big data processing system based on Hadoop. J. Med. Syst. 2015, 39, 23. [Google Scholar] [CrossRef]
- Richter, A.N.; Khoshgoftaar, T.M.; Landset, S.; Hasanin, T. A multi-dimensional comparison of toolkits for machine learning with big data. In Proceedings of the 2015 IEEE International Conference on Information Reuse and Integration, San Francisco, CA, USA, 13–15 August 2015; pp. 1–8. [Google Scholar]
- Huang, X.; Jiang, P.; Ma, J. A machine learning application for electric power industrial big-data based on Hadoop. In Proceedings of the 2018 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), Huangshan, China, 28–30 July 2018; pp. 1134–1137. [Google Scholar]
- Wan, J.; Tang, S.; Li, D.; Wang, S.; Liu, C.; Abbas, H.; Vasilakos, A.V. A manufacturing big data solution for active preventive maintenance. IEEE Trans. Ind. Inform. 2017, 13, 2039–2047. [Google Scholar] [CrossRef]
- Maktoubian, J. Proposing a streaming Big Data analytics (SBDA) platform for condition based maintenance (CBM) and monitoring transportation systems. EAI Endorsed Trans. Scalable Inf. Syst. 2017, 17, 4. [Google Scholar] [CrossRef]
- Gupta, N.; Lenka, R.K.; Barik, R.K.; Dubey, H. FAIR: A Hadoop-based hybrid model for faculty information retrieval system. arXiv 2017, arXiv:1706.08018, 2017. [Google Scholar]
- Faghri, F.; Hashemi, S.H.; Babaeizadeh, M.; Nalls, M.A.; Sinha, S.; Campbell, R.H. Toward scalable machine learning and data mining: The bioinformatics case. arXiv 2017, arXiv:1710.00112. [Google Scholar]
- Kiourtis, A.; Karamolegkos, P.; Karabetian, A.; Voulgaris, K.; Poulakis, Y.; Mavrogiorgou, A.; Kyriazis, D. An Autoscaling Platform Supporting Graph Data Modelling Big Data Analytics. Stud. Health Technol. Inf. 2022, 295, 376–379. [Google Scholar]
- Mavrogiorgos, K.; Kiourtis, A.; Mavrogiorgou, A.; Kyriazis, D. Self-Adaptable Infrastructure Management for Analyzing the Efficiency of Big Data Stores. J. Adv. Inf. Technol. 2022, 13, 423–432. [Google Scholar] [CrossRef]
- Kyriazis, D.; Biran, O.; Bouras, T.; Brisch, K.; Duzha, A.; del Hoyo, R.; Kiourtis, A.; Kranas, P.; Maglogiannis, I.; Manias, G.; et al. Policycloud: Analytics as a service facilitating efficient data-driven public policy management. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Neos Marmaras, Greece, 5–7 June 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 141–150. [Google Scholar]
ID | Objective |
---|---|
#1 | Users can login to the platform with a registration system. |
#2 | Users can collect their desired batch or streaming data. |
#3 | Users can save their collected data and the analyzed data for future use or access. |
#4 | Users have the ability to pre-process their data. |
#5 | Users are recommended by the most suitable processing framework to be applied, where the platform recommends the optimal use between Spark or MapReduce. |
#6 | Users are recommended by the most suitable analysis framework to be applied, where the platform suggests the optimal use between MLlib or Mahout libraries. |
#7 | The users’ analytics results are displayed graphically, in a user-friendly way. |
EverAnalyzer Layer | Subsystem | Objective |
---|---|---|
User Interaction Layer | Verification System | #1 |
Data Collection System | #2 | |
Data Analysis System | #4, #5, #6 | |
Visualization System | #7 | |
Data Management Layer | Database System | #1, #3 |
File System | #3 |
Anaemia | Cancer | Cholera | Coronavirus | Influenza | Monkeypox |
---|---|---|---|---|---|
Obesity | Pneumonia | Smallpox | Syphilis | Tetanus | Yellow fever |
Zika virus | Trachoma | Diabetes | Diarrhoea | Ebola virus | Epilepsy |
Hepatitis | HIV-AIDS | Depression | Disability | Cardiovascular | Chagas |
Dementia | Dracunculiasis | Echinococcosis | Foodborne | Hypertension | Infertility |
Disease/Condition (Twitter Keyword) | Byte Size (Before Pre-Processing) | Byte Size (After Pre-Processing) | Number of Tweets (Before Pre-Processing) | Number of Tweets (After Pre-Processing) |
---|---|---|---|---|
Anaemia | 689,352 | 86,873 | 500 | 285 |
Cancer | 643,186 | 90,810 | 500 | 294 |
Cholera | 667,851 | 75,305 | 500 | 246 |
Coronavirus | 682,377 | 67,983 | 500 | 225 |
Influenza | 635,255 | 80,703 | 500 | 268 |
Monkeypox | 52,702 | 48,319 | 500 | 173 |
Obesity | 714,737 | 87,469 | 500 | 285 |
Pneumonia | 625,063 | 81,848 | 500 | 264 |
Smallpox | 659,805 | 92,679 | 500 | 299 |
Syphilis | 158,457 | 86,303 | 500 | 268 |
Tetanus | 415,683 | 75,669 | 500 | 246 |
Yellow fever | 82,814 | 60,156 | 500 | 203 |
Zika virus | 673,913 | 92,458 | 500 | 279 |
Trachoma | 294,028 | 63,508 | 500 | 205 |
Diabetes | 659,150 | 49,383 | 500 | 165 |
Diarrhoea | 679,323 | 96,373 | 500 | 314 |
Ebola virus | 653,989 | 75,345 | 500 | 241 |
Epilepsy | 572,757 | 53,907 | 500 | 173 |
Hepatitis | 612,424 | 84,098 | 500 | 275 |
HIV-AIDS | 167,690 | 77,874 | 500 | 247 |
Depression | 721,110 | 70,687 | 500 | 212 |
Disability | 716,667 | 69,673 | 500 | 218 |
Cardiovascular | 700,503 | 87,562 | 500 | 294 |
Chagas | 624,708 | 74,773 | 500 | 251 |
Dementia | 603,855 | 56,667 | 500 | 185 |
Dracunculiasis | 119,143 | 78,416 | 500 | 255 |
Echinococcosis | 617,787 | 67,816 | 500 | 224 |
Foodborne | 163,814 | 84,306 | 500 | 264 |
Hypertension | 669,548 | 68,421 | 500 | 224 |
Infertility | 322,207 | 49,324 | 500 | 155 |
Disease/Condition | Consecutive Correct Answers | Suggestion Result | EverAnalyzer Recommendation | Expected Recommendation |
---|---|---|---|---|
Anaemia | 1 | Correct | MapReduce | MapReduce |
Cancer | 0 | Wrong | MapReduce | Spark |
Cholera | 1 | Correct | Spark | Spark |
Coronavirus | 2 | Correct | Spark | Spark |
Influenza | 3 | Correct | Spark | Spark |
Monkeypox | 4 | Correct | Spark | Spark |
Obesity | 0 | Wrong | MapReduce | Spark |
Pneumonia | 1 | Correct | Spark | Spark |
Smallpox | 0 | Wrong | MapReduce | Spark |
Syphilis | 1 | Correct | Spark | Spark |
Tetanus | 2 | Correct | Spark | Spark |
Yellow fever | 3 | Correct | Spark | Spark |
Zika virus | 0 | Wrong | MapReduce | Spark |
Trachoma | 1 | Correct | Spark | Spark |
Diabetes | 2 | Correct | Spark | Spark |
Diarrhoea | 0 | Wrong | MapReduce | Spark |
Ebola virus | 1 | Correct | Spark | Spark |
Epilepsy | 2 | Correct | Spark | Spark |
Hepatitis | 3 | Correct | Spark | Spark |
HIV-AIDS | 4 | Correct | Spark | Spark |
Depression | 5 | Correct | Spark | Spark |
Disability | 6 | Correct | Spark | Spark |
Cardiovascular | 0 | Wrong | MapReduce | Spark |
Chagas | 1 | Correct | Spark | Spark |
Dementia | 2 | Correct | Spark | Spark |
Dracunculiasis | 3 | Correct | Spark | Spark |
Echinococcosis | 4 | Correct | Spark | Spark |
Foodborne | 5 | Correct | Spark | Spark |
Hypertension | 6 | Correct | Spark | Spark |
Infertility | 7 | Correct | Spark | Spark |
Disease/Condition | Spark Execution Speed (Milliseconds) | MapReduce Execution Speed (Milliseconds) |
---|---|---|
Anaemia | 11,185 | 1387 |
Cancer | 649 | 1369 |
Cholera | 469 | 1456 |
Coronavirus | 474 | 1439 |
Influenza | 552 | 1414 |
Monkeypox | 531 | 1358 |
Obesity | 489 | 1320 |
Pneumonia | 424 | 1327 |
Smallpox | 431 | 1416 |
Syphilis | 427 | 1438 |
Tetanus | 388 | 1388 |
Yellow fever | 392 | 1401 |
Zika virus | 389 | 1300 |
Trachoma | 354 | 1472 |
Diabetes | 480 | 1426 |
Diarrhoea | 374 | 1343 |
Ebola virus | 496 | 1434 |
Epilepsy | 434 | 1420 |
Hepatitis | 325 | 1418 |
HIV-AIDS | 485 | 1426 |
Depression | 345 | 1510 |
Disability | 279 | 1482 |
Cardiovascular | 291 | 1346 |
Chagas | 351 | 1469 |
Dementia | 365 | 1330 |
Dracunculiasis | 292 | 1430 |
Echinococcosis | 376 | 1344 |
Foodborne | 384 | 1441 |
Hypertension | 382 | 1522 |
Infertility | 352 | 1317 |
MapReduce | Spark |
Inefficient for applications that repeatedly reuse the same set of data. | Uses in-memory processing, reusing it for faster computation. |
Quite faster in batch processing. | As memory size is limited, it is quite slower in batch processing of huge datasets. |
Data is stored in disk for processing. | Data stored in main memory. |
Difficulty in processing and modifying data in real-time due to its high latency. | Processes and modifies data in real-time due to its low latency. |
Used to process from bygone datasets. | Used for streaming/batch processing and ML. |
Uses replication for fault tolerance. | Uses Resilient Distributed Datasets (RDDs) for fault tolerance. |
Merges and partitions shuffle files. | It does not merge and partition shuffle files. |
Primarily disk-based computation. | Primarily RAM based computation. |
Category | Algorithm | Mahout | MLlib |
Dimension Reduction | Principle Component Analysis (PCA) | Yes | Yes |
Singular Value Decomposition (SVD) | Yes | Yes | |
Regression | Linear Regression | No | Yes |
Logistic Regression | No | Yes | |
Clustering | Hierarchical Clustering | No | Yes |
Distributed-based Clustering | No | Yes | |
Centroid-based Clustering (K-means) | Yes | Yes | |
Classification | Support Vector Machines (SVM) | No | Yes |
Artificial Neural Networks (ANN) | No | Yes | |
Decision Tree | No | Yes | |
Naive Bayes | Yes | Yes | |
Ensemble Methods (Boosting, Random Forest) | Yes | Yes |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Karamolegkos, P.; Mavrogiorgou, A.; Kiourtis, A.; Kyriazis, D. EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem. Information 2023, 14, 93. https://doi.org/10.3390/info14020093
Karamolegkos P, Mavrogiorgou A, Kiourtis A, Kyriazis D. EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem. Information. 2023; 14(2):93. https://doi.org/10.3390/info14020093
Chicago/Turabian StyleKaramolegkos, Panagiotis, Argyro Mavrogiorgou, Athanasios Kiourtis, and Dimosthenis Kyriazis. 2023. "EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem" Information 14, no. 2: 93. https://doi.org/10.3390/info14020093
APA StyleKaramolegkos, P., Mavrogiorgou, A., Kiourtis, A., & Kyriazis, D. (2023). EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem. Information, 14(2), 93. https://doi.org/10.3390/info14020093