Decision-Tree-Based Horizontal Fragmentation Method for Data Warehouses
Abstract
:1. Introduction
2. State of the Art of Horizontal Fragmentation Methods
3. Materials and Methods
3.1. FTree Method
Algorithm 1. Generate data sets. | |
1 | Data: PUM of the fact table F and dim (a set of queries Q = {q1, q2, …, qn}, |
2 | the frequency fi of every query qi, a set of predicates Pr = {p1, p2, …, pr}, the |
3 | selectivity selj of every predicate pj), W |
4 | Result: data sets D = {data_set2, data_set3, …, data_setw−1} |
5 | c = 2 |
6 | for each stepi ∈ PT | 1 ≤ i ≤ r − 1 do |
7 | getMPM(PUM, MPM); |
8 | choose two fragments with the greatest merging profit; |
9 | fuse the fragments; |
10 | if i > r-(W-c) |
11 | generate data_setc |
12 | c = c + 1; |
13 | end; |
14 | end; |
3.2. FTree Web Application
4. Results and Discussion
4.1. Description of the Second Scenario: Tourist Data Warehouse
4.2. Application of FTree in the Tourist Data Warehouse
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ozsu, M.T.; Valduriez, P. Principles of Distributed Database Systems, 4th ed; Springer Nature Switzerland AG: Cham, Switzerland, 2020; p. 768. [Google Scholar]
- Daniel, C.; Salamanca, E.; Nordlinger, B. Hospital Databases: AP-HP Clinical Data Warehouse. In Healthcare and Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2020; pp. 57–67. [Google Scholar]
- Melton, J.E.; Go, S.; Zilliac, G.G.; Zhang, B.Z. Greenhouse Gas Emission Estimations for 2016–2020 using the Sherlock Air Traffic Data Warehouse; Report NASA/TM-202220007609; NASA: Washington, DC, USA, 2022.
- Janzen, T.J.; Ristino, L. USDA and Agriculture Data: Improving Productivity while Protecting Privacy; SSRN: Washington, DC, USA, 2018. [Google Scholar]
- Han, J.; Kamber, M.; Pei, J. Data Mining Concepts and Techniques, 3rd ed; Morgan Kaufmann Publishers: Burlington, MA, USA, 2012; p. 703. [Google Scholar]
- Furtado, P. Experimental Evidence on Partitioning in Parallel Data Warehouses. In Proceedings of the 7th ACM International Workshop on Data Warehousing and OLAP, Washington, DC, USA, 12–13 November 2004; pp. 23–30. [Google Scholar]
- Kimball, R.; Ross, M.; Thornthwaite, W.; Mundy, J.; Becker, B. The Data Warehouse Lifecycle Toolkit, 2nd ed.; Wiley Publishing, Inc.: Indianapolis, Indiana, 2008; p. 636. [Google Scholar]
- Noaman, A.Y.; Barker, K. A Horizontal Fragmentation Algorithm for the Fact Relation in a Distributed Data Warehouse. In Proceedings of the Eighth International Conference on Information and Knowledge Management, CIKM ’99, Kansas City, MI, USA, 2–6 November 1999; pp. 154–161. [Google Scholar]
- Ramdane, Y.; Kabachi, N.; Boussaid, O.; Bentayeb, F. SDWP: A New Data Placement Strategy for Distributed Big Data Warehouses in Hadoop. In Big Data Analytics and Knowledge Discovery; Ordonez, C., Song, I.Y., Anderst-Kotsis, G., Tjoa, A.M., Khalil, I., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 189–205. [Google Scholar]
- Curino, C.; Jones, E.; Zhang, Y.; Madden, S. Schism: A workload-driven approach to database replication and partitioning. Proc. VLDB Endow. 2010, 3, 48–57. [Google Scholar] [CrossRef]
- Mahboubi, H.; Darmont, J. Data mining-based fragmentation of XML data warehouses. In Proceedings of the ACM 11th international workshop on Data warehousing and OLAP-DOLAP ’08, Napa Valley, CA, USA, 30 October 2008; ACM Press: New York, NY, USA, 2008; p. 9. Available online: http://portal.acm.org/citation.cfm?doid=1458432.1458435 (accessed on 12 September 2022).
- Barr, M.; Boukhalfa, K.; Bouibede, K. Bi-Objective Optimization Method for Horizontal Fragmentation Problem in Relational Data Warehouses as a Linear Programming Problem. Appl. Artif. Intell. 2018, 32, 907–923. [Google Scholar] [CrossRef]
- Liu, J.Y.C.; Wang, C.W.; Chan, C.Y. An Efficient Partitioning for Object-Relational Data Warehouses. Appl. Mech. Mater. 2013, 284–287, 3320–3324. [Google Scholar] [CrossRef]
- Kechar, M.; Nait-Bahloul, S. Performance optimisation of the decision-support queries by the horizontal fragmentation of the data warehouse. Int. J. Bus. Inf. Syst. 2017, 26, 506. [Google Scholar] [CrossRef]
- Kechar, M.; Nait-Bahloul, S. Bringing Together Physical Design and Fast Querying of Large Data Warehouses: A New Data Partitioning Strategy. In Proceedings of the 4th International Conference on Big Data and Internet of Things, Rabat Morocco, 23–24 October 2019; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar]
- Ramdane, Y.; Boussaid, O.; Kabachi, N.; Bentayeb, F. Partitioning and Bucketing Techniques to Speed up Query Processing in Spark-SQL. In Proceedings of the 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), Singapore, 11–13 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 142–151. Available online: https://ieeexplore.ieee.org/document/8644891/ (accessed on 12 September 2022).
- Parchas, P.; Naamad, Y.; Bouwel, P.V.; Faloutsos, C.; Petropoulos, M. Fast and effective distribution-key recommendation for amazon redshift. Proc. VLDB Endow. 2020, 13, 2411–2423. [Google Scholar] [CrossRef]
- Barkhordari, M.; Niamanesh, M. Chabok: A Map-Reduce based method to solve data warehouse problems. J. Big. Data. 2018, 5, 1–25. [Google Scholar] [CrossRef]
- Bellatreche, L.; Boukhalfa, K.; Richard, P. Data Partitioning in Data Warehouses: Hardness Study, Heuristics and ORACLE Validation. In Data Warehousing and Knowledge Discovery; Song, I.Y., Eder, J., Nguyen, T.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 87–96. Available online: http://link.springer.com/10.1007/978-3-540-85836-2_9 (accessed on 12 September 2022).
- Barr, M.; Bellatreche, L. A New Approach Based on Ants for Solving the Problem of Horizontal Fragmentation in Relational Data Warehouses. In Proceedings of the 2010 International Conference on Machine and Web Intelligence, Algiers, Algeria, 3–5 October 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 411–415. Available online: http://ieeexplore.ieee.org/document/5648104/ (accessed on 12 September 2022).
- Ramdane, Y.; Kabachi, N.; Boussaid, O.; Bentayeb, F. SkipSJoin: A New Physical Design for Distributed Big Data Warehouses in Hadoop. In Conceptual Modeling; Laender, A.H.F., Pernici, B., Lim, E.P., de Oliveira, J.P.M., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 255–263. [Google Scholar]
- Ettaoufik, A.; Ouzzif, M. Web Service for Incremental and Automatic Data Warehouses Fragmentation. Int. J. Adv. Comput. Sci. Appl. 2017, 8, 1–10. [Google Scholar]
- Soussi, N. Big-Parallel-ETL: New ETL for Multidimensional NoSQL Graph Oriented Data. J. Phys. Conf. Ser. 2021, 1743, 012037. [Google Scholar] [CrossRef]
- Munerman, V.; Munerman, D.; Samoilova, T. The Heuristic Algorithm for Symmetric Horizontal Data Distribution. In Proceedings of the 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus). St. Petersburg, Moscow, Russia, 26–29 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 2161–2165. [Google Scholar]
- Ramdane, Y.; Kabachi, N.; Boussaid, O.; Bentayeb, F. A Data Mining Approach to Guide the Physical Design of Distributed Big Data Warehouses. In Advances in Knowledge Discovery and Management: Volume 9; Jaziri, R., Martin, A., Rousset, M.C., Boudjeloud-Assala, L., Guillet, F., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 107–125. [Google Scholar] [CrossRef]
- Ramdane, Y.; Boussaid, O.; Boukraà, D.; Kabachi, N.; Bentayeb, F. Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance. Parallel. Comput. 2022, 111, 102918. [Google Scholar] [CrossRef]
- O’neil, P.; O’neil, B.; Chen, X. The Star Schema Benchmark (SSB); UMass: Boston, MA, USA, 2009. [Google Scholar]
- Rodríguez-Mazahua, N.; Rodríguez-Mazahua, L.; López-Chau, A.; Alor-Hernández, G.; Peláez-Camarena, S.G. Comparative Analysis of Decision Tree Algorithms for Data Warehouse Fragmentation. In New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques; Zapata-Cortes, J.A., Alor-Hernández, G., Sánchez-Ramírez, C., García-Alcaraz, J.L., Eds.; Springer International Publishing: Cham, Switzerland, 2021; Volume 966, pp. 337–363. [Google Scholar] [CrossRef]
- Son, J.H.; Kim, M.H. An adaptable vertical partitioning method in distributed systems. J. Syst. Softw. 2004, 73, 551–561. [Google Scholar] [CrossRef]
- Rodríguez, L.; Alor-Hernández, G.; Abud-Figueroa, M.A.; Peláez-Camarena, S.G. Horizontal Partitioning of Multimedia Databases Using Hierarchical Agglomerative Clustering. In Proceedings of the Mexican International Conference on Artificial Intelligence, MICAI 2014: Nature-Inspired Computation and Machine Learning, Tuxtla, Mexico, 16–22 November 2014; Springer: Cham, Switzerland, 2014; pp. 296–309. [Google Scholar]
- Anitha, S.; Vanitha, M. Classification of VASA Dataset Using J48, Random Forest, and Naive Bayes. In Ntelligent Data Engineering and Analytics Smart Innovation, Systems, and Technologies; Satapathy, S.C., Ed.; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar] [CrossRef]
- Razdan, S.; Gupta, H.; Seth, A. Performance Analysis of Network Intrusion Systems using J48 and Naive Bayes Algorithm. In Proceedings of the 6th International Conference for Convergence in Technology (I2CT), Maharashtra, India, 2–4 April 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–7. [Google Scholar]
- Tan, P.N.; Steinbach, M.; Karpatne, A.; Kumar, V. Introduction to Data Mining, 2nd ed.; Pearson: New York, NY, USA, 2019; p. 839. [Google Scholar]
- Kimball, R.; Ross, M. The Kimball Group Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence, 2nd ed.; John Wiley & Sons, Inc.: Indianapolis, Indiana, 2016; p. 744. [Google Scholar]
Work | Classification | Validation |
---|---|---|
[9] | Data mining (clustering) | TPC-DS benchmark employing Scala language applied in a cluster of similar nodes, a Hadoop-YARN platform, a Spark engine, and Hive. |
[12] | Metaheuristic | APB-1 benchmark. |
[13] | Minterm predicates | TPC-H benchmark. |
[14] | Cost | APB-1 benchmark. |
[15] | Cost | SSB (star schema benchmark). |
[16] | Data mining (association) | TPC-DS benchmark. |
[17] | Graph | Real1 and Real2 join graphs randomly extracted from real-life guests of Redshift with different volumes and frequencies as well as the TPC-DS benchmark. |
[18] | Other (Map–Reduce) | TPC-DS benchmark. |
[19] | Affinity | APB-1 benchmark. |
[20] | Cost | APB-1 benchmark. |
[21] | Data mining (clustering) | TPC-DS benchmark. |
[22] | Cost | APB-1 benchmark. |
[23] | Other (Map–Reduce) | JSON file which contains Neo4j vast data combined with DW metadata. |
[24] | Metaheuristic | Experiments in a workstation to identify dependencies such as the execution time of the method compared to the number of data warehouses and the distribution quality. |
[25] | Data mining (clustering) | Experiments on a computer cluster utilizing the TPC-DS benchmark. |
[26] | Partitioning (hash) | Experiments with the TPC-DS benchmark through a cluster of homogeneous nodes, a Spark engine, a Hadoop-YARN platform, a Hive system, a Ray system, and a Redis database. |
Query | Frequency |
---|---|
q1: SELECT d_month, sum(lo_quantity) from date, lineorder WHERE d_datekey=lo_orderdate AND d_year=1992 GROUP BY d_month; | 3 |
q2: SELECT d_month, sum(lo_quantity) FROM date, lineorder WHERE d_datekey=lo_orderdate AND d_year=1992 AND d_sellingseason=‘Summer’ GROUP BY d_month; | 5 |
q3: SELECT d_month, avg(lo_ordtotalprice) from date, lineorder WHERE d_datekey=lo_orderdate AND d_year=1993 GROUP BY d_month; | 4 |
q4: SELECT d_month, avg(lo_ordtotalprice) FROM date, lineorder WHERE d_datekey=lo_orderdate AND d_year=1993 AND d_sellingseason=‘Christmas’ GROUP BY d_month; | 3 |
q5: SELECT d_year, sum(lo_quantity) FROM date, lineorder WHERE d_datekey=lo_orderdate AND d_month=‘January’ GROUP BY d_year; | 3 |
q6: SELECT d_month, sum(lo_revenue) FROM lineorder, date WHERE d_datekey=lo_orderdate and d_year=1995 GROUP BY d_month; | 2 |
q7: SELECT d_year, sum(lo_quantity) FROM date, lineorder WHERE d_datekey=lo_orderdate AND d_sellingseason=‘Winter’ GROUP BY d_year; | 5 |
q8: SELECT sum(lo_revenue), d_year, p_brand1 FROM lineorder, date, part, supplier WHERE lo_orderdate=d_datekey AND lo_partkey=p_partkey AND lo_suppkey=s_suppkey AND p_category=‘MFGR#12′ AND s_region=‘AMERICA’ GROUP BY d_year, p_brand1 ORDER BY d_year, p_brand1; | 2 |
q9: SELECT sum(lo_revenue), d_year, p_brand1 FROM lineorder, date, part, supplier WHERE lo_orderdate=d_datekey AND lo_partkey=p_partkey AND lo_suppkey=s_suppkey AND p_brand1=‘MFGR#2221′ AND s_region=‘EUROPE’ GROUP BY d_year, p_brand1 ORDER BY d_year, p_brand1; | 2 |
Pr | Description | S |
---|---|---|
p1 | d_year = 1992 | sel1 = 907,987 |
p2 | d_sellingseason = ’Summer’ | sel2 = 2,076,040 |
p3 | d_year = 1993 | sel3 = 908,288 |
p4 | d_sellingseason = ’Christmas’ | sel4 = 912,008 |
p5 | Dumont = ’January’ | sel5 = 541,483 |
p6 | d_year = 1995 | sel6 = 909,991 |
p7 | d_sellingseason = ‘Winter ‘ | sel7 = 1,577,160 |
p8 | p_category = ‘MFGR#12′ | sel8 = 236,816 |
p9 | s_region = ‘AMERICA’’ | sel9 = 1,134,371 |
p10 | p_brand1 = ‘MFGR#2221′, | sel10 = 5848 |
p11 | s_region = ‘EUROPE’ | sel11 = 1,140,311 |
Dimension | Importance |
---|---|
date | 38,076,239 |
supplier | 4,549,634 |
part | 485,328 |
Q/Pr | p1 | p2 | p3 | p4 | p5 | p6 | p7 | Fi |
---|---|---|---|---|---|---|---|---|
q1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
q2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 5 |
q3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 |
q4 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 3 |
q5 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 |
q6 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 |
q7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 5 |
selj | 907,987 | 2,076,040 | 908,288 | 912,008 | 541,483 | 913,927 | 1,577,160 |
Step | Predicates | ||||||
---|---|---|---|---|---|---|---|
1 | p1 | p2 | p3 | p4 | p5 | p6 | p7 |
2 | p1, p2 | * | p3 | p4 | p5 | p6 | p7 |
3 | p1, p2 | * | p3, p4 | * | p5 | p6 | p7 |
4 | p1, p2 | * | p3, p4 | * | p5, p6 | * | p7 |
5 | p1, p2 | * | p3, p4 | * | p5, p6, p7 | * | * |
6 | p1, p2, p3, p4 | * | * | * | p5, p6, p7 | * | * |
Scheme | Precision | Recall | F-Measure | ROC Area |
---|---|---|---|---|
data_set2 | 0.286 | 0.429 | 0.343 | 0.208 |
data_set3 | - | 0.429 | - | 0.244 |
Name | Type | Description |
---|---|---|
postgres_log | Table | It stores all the queries found in the log file. |
current_database | Trigger | Before inserting a query into the postgres_log table, it verifies that only the relevant queries were saved. |
copy_log() | Function | It copies the contents of the current day’s transaction log to the postgres_log table. |
queries | Table | It stores the identifier, the description, and the frequency of the queries. |
delete_queries | Trigger | After removing a query on the queries table, it reduces the value of the identifier by 1. |
insert_queries | Trigger | Before inserting a query into the queries table, if the query is already stored in this table, then it is not inserted and its frequency is increased by 1. |
ai_queries | Trigger | After inserting a query into the queries table, it updates its identifier if it is bigger than the number of tuples. |
statistic_collector(text) | Function | It analyzes the queries stored in postgres_log to insert into the table queries only those executed in the table to be fragmented. Its parameter is the fact table. |
attributes | View | It stores the names of the dimensions as well as the attributes of each dimension and the order in which they appear in the dimension table. |
attribute | Table | It saves the data from the attributes view and adds a global order attribute. |
fact_attributes | View | It stores the name of the fact table as well as the names and order of each of its attributes. |
predicates | Table | It saves the identifier, description, and selectivity of the predicates, as well as their dimensions and order in that dimension. |
insert_predicates | Trigger | Before inserting predicates into the table, it analyzes that only those from queries involving the fact table with the dimension table are taken. |
insert_predicates2 | Trigger | Before a predicate is inserted into the table, it validates that only different predicates are recorded. |
ai_predicates | Trigger | After inserting a predicate into the predicates table, it updates its identifier if it is greater than the maximum number of rows. |
delete_predicates | Trigger | After removing a predicate in the predicates table, it reduces the value of the identifier by 1. |
get_predicates() | Function | It obtains the predicates of the queries stored in queries and stores them in the predicates table. |
query_predicate | Table | It saves the relationship between queries and predicates. |
delete_query_predicate | Trigger | After deleting from the query_predicate table, it sets to 1 the sequences for queries and predicates. |
predicate_usage_table | Table | It stores the PUM that Algorithm 1 takes as input. |
data_set | Table | It contains the data_set obtained by Algorithm 1. |
query_pred() | Function | It analyzes which predicates appear in which queries and keeps this relation in query_predicate. |
get_dataset() | Function | It creates the data sets that are used to build the decision tree. |
importance_dimension | View | It stores the importance of the dimensions. |
ds_attributes | View | The attributes of the data set table and their order are in this view. |
primary_keys | View | It stores the name and the attributes of primary keys for all the DW tables. |
foreign_keys | View | The name of the foreign keys for each dimension and the attribute in the fact table involved in this constraint are in this view. |
Name | Type | Attributes | Tuples |
---|---|---|---|
Hotel_activity | Facts | 8 | 35,424 |
Internationals_visitors | Facts | 5 | 576 |
Location | Dimension | 4 | 123 |
Time | Dimension | 4 | 144 |
Tourism | Dimension | 3 | 4 |
Tourist | Dimension | 2 | 2 |
Q/Pr | p1 | p2 | p3 | p4 | p5 | p6 | p7 | Fi |
---|---|---|---|---|---|---|---|---|
q1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 20 |
q2 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 15 |
q3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 15 |
q4 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 35 |
q5 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 35 |
q6 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 30 |
q7 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 10 |
q8 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 10 |
q9 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 10 |
q10 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 10 |
selj | 2952 | 8856 | 2952 | 2952 | 2952 | 2952 | 2952 |
Pr | Pr Text |
---|---|
p1 | month = ‘feb’ |
p2 | quarter = ‘T3′ |
p3 | year = 2018 |
p4 | year = 2014 |
p5 | year = 2017 |
p6 | year = 2015 |
p7 | year = 2016 |
Scheme | J48 | Naïve Bayes | Multi-Layer Perceptron | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
P | R | F-M | ROC | P | R | F-M | ROC | P | R | F-M | ROC | |
data_set2 | 0.925 | 0.900 | 0.903 | 0.881 | 0.925 | 0.900 | 0.903 | 0.952 | 0.880 | 0.800 | 0.808 | 0.857 |
data_set3 | - | 0.600 | - | 0.439 | - | 0.800 | - | 0.774 | 0.750 | 0.600 | 0.650 | 0.694 |
Query | Frequency |
---|---|
q1: SELECT month, sum(booked_rooms) FROM time, hotel_activity WHERE time.id_time=hotel_activity.id_time AND time.year=2007 GROUP BY month; | 3 |
q2: SELECT month, sum(booked_rooms) FROM time, hotel_activity WHERE time.id_time=hotel_activity.id_time AND time.year=2008 GROUP BY month; | 4 |
q3: SELECT month, sum(booked_rooms) FROM time, hotel_activity WHERE time.id_time=hotel_activity.id_time AND time.quarter=‘T1′ GROUP BY month; | 2 |
q4: SELECT month, sum(tourist_night) FROM time, hotel_activity WHERE time.id_time=hotel_activity.id_time AND time.year=2009 GROUP BY month; | 5 |
q5: SELECT quarter, sum(tourist_night) FROM time, hotel_activity WHERE time.id_time=hotel_activity.id_time AND time.year=2010 GROUP BY quarter; | 2 |
q6: SELECT year, sum(tourist_night) FROM time, hotel_activity WHERE time.id_time=hotel_activity.id_time AND time.quarter=‘T2′ group by year; | 3 |
q7: SELECT month, sum(tourist_arrivals) FROM time, hotel_activity WHERE time.id_time=hotel_activity.id_time AND time.year=2011 group by month; | 4 |
q8: SELECT year, sum(tourist_arrivals) FROM time, hotel_activity WHERE time.id_time=hotel_activity.id_time AND time.month=‘dic’ group by year; | 7 |
q9: SELECT quarter, sum(tourist_arrivals) FROM time, hotel_activity WHERE time.id_time=hotel_activity.id_time AND time.year=2012 group by quarter; | 4 |
q10: SELECT year, avg(stays) FROM time, hotel_activity WHERE time.id_time=hotel_activity.id_time AND time.month=‘jul’ group by year; | 10 |
q11: SELECT quarter, avg(stays) FROM time, hotel_activity WHERE time.id_time=hotel_activity.id_time AND time.year=2013 group by quarter; | 3 |
q12: SELECT month, avg(stays) FROM time, hotel_activity WHERE time.id_time=hotel_activity.id_time AND time.year=2013 AND time.quarter=‘T4′ group by month; | 4 |
Scheme | Naïve Bayes | Multi-Layer Perceptron | ||||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F-Measure | ROC Area | Precision | Recall | F-Measure | ROC Area | |
data_set2 | 0.800 | 0.667 | 0.625 | 0.708 | 0.688 | 0.667 | 0.657 | 0.694 |
data_set3 | 0.857 | 0.750 | 0.742 | 0.750 | 0.563 | 0.583 | 0.570 | 0.733 |
Query | HFS of data_set2 | HFS of data_set3 | ||||
---|---|---|---|---|---|---|
ITC | RTC | Cost | ITC | RTC | Cost | |
q1 | 38,376 | 0 | 38,376 | 38,376 | 0 | 38,376 |
q2 | 51,168 | 0 | 51,168 | 51,168 | 0 | 51,168 |
q3 | 22,632 | 23,616 | 46,248 | 11,808 | 23,616 | 35,424 |
q4 | 73,800 | 24,600 | 98,400 | 22,140 | 147,600 | 169,740 |
q5 | 25,584 | 0 | 25,584 | 25,584 | 0 | 25,584 |
q6 | 33,948 | 53,136 | 87,084 | 17,712 | 53,136 | 70,848 |
q7 | 51,168 | 0 | 51,168 | 51,168 | 0 | 51,168 |
q8 | 89,544 | 0 | 89,544 | 89,544 | 0 | 89,544 |
q9 | 59,040 | 15,744 | 74,784 | 17,712 | 94,464 | 112,176 |
q10 | 127,920 | 0 | 127,920 | 127,920 | 0 | 127,920 |
q11 | 44,280 | 8856 | 53,136 | 13,284 | 53,136 | 66,420 |
q12 | 43,296 | 173,184 | 216,480 | 1968 | 330,624 | 332,592 |
Q/P | p1 | p2 | p3 | p4 | p5 | p6 | p7 | Fi |
---|---|---|---|---|---|---|---|---|
q1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 20 |
q2 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 15 |
q3 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 20 |
q4 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 15 |
q5 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 15 |
q6 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 10 |
q7 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 15 |
selj | 30 | 50 | 80 | 65 | 20 | 35 | 40 |
Query | FTree | Kechar & Nait-Bahloul [14] | ||||
---|---|---|---|---|---|---|
ITC | RTC | Cost | ITC | RTC | Cost | |
q1 | 4200 | 0 | 4200 | 3500 | 0 | 3500 |
q2 | 975 | 0 | 975 | 2475 | 0 | 2475 |
q3 | 0 | 0 | 0 | 2900 | 0 | 2900 |
q4 | 450 | 0 | 450 | 1950 | 0 | 1950 |
q5 | 0 | 0 | 0 | 2175 | 0 | 2175 |
q6 | 1500 | 0 | 1500 | 1500 | 0 | 1500 |
q7 | 450 | 0 | 450 | 1950 | 0 | 1950 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rodríguez-Mazahua, N.; Rodríguez-Mazahua, L.; López-Chau, A.; Alor-Hernández, G.; Machorro-Cano, I. Decision-Tree-Based Horizontal Fragmentation Method for Data Warehouses. Appl. Sci. 2022, 12, 10942. https://doi.org/10.3390/app122110942
Rodríguez-Mazahua N, Rodríguez-Mazahua L, López-Chau A, Alor-Hernández G, Machorro-Cano I. Decision-Tree-Based Horizontal Fragmentation Method for Data Warehouses. Applied Sciences. 2022; 12(21):10942. https://doi.org/10.3390/app122110942
Chicago/Turabian StyleRodríguez-Mazahua, Nidia, Lisbeth Rodríguez-Mazahua, Asdrúbal López-Chau, Giner Alor-Hernández, and Isaac Machorro-Cano. 2022. "Decision-Tree-Based Horizontal Fragmentation Method for Data Warehouses" Applied Sciences 12, no. 21: 10942. https://doi.org/10.3390/app122110942
APA StyleRodríguez-Mazahua, N., Rodríguez-Mazahua, L., López-Chau, A., Alor-Hernández, G., & Machorro-Cano, I. (2022). Decision-Tree-Based Horizontal Fragmentation Method for Data Warehouses. Applied Sciences, 12(21), 10942. https://doi.org/10.3390/app122110942