A Generic Pipeline for Machine Learning Users in Energy and Buildings Domain
Abstract
:1. Introduction
2. The Essential Steps and Potential Improvements in ML Algorithms Implementation
2.1. Problem Identification and Formulation
2.2. Data Collection, Analysis, and Preprocessing
2.3. ML Algorithm Selection
2.3.1. Applications of ANN Algorithm
Modeling Solar Radiation and Solar Steam Generators
Modeling and Forecasting Energy Loads and Consumption
Simulation and Control for Energy Consumption
Fault Detection and Diagnosis
Energy Assessment
2.3.2. Applications of SVM Algorithm
Modeling and Predicting Energy Loads and Consumption
Making Simulation and Prediction Tools
Classification for Buildings Depending on Energy Consumption
2.3.3. Applications of GPR or GMM Algorithm
Energy Saving Verification and Retrofit Studies
2.3.4. Applications of Clustering Algorithms (K-Means and K-Shape)
Energy Assessment and Forecasting
2.4. Model Training, Validation, and Tuning
2.5. Model Evaluation
2.6. Model Verification
3. Discussion
- (1)
- In 2006, Karatasou, Santamouris, and Geros designed an hourly buildings load prediction tool based on a feed-forward artificial neural network (FFANN). By comparing between paper steps and the proposed pipeline, it is found that the authors did not mention any preprocessing steps except statistical analysis. They stated that the data did not have any noise, removed the missing values, and normalized the data. Thus, they did not take the full benefits of statistical analysis to study the data nature, and there are some wrong prediction peaks due to ignoring the outliers’ effect in the preprocessing data step. Because of the large data size and since the ML type is prediction, the selected algorithm was ANN, and they implemented the cross-validation technique to create a robust model. In addition, the ANN algorithm was evaluated with two different data sets to ensure robustness, but it is not enough because the evaluation would be better if performed on the same model structures with different data sets but with the same input features to increase reliability and robustness [21].
- (2)
- Dombaycı et al., in 2010, developed an hourly heating energy prediction model based on ANN to estimate energy in the design stage. The authors did not mention any preprocessing steps, just normalization, because the user data were calculated, so the probability of containing noise, missing values, and outliers is very small (this does not have the same worth of actual data). The ANN was used because the data are big and the ML problem concerns prediction. The data were split to train and test sets, but this was not enough because the trained model could be more robust if the cross-validation technique was used in training and evaluation steps [13].
- (3)
- Mena et al., in 2014, developed and assessed a short-term predictive ANN model of electricity demand. The authors manually reduced the number of features because the data had a high number of features. Although the authors mentioned the outliers and noise in the data, they did not apply any type of analysis to solve these two problems in the data. In addition, the missing values in the data are kept as is and the authors depended on a manual method in splitting the data to skip missing values, which means the splitting blocks are imbalanced. Thus, the efforts made in the training and evaluation steps to create a robust model are useless because the preprocessing steps are not well performed, so the results from the model have a relatively high mean error [57].
- (4)
- In 2015, Li et al. improved the short-term hourly electricity consumption prediction of a building. The authors mentioned a large number of features, so they used an automatic method of reducing features (PCA). However, the authors did not mention anything about the missing values and outliers in the data. Because of the large data size and since the ML type is prediction, the selected algorithm was ANN. The automatic tuning gives high prediction results, but it needs to integrate with cross-validation techniques to ensure the robustness and reliability of the model [16].
- (5)
- In 2017, Yang et al. proposed an energy clustering and prediction method based on k-shape and SVM algorithms for time series data. The authors mentioned the noise in the data but did not mention solving it. In addition, there was no mention of any technique to solve the problem of outliers and missing values. The data size is relatively high to be used in SVM algorithms (the authors did not take into consideration the data size when selecting the algorithm), and the authors extracted features to decrease complexity and effort during model training. Due to the huge data size, it is recommended to used parallel SVM to reduce time or replace it directly with ANN [36].
- (6)
- Heo and Zavala, in 2012, used the GPR model in energy savings and uncertainty measurements and verification problems. The authors did not use any feature extraction concept, although they mentioned a high degree of complexity in the data due to noise and nonlinear relationships. Moreover, they did not mention any technique to detect and solve the outliers and missing values problems. The data size is relatively large and since the authors did not mention time consumption in training, it may be too large. Thus, it will be better to use the Gaussian model to remove noise only and complete the prediction by ANN or use ANN directly for all problems [33].
- (7)
- In 2014, Gao and Malkawi proposed a benchmarking technique for building energy based on the k-means concept. The authors used the features selection technique due to the high number of features. The data contain outliers, but the authors did not mention the technique to solve this. In addition, the imputation technique for missing values was not declared well, which greatly affected the k-means solution (the k-means has a high probability of falling into local minimum) [37].
4. Implement the Pipeline on CBECS Data
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhao, H. Artificial Intelligence Models for Large Scale Buildings Energy Consumption Analysis. Ph.D. Thesis, Ecole Centrale Paris, Gif-sur-Yvette, France, 2014. [Google Scholar]
- Tabrizchi, H.; Javidi, M.M.; Amirzadeh, V. Estimates of residential building energy consumption using a multi-verse optimizer-based support vector machine with k-fold cross-validation. Evol. Syst. 2019. [Google Scholar] [CrossRef]
- Cai, H.; Shen, S.; Lin, Q.; Li, X.; Xiao, H. Predicting the energy consumption of residential buildings for regional electricity supply-side and demand-side management. IEEE Access 2019, 7, 30386–30397. [Google Scholar] [CrossRef]
- Seyedzadeh, S.; Rahimian, F.P.; Oliver, S.; Rodriguez, S.; Glesk, I. Machine learning modelling for predicting non-domestic buildings energy performance: A model to support deep energy retrofit decision-making. Appl. Energy 2020, 279, 115908. [Google Scholar] [CrossRef]
- Somu, N.; Raman, G.R.M.; Ramamritham, K. A deep learning framework for building energy consumption forecast. Renew. Sustain. Energy Rev. 2021, 137, 110591. [Google Scholar] [CrossRef]
- Fayaz, M.; Kim, D. A Prediction Methodology of Energy Consumption Based on Deep Extreme Learning Machine and Comparative Analysis in Residential Buildings. Electronics 2018, 7, 222. [Google Scholar] [CrossRef]
- Liu, Z.; Wu, D.; Liu, Y.; Han, Z.; Lun, L.; Gao, J.; Cao, G. Accuracy analyses and model comparison of machine learning adopted in building energy consumption prediction. Energy Explor. Exploit. 2019, 37, 1426–1451. [Google Scholar] [CrossRef] [Green Version]
- Wang, L.; El-Gohary, N.M. Machine-Learning-Based Model for Supporting Energy Performance Benchmarking for Office Buildings; Springer: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
- Seyedzadeh, S.; Rahimian, F.P.; Glesk, I.; Roper, M. Machine learning for estimation of building energy consumption and performance: A review. Vis. Eng. 2018, 6, 5. [Google Scholar] [CrossRef]
- Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning_From Theory to Algorithm. 2014. Available online: https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf (accessed on 30 August 2021).
- Chao, W.-L. Machine Learning Tutorial. 2011. Available online: https://www.semanticscholar.org/paper/Machine-Learning-Tutorial-Chao/e74d94c407b599947f9e6262540b402c568674f6 (accessed on 30 August 2021).
- Kirsch, J.H.A.D. IBM Machine Learning for Dummies. 2018. Available online: https://www.ibm.com/downloads/cas/GB8ZMQZ3 (accessed on 30 August 2021).
- Dombaycı, Ö.A. The prediction of heating energy consumption in a model house by using artificial neural networks in Denizli–Turkey. Adv. Eng. Softw. 2010, 41, 141–147. [Google Scholar] [CrossRef]
- Antanasijević, D.; Pocajt, V.; Ristić, M.; Perić-Grujić, A. Modeling of energy consumption and related GHG (greenhouse gas) intensity and emissions in Europe using general regression neural networks. Energy 2015, 84, 816–824. [Google Scholar] [CrossRef]
- Platon, R.; Dehkordi, V.R.; Martel, J. Hourly prediction of a building’s electricity consumption using case-based reasoning, artificial neural networks and principal component analysis. Energy Build. 2015, 92, 10–18. [Google Scholar] [CrossRef]
- Li, K.; Hu, C.; Liu, G.; Xue, W. Building’s electricity consumption prediction using optimized artificial neural networks and principal component analysis. Energy Build. 2015, 108, 106–113. [Google Scholar] [CrossRef]
- Yalcintas, M.; Aytun Ozturk, U. An energy benchmarking model based on artificial neural network method utilizing US Commercial Buildings Energy Consumption Survey (CBECS) database. Int. J. Energy Res. 2007, 31, 412–421. [Google Scholar] [CrossRef]
- Edwards, R.E.; New, J.; Parker, L.E. Predicting future hourly residential electrical consumption: A machine learning case study. Energy Build. 2012, 49, 591–603. [Google Scholar] [CrossRef]
- Kialashaki, A.; Reisel, J.R. Modeling of the energy demand of the residential sector in the United States using regression models and artificial neural networks. Appl. Energy 2013, 108, 271–280. [Google Scholar] [CrossRef]
- Olofsson, T.; Andersson, S. Long-term energy demand predictions based on short-term measured data. Energy Build. 2001, 33, 85–91. [Google Scholar] [CrossRef]
- Karatasou, S.; Santamouris, M.; Geros, V. Modeling and predicting building’s energy use with artificial neural networks: Methods and results. Energy Build. 2006, 38, 949–958. [Google Scholar] [CrossRef]
- Du, Z.; Fan, B.; Jin, X.; Chi, J. Fault detection and diagnosis for buildings and HVAC systems using combined neural networks and subtractive clustering analysis. Build. Environ. 2014, 73, 1–11. [Google Scholar] [CrossRef]
- Huang, H.; Chen, L.; Hu, E. A neural network-based multi-zone modelling approach for predictive control system design in commercial buildings. Energy Build. 2015, 97, 86–97. [Google Scholar] [CrossRef]
- Pérez-Ortiz, J.A.; Gers, F.A.; Eck, D.; Schmidhuber, J. Kalman filters improve LSTM network performance in problems unsolvable by traditional recurrent nets. Neural Netw. 2003, 16, 241–250. [Google Scholar] [CrossRef]
- González, P.A.; Zamarreño, J.M. Prediction of hourly energy consumption in buildings based on a feedback artificial neural network. Energy Build. 2005, 37, 595–601. [Google Scholar] [CrossRef]
- Aydinalp, M.; Ismet Ugursal, V.; Fung, A.S. Modeling of the space and domestic hot-water heating energy-consumption in the residential sector using neural networks. Appl. Energy 2004, 79, 159–178. [Google Scholar] [CrossRef]
- Hou, Z.; Lian, Z. An application of support vector machines in cooling load prediction. In Proceedings of the 2009 International Workshop on Intelligent Systems and Applications, Wuhan, China, 23–24 May 2009. [Google Scholar]
- Li, Q.; Meng, Q.; Cai, J.; Yoshino, H.; Mochida, A. Applying support vector machine to predict hourly cooling load in the building. Appl. Energy 2009, 86, 2249–2256. [Google Scholar] [CrossRef]
- Li, X.; Lu, J.-H.; Ding, L.; Xu, G.; Li, J. Building Cooling Load Forecasting Model Based on LS-SVM. In Proceedings of the 2009 Asia-Pacific Conference on Information Processing, Shenzhen, China, 18–19 July 2009; pp. 55–58. [Google Scholar]
- Jain, R.K.; Smith, K.M.; Culligan, P.J.; Taylor, J.E. Forecasting energy consumption of multi-family residential buildings using support vector regression: Investigating the impact of temporal and spatial monitoring granularity on performance accuracy. Appl. Energy 2014, 123, 168–178. [Google Scholar] [CrossRef]
- Zhao, H.-X.; Magoulès, F. A review on the prediction of building energy consumption. Renew. Sustain. Energy Rev. 2012, 16, 3586–3592. [Google Scholar] [CrossRef]
- Zhao, H.X.; Magoulès, F. Parallel Support Vector Machines Applied to the Prediction of Multiple Buildings Energy Consumption. Algorithms Comput. Technol. 2009, 4, 231–249. [Google Scholar] [CrossRef]
- Heo, Y.; Zavala, V.M. Gaussian process modeling for measurement and verification of building energy savings. Energy Build. 2012, 53, 7–18. [Google Scholar] [CrossRef]
- Burkhart, M.C.; Heo, Y.; Zavala, V.M. Measurement and verification of building systems under uncertain data: A Gaussian process modeling approach. Energy Build. 2014, 75, 189–198. [Google Scholar] [CrossRef]
- Heo, Y.; Choudhary, R.; Augenbroe, G.A. Calibration of building energy models for retrofit analysis under uncertainty. Energy Build. 2012, 47, 550–560. [Google Scholar] [CrossRef]
- Yang, J.; Ning, C.; Deb, C.; Zhang, F.; Cheong, D.; Lee, S.E.; Tham, K.W. k-Shape clustering algorithm for building energy usage patterns analysis and forecasting model accuracy improvement. Energy Build. 2017, 146, 27–37. [Google Scholar] [CrossRef]
- Gao, X.; Malkawi, A. A new methodology for building energy performance benchmarking: An approach based on intelligent clustering algorithm. Energy Build. 2014, 84, 607–616. [Google Scholar] [CrossRef]
- Lara, R.A.; Pernigotto, G.; Cappelletti, F.; Romagnoni, P.; Gasparella, A. Energy audit of schools by means of cluster analysis. Energy Build. 2015, 95, 160–171. [Google Scholar] [CrossRef]
- Santamouris, M.; Mihalakakou, G.; Patargias, P.; Gaitani, N.; Sfakianaki, K.; Papaglastra, M.; Zerefos, S. Using intelligent clustering techniques to classify the energy performance of school buildings. Energy Build. 2007, 39, 45–51. [Google Scholar] [CrossRef]
- Gaitani, N.; Lehmann, C.; Santamouris, M.; Mihalakakou, G.; Patargias, P. Using principal component and cluster analysis in the heating evaluation of the school building sector. Appl. Energy 2010, 87, 2079–2086. [Google Scholar] [CrossRef]
- Kalogirou, S.A. Applications of artificial neural networks in energy systems a review. Energy Convers. Manag. 1998, 40, 1073–1087. [Google Scholar] [CrossRef]
- Ascione, F.; Bianco, N.; De Stasio, C.; Mauro, G.M.; Vanoli, G.P. Artificial neural networks to predict energy performance and retrofit scenarios for any member of a building category: A novel approach. Energy 2017, 118, 999–1017. [Google Scholar] [CrossRef]
- Beccali, M.; Ciulla, G.; Brano, V.L.; Galatioto, A.; Bonomolo, M. Artificial neural network decision support tool for assessment of the energy performance and the refurbishment actions for the non-residential building stock in Southern Italy. Energy 2017, 137, 1201–1218. [Google Scholar] [CrossRef]
- Paudel, S.; Elmtiri, M.; Kling, W.L.; Le Corre, O.; Lacarrière, B. Pseudo dynamic transitional modeling of building heating energy demand using artificial neural network. Energy Build. 2014, 70, 81–93. [Google Scholar] [CrossRef] [Green Version]
- Deb, C.; Eang, L.S.; Yang, J.; Santamouris, M. Forecasting diurnal cooling energy load for institutional buildings using Artificial Neural Networks. Energy Build. 2016, 121, 284–297. [Google Scholar] [CrossRef]
- Benedetti, M.; Cesarotti, V.; Introna, V.; Serranti, J. Energy consumption control automation using Artificial Neural Networks and adaptive algorithms: Proposal of a new methodology and case study. Appl. Energy 2016, 165, 60–71. [Google Scholar] [CrossRef]
- Ahn, J.; Cho, S.; Chung, D.H. Analysis of energy and control efficiencies of fuzzy logic and artificial neural network technologies in the heating energy supply system responding to the changes of user demands. Appl. Energy 2017, 190, 222–231. [Google Scholar] [CrossRef]
- Kalogirou, S.; Lalot, S.; Florides, G.; Desmet, B. Development of a neural network-based fault diagnostic system for solar thermal applications. Sol. Energy 2008, 82, 164–172. [Google Scholar] [CrossRef]
- Hong, S.-M.; Paterson, G.; Mumovic, D.; Steadman, P. Improved benchmarking comparability for energy consumption in schools. Build. Res. Inf. 2013, 42, 47–61. [Google Scholar] [CrossRef]
- Buratti, C.; Barbanera, M.; Palladino, D. An original tool for checking energy performance and certification of buildings by means of Artificial Neural Networks. Appl. Energy 2014, 120, 125–132. [Google Scholar] [CrossRef]
- Lai, F.; Magoulès, F.; Lherminier, F. Vapnik’s learning theory applied to energy consumption forecasts in residential buildings. Int. J. Comput. Math. 2008, 85, 1563–1588. [Google Scholar] [CrossRef]
- Li, X.; Bowers, C.P.; Schnier, T. Classification of Energy Consumption in Buildings with Outlier Detection. IEEE Trans. Ind. Electron. 2010, 57, 3639–3644. [Google Scholar] [CrossRef]
- Oladipupo, T. Types of Machine Learning Algorithms. New Adv. Mach. Learn. 2010, 3, 19–48. [Google Scholar]
- Wong, S.L.; Wan, K.K.W.; Lam, T.N.T. Artificial neural networks for energy analysis of office buildings with daylighting. Appl. Energy 2010, 87, 551–557. [Google Scholar] [CrossRef]
- Smola, A.; Vishwanathan, S.V.N. Introduction to Machine Learning; Cambridge University: Cambridge, UK, 2008. [Google Scholar]
- Deisenroth, M.P.; Faisal, A.A.; Ong, C.S. Mathematics for Machine Learning; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
- Mena, R.; Rodríguez, F.; Castilla, M.; Arahal, M.R. A prediction model based on neural networks for the energy consumption of a bioclimatic building. Energy Build. 2014, 82, 142–155. [Google Scholar] [CrossRef]
Preprocessing Questions and Actions | Model Selection Questions and Actions | Model Creation | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Too Many Features? or Too Few Features? | Extracted or Selected Features | Noisy Data? | Time Serious Data? | Kalman or Gaussian Filters | Outliers’ Values | Missing Values | Benchmarking or Prediction? | Time Serious Data? | Very Big Data? | Complex System? | Selected Algorithm? | Append Clusters Labels to Data? or Normalize Data? | Training, Validation and Tuning | Evaluation | |
Not asked | ----- | × | √ | ----- | ----- | removed | prediction | √ | √ | Not asked | ANN | normalized | Applied cross-validation on two data sets | Used two data sets and tried different samples steps | [21] |
Not asked | ----- | × | √ | ----- | ----- | ----- | prediction | √ | √ | Not asked | ANN | normalized | Split data to train and test sets | Used test data | [13] |
Too many features | Selected features using correlation between features | √ | √ | ----- | Keep as it | Keep as it | prediction | √ | √ | √ | ANN | normalized | Split data to train, validate, and test sets with different samples steps | Used test data with different samples steps | [57] |
Too many features | Selected features using PCA | × | √ | ----- | ----- | ----- | prediction | √ | √ | Not asked | ANN | normalized | Split data to train and test with automatic tuning (PSO) | Used test data | [16] |
Too many features | Extracted features using k-shape clustering | √ | √ | ----- | Filtered | Imputed | Benchmarking and prediction | √ | Not asked | Not asked | k-shape and SVM | normalized and append cluster labels | Split data to train and test and applied cross-validation | Used test data with different samples steps | [36] |
Not asked | ----- | √ | √ | ----- | ----- | ----- | prediction | √ | √ | √ | GPR | normalized | Split data to train and test with different samples steps | Used test data with different samples steps | [33] |
Too many features | Selected features using p-value | × | √ | ----- | ----- | replaced | Benchmarking | × | √ | Not asked | k-mean | normalized | Applied similarity measure on one package data | Compare results with EnergyStar software | [37] |
Model Target | Data Source | Data Size | Model Features | Selected Algorithm? | Best Evaluation Results | Comments and Expected Improvements |
---|---|---|---|---|---|---|
Predict Hourly Energy Consumption [21] | “Two different data sets provided from two different buildings: The first set is the benchmark PROBEN 1, and comes from the first energy prediction contest, the Great Building Energy Predictor Shootout I, organized by ASHRAE (data set A) & The second data set derives from an office building located in Athens, Greece (data set B)” | data set A: a total of 4208 time steps, data set B: a total of 8280 time steps | data set A: “temperature, solar radiation, humidity ratio and wind speed” data set B: “ambient temperature, humidity, daily, weekly and yearly cycles the hour of day, day of week and day of year” | ANN | data set A: RMS is 15.25, MAPE is 1.50, CV is 2.44 and MBE is 0.37 data set B: RMS is 1.13, MAPE is 2.64, CV is 2.95 and MBE is −0.03 | There are some wrong prediction peaks due to ignoring the effect of the outliers in preprocessing data step, and the evaluation would be better if carried out on the same model structures with different data sets but with the same input features to increase reliability and robustness. |
Predict Hourly Heating Energy [13] | “A model house designed in Denizli which is located in Central Aegean Region of Turkey” | A total of 35,070 time steps | “Month, day of the month, hour of the day, and energy consumption values at certain hours” | ANN | RMSE is 1.2125, R2 is 0.9880 and MAPE is 0.2081 | The author did not mention any preprocessing steps, just normalization because the used data was calculated, which did not have the same worth of actual data, and the trained model could be more robust if the cross-validation technique was used in training and different time steps during the evaluation, |
Predict Hourly Energy Consumption [57] | “CIESOL bioclimatic building, located in the southeast of Spain” | A total of 700,000 time steps | “The type and hour of the day, weather variables (outdoor temperature, outdoor humidity, solar radiation, wind velocity and wind direction) and the state of the actuators from the solar cooling installation” | ANN | Mean error is 11.48% | Although the authors mentioned the outliers and noise in the data, they did not apply any type of analysis to solve these two problems in the data. In addition, the missing values in the data are kept as is and the authors depended on a manual method in splitting data to skip missing values, which means the splitting blocks are imbalanced. Therefore, the efforts made in the training and evaluation steps to create a robust model were useless because the preprocessing steps are not well performed, so the results from the model have a relatively high mean error. |
Predict Hourly electricity consumption [16] | “The Great Building Energy Predictor Shootout I, organized by ASHRAE in 1990s (data set A) Data from a library building located in Hangzhou, East China (data set B)” | data set A: a total of 4208 time steps, data set B: a total of 2472 time steps | Data A: “outdoor dry bulb temperature, solar radiation, humidity ratio and wind speed” Data B: “daily temperature and occupancy” | ANN | data set A: CV is 0.0254 and MAPE is 0.0162 data set B: CV is 0.0758 and MAPE is 0.058 | The authors did not mention anything about the missing values in the data. The automatic tuning gives high prediction results, but it needs to integrate with a cross-validation technique to ensure the robustness and reliability for model. |
Benchmark and predict (hourly and weekly) Energy consumption [36] | “10 institutional buildings in Singapore” | a total of 122 days for each building | “Hourly and weekly energy consumption” | k-shape and SVM | Respective MAPE values are 15.36, 9.46, 1.033 1.23, 2.37, 3.66, 0.57, 54.11, 3.63, 4.46 for the ten buildings | The authors mentioned the noise in data but did not mention solving it. In addition, the outliers and missing values did not mention solve in the technique. The data size is relatively high to be used in SVM algorithms (the authors did not take into consideration the data size when selecting the algorithm), and the authors extracted features to decrease complexity and effort. Thus, it is recommended to used parallel SVM to reduce time. |
Predict Daily Energy Performance [33] | “Real weather data in the Chicago area” | a total of 8736 time steps | “Weather and occupancy levels, and the most commonly used is outdoor dry-bulb air temperature” | GPR | SSE is from 2.7e5 to 3.6e6 and total energy savings prediction error is from 31 to 41.23 | The data size is relatively large, and the authors did not mention time consumption in training, as it may be too large. Therefore, it will be better to use the Gaussian model to remove noise only and complete prediction by ANN or use ANN directly for all problems. |
Benchmark annual Energy Performance [37] | “commercial building (CBECS database)” | 5215 samples | “Area, percent heated, percent cooled, wall materials, roof materials, window materials, window percent, shape, number of floors, construction year, weekly operation hours, occupants, variable air volume, heating unit, cooling unit, economizer, refrigerators, number of servers, office equipment, heating and cooling degree day” | k-mean | Ratio between actual energy index to centroid for cluster in range from 0.96 to 2.1 for each cluster | The data contain outliers, but the authors did not mention this or the technique to solve this. In addition, the imputation for missing values is not declared well. The evaluation step is carried out using a comparison with EnergyStar without declaration of any approach to overcome the local minimum solution of the k-mean algorithm. |
Selected Features | Values and Ranges Format | Analysis before Changes | Notes and Changes | Analysis after Changes |
---|---|---|---|---|
Square footage (SQFT) | 1001–1,500,000 | Mean = 124,473.50 Median = 20,750.00 Std = 258,613.18 Outliers = 12.31% Missing = 0.0% | No changes | Mean = 124,473.50 Median = 20,750.00 Std = 258,613.18 Outliers = 12.31% Missing = 0.0% |
Number of floors (NFLOOR) | 1–14 994 = 15 to 25 995 = More than 25 | Mean = 30.16 Median = 2.00 Std = 163.61 Outliers = 9.73% Missing = 0.0% | Change (994 = 15 to 25) to (‘20’ = 15 to 25) as mean value to this range and change (995 = More than 25) to (30 = More than 25) [17] | Mean = 3.01 Median = 2.00 Std = 4.31 Outliers = 9.73% Missing = 0.0% |
Year of construction (YRCON) | 995 = Before 1946 1946–2012 | Mean = 1861.10 Median = 1981.00 Std = 325.77 Outliers = 12.37% Missing = 0.0% | Change (995 = Before 1946) to (1932 = Before 1946) | Mean = 1976.97 Median = 1981.00 Std = 23.34 Outliers = 0.00% Missing = 0.0% |
Total hours open per week (WKHRS) | 0–168 | Mean = 78.02 Median = 60.00 Std = 51.37 Outliers = 0.00% Missing = 0.0% | No changes | Mean = 78.02 Median = 60.00 Std = 51.37 Outliers = 0.00% Missing = 0.0% |
Number of employees (NWKER) | 0–6500 | Mean = 178.78 Median = 15.00 Std = 565.94 Outliers = 15.97% Missing = 0.0% | No changes | Mean = 178.78 Median = 15.00 Std = 565.94 Outliers = 15.97% Missing = 0.0% |
Percent heated (HEATP) | 0–100 Missing = Not applicable | Mean = 88.52 Median = 100.00 Std = 24.24 Outliers = 19.94% Missing = 7.75% | Fill missing values with 0 because (not applicable mean zero percentage) | Mean = 81.49 Median = 100.00 Std = 33.38 Outliers = 15.73% Missing = 0.0% |
Percent cooled (COOLP) | 1–100 Missing = Not applicable | Mean = 79.81 Median = 100.00 Std = 30.13 Outliers = 8.29% Missing = 10.18% | Fill missing values with 0 because (not applicable mean zero percentage) | Mean = 71.68 Median = 95.00 Std = 37.39 Outliers = 0.00% Missing = 0.0% |
Number of computers (PCTERMN) | 0–4195 Missing = Not applicable | Mean = 168.93 Median = 10.00 Std = 530.48 Outliers = 16.21% Missing = 3.56% | Fill missing values with 0 because (not applicable mean zero) | Mean = 162.92 Median = 9.00 Std = 521.90 Outliers = 16.1% Missing = 0.0% |
Percent lit when open (LTOHRP) | 0–100 Missing = Not applicable | Mean = 82.12 Median = 95.00 Std = 25.01 Outliers = 8.04% Missing = 4.4% | Fill missing values with 0 because (not applicable mean zero) | Mean = 78.50 Median = 90.00 Std = 29.70 Outliers = 8.66% Missing = 0.0% |
Annual electricity consumption (thous Btu) (ELBTU) | Output Feature | Mean = 9,283,680.98 Median = 822,346.50 Std = 32,174,631.57 Outliers = 14.67% Missing = 2.47% | No changes | Mean = 9,283,680.98 Median = 822,346.50 Std = 32,174,631.57 Outliers = 14.67% Missing = 2.47% |
Selected Features | Analysis before Changes | Notes and Changes | Analysis after Changes |
---|---|---|---|
Total hours open per week | Mean = 78.02 Median = 60.00 Std = 51.37 Outliers = 0.00% | No changes | Mean = 78.02 Median = 60.00 Std = 51.37 Outliers = 0.00% |
Building age | Mean = 35.03 Median = 31.00 Std = 23.34 Outliers = 0.00% | Mean = 35.03 Median = 31.00 Std = 23.34 Outliers = 0.00% | |
Building area per employee | Mean = 596,141.43 Median = 1176.48 Std = 2,362,836.96 Outliers = 13.33% | Convert to log scale | Mean = 7.62 Median = 7.07 Std = 2.37 Outliers = 6.92% |
Building area per PC | Mean = 1,530,342.86 Median = 2000.00 Std = 3,595,546.34 Outliers = 18.51% | Mean = 8.72 Median = 7.60 Std = 3.35 Outliers = 15.28% | |
Number of floors | Mean = 3.01 Median = 2.00 Std = 4.31 Outliers = 9.73% | Mean = 0.65 Median = 0.69 Std = 0.82 Outliers = 2.80% | |
Percent heated | Mean = 81.49 Median = 100.00 Std = 33.38 Outliers = 15.73% | No changes | Mean = 81.49 Median = 100.00 Std = 33.38 Outliers = 15.73% |
Percent cooled | Mean = 71.68 Median = 95.00 Std = 37.39 Outliers = 0.00% | Mean = 71.68 Median = 95.00 Std = 37.39 Outliers = 0.00% | |
Percent lit when open | Mean = 78.50 Median = 90.00 Std = 29.70 Outliers = 8.66% | Mean = 78.50 Median = 90.00 Std = 29.70 Outliers = 8.66% | |
Electricity use (thous Btu) per area | Mean = 64.29 Median = 40.96 Std = 82.96 Outliers = 7.16% | Convert to log scale | Mean = 3.58 Median = 3.71 Std = 1.24 Outliers = 2.83% |
ANN Models | Hyper-Parameters | Test Results | ||
---|---|---|---|---|
Learning Rate | Dense Layers Number | Nodes Number | Adjusted R2 | |
Model 1 | 4.54 × 10−3 | 1 | 143 | 0.63 |
Model 2 | 2.31 × 10−3 | 1 | 512 | 0.64 |
Model 3 | 2.18 × 10−5 | 5 | 434 | 0.45 |
Model 4 | 4.98 × 10−3 | 7 | 119 | 0.76 |
Model 5 | 3.28 × 10−3 | 21 | 124 | 0.85 |
Model 6 | 9.98 × 10−5 | 25 | 227 | 0.9 |
Model 7 | 9.41 × 10−5 | 25 | 263 | 0.91 |
Model 8 | 40.47 × 10−5 | 30 | 180 | 0.897 |
Model 9 | 29.79 × 10−5 | 30 | 74 | 0.894 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Abbass, M.A.B.; Hamdy, M. A Generic Pipeline for Machine Learning Users in Energy and Buildings Domain. Energies 2021, 14, 5410. https://doi.org/10.3390/en14175410
Abbass MAB, Hamdy M. A Generic Pipeline for Machine Learning Users in Energy and Buildings Domain. Energies. 2021; 14(17):5410. https://doi.org/10.3390/en14175410
Chicago/Turabian StyleAbbass, Mahmoud Abdelkader Bashery, and Mohamed Hamdy. 2021. "A Generic Pipeline for Machine Learning Users in Energy and Buildings Domain" Energies 14, no. 17: 5410. https://doi.org/10.3390/en14175410
APA StyleAbbass, M. A. B., & Hamdy, M. (2021). A Generic Pipeline for Machine Learning Users in Energy and Buildings Domain. Energies, 14(17), 5410. https://doi.org/10.3390/en14175410