Causality-Driven Feature Selection for Calibrating Low-Cost Airborne Particulate Sensors Using Machine Learning
Abstract
:1. Introduction
Machine Learning and the Need for Causality
2. Materials and Methods
2.1. Proposed Feature Selection Mechanism
- 1:
- For each in , the causation criteria set by CCM for is evaluated. For the current study, the causal-ccm package [45] was used for this purpose. The implementation details and steps of the CCM algorithm are described in Appendix A.In evaluating the causal relationship from to Y, it is essential to select a sufficiently long time series for both variables in order to ascertain that the criterion of convergence is met and that the cross-map skill does not deteriorate significantly over time.
- 2:
- For each causality assessment, the causal-ccm package evaluates a p-value, representing the statistical significance of the result. All for which the p-value [46] and therefore not registered as a sufficiently rigorous causal connection are eliminated from the set of input features to the ML model
- 3:
- Next, the remaining features are ranked according to the strength of the causal relationship , from most causally related to Y to the least.
- 4:
- An appropriate threshold value is established for the strength of causality and the features exceeding this threshold are selected. The machine learning models are then constructed and trained for all possible subsets of the selected features as input variables to the model. After training, for each instance, the efficacy is tested using an independent validation dataset to assess how well it performs when presented with data that the algorithm has not previously seen, i.e., we test its generalizability.
- 5:
- The model that demonstrates the best predictive performance is selected as the final calibration model. Performance metrics are compared with the full model to assess any improvement in generalizability. If no improvement is observed, the process in Step 4 is repeated using a lower threshold.
2.2. Experimental Test Cases
2.2.1. Experimental Setup and Datasets Used
2.2.2.
2.2.3.
3. Results
3.1.
3.2. PM2.5
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
PM | Particulate matter |
IoT | Internet of Things |
LCS | Low-cost air quality sensor systems |
ML | Machine learning |
CCM | Convergent cross mapping |
OPC | Optical particle counter |
MSE | Mean squared error |
GC | Granger Causality |
Appendix A. The CCM Algorithm
- 1:
- Define the reconstructed shadow manifold :Then, the reconstructed shadow manifold is defined by (A2).
- 2:
- At t, locate in .
- 3:
- Identify nearest neighbors:Find the nearest neighbor vectors from selected vector ( is the minimum number of points needed for an embedding/simplex with E dimensions [38]).Let the time indices of the nearest neighbors of be denoted by .
- 4:
- Define the model that predicts X given :Construct a model that predicts X based on states of Y given byHere, the division by serves to scale distances relative to the nearest neighbor. In this approach, the more distant neighbors are assigned lower weights, with the weights decreasing exponentially as the distance increases.
- 5:
- Assess dynamical coupling between X and Y:If X and Y are dynamically coupled, nearby clusters of points in should correspond to nearby clusters in . As L increases, the density of neighbor points in both manifolds should increase, should converge to X. Therefore, the convergence of nearest neighbors can be examined to assess the correspondence between states on and .
- 6:
- Evaluate correlation for causality testing:Plot the correlation coefficients between X and . If a significant correlation is observed, this indicates that sufficient information from X is embedded in Y. In this case, we can conclude that X causally influences Y.
Appendix B. Granger Causality
Appendix B.1. PM1
Lag Length | p-Value |
---|---|
1 | 0.5231 |
2 | 0.0528 |
3 | 0.0616 |
4 | 0.0687 |
5 | 0.1191 |
6 | 0.1780 |
7 | 0.2919 |
8 | 0.3875 |
9 | 0.4419 |
10 | 0.4333 |
Lag Length | p-Value |
---|---|
1 | 0.5696 |
2 | 0.0943 |
3 | 0.1172 |
4 | 0.1448 |
5 | 0.2356 |
6 | 0.3174 |
7 | 0.4809 |
8 | 0.5799 |
9 | 0.6351 |
10 | 0.6192 |
Lag Length | p-Value |
---|---|
1 | 0.8333 |
2 | 0.5366 |
3 | 0.6190 |
4 | 0.7315 |
5 | 0.5543 |
6 | 0.6514 |
7 | 0.7924 |
8 | 0.7104 |
9 | 0.7401 |
−10 | 0.7885 |
Appendix B.2. PM2.5
Lag Length | p-Value |
---|---|
1 | 0.3366 |
2 | 0.4009 |
3 | 0.7111 |
4 | 0.8371 |
5 | 0.9256 |
6 | 0.9664 |
7 | 0.9822 |
8 | 0.9843 |
9 | 0.9883 |
10 | 0.9756 |
Lag Length | p-Value |
---|---|
1 | 0.5113 |
2 | 0.6379 |
3 | 0.8674 |
4 | 0.9466 |
5 | 0.9821 |
6 | 0.9934 |
7 | 0.9970 |
8 | 0.9977 |
9 | 0.9984 |
10 | 0.9958 |
Lag Length | p-Value |
---|---|
1 | 0.4672 |
2 | 0.7967 |
3 | 0.9837 |
4 | 0.9656 |
5 | 0.9607 |
6 | 0.9762 |
7 | 0.9968 |
8 | 0.9990 |
9 | 0.9944 |
10 | 0.9990 |
References
- Intergovernmental Panel on Climate Change. Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change; Cambridge University Press: Cambridge, UK; New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Orru, H.; Ebi, K.; Forsberg, B. The interplay of climate change and air pollution on health. Curr. Environ. Health Rep. 2017, 4, 504–513. [Google Scholar] [CrossRef]
- Arshad, K.; Hussain, N.; Ashraf, M.H.; Saleem, M.Z. Air pollution and climate change as grand challenges to sustainability. Sci. Total Environ. 2024, 928, 172370. [Google Scholar]
- Shaddick, G.; Thomas, M.L.; Mudu, P.; Ruggeri, G.; Gumy, S. Half the world’s population are exposed to increasing air pollution. NPJ Clim. Atmos. Sci. 2020, 3, 23. [Google Scholar] [CrossRef]
- Li, Y.; Xu, L.; Shan, Z.; Teng, W.; Han, C. Association between air pollution and type 2 diabetes: An updated review of the literature. Ther. Adv. Endocrinol. Metab. 2019, 10, 2042018819897046. [Google Scholar] [CrossRef]
- Nolte, C. Air quality. In Impacts, Risks, and Adaptation in the United States: Fourth National Climate Assessment, Volume II; U.S. Global Change Research Program: Washington, DC, USA, 2018; Chapter 13; p. 516. [Google Scholar]
- Malings, C.; Archer, J.-M.; Barreto, Á.; Bi, J. Low-Cost Air Quality Sensor Systems (LCS) for Policy-Relevant Air Quality Analysis; Gaw Report No. 293; World Meteorological Organization: Geneva, Switzerland, 2024. [Google Scholar]
- Okafor, N.U.; Alghorani, Y.; Delaney, D.T. Improving data quality of low-cost IoT sensors in environmental monitoring networks using data fusion and machine learning approach. ICT Express 2020, 6, 220–228. [Google Scholar] [CrossRef]
- DeSouza, P.; Kahn, R.; Stockman, T.; Obermann, W.; Crawford, B.; Wang, A.; Crooks, J.; Li, J.; Kinney, P. Calibrating networks of low-cost air quality sensors. Atmos. Meas. Tech. 2022, 15, 6309–6328. [Google Scholar] [CrossRef]
- Wijeratne, L.O.; Kiv, D.R.; Aker, A.R.; Talebi, S.; Lary, D.J. Using machine learning for the calibration of airborne particulate sensors. Sensors 2019, 20, 99. [Google Scholar] [CrossRef]
- Zhang, Y.; Wijeratne, L.O.; Talebi, S.; Lary, D.J. Machine learning for light sensor calibration. Sensors 2021, 21, 6259. [Google Scholar] [CrossRef]
- Wang, A.; Machida, Y.; de Souza, P.; Mora, S.; Duhl, T.; Hudda, N.; Durant, J.L.; Duarte, F.; Ratti, C. Leveraging machine learning algorithms to advance low-cost air sensor calibration in stationary and mobile settings. Atmos. Environ. 2023, 301, 119692. [Google Scholar] [CrossRef]
- Kelly, B.; Xiu, D. Financial machine learning. Found. Trends Financ. 2023, 13, 205–363. [Google Scholar] [CrossRef]
- Mariani, M.M.; Borghi, M. Artificial intelligence in service industries: Customers’ assessment of service production and resilient service operations. Int. J. Prod. Res. 2024, 62, 5400–5416. [Google Scholar] [CrossRef]
- Rajkomar, A.; Dean, J.; Kohane, I. Machine learning in medicine. N. Engl. J. Med. 2019, 380, 1347–1358. [Google Scholar] [CrossRef]
- Kang, Z.; Catal, C.; Tekinerdogan, B. Machine learning applications in production lines: A systematic literature review. Comput. Ind. Eng. 2020, 149, 106773. [Google Scholar] [CrossRef]
- Lary, D.J.; Zewdie, G.K.; Liu, X.; Wu, D.; Levetin, E.; Allee, R.J.; Malakar, N.; Walker, A.; Mussa, H.; Mannino, A.; et al. Machine learning applications for earth observation. In Earth Observation Open Science and Innovation; Springer: Cham, Switzerland, 2018; pp. 165–218. [Google Scholar]
- Malakar, N.K.; Lary, D.J.; Moore, A.; Gencaga, D.; Roscoe, B.; Albayrak, A.; Wei, J. Estimation and bias correction of aerosol abundance using data-driven machine learning and remote sensing. In Proceedings of the 2012 Conference on Intelligent Data Understanding, Boulder, CO, USA, 24–26 October 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 24–30. [Google Scholar]
- Albayrak, A.; Wei, J.; Petrenko, M.; Lary, D.; Leptoukh, G. Modis aerosol optical depth bias adjustment using machine learning algorithms. In Proceedings of the AGU Fall Meeting Abstracts, San Francisco, CA, USA, 4–8 December 2011; Volume 2011, p. A53C-0371. [Google Scholar]
- Shin, S.; Baek, K.; So, H. Rapid monitoring of indoor air quality for efficient HVAC systems using fully convolutional network deep learning model. Build. Environ. 2023, 234, 110191. [Google Scholar] [CrossRef]
- Ravindiran, G.; Hayder, G.; Kanagarathinam, K.; Alagumalai, A.; Sonne, C. Air quality prediction by machine learning models: A predictive study on the indian coastal city of Visakhapatnam. Chemosphere 2023, 338, 139518. [Google Scholar] [CrossRef]
- Wang, S.; McGibbon, J.; Zhang, Y. Predicting high-resolution air quality using machine learning: Integration of large eddy simulation and urban morphology data. Environ. Pollut. 2024, 344, 123371. [Google Scholar] [CrossRef]
- SK, A.; Ravindiran, G. Integrating machine learning techniques for Air Quality Index forecasting and insights from pollutant-meteorological dynamics in sustainable urban environments. Earth Sci. Inform. 2024, 17, 3733–3748. [Google Scholar]
- Rudner, T.G.J.; Toner, H. Key Concepts in AI Safety: Interpretability in Machine Learning; Center for Security and Emerging Technology: CSET: Washington, DC, USA, 2021. [Google Scholar]
- Doshi-Velez, F.; Kim, B. Towards A Rigorous Science of Interpretable Machine Learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
- Lipton, Z.C. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
- Lundberg, S. A unified approach to interpreting model predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
- Li, K.; DeCost, B.; Choudhary, K.; Greenwood, M.; Hattrick-Simpers, J. A critical examination of robustness and generalizability of machine learning prediction of materials properties. NPJ Comput. Mater. 2023, 9, 55. [Google Scholar] [CrossRef]
- Schölkopf, B. Causality for machine learning. In Probabilistic and Causal Inference: The Works of Judea Pearl; Association for Computing Machinery: New York, NY, USA, 2022; pp. 765–804. [Google Scholar]
- Cloudera Fast Forward Labs. Causality for Machine Learning: Applied Research Report. 2020. Available online: https://ff13.fastforwardlabs.com/ (accessed on 8 November 2024).
- Beery, S.; Van Horn, G.; Perona, P. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 456–473. [Google Scholar]
- Ye, W.; Zheng, G.; Cao, X.; Ma, Y.; Hu, X.; Zhang, A. Spurious correlations in machine learning: A survey. arXiv 2024, arXiv:2402.12715. [Google Scholar]
- Ilyas, A.; Santurkar, S.; Tsipras, D.; Engstrom, L.; Tran, B.; Madry, A. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2019; Volume 32. [Google Scholar]
- Haavelmo, T. The probability approach in econometrics. Econometrica 1944, 12, S1–S115. [Google Scholar] [CrossRef]
- Bühlmann, P. Invariance, causality and robustness. Stat. Sci. 2020, 35, 404–426. [Google Scholar] [CrossRef]
- Peters, J.; Bühlmann, P.; Meinshausen, N. Causal inference by using invariant prediction: Identification and confidence intervals. J. R. Stat. Soc. Ser. B Stat. Methodol. 2016, 78, 947–1012. [Google Scholar] [CrossRef]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
- Sugihara, G.; May, R.; Ye, H.; Hsieh, C.H.; Deyle, E.; Fogarty, M.; Munch, S. Detecting causality in complex ecosystems. Science 2012, 338, 496–500. [Google Scholar] [CrossRef]
- Tsonis, A.A.; Deyle, E.R.; May, R.M.; Sugihara, G.; Swanson, K.; Verbeten, J.D.; Wang, G. Dynamical evidence for causality between galactic cosmic rays and interannual variation in global temperature. Proc. Natl. Acad. Sci. USA 2015, 112, 3253–3256. [Google Scholar] [CrossRef]
- Takens, F. Detecting strange attractors in turbulence. In Dynamical Systems and Turbulence, Warwick 1980: Proceedings of a Symposium Held at the University of Warwick 1979/80; Springer: Berlin/Heidelberg, Germany, 2006; pp. 366–381. [Google Scholar]
- Sun, Y.N.; Qin, W.; Hu, J.H.; Xu, H.W.; Sun, P.Z. A causal model-inspired automatic feature-selection method for developing data-driven soft sensors in complex industrial processes. Engineering 2023, 22, 82–93. [Google Scholar] [CrossRef]
- Chen, Z.; Cai, J.; Gao, B.; Xu, B.; Dai, S.; He, B.; Xie, X. Detecting the causality influence of individual meteorological factors on local PM2. 5 concentration in the Jing-Jin-Ji region. Sci. Rep. 2017, 7, 40735. [Google Scholar]
- Rybarczyk, Y.; Zalakeviciute, R.; Ortiz-Prado, E. Causal effect of air pollution and meteorology on the COVID-19 pandemic: A convergent cross mapping approach. Heliyon 2024, 10, e25134. [Google Scholar] [CrossRef]
- Ye, H.; Deyle, E.R.; Gilarranz, L.J.; Sugihara, G. Distinguishing time-delayed causal interactions using convergent cross mapping. Sci. Rep. 2015, 5, 14750. [Google Scholar] [CrossRef]
- Javier, P.J.E. causal-ccm: A Python Implementation of Convergent Cross Mapping, version 0.3.3; GitHub: San Francisco, CA, USA, 2021. [Google Scholar]
- Edwards, A.W. RA Fischer, statistical methods for research workers, (1925). In Landmark Writings in Western Mathematics 1640–1940; Elsevier: Amsterdam, The Netherlands, 2005; pp. 856–870. [Google Scholar]
- Alphasense. Alphasense User Manual OPC-N3 Optical Particle Counter; Alphasense: Great Notley, UK, 2018. [Google Scholar]
- Broich, A.V.; Gerharz, L.E.; Klemm, O. Personal monitoring of exposure to particulate matter with a high temporal resolution. Environ. Sci. Pollut. Res. 2012, 19, 2959–2972. [Google Scholar] [CrossRef]
- GRIMM Aerosol Technik. GRIMM Software for Optical Particle Counter, Portable Aerosol Spectrometer 1.108/1.109; GRIMM Aerosol Technik: Ainring, Germany, 2009. [Google Scholar]
- Sugihara, G.; May, R.M. Nonlinear forecasting as a way of distinguishing chaos from measurement error in time series. Nature 1990, 344, 734–741. [Google Scholar] [CrossRef]
- Marcílio, W.E.; Eler, D.M. From explanations to feature selection: Assessing SHAP values as feature selection mechanism. In Proceedings of the 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Porto de Galinhas, Brazil, 7–10 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 340–347. [Google Scholar]
- Kirešová, S.; Guzan, M. Determining the correlation between particulate matter PM10 and meteorological factors. Eng 2022, 3, 343–363. [Google Scholar] [CrossRef]
- Yang, H.; Peng, Q.; Zhou, J.; Song, G.; Gong, X. The unidirectional causality influence of factors on PM2.5 in Shenyang city of China. Sci. Rep. 2020, 10, 8403. [Google Scholar] [CrossRef]
- Fu, H.; Zhang, Y.; Liao, C.; Mao, L.; Wang, Z.; Hong, N. Investigating PM2.5 responses to other air pollutants and meteorological factors across multiple temporal scales. Sci. Rep. 2020, 10, 15639. [Google Scholar] [CrossRef] [PubMed]
- Vaishali; Verma, G.; Das, R.M. Influence of temperature and relative humidity on PM2.5 concentration over Delhi. MAPAN 2023, 38, 759–769. [Google Scholar] [CrossRef]
- Hernandez, G.; Berry, T.A.; Wallis, S.L.; Poyner, D. Temperature and humidity effects on particulate matter concentrations in a sub-tropical climate during winter. Int. Proc. Chem. Biol. Environ. Eng. 2017, 102, 41–49. [Google Scholar]
- Kim, M.; Jeong, S.G.; Park, J.; Kim, S.; Lee, J.H. Investigating the impact of relative humidity and air tightness on PM sedimentation and concentration reduction. Build. Environ. 2023, 241, 110270. [Google Scholar] [CrossRef]
- Zhang, M.; Chen, S.; Zhang, X.; Guo, S.; Wang, Y.; Zhao, F.; Chen, J.; Qi, P.; Lu, F.; Chen, M.; et al. Characters of particulate matter and their relationship with meteorological factors during winter Nanyang 2021–2022. Atmosphere 2023, 14, 137. [Google Scholar] [CrossRef]
- Zhang, S.; Xing, J.; Sarwar, G.; Ge, Y.; He, H.; Duan, F.; Zhao, Y.; He, K.; Zhu, L.; Chu, B. Parameterization of heterogeneous reaction of SO2 to sulfate on dust with coexistence of NH3 and NO2 under different humidity conditions. Atmos. Environ. 2019, 208, 133–140. [Google Scholar] [CrossRef]
- Raysoni, A.U.; Pinakana, S.D.; Mendez, E.; Wladyka, D.; Sepielak, K.; Temby, O. A Review of Literature on the Usage of Low-Cost Sensors to Measure Particulate Matter. Earth 2023, 4, 168–186. [Google Scholar] [CrossRef]
- Granger, C.W. Investigating causal relations by econometric models and cross-spectral methods. Econom. J. Econom. Soc. 1969, 37, 424–438. [Google Scholar] [CrossRef]
- Javier, P.J.E. Chapter 6: Convergent Cross Mapping. In Time Series Analysis Handbook; GitHub: San Francisco, CA, USA, 2021. [Google Scholar]
- Clarke, H.D.; Granato, J. Time Series Analysis in Political Science. In Encyclopedia of Social Measurement; Kempf-Leonard, K., Ed.; Elsevier: Amsterdam, The Netherlands, 2005; pp. 829–837. [Google Scholar] [CrossRef]
- Seabold, S.; Perktold, J. Statsmodels: Econometric and statistical modeling with Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28–30 June 2010. [Google Scholar]
Feature Selection Approach | Features Used as Predictors | Number of Predictors | MSE | |
---|---|---|---|---|
No feature selection | All 42 outputs from the LCS | 42 | 0.213 | 0.987 |
SHAP value-based | Reject Count Ratio, PM1 from OPCN3, Reject Count Glitch, OPCN3 Interior Temperature, Ambient Temperature, OPCN3 Interior Humidity | 6 | 0.150 | 0.991 |
Causality-based | Bin 0, Reject Count Ratio, Ambient Pressure, Ambient Temperature, Ambient Humidity | 5 | 0.121 | 0.993 |
Feature Selection Approach | Features Used as Predictors | Number of Predictors | MSE | |
---|---|---|---|---|
No feature selection | All 42 outputs from the LCS | 42 | 0.41 | 0.977 |
SHAP value-based | Bin 0, Reject Count Ratio, Reject Count Glitch, Bin 3, PM1 from OPCN3, PM2.5 from OPCN3, OPCN3 Interior Temperature, OPCN3 Interior Humidity, Bin 1 | 9 | 0.286 | 0.984 |
Causality-based | Bin 0, PM1 from OPCN3, PM2.5 from OPCN3, Reject Count Ratio, Ambient Temperature, Ambient Pressure, Ambient Humidity | 7 | 0.274 | 0.985 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sooriyaarachchi, V.; Lary, D.J.; Wijeratne, L.O.H.; Waczak, J. Causality-Driven Feature Selection for Calibrating Low-Cost Airborne Particulate Sensors Using Machine Learning. Sensors 2024, 24, 7304. https://doi.org/10.3390/s24227304
Sooriyaarachchi V, Lary DJ, Wijeratne LOH, Waczak J. Causality-Driven Feature Selection for Calibrating Low-Cost Airborne Particulate Sensors Using Machine Learning. Sensors. 2024; 24(22):7304. https://doi.org/10.3390/s24227304
Chicago/Turabian StyleSooriyaarachchi, Vinu, David J. Lary, Lakitha O. H. Wijeratne, and John Waczak. 2024. "Causality-Driven Feature Selection for Calibrating Low-Cost Airborne Particulate Sensors Using Machine Learning" Sensors 24, no. 22: 7304. https://doi.org/10.3390/s24227304
APA StyleSooriyaarachchi, V., Lary, D. J., Wijeratne, L. O. H., & Waczak, J. (2024). Causality-Driven Feature Selection for Calibrating Low-Cost Airborne Particulate Sensors Using Machine Learning. Sensors, 24(22), 7304. https://doi.org/10.3390/s24227304