A Study on Data Profiling: Focusing on Attribute Value Quality Index
Abstract
:1. Introduction
2. Materials and Methods
2.1. Structured Data Quality Factors
2.2. Unstructured Data Quality Factors
2.3. Data Quality Diagnosis
2.3.1. Calculation of Data Quality Errors
2.3.2. Attribute Extraction Using Geometric Mean
2.3.3. Data Quality Diagnostic Comparison
2.4. Feature Scaling
2.5. Research Model
2.5.1. Research Model for Data Quality Index Calculation
2.5.2. Data Analysis Methods for Model Development
2.5.3. Data Value Quality Index Calculation Model
3. Results
3.1. Data Collection and Analysis Method
3.2. Performance Evaluation Method
3.3. Results of Data Attribute Derivation Experiment
4. Discussion
Author Contributions
Funding
Conflicts of Interest
References
- Jang, W.J.; Kim, J.Y.; Lim, B.T.; Gim, G.Y. A Study on Data Profiling Based on the Statistical Analysis for Big Data Quality Diagnosis. Int. J. Adv. Sci. Technol. 2018, 117, 77–88. [Google Scholar] [CrossRef]
- NIPA. Ride the Wind Big Data, Business Analytics Software Market is the Fastest Growing; IDC & Info Press: Seoul, Korea, 2012. [Google Scholar]
- Korea Data Agency. Korea Data Agency. Data Quality Diagnosis Procedures and Techniques; Korea Data Agency: Seoul, Korea, 2009. [Google Scholar]
- Korea Data Agency. Data Quality Management Maturity Model Ver1.0; Korea Data Agency: Seoul, Korea, 2006. [Google Scholar]
- Jung, S.H. A Study on the Influence Factors in Data Quality of Public Organizations. Ph.D. Thesis, Department Information Management, Dongguk University, Seoul, Korea, 2013. [Google Scholar]
- English, L.P. Information Quality Applied: Best Practices for Improving Business Information, Processes and Systems; Wiley Publishing Press: San Francisco, CA, USA, 2009. [Google Scholar]
- Madnick, S.E.; Wang, R.Y.; Lee, T.W.; Zhu, H. Overview and framework for data and information quality research. J. Data Inf. Qual. 2009, 1, 1–22. [Google Scholar] [CrossRef]
- Open Government Data Quality Management Manual, 2nd ed.; National Information Society Agency Press: Seoul, Korea, 2018.
- Kim, H.C. A Study on Public Data Quality Factors Affecting the Confidence of the Public Data Open Policy. Ph.D. Thesis, Department Business Administration, Soongsil University, Seoul, Korea, 2015. [Google Scholar]
- Choi, S.K.; Jeon, S.C. Aviation Communication Technique: A propose of Big data quality elements. J. Korea Navig. Inst. 2013, 17, 9–15. [Google Scholar]
- Park, H.G.; Song, H.G.; Jang, W.J.; Lee, S.R.; Lim, C.S. Fourth Industrial Revolution, Era of New Manufacturing; Heute Books Press: Seoul, Korea, 2017; pp. 80–99. [Google Scholar]
- Lee, C.N.; Yoo, K.H.; Mun, B.M.; Bae, S.J. Informal Quality Data Analysis via Sentimental analysis and Word2vec method. J. Korean Soc. Qual. Manag. 2017, 45, 117–128. [Google Scholar] [CrossRef]
- Johnson, T. Data Profiling. In Encyclopedia of Database Systems; Springer: Boston, MA, USA, 2009; pp. 604–608. [Google Scholar]
- Naumann, F. Data profiling revisited. ACM SIGMOD Rec. 2014, 42, 40–49. [Google Scholar] [CrossRef] [Green Version]
- Rahm, E.; Do, H.H. Data cleaning: Problems and current approaches. IEEE Data Eng. 2000, 23, 3–13. [Google Scholar]
- David, L.; Powell, R.J. Business Intelligence: The Savvy Manager’s Guide, Getting Onboard with Emerging IT; Morgan Kaufmann Press: San Francisco, CA, USA, 2003; pp. 110–111. [Google Scholar]
- David, L. Master Data Management; Morgan Kaufmann Press: San Francisco, CA, USA, 2009; pp. 94–96. [Google Scholar]
- Olson, J.E. Data Quality: The Accuracy Dimension; Morgan Kaufmann Press: San Francisco, CA, USA, 2003; pp. 140–142. [Google Scholar]
- Bonferroni, C. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 1936, 8, 3–62. [Google Scholar]
- Lee, D.S. Design of an Inference Control Process in OLAP Data Cubes. Ph.D. Thesis, Department Industrial Information System Engineering, Soongsil University, Seoul, Korea, 2009. [Google Scholar]
- Seo, M.K. Practical Data Processing and Analysis Using R; Gilbut Press: Seoul, Korea, 2014. [Google Scholar]
- Winston, C. R Graphics Cookbook; O’Reilly Media Press: Seoul, Korea, 2015. [Google Scholar]
- Gonick, L.; Smith, W. The Cartoon Guide to Statistics; KungRee Press: Seoul, Korea, 2015. [Google Scholar]
- Package Scale. Available online: http://cran.r-project.org/web/packages/scales/scales.pdf (accessed on 9 August 2018).
- Jang, W.J. An Empirical Study on Comparative Evaluation of Document Similarity between Machine Learning and an Expert. Ph.D. Thesis, Dept. IT Policy and Management, Soongsil University, Seoul, Korea, 2019. [Google Scholar]
- Korea Data Agency. The Guide for Advanced Data Analytics Professional; Korea Data Agency: Seoul, Korea, 2018. [Google Scholar]
- Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. From data mining to knowledge discovery in databases. AI Mag. 1996, 17, 37–54. [Google Scholar]
- Azevedo, A.I.R.L.; Santos, M.F. KDD, SEMMA and CRISP-DM: A parallel overview. In Proceedings of the IADS-DM, Amsterdam, The Netherlands, 24–26 July 2008. [Google Scholar]
- Yoo, S.M. R-Statistical Analysis for Writing Academical Papers; Slow & Steady Press: Seoul, Korea, 2016; pp. 78–81. [Google Scholar]
- Shin, G.K. Partial Least Squares Structural Equation Mdeling (PLS-SEM) with SmartPLS 3.0; ChungRam Press: Seoul, Korea, 2018. [Google Scholar]
- Delhi Weather Data. Available online: https://www.kaggle.com/mahirkukreja/delhi-weather-data/data (accessed on 20 December 2017).
Target Data Type | Contents | |
---|---|---|
Metadata | DB form constructed from data having various information about contents | |
Text | Direct input method | DB type constructed by direct input of text |
OCR conversion method | DB form constructed by OCR conversion of characters | |
Chinese character data | DB form constructed by inputting the data written only in Chinese characters such as old documents and old books | |
Image | DB form constructed through scanning or camera shooting | |
Sound | DB form built by editing recording or holding tape | |
Video | DB form built by editing shooting or holding data (reel tape, beta tape, video tape) | |
3D | DB forms constructed from 3D data through image-based modeling and rendering methods and 3D scanning for building images from digital filming into 3D data | |
GIS | DB format constructed by inputting scanning and attribute information of a map that was already produced | |
Aerial photograph | DB format constructed by recording filming information and spatial information on film and photo data and aerial photos stored | |
Weather | DB form constructed by converting past satellite raw data and earth observation satellite binary data into standard format | |
Cartographic satellite pictures | DB type constructed with numerical orthophotographic image data by inputting attribute information to satellite photographs |
Quality Diagnosis Method | Method Explanation |
---|---|
Value diagnostic profiling | ○ The method to analyze the data value error itself, such as the validity and accuracy of the data value - Diagnosis centered on the accuracy of data values through column analysis, date analysis, pattern analysis, and code analysis |
Unstructured survey | ○ The method to diagnose the error of unstructured data, such as documents, images, or videos, through a human’s manual confirmation (actual measurement) - Views information directly or manually checks the document without separate tools |
Data Type | Weight Applying Criteria | Weight |
---|---|---|
All data types | Missing value (NA) > 0 | 0.1 |
Integer or numeric | Near-zero variance (0) | 0.1 |
Integer or numeric | Standard deviation (SD) ≥ 100 | 0.1 |
Integer or numeric | Outlier Bonferroni p < 0.05 | 0.1 |
Factor | Space > 0 | 0.1 |
Date | (Last date − first date) > (current date − first date) | 0.1 |
Data Type | Criteria Applying Attribute Correction Value | Correction Value |
---|---|---|
All data types | The number of missing values (NA) is more than 1% | 0.1 |
Integer or numeric | Outlier Bonferroni p ≤ 0.00001 | 0.1 |
Division | Data Quality Error Calculation | Attribute Extraction Using Geometric Means |
---|---|---|
Quality diagnosis method | In principle, data profiling is performed for all attributes, and, in some cases, target attributes are selected according to the subjective judgment of the person who performs data profiling. | Data profiling is performed targeting the attributes derived from the attribute extraction model. |
Advantages | By performing data profiling targeting all attributes, you can explore the data value characteristics of each attribute. | This can be done for attributes that possibly have errors, and only those attributes that possibly have errors can be selected depending on attribute weights. |
Disadvantages | It is inefficient because it takes a long time when there is a lot of data as it is performed for all attributes. Depending on the subjective judgment of the performer, the data quality diagnosis result may be different. | Using the attribute extraction model, it is possible to select an attribute with a high probability of error according to the attribute weight, but it cannot determine the degree of data value quality for each attribute. |
Data Type | Measurement Item (k) | Attribute Quality Index Applying Criteria (β) | Weight (α) |
---|---|---|---|
Numeric, date, categorical date | Missing value | Missing value = 0 | 0.0 |
0 < Missing value ≤ 5% | 1.2 | ||
5% < Number of missing values ≤ 15% | 1.5 | ||
Number of missing values > 15% | 2.0 | ||
Number, number categorical | Outlier | Z-score ≤ | 2 | | 0.0 |
| 2 | < Z-score ≤ | 3 | | 1.2 | ||
| 3 | < Z-score ≤ | 4 | | 1.5 | ||
Z-score > | 4 | | 2.0 |
Attribute Description | Attribute Description | ||
---|---|---|---|
date_time_utc | String | _heatindexm | String |
_conds | String | _hum | Numeric |
_dewptm | Numeric | _precipm | String |
_fog | Numeric | _pressurem | Numeric |
_hail | Numeric | _rain | Numeric |
_snow | Numeric | _wdird | Numeric |
_tempm | Numeric | _wdire | String |
_thunder | Numeric | _wgustm | String |
_tornado | Numeric | _windchillm | String |
_vism | Numeric | _wspdm | Numeric |
Division | Precondition |
---|---|
Target data type | Number, date, and attributes of categorical data type (15 target attributes) |
Attribute extraction using geometric mean | All data attributes extracted by the attribute extraction model |
Data value quality index calculation model | All data attributes with over 0 data attribute value quality index (AVQI) |
List of Extracted Data Attributes | Experimental Result Value |
---|---|
_dewpm, _fog, _hail, _humure, _pressurem, _rain, _snow, _tempm, _thunder, _tornado, _vism, _wdird, _wspdm | 0.297 |
List of Extracted Data Attributes |
---|
_dewpm, _fog, _hail, _humure, _pressurem, _rain, _snow, _tempm, _thunder, _tornado, _vism, _wdird, _wspdm |
Division | Attribute Quality Index Applying Criteria (β) | Weight (α) | _Dewptm | _Fog | _Hail | _Hum | _Pressurem | _Rain | _Snow | _Tempm | _Thunder | _Tornado | _Vism | _Wdird | _Wspdm | _Date | _Time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Missing value | Missing value = 0 | 0 | |||||||||||||||
0 < Missing value ≤ 5% | 1.2 | 621 | 757 | 232 | 673 | 4428 | 2358 | ||||||||||
5% < Number of missing value ≤ 15% | 1.5 | 14,755 | |||||||||||||||
Number of missing value > 15% | 2 | ||||||||||||||||
Outlier | Z-core ≤ | 2 | | 0 | |||||||||||||||
| 2 | < Z-Score ≤ | 3 | | 1.2 | 826 | 780 | 3080 | 9 | 294 | |||||||||||
| 3 | < Z-Score ≤ | 4 | | 1.5 | 26 | 7038 | 1 | 1 | 34 | |||||||||||
Z-Score > | 4 | | 2 | 5 | 13 | 2 | 1 | 2652 | 1 | 4 | 952 | 2 | 1 | 3 | 133 | ||||
AVQI | 0.208 | 0.5 | 1 | 0.201 | 0.203 | 1 | 1 | 0.201 | 1 | 1 | 0.2 | 0.5 | 0.241 | 0 | 0 |
Division | Data Quality Efficiency Measurement Value (%) |
---|---|
Value (accuracy) error rate | |
Attribute extraction using geometric mean | |
Data value quality index calculation model |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jang, W.-J.; Lee, S.-T.; Kim, J.-B.; Gim, G.-Y. A Study on Data Profiling: Focusing on Attribute Value Quality Index. Appl. Sci. 2019, 9, 5054. https://doi.org/10.3390/app9235054
Jang W-J, Lee S-T, Kim J-B, Gim G-Y. A Study on Data Profiling: Focusing on Attribute Value Quality Index. Applied Sciences. 2019; 9(23):5054. https://doi.org/10.3390/app9235054
Chicago/Turabian StyleJang, Won-Jung, Sung-Taek Lee, Jong-Bae Kim, and Gwang-Yong Gim. 2019. "A Study on Data Profiling: Focusing on Attribute Value Quality Index" Applied Sciences 9, no. 23: 5054. https://doi.org/10.3390/app9235054
APA StyleJang, W. -J., Lee, S. -T., Kim, J. -B., & Gim, G. -Y. (2019). A Study on Data Profiling: Focusing on Attribute Value Quality Index. Applied Sciences, 9(23), 5054. https://doi.org/10.3390/app9235054