Research Phase 3
Data Quality Assessment
Increasing amounts of data and advances in data analytics, especially in machine learning, have the potential to make production technology more efficient. To improve human and autonomous decision-making, much knowledge and information have to be extracted from data. The success of such data projects depends heavily on the quality of the data, however, the evaluation of data is time-consuming and requires a high degree of expert knowledge.
To be able to precisely estimate the success and utility of data projects during use-case selection, we developed a tool for quantifying and evaluating the quality of production data. This will enable users with little to no data expertise to rapidly assess the quality of their datasets.
The assessment system focuses on the two most common data modalities encountered in production: time series data and cross-sectional data. Datasets are evaluated according to 41 criteria divided into four categories: data set, datapoint, feature, and modeling criteria. The assessment is made based on the quantification of the individual features, which are then combined into an overall assessment score with customizable weighting. In addition to the assessment of the dataset, the user receives brief explanations of all criteria, various visualizations of data characteristics, and recommendations to further improve the quality of the dataset.
Our data quality assessment system empowers users without knowledge of data analysis and statistics to comprehensively assess the quality of their data and creates transparency throughout the lifecycle of datasets. Because it is applicable to other data analysis or machine learning projects, it could be useful for many other ventures. Moreover, it also provides experts with an extensible and customizable tool for identifying profitable machine learning use-cases.