Transactions on Machine Learning and Data Mining (ISSN: 1865-6781)


Volume 11 - Number 2 - October 2018 - Pages 63-77


Design of a Data Quality Simulation Application for Predictive Classification: Minimal Setup Viewpoint

Markus Vattulainen

University of Tampere, Finland


Abstract

Data quality simulations, controlled adding of data quality problems to data, are not common in industrial or scientific research reports on predictive classification. As a consequence, there is uncertainty regarding robustness of classification results achieved and what specific data quality dimension of data production process should be improved. The best simulation applications have an extensive set of features but are limited by setup effort and expert level conceptual understanding required to run simulations. The current paper addresses a design question: what are the components of a data quality simulation application that requires no or minimal up-front setup effort? As a contribution, a component listing is presented and the feasibility of the design demonstrated by implementing it with R statistical language. Demonstration of the system with six business performance measurement system data sets suggests that controlled adding of eight common data quality problems (noise, missing values, low variance, outliers, class inconsistency, class imbalance, irrelevant features and low data volume) can be set up by a single line of R code enabling measurement of decrease in classification accuracy for each added data quality problem separately and in combination to support wider use of data quality simulations.

Keywords:Classification, Data quality, Simulation, System design

PDFDownload Paper (379 KB)


Back to Table of Contents