Transactions on Machine Learning and Data Mining (ISSN: 1865-6781)
Volume 11 - Number 2 - October 2018 - Pages 63-77
Design of a Data Quality Simulation Application for Predictive Classification: Minimal Setup Viewpoint
Markus Vattulainen
University of Tampere, Finland
Abstract
Data quality simulations, controlled adding of data quality problems to data, are not common in
industrial or scientific research reports on predictive classification. As a consequence, there is uncertainty regarding
robustness of classification results achieved and what specific data quality dimension of data production process should
be improved. The best simulation applications have an extensive set of features but are limited by setup effort and expert
level conceptual understanding required to run simulations. The current paper addresses a design question: what are the
components of a data quality simulation application that requires no or minimal up-front setup effort? As a contribution,
a component listing is presented and the feasibility of the design demonstrated by implementing it with R statistical language.
Demonstration of the system with six business performance measurement system data sets suggests that controlled adding of eight
common data quality problems (noise, missing values, low variance, outliers, class inconsistency, class imbalance, irrelevant
features and low data volume) can be set up by a single line of R code enabling measurement of decrease in classification accuracy
for each added data quality problem separately and in combination to support wider use of data quality simulations.
Keywords:Classification, Data quality, Simulation, System design
Download Paper (379 KB)