Transactions on Machine Learning and Data Mining (ISSN: 1865-6781)


Volume 9 - Number 2 - October 2016 - Pages 49-61


Preprocessing Optimization for Predictive Classification: Baseline Results from Six Industry Cases

Markus Vattulainen

School of Information Sciences, University of Tampere, Finland


Abstract

Data preprocessing is often the most time-consuming phase in data analysis and automation of it requires computationally costly search from preprocessing com-binations. Efforts to build and evaluate efficient preprocessing automation systems have been challenged by the lack of baseline results from industry regarding the extent of which the infeasible exhaustive search can be speeded up. The research question addressed is: how good are heuristic search methods compared to exhaustive search given a 10%-time constraint? The baseline results from 5/6 real business performance measurement system cases show that simple hill-climbing heuristic with one or three restarts resulted in median 98% classification accuracy compared to global optimum found by exhaustive search. The outcome is attributed to the characteristics of the search space, which included several points near the optimum in all of the cases. For the worst case heuristic hyperparameter optimization with hybridization increased the comparative ratio from 82% to 89%. The results suggest that faster heuristic methods can find near-optimal pre-processing combinations and thus support efficient automation of preprocessing for predictive classification.


Keywords:Preprocessing, Classification, Optimization, Metaheuristics, Business Performance Measurement System

PDFDownload Paper (379 KB)


Back to Table of Contents