Transactions on Machine Learning and Data Mining (ISSN: 1865-6781)


Volume 10 - Number 2 - October 2017 - Pages 67-77


Performance Analysis of Sparks Machine Learning Library

Seyedfaraz Yasrobi, Jakayla Alston, Babak Yadranjiaghdam, Nassehzadeh Tabrizi

East Carolina University, Greenville, USA


Abstract

This paper examines the performance of Apache Sparks machine learning library with reference to the optimal required resources such as the number of machines and cores to best perform popular machine learning algorithms. In order to achieve this, we have observed the training time of classification algorithms such as logistic regression, support vector machines, decision trees, random forests, and gradient boosted trees under different configurations on a sample dataset. Our research revealed that having an excessive number of resources does not necessarily decrease the training time of the machine learning algorithms, rather, it may even degrade the training time by up to 30 percent. Furthermore, this study confirms that methodologies such as tree ensembles can increase the training time of machine learning algorithms compared to that of typical decision trees.

Keywords:

PDFDownload Paper (379 KB)


Back to Table of Contents