Transactions on Machine Learning and Data Mining (ISSN: 1865-6781)
Volume 10 - Number 2 - October 2017 - Pages 67-77
Performance Analysis of Sparks Machine Learning Library
Seyedfaraz Yasrobi, Jakayla Alston, Babak Yadranjiaghdam, Nassehzadeh Tabrizi
East Carolina University,
Greenville, USA
Abstract
This paper examines the performance of Apache Sparks machine
learning library with reference to the optimal required resources such as the
number of machines and cores to best perform popular machine learning algorithms.
In order to achieve this, we have observed the training time of classification
algorithms such as logistic regression, support vector machines, decision
trees, random forests, and gradient boosted trees under different configurations
on a sample dataset. Our research revealed that having an excessive number of
resources does not necessarily decrease the training time of the machine learning
algorithms, rather, it may even degrade the training time by up to 30 percent.
Furthermore, this study confirms that methodologies such as tree ensembles
can increase the training time of machine learning algorithms compared to
that of typical decision trees.
Keywords:
Download Paper (379 KB)