
Select an Action

Automatic Selection of MapReduce Machine Learning Algorithms: A Model Building Approach
Title:
Automatic Selection of MapReduce Machine Learning Algorithms: A Model Building Approach
Author:
Franklin, Bryan M., author.
ISBN:
9780355979749
Personal Author:
Physical Description:
1 electronic resource (265 pages)
General Note:
Source: Dissertation Abstracts International, Volume: 79-10(E), Section: B.
Includes supplementary digital materials.
Advisors: Laura E. Brown Committee members: Timothy Havens; Benjamin Ong; Thomas Oommen.
Abstract:
As the amount of information available for data mining grows larger, the amount of time needed to train models on those huge volumes of data also grows longer. Techniques such as sub-sampling and parallel algorithms have been employed to deal with this growth. Some studies have shown that sub-sampling can have adverse effects on the quality of models produced, and the degree to which it affects different types of learning algorithms varies. Parallel algorithms perform well when enough computing resources (e.g. cores, memory) are available, however for a limited sized cluster the growth in data will still cause an unacceptable growth in model training time. In addition to the data size mitigation problem, picking which algorithms are well suited to a particular dataset, can be a challenge. While some studies have looked at selection criteria for picking a learning algorithm based on the properties of the dataset, the additional complexity of parallel learners or possible run time limitations has not been considered. This study explores run time and model quality results of various techniques for dealing with large datasets, including using different numbers of compute cores, sub-sampling the datasets, and exploiting the iterative anytime nature of the training algorithms. The algorithms were studied using MapReduce implementations of four supervised learning algorithms, logistic regression, tree induction, bagged trees, and boosted stumps for binary classification using probabilistic models. Evaluation of these techniques was done using a modified form of learning curves which has a temporal component. Finally, the data collected was used to train a set of models to predict which type of parallel learner best suits a particular dataset, given run time limitations and the number of compute cores to be used. The predictions of those models were then compared to the actual results of running the algorithms on the datasets they were attempting to predict.
Local Note:
School code: 0129
Added Corporate Author:
Available:*
Shelf Number | Item Barcode | Shelf Location | Status |
|---|---|---|---|
| XX(679943.1) | 679943-1001 | Proquest E-Thesis Collection | Searching... |
On Order
Select a list
Make this your default list.
The following items were successfully added.
There was an error while adding the following items. Please try again.
:
Select An Item
Data usage warning: You will receive one text message for each title you selected.
Standard text messaging rates apply.


