Automatic Selection of MapReduce Machine Learning Algorithms: A Model Building Approach

Select an Action

Place Hold(s)
Add to My Lists
Email
Print

Title:

Automatic Selection of MapReduce Machine Learning Algorithms: A Model Building Approach

Author:

Franklin, Bryan M., author.

ISBN:

9780355979749

Personal Author:

Franklin, Bryan M., author.

Physical Description:

1 electronic resource (265 pages)

General Note:

Source: Dissertation Abstracts International, Volume: 79-10(E), Section: B.

Includes supplementary digital materials.

Advisors: Laura E. Brown Committee members: Timothy Havens; Benjamin Ong; Thomas Oommen.

Abstract:

As the amount of information available for data mining grows larger, the amount of time needed to train models on those huge volumes of data also grows longer. Techniques such as sub-sampling and parallel algorithms have been employed to deal with this growth. Some studies have shown that sub-sampling can have adverse effects on the quality of models produced, and the degree to which it affects different types of learning algorithms varies. Parallel algorithms perform well when enough computing resources (e.g. cores, memory) are available, however for a limited sized cluster the growth in data will still cause an unacceptable growth in model training time. In addition to the data size mitigation problem, picking which algorithms are well suited to a particular dataset, can be a challenge. While some studies have looked at selection criteria for picking a learning algorithm based on the properties of the dataset, the additional complexity of parallel learners or possible run time limitations has not been considered. This study explores run time and model quality results of various techniques for dealing with large datasets, including using different numbers of compute cores, sub-sampling the datasets, and exploiting the iterative anytime nature of the training algorithms. The algorithms were studied using MapReduce implementations of four supervised learning algorithms, logistic regression, tree induction, bagged trees, and boosted stumps for binary classification using probabilistic models. Evaluation of these techniques was done using a modified form of learning curves which has a temporal component. Finally, the data collected was used to train a set of models to predict which type of parallel learner best suits a particular dataset, given run time limitations and the number of compute cores to be used. The predictions of those models were then compared to the actual results of running the algorithms on the datasets they were attempting to predict.

Local Note:

School code: 0129

Subject Term:

Computer science.

Artificial intelligence.

Added Corporate Author:

Michigan Technological University. Computer Science.

Electronic Access:

http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:10791719

Available:*

Shelf Number	Item Barcode	Shelf Location	Status
XX(679943.1)	679943-1001	Proquest E-Thesis Collection	Searching...

On Order

Select a list

Make this your default list.

The following items were successfully added.

There was an error while adding the following items. Please try again.