
Select an Action

Approximate Clustering Algorithms for High Dimensional Streaming and Distributed Data
Title:
Approximate Clustering Algorithms for High Dimensional Streaming and Distributed Data
Author:
Carraher, Lee A., author.
ISBN:
9780438096431
Personal Author:
Physical Description:
1 electronic resource (145 pages)
General Note:
Source: Dissertation Abstracts International, Volume: 79-11(E), Section: B.
Committee members: Fred Beyette; Raj Bhatnagar; Anca Ralescu; Dan Ralescu; Philip Wilsey.
Abstract:
Clustering data has gained popularity in recent years due to an expanding opportunity to discover knowledge and collect insights from multiple widely available and diverse data sources. Data clustering offers an intuitive solution to a wide variety of unsupervised classification problems. Clustering solutions to problems arise often in areas in which no ground truth is known, or when the arrival frequency of a data source exceeds the labeling capabilities of a human expert. Due to the continuous, fast moving nature of many common data streams, such as those from IoT (Internet of Things) devices, social network interactions, e-commerce click-streams, scientific monitoring devices, and network traffic, noise robust and distributed clustering algorithms are necessary.
Often, current clustering methods suffer from one or more drawbacks when applied to these demanding problems. For this reason, we propose a new method for data clustering that is noise resilient, and distributed with a predictable overall complexity. The principal claim of this research is that while many clustering algorithms rigorously optimize a loss function, their convergence often results in finding a local minima that is indistinguishable from a less computationally rigorous optimization on an approximation of the data. We propose that by removing the rigorous optimization requirement, we can achieve better scalability, and parallelism with comparable performance. In this work we design a clustering algorithm along these lines that uses dimensional reduction and hashing to reduce the problem size while still attaining comparable clustering performance to other clustering algorithms. Our proposed method is more robust to noise with a lower runtime requirement, and greater opportunity for shared and distributed memory parallelism.
This work presents a set of methods for clustering high dimensional data for a variety of data source and processing environments. The proposed RPHash algorithms share a commonality in that they all utilize locality sensitive hash (LSH) functions and random projection (RP) to identify density modes in data sources. They differ however in the operation space and cluster region identification method. The algorithms presented are the RPHash algorithm, the streamingRPHash algorithm, and the Tree Walk RPHash (TWRP) algorithm.
We analyze the results of our developed algorithms on both streaming and at-rest data input environments. Experiments on real and synthetic data demonstrate the advantages of our proposed clustering methods for streaming and at-rest data against common clustering algorithms. Furthermore our theoretical analysis shows that our runtime and memory complexities are effectively linear and sub-linear respectively, in terms of input. Our principal claim that approximate clustering results are not substantially different than exact clustering methods with guarantee convergence to a local minima, is confirmed by these results. In addition we demonstrate the potential gains in processing speed and parallelism.
Local Note:
School code: 0045
Subject Term:
Added Corporate Author:
Available:*
Shelf Number | Item Barcode | Shelf Location | Status |
|---|---|---|---|
| XX(696333.1) | 696333-1001 | Proquest E-Thesis Collection | Searching... |
On Order
Select a list
Make this your default list.
The following items were successfully added.
There was an error while adding the following items. Please try again.
:
Select An Item
Data usage warning: You will receive one text message for each title you selected.
Standard text messaging rates apply.


