Clustering of Large-Scale Protein Datasets

Başlık:

Clustering of Large-Scale Protein Datasets

Yazar:

Abnousi, Armen, author. (orcid)0000-0003-1822-0928

ISBN:

9780438103597

Yazar Ek Girişi:

Abnousi, Armen, author.

Fiziksel Tanımlama:

1 electronic resource (118 pages)

Genel Not:

Source: Dissertation Abstracts International, Volume: 79-11(E), Section: B.

Advisors: Shira L. Broschat Committee members: Kelly Brayton; Ananth Kalyanaraman; Yinghui Wu.

Özet:

Identifying similar proteins and grouping them accordingly is the operation generally known as protein clustering. This operation is essential to the prediction of protein function and structure. In this dissertation, we present a novel approach for protein clustering based on amino acid sequences of proteins. Our work consists of two main components: (1) detection of conserved regions within protein sequences and (2) grouping of these conserved regions based on their estimated similarity.

For the detection of conserved regions we have developed the Non-Alignment Domain Detection Algorithm, NADDA, which uses random subspace ensemble methods on protein profiles, extracting features based on repeated short subsequences in the proteins. We have achieved up to 76% accuracy for some sets in prediction of conserved indices on our example data sets when compared to domain annotations by Pfam.

For the clustering of conserved regions we are using a min-wise independent hashing method (shingling). We show that our method generates results comparable to existing known clusters. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. In addition, we show that for an example randomly selected data set, the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm, pClust. Both of our presented methods are alignment-free and based on independent operations on small subsequences from the input data set. This has allowed us to extensively use the power of the MapReduce framework to parallelize our algorithms. A MapReduce implementation of both is made publicly available.

Notlar:

School code: 0251

Konu Başlığı:

Bioinformatics.

Tüzel Kişi Ek Girişi:

Washington State University. Computer Science.

Elektronik Erişim:

http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:10641432

Mevcut:*

Yer Numarası	Demirbaş Numarası	Shelf Location	Lokasyon / Statüsü / İade Tarihi
XX(688045.1)	688045-1001	Proquest E-Tez Koleksiyonu	Arıyor...

On Order

Liste seç

Bunu varsayılan liste yap.

Öğeler başarıyla eklendi

Öğeler eklenirken hata oldu. Lütfen tekrar deneyiniz.