Fault Tolerance through Invariant Checks in Applications Using Linear Algebraic Methods

Select an Action

Place Hold(s)
Add to My Lists
Email
Print

Title:

Fault Tolerance through Invariant Checks in Applications Using Linear Algebraic Methods

Author:

Loh, Felix Da Yuan, author.

ISBN:

9780438082496

Personal Author:

Loh, Felix Da Yuan, author.

Physical Description:

1 electronic resource (150 pages)

General Note:

Source: Dissertation Abstracts International, Volume: 79-11(E), Section: B.

Advisors: Parameswaran Ramanathan Committee members: Yu Hen Hu; Mikko Lipasti; Dan Negrut; Kewal Saluja.

Abstract:

Graphics processing units (GPUs) have become a popular platform for scientific computing applications, many of which are based on linear algebra. As the minimum feature size of transistors decreases, GPUs are becoming more vulnerable to transient faults caused by events such as alpha particle strikes, power fluctuations and electronic noise. In addition, the likelihood of a fault increases as more GPU computing nodes are used in supercomputers to meet the increasingly demanding computational requirements of scientific applications. Consequently, there are concerns that GPU-based supercomputer systems will suffer from very high fault rates. In order to ensure reliability, it is necessary to use fault tolerance (FT) techniques.

This thesis presents low-overhead FT techniques for several commonly-used linear algebraic applications that run on GPUs, focusing mainly on applications that operate with sparse matrices. These FT techniques exploit the invariant properties of the algorithms used in these applications, and exploit the parallel execution model of GPUs to allow for low-overhead error detection.

This thesis introduces and studies efficient error checking schemes for three popular matrix factorization techniques: Householder QR factorization, left-looking Cholesky factorization, and right-looking LU factorization. It also explores lightweight invariant checking methods for the preconditioned conjugate gradient (PCG) and biconjugate gradient stabilized (BiCGSTAB) iterative solvers and introduces an efficient checking method for the Lanczos eigensolver, as well as fault injection mechanisms for NVIDIA GPUs that allow for the simulation of transient, non-instantaneous faults.

This thesis carefully evaluates these FT methods on a contemporary NVIDIA GPU platform, and the results show that the aforementioned error checking strategies have high error coverage and are significantly more efficient than prior FT techniques on a GPU system.

Local Note:

School code: 0262

Subject Term:

Computer engineering.

Electrical engineering.

Computer science.

Added Corporate Author:

The University of Wisconsin - Madison. Electrical Engineering.

Electronic Access:

http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:10825822

Available:*

Shelf Number	Item Barcode	Shelf Location	Status
XX(694428.1)	694428-1001	Proquest E-Thesis Collection	Searching...

On Order

Select a list

Make this your default list.

The following items were successfully added.

There was an error while adding the following items. Please try again.