
Select an Action

Fault Tolerance through Invariant Checks in Applications Using Linear Algebraic Methods
Title:
Fault Tolerance through Invariant Checks in Applications Using Linear Algebraic Methods
Author:
Loh, Felix Da Yuan, author.
ISBN:
9780438082496
Personal Author:
Physical Description:
1 electronic resource (150 pages)
General Note:
Source: Dissertation Abstracts International, Volume: 79-11(E), Section: B.
Advisors: Parameswaran Ramanathan Committee members: Yu Hen Hu; Mikko Lipasti; Dan Negrut; Kewal Saluja.
Abstract:
Graphics processing units (GPUs) have become a popular platform for scientific computing applications, many of which are based on linear algebra. As the minimum feature size of transistors decreases, GPUs are becoming more vulnerable to transient faults caused by events such as alpha particle strikes, power fluctuations and electronic noise. In addition, the likelihood of a fault increases as more GPU computing nodes are used in supercomputers to meet the increasingly demanding computational requirements of scientific applications. Consequently, there are concerns that GPU-based supercomputer systems will suffer from very high fault rates. In order to ensure reliability, it is necessary to use fault tolerance (FT) techniques.
This thesis presents low-overhead FT techniques for several commonly-used linear algebraic applications that run on GPUs, focusing mainly on applications that operate with sparse matrices. These FT techniques exploit the invariant properties of the algorithms used in these applications, and exploit the parallel execution model of GPUs to allow for low-overhead error detection.
This thesis introduces and studies efficient error checking schemes for three popular matrix factorization techniques: Householder QR factorization, left-looking Cholesky factorization, and right-looking LU factorization. It also explores lightweight invariant checking methods for the preconditioned conjugate gradient (PCG) and biconjugate gradient stabilized (BiCGSTAB) iterative solvers and introduces an efficient checking method for the Lanczos eigensolver, as well as fault injection mechanisms for NVIDIA GPUs that allow for the simulation of transient, non-instantaneous faults.
This thesis carefully evaluates these FT methods on a contemporary NVIDIA GPU platform, and the results show that the aforementioned error checking strategies have high error coverage and are significantly more efficient than prior FT techniques on a GPU system.
Local Note:
School code: 0262
Added Corporate Author:
Available:*
Shelf Number | Item Barcode | Shelf Location | Status |
|---|---|---|---|
| XX(694428.1) | 694428-1001 | Proquest E-Thesis Collection | Searching... |
On Order
Select a list
Make this your default list.
The following items were successfully added.
There was an error while adding the following items. Please try again.
:
Select An Item
Data usage warning: You will receive one text message for each title you selected.
Standard text messaging rates apply.


