- Broschiertes Buch
- Merkliste
- Auf die Merkliste
- Bewerten Bewerten
- Teilen
- Produkt teilen
- Produkterinnerung
- Produkterinnerung
Break into the powerful world of parallel GPU programming with this down-to-earth, practical guide
Designed for professionals across multiple industrial sectors, Professional CUDA C Programming presents CUDA -- a parallel computing platform and programming model designed to ease the development of GPU programming -- fundamentals in an easy-to-follow format, and teaches readers how to think in parallel and implement parallel algorithms on GPUs. Each chapter covers a specific topic, and includes workable examples that demonstrate the development process, allowing readers to explore both the…mehr
Andere Kunden interessierten sich auch für
- Jean-Marc PiersonLarge-Scale Distributed Systems and Energy Efficiency130,99 €
- Dan C. MarinescuInternet Workflow Management191,99 €
- Amol B. BakshiArchitecture-Independent Programming for Wireless Sensor Networks139,99 €
- High Performance Computing143,99 €
- Thomas C. JepsenDistributed Storage Networks136,99 €
- Jiannong CaoMobile Agents107,99 €
- Samee U. KhanScalable Computing and Communications191,99 €
-
-
-
Break into the powerful world of parallel GPU programming with this down-to-earth, practical guide
Designed for professionals across multiple industrial sectors, Professional CUDA C Programming presents CUDA -- a parallel computing platform and programming model designed to ease the development of GPU programming -- fundamentals in an easy-to-follow format, and teaches readers how to think in parallel and implement parallel algorithms on GPUs. Each chapter covers a specific topic, and includes workable examples that demonstrate the development process, allowing readers to explore both the "hard" and "soft" aspects of GPU programming.
Computing architectures are experiencing a fundamental shift toward scalable parallel computing motivated by application requirements in industry and science. This book demonstrates the challenges of efficiently utilizing compute resources at peak performance, presents modern techniques for tackling these challenges, while increasing accessibility for professionals who are not necessarily parallel programming experts. The CUDA programming model and tools empower developers to write high-performance applications on a scalable, parallel computing platform: the GPU. However, CUDA itself can be difficult to learn without extensive programming experience. Recognized CUDA authorities John Cheng, Max Grossman, and Ty McKercher guide readers through essential GPU programming skills and best practices in Professional CUDA C Programming , including:
CUDA Programming Model
GPU Execution Model
GPU Memory model
Streams, Event and Concurrency
Multi-GPU Programming
CUDA Domain-Specific Libraries
Profiling and Performance Tuning
The book makes complex CUDA concepts easy to understand for anyone with knowledge of basic software development with exercises designed to be both readable and high-performance. For the professional seeking entrance to parallel computing and the high-performance computing community, Professional CUDA C Programming is an invaluable resource, with the most current information available on the market.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
Designed for professionals across multiple industrial sectors, Professional CUDA C Programming presents CUDA -- a parallel computing platform and programming model designed to ease the development of GPU programming -- fundamentals in an easy-to-follow format, and teaches readers how to think in parallel and implement parallel algorithms on GPUs. Each chapter covers a specific topic, and includes workable examples that demonstrate the development process, allowing readers to explore both the "hard" and "soft" aspects of GPU programming.
Computing architectures are experiencing a fundamental shift toward scalable parallel computing motivated by application requirements in industry and science. This book demonstrates the challenges of efficiently utilizing compute resources at peak performance, presents modern techniques for tackling these challenges, while increasing accessibility for professionals who are not necessarily parallel programming experts. The CUDA programming model and tools empower developers to write high-performance applications on a scalable, parallel computing platform: the GPU. However, CUDA itself can be difficult to learn without extensive programming experience. Recognized CUDA authorities John Cheng, Max Grossman, and Ty McKercher guide readers through essential GPU programming skills and best practices in Professional CUDA C Programming , including:
CUDA Programming Model
GPU Execution Model
GPU Memory model
Streams, Event and Concurrency
Multi-GPU Programming
CUDA Domain-Specific Libraries
Profiling and Performance Tuning
The book makes complex CUDA concepts easy to understand for anyone with knowledge of basic software development with exercises designed to be both readable and high-performance. For the professional seeking entrance to parallel computing and the high-performance computing community, Professional CUDA C Programming is an invaluable resource, with the most current information available on the market.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
Produktdetails
- Produktdetails
- Verlag: Wiley & Sons
- 1. Auflage
- Seitenzahl: 528
- Erscheinungstermin: 15. September 2014
- Englisch
- Abmessung: 236mm x 187mm x 30mm
- Gewicht: 875g
- ISBN-13: 9781118739327
- ISBN-10: 1118739329
- Artikelnr.: 40449051
- Verlag: Wiley & Sons
- 1. Auflage
- Seitenzahl: 528
- Erscheinungstermin: 15. September 2014
- Englisch
- Abmessung: 236mm x 187mm x 30mm
- Gewicht: 875g
- ISBN-13: 9781118739327
- ISBN-10: 1118739329
- Artikelnr.: 40449051
John Cheng, PHD, is a Research Scientist at BGP International in Houston. He has developed seismic imaging products with GPU technology and many high-performance parallel production applications on heterogeneous computing-platforms. Max Grossman is an expert in GPU computing with experience applying CUDA to problems in medical imaging, machine learning, geophysics, and more. Ty McKercher has been helping customers adopt GPU acceleration technologies while he has been employed at NVIDIA since 2008.
Foreword xvii
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481
Foreword xvii
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481
Foreword xvii
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481
Foreword xvii
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481
Preface xix
Introduction xxi
Chapter 1: Heterogeneous Parallel Computing with CUDA 1
Parallel Computing 2
Sequential and Parallel Programming 3
Parallelism 4
Computer Architecture 6
Heterogeneous Computing 8
Heterogeneous Architecture 9
Paradigm of Heterogeneous Computing 12
CUDA: A Platform for Heterogeneous Computing 14
Hello World from GPU 17
Is CUDA C Programming Difficult? 20
Summary 21
Chapter 2: CUDA Programming Model 23
Introducing the CUDA Programming Model 23
CUDA Programming Structure 25
Managing Memory 26
Organizing Threads 30
Launching a CUDA Kernel 36
Writing Your Kernel 37
Verifying Your Kernel 39
Handling Errors 40
Compiling and Executing 40
Timing Your Kernel 43
Timing with CPU Timer 44
Timing with nvprof 47
Organizing Parallel Threads 49
Indexing Matrices with Blocks and Threads 49
Summing Matrices with a 2D Grid and 2D Blocks 53
Summing Matrices with a 1D Grid and 1D Blocks 57
Summing Matrices with a 2D Grid and 1D Blocks 58
Managing Devices 60
Using the Runtime API to Query GPU Information 61
Determining the Best GPU 63
Using nvidia-smi to Query GPU Information 63
Setting Devices at Runtime 64
Summary 65
Chapter 3: CUDA Execution Model 67
Introducing the CUDA Execution Model 67
GPU Architecture Overview 68
The Fermi Architecture 71
The Kepler Architecture 73
Profile-Driven Optimization 78
Understanding the Nature of Warp Execution 80
Warps and Thread Blocks 80
Warp Divergence 82
Resource Partitioning 87
Latency Hiding 90
Occupancy 93
Synchronization 97
Scalability 98
Exposing Parallelism 98
Checking Active Warps with nvprof 100
Checking Memory Operations with nvprof 100
Exposing More Parallelism 101
Avoiding Branch Divergence 104
The Parallel Reduction Problem 104
Divergence in Parallel Reduction 106
Improving Divergence in Parallel Reduction 110
Reducing with Interleaved Pairs 112
Unrolling Loops 114
Reducing with Unrolling 115
Reducing with Unrolled Warps 117
Reducing with Complete Unrolling 119
Reducing with Template Functions 120
Dynamic Parallelism 122
Nested Execution 123
Nested Hello World on the GPU 124
Nested Reduction 128
Summary 132
Chapter 4: Global Memory 135
Introducing the CUDA Memory Model 136
Benefits of a Memory Hierarchy 136
CUDA Memory Model 137
Memory Management 145
Memory Allocation and Deallocation 146
Memory Transfer 146
Pinned Memory 148
Zero-Copy Memory 150
Unified Virtual Addressing 156
Unified Memory 157
Memory Access Patterns 158
Aligned and Coalesced Access 158
Global Memory Reads 160
Global Memory Writes 169
Array of Structures versus Structure of Arrays 171
Performance Tuning 176
What Bandwidth Can a Kernel Achieve? 179
Memory Bandwidth 179
Matrix Transpose Problem 180
Matrix Addition with Unified Memory 195
Summary 199
Chapter 5: Shared Memory and Constant Memory 203
Introducing CUDA Shared Memory 204
Shared Memory 204
Shared Memory Allocation 206
Shared Memory Banks and Access Mode 206
Configuring the Amount of Shared Memory 212
Synchronization 214
Checking the Data Layout of Shared Memory 216
Square Shared Memory 217
Rectangular Shared Memory 225
Reducing Global Memory Access 232
Parallel Reduction with Shared Memory 232
Parallel Reduction with Unrolling 236
Parallel Reduction with Dynamic Shared Memory 238
Effective Bandwidth 239
Coalescing Global Memory Accesses 239
Baseline Transpose Kernel 240
Matrix Transpose with Shared Memory 241
Matrix Transpose with Padded Shared Memory 245
Matrix Transpose with Unrolling 246
Exposing More Parallelism 249
Constant Memory 250
Implementing a 1D Stencil with Constant Memory 250
Comparing with the Read-Only Cache 253
The Warp Shuffle Instruction 255
Variants of the Warp Shuffle Instruction 256
Sharing Data within a Warp 258
Parallel Reduction Using the Warp Shuffle Instruction 262
Summary 264
Chapter 6: Streams and Concurrency 267
Introducing Streams and Events 268
CUDA Streams 269
Stream Scheduling 271
Stream Priorities 273
CUDA Events 273
Stream Synchronization 275
Concurrent Kernel Execution 279
Concurrent Kernels in Non-NULL Streams 279
False Dependencies on Fermi GPUs 281
Dispatching Operations with OpenMP 283
Adjusting Stream Behavior Using Environment Variables 284
Concurrency-Limiting GPU Resources 286
Blocking Behavior of the Default Stream 287
Creating Inter-Stream Dependencies 288
Overlapping Kernel Execution and Data Transfer 289
Overlap Using Depth-First Scheduling 289
Overlap Using Breadth-First Scheduling 293
Overlapping GPU and CPU Execution 294
Stream Callbacks 295
Summary 297
Chapter 7: Tuning Instruction-Level Primitives 299
Introducing CUDA Instructions 300
Floating-Point Instructions 301
Intrinsic and Standard Functions 303
Atomic Instructions 304
Optimizing Instructions for Your Application 306
Single-Precision vs. Double-Precision 306
Standard vs. Intrinsic Functions 309
Understanding Atomic Instructions 315
Bringing It All Together 322
Summary 324
Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC 327
Introducing the CUDA Libraries 328
Supported Domains for CUDA Libraries 329
A Common Library Workflow 330
The CUSPARSE Library 332
cuSPARSE Data Storage Formats 333
Formatting Conversion with cuSPARSE 337
Demonstrating cuSPARSE 338
Important Topics in cuSPARSE Development 340
cuSPARSE Summary 341
The cuBLAS Library 341
Managing cuBLAS Data 342
Demonstrating cuBLAS 343
Important Topics in cuBLAS Development 345
cuBLAS Summary 346
The cuFFT Library 346
Using the cuFFT API 347
Demonstrating cuFFT 348
cuFFT Summary 349
The cuRAND Library 349
Choosing Pseudo- or Quasi- Random Numbers 349
Overview of the cuRAND Library 350
Demonstrating cuRAND 354
Important Topics in cuRAND Development 357
CUDA Library Features Introduced in CUDA 6 358
Drop-In CUDA Libraries 358
Multi-GPU Libraries 359
A Survey of CUDA Library Performance 361
cuSPARSE versus MKL 361
cuBLAS versus MKL BLAS 362
cuFFT versus FFTW versus MKL 363
CUDA Library Performance Summary 364
Using OpenACC 365
Using OpenACC Compute Directives 367
Using OpenACC Data Directives 375
The OpenACC Runtime API 380
Combining OpenACC and the CUDA Libraries 382
Summary of OpenACC 384
Summary 384
Chapter 9: Multi-GPU Programming 387
Moving to Multiple GPUs 388
Executing on Multiple GPUs 389
Peer-to-Peer Communication 391
Synchronizing across Multi-GPUs 392
Subdividing Computation across Multiple GPUs 393
Allocating Memory on Multiple Devices 393
Distributing Work from a Single Host Thread 394
Compiling and Executing 395
Peer-to-Peer Communication on Multiple GPUs 396
Enabling Peer-to-Peer Access 396
Peer-to-Peer Memory Copy 396
Peer-to-Peer Memory Access with Unified Virtual Addressing 398
Finite Difference on Multi-GPU 400
Stencil Calculation for 2D Wave Equation 400
Typical Patterns for Multi-GPU Programs 401
2D Stencil Computation with Multiple GPUs 403
Overlapping Computation and Communication 405
Compiling and Executing 406
Scaling Applications across GPU Clusters 409
CPU-to-CPU Data Transfer 410
GPU-to-GPU Data Transfer Using Traditional MPI 413
GPU-to-GPU Data Transfer with CUDA-aware MPI 416
Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI 417
Adjusting Message Chunk Size 418
GPU to GPU Data Transfer with GPUDirect RDMA 419
Summary 422
Chapter 10: Implementation Considerations 425
The CUDA C Development Process 426
APOD Development Cycle 426
Optimization Opportunities 429
CUDA Code Compilation 432
CUDA Error Handling 437
Profile-Driven Optimization 438
Finding Optimization Opportunities Using nvprof 439
Guiding Optimization Using nvvp 443
NVIDIA Tools Extension 446
CUDA Debugging 448
Kernel Debugging 448
Memory Debugging 456
Debugging Summary 462
A Case Study in Porting C Programs to CUDA C 462
Assessing crypt 463
Parallelizing crypt 464
Optimizing crypt 465
Deploying Crypt 472
Summary of Porting crypt 475
Summary 476
Appendix: Suggested Readings 477
Index 481