# Practical programming in CUDA

### Massimiliano Piscozzi

Università degli Studi di Milano

June 2008

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              |                         |                     |            |
|              |                         |                     |            |

# Outline

Introduction

Optimization strategies

Parallel operations

References

A B + A B +
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A
 A

< ∃ >

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              |                         |                     |            |
|              |                         |                     |            |

# Outline

### Introduction

Optimization strategies

Parallel operations

References

Practical programming in CUDA

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
| •0           |                         |                     |            |
|              |                         |                     |            |
|              |                         |                     |            |

# Parallelism?

#### 1. Functional parallelism

"Different part of data are processed concurrently by separate functional sections on different computational units"

- No explicit association between kernels and multiprocessors
- No simulataneous execution of kernels
- Big-kernel approach = waste of shared resources & need of blocks synchronization mechanism

#### 2. Data parallelism

"Data are processed in parallel by distributing elements across different processing units, all of which perform *more or less* the same algorithmic function"

- Streams and kernels approach
  - Stream = (large) set of homogeneous data
  - Kernel = function transforming one or more input streams into one or more output streams
- Kernels launch as synchronization point
- More is better than less ...

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
| •0           |                         |                     |            |
|              |                         |                     |            |
|              |                         |                     |            |

# Parallelism?

#### 1. Functional parallelism

"Different part of data are processed concurrently by separate functional sections on different computational units"

- No explicit association between kernels and multiprocessors
- No simulataneous execution of kernels
- Big-kernel approach = waste of shared resources & need of blocks synchronization mechanism

#### 2. Data parallelism

"Data are processed in parallel by distributing elements across different processing units, all of which perform *more or less* the same algorithmic function"

- Streams and kernels approach
  - Stream = (large) set of homogeneous data
  - Kernel = function transforming one or more input streams into one or more output streams
- Kernels launch as synchronization point
- More is better than less ....

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
| 00           |                         |                     |            |
|              |                         |                     |            |

- The importance of optimization
  - 1. High-level programming language
  - 2. Low-level access to hardware

#### Parallelism in CUDA

- 1. Blocks scheduling mechanism (implicit)
- 2. Thread warps, SIMD groups (explicit)
- 3. Shared memory access (explicit)

< ≣ >

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
| 00           |                         |                     |            |
|              |                         |                     |            |

- The importance of optimization
  - 1. High-level programming language
  - 2. Low-level access to hardware

### Parallelism in CUDA

- 1. Blocks scheduling mechanism (implicit)
- 2. Thread warps, SIMD groups (explicit)
- 3. Shared memory access (explicit)

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
| 00           |                         |                     |            |
|              |                         |                     |            |

- The importance of optimization
  - 1. High-level programming language
  - 2. Low-level access to hardware

### Parallelism in CUDA

- 1. Blocks scheduling mechanism (implicit)
- 2. Thread warps, SIMD groups (explicit)
- 3. Shared memory access (explicit)

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
| 00           |                         |                     |            |
|              |                         |                     |            |

- The importance of optimization
  - 1. High-level programming language
  - 2. Low-level access to hardware
- Parallelism in CUDA
  - 1. Blocks scheduling mechanism (implicit)
  - 2. Thread warps, SIMD groups (explicit)
  - 3. Shared memory access (explicit)

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
| 00           |                         |                     |            |
|              |                         |                     |            |

- The importance of optimization
  - 1. High-level programming language
  - 2. Low-level access to hardware
- Parallelism in CUDA
  - 1. Blocks scheduling mechanism (implicit)
  - 2. Thread warps, SIMD groups (explicit)
  - 3. Shared memory access (explicit)

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              |                         |                     |            |
|              |                         |                     |            |

# Outline

Introduction

Optimization strategies

Parallel operations

References

Practical programming in CUDA

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | ●0000000                |                     |            |
|              |                         |                     |            |

#### Essential ingredients to write efficient CUDA kernel...

- 1. No-branching code
- 2. Data Read/Write optimization
  - Prefetching memory
  - Memory coalescing
- 3. Bank conflicts resolution
- ...and a good parallel algorithm!!
  - PRAM CRCW (Parallel Random Access Machine Concurrent Read Concurrent Write) model
  - No message passing model

(日) (同) (三) (三)

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

#### Essential ingredients to write efficient CUDA kernel...

#### 1. No-branching code

- 2. Data Read/Write optimization
  - Prefetching memory
  - Memory coalescing
- 3. Bank conflicts resolution

#### ...and a good parallel algorithm!!

- PRAM CRCW (Parallel Random Access Machine Concurrent Read Concurrent Write) model
- No message passing model

-

• • • • • • • • • • • • • •

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

Essential ingredients to write efficient CUDA kernel...

- 1. No-branching code
- 2. Data Read/Write optimization
  - Prefetching memory
  - Memory coalescing
- 3. Bank conflicts resolution
- ...and a good parallel algorithm!!
  - PRAM CRCW (Parallel Random Access Machine Concurrent Read Concurrent Write) model
  - No message passing model

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

Essential ingredients to write efficient CUDA kernel...

- 1. No-branching code
- 2. Data Read/Write optimization
  - Prefetching memory
  - Memory coalescing
- 3. Bank conflicts resolution
- ...and a good parallel algorithm!!
  - PRAM CRCW (Parallel Random Access Machine Concurrent Read Concurrent Write) model
  - No message passing model

• = • •

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

Essential ingredients to write efficient CUDA kernel...

- 1. No-branching code
- 2. Data Read/Write optimization
  - Prefetching memory
  - Memory coalescing
- 3. Bank conflicts resolution
- ...and a good parallel algorithm!!
  - PRAM CRCW (Parallel Random Access Machine Concurrent Read Concurrent Write) model
  - No message passing model

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

#### No-branching code = from SPMD to SIMD

- Branching inside a warp = serialization
  - No divergence if branch granularity is a whole multiple of warp size
- Use a multiple of 32 threads per block
  - Prefer data padding than special cases
- Low-control flow overhead
  - Small loops unrolled
- Think in parallel!
  - Do not rely on any ordering between warps (use syncthreads())
  - \_\_\_syncthreads() in a branch = deadlock

< ∃ >

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

#### No-branching code = from SPMD to SIMD

- Branching inside a warp = serialization
  - No divergence if branch granularity is a whole multiple of warp size
- Use a multiple of 32 threads per block
  - Prefer data padding than special cases
- Low-control flow overhead
  - Small loops unrolled
- Think in parallel!
  - Do not rely on any ordering between warps (use syncthreads())
  - \_\_syncthreads() in a branch = deadlock

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

#### No-branching code = from SPMD to SIMD

- Branching inside a warp = serialization
  - No divergence if branch granularity is a whole multiple of warp size
- Use a multiple of 32 threads per block
  - Prefer data padding than special cases
- Low-control flow overhead
  - Small loops unrolled
- Think in parallel!
  - Do not rely on any ordering between warps (use syncthreads())
  - syncthreads() in a branch = deadlock

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

#### No-branching code = from SPMD to SIMD

- Branching inside a warp = serialization
  - No divergence if branch granularity is a whole multiple of warp size
- Use a multiple of 32 threads per block
  - Prefer data padding than special cases
- Low-control flow overhead
  - Small loops unrolled
- ► Think in parallel!
  - Do not rely on any ordering between warps (use syncthreads())
  - syncthreads() in a branch = deadlock

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

## Prefetching memory

- Maximize arithmetic intensity (many calculations per memory access)
- Latency hiding
  - Help the (young) compiler do a better job
    - Memory instruction followed by independent ALU instructions (if possible)
  - Sometimes it's better to recompute than to cache
  - Exploit block scheduling mechanism = use more than 16 blocks
  - Use prefetching strategy (manual caching)





| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

# Prefetching strategy

#### Prefetching strategy

- 1. Use threads to cooperatively move data from *device memory* to *shared memory*
- 2. Barrier synchronization
- 3. Use threads to process data
- 4. ...
- 5. Barrier synchronization
- 6. Use threads to cooperatively write results to device memory





| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

### Memory coalescing

- If per-thread memory accesses form a contiguous range of adresses, accesses will be coalesced into a single access
  - Coalesced
    - fewer memory accesses
    - bigger data transfers (32-bit, 64-bit, 128-bit instructions)
  - Non coalesced
    - serialization of memory operations
- Memory alignment
  - Explicit alignment of custom types ( align keyword)
  - Prefer Structure of Arrays (SoA) than Array of Structures (AoS)
- Use prefetching strategy to do coalesced read/write ....
- ... use threads cooperation to permute data in shared memory

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

### Memory coalescing

- If per-thread memory accesses form a contiguous range of adresses, accesses will be coalesced into a single access
  - Coalesced
    - fewer memory accesses
    - bigger data transfers (32-bit, 64-bit, 128-bit instructions)
  - Non coalesced
    - serialization of memory operations
- Memory alignment
  - Explicit alignment of custom types ( align keyword)
  - Prefer Structure of Arrays (SoA) than Array of Structures (AoS)
- Use prefetching strategy to do coalesced read/write ....
- ... use threads cooperation to permute data in shared memory

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

### Memory coalescing

- If per-thread memory accesses form a contiguous range of adresses, accesses will be coalesced into a single access
  - Coalesced
    - fewer memory accesses
    - bigger data transfers (32-bit, 64-bit, 128-bit instructions)
  - Non coalesced
    - serialization of memory operations
- Memory alignment
  - Explicit alignment of custom types ( align keyword)
  - Prefer Structure of Arrays (SoA) than Array of Structures (AoS)
- Use prefetching strategy to do coalesced read/write ....
- ... use threads cooperation to permute data in shared memory

## Shared memory & Bank conflicts

#### Parallel shared memory access

- Many threads access memory, memory is divides into banks
- At every cycle: each bank can service one address
- Successive 32-bit words = successive banks (16 banks)

#### Bank conflicts

- Conflicting accesses are serialized
- Conflict = same bank! (not same address)
- Conflicts can occur only inside a SIMD group
- No conflict if . . .
  - ...all threads access different banks
  - ...all threads access the identical address (broadcast, global data)

#### Bank conflict resolution

- Explicit stride based on tid
- Use more shared memory

< □ > < A >

## Shared memory & Bank conflicts

- Parallel shared memory access
  - Many threads access memory, memory is divides into banks
  - At every cycle: each bank can service one address
  - Successive 32-bit words = successive banks (16 banks)

#### Bank conflicts

- Conflicting accesses are serialized
- Conflict = same bank! (not same address)
- Conflicts can occur only inside a SIMD group
- No conflict if ...
  - ...all threads access different banks
  - ...all threads access the identical address (broadcast, global data)

#### Bank conflict resolution

- Explicit stride based on tid
- Use more shared memory

## Shared memory & Bank conflicts

- Parallel shared memory access
  - Many threads access memory, memory is divides into banks
  - At every cycle: each bank can service one address
  - Successive 32-bit words = successive banks (16 banks)

#### Bank conflicts

- Conflicting accesses are serialized
- Conflict = same bank! (not same address)
- Conflicts can occur only inside a SIMD group
- No conflict if ...
  - ...all threads access different banks
  - ...all threads access the identical address (broadcast, global data)
- Bank conflict resolution
  - Explicit stride based on tid
  - Use more shared memory

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 00000000                |                     |            |
|              |                         |                     |            |

## Bank conflicts



No conflicts



No conflicts

<ロ> (四) (四) (日) (日) (日)

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              | 0000000                 |                     |            |
|              |                         |                     |            |

# Bank conflicts (cont'd)



2-way conflicts



8-way conflicts

<ロ> (四) (四) (日) (日) (日)

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              |                         |                     |            |
|              |                         |                     |            |

# Outline

Introduction

Optimization strategies

Parallel operations

References

Practical programming in CUDA

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              |                         | ●000                |            |
|              |                         |                     |            |

### Data-Parallel building blocks

- Data-parallel operations
  - Stream:  $S = [a_0, a_1, ..., a_{n-1}]$
  - 1. Map
    - Local function, f
    - $map(S) = [f(a_0), f(a_1), \dots, f(a_{n-1})]$
    - Excellent data-parallelism, no threads communication
  - 2. Reduce
    - ▶ Binary associative operator, ⊗
    - reduce( $S, \otimes$ ) =  $a_0 \otimes a_1 \otimes \ldots \otimes a_{n-1}$
    - Pyramidal construction
    - O(log<sub>2</sub>N) steps, O(N) work
  - 3. Scan
    - ▶ Binary associative operator, ⊗
    - ▶ Inclusive  $scan(S, \otimes) = [a_0, (a_0 \otimes a_1), \dots, (a_0 \otimes a_1 \otimes \dots \otimes a_{n-1})]$
    - Exclusive  $scan(S, \otimes) = [I, a_0, \dots, (a_0 \otimes a_1 \otimes \dots \otimes a_{n-2})]$
    - Common algorithmic pattern: the computation seems inherently sequential, but can be efficiently implemented in parallel
    - $O(log_2N)$  steps, O(N) work

(日) (同) (三) (三)

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              |                         | 0000                |            |
|              |                         |                     |            |

## The importance of scan

 M. Harris, S. Sengupta, J. Owens: Parallel Prefix Sum (Scan) in CUDA. GPU Gems 3, Hubert Nguyen, ed. Addison Wesley, August 2007

#### Applications:

- Sorting,
- Stream compaction,
- Building data structures (trees and summed-area tables)
- CUDPP: Cuda Data Parallel Primitives Library
  - http://www.gpgpu.org/developer/cudpp/

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              |                         | 0000                |            |
|              |                         |                     |            |

# Example: radix-sort



Practical programming in CUDA

Massimiliano Piscozzi

(日) (四) (王) (王)

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              |                         | 0000                |            |
|              |                         |                     |            |

## Example: pointers



< ∃⇒

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              |                         |                     |            |
|              |                         |                     |            |

# Outline

Introduction

Optimization strategies

Parallel operations

References

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              |                         |                     | 00         |
|              |                         |                     |            |

### Some references

- Pre-CUDA, but useful
  - A. Lefohn: Glift: Generic Data Structures for Graphics Hardware, PhD thesis, Computer Science, University of California, Davis, September 2006.
  - M. Kaas, A. Lefohn, J. D. Owens: Interactive Depth of Field Using Simulated Diffusion, Pixar Animation Studios, January 2006.
- CUDA references
  - NVIDIA website & CUDA forum
  - Google: CUDA, G80 keywords

| Introduction | Optimization strategies | Parallel operations | References |
|--------------|-------------------------|---------------------|------------|
|              |                         |                     | 00         |
|              |                         |                     |            |

# Not only CUDA

- CELL parallel architecture
  - IBM website
    - http://www.research.ibm.com/cell/
  - Cell Broadband Engine
    - http://cell.scei.co.jp/e\_download.html
  - Multicore Programming Primer (MIT & Playstation3)
    - http://cag.csail.mit.edu/ps3/



3000 Dony Computer Distancement Inc. M rights reserve temper anti-perifications are subject it informpr utilities but