# CUDA architecture

#### Massimiliano Piscozzi

Università degli Studi di Milano

June 2008



-

・ロン ・回と ・ヨン・

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       |                  |                   |
|              |       |                  |                   |

## Outline

Introduction

GPGPU

**GPU** Architecture

Programming model

Massimiliano Piscozzi

< ≣ ▶

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       |                  |                   |
|              |       |                  |                   |

# Outline

#### Introduction

GPGPU

**GPU** Architecture

Programming model

<□▶ < 큔▶ < 클▶ < 클▶ 글 ∽) Q (~ Massimiliano Piscozzi

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
| 000          |       |                  |                   |
|              |       |                  |                   |

## What's CUDA?

#### CUDA = Compute Unified Device Architecture

- GPGPU technology
  - 1. Hardware technology
  - 2. Software technology

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
| 000          |       |                  |                   |
|              |       |                  |                   |

#### CUDA: the hardware side

- GeForce / Quadro (8-series) graphics card
  - Desktops, notebooks



- ▶ Tesla (C/D/S-870) high performance computing (HPC) solution
  - Workstations, servers, clusters



< □ > < A > >

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
| 000          |       |                  |                   |
|              |       |                  |                   |

#### CUDA: the software side



- CUDA software stack
  - 1. Hardware layer
  - 2. Application Programming Interface (API)
    - C language extension
  - 3. Higher level mathematical/programming libraries
    - ► CUBLAS, CUFTT, CUDPP, ...

4 3 b.

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       |                  |                   |
|              |       |                  |                   |

## Outline

Introduction

GPGPU

GPU Architecture

Programming model

Massimiliano Piscozzi

・ロン ・回と ・ヨン・

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              | 00000 |                  |                   |
|              |       |                  |                   |

#### CPU vs GPU Performance



- 1. Why GPUs are so fast?
- 2. Why can't we replace CPUs with GPUs?
  - When can we use the GPU instead of the CPU?

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              | 00000 |                  |                   |
|              |       |                  |                   |

## Real-time rendering

Graphics hardware enables real-time rendering





- Main goal
  - Transform a collection of 3D primitives (triangles, lines, points) ...
  - ... into an array of pixels

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              | 00000 |                  |                   |
|              |       |                  |                   |

## Logical graphics pipeline



- Very close to the graphics libraries (OpenGL, DirectX) pipeline
  - Vertex transformation  $\approx$  perspective projection
  - Fragment operations  $\approx$  shaders evaluation

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              | 00000 |                  |                   |
|              |       |                  |                   |

## Semi-fixed graphics pipeline



- Two programmable units
  - 1. Vertex processor
  - 2. Fragment processor
- The raster unit is fixed

-

・ロト ・ ア・・ ヨト ・

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              | 00000 |                  |                   |
|              |       |                  |                   |

#### CPU vs GPU architecture

► GPUs are specialized for highly parallel, compute-intensive computation

- 1. Same computation = lower requirement for flow control
- 2. Arithmetic intensity (memory access latency hiding) vs big data caches



SIMD (Single Instruction Multiple Data) architecture

- ₹ 🖬 🕨

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       |                  |                   |
|              |       |                  |                   |

# Outline

Introduction

GPGPU

**GPU** Architecture

Programming model

Massimiliano Piscozzi

・ロン ・回と ・ヨン・

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       | •0000000         |                   |
|              |       |                  |                   |

#### Computer organization



The graphics card can be used as a coprocessor

(ロ) (部) (注) (注)

| Introduction | <b>GPGPU</b> | GPU Architecture | Programming model |
|--------------|--------------|------------------|-------------------|
| 000          | 00000        | 0●000000         |                   |
|              |              |                  |                   |

### CPU-GPU cooperation

- CPU-GPU communication via PCI Express bus
- CPU and GPU each have their own memory spaces



- CPU is the host, GPU is the device
  - 1. CPU sends data to the GPU
  - 2. GPU processes data
  - 3. CPU copies data back from the GPU

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       | 0000000          |                   |
|              |       |                  |                   |

#### GeForce 6-series architecture



- Specialized units (SIMD architecture)
  - 1. Vertex processors
  - 2. Fragment processors
- GPU memory interface up to 35 GB/s



• = • •

< □ > < A > >

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       | 0000000          |                   |
|              |       |                  |                   |

## Legacy GPGPU approach



General purpose applications must be mapped on the graphical pipeline

- 1. GPGPU algorithms = multi-pass rendering
- 2. Algorithms written using shading languages (GLSL, CG, ...)

(日) (同) (三) (三)

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       | 0000000          |                   |
|              |       |                  |                   |

#### The unified architecture

Specialized units vs general purpose processors



#### Unified design

- Better workload balancing
- (More) independent of the logical pipeline

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       | 00000000         |                   |
|              |       |                  |                   |

#### GeForce 8-series architecture



- General purpose multiprocessors (SIMD architecture)
  - No Vertex / Fragment specialization
- GPU memory interface up to 90 GB/s

A B A B A
A
B
A
A
B
A
A
B
A
A
B
A
A
B
A
A
B
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       | 00000000         |                   |
|              |       |                  |                   |

## Multiprocessors

- Up to 16 multiprocessors per card
- Each multiprocessor can execute a warp of threads
  - Lightweight threads
  - ► Warp = 32 threads
- SIMD thread execution

| Multipro                   | cessor N              |      |             |                     |
|----------------------------|-----------------------|------|-------------|---------------------|
| Multiproc                  | essor 2               |      |             |                     |
| Multiproce:<br>Processor 1 | ssor 1<br>Processor 2 | •••• | Processor M | Instruction<br>Unit |
|                            |                       |      |             |                     |

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       | 0000000          |                   |
|              |       |                  |                   |

## Memory architecture

- One set of local 32-bit registers per-processor
- A shared memory, shared by all the processors
- A read-only constant cache, to speeds-up reads from the constant memory space
- A read-only texture cache,to speeds-up reads from the constant memory space



| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       |                  |                   |
|              |       |                  |                   |

# Outline

Introduction

GPGPU

**GPU** Architecture

Programming model

| Introduction | <b>GPGPU</b> | GPU Architecture | Programming model |
|--------------|--------------|------------------|-------------------|
| 000          | 00000        |                  | ●00000            |
|              |              |                  |                   |

# Blocks and grids

- Block = one-, two- or threedimensional array of threads
- Grid = one-, two- or threedimensional array of blocks

| Device      |                                   |                  |                  |               |  |  |  |
|-------------|-----------------------------------|------------------|------------------|---------------|--|--|--|
|             | Grid 1                            |                  |                  |               |  |  |  |
|             | Biock Biock B<br>(0, 0) (1, 0) (2 |                  |                  |               |  |  |  |
|             | Block<br>(0, 1)                   | Block<br>(1, 1)  |                  | lock<br>2, 1) |  |  |  |
|             |                                   |                  |                  |               |  |  |  |
| Block (1, 1 | 1)                                |                  |                  |               |  |  |  |
|             | hread<br>1,0) Thread<br>(2,0)     | Thread<br>(3,0)  | Thread<br>(4, 0) |               |  |  |  |
|             | hread Thread (2, 1)               | Thread<br>(3, 1) | Thread<br>(4, 1) |               |  |  |  |
|             | hread Thread (2, 2)               | Thread<br>(3, 2) | Thread<br>(4, 2) |               |  |  |  |

< A

| Introduction | <b>GPGPU</b> | GPU Architecture | Programming model |
|--------------|--------------|------------------|-------------------|
| 000          | 00000        |                  | ○●○○○○            |
|              |              |                  |                   |

### Threads batching

- A block is processed by only one multiprocessor
  - Each block is split into warps (consecutive IDs)
- Several blocks can be processed by the same multiprocessor concurrently
  - Registers and shared memory
- No-synchronization mechanism between blocks



-

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       |                  | 000000            |
|              |       |                  |                   |

## Programming model



(日) (四) (王) (王)

| Introduction | GPGPU | GPU Architecture | Programming model |
|--------------|-------|------------------|-------------------|
|              |       |                  | 000000            |
|              |       |                  |                   |

## Gather and scatter



Gather



Scatter

メロト メロト メヨト メ

| Introduction | <b>GPGPU</b> | GPU Architecture | Programming model |
|--------------|--------------|------------------|-------------------|
| 000          | 00000        |                  | 0000●0            |
|              |              |                  |                   |

## PTX Code



- 1. CUDA Applications written in (extended) C language
- 2. NVCC: NVIDIA CUDA compiler based on Open64
- 3. PTX = Parallel Thread eXecution

< □ > < A > >

< ∃ > ...

| Introduction | <b>GPGPU</b> | GPU Architecture | Programming model |
|--------------|--------------|------------------|-------------------|
| 000          | 00000        |                  | 00000●            |
|              |              |                  |                   |

## Extended C language

- Explicit GPU memory allocation (only from CPU!)
  - cudaMalloc(...)
  - cudaFree()
- Memory copy between host and device
  - cudaMemcpy(...)
  - cudaMemcpy2D(...)
- Function execution on GPU
  - \_\_global\_\_ void myKernelFunction(...);
- Explicit shared memory allocation
  - \_\_shared\_\_\_ int mySharedVariable;
- Kernel launch (CPU  $\rightarrow$  GPU)
  - myKernelFunc < < < gridSize, blockSize, sharedMem > > >

< ロ > < 同 > < 三 > <