

Digital Engineering • Universität Potsdam

## Parallel Programming and Heterogeneous Computing

E2 - Summary

Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group



#### **Course Topics**

- A. The Parallelization Problem
  - Power wall, memory wall, Moore's law
  - Terminology and metrics
- B. Shared Memory Parallelism
  - Theory of concurrency, hardware today and in the past
  - Programming models, optimization, profiling
- C. Heterogeneous Computing
  - On-Chip Accelerators (e.g. SIMD, special purpose accelerators, etc.)
  - External Accelerators (e.g. GPUs, FPGAs, etc.)
- D. Shared Nothing Parallelism
  - Theory of concurrency, hardware today and in the past
  - Programming models, optimization, profiling

ParProg20 E2 Summary



Digital Engineering • Universität Potsdam

## A: Why Parallel?, Terminology, Hardware, Metrics, Workloads, Foster's Methodology

## Moore's Law vs. Walls: Speed, Power, Memory, ILP





## [Pfister1998] Three Ways of Doing Things Faster



 Work Harder (execution capacity)

 Work Smarter (optimization)



HPI

Hasso

Plattner

: Workload

collection of operations that are executed to produce a desired result

~ Program, Application

: Execution Unit

facility that is capable of executing the operations of a workload

 Get Help (parallelization) ParProg20 A1 Terminology

Lukas Wenzel

#### An Important Distinction



#### Concurrency

Capability of a machine to have multiple tasks in progress at any point in time

 Can be realized without parallel hardware

#### Parallelism

Capability of a machine to perform multiple tasks simultaneously

Requires parallel hardware

- : Parallelism
- : Concurrency
- : Distribution

Any parallel program is a concurrent program,

some concurrent programs cannot be executed correctly in parallel.

#### Distribution

Form of Parallelism, where tasks are performed by multiple communicating machines

ParProg20 A1 Terminology

Lukas Wenzel

#### **Concurrency** $\supset$ **Parallelism** $\supset$ **Distribution**

sometimes Concurrency \ Parallelism called "Concurrency"



## Hardware Taxonomy [Flynn1966]

#### Multiple Data Streams





#### ParProg 2020 A2 Parallel Hardware

Lukas Wenzel

MISD

Multiple Instruction Streams

#### MIMD Hardware Taxonomy







#### **SM-MIMD** Hardware



#### Recap Optimization Goals

- Decrease Latency process a single workload faster (= speedup)
- Increase Throughput process more workloads in the same time
- > Both are **Performance** metrics
- **Scalability**: make best use of additional resources
  - **Scale Up**: Utilize additional resources on a machine
  - **Scale Out**: Utilize resources on additional machines
- Cost/Energy Efficiency:
  - minimize cost/energy requirements for given performance objectives
  - alternatively: maximize performance for given cost/energy budget
- **Utilization**: minimize idle time (=waste) of available resources
- Precision-Tradeoffs: trade performance for precision of results

ParProg20 A1 Terminology

Lukas Wenzel



#### Anatomy of a Workload

The longest task puts a lower bound on the shortest execution time.

-T1-T2-T3-T4-T5-T6-T7-T8-

Modeling discrete tasks is impractical  $\rightarrow$  simplified **continuous model.** 



Replace absolute times by **parallelizable fraction** P:

$$T_{\text{par}} = T_1 \cdot P$$
  

$$T_{\text{seq}} = T_1 \cdot (1 - P)$$
  

$$T(N) = T_1 \cdot \left(\frac{P}{N} + (1 - P)\right)$$



ParProg 2020 A3 Performance Metrics

Lukas Wenzel

#### [Amdahl1967] Amdahl`s Law



Amdahl's Law derives the speedup  $s_{Amdahl}(N)$  for a parallelization degree N

$$\mathbf{s}_{\text{Amdahl}}(\mathbf{N}) = \frac{T_1}{T(\mathbf{N})} = \frac{T_1}{T_1 \cdot \left(\frac{P}{N} + (1-P)\right)} = \frac{\mathbf{1}}{\frac{P}{N} + (\mathbf{1}-P)}$$

Even for **arbitrarily large** N, the speedup converges to a **fixed limit** 

$$\lim_{N\to\infty} s_{Amdahl}(N) = \frac{1}{1-P}$$

ParProg 2020 A3 Performance Metrics

Lukas Wenzel

For getting reasonable speedup out of 1000 processors, the sequential part must be substantially below 0.1%

## [Amdahl1967] Amdahl`s Law





Amdahl's Law

Regardless of processor count, **90% parallelizable** code allows not more than a **speedup by factor 10**.

- Parallelism requires highly parallelizable workloads to achieve a speedup
- What is the sense in large parallel machines?

Amdahl's law assumes a simple speedup scenario!

- isolated execution of a single workload
- fixed workload size

#### ParProg 2020 A3 Performance Metrics

Lukas Wenzel

By Daniels220 at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=6678551

#### [Gustafson1988] Gustafson-Barsis' Law



Consider a **scaled speedup scenario**, allowing a variable workload size w.

Amdahl ~ What is the shortest execution time for a given workload? Gustafson-Barsis ~ What is the largest workload for a given execution time?

Determine the scaled speedup  $s_{Gustavson}(N)$  through the increase in workload size w(N) over the fixed execution time T

$$\mathbf{s}_{\text{Gustafson}}(\mathbf{N}) = \mathbf{P} \cdot \mathbf{N} + (\mathbf{1} - \mathbf{P})$$



#### Chart 15

#### ParProg 2020 A3 Performance **Metrics**



Plattner Institut

Parallel fraction **P** is a hypothetical parameter and not easily deduced from a given workload.

- Karp-Flatt-Metric determines sequential fraction  $\mathbf{Q} = \mathbf{1} \mathbf{P}$  empirically  $\geq$ 
  - Measure baseline execution time  $T_1$ 1. by executing workload on a single execution unit
  - Measure parallelized execution time T(N)2. by executing workload on N execution units
  - Determine speedup  $s(N) = \frac{T_1}{T(N)}$ 3.
  - Calculate Karp-Flatt-Metric 4.

[Karp1990]

Karp-Flatt-Metric

 $\mathbf{Q}(\mathbf{N}) = \frac{\overline{\mathbf{s}(\mathbf{N})} - \overline{\mathbf{N}}}{\mathbf{1} - \mathbf{1}}$ 



#### HPI Hasso Plattner Institut

#### Workloads

"task-level parallelism"



#### "data-level parallelism"



- Different tasks being performed at the same time
- Might originate from the same or different programs

 Parallel execution of the same task on disjoint data sets

#### ParProg20 A4 Foster's Methodology

Sven Köhler

## Designing Parallel Algorithms [Foster]



- A) Search for concurrency and scalability
  - Partitioning
     Decompose computation and data into the *smallest possible* tasks
  - Communication
     Define necessary coordination of task execution
- B) Search for locality and other performance-related issues
  - Agglomeration
     Consider performance and implementation costs
  - Mapping

Maximize execution unit utilization, minimize communication

Might require backtracking or parallel investigation of steps

ParProg20 A4 Foster's Methodology

Sven Köhler

## Surface-To-Volume Effect [Foster, Breshears]



| Visualize the da<br>processed (in p<br>sliced 3D cube |                                                                                  | ↓<br>5<br>↓<br>1 ◯ | total volume rea | ncreases while<br>mains constant |  |
|-------------------------------------------------------|----------------------------------------------------------------------------------|--------------------|------------------|----------------------------------|--|
|                                                       | Total surface area<br>(height × width ×<br>number of sides ×<br>number of boxes) | 6                  | 150              | 750                              |  |
|                                                       | Total volume<br>(height × width × length<br>× number of boxes)                   | 1                  | 125              | 125                              |  |
|                                                       | Surface-to-volume<br>ratio<br>(surface area / volume)                            | 6                  | 1.2              | 6                                |  |
| [nicerweb.com]                                        |                                                                                  |                    |                  |                                  |  |

ParProg20 A4 Foster's Methodology

Sven Köhler



#### B1: Shared Memory Systems (Concurrency & Synchronization)



#### **Critical Section**

- Mutual Exclusion demand: Only one task at a time is allowed into its critical section, among all tasks that have critical sections for the same resource.
- Progress demand: If no other task is in the critical section, the decision for entering should not be postponed indefinitely. Only tasks that wait for entering the critical section are allowed to participate in decisions.
- Bounded Waiting demand: It must not be possible for a task requiring access to a critical section to be delayed indefinitely by other threads entering the section (starvation problem)



ParProg20 B1 Concurrency & Synchronization

Sven Köhler

## Cooperating Sequential Processes [Dijkstra1965] Solution: Dekker got it!



- Combination of approach #4 and a variable `turn`, which realizes mutual blocking avoidance through prioritization
- Idea: Spin for section entry only if it is your turn



#### ParProg20 B1 Concurrency & Synchronization

Hasso

Plattner

Institut

ΗP

Sven Köhler

- **Test-and-set** processor instruction, wrapped by the operating system or compiler
  - Write to a memory location and return its old value as atomic step
  - Also known as compare-and-swap (CAS) or read-modify-write
- Idea: Spin in writing 1 to a memory cell, until the old value was 0
  - Between writing and test, no other operation can modify the value
- Busy waiting for acquiring a (spin) lock
- Efficient especially for short waiting periods
- For long periods try to *deactivate* your processor between loops.

```
function Lock(boolean *lock) {
   while (test_and_set (lock))
   ;
}
#define LOCKED 1
   int TestAndSet(int* lockPtr) {
      int oldValue;
      oldValue = SwapAtomic(lockPtr, LOCKED);
      return oldValue == LOCKED;
}
```

ParProg20 B1 Concurrency & Synchronization

Sven Köhler



#### Coroutines



```
var q := new queue
coroutine produce
loop
    while q is not full
        create some new items
        add the items to q
        yield to consume
    coroutine consume
    loop
    while q is not empty
        remove some items from q
        use the items
    yield to produce
```

```
def generator():
   for i in range(5):
        yield i * 2
```

```
for item in generator():
    print(item)
```

ParProg20 B1 Concurrency & Synchronization

Sven Köhler



#### **Other High-Level Primitives**

- Today: Multitude of high-level synchronization primitives
- Spinlock
  - Perform busy waiting, lowest overhead for short locks

#### Reader / Writer Lock

- Special case of mutual exclusion through semaphores
- Multiple "Reader" tasks can enter the critical section at the same time, but "Writer" task should gain exclusive access
- Different optimizations possible: minimum reader delay, minimum writer delay, throughput, ...

ParProg20 B1 Concurrency & Synchronization

Sven Köhler

## Coffman Conditions [Coffman1970]

- 1970. E.G. Coffman and A. Shoshani.
   Sequencing tasks in multiprocess systems to avoid deadlocks.
  - All conditions must be fulfilled to allow a deadlock to happen
  - Mutual exclusion condition Individual resources are available or held by no more than one task at a time
  - Hold and wait condition Task already holding resources may attempt to hold new resources
  - No preemption condition Once a task holds a resource, it must voluntarily release it on its own
  - Circular wait condition Possible for a task to wait for a resource held by the next thread in the chain
- Avoiding circular wait turned out to be the easiest solution for deadlock avoidance
- Avoiding mutual exclusion leads to non-blocking synchronization
  - These algorithms no longer have a critical section

ParProg20 B1 Concurrency & Synchronization

Sven Köhler

: Coffman Conditions





## B2: Programming Models



#### POSIX Threads (Pthreads)

# pthread -

create \_self \_cancel \_exit \_join kill \_attr\_setstacksize \_attr\_setstackaddr \_mutex\_lock \_mutex\_trylock \_mutex\_unlock \_cond\_signal \_cond\_timedwait \_cond\_wait \_rwlock\_rdlock \_rwlock\_unlock \_rwlock\_wrlock \_barrier\_wait \_key\_create \_setspecific [...]

ParProg20 B2 Programming Models

Sven Köhler



#### C++11

- C++11 specification added support concurrency constructs
- Allows asynchronous tasks with std::async or std::thread
- Relies on *Callable* instance (functions, member functions, lambdas, ...)

```
#include <future>
                                                    #include <thread>
#include <iostream>
                                                    #include <iostream>
                                                    void write message(std::string const& message) {
void write message(std::string const& message) {
                                                        std::cout<<message;</pre>
  std::cout<<message;</pre>
}
                                                    int main() {
                                                                                                          ParProg20 B2
int main() {
                                                       std::thread t(write message,
                                                                                                          Programming
  auto f = std::async(write message,
                                                         "hello world from std::thread\n");
    "hello world from std::async\n");
                                                                                                          Models
                                                       write message("hello world from main\n");
 write message("hello world from main\n");
                                                       t.join();
                                                                                                          Sven Köhler
  f.wait();
```



#### C++11: Futures & Promises

- Launch policy for the *async call* can be specified
  - Deferred or immediate launch of the activity
- As for all asynchronous task types, a **future** is returned
  - Object representing the (future) result of an asynchronous operation, allows to block on the result reading
  - Original concept by Baker and Hewitt [1977]
- A promise object can store a value that is later acquired via a future object
  - Separate concept since futures are only readable
  - Can provide a dummy barrier implementation
- Future == Handle, Promise == Value
- Promise and future as concept also available in Java 5, Smalltalk, Scheme, CORBA, ...

ParProg20 B2 Programming Models

Sven Köhler

## Explicit vs Implicit Threading

Thread generation, synchronization, data access:

Explicit, as part of some sequential code (OS API, C++/Java/Python Threads)



#### **Implicit Threading**

Implicit, based on a framework

(OpenMP, OpenCL, Intel TBB, ...)



#### ParProg20 B2 Programming Models

Sven Köhler





#### OpenMP

**Specification** for C/C++ and Fortran language extension

- Portable shared memory thread programming
- High-level abstraction of task- and loop parallelism
- Derived from compiler-directed parallelization of serial language code (HPF), with support for incremental change of legacy code
- Multiple implementations exist

Programming model: Fork-Join-Parallelism

Master thread spawns group of threads for limited code region



ParProg20 B2 Programming Models

Sven Köhler



## **OpenMP Loop Parallelization Scheduling**

- schedule (static, [chunk]):
  - Contiguous ranges of iterations (chunks) are assigned to the threads
  - Low overhead, round robin assignment to free threads
  - Static scheduling for predictable and similar work per iteration
  - Increasing chunk size reduces overhead, improves cache hit rate
  - Decreasing chunk size allows finer balancing of work load
  - Default is one chunk per thread
- schedule (guided, [chunk])
  - Dynamic schedule, shrinking ranges per step
  - Starts with large block, until minimum chunk size is reached
  - Good for computations with increasing iteration length (e.g. prime sieves)
- schedule (dynamic, [chunk])
  - Idling threads grab iteration (or chunk) as available (work-stealing)
  - Higher overhead, but good for unbalanced/unpredicable iteration work load

ParProg20 B2 Programming Models

Sven Köhler

#### Chart **33**

Sven Köhler

#### Work Stealing

Blumofe, Leiserson, Charles:

Scheduling Multithreaded Computations by Work Stealing (FOCS 1994)

Problem of scheduling scalable multithreading problems on SMP

**Work sharing**: When processors create new work, the scheduler migrates threads for balanced utilization

**Work stealing**: Underutilized core takes work from other processor, leads to less thread migrations

- □ Goes back to work stealing research in Multilisp (1984)
- Supported in OpenMP implementations, TPL, TBB, Java, Cilk, ...

Randomized work stealing: Lock-free ready dequeue per processor

- Task are inserted at the bottom, local work is taken from the bottom
- If no ready task is available, the core steals the top-most one from another randomly chosen core; added at the bottom
- Ready tasks are executed, or wait for a processor becoming free

Large body of research about other work stealing variations



ParProg20 B2 Programming Models



#### B3: Hardware

## Shared-Memory Hardware Exploiting Instruction Level Parallelism

- ILP arises naturally within a workload
  - Programmers think in terms of a single instruction sequence
- TLP is explicitly encoded within a workload
  - Programmers designates parallel operations using multiple instruction sequences





ParProg20 B2 Shared-Memory Hardware

Hasso

Plattner

Institut

Lukas Wenzel

#### Why consider ILP in a parallel programming lecture?

Knowledge of common ILP mechanisms and assumptions enables performance optimization on single-thread granularity!

## Shared-Memory Hardware Exploiting Instruction Level Parallelism



HPI Hasso Plattner Institut Shared-Memory Hardware Thread Level Parallelism

#### **Single-Core Multithreading**

- Threads are the smallest units of parallelism under programmers' explicit control
- There are different execution schemes for multiple threads on a single core:







#### ParProg20 B2 Shared-Memory Hardware

Lukas Wenzel

Chart **22** 



Shared-Memory Hardware Memory Consistency Models



#### **Overview**



ParProg20 B2 **Shared-Memory** Hardware

Lukas Wenzel

Shared-Memory Hardware Coherent Cache Hierarchy



#### **MSI Coherence Protocol**

- MSI is a simple coherence protocol, based on a state machine
- Seen from a particular cache, each cache line is in one of three states:
  - <u>I</u>nvalid: The cache line is not present in the cache, this cache may service neither Load nor Store operations
  - <u>Shared:</u> The cache line is present in this and probably other caches, this cache may service Load operations
  - <u>M</u>odified: The cache line is only present in this cache, this cache may service Load and Store operations

ParProg20 B2 Shared-Memory Hardware

Lukas Wenzel



#### B4: NUMA

## Non-Uniform Memory Access Concept



- Part of the main memory is directly attached to a socket (local memory)
- Memory attached to a different socket can be accessed indirectly via the other socket's memory controller and interconnect (remote memory)
- Socket + local memory form a NUMA node



Chart **42** 

## Non-Uniform Memory Access Placement Decisions

Tradeoff:

computational load balancing  $\diamond$  data locality

Thread Placement: Realized in the OS through an Affinity Mask

- Pinning (= only a single bit set)
- Affinity mask can be adjusted at runtime
- Computation follows data

Data Placement: Realized in the OS on page granularity (<u>4k</u>, <u>64k</u>, ... 64GB)

- Static: Placement policies apply at allocation tome
  - □ First-touch · Allocate on fixed node(s) · Interleaving
- Dynamic: Pages can migrate at runtime
- Data follows computation





ParProg 20 B4 Non-Uniform Memory Access



## Non-Uniform Memory Access Topology Examples: SGI UV-300H



| N  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 1  | 10 | 16 | 19 | 16 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
| 2  | 16 | 10 | 16 | 19 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
| 3  | 19 | 16 | 10 | 16 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
| 4  | 16 | 19 | 16 | 10 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
| 5  | 50 | 50 | 50 | 50 | 10 | 16 | 19 | 16 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
| 6  | 50 | 50 | 50 | 50 | 16 | 10 | 16 | 19 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
| 7  | 50 | 50 | 50 | 50 | 19 | 16 | 10 | 16 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
| 8  | 50 | 50 | 50 | 50 | 16 | 19 | 16 | 10 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 |
| 9  | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 10 | 16 | 19 | 16 | 50 | 50 | 50 | 50 |
| 10 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 16 | 10 | 16 | 19 | 50 | 50 | 50 | 50 |
| 11 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 19 | 16 | 10 | 16 | 50 | 50 | 50 | 50 |
| 12 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 16 | 19 | 16 | 10 | 50 | 50 | 50 | 50 |
| 13 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 10 | 16 | 19 | 16 |
| 14 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 16 | 10 | 16 | 19 |
| 15 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 19 | 16 | 10 | 16 |
| 16 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 50 | 16 | 19 | 16 | 10 |

ParProg 2019 Non-Uniform Memory Access

How would you roll out a matrix multiplication workload on this system?

What tools / control mechanisms can you use?

Chart 43



Felix Eberhardt



#### C1: SIMD

Scalar vs. SIMD



How many instructions are needed to add four numbers from memory?



scalar

4 element SIMD



ParProg20 C1 Integrated Accelerators

4 additions 8 loads 4 stores 1 addition 2 loads 1 store

Chart **45** 

Sven Köhler

Vector Data Realignment and Permutation (1)

HPI Hasso Plattner Institut

Sometimes memory is not correctly ordered for a certain tasks.

Example: Squared absolute of 2D points  $(r^2 = p_x^2 + p_y^2)$ 



Conditional Programming (1)

There are **no branches** for element computation in AltiVec.

Instead compute both variants and then use **bit-wise select**.





. . .

А

a =



#### Architecture-Dependent Element Count in Vector Registers

ppc64

AltiVec/VMX





AVX-512 register scheme as extension from the AVX (YMM0-YMM15) and SSE (XMM0-XMM15)

| riviivi 15) and | regis    | •   |     |      | "  | dill  |
|-----------------|----------|-----|-----|------|----|-------|
| 511             | 256      | 255 | 128 | 127  | 0  |       |
| ZMM0            |          | YM  | MO  | XMN  | 10 |       |
| ZMM1            |          | YM  | M1  | XMN  | 11 |       |
| ZMM2            |          | YM  | M2  | XMN  | 12 |       |
| ZMM3            |          | YM  | M3  | XMN  | 13 |       |
| ZMM4            |          | YM  | M4  | XMN  | 14 |       |
| ZMM5            |          | YM  | M5  | XMN  | 15 | m128  |
| ZMM6            |          | YM  | M6  | XMN  | 16 |       |
| ZMM7            |          | YM  | M7  | XMN  | 17 | m128d |
| ZMM8            |          | YM  | M8  | XMN  |    | m128i |
| ZMM9            |          | YM  | M9  | XMN  | 19 |       |
| ZMM10           |          | YM  | M10 | XMM  | 10 | m256  |
| ZMM11           |          | YM  | M11 | XMM  | 11 |       |
| ZMM12           |          | YM  | V12 | XMM  | 12 | m256d |
| ZMM13           | <u> </u> | YM  | V13 | XMM  | 13 | m256i |
| ZMM14           |          | YM  | W14 | XIMM | 14 |       |
| ZMM15           |          | YM  | W15 | XMM  | 15 | m512  |
| ZMM16           |          | YM  | V16 | XMM  | 16 |       |
| ZMM17           |          | YM  | M17 | XMM  | 17 |       |
| ZMM18           |          |     | V18 | XMM  | -  |       |
| ZMM19           |          | YM  | V19 | XMM  |    |       |
| ZMM20           |          | YMI | M20 | XMM  | 20 |       |
| ZMM21           |          | YMI | M21 | XMM  |    |       |
| ZMM22           |          | YMI | M22 | XMM  | 22 |       |
| ZMM23           |          | YMI | M23 | XMM  | 23 |       |
| ZMM24           |          |     | M24 | XMM  |    |       |
| ZMM25           |          |     | M25 | XMM  |    |       |
| ZMM26           |          |     | M26 | XMM  |    |       |
| ZMM27           |          |     | M27 | XMM  |    |       |
| ZMM28           |          |     | M28 | XMM  |    |       |
| ZMM29           |          | YM  | M29 | XMM  | 29 |       |
|                 |          |     |     |      |    |       |

YMM30 XMM30

ZMM30

#### amd64

4 floats 2 doubles integers (8-128bit) 8 floats 4 doubles integers (8-128bit)

#### ParProg20 C1 Integrated Accelerators

Sven Köhler

. . .

Chart **48** 



#### What loops can be vectorized

- Countable loops
- Static counts (length does not change)
- Single entry and single exit (read: no data-depended break)
- All function calls can be in-lined, or are math intrinsics (sin, floor, ...)
- Straight-line code (no switch-statements), mask-able if/continue

```
for (int i=0; i<length; i++) {
    float s = b[i]*b[i] - 4*a[i]*c[i];
    if ( s >= 0 ) {
        s = sqrt(s) ;
        x2[i] = (-b[i]+s)/(2.*a[i]);
        x1[i] = (-b[i]-s)/(2.*a[i]);
    } else {
        x2[i] = 0.;
        x1[i] = 0.;
    }
}
```

ParProg20 C1 Integrated Accelerators

```
Sven Köhler
```



#### C2: GPUs



#### Why GPUs?



[https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/, https://www.top500.org/statistics/list/]



## GPU Hardware: Discrete vs. Integrated GPUs







#### Hardware: NVIDIA GA100 Full GPU with 128 SMs



ParProg20 C2 GPUs

Max Plauth

Chart 53

## CUDA Programming Model: Kernels

- "a routine compiled for high throughput accelerators" (Wikipedia)
- An instance of a kernel function is executed once per thread
- Indices determine what portion of work is performed by a kernel instance
- Think of kernels as the body of an inner loop







## CUDA Programming Model: Memory Hierarchy

- Register File
  - Private to each thread
  - Fastest memory, several variables
- Shared Memory
  - Shared per block
  - Fast memory, several kilobytes
  - Managed manually
- Global Memory
  - □ Shared per process
  - Slowest memory, several gigabytes



Block (0, 0)

Block (0, 1)

Block (0, 2)

Block (1, 0)

Block (1, 1)

Block (1, 2)

## **ParProg20 C2 GPUs** Max Plauth

Chart 55



## Best Practices for Performance Tuning

| Algorithm Design | • Asynchronous, Recompute, Simple                          |                      |
|------------------|------------------------------------------------------------|----------------------|
| Memory Transfer  | • Chaining, Overlap Transfer & Compute                     |                      |
| Control Flow     | Avoid Divergent Branching                                  |                      |
| Memory Types     | • Local Memory as Cache, rare resource                     |                      |
| Memory Access    | <ul> <li>Coalescing, Bank Conflicts</li> </ul>             |                      |
| Sizing           | <ul> <li>Work-Group Size, Work / Work-Item</li> </ul>      | ParProg20 C2<br>GPUs |
| Instructions     | <ul> <li>Shifting, Fused Multiply, Vector Types</li> </ul> | Max Plauth           |
| Precision        | <ul> <li>Native Math Functions, Build Options</li> </ul>   | Chart <b>56</b>      |



#### C3: FPGA

## Introduction Mapping Workloads to Hardware



#### **Example:**

```
Given Arrays a, b, and f calculate r[i] = a[i] \times f[i] + b[i] \times (1 - f[i])
```





ParProg 2020 C3 FPGA Accelerators

Lukas Wenzel

**General Purpose Hardware** 

**Custom Hardware** 

#### FPGA Characteristics Hardware Structure





## FPGA Characteristics Performance

#### Maximum clock frequency is design specific!

 Combinatorial paths begin and end at flipflops
 Clock period must be longer that the maximum path delay

Maximum delay:

 $\max{t_{\delta}} = 7$ ns

Clock frequency:

$$f \leq \frac{1}{\max\{t_{\delta}\}} = 143 \text{MHz}$$



ParProg 2020 C3 FPGA Accelerators

Lukas Wenzel

## FPGA Design Basic Patterns



#### Any program can be transformed into an equivalent hardware design:

- Variables and operations are realized in the datapath
- Control flow is realized through a finite state machine (FSM) controlling the datapath



## FPGA Design Dataflow Model



- Dataflow is a computational model based on streams of data units, that are processed by traversing a network of operators
  - Enables a flexible kind of task parallelism, where operations are not orchestrated by control flow but availability of data operands



Workloads with an efficient dataflow representation usually yield an efficient hardware implementation!

Chart **62** 

## FPGA Development Workflow



High-level design methods extend the frontend of traditional workflows. They usually produce HDL descriptions as intermediate artifacts.



#### **FPGA Accelerators**



- DRAM modules to complement the limited BRAM capacity on the FPGA
- Flash Storage

 $\mathbf{x} \in \mathbf{x}$ 

- Network Interfaces
- Video and Peripheral Ports
- Auxilliary Accelerators like Crypto Units or A/V Codecs





ParProg 2020 C3 FPGA Accelerators

Lukas Wenzel



Chart 65

# **SNAP** Core

ctrl

#### **FPGA Accelerators**

## Lukas Wenzel





#### Channels consist of: Payload • Valid handshake • Ready handshake



- time Advanced Extensible Interface Stream (AXI Stream) ~ sequential access
- **Advanced Extensible Interface** (AXI) ~ random access



clock

valid

ready

a

transfer occurred

payload





## D1: Shared Nothing Basics

## Parallel Random Access Machine (PRAM)



Natural extension of the Random Access Machine (RAM) model:



- Arbitrary amount of memory
- Constant memory access latency
- Arbitrary number of processors
- Lockstep execution

#### Multiple processors can read the same address

| Multiple<br>processors<br>can write the<br>same address | Exclusive Read,<br>Exclusive Write<br>EREW         | Concurrent Read,<br>Exclusive Write<br>CREW  | <ul><li>Arbitration Policies:</li><li>Common</li><li>Arbitrary</li></ul> | ParProg 2020 D1<br>Shared-Nothing<br>Basics<br>Lukas Wenzel |
|---------------------------------------------------------|----------------------------------------------------|----------------------------------------------|--------------------------------------------------------------------------|-------------------------------------------------------------|
|                                                         | Exclusive Read,<br>Concurrent Write<br><b>ERCW</b> | Concurrent Read,<br>Concurrent Write<br>CRCW | <ul> <li>Priority</li> <li>Aggregate (Sum, Max, Avg,)</li> </ul>         | Chart <b>67</b>                                             |

Algorithms are divided into three repeating phases, forming multiple <u>supersteps</u>:

- 1. Local Computation
- 2. Global Communication
- 3. Synchronization

**Superstep duration varies** at runtime depending  $g \cdot |msg_{02}|$  on computational and communication load.

#### Performance estimates using the following parameters:

#### **Computation time:**

 $t_W = \max\{w_i\}$ 

#### **Communication time:** $t_c = g \cdot m \cdot h$

 $g\sim$  message bandwidth

- $m = \max\{|msg_k|\} \sim \text{message size}$
- $h = \max\{\#in_i, \#out_i\} \sim \text{communication pattern}$

#### **Synchronization overhead:** $t_s = l$

 $g \cdot |msg_{01}|$ 

 $W_0$ 

.

ParProg 2020 D1 Shared-Nothing Basics



## [Valiant1990] Bulk Synchronous Parallel Model (BSP)





#### Chart **69**

*Example: Request-Response sequence* between two processors

• P = 2; l = 3; g = 4; o = 2;  $t_{resp} = 3$ 

• 
$$t_{total} = 2 \cdot l + 4 \cdot o + t_{resp} = 17$$

- *l* **latency** (time in cycles between transmission and reception of a message)
- o overhead (time in cycles for send / receive operation)
- *P* **#processors** g - gap (time in cycles between messages from / to a single processor)

[Culler1993]

communication patterns.

LogP Model

**Parameters:** 

LogP enables a fine-grained analysis of





ParProg 2020 D1 Shared-Nothing **Basics** 



Lukas Wenzel

## **Network Topologies**

#### **Topologies are characterized by multiple metrics:**

- Diameter ~ Latency
   Maximum distance between any two nodes
- Connectivity ~ Resilience
   Minimum number of removed edges to cause partition
- Bisection Bandwidth ~ Throughput
   Transfer capacity across balanced network cuts
- Cost ~ Network complexity
   Total number of edges
- Degree ~ Node complexity
   Maximum number of edges per node
- Link Bandwidth



ParProg 2020 D1 Shared-Nothing Basics

Lukas Wenzel





## Network Topologies

| Fully Con    | nected            | Ring         | J                                        | Sta          | Star               |  |  |
|--------------|-------------------|--------------|------------------------------------------|--------------|--------------------|--|--|
| Diameter     | 1                 | Diameter     | $\left\lfloor \frac{n}{2} \right\rfloor$ | Diameter     | 2                  |  |  |
| Connectivity | n-1               | Connectivity | 2                                        | Connectivity | 1<br>(single node) |  |  |
| Cost         | $\frac{n^2-n}{2}$ | Cost         | n                                        | Cost         | n                  |  |  |
| Degree       | n-1               | Degree       | 2                                        | Degree       | 1   n (!)          |  |  |







ParProg 2020 D1 Shared-Nothing Basics

Lukas Wenzel

Chart **71** 



## Network Topologies

| d-M          | lesh                                                  | d-Torus      |                                                                                                                    |  |
|--------------|-------------------------------------------------------|--------------|--------------------------------------------------------------------------------------------------------------------|--|
| Diameter     | $d \cdot (k-1) = d \cdot (\sqrt[d]{n}-1)$             | Diameter     | $ \begin{bmatrix} d \cdot (k-1)/2 \end{bmatrix} $<br>= $ \begin{bmatrix} d \cdot (\sqrt[d]{n}-1)/2 \end{bmatrix} $ |  |
| Connectivity | d                                                     | Connectivity | $2 \cdot d$                                                                                                        |  |
| Cost         | $d \cdot k^{d-1} \cdot (k-1) = d \cdot (n-n^{d-1/d})$ | Cost         | $m{d}\cdotm{k}^{m{d}}=m{d}\cdotm{n}$                                                                               |  |
| Degree       | $2 \cdot d$                                           | Degree       | $2 \cdot d$                                                                                                        |  |

d = 2 k = 3  $n = k^{d} = 9$ 





ParProg 2020 D1 Shared-Nothing Basics

Lukas Wenzel

Chart **72** 

### HPI Hasso Plattner Institut

# Network Topologies

### Fat Tree of Depth *l*

= Binary *l*-level switch hierarchy, where uplink bandwidth equals sum of downlink bandwidths





# D2: MPI

# Single Program Multiple Data (SPMD)





seq. program and data distribution



seq. node program with message passing



identical copies with different process identifications

### ParProg20 D2 MPI

Sven Köhler



# MPI Communication Terminology



ParProg20 D2 MPI

Sven Köhler

**Communicator**: handle for group of processes (MPI\_COMM\_WORLD = all) **Size**: Number of processes in a communicator



# Circular Left Shift Example

shifts <number of positions>

#### Description

- Position 0 of an array with 100 entries is initialized to 1. The array is distributed among all processes in a blockwise fashion.
- A number of circular left shift operations is executed.
- The number is specified via a command line parameter.



ParProg20 D2 MPI

```
Sven Köhler
```

#### if (myid==0) { MPI Send(&values[0], 1, MPI INT, 1 Process MPI COMM WORLD); for (j=1;j<100/np;j++) {</pre> values[j-1]=values[j]; MPI Recv(&values[100/np-1], 1, MPI 10, MPI COMM WORLD, &stat }else{ int buf=values[0]; for (j=1;j<100/np;j++) {</pre> Other See values[j-1]=values[j]; MPI Recv(&values[100/np-1], 1, MPI INT, rnbr, 10, MPI COMM WORLD, &status); MPI Send(&buf, 1, MPI INT, lnbr, 10,

for (i=0;i<shifts;i++) {</pre>

MPI\_COMM\_WORLD);

}

## Send and Receive Protocols





ParProg20 D2 MPI

Sven Köhler

# **MPI** Collective Operations







# D3: Actors



### Actors



ParProg20 D3 Actors

Sven Köhler

## "Everything is an actor"



# Erlang Cluster Terminology



ParProg20 D3 Actors

Sven Köhler

An Erlang cluster consists of multiple interconnected nodes, each running several light-weight processes (actors).

Message passing implemented by shared memory (same node), TCP (ERTS), ...

#### Armstrong, Joe. "Concurrency oriented programming in Erlang." Invited talk, FFG (2003).

# Concurrency in Erlang

- Each concurrent activity is called *process*, started from a function
- Local state is call-stack and local variables
- Only interaction through asynchronous message passing
- Processes are reachable via unforgable name (pid)
- Design philosophy is to spawn a worker process for each new event
  - spawn([node, ]module, function, argumentlist)
  - Spawn always succeeds, created process may terminate with a runtime error later (*abnormally*)
  - Supervisor process can be notified on fails



ParProg20 D3 Actors

Sven Köhler



# Enjoy whatever helps you learning. Much success for the exam!







Digital Engineering • Universität Potsdam

# Parallel Programming and Heterogeneous Computing

E2 - Summary

Max Plauth, Sven Köhler, Felix Eberhardt, Lukas Wenzel and Andreas Polze Operating Systems and Middleware Group