# Timeloop

# Accelergy

Angshuman Parashar Yannan Nellie Wu

Po-An Tsai

Vivienne Sze

Joel S. Emer

**NVIDIA** 

MIT

**NVIDIA** 

MIT

**NVIDIA, MIT** 

### **ISPASS Tutorial**

Part 2: Hands-on session

August 2020







### Resources

- Tutorial Website: <a href="https://accelergy.mit.edu/tutorial.html">https://accelergy.mit.edu/tutorial.html</a>
- Tutorial Docker: <a href="https://github.com/Accelergy-Project/timeloop-accelergy-tutorial">https://github.com/Accelergy-Project/timeloop-accelergy-tutorial</a>
  - Various exercises and example designs <u>and</u> environment setup for the tools





## **EXPLOITING REUSE**





## TIMELOOP / ACCELERGY

Tools for Evaluation and Architectural Design-Space Exploration of DNN Accelerators





Infrastructure Download Instructions: <a href="http://accelergy.mit.edu/isca20\_tutorial.html">http://accelergy.mit.edu/isca20\_tutorial.html</a>



## **EXERCISE 0: PROBLEM**

### Conv1D





## **EXERCISE 0: ARCHITECTURE**

### 1-Level Temporal



### **EXERCISE 0: MAPPING**

### 1-Level Temporal



### Write:

### mapping:

- target: Buffer type: temporal factors: R=3 P=16 permutation: RP

## **EXERCISE 0**

Follow the instructions in the README.

## **EXERCISE 0**

### Run Timeloop model:

>> timeloop-model arch.yaml problem.yaml map.yaml

### Output:

```
timeloop-model.map.txt

Buffer [ Weights:3 Inputs:18 Outputs:16 ]

| for P in [0:16)
| for R in [0:3)
```

```
timeloop-model.stats.txt
. . . . . .
Summary Stats
Utilization: 1.00
Cycles: 48
Energy: 0.00 uJ
Area: 0.00 mm<sup>2</sup>
MACCs = 48
pJ/MACC
    MACC
                              = 0.60
    Buffer
                              = 1.54
    Total
                              = 2.14
```



## **EXERCISE 1: ARCHITECTURE**

### 2-Level Temporal





## **EXERCISE 1: MAPPING**

### Weight Stationary





```
for p1 in [0:1)
for r1 in [0:3)
```

Buffer

for r0 in [0:1)
 for p0 in [0:16)
 Output[p] += Weight[r] \* Input[p+r];

### **Expected outputs**

| Metric              | Weights | Inputs | Outputs |
|---------------------|---------|--------|---------|
| Buffer occupancy    | 1       | Р      | Р       |
| MainMemory accesses | R       | W      | Р       |
| Buffer accesses     | PR      | PR     | 2PR     |

### Write:

### mapping:

target: MainMemory type: temporal factors: R=3 P=1

permutation: RP # inner to outer

- target: Buffer
 type: temporal
 factors: R=1 P=16

permutation: PR # inner to outer



## **EXERCISE 1: MAPPING**

### **Output Stationary**





for r1 in [0:1) for p1 in [0:16)

Buffer

for p0 in [0:1)
 for r0 in [0:3)
 Output[p] += Weight[r] \* Input[p+r];

### **Expected outputs**

| Metric              | Weights | Inputs | Outputs |
|---------------------|---------|--------|---------|
| Buffer occupancy    | R       | R      | 1       |
| MainMemory accesses | R       | W      | Р       |
| Buffer accesses     | PR      | PR     | 2PR     |

### Write:

### mapping:

- target: MainMemory
type: temporal
factors: R=1 P=16
permutation: PR

- target: Buffer type: temporal factors: R=3 P=1 permutation: RP



## **EXERCISE 1**

Follow the directions in the README.



## **EXERCISE 2: PROBLEM**

Conv1D + Output Channels



## **EXERCISE 2: MAPPINGS**

### Untiled vs. K-tiled



- target: MainMemory type: temporal factors: R=1 P=16 K=32

- target: Buffer type: temporal factors: R=3 P=1 K=1 permutation: RPK

- target: MainMemory type: temporal factors: R=1 P=16 K=16 permutation: RPK

- target: Buffer type: temporal factors: R=3 P=1 K=2



## **EXERCISE 2**

Follow the directions in the README.



## **EXERCISE 2: O.S. DATAFLOW VARIANTS**



Inputs W = P+R-1



Alg. min. MainMemory accesses

| Weights | Inputs | Outputs |
|---------|--------|---------|
| KR      | W      | KP      |

# $\bigvee (O_{kp} += W_{kr} I_{p+r-1})$

$$\bigvee_{k_1=1}^{K_1} \bigvee_{p=1}^{P} \bigvee_{k_0=1}^{K_0} \bigvee_{r=1}^{R} (O_{kp} += W_{kr} I_{p+r-1})$$
where  $K = K_1 \times K_0$  and  $k = k_1 K_0 + k_0$ 

$$\bigvee_{p_1=1}^{P_1} \bigvee_{k=1}^{K} \bigvee_{p_0=1}^{P_0} \bigvee_{r=1}^{R} (O_{kp} += W_{kr} I_{p+r-1})$$
where  $P = P_1 \times P_0$  and  $p = p_1 P_0 + p_0$ 

### Buffer occupancy

| Weights          | Inputs              | Outputs |
|------------------|---------------------|---------|
| R                | R                   | 1       |
| R                | R                   | 1       |
| R                | W                   | 1       |
| KR               | R                   | 1       |
| K <sub>b</sub> R | R                   | 1       |
| R                | R+P <sub>b</sub> -1 | 1       |

### MainMemory accesses

| Weights     | Inputs               | Outputs      |
|-------------|----------------------|--------------|
| KR          | KW                   | KP <b></b> ✓ |
| KPR         | W                    | KP 🗾         |
| KR          | W                    | KP ✓         |
| KR          | W                    | KP ⊮         |
| KR          | (K/K <sub>b</sub> )W | KP 🖊         |
| $K(P/P_b)R$ | W                    | KP 🗼         |

## **EXERCISE 3: ARCHITECTURE**

### 3-Level Temporal





## **EXERCISE 3B: BYPASSING LEVELS**

3-Level Temporal with Level Bypassing





## **EXERCISE 3B: BYPASSING**

### **Bypassing**

- Avoids energy cost of reading and writing buffers
- May result in additional accesses to outer buffers
- Does not change energy cost of moving data over network wires

For brevity in expressing mappings, Timeloop's evaluator assumes each datatype is stored at each level.

We will see later that Timeloop's mapper makes no such assumption

### Follow the directions in the README.

### Challenge

Experiment with bypass strategies to find out if there's any benefit in bypassing for this problem.



## **EXERCISE 4: SPATIAL INSTANCES**

3-Level with multiple PEs



## **EXERCISE 4: MAPPING**

### Spatial levels need loops too



By convention, a block of spatial\_for loops representing a spatial famout from storage level *Outer* to storage level *Inner* are described as a spatial mapping directive targeted at level *Outer*.



## **EXERCISE 4: SPATIAL INSTANCES**

3-Level with multiple PEs





## **EXERCISE 4**

Follow the directions in the README.



### **EXERCISE 4: SPATIAL INSTANCES**

Specifying complete mappings manually is beginning to get tedious. Space of choices and consequences is getting larger. Moving to realistic problem shapes and hardware topologies, we get a combinatorial explosion.

Fortunately, Timeloop's mapper was built exactly for this.





## INVOKING THE MAPPER



To understand how the mapper works, let's go back to a simpler hardware architecture.





Arch: 3-Level, Problem: 1D + Output Channels



### Recall:

### mapping:

- target: MainMemory
 type: temporal
 factors: R=1 P=16 K=4
 permutation: RPK

 target: GlobalBuffer type: temporal factors: R=3 P=1 K=2 permutation: RPK

- target: RegisterFile
 type: temporal
 factors: R=1 P=1 K=4
 permutation: RPK

# Mapper constructs a mapping template:

### mapping:

- target: MainMemory
 type: temporal
 factors: R=\_ P=\_ K=\_
 permutation: \_ \_ \_

- target: GlobalBuffer
 type: temporal
 factors: R=\_ P=\_ K=\_
 permutation:

- target: RegisterFile
type: temporal
factors: R=\_ P=\_ K=\_
permutation: \_ \_ \_

Arch: 3-Level, Problem: 1D + Output Channels



**Mapspace:** An enumeration of ways to fill in these red blanks:

- Factors
- Permutations
- Dataspace Bypass\*
- \* = not shown in example

Mapper constructs a mapping template:

mapping:

- target: MainMemory
 type: temporal
 factors: R=\_ P=\_ K=\_
 permutation: \_\_\_\_

target: GlobalBuffer
type: temporal
factors: R=\_ P=\_ K=\_
permutation:

- target: RegisterFile
 type: temporal
 factors: R=\_ P=\_ K=\_
 permutation: \_\_\_\_\_\_

Arch: 3-Level, Problem: 1D + Output Channels



**Mapspace:** An enumeration of ways to fill in these \_ red blanks:

- Factors
- Permutations
- Dataspace Bypass

Mapspaces can be constrained by the user.

- Architecture constraints
- Mapspace constraints

# Mapper constructs a mapping template:

mapping:

- target: MainMemory
   type: temporal
   factors: R=\_ P=\_ K=\_
   permutation: \_ \_ \_
- target: GlobalBuffer
   type: temporal
   factors: R=\_ P=\_ K=\_
   permutation: \_\_\_\_\_
  - target: RegisterFile type: temporal factors: R=\_ P=\_1 K=\_1 permutation: R\_\_\_\_

Arch: 3-Level, Problem: 1D + Output Channels



Mapspace: An enumeration of ways to fill in these \_ red blanks:

- Factors
- Permutations
- Dataspace Bypass

Mapspaces can be constrained by the user.

- Architecture constraints
- Mapspace constraints

Mapper runs a search heuristic over the constrained mapspace

# Mapper constructs a mapping template:

mapping:

- target: MainMemory
   type: temporal
   factors: R=\_ P=\_ K=\_
   permutation: \_ \_ \_
- target: GlobalBuffer
   type: temporal
   factors: R=\_ P=\_ K=\_
   permutation: \_ \_ \_
  - target: RegisterFile
    type: temporal
    factors: R=\_ P=1 K=1
    permutation: R \_\_\_



## **EXERCISE 5: MAPSPACE CONSTRAINTS**

We provide 3 alternative sets of constraints:

- 1mapping: Constrain mapspace to the point that only 1 legal mapping remains in it!
- freebypass: Factors and permutations are forced, but bypass options are left unspecified.
  - Each of 3 dataspaces may either be kept or bypassed at each of the 2 inner levels (RegisterFile and GlobalBuffer) => (2^2)^3 = 64 choices!
  - Does Timeloop find a better bypassing strategy?
- null: Fully unconstrained.
  - How large is the mapspace?
  - Does Timeloop find a better mapping?



### HARDWARE X/Y DIMENSIONS



## HARDWARE X/Y DIMENSIONS

What if you wanted this mapping instead?

GlobalBufer RegFile RegFile RegFile RegFile map K=3 RegFile RegFile RegFile RegFile RegFile RegFile RegFile RegFile map K=4 factors: K=4 K=3 R=1 S=1 P=1 Q=1 N=1 permutation: K KRSP O N split: 1

Use a simulation hack: a "dummy" buffer





## PARTITIONED BUFFERS

To model:



Represent it as:



This is also a temporary workaround. Partitioned buffers will be supported natively in future.

## **EXERCISE 6: PROBLEM**

### Convolutional Network Layer

```
for r = [0:R):
  for s = [0:S):
                                                                                              data-spaces:
                                                                    problem:
    for p = [0:P):
                                                                                                 - name: Weights
                                                                      shape:
      for q = [0:0):
                                                                                                   projection:
                                                                        name: CNNLayer
        for c = [0:C):
                                                                         dimensions:
                                                                                                   - [ [C] ],
          for k = [0:K):
                                                                                                   - [ [K] ],
            for n = [0:N):
                                                                                                   - [ [R] ],
              Output[n][k][q][p] +=
                                                                                                   - [ [S] ]
                  Weight[c][k][r][s] *
                                                                         - S
                                                                                                 - name: Inputs
                  Input[n][c]
                                                                                                   projection:
                        [q*Hstride+s*Hdilation]
                                                                                                   - [ [N] ]
                                                                         - Q
                        [p*Wstride+r*Wdilation];
                                                                                                   - [ [S, Hdilation], [Q, Hstride] ]
                                                                     coefficients:
                                                                                                   - [ [R, Wdilation], [P, Wstride] ]
                                                                         - name: Wstride
      Weights
                        Inputs
                                                                                                 - name: Outputs
                                                                           default: 1
                                      Outputs
                                                                                                   projection:
                                                                         - name: Hstride
                                                                           default: 1
                                                                                                   - [ [N] ]
                H=
                                                                                                   - [ [K] ]
                                                                         name: Wdilation
               O+S-1
                                                                                                   - [ [Q] ]
                                                                           default: 1
                                                                         - name: Hdilation
                                                                                                   - [ [P] ]
                                                                                                   read-write: True
                                                                           default: 1
```



# **EXERCISE 6: ARCHITECTURE**

Eyeriss-256



I'liT

## **EXERCISE 6: CNN LAYER ON EYERISS-256**

Mapper is multi-threaded.

- Mapspace is split between each mapper thread.
- Default number of threads = number of logical CPUs on host machine.

For long mapper runs, you can use the interactive ncurses-based status tracker by setting mapper.live-status = True

- Tracks various statistics for each mapper thread:
  - Best energy-efficiency/performance seen so far
  - Number of legal/illegal/total mappings examined so far
  - Number of consecutive illegal mappings
  - Number of consecutive legal sub-optimal mappings



# **TUNING THE MAPPER'S SEARCH**

### Search heuristics (as of today)

- I inear
- Random
- Hybrid

Optimization criteria: prioritized list of statistics emitted by the model, e.g.,

- [cycles, energy]
- [ last-level-accesses ]

#### **Termination conditions**

- Mapspace exhausted
- #Valid mappings encountered >= "search-size"
- #Consecutive invalid mappings encountered >= "timeout"
- #Consecutive sub-optimal valid mappings encountered >= "victory-condition"
- Ctrl+C



# **EXERCISE 6**

Follow the directions in the README.

Complete the exercise and enjoy!



## **TIMELOOP**



Timeloop aims to serve as a vehicle for quality research on flexible DNN accelerator architectures. The infrastructure is released at <a href="https://github.com/NVlabs/timeloop">https://github.com/NVlabs/timeloop</a> under a BSD license.

Please join us in making Timeloop better and more useful for research opportunities across the community.





### Resources

#### Tutorial Related

- Tutorial Website: <a href="http://accelergy.mit.edu/isca20\_tutorial.html">http://accelergy.mit.edu/isca20\_tutorial.html</a>
- Tutorial Docker: <a href="https://github.com/Accelergy-Project/timeloop-accelergy-tutorial">https://github.com/Accelergy-Project/timeloop-accelergy-tutorial</a>
  - Various exercises and example designs <u>and</u> environment setup for the tools

#### Other

- Infrastructure Docker: <a href="https://github.com/Accelergy-Project/accelergy-timeloop-infrastructure">https://github.com/Accelergy-Project/accelergy-timeloop-infrastructure</a>
  - Pure environment setup for the tools <u>without</u> exercises and example designs
- Open Source Tools
  - Accelergy: <a href="http://accelergy.mit.edu/">http://accelergy.mit.edu/</a>
  - Timeloop: <a href="https://github.com/NVlabs/timeloop">https://github.com/NVlabs/timeloop</a>

### – Papers:

- A. Parashar, et al. "Timeloop: A systematic approach to DNN accelerator evaluation," ISPASS, 2019.
- Y. N. Wu, V. Sze, J. S. Emer, "An Architecture-Level Energy and Area Estimator for Processing-In-Memory Accelerator Designs," ISPASS, 2020.
- Y. N. Wu, J. S. Emer, V. Sze, "Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs," ICCAD, 2019.





## MAPPING CHOICES

### Energy-efficiency of peak-perf mappings of a single problem



480,000 mappings shown

Spread: 19x in energy efficiency

Only 1 is optimal, 9 others within 1%

A model needs a mapper to evaluate a DNN workload on an architecture

6,582 mappings have min. DRAM accesses but vary 11x in energy efficiency

A mapper needs a good cost model to find an optimal mapping



### Infrastructure Download Instructions:

## WHY TIMELOOP/ACCELERGY?

### Microarchitectural model (Timeloop/Accelergy)

- Expressive: generic, template based hardware model
- Fast: faster than native execution on host CPUs
- Accurate: validated vs. design-specific models

### Technology model (Accelergy)

- Allows user-defined complex architectural components
- Plugins for various technology models, e.g., Cacti, Aladdin, proprietary databases

### Built-in Mapper (Timeloop)

 Addresses the hard problem of optimizing data reuse, which is required for faithful evaluation of a workload on an architecture



