# **Recap: Computations of DL EECE695D: Efficient ML Systems**

Spring 2025



- Last Class
  - Motivations
  - What to expect
  - Administrivia
- Today
  - A brief recap of ML/DL
    - Inference cost?
    - Training cost?

### Agenda

Linear Models

### Model

- Linear Model = Extremely simple neural net
  - e.g., linear regression, perceptrons, ...
  - for an easy discussion, suppose that  $y \in \mathbb{R}^1$  (the value to be predicted)



Output prediction

$$\sum_{i=0}^{M} w_i x_i = \mathbf{W} \cdot \mathbf{X}$$

### Inference

- **Question**. How many computations do we need for an inference?

  - Which unit will you use?

### $\hat{y} = \mathbf{w}^{\mathsf{T}} \mathbf{x}, \qquad \mathbf{x}, \mathbf{w} \in \mathbb{R}^d$

### **Recap: FLOPs**

- One option is to use FLOPs (Floating Point Operations)
  - 1 addition of floats = 1 FLOP
  - 1 multiplication of floats = 1 FLOP

| float | 8 bits   |        |
|-------|----------|--------|
| 0     | 01111110 | .01010 |
| Sign  | Exponent |        |

value = (sign) 
$$\times$$
 2



### 101010101010101011

Fraction (a.k.a. Mantissa)

### (exponent) 1.(fraction) X

https://docs.nvidia.com/cuda/floating-point/index.html



- $\hat{y} = \mathbf{w}^{\mathsf{T}} \mathbf{x}, \qquad \mathbf{x}, \mathbf{w} \in \mathbb{R}^d$
- Answer. We need 2*d* FLOPs
  - We are performing an elementary operation d times:

Inference (again)

 $s \leftarrow s + (w_i \times x_i)$ 

## MAC / MAD

Multiply and add (or accumulate) abstracted into a single operation

Rounding only done once; better precision

 $a \leftarrow \operatorname{rn}(a + \operatorname{rn}(b \times c))$ 

 $a \leftarrow a + (b \times c)$ 

 $a \leftarrow \operatorname{rn}(a + (b \times c))$ 

### Example

- To see the importance of rounding, consider the following example:

### 







 $p1 = rn (a_1 \times b_1)$  $p^2 = rn (a_2 \times b_2)$  $p3 = rn (a_3 \times b_3)$  $p4 = rn (a_4 \times b_4)$  $s_{left} = rn(p_1 + p_2)$  $s_{right} = rn(p_3 + p_4)$ 

### • Sometimes, you would see "TOPs"

 Usually an umbrella term for (a trillion) INT8 / INT4 operations (here, "op" could mean a single fused multiply-adds)

|                      | Jetson AGX Orin series                                           |                         |                                                                | Jetson Orin NX series                                               |                        | Jetson Orin Nano series                                             |                                            |                                           |             |
|----------------------|------------------------------------------------------------------|-------------------------|----------------------------------------------------------------|---------------------------------------------------------------------|------------------------|---------------------------------------------------------------------|--------------------------------------------|-------------------------------------------|-------------|
|                      | Jetson AGX<br>Orin<br>Developer Kit                              | Jetson AGX<br>Orin 64GB | Jetson AGX<br>Orin Industrial                                  | Jetson AGX<br>Orin 32GB                                             | Jetson Orin<br>NX 16GB | Jetson Orin<br>NX 8GB                                               | Jetson Orin<br>Nano Super<br>Developer Kit | Jetson Orin<br>Nano 8GB                   | Jets<br>Nan |
| Al<br>Performance    | 275 1                                                            | FOPS                    | 248 TOPs                                                       | 200 TOPS                                                            | 157 TOPS               | 117 TOPS                                                            | 67 TOPS                                    | 67 TOPS                                   | 34          |
| GPU                  | 2048-core NVIDIA Ampere architecture GPU with 64<br>Tensor Cores |                         | 1792-core<br>NVIDIA<br>Ampere c GPU<br>with 56 Tensor<br>Cores | 1024-core NVIDIA Ampere<br>architecture GPU with 32 Tensor<br>Cores |                        | 1024-core NVIDIA Ampere<br>architecture GPU with 32 Tensor<br>Cores |                                            | 512<br>N\<br>Arr<br>archi<br>GPU<br>Tenso |             |
| GPU Max<br>Frequency | 1.3 (                                                            | GHz                     | 1.2 GHz                                                        | 930 MHz                                                             | 1173MHz                | 1173MHz                                                             | 1020MHz                                    | 1020MHz                                   | 102         |

### TOPs





### **34 TOPS**



• **Question**. How many computations do we need for training?



## Training

## Training

- This is ill-posed, as we require more setup:
  - $D = \{(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_N, y_N)\}$ We have a dataset
  - We use the squared loss  $\ell(y, \hat{y}) = (y \hat{y})^2$
  - We solve the empirical risk minimization

$$\min_{\mathbf{w} \in \mathbb{R}^d} \sum_{i=1}^{N} (y_i - \mathbf{w}^{\mathsf{T}} \mathbf{x}_i)^2$$

• **Question**. Can we answer the question now?

$$\Leftrightarrow \min_{\mathbf{w} \in \mathbb{R}^d} \|\mathbf{y} - \mathbf{w}^\mathsf{T} \mathbf{X}\|_2^2$$

- Not really!
  - It depends on the optimization method to solve:

 $\min_{\mathbf{w} \in \mathbb{R}^d} \| \mathbf{x} \|_{\mathbf{w}}$ 

- Exact solution
- Gradient descent

## Training

$$\|\mathbf{y} - \mathbf{w}^{\mathsf{T}}\mathbf{X}\|_2^2$$

- Exact solution can be found as

  - Here, dagger means Moore–Penrose pseudoinverse

- Typically, computing the matrix inverse is quite costly
  - $O(n^3)$  FLOPs, where constants depend on "how" (tradeoff of numerical stability & computation)

Exact solution

### $\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^\dagger \mathbf{X}^\top \mathbf{y}$

### Gradient descent

• We'll use gradient descent (does this make our life easier?)

• For linear regression, this is:

$$\mathbf{w} \leftarrow (\mathbf{I} - 2\eta)$$

- - Q. What if we use SGD?
  - **Q.** Any sacrifice in memory?

 $\mathbf{w} \leftarrow \mathbf{w} - \eta \cdot \nabla L(\mathbf{w})$ 

### $\mathbf{X}^{\mathsf{T}}\mathbf{X}\mathbf{W} + 2\eta \cdot \mathbf{X}^{\mathsf{T}}\mathbf{W}$

Luckily, blue terms can be pre-computed, and thus negligible.

### Per-iteration compute

After pre-computing, we are doing simply

$$\mathbf{w} \leftarrow \mathbf{A}\mathbf{w} + \mathbf{b},$$

- VV adds. *d* FLOPs

• Brainteaser. Can you find a better way, if  $d \gg N$ ?

$$\mathbf{A} \in \mathbb{R}^{d \times d}, \mathbf{b} \in \mathbb{R}^{d}$$

• **MV mults.** Equivalent to computing dot products d times =  $2d^2$  FLOPs

### Takeaways

- Many things affect the computational cost
  - Dataset size
  - Model size
  - Optimization algorithm
  - Number of iterations
- Even worse, the optimal way to compute can vary
  - dvs N
- Luckily, measuring the inference cost is relatively simple

Neural nets

### Neural nets

- A graph of layers
  - Each layer performs an elementary operation
  - Example. MLP



### Example: Inceptionv3



- Dropout
- Fully connected
- Softmax

V

https://cloud.google.com/tpu/docs/inception-v3-advanced?hl=en

## Linear layer

• The simplest building block

### $\mathbf{Y} = \sigma(\mathbf{W}\mathbf{X} + \mathbf{b}\mathbf{1}^{\mathsf{T}})$

- Input  $\mathbf{X} \in \mathbb{R}^{d_i \times n}$ 
  - $d_i$ : Input dimension
  - *n*: batch size
- Weight  $\mathbf{W} \in \mathbb{R}^{d_o \times d_i}$ 
  - $d_o$ : output dimension
- Activation function  $\sigma(\cdot)$

### Input Neurons



### Output Neurons



# Linear layer $\mathbf{Y} = \boldsymbol{\sigma}(\mathbf{W}\mathbf{X} + \mathbf{b}\mathbf{1}^{\mathsf{T}})$

- Sequentially perform three operations
  - MM multiply
  - MM addition
  - Activation

• Question. What is the heaviest?

## Matrix multiplications

- Of course, the matmul
  - How many FLOPs for multiplying two matrices?





n



n

• B =

C

## Matrix multiplications

- comptuations
  - e.g., convolution, self-attention, gradient computation, ...
  - A good reference: <u>this NVIDIA doc</u>

Question. Other than FLOPs, what should we care about?

• More broadly, GEMM (generalized matmuls) is at the center of deep learning

### Memory issues

- Consider multiplying very large matrices
  - Suppose that we multiply two 4096x4096 matrices.
  - <u>Question</u>. If they are in FP32, how much memory do you need?
    - Can your L1 cache hold it?
    - Can your L2 cache hold it?

Need to care about data movement!

| Central Processing Unit |  |  |
|-------------------------|--|--|
| Control Unit            |  |  |
|                         |  |  |
| Arithmetic/Logic Unit   |  |  |
|                         |  |  |
|                         |  |  |
| Memory Unit             |  |  |
|                         |  |  |

## Memory issues

- DL HWs tend to have spacious memory & high memory bandwidths
- <u>Example</u>. NVIDIA H100 has ~60MB L2 cache & 80GB GPU memory, connected with several TB/s bandwidths



## Matmuls on GPUs

- On GPUs, matmuls are grouped into tiles.



• See also: link

| ( | ; |
|---|---|
|   |   |

- To fully utilize tensor cores, pay attention to dimensions
  - Cannot get full benefits if dimensions are not regular.

| Tensor Cores can be used for | cuBLAS version < 11.0<br>cuDNN version < 7.6.3 | cuBLAS version ≥ 11.0<br>cuDNN version ≥ 7.6.3                             |
|------------------------------|------------------------------------------------|----------------------------------------------------------------------------|
| INT8                         | Multiples of 16                                | Always but most efficient with multiples<br>16; on A100, multiples of 128. |
| FP16                         | Multiples of 8                                 | Always but most efficient with multiples<br>8; on A100, multiples of 64.   |

### Side note: Tensor Cores

https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html





https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html



## Memory vs. Compute

Suppose that we have an MLP without biases

Input 
$$\rightarrow$$
 Layer 2

An overly simplified sketch for computing matmuls:

4. Repeat 1–3.





## Memory vs. Compute

• Given limited on-chip memory, we can do a stupid thing:

Memory Compute



- A nightmare in terms of runtime
  - More money (cloud GPU)
  - More electricity (idle energy)



## Memory vs. Compute



With more spacious on-chip memory, we can be smarter (double buffering)

### **Bottleneck?**

- Either can be the bottleneck
  - Compute-bound
  - Memory-bound
- Depends on the hardware, model architecture, inference vs. training
  - Nice blog: <u>link</u>



- which HW we use
  - Typically, GPUs spend much less time for compute



Fig. 2: Run time decomposition on two representative state-of-the-art network architectures, ShuffeNet v1 [15]  $(1 \times, g = 3)$  and MobileNet v2 [14]  $(1 \times)$ .



Even with the same model, the runtime distribution differs a lot depending on

Ma et al., "ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design," ECCV 2018





### Model architecture

- Consider the example of convolution
  - Parameter sharing, thus reduces the memory burden



### Model architecture

• To map a  $d \times d \times c$  input to an output of the same size:

### (Option 1) Fully-connected Layer

**Params.** Requires  $c^2 d^4$  parameters.

**Compute.** Requires  $2c^2d^4$  FLOPs

### **Option 2) Convolutional layer** $(3 \times 3)$

**Params.** Requires  $9c^2$  parameters

**Compute.** Requires  $18c^2d^2$  FLOPs

• Saves  $d^4$  in parameters, and  $d^2$  in computations



# Neural nets: Training

- Much more complicated
- Recap: Backprop
  - Re-uses the activations computed during the forward
    - Less computation
    - More memory

## Training



https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/

RuntimeError: CUDA out of memory. Tried to allocate 200.00 MiB (GPU 0; 15.78 GiB total capacity; 14.56 GiB already allocated; 38.44 MiB free; 14.80 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max\_split\_size\_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH\_CUDA\_ALLOC\_CONF





• Consider a two-layer MLP



### Example

 $\hat{Y} = W_2 \sigma(W_1 X)$ 



- Holding intermediate results in memory helps computing backward.
  - Sacrifice in memory
  - Advantages in computation



### Example



- <u>Note</u>. We need <u>2x more computation for BW</u> than FW
  - useful rule of thumb



### Example





### FLOP backward-forward ratios

### Tradeoff

- Backprop trades memory for computation
- Gradient Checkpointing. Trades less memory for less computational benefit.
  - Discard "some activations," and re-materialize whenever needed
    - Visual explanation here: <u>https://github.com/cybertronai/gradient-</u> <u>checkpointing?tab=readme-ov-file</u>

















Vanilla Backprop

### Memory-Poor

• If we can hold only three circles



### With Gradient Checkpoints

• If we can hold slightly more circles



- Discussion so far focuses compute vs. memory, in terms of runtime
  - Making Al faster
- This is not necessarily aligned with other notions of "efficiency"
  - Making Al smaller (#params, #bits)
  - Making Al greener (energy usage)

- Also, duration  $\neq$  latency  $\neq$  throughput
  - See excellent treatise on "<u>the efficiency misnomer</u>"

### Remarks



Figure 1: Comparison of standard Transformers, Universal Transformers and Switch Transformers in terms of three common cost metrics: number of parameters, FLOPs, and throughput. Relative ranking between models is reversed between the two cost indicators. Experiments and the computation of cost metrics were done with Mesh Tensorflow (Shazeer et al., 2018), using 64 TPU-V3.



## Wrapping up

- Today. Recaps on basic ideas
- Next. Sparsity

- Ask yourself:
  - How does forward-mode autodiff compare with backward-mode, in terms of compute & memory?
  - How does depthwise convolution compare with vanilla?
  - How does Adam compare with SGD, in terms of memory?

How does GeLU compare with ReLU, in terms of memory during backprop?

