

## Kernel Invocation (Host-side Code)

// Launch the device computation threads!
MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

You need to extend the code to handle rectangular matrix in MP-2!

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/ University of Illinois at Urbana-Champaign



#### 2

3

### A Simple Matrix Multiplication Kernel (Simplified Dimension and Syntax!) global void MatrixMulKernel(float\* d M, float\* d N, float\* d P, int Width) // Calculate the row index of the d\_P element and d\_M int Row = blockIdx.y\*blockDim.y+threadIdx.y; // Calculate the column idenx of d\_P and d\_N int Col = blockIdx.x\*blockDim.x+threadIdx.x; if ((Row < Width) && (Col < Width)) { float Pvalue = 0;// each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) { Pvalue += d M[Row][k] \* d N[k][Col]; d P[Row][Col] = Pvalue; 4 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/University of Illinois at Urbana-Champaign



































21





# Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread N-3 Thread N-3 Thread N-1 Consider Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/ University of Illinois at Urbana-Champaign









#### Memory Bandwidth Consumption Using 16x16 tiling, we reduce the global memory by a factor of 16 - Each operand is now used by 16 floating-point operations - The 150GB/s bandwidth can now support (150/4)\*16 = 600GFLOPS! Using 32x32 tiling, we reduce the global memory accesses by a factor of 32 - Each operand is now used by 32 floating-point operations - The 150 GB/s bandwidth can now support $(150/4)^{*}32 = 1,200$ GFLOPS! - The memory bandwidth is no longer a limiting factor for performance! 28 © David Kirk/NVIDIA and Wen-mei W. Hwu. 2007-2018 ECE408/CS483/ rsity of Illinois at Urbana-Champaig





© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2018 ECE408/CS483/ University of Illinois at Urbana-Champaign