1、前言
2020年5月14日,在全球疫情肆虐,无数仁人志士前赴后继攻关新冠疫苗之际,NVIDIA 创始人兼首席执行官黄仁勋在自家厨房直播带货,哦不对应该是 NVIDIA GTC 2020 主题演讲中热情洋溢地介绍了新鲜出炉的基于最新 Ampere 架构的 NVIDIA A100 GPU,号称史上最豪华的烧烤。
NVIDIA A100 Tensor Core GPU 基于最新的 Ampere 架构,其核心为基于台积电 7nm 工艺制造的 GA100,内有 542 亿晶体管,裸片尺寸为 826mm^2,而前代 GV100 裸片尺寸 815mm^2,内有 211 亿晶体管,短短 3 年时间,得益于新工艺,芯片集成度翻了不止一倍!从 NVIDIA 发布会内容以及白皮书中能看到一些夺目的数字,今天我们来解密这些数字是怎么得出来的。为此我们需要深入 GPU 架构一探究竟。
2、GPU 架构演变
图形处理器(GPU, Graphics Processing Unit),用来加速计算机图形实时绘制,俗称显卡,经常用于打游戏。自 NVIDIA 于 1999 年发明第一款 GPU GeForce 256,尔来二十有一年矣。




3、Ampere 架构详解
从 Ampere 白皮书【1】看到 GA100 的总体架构图如下:

- GPC —— 图形处理簇,Graphics Processing Clusters
- TPC —— 纹理处理簇,Texture Processing Clusters
- SM —— 流多处理器,Stream Multiprocessors
- HBM2 —— 高带宽存储器二代,High Bandwidth Memory Gen 2




4、不同型号 GPU 峰值计算能力对比
我们可以通过翻阅 GPU 数据手册、白皮书获得不同型号 GPU 峰值计算能力,但这仅停留在纸面,对于管控系统而言需要借助工具来获取这些数值记录在设备数据库,之后调度器可根据计算需求以及库存情况进行计算能力分配。本节将提供这样一个工具来自动计算 GPU 峰值计算能力,基于 CUDA Runtime API 编写,对具体 CUDA 版本没有特殊要求。A100 上运行输出如下:










5、本文代码
calc_peak_gflops.cpp#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <cuda_runtime.h>
#define CHECK_CUDA(x, str) \
if((x) != cudaSuccess) \
{ \
fprintf(stderr, str); \
exit(EXIT_FAILURE); \
}
int cc2cores(int major, int minor)
{
typedef struct
{
int SM;
int Cores;
} sSMtoCores;
sSMtoCores nGpuArchCoresPerSM[] =
{
{0x30, 192},
{0x32, 192},
{0x35, 192},
{0x37, 192},
{0x50, 128},
{0x52, 128},
{0x53, 128},
{0x60, 64},
{0x61, 128},
{0x62, 128},
{0x70, 64},
{0x72, 64},
{0x75, 64},
{0x80, 64},
{-1, -1}
};
int index = 0;
while (nGpuArchCoresPerSM[index].SM != -1)
{
if (nGpuArchCoresPerSM[index].SM == ((major << 4) + minor))
{
return nGpuArchCoresPerSM[index].Cores;
}
index++;
}
printf(
"MapSMtoCores for SM %d.%d is undefined."
" Default to use %d Cores/SM\n",
major, minor, nGpuArchCoresPerSM[index - 1].Cores);
return nGpuArchCoresPerSM[index - 1].Cores;
}
bool has_fp16(int major, int minor)
{
int cc = major * 10 + minor;
return ((cc == 60) || (cc == 62) || (cc == 70) || (cc == 75) || (cc == 80));
}
bool has_int8(int major, int minor)
{
int cc = major * 10 + minor;
return ((cc == 61) || (cc == 70) || (cc == 75) || (cc == 80));
}
bool has_tensor_core_v1(int major, int minor)
{
int cc = major * 10 + minor;
return ((cc == 70) || (cc == 72) );
}
bool has_tensor_core_v2(int major, int minor)
{
int cc = major * 10 + minor;
return (cc == 75);
}
bool has_tensor_core_v3(int major, int minor)
{
int cc = major * 10 + minor;
return (cc == 80);
}
int main(int argc, char **argv)
{
cudaDeviceProp prop;
int dc;
CHECK_CUDA(cudaGetDeviceCount(&dc), "cudaGetDeviceCount error!");
printf("GPU count = %d\n", dc);
for(int i = 0; i < dc; i++)
{
printf("=================GPU #%d=================\n", i);
CHECK_CUDA(cudaGetDeviceProperties(&prop, i), "cudaGetDeviceProperties error");
printf("GPU Name = %s\n", prop.name);
printf("Compute Capability = %d.%d\n", prop.major, prop.minor);
printf("GPU SMs = %d\n", prop.multiProcessorCount);
printf("GPU CUDA cores = %d\n", cc2cores(prop.major, prop.minor) * prop.multiProcessorCount);
printf("GPU SM clock rate = %.3f GHz\n", prop.clockRate/1e6);
printf("GPU Mem clock rate = %.3f GHz\n", prop.memoryClockRate/1e6);
printf("FP32 Peak Performance = %.3f GFLOPS\n", cc2cores(prop.major, prop.minor) * prop.multiProcessorCount * (prop.clockRate / 1e6) * 2);
if(has_fp16(prop.major, prop.minor))
{
printf("FP16 Peak Performance = %.3f GFLOPS\n", cc2cores(prop.major, prop.minor) * prop.multiProcessorCount * (prop.clockRate / 1e6) * 2 * 2);
}
if(has_int8(prop.major, prop.minor))
{
printf("INT8 Peak Performance = %.3f GFLOPS\n", cc2cores(prop.major, prop.minor) * prop.multiProcessorCount * (prop.clockRate / 1e6) * 2 * 4);
}
if(has_tensor_core_v1(prop.major, prop.minor))
{
printf("Tensor Core FP16 Peak Performance = %.3f GFLOPS\n", cc2cores(prop.major, prop.minor) * prop.multiProcessorCount * (prop.clockRate / 1e6) * 2 * 8);
}
if(has_tensor_core_v2(prop.major, prop.minor))
{
printf("Tensor Core FP16 Peak Performance = %.3f GFLOPS\n", cc2cores(prop.major, prop.minor) * prop.multiProcessorCount * (prop.clockRate / 1e6) * 2 * 8);
printf("Tensor Core INT8 Peak Performance = %.3f GFLOPS\n", cc2cores(prop.major, prop.minor) * prop.multiProcessorCount * (prop.clockRate / 1e6) * 2 * 16);
}
if(has_tensor_core_v3(prop.major, prop.minor))
{
printf("Tensor Core TF32 Peak Performance = %.3f GFLOPS\n", cc2cores(prop.major, prop.minor) * prop.multiProcessorCount * (prop.clockRate / 1e6) * 2 * 8);
printf("Tensor Core FP16 Peak Performance = %.3f GFLOPS\n", cc2cores(prop.major, prop.minor) * prop.multiProcessorCount * (prop.clockRate / 1e6) * 2 * 16);
printf("Tensor Core INT8 Peak Performance = %.3f GFLOPS\n", cc2cores(prop.major, prop.minor) * prop.multiProcessorCount * (prop.clockRate / 1e6) * 2 * 32);
}
}
return 0;
}
编译:nvcc -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcudart -o calc_peak_gflops calc_peak_gflops.cpp
如果提示 nvcc 命令未找到,请先安装 CUDA 并设置 PATH 环境变量包含 nvcc 所在目录(Linux 默认为 /usr/local/cuda/bin)。export PATH=/usr/local/cuda/bin:$PATH
运行:./calc_peak_gflops
6、后记
通过获取 GPU 峰值计算能力,可以加深对手头的硬件资源了解程度,不被过度宣传的文章洗脑,多快好省地完成工作。参考文献[1] https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdfwww.nvidia.com[2] GPU Performance Background User Guidedocs.nvidia.com[3] https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdfwww.nvidia.com[4] https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdfimages.nvidia.com[5] https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdfimages.nvidia.com[6] NVIDIA GPU架构的变迁史