GPT模型推理加速实践_latest.pdf

上传人：530650****qq.com

文档编号：95793766

上传时间：2023-08-31

格式：PDF

页数：41

大小：2.72MB

( 4.5 )

《GPT模型推理加速实践_latest.pdf》由会员分享，可在线阅读，更多相关《GPT模型推理加速实践_latest.pdf（41页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、GPT 模型的推理加速方案 LLM 推理挑战推理挑战 LLM 整体推理方案整体推理方案 GPT 模型基本介绍模型基本介绍 GPT 模型推理加速实践模型推理加速实践AgendaLLM 推理挑战LLMs 推理挑战GPT3-175B needs 5*A800-80G for inferenceLLMs 推理挑战How to reduce memory requirement?How to acceleration computing?How to optimize communication?LLM 整体推理方案LLMs 整体推理方案Model compression inferenceQuanti

2、zationPruningDistillationSmaller models-smaller memory footprintCompute acceleration Reduced precision computingReduced complexity-fewer floating-point operations(FLOPs)LLMs 整体推理方案MGMN inferenceWhen LLM model size is too large to deploy on a Single-GPU,and we cant get acceptable model accuracy after

3、 model compression.The other option is Multi-GPU Inference(MGMN)Tensor ParallelPipeline ParallelLLMs 整体推理方案GPT模型基本介绍GPT 模型基本介绍EncoderDecoderChatGPT(GPT-3)=Decoder x 96!GPT=Generative Pre-trained TransformerGPT3GPT basicEmbedding layerDecoder layer*NDecoding GPT 模型基本介绍EncoderDecoderChatGPT(GPT-3)=Dec

4、oder x 96!Model configuration of GPT3 175BNumber of layers(l):96Sequence length(S):2048Hidden layer size(h):12288Vocabulary size(V):51200Total parameters:175BGPT3GPT basicGPT 模型基本介绍EncoderGPT=Generative Pre-trained TransformerGPT basicEmbedding layerText embeddingPosition embeddingDecoder layerDecod

5、ing This place is 0 0 0 0 0 0 1 0 0 0One-hot vector of vocab_sizeWEmbHidden_size=12288 for GPT3Vocab_sizeHidden_sizeHidden_size0.1Text_embeddingGPT 模型基本介绍GPT=Generative Pre-trained TransformerGPT basicEmbedding layerText embeddingPosition embeddingDecoder layerDecoding This place is One-hot vector o

6、f vocab_sizeTokin position id=iSin(x0)Sin(x_N-1)N=hidden_size0 0 0 0 0 0 1 0 0 00.1 0.2Hidden_sizeposition_embeddingGPT 模型基本介绍EncoderDecoderChatGPT(GPT-3)=Decoder x 96!GPT=Generative Pre-trained TransformerGPT basicEmbedding layerDecoder layer*NAttentionLayer Normalization FFNLayer NormalizationDeco

7、ding Will compute attention for current token with every previous tokenGPT 模型基本介绍EncoderDecoderGPT=Generative Pre-trained TransformerGPT basicEmbedding layerDecoder layer*NAttention Layer Normalization FFNLayer NormalizationDecoding 4hFFNHidden_sizeHidden_sizeGPT 模型基本介绍DecodingGPT=Generative Pre-tra

8、ined TransformerGPT basicEmbedding layerDecoder layerAttention Layer Normalization FFNLayer NormalizationDecoding WHidden_sizeDecodingTokenE-1Greedy searchSamplingBeam searchGPT模型推理加速GPT 模型推理加速FasterTransformerHighly optimized for Transformer modelsa)Highly Optimized kernelsb)Shared bufferc)Flexible

9、 to add optimizationd)Supported data type:FP32,FP16,BF16,INT8e)Supported MGMN inference4 flows for FasterTransformer FP16 inferenceFasterTransformer overviewGPT 模型推理加速GPT optimization in FTDecoder layerAttention optimizationK/V cacheNormalization optimizationActivation memory optimization INT8 quant

10、izationDecodingBeam searchStreaming decoding in FTMGMN inferenceTP/PPNccl allreduce optimizationOverview GPT 模型推理加速In GPT model,we will receive contexts as input,and then generate reply step by step.We split the workflow into two phases:context phase and generation phase.DecoderInput sequenceContext

11、 phasegenerate phaseOutput tokenOutput sequenceN-1Output length=NGPT 模型推理加速Context phaseLike encoder,need to handle multiple tokens at onceUsing CUDA kernel to compute batch GEMM is in-efficient when sequence length dimension is large.Use unfused multi-head attention to leverage the tensor core for

12、GEMM computing.Save the result of Key and Value into cache.Decoder attentionAvoid recomputingGPT 模型推理加速Decoder attentionGeneration phase:generate token step by step.Use“fuseQKV masked attention”.GPT 模型推理加速In decoder,multi-head attention computesthe relationship of“current token()”all tokens generate

13、d by previous stepsOriginalDecoder K/V cacheGPT 模型推理加速OptimizationUsing K/V Cache to prevent recomputing and prevent concatenationPrepare large k/v cache bufferCompute,of current tokenPut k/v into cache in-placeDecoder K/V cacheGPT 模型推理加速Decoder normalization optimizationblockReduceblockReducesyncsy

14、ncwarpReducewarpReducesyncGPT 模型推理加速Decoder normalization optimizationGPT 模型推理加速Decoder normalization optimizationwarpReducewarpReduceGPT 模型推理加速Decoder normalization optimizationFrom mathblockReducesyncGPT 模型推理加速OriginalOptimizationDecoder normalization optimizationGPT 模型推理加速Allocate buffer for ever

15、y decoder layers activationOriginalOptimization.bufferbufferbufferbufferbufferDecoder layersActivationsIn FT,only allocate buffer for 1 layers activation,to reuse buffer for all layers activationDecoder activation buffer optimizationGPT 模型推理加速Quantization is used for model size reduction and inferen

16、ce acceleration.Decoder quantizationThere are two common ways to quantize the model Post training quantization(PTQ)-less cost,lower accuracyQuantization aware training(QAT)-higher cost,higher accuracy FP16 inferenceGPT 模型推理加速Weight only int8 in FTINT8 weight only save the weights by INT8,but activat

17、ions are saved by FP16In GEMM,we load INT8 weight,cast to FP16,and use FP16 tensor coreDecoder quantizationGPT 模型推理加速Quantizing weight and activation,GEMM in INT8 tensor core.W8A8 for GPT in FTDecoder quantizationGPT 模型推理加速Decoding beam searchGPT 模型推理加速Decoding beam searchGPT 模型推理加速Decoding-Streamin

18、g decoding in FTOriginal inputWhen bs is large and output sequence lengths vary much,some outputs have to wait for the longest oneOptimization GPToutputGPToutputFT support streaming decodingBetter user experience GPT 模型推理加速MGMN TP/PPPipeline ParallelTensor ParallelRecommend to use TP for intra-node,

19、PP for inter-node Communication volume BandwidthGPT 模型推理加速MGMN allreduce optimizationGPU0GPU2GPU1GPU3GPU4GPU5GPU6GPU7Original nccl allreduceOptimization Use optimized cuda kernel for allreduceThe nccl allreduce usually takes up 20%of end2end pipeline When bs is small,the perf gain for allreduce comm

20、unication is about 50%End2end perf gain is 10%When bs is small,cant fully use the intra-node bandwidth,the nccl allreduce will be latency-bound GPT 模型推理加速Performance bsinput-seq-lenoutput-seq-len Megatron Latency(ms)FT Latency(ms)Speedup11288660.38488.861.3521288687.34509.471.35412881004.88629.641.6

21、812881705.07749.862.271212882365.02886.242.671612883111.571037.4732012883723.731135.723.283212885778.721547.443.731512322384.781719.961.392512322503.241830.561.374512323658.652092.561.758512326238.792629.972.37165123211409.533706.233.08GPT3 175B inference 8*Ampere-80G NVLink群内每日免费分享5份+最新资料300T网盘资源+40万份行业报告为您的创业、职场、商业、投资、亲子、网赚、艺术、健身、心理、个人成长全面赋能！添加微信：Max10246Max10246关注公众号获取更多资料备注“入群”立刻免费领取200套知识地图+最新研报致终身学习者社群行业报告/思维导图/电子书/资讯情报收钱文案、增长黑客、产品运营、品牌企划、营销战略、办公软件、会计财务、广告设计、摄影修图、视频剪辑、直播带货、电商运营、投资理财、汽车房产、餐饮烹饪、职场经验、演讲口才、风水命理、心理思维、恋爱情趣、美妆护肤、健身瘦身、格斗搏击、漫画手绘、声乐训练、自媒体打造、效率软件工具、游戏影音、各种教程扫码先加好友，以备不时之需

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: GPT 模型推理加速实践 _latest

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：GPT模型推理加速实践_latest.pdf
链接地址：https://www.taowenge.com/p-95793766.html