GPT模型推理加速实践_latest.pdf

资源ID：95793766 资源大小：2.72MB 全文页数：41页
资源格式： PDF 下载积分：15金币

快捷下载

会员登录下载

微信登录下载

三方登录下载：

微信扫一扫登录

下载资源需要15金币

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

GPT模型推理加速实践_latest.pdf

GPT 模型的推理加速方案 LLM 推理挑战推理挑战 LLM 整体推理方案整体推理方案 GPT 模型基本介绍模型基本介绍 GPT 模型推理加速实践模型推理加速实践AgendaLLM 推理挑战LLMs 推理挑战GPT3-175B needs 5*A800-80G for inferenceLLMs 推理挑战How to reduce memory requirement?How to acceleration computing?How to optimize communication?LLM 整体推理方案LLMs 整体推理方案Model compression inferenceQuantizationPruningDistillationSmaller models-smaller memory footprintCompute acceleration Reduced precision computingReduced complexity-fewer floating-point operations(FLOPs)LLMs 整体推理方案MGMN inferenceWhen LLM model size is too large to deploy on a Single-GPU,and we cant get acceptable model accuracy after model compression.The other option is Multi-GPU Inference(MGMN)Tensor ParallelPipeline ParallelLLMs 整体推理方案GPT模型基本介绍GPT 模型基本介绍EncoderDecoderChatGPT(GPT-3)=Decoder x 96!GPT=Generative Pre-trained TransformerGPT3GPT basicEmbedding layerDecoder layer*NDecoding GPT 模型基本介绍EncoderDecoderChatGPT(GPT-3)=Decoder x 96!Model configuration of GPT3 175BNumber of layers(l):96Sequence length(S):2048Hidden layer size(h):12288Vocabulary size(V):51200Total parameters:175BGPT3GPT basicGPT 模型基本介绍EncoderGPT=Generative Pre-trained TransformerGPT basicEmbedding layerText embeddingPosition embeddingDecoder layerDecoding This place is 0 0 0 0 0 0 1 0 0 0One-hot vector of vocab_sizeWEmbHidden_size=12288 for GPT3Vocab_sizeHidden_sizeHidden_size0.1Text_embeddingGPT 模型基本介绍GPT=Generative Pre-trained TransformerGPT basicEmbedding layerText embeddingPosition embeddingDecoder layerDecoding This place is One-hot vector of vocab_sizeTokin position id=iSin(x0)Sin(x_N-1)N=hidden_size0 0 0 0 0 0 1 0 0 00.1 0.2Hidden_sizeposition_embeddingGPT 模型基本介绍EncoderDecoderChatGPT(GPT-3)=Decoder x 96!GPT=Generative Pre-trained TransformerGPT basicEmbedding layerDecoder layer*NAttentionLayer Normalization FFNLayer NormalizationDecoding Will compute attention for current token with every previous tokenGPT 模型基本介绍EncoderDecoderGPT=Generative Pre-trained TransformerGPT basicEmbedding layerDecoder layer*NAttention Layer Normalization FFNLayer NormalizationDecoding 4hFFNHidden_sizeHidden_sizeGPT 模型基本介绍DecodingGPT=Generative Pre-trained TransformerGPT basicEmbedding layerDecoder layerAttention Layer Normalization FFNLayer NormalizationDecoding WHidden_sizeDecodingTokenE-1Greedy searchSamplingBeam searchGPT模型推理加速GPT 模型推理加速FasterTransformerHighly optimized for Transformer modelsa)Highly Optimized kernelsb)Shared bufferc)Flexible to add optimizationd)Supported data type:FP32,FP16,BF16,INT8e)Supported MGMN inference4 flows for FasterTransformer FP16 inferenceFasterTransformer overviewGPT 模型推理加速GPT optimization in FTDecoder layerAttention optimizationK/V cacheNormalization optimizationActivation memory optimization INT8 quantizationDecodingBeam searchStreaming decoding in FTMGMN inferenceTP/PPNccl allreduce optimizationOverview GPT 模型推理加速In GPT model,we will receive contexts as input,and then generate reply step by step.We split the workflow into two phases:context phase and generation phase.DecoderInput sequenceContext phasegenerate phaseOutput tokenOutput sequenceN-1Output length=NGPT 模型推理加速Context phaseLike encoder,need to handle multiple tokens at onceUsing CUDA kernel to compute batch GEMM is in-efficient when sequence length dimension is large.Use unfused multi-head attention to leverage the tensor core for GEMM computing.Save the result of Key and Value into cache.Decoder attentionAvoid recomputingGPT 模型推理加速Decoder attentionGeneration phase:generate token step by step.Use“fuseQKV masked attention”.GPT 模型推理加速In decoder,multi-head attention computesthe relationship of“current token()”all tokens generated by previous stepsOriginalDecoder K/V cacheGPT 模型推理加速OptimizationUsing K/V Cache to prevent recomputing and prevent concatenationPrepare large k/v cache bufferCompute,of current tokenPut k/v into cache in-placeDecoder K/V cacheGPT 模型推理加速Decoder normalization optimizationblockReduceblockReducesyncsyncwarpReducewarpReducesyncGPT 模型推理加速Decoder normalization optimizationGPT 模型推理加速Decoder normalization optimizationwarpReducewarpReduceGPT 模型推理加速Decoder normalization optimizationFrom mathblockReducesyncGPT 模型推理加速OriginalOptimizationDecoder normalization optimizationGPT 模型推理加速Allocate buffer for every decoder layers activationOriginalOptimization.bufferbufferbufferbufferbufferDecoder layersActivationsIn FT,only allocate buffer for 1 layers activation,to reuse buffer for all layers activationDecoder activation buffer optimizationGPT 模型推理加速Quantization is used for model size reduction and inference acceleration.Decoder quantizationThere are two common ways to quantize the model Post training quantization(PTQ)-less cost,lower accuracyQuantization aware training(QAT)-higher cost,higher accuracy FP16 inferenceGPT 模型推理加速Weight only int8 in FTINT8 weight only save the weights by INT8,but activations are saved by FP16In GEMM,we load INT8 weight,cast to FP16,and use FP16 tensor coreDecoder quantizationGPT 模型推理加速Quantizing weight and activation,GEMM in INT8 tensor core.W8A8 for GPT in FTDecoder quantizationGPT 模型推理加速Decoding beam searchGPT 模型推理加速Decoding beam searchGPT 模型推理加速Decoding-Streaming decoding in FTOriginal inputWhen bs is large and output sequence lengths vary much,some outputs have to wait for the longest oneOptimization GPToutputGPToutputFT support streaming decodingBetter user experience GPT 模型推理加速MGMN TP/PPPipeline ParallelTensor ParallelRecommend to use TP for intra-node,PP for inter-node Communication volume BandwidthGPT 模型推理加速MGMN allreduce optimizationGPU0GPU2GPU1GPU3GPU4GPU5GPU6GPU7Original nccl allreduceOptimization Use optimized cuda kernel for allreduceThe nccl allreduce usually takes up 20%of end2end pipeline When bs is small,the perf gain for allreduce communication is about 50%End2end perf gain is 10%When bs is small,cant fully use the intra-node bandwidth,the nccl allreduce will be latency-bound GPT 模型推理加速Performance bsinput-seq-lenoutput-seq-len Megatron Latency(ms)FT Latency(ms)Speedup11288660.38488.861.3521288687.34509.471.35412881004.88629.641.6812881705.07749.862.271212882365.02886.242.671612883111.571037.4732012883723.731135.723.283212885778.721547.443.731512322384.781719.961.392512322503.241830.561.374512323658.652092.561.758512326238.792629.972.37165123211409.533706.233.08GPT3 175B inference 8*Ampere-80G NVLink群内每日免费分享5份+最新资料300T网盘资源+40万份行业报告为您的创业、职场、商业、投资、亲子、网赚、艺术、健身、心理、个人成长全面赋能！添加微信：Max10246Max10246关注公众号获取更多资料备注“入群”立刻免费领取200套知识地图+最新研报致终身学习者社群行业报告/思维导图/电子书/资讯情报收钱文案、增长黑客、产品运营、品牌企划、营销战略、办公软件、会计财务、广告设计、摄影修图、视频剪辑、直播带货、电商运营、投资理财、汽车房产、餐饮烹饪、职场经验、演讲口才、风水命理、心理思维、恋爱情趣、美妆护肤、健身瘦身、格斗搏击、漫画手绘、声乐训练、自媒体打造、效率软件工具、游戏影音、各种教程扫码先加好友，以备不时之需

注意事项

本文（GPT模型推理加速实践_latest.pdf）为本站会员（530650****qq.com）主动上传，淘文阁 - 分享文档赚钱的网站仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知淘文阁 - 分享文档赚钱的网站（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。