GPT模型推理加速实践_latest.pdf
GPT 模型的推理加速方案 LLM 推理挑战推理挑战 LLM 整体推理方案整体推理方案 GPT 模型基本介绍模型基本介绍 GPT 模型推理加速实践模型推理加速实践AgendaLLM 推理挑战LLMs 推理挑战GPT3-175B needs 5*A800-80G for inferenceLLMs 推理挑战How to reduce memory requirement?How to acceleration computing?How to optimize communication?LLM 整体推理方案LLMs 整体推理方案Model compression inferenceQuantizationPruningDistillationSmaller models-smaller memory footprintCompute acceleration Reduced precision computingReduced complexity-fewer floating-point operations(FLOPs)LLMs 整体推理方案MGMN inferenceWhen LLM model size is too large to deploy on a Single-GPU,and we cant get acceptable model accuracy after model compression.The other option is Multi-GPU Inference(MGMN)Tensor ParallelPipeline ParallelLLMs 整体推理方案GPT模型基本介绍GPT 模型基本介绍EncoderDecoderChatGPT(GPT-3)=Decoder x 96!GPT=Generative Pre-trained TransformerGPT3GPT basicEmbedding layerDecoder layer*NDecoding GPT 模型基本介绍EncoderDecoderChatGPT(GPT-3)=Decoder x 96!Model configuration of GPT3 175BNumber of layers(l):96Sequence length(S):2048Hidden layer size(h):12288Vocabulary size(V):51200Total parameters:175BGPT3GPT basicGPT 模型基本介绍EncoderGPT=Generative Pre-trained TransformerGPT basicEmbedding layerText embeddingPosition embeddingDecoder layerDecoding This place is 0 0 0 0 0 0 1 0 0 0One-hot vector of vocab_sizeWEmbHidden_size=12288 for GPT3Vocab_sizeHidden_sizeHidden_size0.1Text_embeddingGPT 模型基本介绍GPT=Generative Pre-trained TransformerGPT basicEmbedding layerText embeddingPosition embeddingDecoder layerDecoding This place is One-hot vector of vocab_sizeTokin position id=iSin(x0)Sin(x_N-1)N=hidden_size0 0 0 0 0 0 1 0 0 00.1 0.2Hidden_sizeposition_embeddingGPT 模型基本介绍EncoderDecoderChatGPT(GPT-3)=Decoder x 96!GPT=Generative Pre-trained TransformerGPT basicEmbedding layerDecoder layer*NAttentionLayer Normalization FFNLayer NormalizationDecoding Will compute attention for current token with every previous tokenGPT 模型基本介绍EncoderDecoderGPT=Generative Pre-trained TransformerGPT basicEmbedding layerDecoder layer*NAttention Layer Normalization FFNLayer NormalizationDecoding 4hFFNHidden_sizeHidden_sizeGPT 模型基本介绍DecodingGPT=Generative Pre-trained TransformerGPT basicEmbedding layerDecoder layerAttention Layer Normalization FFNLayer NormalizationDecoding WHidden_sizeDecodingTokenE-1Greedy searchSamplingBeam searchGPT模型推理加速GPT 模型推理加速FasterTransformerHighly optimized for Transformer modelsa)Highly Optimized kernelsb)Shared bufferc)Flexible to add optimizationd)Supported data type:FP32,FP16,BF16,INT8e)Supported MGMN inference4 flows for FasterTransformer FP16 inferenceFasterTransformer overviewGPT 模型推理加速GPT optimization in FTDecoder layerAttention optimizationK/V cacheNormalization optimizationActivation memory optimization INT8 quantizationDecodingBeam searchStreaming decoding in FTMGMN inferenceTP/PPNccl allreduce optimizationOverview GPT 模型推理加速In GPT model,we will receive contexts as input,and then generate reply step by step.We split the workflow into two phases:context phase and generation phase.DecoderInput sequenceContext phasegenerate phaseOutput tokenOutput sequenceN-1Output length=NGPT 模型推理加速Context phaseLike encoder,need to handle multiple tokens at onceUsing CUDA kernel to compute batch GEMM is in-efficient when sequence length dimension is large.Use unfused multi-head attention to leverage the tensor core for GEMM computing.Save the result of Key and Value into cache.Decoder attentionAvoid recomputingGPT 模型推理加速Decoder attentionGeneration phase:generate token step by step.Use“fuseQKV masked attention”.GPT 模型推理加速In decoder,multi-head attention computesthe relationship of“current token()”all tokens generated by previous stepsOriginalDecoder K/V cacheGPT 模型推理加速OptimizationUsing K/V Cache to prevent recomputing and prevent concatenationPrepare large k/v cache bufferCompute,of current tokenPut k/v into cache in-placeDecoder K/V cacheGPT 模型推理加速Decoder normalization optimizationblockReduceblockReducesyncsyncwarpReducewarpReducesyncGPT 模型推理加速Decoder normalization optimizationGPT 模型推理加速Decoder normalization optimizationwarpReducewarpReduceGPT 模型推理加速Decoder normalization optimizationFrom mathblockReducesyncGPT 模型推理加速OriginalOptimizationDecoder normalization optimizationGPT 模型推理加速Allocate buffer for every decoder layers activationOriginalOptimization.bufferbufferbufferbufferbufferDecoder layersActivationsIn FT,only allocate buffer for 1 layers activation,to reuse buffer for all layers activationDecoder activation buffer optimizationGPT 模型推理加速Quantization is used for model size reduction and inference acceleration.Decoder quantizationThere are two common ways to quantize the model Post training quantization(PTQ)-less cost,lower accuracyQuantization aware training(QAT)-higher cost,higher accuracy FP16 inferenceGPT 模型推理加速Weight only int8 in FTINT8 weight only save the weights by INT8,but activations are saved by FP16In GEMM,we load INT8 weight,cast to FP16,and use FP16 tensor coreDecoder quantizationGPT 模型推理加速Quantizing weight and activation,GEMM in INT8 tensor core.W8A8 for GPT in FTDecoder quantizationGPT 模型推理加速Decoding beam searchGPT 模型推理加速Decoding beam searchGPT 模型推理加速Decoding-Streaming decoding in FTOriginal inputWhen bs is large and output sequence lengths vary much,some outputs have to wait for the longest oneOptimization GPToutputGPToutputFT support streaming decodingBetter user experience GPT 模型推理加速MGMN TP/PPPipeline ParallelTensor ParallelRecommend to use TP for intra-node,PP for inter-node Communication volume BandwidthGPT 模型推理加速MGMN allreduce optimizationGPU0GPU2GPU1GPU3GPU4GPU5GPU6GPU7Original nccl allreduceOptimization Use optimized cuda kernel for allreduceThe nccl allreduce usually takes up 20%of end2end pipeline When bs is small,the perf gain for allreduce communication is about 50%End2end perf gain is 10%When bs is small,cant fully use the intra-node bandwidth,the nccl allreduce will be latency-bound GPT 模型推理加速Performance bsinput-seq-lenoutput-seq-len Megatron Latency(ms)FT Latency(ms)Speedup11288660.38488.861.3521288687.34509.471.35412881004.88629.641.6812881705.07749.862.271212882365.02886.242.671612883111.571037.4732012883723.731135.723.283212885778.721547.443.731512322384.781719.961.392512322503.241830.561.374512323658.652092.561.758512326238.792629.972.37165123211409.533706.233.08GPT3 175B inference 8*Ampere-80G NVLink群内每日免费分享5份+最新资料300T网盘资源+40万份行业报告为您的创业、职场、商业、投资、亲子、网赚、艺术、健身、心理、个人成长 全面赋能!添加微信:Max10246Max10246关注公众号获取更多资料备注“入群”立刻免费领取200套知识地图+最新研报致终身学习者社群行业报告/思维导图/电子书/资讯情报收钱文案、增长黑客、产品运营、品牌企划、营销战略、办公软件、会计财务、广告设计、摄影修图、视频剪辑、直播带货、电商运营、投资理财、汽车房产、餐饮烹饪、职场经验、演讲口才、风水命理、心理思维、恋爱情趣、美妆护肤、健身瘦身、格斗搏击、漫画手绘、声乐训练、自媒体打造、效率软件工具、游戏影音、各种教程扫码先加好友,以备不时之需