GPT模型推理加速实践_latest.pdf
![资源得分’ title=](/images/score_1.gif)
![资源得分’ title=](/images/score_1.gif)
![资源得分’ title=](/images/score_1.gif)
![资源得分’ title=](/images/score_1.gif)
![资源得分’ title=](/images/score_05.gif)
《GPT模型推理加速实践_latest.pdf》由会员分享,可在线阅读,更多相关《GPT模型推理加速实践_latest.pdf(41页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、GPT 模型的推理加速方案 LLM 推理挑战推理挑战 LLM 整体推理方案整体推理方案 GPT 模型基本介绍模型基本介绍 GPT 模型推理加速实践模型推理加速实践AgendaLLM 推理挑战LLMs 推理挑战GPT3-175B needs 5*A800-80G for inferenceLLMs 推理挑战How to reduce memory requirement?How to acceleration computing?How to optimize communication?LLM 整体推理方案LLMs 整体推理方案Model compression inferenceQuanti
2、zationPruningDistillationSmaller models-smaller memory footprintCompute acceleration Reduced precision computingReduced complexity-fewer floating-point operations(FLOPs)LLMs 整体推理方案MGMN inferenceWhen LLM model size is too large to deploy on a Single-GPU,and we cant get acceptable model accuracy after
3、 model compression.The other option is Multi-GPU Inference(MGMN)Tensor ParallelPipeline ParallelLLMs 整体推理方案GPT模型基本介绍GPT 模型基本介绍EncoderDecoderChatGPT(GPT-3)=Decoder x 96!GPT=Generative Pre-trained TransformerGPT3GPT basicEmbedding layerDecoder layer*NDecoding GPT 模型基本介绍EncoderDecoderChatGPT(GPT-3)=Dec
4、oder x 96!Model configuration of GPT3 175BNumber of layers(l):96Sequence length(S):2048Hidden layer size(h):12288Vocabulary size(V):51200Total parameters:175BGPT3GPT basicGPT 模型基本介绍EncoderGPT=Generative Pre-trained TransformerGPT basicEmbedding layerText embeddingPosition embeddingDecoder layerDecod
5、ing This place is 0 0 0 0 0 0 1 0 0 0One-hot vector of vocab_sizeWEmbHidden_size=12288 for GPT3Vocab_sizeHidden_sizeHidden_size0.1Text_embeddingGPT 模型基本介绍GPT=Generative Pre-trained TransformerGPT basicEmbedding layerText embeddingPosition embeddingDecoder layerDecoding This place is One-hot vector o
6、f vocab_sizeTokin position id=iSin(x0)Sin(x_N-1)N=hidden_size0 0 0 0 0 0 1 0 0 00.1 0.2Hidden_sizeposition_embeddingGPT 模型基本介绍EncoderDecoderChatGPT(GPT-3)=Decoder x 96!GPT=Generative Pre-trained TransformerGPT basicEmbedding layerDecoder layer*NAttentionLayer Normalization FFNLayer NormalizationDeco
7、ding Will compute attention for current token with every previous tokenGPT 模型基本介绍EncoderDecoderGPT=Generative Pre-trained TransformerGPT basicEmbedding layerDecoder layer*NAttention Layer Normalization FFNLayer NormalizationDecoding 4hFFNHidden_sizeHidden_sizeGPT 模型基本介绍DecodingGPT=Generative Pre-tra
8、ined TransformerGPT basicEmbedding layerDecoder layerAttention Layer Normalization FFNLayer NormalizationDecoding WHidden_sizeDecodingTokenE-1Greedy searchSamplingBeam searchGPT模型推理加速GPT 模型推理加速FasterTransformerHighly optimized for Transformer modelsa)Highly Optimized kernelsb)Shared bufferc)Flexible
9、 to add optimizationd)Supported data type:FP32,FP16,BF16,INT8e)Supported MGMN inference4 flows for FasterTransformer FP16 inferenceFasterTransformer overviewGPT 模型推理加速GPT optimization in FTDecoder layerAttention optimizationK/V cacheNormalization optimizationActivation memory optimization INT8 quant
10、izationDecodingBeam searchStreaming decoding in FTMGMN inferenceTP/PPNccl allreduce optimizationOverview GPT 模型推理加速In GPT model,we will receive contexts as input,and then generate reply step by step.We split the workflow into two phases:context phase and generation phase.DecoderInput sequenceContext
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- GPT 模型 推理 加速 实践 _latest
![提示](https://www.taowenge.com/images/bang_tan.gif)
限制150内