大型语言模型综述(英)-85页-WN7.pdf
1A Survey of Large Language ModelsWayne Xin Zhao,Kun Zhou*,Junyi Li*,Tianyi Tang,Xiaolei Wang,Yupeng Hou,Yingqian Min,BeichenZhang,Junjie Zhang,Zican Dong,Yifan Du,Chen Yang,Yushuo Chen,Zhipeng Chen,Jinhao Jiang,Ruiyang Ren,Yifan Li,Xinyu Tang,Zikang Liu,Peiyu Liu,Jian-Yun Nie and Ji-Rong WenAbstractEver since the Turing Test was proposed in the 1950s,humans have explored the mastering of language intelligenceby machine.Language is essentially a complex,intricate system of human expressions governed by grammatical rules.It poses asignificant challenge to develop capable artificial intelligence(AI)algorithms for comprehending and grasping a language.As a majorapproach,language modeling has been widely studied for language understanding and generation in the past two decades,evolvingfrom statistical language models to neural language models.Recently,pre-trained language models(PLMs)have been proposed by pre-training Transformer models over large-scale corpora,showing strong capabilities in solving various natural language processing(NLP)tasks.Since the researchers have found that model scaling can lead to an improved model capacity,they further investigate the scalingeffect by increasing the parameter scale to an even larger size.Interestingly,when the parameter scale exceeds a certain level,theseenlarged language models not only achieve a significant performance improvement,but also exhibit some special abilities(e.g.,in-context learning)that are not present in small-scale language models(e.g.,BERT).To discriminate the language models in differentparameter scales,the research community has coined the term large language models(LLM)for the PLMs of significant size(e.g.,containing tens or hundreds of billions of parameters).Recently,the research on LLMs has been largely advanced by both academiaand industry,and a remarkable progress is the launch of ChatGPT(a powerful AI chatbot developed based on LLMs),which hasattracted widespread attention from society.The technical evolution of LLMs has been making an important impact on the entire AIcommunity,which would revolutionize the way how we develop and use AI algorithms.Considering this rapid technical progress,in thissurvey,we review the recent advances of LLMs by introducing the background,key findings,and mainstream techniques.In particular,we focus on four major aspects of LLMs,namely pre-training,adaptation tuning,utilization,and capacity evaluation.Furthermore,wealso summarize the available resources for developing LLMs and discuss the remaining issues for future directions.This survey providesan up-to-date review of the literature on LLMs,which can be a useful resource for both researchers and engineers.Index TermsLarge Language Models;Emergent Abilities;Adaptation Tuning;Utilization;Alignment;Capacity Evaluation1INTRODUCTION“The limits of my language mean the limits of my world.”Ludwig WittgensteinLANGUAGEis a prominent ability in human beings toexpress and communicate,which develops in earlychildhood and evolves over a lifetime 1,2.Machines,however,cannot naturally grasp the abilities of understand-ing and communicating in the form of human language,unless equipped with powerful artificial intelligence(AI)algorithms.It has been a longstanding research challengeto achieve this goal,to enable machines to read,write,andcommunicate like humans 3.Technically,language modeling(LM)is one of the majorapproaches to advancing language intelligence of machines.In general,LM aims to model the generative likelihoodof word sequences,so as to predict the probabilities offuture(or missing)tokens.The research of LM has receivedextensive attention in the literature,which can be dividedinto four major development stages:Statistical language models(SLM).SLMs 47 are de-Version:v11(major update on June 29,2023).GitHub link:https:/ and J.Li contribute equally to this work.The authors are mainly with Gaoling School of Artifi cial Intelligence andSchool of Information,Renmin University of China,Beijing,China;Jian-Yun Nie is with DIRO,Universit e de Montr eal,Canada.Contact e-mail:veloped based on statistical learning methods that rose inthe 1990s.The basic idea is to build the word predictionmodel based on the Markov assumption,e.g.,predicting thenext word based on the most recent context.The SLMs witha fixed context lengthnare also calledn-gram languagemodels,e.g.,bigram and trigram language models.SLMshave been widely applied to enhance task performancein information retrieval(IR)8,9 and natural languageprocessing(NLP)1012.However,they often suffer fromthe curse of dimensionality:it is difficult to accuratelyestimate high-order language models since an exponentialnumber of transition probabilities need to be estimated.Thus,specially designed smoothing strategies such as back-off estimation 13 and GoodTuring estimation 14 havebeen introduced to alleviate the data sparsity problem.Neural language models(NLM).NLMs 1517 character-ize the probability of word sequences by neural networks,e.g.,recurrent neural networks(RNNs).As a remarkablecontribution,the work in 15 introduced the concept ofdistributed representation of words and built the word predic-tion function conditioned on the aggregated context features(i.e.,the distributed word vectors).By extending the ideaof learning effective features for words or sentences,ageneral neural network approach was developed to builda unified solution for various NLP tasks 18.Further,word2vec 19,20 was proposed to build a simplified shal-low neural network for learning distributed word represen-tations,which were demonstrated to be very effective acrossarXiv:2303.18223v11 cs.CL 29 Jun 202327LPH*37%(57*377*37&RGH,QVWUXFW*37&KDW*37/D0$*377LPH7*37&RGH,QVWUXFW*37&KDW*37/D0$*37(a)Query=”Language Model”7LPH*37%(57*377*37&RGH,QVWUXFW*37&KDW*37/D0$*377LPH7*37&RGH,QVWUXFW*37&KDW*37/D0$*37(b)Query=”Large Language Model”Fig.1:The trends of the cumulative numbers of arXiv papers that contain the keyphrases“language model”(since June 2018)and“large language model”(since October 2019),respectively.The statistics are calculated using exact match by queryingthe keyphrases in title or abstract by months.We set different x-axis ranges for the two keyphrases,because“languagemodels”have been explored at an earlier time.We label the points corresponding to important landmarks in the researchprogress of LLMs.A sharp increase occurs after the release of ChatGPT:the average number of published arXiv papersthat contain“large language model”in title or abstract goes from 0.40 per day to 8.58 per day(Figure 1(b).a variety of NLP tasks.These studies have initiated theuse of language models for representation learning(beyondword sequence modeling),having an important impact onthe field of NLP.Pre-trained language models(PLM).As an early at-tempt,ELMo 21 was proposed to capture context-awareword representations by first pre-training a bidirectionalLSTM(biLSTM)network(instead of learning fixed wordrepresentations)and then fine-tuning the biLSTM networkaccording to specific downstream tasks.Further,based onthe highly parallelizable Transformer architecture 22 withself-attention mechanisms,BERT 23 was proposed by pre-training bidirectional language models with specially de-signed pre-training tasks on large-scale unlabeled corpora.These pre-trained context-aware word representations arevery effective as general-purpose semantic features,whichhave largely raised the performance bar of NLP tasks.Thisstudy has inspired a large number of follow-up work,whichsets the“pre-training and fi ne-tuning”learning paradigm.Following this paradigm,a great number of studies onPLMs have been developed,introducing either differentarchitectures 24,25(e.g.,GPT-2 26 and BART 24)orimproved pre-training strategies 2729.In this paradigm,itoften requires fine-tuning the PLM for adapting to differentdownstream tasks.Large language models(LLM).Researchers find thatscaling PLM(e.g.,scaling model size or data size)oftenleads to an improved model capacity on downstream tasks(i.e.,following the scaling law 30).A number of studieshave explored the performance limit by training an everlarger PLM(e.g.,the 175B-parameter GPT-3 and the 540B-parameter PaLM).Although scaling is mainly conductedin model size(with similar architectures and pre-trainingtasks),these large-sized PLMs display different behaviorsfrom smaller PLMs(e.g.,330M-parameter BERT and 1.5B-parameter GPT-2)and show surprising abilities(called emer-gent abilities 31)in solving a series of complex tasks.Forexample,GPT-3 can solve few-shot tasks through in-contextlearning,whereas GPT-2 cannot do well.Thus,the researchcommunity coins the term“large language models(LLM)”1for these large-sized PLMs 3235,which attract increasingresearch attention(See Figure 1).A remarkable applicationof LLMs is ChatGPT2that adapts the LLMs from the GPTseries for dialogue,which presents an amazing conversationability with humans.We can observe a sharp increase of thearXiv papers that are related to LLMs after the release ofChatGPT in Figure 1.In the existing literature,PLMs have been widely dis-cussed and surveyed 3639,while LLMs are seldom re-viewed in a systematic way.To motivate our survey,we firsthighlight three major differences between LLMs and PLMs.First,LLMs display some surprising emergent abilities thatmay not be observed in previous smaller PLMs.These abili-ties are key to the performance of language models on com-plex tasks,making AI algorithms unprecedently powerfuland effective.Second,LLMs would revolutionize the waythat humans develop and use AI algorithms.Unlike smallPLMs,the major approach to accessing LLMs is throughthe prompting interface(e.g.,GPT-4 API).Humans have tounderstand how LLMs work and format their tasks in a waythat LLMs can follow.Third,the development of LLMs nolonger draws a clear distinction between research and en-gineering.The training of LLMs requires extensive practicalexperiences in large-scale data processing and distributedparallel training.To develop capable LLMs,researchershave to solve complicated engineering issues,working withengineers or being engineers.Nowadays,LLMs are posing a significant impact onthe AI community,and the advent of ChatGPT and GPT-4leads to the rethinking of the possibilities of artificial generalintelligence(AGI).OpenAI has published a technical articleentitled“Planning for AGI and beyond”,which discussesthe short-term and long-term plans to approach AGI 40,1.Note that a LLM is not necessarily more capable than a small PLM,and emergent abilities may not occur in some LLMs.2.https:/ a more recent paper has argued that GPT-4 might beconsidered as an early version of an AGI system 41.Theresearch areas of AI are being revolutionized by the rapidprogress of LLMs.In the field of NLP,LLMs can serve as ageneral-purpose language task solver(to some extent),andthe research paradigm has been shifting towards the useof LLMs.In the field of IR,traditional search engines arechallenged by the new information seeking way through AIchatbots(i.e.,ChatGPT),and New Bing3presents an initialattempt that enhances the search results based on LLMs.Inthe field of CV,the researchers try to develop ChatGPT-likevision-language models that can better serve multimodaldialogues 4245,and GPT-4 46 has supported multi-modal input by integrating the visual information.This newwave of technology would potentially lead to a prosperousecosystem of real-world applications based on LLMs.Forinstance,Microsoft 365 is being empowered by LLMs(i.e.,Copilot)to automate the office work,and OpenAI supportsthe use of plugins in ChatGPT for implementing specialfunctions.Despite the progress and impact,the underlying prin-ciples of LLMs are still not well explored.Firstly,it ismysterious why emergent abilities occur in LLMs,instead ofsmaller PLMs.As a more general issue,there lacks a deep,detailed investigation of the key factors that contribute tothe superior abilities of LLMs.It is important to study whenand how LLMs obtain such abilities 47.Although there aresome meaningful discussions about this problem 31,47,more principled investigations are needed to uncover the“secrets“of LLMs.Secondly,it is difficult for the researchcommunity to train capable LLMs.Due to the huge de-mand of computation resources,it is very costly to carryout repetitive,ablating studies for investigating the effectof various strategies for training LLMs.Indeed,LLMs aremainly trained by industry,where many important trainingdetails(e.g.,data collection and cleaning)are not revealedto the public.Thirdly,it is challenging to align LLMs withhuman values or preferences.Despite the capacities,LLMsare also likely to produce toxic,fictitious,or harmful con-tents.It requires effective and efficient control approachesto eliminating the potential risk of the use of LLMs 46.Faced with both opportunities and challenges,it needsmore attention on the research and development of LLMs.Inorder to provide a basic understanding of LLMs,this surveyconducts a literature review of the recent advances in LLMsfrom four major aspects,including pre-training(how to pre-train a capable LLM),adaptation(how to effectively adaptpre-trained LLMs for better use),utilization(how to useLLMs for solving various downstream tasks)and capabilityevaluation(how to evaluate the abilities of LLMs and existingempirical findings).We thoroughly comb the literature andsummarize the key findings,techniques,and methods ofLLMs.For this survey,we also create a GitHub projectwebsite by collecting the supporting resources for LLMs,atthe link https:/ also aware of several related review articles on PLMsor LLMs 32,36,38,39,43,4854.These papers eitherdiscuss PLMs or some specific(or general)aspects of LLMs.Compared with them,we focus on the techniques and3.https:/ to develop and use LLMs and provide a relativelycomprehensive reference to important aspects of LLMs.The remainder of this survey is organized as follows:Section 2 introduces the background for LLMs and the evo-lution of GPT-series models,followed by the summarizationof available resources for developing LLMs in Section 3.Sections 4,5,6,and 7 review and summarize the recentprogress from the four aspects of pre-training,adaptation,utilization,and capacity evaluation,respectively.Then,Sec-tion 8 discusses the practical guide for prompt design,and Section 9 reviews the applications of LLMs in severalrepresentative domains.Finally,we conclude the survey inSection 10 by summarizing the major findings and discussthe remaining issues for future work.2OVERVIEWIn this section,we present an overview about the back-ground of LLMs and then summarize the technical evolu-tion of the GPT-series models.2.1Background for LLMsTypically,large language models(LLMs)refer to Transformerlanguage models that contain hundreds of billions(ormore)of parameters4,which are trained on massive textdata 32,such as GPT-3 55,PaLM 56,Galactica 35,and LLaMA 57.LLMs exhibit strong capacities to un-derstand natural l