书签分享收藏举报版权申诉 / 66

立即下载

当前位置：首页 > 研究报告 > 可研报告 > BloombergGPT：一个用于金融的大型语言模型.pdf

BloombergGPT：一个用于金融的大型语言模型.pdf

上传人：530650****qq.com

文档编号：95792958

上传时间：2023-08-31

格式：PDF

页数：66

大小：1.62MB

( 4.5 )

《BloombergGPT：一个用于金融的大型语言模型.pdf》由会员分享，可在线阅读，更多相关《BloombergGPT：一个用于金融的大型语言模型.pdf（66页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、BloombergGPT:A Large Language Model for FinanceShijie Wu1,OzanIrsoy1,Steven Lu1,Vadim Dabravolski1,Mark Dredze1,2,Sebastian Gehrmann1,Prabhanjan Kambadur1,David Rosenberg1,Gideon Mann11Bloomberg,New York,NY USA2Computer Science,Johns Hopkins University,Baltimore,MD USAAbstractThe use of NLP in the r

2、ealm of financial technology is broad and complex,with applicationsranging from sentiment analysis and named entity recognition to question answering.LargeLanguage Models(LLMs)have been shown to be effective on a variety of tasks;however,noLLM specialized for the financial domain has been reported i

3、n literature.In this work,wepresent BloombergGPT,a 50 billion parameter language model that is trained on a widerange of financial data.We construct a 363 billion token dataset based on Bloombergsextensive data sources,perhaps the largest domain-specific dataset yet,augmented with 345billion tokens

4、from general purpose datasets.We validate BloombergGPT on standardLLM benchmarks,open financial benchmarks,and a suite of internal benchmarks that mostaccurately reflect our intended usage.Our mixed dataset training leads to a model thatoutperforms existing models on financial tasks by significant m

5、argins without sacrificingperformance on general LLM benchmarks.Additionally,we explain our modeling choices,training process,and evaluation methodology.As a next step,we plan to release traininglogs(Chronicles)detailing our experience in training BloombergGPT.Contents1Introduction31.1BloombergGPT.3

6、1.2Broader Contributions.42Dataset52.1Financial Datasets(363B tokens 54.2%of training).72.1.1Web(298B tokens 42.01%of training).72.1.2News(38B tokens 5.31%of training).72.1.3Filings(14B tokens 2.04%of training).72.1.4Press(9B tokens 1.21%of training).82.1.5Bloomberg(5B tokens 0.70%of training).82.2P

7、ublic Datasets(345B tokens 48.73%of training).92.2.1The Pile(184B tokens 25.9%of training).92.2.2C4(138B tokens 19.48%of training).92.2.3Wikipedia(24B tokens 3.35%of training).92.3Tokenization.9.Co-first authors.1arXiv:2303.17564v1 cs.LG 30 Mar 20233Model113.1Architecture.113.2Model Scaling.123.3Tra

8、ining Configuration.133.4Large-scale Optimization.144Training Run155Evaluation165.1Few-shot Methodology.185.2Heldout Loss.185.3Financial Tasks.195.3.1External Financial Tasks.205.3.2Internal Task:Sentiment Analysis.225.3.3Exploratory Task:NER.235.4BIG-bench Hard.265.5Knowledge Assessments.265.6Readi

9、ng Comprehension.285.7Linguistic Tasks.295.8Summary.306Qualitative Samples317Related Work328Ethics,Limitations,and Implications378.1Ethical Use.378.2Openness.389Conclusion38A Architecture60A.0 Notation.60A.1 Full Architecture.60A.2 SelfAttention with ALiBi(SA).61A.3 LayerNorm(LN).62A.4 FeedForwardNe

10、twork(FFN).62A.5 List of All Trainable Parameters.63B Details on external financial tasks6421.IntroductionThe release of GPT-3 in 2020(Brown et al.,2020)demonstrated the powerful benefitsof training very large auto-regressive language models(LLMs).GPT-3 had 175 billionparameters,a hundredfold increa

11、se over the previous GPT-2 model,and did remarkablywell across a wide range of now popular LLM tasks,including reading comprehension,open-ended question answering,and code generation.This performance has been replicatedacross several other models(Chowdhery et al.,2022;Scao et al.,2022;Zhang et al.,2

12、022a).Furthermore,evidence suggests that large models exhibit emergent behaviors;growth allowsthem to acquire abilities not present in smaller models(Wei et al.,2022a).A notableexample of emergent behavior is the ability to perform tasks via few-shot prompting,where amodel can learn a task from just

13、 a few examples.This ability improves well-above random aswe increase the size of language models.Broadly speaking,few-shot prompting dramaticallyexpands the range of tasks supported by models and lowers the barrier to entry for usersseeking automation for new language tasks.After GPT-3,models grew

14、in size to 280 billion(Gopher,Rae et al.,2021),540 bil-lion(PaLM,Chowdhery et al.,2022),and 1 trillion parameters(Megatron,Korthikantiet al.,2022).Work also explored other important aspects of achieving a high-performingLLM,such as different training objectives(Tay et al.,2022b),multilingual models(

15、Scaoet al.,2022),more efficient and smaller models(Black et al.,2022),and finding data andparameter-efficient training sizes(Hoffmann et al.,2022).These efforts have almost exclusively focused on general LLMs,trained on datasets thatcover a broad range of topics and domains.While these have included

16、 some datasets forspecialized domains(e.g.,code(Chen et al.,2021a)or biomedical articles Gao et al.(2021)the focus has been on building LLMs with broad capabilities.Recent efforts training modelsusing only domain-specific data have yielded models that,while much smaller,beat generalpurpose LLMs on t

17、asks within those domains,such as science Taylor et al.(2022)andmedicine Bolton et al.(2023);Luo et al.(2022);Lehman et al.(2023).These findingsmotivate further development of models focused on specific domains.Financial Technology(FinTech)is a large and growing area with NLP technologieshaving an i

18、ncreasingly important role Xing et al.(2018);Fisher et al.(2016);Dredzeet al.(2016).Financial NLP tasks Shah et al.(2022)include sentiment analysis Araci(2019),named entity recognition Salinas Alvarado et al.(2015),news classification Sinhaand Khandait(2020),and question answering Chen et al.(2021b,

19、2022).While the range oftasks is similar to those found in general NLP benchmarks,the complexity and terminologyof the financial domain warrant a domain-specific system.For all of the reasons generativeLLMs are attractive in general few-shot learning,text generation,conversational systems,etc.it wou

20、ld be valuable to have a LLM focused on the financial domain.While thereare masked language models tuned for the financial domain Araci(2019),no LLM has beentuned for or evaluated on tasks for this domain.1.1 BloombergGPTWe train BloombergGPT,a 50 billion parameter language model that supports a wid

21、erange of tasks within the financial industry.Rather than building a general-purpose LLM,or a small LLM exclusively on domain-specific data,we take a mixed approach.General3models cover many domains,are able to perform at a high level across a wide variety of tasks,and obviate the need for specializ

22、ation during training time.However,results from existingdomain-specific models show that general models cannot replace them.At Bloomberg,wesupport a very large and diverse set of tasks,well served by a general model,but the vastmajority of our applications are within the financial domain,better serv

23、ed by a specificmodel.For that reason,we set out to build a model that achieves best-in-class results onfinancial benchmarks,while also maintaining competitive performance on general-purposeLLM benchmarks.We achieve this goal by constructing the largest domain-specific dataset yet,drawing onexisting

24、 data creation,collection,and curation resources at Bloomberg.As Bloomberg isprimarily a financial data company,our data analysts have collected and curated financiallanguage documents over the span of forty years.We have extensive archives of financialdata that cover a range of topics,with careful

25、tracking of data sources and usage rights.Weadd this data to public datasets to create a large training corpus with over 700 billion tokens.Using a portion of this training corpus,we train a BLOOM-style,50 billion parametermodel designed based on guidelines from Hoffmann et al.(2022)and Le Scao et a

26、l.(2022).We validate the model on standard LLM benchmarks,open financial benchmarks,and asuite of Bloomberg-internal benchmarks that most accurately reflect our intended use cases.Our results demonstrate that our mixed training approach leads to a model that vastlyoutperforms existing models on in-d

27、omain financial tasks while being on par or better ongeneral NLP benchmarks.1.2 Broader ContributionsBeyond the construction of a LLM for financial data,our goal is to contribute to thebroader research community.Specifically,our experience documented in this paper providesevidence that further devel

28、ops the communitys understanding of several open questions inthe literature.Domain-specific LLMs.The few existing domain-specific LLMs are trained exclusivelyon domain-specific data sources(Luo et al.,2022;Bolton et al.,2023;Taylor et al.,2022),or adapt a very large general purpose model to domain-s

29、pecific tasks(Singhal et al.,2022;Lewkowycz et al.,2022).Our alternative approach training an LLM on both domain-specific and general data sources has not been studied so far.The resulting model does verywell on domain-specific tasks,but also maintains strong performance on general-purposebenchmarks

30、.Training data.Nearly all language models rely in large part on web-scraped data,suchas C4(Raffel et al.,2020)and The Pile(Gao et al.,2021)(which includes OpenWebText2).This data may be cleaned or subsetted in various ways before use Touvron et al.(2023);Rae et al.(2020);Scao et al.(2022);Jernite et

31、 al.(2022),but issues of data duplicationCarlini et al.(2020)and toxic language remain Welbl et al.(2021).Our training data isunusual for LLM training in that it includes a significant amount of curated and prepareddata from reliable sources.Evaluation.LLM evaluation remains a challenging and evolvi

32、ng problem Gehrmann et al.(2022);Goyal et al.(2022),with new benchmarks trying to standardize evaluation across4models(Liang et al.,2022;Srivastava et al.,2022).However,for domain-specific tasks,there remains a mismatch between evaluation and actual use cases.Evaluations are builton available datase

33、ts and not necessarily on how the model will be used in practice.Weprovide results on both public financial NLP benchmarks(Shah et al.,2022;Chen et al.,2021b)as well as a selection of internal Bloomberg tasks,which are better aligned with ourintended use cases and directly evaluate our models abilit

34、y to perform tasks of interest.Model Size.Early LLMs made a single training pass over a corpus of 200-400 billion to-kens(Brown et al.,2020)and Hoffmann et al.(2022)posited that models were undertrained,instead focusing on training smaller models with more data,a strategy most recently em-ployed by

35、Touvron et al.(2023).We select a model size motivated by Hoffmann et al.(2022)and train a 50 billion parameter model on 569 billion tokens from our corpus of over 700billion tokens to produce a model that is competitive with larger models.Tokenizer.After assembling training data,the critical step of

36、 tokenization transformsthe text into a format suitable for the language model.The importance of this step isoften overlooked Mielke et al.(2021),and many older LLMs use the same tokenizer andvocabulary,meaning that we have little evidence to support other tokenizers.We takea different approach and

37、use a Unigram model instead of greedy merge-based sub-wordtokenizers since it saves probabilities allowing for smarter tokenization at inference time(Kudo,2018).Model Building Challenges.GPT-3 and subsequent models were the work of largeteams and required an enormous amount of computation.Initial wo

38、rk to reproduce theseresults,such as OPT Zhang et al.(2022a),did not match the performance of the originalmodel.With the release of each subsequent model,the communitys understanding,ex-perience,and software tools increase.In developing BloombergGPT,we benefited fromexisting code developed as part o

39、f the BLOOM effort Scao et al.(2022),showing that amoderately sized team can produce a competitive model on domain-specific data.We de-scribe our experiences training BloombergGPT in detail to support future training effortsand address each of the above topics.2.DatasetTo train BloombergGPT,we const

40、ruct“FinPile”,a comprehensive dataset consisting ofa range of English financial documents including news,filings,press releases,web-scraped fi-nancial documents,and social media drawn from the Bloomberg archives.These documentshave been acquired through our business process over the past two decades

41、.We augmentFinPile with public data widely used to train LLMs.The result is a training corpus thatis roughly half domain-specific text and half general-purpose text.For a breakdown of thefull training set,see Table 1.To improve data quality,we de-duplicate each dataset(ThePile,C4,Wikipedia,FinPile)a

42、ccording to Lee et al.(2022a);as a side-effect,the statisticsreported in Table 1 might be different from those reported in other papers.5DatasetDocs1e4C/DChars1e8C/TToks1e8T%FinPile175,8861,01717,8834.923,63551.27%Web158,25093314,7684.962,97842.01%News10,0401,6651,6724.443765.31%Filings3,3352,340780

43、5.391452.04%Press1,2653,4434355.06861.21%Bloomberg2,9967582274.60490.70%PUBLIC50,7443,31416,8184.873,45448.73%C434,8322,2067,6835.561,38119.48%Pile-CC5,2554,4012,3125.424276.02%GitHub1,4285,3647663.382273.20%Books319552,3981,0644.972143.02%PubMed Central29432,1819474.512102.96%ArXiv12447,8195913.561

44、662.35%OpenWebText21,6843,8506485.071281.80%FreeLaw34915,3815374.991081.52%StackExchange1,5382,2013394.17811.15%DM Mathematics1008,193821.92430.60%Wikipedia(en)5902,9881764.65380.53%USPTO Backgrounds5174,3392246.18360.51%PubMed Abstracts1,5271,3332045.77350.50%OpenSubtitles3831,0551194.90240.34%Gute

45、nberg(PG-19)3399,3511124.89230.32%Ubuntu IRC1539,222563.16180.25%EuroParl765,053452.93150.21%YouTubeSubtitles1719,831332.54130.19%BookCorpus22370,384655.36120.17%HackerNews825,009414.8780.12%PhilPapers374,827234.2160.08%NIH ExPorter922,165206.6530.04%Enron Emails241,88253.9010.02%Wikipedia(7/1/22)2,

46、2183,2717263.062373.35%TOTAL226,6311,53134,7014.897,089100.00%Table 1:Breakdown of the full training set used to train BloombergGPT.The statisticsprovided are the average number of characters per document(“C/D”),the averagenumber of characters per token(“C/T”),and the percentage of the overall token

47、s(“T%”).Units for each column are denoted in the header.62.1 Financial Datasets(363B tokens 54.2%of training)The Bloomberg Terminal has provided access to a comprehensive set of diverse structuredand unstructured financial data and analytics for the past four decades.In serving thismission,Bloomberg

48、 analysts have curated a set of financial documents that were eithercreated internally or acquired from external sources.We utilize this extensive collection ofcurated and maintained documents to create FinPile,which consists of company filings,financial news,and other data relevant to the financial

49、 markets.Some documents included in the FinPile,such as company filings,are available tothe general public,although collecting these documents and pre-processing them for LLMtraining is a non-trivial task.Other documents,such as(a subset of)Bloomberg news,mustbe purchased.The rest of the documents a

50、re private and available,among other sources,through the Bloomberg Terminal.Finally,we clean this data to strip offmarkup,specialformatting,and templates.Note that each document in FinPile is time-stamped,with dates ranging from 2007-03-01 to 2022-07-31;the quality and quantity of documents increase

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

15 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: BloombergGPT 一个用于金融大型语言模型

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：BloombergGPT：一个用于金融的大型语言模型.pdf
链接地址：https://www.taowenge.com/p-95792958.html