书签分享收藏举报版权申诉 / 15

立即下载

当前位置：首页 > 技术资料 > 技术总结 > 2022年语音识别文献翻译 .pdf

2022年语音识别文献翻译 .pdf

上传人：Q****o

文档编号：28028408

上传时间：2022-07-26

格式：PDF

页数：15

大小：213.88KB

( 4.5 )

《2022年语音识别文献翻译 .pdf》由会员分享，可在线阅读，更多相关《2022年语音识别文献翻译 .pdf（15页珍藏版）》请在淘文阁 - 分享文档赚钱的网站上搜索。

1、青岛大学毕业论文 (设计)科技文献翻译院系：自动化工程学院电子工程系专业：通信工程班级：2006 级 1 班姓名：李洪超指导教师：庄晓东2010 年 5 月 26 日名师资料总结 - - -精品资料欢迎下载 - - - - - - - - - - - - - - - - - - 名师精心整理 - - - - - - - 第 1 页，共 15 页 - - - - - - - - - 1 Speech Recognition Victor Zue, Ron Cole, & Wayne Ward MIT Laboratory for Computer Science, Cambridge, Mass

2、achusetts, USA Oregon Graduate Institute of Science & Technology, Portland, Oregon, USA Carnegie Mellon University, Pittsburgh, Pennsylvania, USA 1 Defining the Problem Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The

3、 recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section. Speech recognition systems can b

4、e characterized by many parameters, some of the more important of which are shown in Figure. An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech c

5、ontains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment-a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some o

6、f the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words. The simpl

7、est language model can be specified as a finite-state network, where the permissible words following each word are given explicitly. More general language models approximating natural language are specified in terms of a context-sensitive grammar. One popular measure of the difficulty of the task, c

8、ombining the vocabulary size and the 名师资料总结 - - -精品资料欢迎下载 - - - - - - - - - - - - - - - - - - 名师精心整理 - - - - - - - 第 2 页，共 15 页 - - - - - - - - - 2 language model, is perplexity, loosely defined as the geometric mean of the number of words that can follow a word after the language model has been app

9、lied (see section for a discussion of language modeling in general and perplexity in particular). Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the micropho

10、ne. Parameters Range Speaking Mode Isolated words to continuous speech Speaking Style Read speech to spontaneous speech Enrollment Speaker-dependent to Speaker-independent Vocabulary Small(20,000 words) Language Model Finite-state to context-sensitive Perplexity Small(100) SNR High (30 dB) to law (1

11、0dB) Transducer Voice-cancelling microphone to telephone Table: Typical parameters used to characterize the capability of speech recognition systems Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal. First, the acoustic realizati

12、ons of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variabilities are exemplified by the acoustic differences of the phoneme ，At word boundaries, contextual variations can be quite dramatic-making gas shortag

13、e sound like gash shortage in American English, and devo andare sound like devandare in Italian. Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabilities can result from changes in

14、the speakers physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities. Figure shows the major components of a typical speech recognition system. The digiti

15、zed speech signal is first transformed into a set of useful measurements or features at a fixed rate, 名师资料总结 - - -精品资料欢迎下载 - - - - - - - - - - - - - - - - - - 名师精心整理 - - - - - - - 第 3 页，共 15 页 - - - - - - - - - 3 typically once every 10-20 msec (see sectionsand 11.3 for signal representation and dig

16、ital signal processing, respectively). These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters. Figu

17、re: Components of a typical speech recognition system. Speech recognition systems attempt to model the sources of variability described above in several ways. At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent f

18、eatures of the signal, and de-emphasize speaker-dependent characteristics. At the acoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of data. Speaker adaptation algorithms have also been developed that adapt speaker-independent aco

19、ustic models to those of the current speaker during system use, (see section). Effects of linguistic context at the acoustic phonetic level are typically handled by training separate models for phonemes in different contexts; this is called context dependent acoustic modeling. Word level variability

20、 can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Common alternate pronunciations of words, as well as effects of dialect and accent are handled by allowing search algorithms to find alternate paths of phonemes through these networks. S

21、tatistical language models, based on estimates of the frequency of occurrence of word sequences, are often used to guide the search through the most probable sequence of words. 名师资料总结 - - -精品资料欢迎下载 - - - - - - - - - - - - - - - - - - 名师精心整理 - - - - - - - 第 4 页，共 15 页 - - - - - - - - - 4 The dominant

22、 recognition paradigm in the past fifteen years is known as hidden Markov models (HMM). An HMM is a doubly stochastic model, in which the generation of the underlying phoneme string and the frame-by-frame, surface acoustic realizations are both represented probabilistically as Markov processes, as d

23、iscussed in sections,and 11.2. Neural networks have also been used to estimate the frame based scores; these scores are then integrated into HMM-based system architectures, in what has come to be known as hybrid systems, as described in section 11.5. An interesting feature of frame-based HMM systems

24、 is that speech segments are identified during the search process, rather than explicitly. An alternate approach is to first identify speech segments, then classify the segments and use the segment scores to recognize words. This approach has produced competitive recognition performance in several t

25、asks. 2 State of the Art Comments about the state-of-the-art need to be made in the context of specific applications which reflect the constraints on the task. Moreover, different technologies are sometimes appropriate for different tasks. For example, when the vocabulary is small, the entire word c

26、an be modeled as a single unit. Such an approach is not practical for large vocabularies, where word models must be built up from subword units. Performance of speech recognition systems is typically described in terms of word error rate E, defined as: where N is the total number of words in the tes

27、t set, and S, I, and D are the total number of substitutions, insertions, and deletions, respectively. The past decade has witnessed significant progress in speech recognition technology. Word error rates continue to drop by a factor of 2 every two years. Substantial progress has been made in the ba

28、sic technology, leading to the lowering of barriers to speaker independence, continuous speech, and large vocabularies. There are several factors that have contributed to this rapid progress. First, there is the coming of age of the HMM. HMM is powerful in that, with the availability of training dat

29、a, the parameters of the model can be trained automatically to give optimal performance. 名师资料总结 - - -精品资料欢迎下载 - - - - - - - - - - - - - - - - - - 名师精心整理 - - - - - - - 第 5 页，共 15 页 - - - - - - - - - 5 Second, much effort has gone into the development of large speech corpora for system development, tr

30、aining, and testing. Some of these corpora are designed for acoustic phonetic research, while others are highly task specific. Nowadays, it is not uncommon to have tens of thousands of sentences available for system training and testing. These corpora permit researchers to quantify the acoustic cues

31、 important for phonetic contrasts and to determine parameters of the recognizers in a statistically meaningful way. While many of these corpora (e.g., TIMIT, RM, ATIS, and WSJ; see section 12.3) were originally collected under the sponsorship of the U.S. Defense Advanced Research Projects Agency (AR

32、PA) to spur human language technology development among its contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada, France, Germany, Japan, and the U.K.) as standards on which to evaluate speech recognition. Third, progress has been brought about by the establishment of st

33、andards for performance evaluation. Only a decade ago, researchers trained and tested their systems using locally collected data, and had not been very careful in delineating training and testing sets. As a result, it was very difficult to compare performance across systems, and a systems performanc

34、e typically degraded when it was presented with previously unseen data. The recent availability of a large body of data in the public domain, coupled with the specification of evaluation standards, has resulted in uniform documentation of test results, thus contributing to greater reliability in mon

35、itoring progress (corpus development activities and evaluation methodologies are summarized in chapters 12 and 13 respectively). Finally, advances in computer technology have also indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has en

36、abled researchers to run many large scale experiments in a short amount of time. This means that the elapsed time between an idea and its implementation and evaluation is greatly reduced. In fact, speech recognition systems with reasonable performance can now run in real time using high-end workstat

37、ions without additional hardware-a feat unimaginable only a few years ago. One of the most popular, and potentially most useful tasks with low perplexity (PP=11) is the recognition of digits. For American English, speaker-independent recognition of digit strings spoken continuously and restricted to

38、 telephone bandwidth can achieve an error rate of 0.3% when the string length is known. One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource 名师资料总结 - - -精品资料欢迎下载 - - - - - - - - - - - - - - - - - - 名师精心整理 - - - - - - - 第 6 页，共 15 页 - - - - - - - - - 6 Management (RM)

39、task, in which inquiries can be made concerning various naval vessels in the Pacific ocean. The best speaker-independent performance on the RM task is less than 4%, using a word-pair language model that constrains the possible words following a given word (PP=60). More recently, researchers have beg

40、un to address the issue of recognizing spontaneously generated speech. For example, in the Air Travel Information Service (ATIS) domain, word error rates of less than 3% has been reported for a vocabulary of nearly 2,000 words and a bigram language model with a perplexity of around 15. High perplexi

41、ty tasks with a vocabulary of thousands of words are intended primarily for the dictation application. After working on isolated-word, speaker-dependent systems for many years, the community has since 1992 moved towards very-large-vocabulary (20,000 words and more), high-perplexity (PP200), speaker-

42、independent, continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on read sentences drawn from North America business news. With the steady improvements in speech recognition performance, systems are now being deployed within telephone and cellular networks in many c

43、ountries. Within the next few years, speech recognition will be pervasive in telephone networks around the world. There are tremendous forces driving the development of the technology; in many countries, touch tone penetration is low, and voice is the only option for controlling automated services.

44、In voice dialing, for example, users can dial 10-20 telephone numbers by voice (e.g., call home) after having enrolled their voices by saying the words associated with telephone numbers. AT&T, on the other hand, has installed a call routing system using speaker-independent word-spotting technology t

45、hat can detect a few key phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to my calling card. At present, several very large vocabulary dictation systems are available for document generation. These systems generally require speakers to pause between words. Th

46、eir performance can be further enhanced if one can apply constraints of the specific domain such as dictating medical reports. Even though much progress is being made, machines are a long way from recognizing conversational speech. Word recognition rates on telephone conversations in the Switchboard

47、 corpus are around 50%. It will be many years before unlimited vocabulary, speaker-independent continuous dictation capability is realized. 名师资料总结 - - -精品资料欢迎下载 - - - - - - - - - - - - - - - - - - 名师精心整理 - - - - - - - 第 7 页，共 15 页 - - - - - - - - - 7 3 Future Directions In 1992, the U.S. National Sc

48、ience Foundation sponsored a workshop to identify the key research challenges in the area of human language technology, and the infrastructure needed to support the work. The key research challenges are summarized in. Research in the following areas for speech recognition were identified: Robustness

49、: In a robust system, performance degrades gracefully (rather than catastrophically) as conditions become more different from those under which it was trained. Differences in channel characteristics and acoustic environment should receive particular attention. Portability: Portability refers to the

50、goal of rapidly designing, developing and deploying systems for new applications. At present, systems tend to suffer significant degradation when moved to a new task. In order to return to peak performance, they must be trained on examples specific to the new task, which is time consuming and expens

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

4.3 金币

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: 2022年语音识别文献翻译 2022 语音识别文献翻译

淘文阁 - 分享文档赚钱的网站所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

限制150内

关于本文

本文标题：2022年语音识别文献翻译 .pdf
链接地址：https://www.taowenge.com/p-28028408.html