网络资源的名字特征及其在资源组织中的应用研究博士毕业论文.doc





《网络资源的名字特征及其在资源组织中的应用研究博士毕业论文.doc》由会员分享,可在线阅读,更多相关《网络资源的名字特征及其在资源组织中的应用研究博士毕业论文.doc(131页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、 博士研究生学位论文题目:网络资源的名字特征及其在资源组织中的应用研究研究方向: 搜索引擎与网络信息挖掘d 导师姓名: d On the Name Characteristics of Digital Resources and Their Applications in Resource Organization Dissertation Submitted toPeking Universityin partial fulfillment of the requirement for the degree of Doctor of Philosophy in Science摘 要本文中,网
2、络资源指在含义上相对独立且完整的非网页类互联网信息资源,通常由一到多个文件按照一定的目录结构组织,常见于FTP服务器和P2P系统的节点中。它们广泛分布在互联网上,是网络用户所关心的重要对象。同时,这些网络资源不仅数量巨大,而且发布、传播和共享都比较自由,呈现出“混乱”和“无序”的特点。广泛收集并重新组织这类网络资源,是许多应用中的一项根本需求。在这项工作中,资源名称是最基本的依据;人们一方面需要根据名称来理解得到的资源,另一方面也要通过对资源命名来标识资源。本文首先考察各类网络资源的命名状况,研究其中蕴含的用户命名行为一般规律;进而研究了如何从资源名中切分出语义片段的方法;并考察了名字信息在资
3、源自动分类中的作用,分析分类性能的影响因素等问题。注意到网络上存在着许多以目录树的形式组织得相当好的资源集合,本文研究了依据目录树信息进行资源整合的效率问题,并针对这类资源整合的任务,设计了一个可扩展性好的资源增量存储与组织方案。作为对上述研究的应用,实现了一个支持海量网络资源存储组织的库藏系统,并为相关领域的研究提供数据和系统平台。本文主要贡献包括:(1) 考察网络资源命名的无序状况,分析用户对资源命名行为的一般规律。通过考察总体与各类别的名字长度、字符构成、片段频度分布、文件后缀对资源类别的互信息、语义种类及位序关系等方面,分析资源的名字无序混乱表象及所蕴含的规律。例如从字符类型熵来看,资
4、源名是用户表达各种资源相关信息的渠道,而其中娱乐类资源名的字符类型熵大于工作学习类的资源名,这体现了用户对娱乐内容的融入感较强,倾向于参与修改名字来反映自己的意见和评价。从符号的出现上看,用户倾向于将多种意思通过显式或隐式分隔信息浓缩在简短的名字中。这些是本文后续名字切分、资源分类等工作的基础。(2) 基于错误驱动转换学习思想和字符类型突变分割假设,提出一种能对资源名按照语义信息切分、且不依赖于词典的方法。这一研究也适用于其他具有多种文字符号混杂、浓缩表达多种语义类型特点的文本环境。该方法优点是能充分利用上下文特征学习,且不要求大规模训练数据。例如给800个训练样本,得到的语义片段的切分精度为
5、81%、召回率为83%。所得的切分结果有助于从混乱的原始命名状态中获取对资源描述有用的信息。(3) 提出一种利用资源及其成员的名字所产生的特征进行资源自动分类的方法,研究了特征分布、概率估算、样本数量等因素对分类性能的影响。发现大量低频特征(例如只在一个资源中出现过)对分类正确率的贡献在于帮助合理估算未观测到的特征的概率;因此也得出在低频特征占优、且使用Simple Good-Turing平滑策略下无需进行特征选择的推论。在使用所有特征情况下,总体分类正确率可达80%。还应用该方法实现了一个资源半自动分类工具,在人工给定资源粒度条件下,进行资源分类的时间是基准时间的45%-50%。(4) 针对
6、原始质量较好的资源集合,提出一种利用原始组织知识的目录归并模型整合资源,刻画粗分类和精细检查两阶段工作模式并评估模型效率。粗分类阶段有精度损失,但完成任务的时间为基准做法的1/2a(a为批量处理的资源数,a1);精细检查阶段在第一阶段基础上进行,能保证精度无损,且完成任务的时间约为基准做法的1/2。(5) 持续从互联网收集、并运用目录归并模式高效低代价地构建一个容量为7.5TB的海量网络资源库藏系统。通过分类体系和文件目录的映射,并在服务器和磁盘两级用模块化思想设计存储、组织功能,该系统能很好地应对增量式存储、组织和服务需求。系统还基于Ontology思想从互联网上为热点门类的资源扩展相关描述
7、信息。关键词:网络资源,命名分析,组织,自动分类,目录归并On the Name Characteristics of Digital Resources and Their Applications in Resource Organization AbstractIn this dissertation, the term “Digital resource” refers to the non-web page data that is: 1) usually composed by one or more files of various data types, and existin
8、g within some directory structures; 2) representing a single independent topic; 3) widely shared and distributed through FTP sites or P2P file systems; 4) organized by Internet users at will more than well-defined styles. Internet users concern about digital resources more and more. At the same time
9、, digital resources are characterized with mass, disorder and confusion. It is a fundamental demand to widely collect and organize digital resources for many applications. In this work, what is the most basic is the resource names. On the one hand, they provide the clue of meaning of resources. On t
10、he other hand, they are used to identify the resources. This paper first studies the disorder naming status of digital resources, and tries to find out generally naming manners of Internet users. Secondly, the paper studies the method of how to segment the resources names based on semantic meanings.
11、 Thirdly, we study how to make use of resource names in automatic resource classification, and analyze the impact factors on the performance. Noting that there are many well-organized digital resources on the Web, we propose a method to reorganize the resources in different file directories to a coh
12、erent classification framework. And we also evaluate the efficiency of the integration process. As practice to all above mentioned research, we designed and implemented a scalable digital resource library which can support massive volume of digital resources and is capable of providing data and serv
13、ices for many academic institutions. In this paper, contributions are listed as follows:1) Study the disorder naming status of digital resources, and find out the generally naming manners of Internet users. By examining the name length, the character type, the fragment frequency distribution, the po
14、int mutual information of file extensions with resources categories and the semantic information, we get an overall knowledge on the disorder and chaos of resource names. For example, from the information entropy of character type, the resource names act as expression medium where the Internet users
15、 are apt to add information about digital resource, such as short description, personal viewpoints, etc. From the symbol appearance, we can know the Internet users often use explicit or implicit separators among name texts to designate the transition of different semantic meanings. These studies are
16、 the base of the later research of this dissertation.2) Propose a segmentation approach which is able to detect the semantic snippets in the digital resource names without any lexicons. The approach is based on the idea of Transformation-Based Error-Driven Learning and the assumption of splitting na
17、me strings at the position of char-type transition. This way of practice can also be applied to similar problems where texts are composed of various symbols and letters, and concentrated expression of a variety of types of semantic information. The method takes full advantage of context and does not
18、 require large-scale training data. Training on 800 samples, we get a performance of 81% in precision and 83% in recall of all the semantic fragmentations.3) Propose a method using the name of resources and theirs members for automatic resource category. We study on the performance factors such as f
19、eature distribution, the smoothing method on probability estimation, and the number of samples. We found that a large quantity of low-frequency features, especially those which only appears in one resource, contribute to the classification accuracy by helping to get reasonable probability estimation
20、 on the unobserved features. Based on this knowledge, the usual feature selection procedures in text categorization are not necessary in this circumstance. When employing all features acquired from the name strings, the overall accuracy of the classification method proposed here can reach 80%. As an
21、 application of this method, we implemented a semi-automatic classification tool which classified the resources with only 45% to 55% in time cost comparing with the benchmark method. 4) Propose a tree-merge model to map the resources originally organized in file system directories and dispersing in
22、the Internet to a coherent classification architecture. The model performs well when the original organization quality is good and usable enough. There are two phases defined by the model, the first phase is a roughly classification with a little precision loss but rapid committing, and the second i
23、s a refine phase which remedies the incorrect classification to the required quality. In the first phase, the time cost is only 1/2a of the baseline (a is the average number of resources classified in one judgement, a 1). And in the refine phase, the time is only half of that of the baseline. 5) Con
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 网络资源的名字特征及其在资源组织中的应用研究 博士毕业论文 网络资源 名字 特征 及其 资源 组织 中的 应用 研究 博士 毕业论文

限制150内