数据挖掘方法用于参与代谢的小分子生物学功能预测研究博士毕业论文(137页).docx
《数据挖掘方法用于参与代谢的小分子生物学功能预测研究博士毕业论文(137页).docx》由会员分享,可在线阅读,更多相关《数据挖掘方法用于参与代谢的小分子生物学功能预测研究博士毕业论文(137页).docx(135页珍藏版)》请在淘文阁 - 分享文档赚钱的网站上搜索。
1、-数据挖掘方法用于参与代谢的小分子生物学功能预测研究博士毕业论文-第 122 页中图分类号:Q-31 单位代号:10280密 级:公开 学 号:09820004 博士学位论文SHANGHAI UNIVERSITYDOCTORAL DISSERTATION题目数据挖掘方法用于参与代谢的小分子生物学功能预测研究作 者彭淳容学科专业材料学导 师陆文聪 教授完成日期二零一二年五月上 海 大 学本论文经答辩委员会全体委员审查,确认符合上海大学博士学位论文质量要求。答辩委员会主任: 姓名: 单位: 职称: 委员: 姓名: 单位: 职称: 姓名: 单位: 职称: 姓名: 单位: 职称: 姓名: 单位: 职称
2、: 导师: 姓名: 单位: 职称: 答辩日期: 年 月 日原 创 性 声 明本人声明:所呈交的论文是本人在导师指导下进行的研究工作。除了文中特别加以标注和致谢的地方外,论文中不包含其他人已发表或撰写过的研究成果。参与同一工作的其他同志对本研究所做的任何贡献均已在论文中作了明确的说明并表示了谢意。 签 名: 日 期: 本论文使用授权说明本人完全了解上海大学有关保留、使用学位论文的规定,即:学校有权保留论文及送交论文复印件,允许论文被查阅和借阅;学校可以公布论文的全部或部分内容。(保密的论文在解密后应遵守此规定)签 名: 导师签名: 日期: 上海大学工学博士学位论文数据挖掘方法用于参与代谢的小分子
3、生物学功能预测研究姓 名:彭淳容导 师:陆文聪 教授学科专业:材料学上海大学材料科学与工程学院二零一二年五月A Dissertation Submitted to Shanghai University for the Doctors Degree in EngineeringResearch on Prediction of Biological Function of Small Molecules in Metabolic Pathway Using Data MiningPh. D. Candidate:Peng ChunrongSupervisor:Prof. Lu WencongM
4、ajor:Material ScienceSchool of Material Science and EngineeringShanghai UniversityMay, 2012摘要小分子是分子量比较小的化合物,可以参与包括代谢反应在内的很多生物过程,据估计,与生物过程有联系的小分子的种类数目至少有10万多个,而迄今为止已搞清楚其生物学功能的尚不足其中的1%。因此,进行小分子的生物学功能识别和预测研究,有助于理解生命过程中一些问题的生物学和化学本质。通过搜集整理小分子生物学功能研究的实验成果,利用数据挖掘方法总结已知数据中隐含的规律,可以预测未知小分子的生物学功能。使用数据挖掘方法进行小分
5、子的生物学功能识别和预测研究,首先要解决的问题就是如何对小分子进行参数表征,这对于数学模型的建立起到至关重要的作用。经过比较现有的商业和开源的分子描述符计算程序,选用了ChemAxon公司的Calculator Plugins等程序,使用Java语言对其进行了二次开发,开发了一个方便易用且可自行定制的批量计算小分子的分子描述符的计算程序。程序极大地提高了小分子的分子描述符计算的便捷性和计算效率,为小分子的生物学功能识别和预测研究提供了高效的工具。正确有效地把具有重要生物学意义的小分子映射到其相对应的代谢途径,将有助于人们更加深入地进行代谢分析,更为深刻地理解小分子的代谢机理。使用ChemAxo
6、n公司的JChem for Excel软件批量计算小分子的分子描述符,基于mRMR算法(minimum Redundancy Maximum Relevance)和FFS算法(Feature Forward Search)进行特征选择,采用以C4.5决策树算法为基本分类器的Adaboost算法预测了小分子可能参与的代谢途径的类型。由此所建立模型的10折交叉验证测试和独立测试集测试的预测正确率分别为83.88%和85.23%,与使用官能团组成表征小分子的方法相比,预测结果有了显著的提高。还使用HyperChem软件计算小分子的分子描述符,基于CFS(Correlation-based Featu
7、re Subset)算法进行特征选择,采用以最近邻算法为基本分类器的Bagging算法预测了小分子可能参与的脂类代谢的子代谢途径,所建模型对Jackknife交叉验证和独立测试集的预测正确率分别是89.85%和91.46%。在代谢途径中,小分子通过与酶的相互作用,参与了整个代谢过程。研究小分子与酶的相互作用,可以根据已知的“小分子-酶作用对”预测未知的小分子和酶能否相互作用,进而为探索各种代谢或催化机理提供新的研究思路。使用所开发的计算程序的计算结果表征小分子,使用改进的拟氨基酸组成表征酶,对代谢途径中小分子和酶的相互作用进行研究。结合使用mRMR算法、IFS(Incremental Feat
8、ure Selection)算法和FFS算法进行特征选择,采用最近邻算法进行建模,其10折交叉验证测试和独立测试集测试的预测正确率分别为85.19%和85.32%,其中正样本的预测正确率分别为86.02%和86.74%,与前人的研究工作相比,正样本的预测正确率有较大的提高。使用投票法对蛋白质与RNA的相互作用进行了研究,有关研究结果有助于理解蛋白质如何控制基因表达。从Weka软件中选取了34种分类算法,建立了四种投票系统。结果表明,投票法的预测结果优于单一分类算法的预测结果,并且使用算法选择和对算法进行加权可以优化预测结果。使用含算法选择的加权多数投票系统取得了最佳的预测结果,独立测试集测试的
9、平均ACC(overall prediction accuracy)值和平均MCC(Matthews Correlation Coefficient)值分别达到82.04%和64.70%。关键词:数据挖掘,小分子,分子描述符,代谢途径,ChemAxon,投票法AbstractSmall molecules are compounds with relatively small molecular weight. More than one hundred thousand small molecules can participate in many biological process in
10、cluding metabolic reactions, but the number with known biological function is less than 1% so far. Therefore, its conducive to understand the biological and chemical nature of some questions in the process of life, through the research in recognition and prediction of biological functions of small m
11、olecules. The biological function of unknown small molecules can be predicted via collecting the results of experiments and summarizing the implied regularities in known data by using data mining.In order to recognize and predict the biological functions of small molecules by using data mining, the
12、first problem is how to coding small molecules, which plays a crucial role for mathematical modeling. By comparing the existing commercial and open source programs for the computation of molecular descriptors, Calculator Plugins of ChemAxon was selected, and a program for the calculation of molecula
13、r descriptors was developed. This program is the secondary development based on Calculator Plugins by using Java language, which is easy to use and can be customized to the batch calculation. This program has greatly improved the convenience and efficiency of calculation, which provide the high-effi
14、ciency tool for the above research.Mapping small molecules to corresponding metabolic pathways correctly and efficiently will contribute to the analysis of metabolic pathway and understand of metabolic mechanism in depth. JChem for Excel of ChemAxon was chosen for batch computing descriptors of smal
15、l molecules, mRMR (minimum Redundancy Maximum Relevance) and FFS (Feature Forward Search) algorithms were selected for feature selection, and Adaboost algorithm based on C4.5 decision tree algorithm was used for predicting the possible metabolic pathway which small molecules involved in. Thus the pr
16、edicted accuracies of 10-folds cross-validation test and independent set test for the metabolic pathway are 83.88% and 85.23%, respectively. The results have improved significantly compared to the predicted results encoded by functional group composition. The possible subpath way in metabolic pathwa
17、y of lipid which small molecules involved in was predicted also. HyperChem was chosen for computing descriptors of small molecules, CFS (Correlation-based Feature Subset) algorithm was selected for feature selection, and Bagging algorithm based on nearest neighbor algorithm was used for modeling. Th
18、e predicted accuracies of Jackknife cross-validation and independent set are 89.85% and 91.46%, respectively.Small molecules participate in the whole metabolic process in metabolic pathway via the interaction with enzyme. Predicting unknown molecule-enzyme interaction according to known molecule-enz
19、yme interaction can provide new idea for exploring various metabolic or catalytic mechanisms by the research on molecule-enzyme interaction. The result of developed program ahead was used for coding small molecules, improved pseudo amino acid composition was used for coding enzymes, and three algori
20、thms were chosen for feature selection, including mRMR, IFS (Incremental Feature Selection) and FFS. The prediction model was built for the molecule-enzyme interaction in metabolic pathway by using nearest neighbor algorithm. The predicted accuracies of 10-folds cross-validation test and independent
21、 set test for the molecule-enzyme interaction are 85.19% and 85.32% respectively, and the predicted accuracies of positive samples in 10-folds cross-validation test and independent set test are 86.02% and 86.74% respectively. The predicted accuracies of positive samples increased greatly compared wi
22、th previous work.The interaction of protein-RNA was studied by voting algorithm, which is conducive to understand the gene expression of protein. 34 classifiers were chosen from Weka, and four voting systems were built. As a result, the voting system performs better than any single classifiers, and
23、algorithm selection and weighted system can optimize the predicted accuracies. Weighted voting system with algorithm selection achieved the best prediction results, and the average ACC (overall prediction accuracy) value and average MCC (Matthew s Correlation Coefficient) value reached 82.04% and 64
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 数据 挖掘 方法 用于 参与 代谢 分子生物学 功能 预测 研究 博士 毕业论文 137
限制150内