生物物理学习PPT教案.pptx
Gene Prediction Ideal caseReal worldWhat is a gene?Wilhelm Johannsens definition of a gene :The word gene was first used by Wilhelm Johannsen in 1909, based on the concept developed by Gregor Mondel in 1866. “The special conditions, foundations and determiners which are present in the gametes (配子) in unique, separate and thereby independent ways by which many characteristics of the organism are specified.” Johannsen, W. (1909) Biol. Philos. 4: 303-329.What is a gene? A gene is the basic physical and functional unit of heredity. Genes, which are made up of DNA, act as instructions to make molecules called proteins. Old concept: A gene is a locus (or region) of DNA that encodes a functional protein or RNA product, and is the molecular unit of heredity.New definition: Gene PredictionGene prediction: To identify all genes in a genomeatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctGeneGene prediction is the basic for functional studiesFinding all genes in a genome could be hardFinding all the genes is hard- Mammalian genomes are large 8000 km of 10 bp type- Only about 1% coding proteins- Non-coding RNAs are more difficult to be predictedThe structure of prokaryotic (原核生物的) genesPromoter structure of prokaryotic (原核生物的) genesThe structure of eukaryotic (真核生物的) genesThe structure of eukaryotic (真核生物的) genesOpen Reading Frames (ORFs)Protein coding gene prediction is to detect potential coding regions by looking for ORFsSignals defining ORFs in eukaryotic genes:- Start codon: ATG- Stop codons: TAG, TGA, TAA- Splicing donor sites: usually GT- Splicing acceptor sites: usually AGUTRs are usually defined according to expression evidenceTypes of exonsSix Frames in a DNA SequenceDNA replication occurs in the 5 -to-3 direction Six Frames in a DNA SequenceSix Frames in a DNA SequenceCodon usage selection in translationCodon usage selection in translationCodon usage in mouse genomeUneven usage of codons may characterize a real gene! Eukaryotic ORF prediction Signals defining ORFs in eukaryotic genes:- Start codon: ATG- Stop codons: TAG, TGA, TAA- Splicing donor sites: usually GT- Splicing acceptor sites: usually AG- Coding frame- Codon usageGene syntax rulesThe common gene syntax rules for forward-strand genes:Conceptual gene finding frameworkConceptual gene finding frameworkMethods for Eukaryotic Gene Prediction1. Ab initio method: - Only use genomic sequences as input - GENSCAN (Burge 1997; Burge and Karlin 1997) - Fgenesh (Solovyev and Salamov 1997) - Capable to predict novel genes 2. Transcript-alignment-based method: - Use cDNA, mRNA or protein similarity as major clues - ENSEMBL (Birney et al. 2004) - High accuracy - Can only find genes with transcription evidence 3. Hybrid method: - Integrate EST, cDNA, mRNA or protein alignments into ab initio method - Fgenesh+ (Solovyev and Salamov 1997) - AUFUSRUS+ (Stanke, Schoffmann et al. 2006)Methods for Eukaryotic Gene Prediction4. Comparative-genomics-based method: - Assume coding regions are more conserved Genome 1Genome 2Methods for Eukaryotic Gene Prediction4. Comparative-genomics-based method: - Assume coding regions are more conserved - Capable to predict novel genes and non-protein coding genes - Can use transcript data to improve prediction accuracy - TWINCAN and N-SCAN (do not use transcript similarity) - TWINCAN-EST and N-SCAN-EST (use transcript similarity)Problems: - Performance depends on the evolutionary distance between the compared sequences- Exon/intron boundaries may not be conservedAbout the ab initio gene prediction methodsDifficult to handle the following cases:- Nested/overlapped genes- Polycistronic genes- Alternative splicing- Frame-shift errors- Split start codons- Non-ATG triplet as the start codon- Extremely short exons- Extremely long introns- Non-canonical introns- UTR intronsHidden Markov Model is a commonly used algorithm for gene predictionHidden Markov Model (HMM) Markov Property Markov Chain Markov Model Hidden Markov ModelMarkov PropertyMarkov Property is simply that given the present state, future states are independent of the past Stochastic processes are generally considered as the collections of random variables, thus have Markov PropertyMarkov ChainMarkov Chain is a system that we can use to predict the future given the presentIn the Markov Chain, the present state only depends on two things: - Previous state - Probability of moving from previous state to present stateMarkov ChainTo estimate the status of studentsMarkov ChainSuppose graduate students have two types of moods: - Happy - Depressed about researchEach type of students has its own Markov chainFinally, there are three locations we can find the students: - Lab - Canteen - DormMarkov ChainMarkov Chain of happy studentsLabCanteenDormMarkov ChainMarkov Chain of depressed studentsLabCanteenDormMarkov Chain ProbabilityThe probability of observing a given sequence is equal to the product (乘积) of all observed transition probabilities. P (Canteen - Dorm - Lab) = P (Canteen) P(Dorm|Canteen) P(Lab|Dorm) P (Canteen - Lab) = P (Canteen) P(Lab|Canteen) Markov ModelA Markov model is a stochastic model used to model randomly changing system where it is assumed that the future states depends only on the present state. LabCanteenDorm LabCanteenDormDormCanteenLab Hidden Markov ModelNow we have the general information about the relationship between the student mood and location - Mood is HiddenIf we simply observe the locations of a student, can we tell what mood he is in? - Observations are the locations of the students- Parameters of the model are the probabilities of a student being in a particular locationHidden Markov Model (HMM)Observations: Observations: LLLC LLLCD DCLLCLLDDDDLLCLLCD DL LDDDDC CDDDDDDDDLCLLLCCLLCLLLCCLHidden state: Hidden state: HHHHHHHHHHHHDDDDDDDDDHHHHHH HHHHHHHHHHHHDDDDDDDDDHHHHHHUsing HMM to estimate student moodLab0.75Dorm0.05Lab0.4Canteen0.2Dorm0.4 Canteen0.2Hidden Markov Model (HMM)Application of HMM in gene predictionWhat do we want? Why are HMMs a good fit for gene prediction? - DNA sequences are in order which is necessary for HMMs - Enough training data for what is a gene and what is not a gene- To find coding and non-coding regions from an unlabeled string of DNA sequencesHMMs need to be trained to be truly effectiveHMMs for gene predictionHMMs for gene predictionCautions about HMMsNeed to be mindful of overfittingHMMs can be slow (needs proper decoding)- DNA sequences can be very long thus processing them can be very time consumingStates are supposed to be independent of each other and this is NOT always true! - Need a good training set- More training data does not always mean a better model Protein-coding genes have specific evolutionary constraints- Gaps between homologous genes are multiples of three (preserve amino acid translation)- Mutations are mostly at synonymous positions- Conservation boundaries are sharp (pinpoint individual splicing signals)Features for protein coding genesDmel TGTTCATAAATAAA-TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-GGCTCCAGCATCTTCDsec TGTCCATAAATAAA-TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-GGCTCCAGCATCTTCDsim TGTCCATAAATAAA-TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-GGCTCCAGCATCTTCDyak TGTCCATAAATAAA-TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT-GGCTCCAGCATCTTCDere TGTCCATAAATAAA-TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT-GGCTCCAGCATCTTTDana TGTCCATAAATAAA-TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG-CGGCCGTGA-GGCTCCATCATCTTADpse TGTCCATAAATGAA-TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG-GGCTCCATCATTTTCDper TGTCCATAAATGAA-TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG-GGCTCCATTATTTTCDwil TGTTCATAAATGAA-TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG-GGTTCCATTATCTTCDmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC-TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTTDvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC-GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTCDgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC-TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT * * * * * * * * * * * * * * * * * * * * * * SpliceREALITYPREDICTIONExon LevelWRONGEXONCORRECTEXONMISSINGEXONSn =Sensitivity(灵敏度灵敏度)number of correct exonsnumber of actual exonsSp =Specificity(特异性特异性)number of correct exonsnumber of predicted exonsMeasure of prediction accuracyTNFPFNTNTNTPFNTPFNREALITYPREDICTIONPREDICTIONREALITYTPFNTNFPccncncSn = TP / (TP + FN)Sp = TP / (TP + FP)SensitivitySpecificityNucleotide LevelMeasure of prediction accuracyC: correct; nc: incorrect; TP: true positive; FP: false positive; FN: false negative; TN: true negativeGene prediction softwareExample of gene findersExample of gene findersExample of gene findersAccuracy of Gene Prediction Gene prediction is easier in microbial genomesWhy? Smaller genomesSimpler gene structuresMore sequenced genomes! (for comparative approaches)Methods? Previously, mostly HMM-based Now: similarity-based methodsbecause so many genomes are availableGene prediction in prokaryotesSummaryNothing is perfectEach gene identification approach has its own features and limitationsGenome annotation is an on-going process, and the accuracy is bring improved along with the improvement of methods and accumulation of the evidence data The structure of prokaryotic (原核生物的) genesOpen Reading Frames (ORFs)Protein coding gene prediction is to detect potential coding regions by looking for ORFsSignals defining ORFs in eukaryotic genes:- Start codon: ATG- Stop codons: TAG, TGA, TAA- Splicing donor sites: usually GT- Splicing acceptor sites: usually AGUTRs are usually defined according to expression evidenceSix Frames in a DNA SequenceDNA replication occurs in the 5 -to-3 direction Markov ChainMarkov Chain of depressed studentsLabCanteenDormUsing HMM to estimate student moodLab0.75Dorm0.05Lab0.4Canteen0.2Dorm0.4 Canteen0.2Hidden Markov Model (HMM) Protein-coding genes have specific evolutionary constraints- Gaps between homologous genes are multiples of three (preserve amino acid translation)- Mutations are mostly at synonymous positions- Conservation boundaries are sharp (pinpoint individual splicing signals)Features for protein coding genesDmel TGTTCATAAATAAA-TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-GGCTCCAGCATCTTCDsec TGTCCATAAATAAA-TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-GGCTCCAGCATCTTCDsim TGTCCATAAATAAA-TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT-GGCTCCAGCATCTTCDyak TGTCCATAAATAAA-TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT-GGCTCCAGCATCTTCDere TGTCCATAAATAAA-TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT-GGCTCCAGCATCTTTDana TGTCCATAAATAAA-TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG-CGGCCGTGA-GGCTCCATCATCTTADpse TGTCCATAAATGAA-TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG-GGCTCCATCATTTTCDper TGTCCATAAATGAA-TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG-GGCTCCATTATTTTCDwil TGTTCATAAATGAA-TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG-GGTTCCATTATCTTCDmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC-TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTTDvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC-GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTCDgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC-TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT * * * * * * * * * * * * * * * * * * * * * * SpliceGene prediction softwareExample of gene finders