WO2005057208A1

WO2005057208A1 - Methods of identifying peptides and proteins

Info

Publication number: WO2005057208A1
Application number: PCT/US2004/040225
Authority: WO
Inventors: Srdjan Askovic; John Peltier; Amit Phansalkar; Joseph Patrick Eccleston
Original assignee: Prolexys Pharmaceuticals, Inc.
Priority date: 2003-12-03
Filing date: 2004-12-02
Publication date: 2005-06-23

Abstract

The present invention provides methods for identifying biological molecules by mass spectrometry, particularly peptides and proteins. In one aspect, the present invention includes methods of identifying proteins by analizing fragmentation spectra of peptides derived from proteins. In another aspect, the present invention provides methods of quantifying the confidence of a putative peptide sequence assignment generated by a de novo peptide-sequencing algorithm, or a peptide identification algorithm employing one or more protein sequence databases. The present invention includes methods for identifying amino acid sequence of peptides wherein peptide sequence annotators are used to independently verify assigned peptide sequence identities. The present invention also includes methods wherein a parallel confidence assessment algorithm of the present invention calculates the sum of selected peptide sequence annotators multiplied by weighting factors, each selected to achieve an accurate assessment of the confidence of each putative peptide sequence assignment.

Description

METHODS OF IDENTIFYING PEPTIDES AND PROTEINS

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority under 35 U.S.C. 119(e) to U.S. Provisional Patent Application No. 60/527,040 filed December 3, 2003 which is hereby incorporated by reference in its entirety to the extent not inconsistent with the disclosure herein.

BACKGROUND OF THE INVENTION

Interest in the field of proteomics has expanded tremendously in the last few years due to its potential to revolutionize biological and medical research, particularly in the development of new drugs and therapies. Traditionally, the term proteome is used to describe the entire set of proteins encoded by a genome. In a broader sense, however, the study of the proteome, called proteomics, involves characterization of gene and cellular function by determining the activities, interactions, localization and modifications of individual proteins and protein complexes present in a cell or tissue.

A proteome is highly dynamic because the types of proteins expressed by a cell and their abundances, modifications and subcellular locations vary substantially with the physiological condition of a cell or tissue. Characterization of changes in protein content and activity in response to disease, therefore, may assist in identifying new targets useful for drug development and novel biomarkers for the diagnosis and early detection of disease. Furthermore, proteomics research is highly complementary to other functional approaches to understanding cellular processes, such as microarray-based expression profiles, systematic genetics, and small molecule based arrays. Integration of information from these diverse perspectives via bioinformatic analysis promises to greatly facilitate our emerging understanding of systems-level cellular behavior.

Proteomics is a complex field involving a very large number of proteins and protein complexes corresponding to a genome. For example, the human proteome is expected to consist of between about 400,000 to about 1 ,000,000 proteins, which may interact to form a huge number of protein-protein complexes important in regulating cellular behavior. This complexity is further compounded by the large dynamic range associated with protein expression, typically exceeding over six orders of magnitude, and by important post-translational modifications that affect protein activity and function ["From Genomics to Proteomics," Tyers, M. and Mann, Matthais, Nature, Vol. 422, pg 193-197 (2003)]. As a result of the extraordinarily large number of variables necessary for accurately characterizing cellular function in terms of protein behavior, a number of high throughput methods of identifying proteins have emerged over the last several years. These techniques include 2-D gel electrophoresis protein identification methods, genetic readout experiments, such as the yeast two-hybrid assay, micro-array and chip experiments, and mass spectrometry methods.

Mass spectrometry has played a long-standing role in identifying proteins in complex mixtures, probing protein-protein interactions and characterizing post- translational modifications. Mass spectrometric analysis provides sensitive, fast and selective detection and requires extremely small quantities of protein samples. In addition, mass spectrometric analysis is well suited for automated, high throughput operation, particularly when combined with multidimensional separation techniques, such as high performance liquid chromatography (HPLC) or capillary electrophoresis. The application of mass spectrometric methods to protein identification as been the subject of numerous scientific publications including "Mass Spectrometry and the Age of the Proteome," Yates, J.R., J. Mass Spectrometry, Vol 33, 1-19 (1998); "Mass Spectrometry-based Proteomics," Aebersold, R. and Mann, Matthias, Nature, Vol. 422, 198-207 (2003); "Proteomics to Study Genes and Genomes," Pandey, A. and Mann, M., Nature, Vol. 405, 837-846 (2000); "Mass Spectrometry in Proteomics," Aebersold, R. and Goodlett, D.R., Chem. Rev., Vol 101 , 269-295 (2001); "An automated Multidimensional Protein Identification Technology for Shotgun Proteomics," Wolters, D.A., Washburn, M.P. and Yates, J.R., III, Anal Chem., Vol 73, 5683-5690 (2001); and "Analysis of Proteins and

Proteomes by Mass Spectrometry," Mann, M., Hendrickson, R.C. and Pandey, A., Annu. Rev, Biochem, Vol. 70, 437-473 (2001), which are all hereby incorporated by reference in their entireties to the extent not inconsistent with the present description. Traditionally, protein sequences are determined by stepwise enzymatic degradation of purified proteins into peptide fragments, for example by trypsin digestion, and subsequent mass analysis of peptide fragments by mass spectrometry. The recent availability of complete or partially complete gene and genome sequence databases has revolutionized the use of mass-spectrometry to identify proteins. Gene and genome sequence information allows peptide mass data to be directly correlated to complementary protein sequence information permitting rapid and, often conclusive, identification of many proteins. For example, protein identification has been carried out by peptide mapping mass spectrometric methods, such as peptide-mass mapping or peptide-mass fingerprinting. In these methods, a purified protein sample is subjected to proteolytic digestion and the resulting peptides are analyzed in a mass spectrometer, commonly an electrospray ionization (ESI) mass spectrometer or matrix assisted laser desorption (MALDI) mass spectrometer, thereby generating a list of experimentally determined peptide molecular masses. Protein sequences are identified by matching the list of experimentally determined peptide masses with calculated lists of all possible peptide masses in each entry of a comprehensive protein sequence database. Protein identification via peptide mapping may often be improved by incorporation of auxiliary sequence database search constraints including the estimated molecular weight of the parent protein and the cleavage specificity of the protease used for digestion.

Tandem mass spectrometry (MS/MS) analysis methods have recently replaced traditional peptide-mapping or peptide-mass fingerprinting methods as the preferred protein identification technique. In protein identification by MS/MS analysis, a protein-containing sample is first subjected to proteolytic digestion, usually by a protease having high digestion specificity, such as trypsin. The resulting complex mixture of peptides is fractionated and delivered to a mass spectrometer for peptide identification. In the mass spectrometer, peptides are ionized, thereby forming precursor ions, which are selectively transmitted by a first mass analyzer on the basis of molecular mass, charged state or both. Transmitted precursor ions are broken down into fragment ions (or daughter ions) of the precursor ion. Fragment ions are subsequently mass analyzed by a second mass analyzer and detected, thereby generating a fragmentation mass spectrum comprising a series of peaks corresponding to the mass-to-charge ratios of all charge carrying fragments generated upon dissociation. As a result of this analysis, each peptide in the complex mixture may be characterized in terms of: (1) a precursor ion molecular mass and (2) a fragmentation spectrum. Acquired fragmentation spectra are analyzed using peptide sequence database search tools (e.g. spectrum matching tools), which compare acquired fragmentation spectra to theoretical peptide fragmentation spectra generated on the basis of peptide molecular mass. Alternatively, a de novo peptide-sequencing algorithm may be used to directly interpret a fragmentation mass spectrum and provide putative peptide sequence assignments. The output of these search tools is a list of putative peptide sequence assignments for each peptide analyzed. In addition, each putative peptide sequence assignment in the list is characterized by a confidence score that is intended to provide a measure of the accuracy of the assignment. The scored list of putative peptide assignments is provided as input to a protein identification algorithm which reconstructs the sequences of proteins originally present in the sample using a protein sequence database derived from protein, gene and genome sequence information.

Protein identification by MS/MS analysis methods provides clear benefits over conventional peptide mapping techniques. First, MS/MS analysis characterizes each peptide with respect to a fragmentation spectrum, in addition to peptide molecular mass. The peptide fragmentation process depends on the amino acid sequence of a peptide and in many cases the fragmentation products of a given peptide can be accurately predicted. As each fragment mass analyzed and detected represents a part of the peptide, peptide fragmentation spectra often provide unique signatures of peptide identities. For example, different peptides having the same or similar molecular masses are often easily distinguished on the basis of the peak patterns in their respective fragmentation spectra. Second, the additional information contained in fragmentation spectra allows specific proteins to be identified in the presence of other proteins. Therefore, MS/MS analysis methods are capable of identification and characterization of proteins in complex mixtures and are well suited for high throughput analysis. Several algorithms have recently emerged for relating MS/MS data to sequence information using either spectrum matching or de novo peptide sequencing methods. First, the "peptide sequence tag" approach uses short, unambiguous amino acid sequences, which are derived from fragmentation spectra, in combination with peptide molecular mass measurements, to provide a specific probe to determine the identity of some proteins in a sample. Second, the "cross- correlation method" uses peptide sequences extracted from a protein sequence database to generate theoretical peptide fragmentation spectra. Theoretical spectra are directly compared to experimentally acquired fragmentation spectra to determine peptide identities useful for reconstructing the amino acid sequence of proteins in a sample [Eng, J.K., McCormack, A.L., and Yates III, J.R., J. Am. Soc. Mass Spectrom., Vol 5, 976 (1994)]. Third, "probability based matching" evaluates correlations between observed peaks in an experimental fragmentation spectrum and the molecular masses of predicted fragments calculated from peptide and protein sequences [MacCoss, M.J., Wu, CC, Yates III, J.R., Anal. Chem., Vol 74, 5593-5599 (2002); Perkins, D.N. et al., Electrophoresis, Vol. 20, 3551-3567 (1999)]. These correlations are used to derive a statistical assessment of the match between the experimental spectrum and peptide sequences contained in a protein sequence database. Peptide assignments having high confidence assessments are then used to reconstruct the amino acid sequence of proteins in a sample.

A number of recent advances in MS/MS analytical methods have made protein identification using these techniques especially promising. First, a suite of high performance, complementary MS/MS instrumentation has emerged allowing for peptides to be analyzed under a wide range of sample preparation, ionization and fragmentation conditions. For example, time-of-flight— time-of-flight (TOF-TOF) instrumentation provides for MS/MS analysis of primarily singly charged precursor ions generated by matrix assisted laser desorption/ionization methods. Alternatively, triple quadrupole mass spectrometers, linear ion trap mass spectrometers, 3D ion trap mass spectrometers and quadrupole-time-of-flight mass spectrometers provide MS/MS analysis of multiply charged precursor ions generated by electrospray ionization methods. Second, use of multidimensional, on-line or off-line chromatographic separation in combination with MS/MS has been demonstrated to greatly improve the extent of separation of peptides in mixtures achievable prior to MS / MS analysis. For example, the use of two dimensional (strong cation exchange/reverse phase) or three dimensional (strong cation exchange/avidin/ reverse phase) chromatography, provides greater peak capacity than a single- dimension of peptide chromatography which results in less complex fragmentation spectra that are easier to interpret. Finally, stable-isotope dilution methods enhance the extent of quantitative information, which may be extracted from peptide fragmentation spectra. In these methods, stable isotope tags are introduced into proteins via metabolic labeling, enzymatic reactions or via chemical reactions using isotope-coded affinity tags.

Despite these improvements, the full benefits of high throughput MS/MS techniques for protein identification remain unrealized due to difficulties in verifying peptide sequence assignments made by conventional peptide or protein sequence database search tools. A plurality of putative peptide sequence assignments is typically matched to every peptide fragmentation spectrum, despite the fact that only one sequence assignment is generally correct. Submission of incorrect peptide sequence assignments to protein identification algorithms often results in a large number of false positive protein identifications. In addition, submission of incorrect peptide sequence assignments to protein identification algorithms may result in an inaccurate assessment of the sequence coverage of correctly identified proteins, which can obscure identification of the correct variant of a modified protein.

Peptide sequence verification primarily involves distinguishing correct peptide assignments from false identifications in peptide or protein sequence database search results. A range of approaches to the problem of peptide sequence verification has been examined over the last several years. First, manual verification by researchers having expertise in fragmentation spectrum interpretation may assist in reducing or eliminating false peptide sequence assignments. Such manual verification approaches, however, are not feasible for the analysis of high throughput data sets, which may comprise thousands of individual peptide fragmentation spectra. In addition, manual verification regularly entails a significant amount of subjective fragmentation spectrum interpretation, which often generates different protein identifications from different experts. Second, sequence assignment filtering methods have also been applied to peptide sequence assignment verification. In these methods, filtering criterion based on observed and predicted peak positions in fragmentation spectra are applied to either the list of putative peptide sequence assignments or to the list of protein sequence assignments to reduce the rate of false peptide and/or protein identifications. Although these methods reduce the sheer amount of MS/MS data input into protein identification algorithms, no single filtering criterion is capable of providing a truly correct list of peptides or proteins. To overcome limitations associated with single filters, attempts have been made to combine a plurality of filters in a serial fashion. Use of a plurality of filtering criteria, however, often results in a propagation of errors associated with each filter criterion and may actually increase the incidence of peptide and protein misidentification. In addition, the order in which different filter criteria are applied substantially affects which assignments are deemed correct and which assignments are rejected. Further, serial filtering data reduction methods fail to utilize important correlations between filtering criteria and other experimentation parameters not derived from fragmentation spectra, which may be especially important for sequence assignment verification.

It will be appreciated from the foregoing that a clear need exists for methods of verifying peptide sequence assignments derived from MS/MS data. Specifically, computer assisted peptide sequence assignment verification methods capable of high throughput data analysis and automation are needed. In addition, peptide verification methods are needed which decrease the number of false positive protein identifications generated by protein sequence identification methods using MS / MS data.

SUMMARY OF THE INVENTION This invention provides methods for identifying biological molecules by mass spectrometry, particularly peptides and proteins. The present invention provides methods of identifying peptides from fragmentation spectra. In addition, the present invention includes methods of identifying proteins by analyzing peptides derived from proteins. It is an object of the present invention to provide methods of correlating MS/MS data to amino acid sequences in protein sequence and/or peptide sequence databases and/or amino acid sequences derived from de novo peptide-sequencing algorithms. It is further an object of the present invention to provide methods of independently verifying peptide sequence assignments generated by conventional protein and peptide database search algorithms, by de novo peptide-sequencing algorithms, or by any other method, algorithm, or computer program that generates putative peptide sequence assignments from peptide mass data and/or peptide fragmentation mass spectra. It is yet another object of the present invention to provide methods for identifying peptides and proteins, which decrease the number of false positive identifications and missed identifications relative to conventional peptide and protein identification methods. It is yet another object of the present invention to provide methods of detecting and characterizing modifications of proteins and peptides, such as co-translational modifications, post translational modifications and modifications involving the introduction of identity tags and/or labels. In one aspect, the present invention provides methods for identifying the amino acid sequence of peptides wherein peptide sequence annotators are used to independently verify assigned peptide sequence identities. In an exemplary method, the molecular mass of a peptide analyte is determined and a peptide fragmentation mass spectrum is generated comprising a series of peaks corresponding to fragments of the peptide analyte. A plurality of putative peptide sequence assignments is generated using any peptide sequence assignment algorithm or method that uses peptide fragmentation mass spectra and/or peptide mass data as input. In one aspect of the present invention, putative peptide sequence assignments are generated using a spectrum matching algorithm in combination with one or more protein sequence databases comprising a plurality of protein amino acid sequences and/or peptide sequence databases comprising a plurality of peptide sequences and peptide fragment sequences. Exemplary peptide sequence databases useful in the methods of the present invention may be derived from protein amino acid sequence data and/or genomic data. In another aspect of the present invention, peptide sequence assignments are generated using one or more de novo peptide-sequencing algorithms. In an exemplary embodiment, the putative peptide sequence assignments generated by the spectrum matching algorithm or de novo peptide-sequencing algorithm are characterized by molecular masses within a selected range of the molecular mass of said peptide analyte. Alternatively, putative peptide sequence assignments may be determined on the basis of both measured peptide analyte mass and observed peak positions in a fragmentation spectrum using information from a protein sequence database and/or information from a de novo peptide-sequencing algorithm.

A peptide sequence annotator index comprising a plurality of peptide sequence annotators is compiled for each of the putative peptide sequence assignments. In an exemplary embodiment, at least a portion of the peptide sequence annotators are determined by comparing the fragmentation mass spectrum of the peptide analyte to the entries of one or more peptide or protein sequence databases comprising masses of peptides, masses of fragments of peptides or both and/or comparing the fragmentation spectrum to one or more protein sequence databases comprising protein amino acid sequences. In addition, annotators can be determined on the basis of other characteristics, physical properties and/or chemical properties corresponding to each putative peptide sequence assignment, such as predicted retention times, elution times and mobilities on specific chromatographic media, molecular mass and expected fragmentation products. Further, annotators can be determined on the basis of mass spectrometric and/or chromatographic instrumentation used to analyze the peptide analyte, experimental conditions in the mass spectrometer, the composition and/or purity of the sample containing the peptide analyte, and statistical parameters characterizing the closeness of the measured peptide fragmentation spectrum to a theoretical fragmentation spectrum corresponding to a given putative peptide sequence assignment.

At least a portion of the peptide sequence annotators in each peptide sequence annotator index is input into a parallel confidence assessment algorithm. Selection of which peptide sequence annotators to be input in the parallel confidence assessment algorithm in the present invention may be based on a wide number of experimental parameters including, but not limited to, the instrumentation used for MS/MS analysis, the composition of the sample containing the peptide analyte, sample purity, the presence of known background proteins, MS/MS data quality, signal to noise ratio in the peptide fragmentation mass spectrum, and precursor ion charge state. Operation of the parallel confidence assessment algorithm generates a quantitative confidence assessment for each putative peptide sequence assignment generated by the protein sequence database. The identity of the peptide analyte is determined by selecting the putative peptide sequence assignment having the highest confidence assessment.

In an exemplary embodiment, a parallel confidence assessment algorithm of the present invention calculates the sum of selected peptide sequence annotators multiplied by weighting factors, each selected to achieve an accurate assessment of the confidence of each putative peptide sequence assignment. In another embodiment, the parallel confidence assessment algorithm comprises a series of peptide sequence assessment rules determined by an artificial neural network algorithm, preferably peptide sequence assessment rules derived from the operation of an artificial neural network algorithm on one or more MS/MS data sets generated by analyzing a plurality of peptides having known identities or one or more MS/MS data sets wherein sequence assignments are manually verified. In another embodiment, the parallel confidence assessment algorithm evaluates one or more correlations between different peptide sequence annotators in each peptide sequence annotator index to achieve an accurate assessment of the confidence of each putative peptide sequence assignment.

Peptide sequence annotators useable in the present invention comprise any information useful by itself or in combination with other annotators for assessing the accuracy of an assigned putative peptide sequence assignment. When analyzed in combination, peptide sequence annotators of the present invention are capable of distinguishing correct putative peptide sequence assignments from incorrect putative peptide sequence assignments and are capable of determining a confidence assessment for each putative peptide sequence assignment. Sequence annotators may be determined empirically and/or predicted from known information relating to a putative peptide sequence assignment. Sequence annotators may be derived from predicted chemical and/or physical properties of putative peptide sequence assignments, for example, on the basis of the amino acid sequence, the presence of modifications in the amino acids comprising the peptide, size, affinity, structure, molecular mass or any combination of these properties. Alternatively, peptide sequence annotators of the present invention may be derived from experimental conditions or measurements. For example, exemplary sequence annotators may be derived from ionization conditions or fragmentation conditions in a mass spectrometer. Alternatively, peptide sequence annotators of the present invention may be derived from the combination of predicted chemical or physical properties of putative peptide sequence assignments and experimental conditions or measurements. For example, exemplary annotators may be derived from a comparison of the measured molecular mass or observed fragmentation spectrum of a peptide analyte and the molecular mass or predicted fragments corresponding to a putative peptide sequence assignment.

In one embodiment, exemplary sequence annotators are derived from correlations between predicted fragmentation patterns of putative peptide sequence assignments and fragment masses extracted from a peptide fragmentation mass spectrum. In an embodiment of the present invention, the observed pattern of peaks in a fragmentation mass spectrum is analyzed to provide a series of intensities corresponding to a plurality of fragments having different molecular masses. Annotators of the present invention may be determined by comparing the masses and/or relative intensities of fragments observed in a peptide fragmentation mass spectrum to peptide fragments predicted for a given putative peptide sequence assignment using one or more peptide and/or protein sequence databases comprising amino acid sequences of proteins, masses of peptides, mass of expected fragments of peptides or any combination of these. Alternatively, Annotators of the present invention may be determined by comparing the masses and/or relative intensities of fragments observed in a peptide fragmentation mass spectrum to peptide fragments predicted on the basis of known peptide fragmentation kinetics and dynamics. In this context, "fragments predicted for each putative peptide sequence assignment" refers to the fragments that are expected to be generated upon analysis of a peptide having the same sequence as the putative peptide sequence via a selected MS/MS analysis method or instrument. The presence or absence of "matching fragments" which are present in a peptide fragmentation mass spectrum and predicted for a selected putative peptide sequence assignment is an indicator as to the accuracy of a given putative peptide sequence assignment. Annotators of the present invention can be derived by analysis of the molecular masses corresponding to all fragments observed in a peptide fragmentation spectrum or from a wide variety of specific fragment types, such as a- type fragments, b-type fragments, c-type fragments, x-type fragments, y-type fragments, z-type fragments, internal fragments, immonium ions and satellite ions. Exemplary annotators of the present invention comprise cumulative relative intensities and/or numbers of all matching fragments, all matching a-type fragments, all matching b-type fragments, all matching c-type fragments, all matching x-type fragments, all matching y-type fragments, all matching z-type fragments all matching internal fragments and all matching immonium ions.

In another embodiment, annotators may be derived from correlations between the molecular masses and/or relative intensities of two or more observed or predicted peptide fragments. An exemplary annotator of the present invention is determined by comparing fragment masses observed in a fragmentation mass spectrum and de novo sequence tags predicted for a given putative peptide sequence assignment. In this context, "de novo sequence tags" refers to one or more correlations between the masses of peptide fragments predicted on the basis of the amino acid sequence of a given putative peptide sequence assignment. Such correlations typically comprise one or more mass differences between a plurality of expected peptide fragments. The presence or absence of fragment masses extracted from a fragmentation spectrum which are characterized by the same correlations as a de novo sequence tag is an indicator as to the accuracy of a given putative peptide sequence assignment.

In another embodiment, exemplary sequence annotators are derived peptide fractionation and/or separation properties, such as elution time, retention time and mobility on specific chromatographic media under specific conditions. Exemplary separation properties include mobilities and/or retention times characterized using liquid phase and/or gas phase chromatographic methods. In an embodiment of the present invention, a peptide-containing sample is subjected to fractionation prior to MS/MS analysis and peptide analytes in the sample are characterized in terms of retention time, elution time or mobility. Exemplary fractionation techniques useful in the present invention included, but are not limited to, chromatographic and electrophoresis methods, such as capillary electrophoresis. Annotators of the present invention may be determined by comparing the fractionation and/or separation properties experimentally determined for a peptide analyte to the predicted fractionation and/or separation properties of a peptide corresponding to putative peptide sequence assignment. An exemplary annotator of the present invention is determined by comparing the observed retention time of a peptide analyte on specific chromatographic media to a retention time predicted for a peptide having the same sequence as a putative peptide sequence assignment on the same chromatographic media. Retention times and other fractionation/separation properties can be predicted for a putative peptide sequence on the basis of amino acid sequence, peptide structure, size, shape, affinity or any combination of these properties. The closeness of the observed retention and predicted retention time is an indication of the accuracy of a given putative peptide assignment. In another embodiment, exemplary sequence annotators are derived by comparing the measured molecular mass of a peptide analyte and the molecular mass corresponding to a putative peptide sequence assignment. An exemplary peptide sequence annotator is determined by subtracting the experimentally determined molecular mass of a peptide analyte from the molecular mass corresponding to a selected putative peptide assignment. The closeness of these molecular masses is an indication of the accuracy of a given putative peptide assignment.

The peptide identification methods of the present invention provide several advantages over conventional methods of peptide identification by mass spectrometry. First, the methods of the present invention are highly versatile and are applicable to a wide variety of mass spectrometric analysis methods and instrumentation. Second, the present methods are amenable to computer assisted automation and, thus, are well suited to high throughput analysis of a large number of different peptide analytes. Third, peptide sequence verification provided by the present invention uses an objective validation criterion based on an observed peptide fragmentation spectrum and predicted chemical and/or physical properties of peptides. Therefore, the present methods are is not susceptible to operator- introduced subjective bias. Fourth, the present methods reduce the number of false peptide sequence assignments and missed peptide sequence assignments generated for a given peptide analyte.

In another aspect, the present invention provides methods of quantifying the confidence of a putative peptide sequence assignment generated by a de novo peptide-sequencing algorithm, or a peptide identification algorithm employing one or more protein sequence databases comprising protein amino acid sequences and/or peptide sequence databases comprising protein masses, peptide masses, peptide fragment masses or both, or any other peptide sequence assignment method, algorithm or computer software that uses peptide fragmentation mass spectra and/or peptide mass data as input. In this aspect, a peptide sequence annotator index is generated for a putative peptide sequence assignment using the methods of the present invention. At least a portion of the peptide sequence annotators in the index are input into a parallel peptide sequence assessment algorithm of the present invention and operation of the parallel peptide sequence assessment algorithm yields a determination of the confidence assessment of the putative peptide sequence assignment. In this context, "confidence assessment of a peptide sequence assignment' refers to the probability that the sequence assignment is correct. Therefore, the present invention provides methods of determining the probability that a peptide sequence assignment is correct or incorrect, preferably for a chosen statistical significance level. In another embodiment, the peptide sequence confidence assessment methods of the present invention are capable of ranking a plurality of putative peptide sequence assignments, such as the sequence assignments determined by a conventional peptide identification algorithm, in order of ascending or descending probability that the putative peptide sequence assignment is correct.

In another aspect, the present invention comprises methods of identifying the amino acid sequence of a protein employing peptide sequence verification by analyzing peptide MS/MS data. In an exemplary method, a protein analyte is decomposed into a plurality of peptides analytes, preferably by selective proteolytic digestion. The peptides are fractionated and sequentially delivered to a mass spectrometer. Peptide molecular masses and peptide fragmentation spectra are determined for some or all of the peptides and input into a peptide identification algorithm, for example an spectrum matching algorithm using one or more protein sequence databases comprising protein amino acid sequences and/or one or more peptide or protein sequence databases comprising peptide masses, peptide fragment masses or both, or a de novo peptide sequence assignment algorithm. A series of putative peptide sequence assignments are generated for each fragmentation spectrum. A confidence assessment of each putative peptide sequence assignment in each series of putative peptide sequence assignments is made using the methods of the present invention. At least a portion of the peptide sequence assignments corresponding to each fragmentation spectrum are input into a protein identification algorithm utilizing one or more protein amino acid sequence databases. Operation of the protein identification algorithm results in a determination of one or more putative protein sequences associated with the protein analyte, particularly the amino acid sequence of the protein. In a preferred embodiment, operation of the protein identification algorithm results in the determination of a single protein sequence associated with the protein analyte.

In an exemplary method of identifying the amino acid sequence of a protein analyte, only those peptide sequence assignments having a confidence assessment greater than a selected threshold value are input into the protein identification algorithm. In an alternative embodiment, a confidence assessment is assigned to every putative peptide sequence assignment for each peptide analyte. Confidence assessments are input into the protein identification algorithm along with each putative sequence assignment, thereby resulting in more accurate protein sequence identifications by the protein identification algorithm. In another embodiment, putative peptide sequence assignments corresponding to each peptide analyte are ranked in order of descending confidence assessment and are input into the protein identification algorithm in the form of an ordered list.

Methods of identifying protein amino acid sequences employing peptide sequence verification provide several advantages over conventional protein identification methods. First, use of peptide sequence verification decreases the number of putative peptide assignments submitted to the protein sequence analysis algorithm and thus, reduces the computational resources required to generate protein identifications. Second, peptide sequence verification also reduces the rate of false protein identifications and provides more accurate sequence assessments of identified proteins.

Protein sequence identification methods of the present invention may be used to identify proteins in substantially purified samples or in complex mixtures. Proteins may be identified using the present methods in the presence of one or more different proteins or other biological molecules, such as oligonucleotides, polysaccharides and carbohydrates. Methods of the present invention are capable of identifying a plurality of proteins present in a protein-containing sample.

The protein identification methods of the present invention are capable of detecting and characterizing post-translational modification of proteins. These methods are based on the fact that peptides comprising modified amino acids exhibit different fragmentation processes than peptides comprising unsubstituted amino acids and, therefore, generate different fragments. To distinguish between modified and unmodified peptides, masses of fragments observed in peptide fragmentation spectra are compared to one or more protein sequence databases and/or protein or peptide sequence databases comprising the masses of expected fragments of unmodified and modified proteins and/or peptides. Alternatively, masses of fragments observed in a peptide fragmentation spectrum may be analyzed using a de novo peptide-sequencing algorithm to generate a plurality of putative peptide sequences, including peptide sequences comprising one or more modified amino acids. In the present invention, the composition of modified proteins is reconstructed by inputting putative peptide sequence assignments, including putative peptide sequence assignments comprising modified amino acids, into a protein identification algorithm. Post-translational modifications detectable and characterizable by the methods of the present invention include, but are not limited to, phosphorylation, lipidation, prenylation, sulfation, hydroxylation, acetylation, addition of cofactors, formation of disulfide bonds and proteolysis.

The methods of the present invention are broadly applicable to the analysis of any polymeric material, particularly biopolymers such as oligonucleotides, polysaccharides and carbohydrates. Application of the present methods to identifying polymers involves cleaving the polymer into its constituent parts (or cleavage products) and identifying the composition of these parts by MS/MS analysis. Cleavage of biopolymers may be performed by any cleaving means known in the art including, but not limited to, enzymatic degradation, chemical degradation, photolytic degradation and photochemical degradation. Preferred means of cleaving biological molecules provide cleavage at specific bonds. The methods of identifying polymers of the present invention include the step of generating a sequence database, which characterizes the sequence of analyte biomolecules with respect to their expected cleavage products. In addition, the methods of identifying biopolymers of the present invention include the step of generating expected fragment databases for analyte biopolymers and comprising the masses of polymers, cleavage products of polymers, fragments of cleavage products or any combination of these.

In another embodiment, the present invention provides a method for identifying a peptide analyte, said method comprising the steps of: (1) measuring the molecular mass of said peptide analyte; (2) generating a fragmentation mass spectrum of said peptide analyte comprising a series of peaks corresponding to fragments of said peptide analyte; (3) determining a plurality of putative peptide sequence assignments for said peptide analyte using a peptide sequence database comprising protein amino acid sequences, a spectrum matching algorithm, a de novo peptide sequence assignment algorithm or any combination of these; (4) compiling a peptide sequence annotator index for each of said putative peptide sequence assignments comprising a plurality of peptide sequence annotators; (5) combining at least a portion of said peptide sequence annotators in a parallel confidence assessment algorithm, thereby generating a confidence assessment for each putative peptide sequence assignment, wherein said parallel confidence assessment algorithm comprise a series of peptide sequence assessment rules derived from an artificial neural network algorithm; and (6) identifying said peptide by determining the putative peptide sequence assignment having the highest confidence assessment.

In another embodiment, the present invention provides a method for identifying a protein analyte comprising the steps of: (1) digesting said protein analyte, thereby generating a plurality of peptide analytes; (2) measuring the molecular mass of said peptide analytes; (3)generating a fragmentation mass spectrum for each of said peptide analytes comprising a series of peaks corresponding to fragments of said peptide analytes; (4) determining a plurality of putative peptide sequence assignments for each of said peptide analytes using a peptide sequence database comprising protein amino acid sequences, a spectrum matching algorithm, a de novo peptide sequence assignment algorithm or any combination of these; (5) compiling a peptide sequence annotator index for each of said putative peptide sequence assignments comprising a plurality of peptide sequence annotators; (6) combining at least a portion of said peptide sequence annotators in a parallel confidence assessment algorithm, thereby generating a confidence assessment for each putative peptide sequence assignment, wherein said parallel confidence assessment algorithm comprises peptide sequence assessment rules derived from an artificial neural network algorithm; and (7) inputting said putative peptide sequence assignments and confidence assessments into a protein identification algorithm, wherein said protein identification algorithm compares said putative protein sequence assignments to a protein sequence database comprising protein amino acid sequences, thereby determining the identity of said protein analyte.

The invention is further illustrated by the following description, examples, drawings and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 is a schematic illustrating an exemplary method of identifying the amino acid sequence of protein analytes in a protein-containing sample.

Figure 2 is a schematic diagram illustrating exemplary methods of peptide sequence verification and data analysis for the analysis of peptide and protein analytes.

Figures 3A-D show MS/MS spectra corresponding to four different peptide analytes. Figure 3A corresponds to Sequence Identity No. 1 , Figure 3B corresponds to Sequence Identity No. 2, Figure 3C corresponds to Sequence Identity No. 3 and Figure 3D corresponds to Sequence Identity No. 4. DETAILED DESCRIPTION OF THE INVENTION

Referring to the drawings, like numerals indicate like elements and the same number appearing in more than one drawing refers to the same element. In addition, hereinafter, the following definitions apply: The terms "peptide" and "polypeptide" are used synonymously in the present disclosure, and refer to a class of compounds composed of amino acid residues chemically bonded together by amide bonds (or peptide bonds). Peptides and polypeptides also include polymeric compounds composed of amino acid residues including one or more modified amino acid residues. Modifications can be naturally occurring or non-naturally occurring, such as modifications generated by chemical synthesis. Modifications to amino acids in peptides or polypeptides include, but are not limited to, phosphorylation, lipidation, prenylation, sulfonation, hydroxylation, acetylation, methionine oxidation, alkylation, acylation, carbamylation, iodination and the addition of cofactors. Peptides and polypetides are polymeric compounds comprising at least two amino acid residues or modified amino acid residues. Peptides and polypeptides of the present invention may be generated by degradation of proteins, for example by proteolyic digestion. Peptides and polypeptides may be generated by substantially complete digestion or by partial digestion of proteins. Identifying a peptide or polypeptide refers to determination of is composition, particularly its amino acid sequence, and characterization of any modifications of one or more amino acids comprising the peptide or polypeptide. "Protein" refers to a class of compounds comprising one or more polypeptide chains and/or modified polypeptide chains. Proteins may be modified by naturally occurring processes such as post-translational modifications or co-translational modifications. Exemplary post-translational modifications or co-translational modifications include, but are not limited to, phosphorylation, lipidation, prenylation, sulfonation, hydroxylation, acetylation, methionine oxidation, the addition of cofactors, proteolysis, and assembly of proteins into macromolecular complexes. Modification of proteins may also include non-naturally occurring derivatives, analogues and functional mimetics generated by chemical synthesis. Exemplary derivatives include chemical modifications such as alkylation, acylation, carbamylation, iodination or any modification that derivatizes the protein. In the present invention, proteins may be modified by labeling methods, such as metabolic labeling, enzymatic labeling or by chemical reactions. Proteins may be modified by the introduction of stable isotope tags, for example as is typically done in a stable isotope dilution experiment. Proteins of the present invention may be derived from sources, which include but are not limited to cells, cell or tissue lysates, cell culture medium after cell growth, whole organisms or organism lysates or any excreted fluid or solid from a cell or organism. "Fragment" refers to a portion of polymer analyte, such as a peptide.

Fragments may be derived from bond cleavage in a parent polymer, such as a parent peptide. Fragments may also be generated from multiple cleavage events or steps. Fragments may be a truncated peptide, either carboxy-terminal, amino- terminal or both, of a parent peptide. A fragment may refer to products generated upon the cleavage of a peptide bond, a C-C bond, a C-N bond, a C-0 bond or combination of these processes. Fragments may refer to products formed by processes whereby one or more side chains of amino acids are removed, or a modification is removed, or any combination of these processes. Fragments useful in the present invention include fragments formed under metastable conditions or result from the introduction of energy to the precursor by a variety of methods including, but not limited to, collision induced dissociation (CID), surface induced dissociation (SID), laser induced dissociation (LID), electron capture dissociation or any combination of these methods or any equivalents known in the art of tandem mass spectrometry. Fragments useful in the present invention also include, but are not limited to, x-type fragments, y-type fragments, z-type fragments, a-type fragments, b-type fragments, c-type fragments, internal ion (or internal cleavage ions), immonium ions or satellite ions. The types of fragments derived from a parent polymer analyte, such as a peptide analyte, often depend on the sequence of the parent, method of fragmentation, charge state of the parent precursor ion, amount of energy introduced to the parent precursor ion and method of delivering energy into the parent precursor ion. In the present invention, comparison of the molecular masses of fragments derived from a fragmentation spectrum to the molecular masses of expected fragments corresponding to a putative peptide sequence assignment is used to determine peptide sequence annotators which are useful in verifying peptide sequence assignments. Properties of fragments, such as molecular mass, may be characterized by analysis of a fragmentation mass spectrum. "Fragmentation mass spectrum" refers to one or more peaks corresponding to the mass-to-charge ratios of fragments generated upon dissociation of a parent precursor ion in a mass spectrometer. Exemplary fragmentation mass spectra are mass spectra obtained upon the dissociation of a parent precursor ion. Dissociation may occur under metastable conditions or result from the introduction of energy to the precursor by a variety of methods including, but not limited to, collision induced dissociation (CID), surface induced dissociation (SID), laser induced dissociation (LID), electron capture dissociation or any combination of these methods or any equivalents known in the art of tandem mass spectrometry. Exemplary methods of the present invention use peptide sequence annotators derived from observed and predicted collision induced dissociation (CID) fragmentation mass spectra.

"Putative peptide sequence assignment" is a sequence of amino acid residues and/or modified amino acid residues, which is associated with a fragmentation spectrum. In one aspect of the present invention, putative peptide sequence assignments are generated by measuring the molecular mass of a peptide analyte and determining all possible peptide sequences in proteins of a given proteome which have molecular masses with a certain range of the measured molecular mass of a peptide analyte. In another aspect of the present invention, putative peptide sequence assignments may also be determined by analyzing the fragmentation mass spectrum of a peptide analyte. Putative peptide sequence assignments useful in the methods of the present invention may be determined using any peptide sequence assignment method, algorithm or computer software that uses peptide fragmentation mass spectra and/or peptide mass data as input. In one embodiment of the present invention, putative peptide sequence assignments are determined using one or more protein sequence databases comprising amino acid sequences of proteins, such as a protein sequence database derived from gene and genome databases. In another embodiment of the present invention, putative peptide sequence assignments are determined using one or more peptide sequence databases comprising peptide sequences, peptide fragment sequences, or a combination of peptide sequences and peptide fragment sequences. In yet another embodiment of the present invention, putative peptide sequence assignments are determined using one or more de novo peptide-sequencing algorithms. The present invention includes methods of verifying putative peptide sequence assignments.

"De novo peptide-sequencing algorithm" refers to methods, algorithms and computer software that determine the sequence of a peptide without prior knowledge of the peptide sequence. De novo sequencing of peptides in the present invention can be performed by any method known in the art including, but not limited to, Edman degradation and by interpretation of MS/MS spectra corresponding to peptide analytes. Exemplary de novo peptide-sequencing algorithms are described in "De Novo Peptide Sequencing by Nanoelectrospray Tandem Mass Spectrometry Using Triple Quadrupole and Quadrupole/Time-of-Flight Instruments," Shevchenko, A. et al, Mass Spectrometry of Proteins and Peptides, ISBN 1-59259-045-4 (2000); "Rapid 'de Novo' Peptide Sequencing by a Combination of Nanoelectrospray,

Isotopic Labeling and a Quadrupole/Time-of flight Mass Spectrometer," Shevchenko, A. et al., Rapid Communications in Mass Spectrometry, 1997. 11 (9): p. 1015-1024; and "De novo" sequencing of peptides recovered from in-gel digested proteins by nanoelectrospray tandem mass spectrometry," Shevchenko, A. et al., Mol. Biotechnol. 2002. 20(1): p. 107-18, which are hereby incorporated by reference in their entireties to the extent not inconsistent with the present disclosure. In one aspect of the present invention, de novo peptide-sequencing algorithms use peptide mass data and peptide fragmentation spectra as input and generate one or more putative peptide assignments. De novo peptide sequencing may be achieved by measuring distances between peaks in a fragmentation spectrum, and comparing these distances to masses of either amino acid residues, modified amino acid residues, or fragments of amino acid residues or modified amino acid residues. In many cases, operation of a de novo peptide-sequencing algorithm generates a plurality of possible peptide sequence assignments corresponding to a single peptide fragmentation spectrum. The methods of the present invention are ideally suited to verify and/or assess the confidence of putative peptide assignments generated by de novo peptide-sequencing algorithms. The present invention also includes peptide sequence assignment algorithms that employ a combination of one or more de novo peptide-sequencing algorithms and one or more protein and/or peptide data base search algorithms. Alternatively, the present invention includes methods wherein all or some peptide sequences generated by de novo peptide sequencing are assigned to one or more peptide sequence entries in a protein sequence and/or peptide sequence database.

"Spectrum matching" refers to a process in which peaks that correspond to the mass-to-charge ratios of fragments and/or precursor ions are matched to predicted fragments or precursor ions derived from peptide and/or protein databases comprising protein masses, peptide masses, peptide fragment masses or any combinations of these. Alternatively, the peaks that correspond to the mass-to- charge ratios of fragments and/or precursor ions are matched to stored spectra or stored representations of spectra generated from actual peptide or protein samples. The matching process may be performed entirely manually, or more preferably some or all of the steps may be performed automatically using a computer-based spectrum matching algorithm. Spectrum matching in the present invention may be performed by any method, algorithm or computer software known in the art.

In the following description, numerous specific details of the devices, device components and methods of the present invention are set forth in order to provide a thorough explanation of the precise nature of the invention. It will be apparent, however, to those of skill in the art that the invention can be practiced without these specific details.

This invention provides methods of identifying peptides and proteins using MS/MS data. In particular, the present invention provides methods of verifying peptide sequence assignments derived from protein and peptide sequence databases or derived from another peptide sequence assignment algorithm, such as a de novo peptide sequence algorithm. Further, the present invention provides methods for detecting and characterizing modifications of peptides and proteins.

Figure 1 is a schematic illustrating an exemplary method of identifying the amino acid sequences of protein analytes in a protein-containing sample. As illustrated in Figure 1 , a protein sample containing one or more protein analytes is subjected to digestion resulting in a mixture of peptides. Preferred digestion methods of the present invention include proteolytic digestion exhibiting highly specific sequence site cleavage. Exemplary methods of digestion usable in the present invention include the use of proteases or combinations of proteases, such as trypsin, thrombin and chymotrypsin. Alternatively, peptides may be generated from proteins by the addition of chemical reagents, such as cyanogen bromide, acids and/or bases.

Referring again to Figure 1 , the peptide mixture is fractionated, thereby generating a plurality of discrete peptide fractions. In an exemplary embodiment, discrete peptide fractions correspond to spatially separated aliquots, which comprise substantially purified peptide analytes. In a preferred embodiment, discrete peptide fractions correspond to spatially separated aliquots which each comprise substantially a single peptide analyte. Fractionation may be achieved by any method know in the art of peptide separation including, but not limited to, chromatographic methods and electrophoresis methods. Exemplary chromatographic separation methods useable in the present invention include single and multidimensional chromatography, such as strong cation exchange/reverse phase high performance liquid chromatography or strong cation exchange/avidin/reverse phase high performance liquid chromatography. Exemplary chromatographic methods useful in the present invention also include separation on the basis of hydrophobicity, for example using C18 columns. Preferred methods and instrumentation of fractionating peptides include online or offline methods and instrumentation, which are capable of interfacing with a mass spectrometer. Referring again to Figure 1 , discrete peptide fractions are separately delivered to a tandem mass spectrometer for analysis, wherein molecular masses of peptide analytes in each discrete peptide fraction is experimentally determined and at least one fragmentation mass spectrum is acquired corresponding to each peptide fraction. In an exemplary method, peptide analytes in each discrete fraction are ionized, for example by electrospray ionization or matrix assisted laser desorption/ionization methods, and are subjected to conditions for dissociation, thereby generating one or more charge carrying fragments. Charge carrying fragments are subsequently mass analyzed, for example, by time-of-flight analysis, quadrupole mass filtering or ion trap methods, and detected, thereby generating fragmentation mass spectra. Exemplary fragmentation mass spectra comprise a series of peaks, of varying abundance corresponding to the mass-to-charge ratio of fragments generated upon collisional induced dissociation of a peptide precursor ion. Any method known in the art of mass spectrometry of determining the mass of peptide analytes and acquiring fragmentation mass spectra is useable in the methods of the present invention. The result of MS/MS analysis is that each discrete fraction is characterized in terms of at least one molecular mass and at least one fragmentation mass spectrum. In an exemplary embodiment, each fragmentation mass spectrum is analyzed to provide lists of the mass-to-charge ratios, molecular masses and relative intensities of each charge carrying fragment.

Figure 2 is a schematic diagram illustrating exemplary methods of peptide sequence verification and data analysis for the analysis of peptide and protein analytes. As shown in Figure 2, the peptide analyte mass and peak lists corresponding to mass-to-charge ratios, molecular masses or both of fragments observed in the fragmentation mass spectrum are input into to a peptide sequence assignment algorithm. An exemplary peptide sequence assignment algorithm usable in the methods of the present invention is a peptide database search algorithm which compares the peptide analyte mass and peak lists corresponding to a peptide analyte to entries in a protein sequence database and/or a peptide sequence database. An exemplary protein sequence database comprises a plurality of protein amino acid sequences and an exemplary peptide sequence database comprises a plurality of peptide masses and a plurality of masses of fragments of peptides. Operation of the peptide database search algorithm generates a plurality of putative peptide sequence assignments corresponding to the peptide analyte. Exemplary peptide database search algorithms include conventional peptide database search software tools, such as MASCOT and SEQUEST computer software packages. Alternatively, the present invention may be practiced using a peptide sequence assignment algorithm comprising one or more de novo peptide- sequencing algorithms.

Referring again to Figure 2, each putative peptide sequence assignment is analyzed to generate a peptide sequence annotator index comprising a plurality of peptide sequence annotators. In an exemplary embodiment, peptide sequence annotator indices for all putative peptide sequence assignments are organized in a relational database. Derivation of peptide sequence annotator indices for putative peptide sequence assignments may be achieved by any means known in the art including the use of one or more peptide sequence annotator algorithms, preferably automated, computer-assisted peptide sequence annotator algorithms. In exemplary embodiments, peptide sequence annotator algorithms compare the masses of fragments observed in fragmentation mass spectra to the entries of protein sequence and peptide sequence databases. Alternatively, the methods of the present invention may further comprise steps of inputting additional data into the peptide sequence annotator algorithm, such as the type of instrumentation used for MS/MS analysis, experimental conditions in the mass spectrometer and observed physical and chemical properties characterizing the discrete peptide fractions, such as retention times, elution times or mobilities for specific chromatographic media (for both liquid and gas phase chromatography) under specific conditions. Preferred peptide sequence annotator algorithms of the present invention are capable of evaluating this additional data and deriving additional peptide sequence annotators. Peptide sequence annotators useful in the present invention may be derived from results generated by conventional peptide database search software tools, such as MASCOT and SEQUEST computer software packages or any other peptide sequence assignment algorithm, such as a de novo sequencing algorithm.

Referring again to Figure 2, at least a portion of the peptide sequence annotators in each peptide sequence annotator index are input into a parallel confidence assessment algorithm, thereby generating a confidence assessment for each putative peptide sequence assignment. Although evaluation of a single annotator is not able to accurately assess the confidence of a putative peptide sequence assignment or series of putative peptide sequence assignments, evaluation of a plurality of selected annotators provides an assessment of the probability that a given putative peptide sequence assignment is correct. Any parallel confidence assessment algorithm capable of accurately characterizing the confidence of a putative peptide sequence assignment is useable in the present invention. An exemplary parallel confidence assessment algorithm comprises a plurality of peptide sequence assignment rules derived from the operation of an artificial neural network algorithm on one or more MS/MS data sets corresponding to known peptide and/or protein analytes. Alternatively, exemplary parallel confidence assessment algorithms may assign peptide sequence annotators different weighting factors, such as weighting factors determined by operation of an artificial neural network algorithm on one or more MS/MS data sets corresponding to known peptide and/or protein analytes. In exemplary embodiments, summation of at least a portion of the peptide sequence annotators multiplied by their respective weighting factors provides a quantitative assessment of the confidence of a given putative peptide sequence assignment. The present invention includes parallel confidence assessment algorithms employing non-linear peptide sequence annotator weighting and fully automated parallel confidence assessment algorithms. Exemplary methods of the invention also use machine learning algorithms, statistical tools or a combination of these.

Referring again to Figure 2, at least a portion of the annotated putative peptide sequence assignments corresponding to peptide analytes in each discrete peptide fraction are input into a protein identification algorithm. Exemplary protein identification algorithms compare the putative peptide sequence assignments to entries in one or more protein sequence databases comprising protein amino acid sequences. In an exemplary embodiment, only putative peptide sequence assignments having a confidence assessment greater than a selected threshold value are input into the protein identification algorithm. Alternatively, putative peptide sequence assignments may be input into the protein identification algorithm in an ordered list ranked in order of decreasing or increasing confidence assessment. Alternatively, the present methods include embodiments wherein putative peptide sequence assignments and associated confidence assessments are input into the protein identification algorithm together, preferably in the form of a relational database. Operation of the protein identification algorithm results in a list of protein sequences corresponding to proteins in the protein-containing sample. Peptide sequence annotators useable in the present invention include annotators derived from the predicted fragmentation patterns of peptides. The types and abundance of charge carrying fragments observed in the MS/MS analysis of peptides depends on a number of factors including primary amino acid sequence, the presence of modified amino acids in the peptide sequence, the amount of energy imparted to the peptide precursor ion and the charge state of the peptide precursor ion. Accordingly, fragments which are expect to be generated from a selected peptide sequence may be accurately predicted in many instances on the base of its primary sequence, the type of MS/MS instrumentation employed for analysis and the properties of the precursor ion mass-selected for collisional-induced dissociation.

Peptide ions subjected to a variety of dissociation conditions may fragment at any bond along the peptide backbone, thereby generating a ladder of sequence ions. For example, peptide ions frequently fragment at amide bonds (or peptide bonds), thereby generating a sequence of daughter ions, such as y-type ions and b-type ions. Specifically, if charge is retained on the fragment ion corresponding to the N- terminal portion of the peptide after cleavage of the amide bond, b-type ions are formed. If charge is retained on the fragment ion corresponding to the C-terminal portion of the peptide after cleavage of the amide bond, however, y-type ions are formed. Alternatively, peptide ions subjected to dissociation conditions may fragment at C-C bonds in a peptide, thereby generating a sequence of daughter ions, such as a-type ions and x-type ions. Specifically, if charge is retained on the fragment ion corresponding to the N-terminal portion of the peptide after cleavage of the C-C bond, a-type ions are formed. If charge is retained on the fragment ion corresponding to the C-terminal portion of the peptide after cleavage of the C-C bond, however, x-type ions are formed. Alternatively, peptide ions subjected to dissociation conditions may fragment at C-N bonds adjacent to the amide bond in a peptide, thereby generating a sequence of daughter ions, such as c-type ions and z- type ions. Specifically, if charge is retained on the fragment ion corresponding to the N-terminal portion of the peptide after cleavage of the C-N bonds adjacent to the amide bond; c-type ions are formed. If charge is retained on the fragment ion corresponding to the C-terminal portion of the peptide after cleavage of C-N bonds adjacent to the amide bond, however, z-type ions are formed. Double backbone cleavage of a peptide may also generate charge-carrying fragments, commonly referred to as internal fragments. Usually, these are formed by a combination b-type and y-type cleavage processes to produce an amino- acylium ion. Alternatively, double cleavage by a combination of a-type and y-type cleavage processes produces an amino-immonium ion. An internal fragment with a single side chain formed by a combination of a-type and y-type cleavage processes is referred to as an immonium ion.

In one aspect, exemplary annotators of the present invention are determined by comparing or matching the masses of fragments observed in peptide CID fragmentation mass spectra to the predicted fragments for a given putative peptide sequence assignment. In the context of this disclosure, the term "matched peak" refers to a peak in a peptide fragmentation mass spectrum which corresponds to a molecular mass that is within a selected range of one of the molecular masses of fragments predicted for a putative peptide sequence assignment. An exemplary annotator of the present invention comprises the number of all matched peaks in a peptide fragmentation mass spectrum. Alternatively, annotators of the present invention may comprise the number of matched peaks in a peptide fragmentation mass spectrum which correspond to one or more specific fragment ion types. Exemplary annotators of the present invention include the number of matched peaks corresponding to a-type fragments, b-type fragments, c-type fragments, x-type fragments, y-type fragments, z-type fragments, internal fragments, immonium ions or any combinations of these. In another aspect, exemplary annotators of the present invention are determined by calculating the relative intensities of matched peaks in a peptide fragmentation mass spectrum. In the context of this disclosure, the term "relative intensity" refers to the integrated areas of one or more selected peaks in a peptide fragmentation mass spectrum divided by the sum of integrated areas of all peaks in the fragmentation mass spectrum and may be expressed by the equation:

/ x relative intenstity = , (I) total wherein l_x is the integrated intensity of one or more selected peaks and l_totai is the sum of integrated areas of all peaks in the fragmentation mass spectrum. An exemplary annotator of the present invention comprises the relative intensity of all matched peaks in a peptide fragmentation mass spectrum. Alternatively, annotators may be determined by calculating the relative intensity of match peaks corresponding to one or more specific fragment ion types. Exemplary annotators include the relative intensities of matched peaks in a peptide fragmentation mass spectrum that correspond to a-type fragments, b-type fragments, c-type fragments, x-type fragments, y-type fragments, z-type fragments, internal fragments, immonium ions or any combination of these, or fragments derived from these.

Annotators may also be derived from the identities of one or more fragments, which are matched to one or more peaks in a peptide fragmentation mass spectrum. For example, the presence of one or more peaks in a peptide fragmentation mass spectrum that are positively matched to an internal fragment having a known identity, such as an immonium ion having a known side chain, may serve the basis of an exemplary annotator. Alternatively, exemplary annotators may be derived from correlations between observed peaks in a fragmentation mass spectrum and one or more de novo peptide sequence tags. In this embodiment, correlations between the relative positions of peaks in a fragmentation mass spectrum and fragment masses predicted on the basis of de novo sequence tags is an indicator of a specific peptide sequence identity or specific protein sequence identity.

Peptide sequence annotators useable in the present invention include annotators derived from predicted retention times, elution times and mobilities of peptides or peptide ions through a gas or fluid under influence of an electric field, or on specific chromatographic media, such as a high performance liquid chromatography column. Peptide retention times, for example, in many cases can be accurately predicted on the basis of primary amino acid sequence, affinity, size, molecular mass, structure or any combination of these properties. In one in embodiment, an annotator of the present invention is calculated by subtracting the retention time measured for a peptide analyte on specific chromatographic media and the predict retention time for a putative peptide sequence assignment for the same specific chromatographic media. An exemplary annotator is calculated using predicted and measured retention times using chromatographic separation on the basis of hydrophobicity, for example using C18 columns.

Peptide retention times are known to vary with certain experimental conditions, which are often difficult to accurately characterize, such as the age of a chromatography column, temperature, and variations in buffers used during fractionation. Therefore, the predicted peptide retention times used to derive peptide sequence annotators should reflect, as accurately as possible, the experimental conditions employed during peptide analyte analysis. In an exemplary embodiment, the putative peptide sequence assignments themselves are used to generate a regression line used for predicting peptide retention times under relevant experimental conditions. Preferably, the putative peptide sequence assignments used to derive the regression line have confidence scores, such as MASCOT peptide assignment scores, larger than a selected threshold value to ensure that the regression line accurately reflects actual peptide retention times. Another exemplary annotator is based on the determination of whether or not a particular putative peptide sequence assignment was used to determine the regression line used for used for predicting peptide retention times. If the retention time of the putative peptide sequence assignment was used to determine the regression line then the putative peptide sequence assignment may be allowed much lower tolerance than if it has not been used.

In another embodiment, peptide sequence annotators are determined on the basis of the gas phase ion mobility, such as electrophoretic mobility. In an exemplary embodiment, precursor ions generated from peptide analytes are analyzed by a differential mobility analyzer. Flight times through a given mobility media, such as selected pressure of a gas or mixture of gases, are determined for each peptide analyte and analyzed to determine gas phase ion mobilities. An exemplary peptide sequence annotator is determined by subtracting the measured gas ion mobility from gas ion mobilities predicted for each putative peptide sequence assignment. An advantage of using peptide sequence annotators based on ion mobility is that peptides, in some cases, may be differentiated on the basis of structure, size, shape, charge state or any combination of these properties using ion mobility measurements.

A number of useful peptide sequence annotators may be derived from other empirical observations made during protein and peptide analysis. First, exemplary annotators may be derived on the basis of whether or not the same peptide sequence was detected from a precursor ion having a different charge state. Putative peptide sequence assignments are more likely to be accurate if the same or similar fragments are generated from the same precursor ion having two charge states. Second, exemplary annotators may be derived on the basis of whether or not the same peptide sequence was determined by analysis using more than one type of MS/MS instrumentation. Since different instruments employ different ionization, mass analysis and dissociation conditions, identification of the same peptide by more than one MS/MS instrument will reduce the probability that the assignment is a random occurrence. Third, exemplary annotators may be derived on the basis of the total number of times a peptide is identified. Fourth, exemplary annotators may be derived on the basis of whether or not proline is present in the sequence. If proline is present, an annotator may be determined by determining the relative intensity of the most dominant peak. This exemplary annotator is based on the fact that if a peptide contains proline, it is likely that it will cleave exactly on the N- terminal part of proline and that the peak resulting from that one bond break is often the most prominent peak in the observed fragmentation mass spectrum. If the proline related peak has a large relative intensity, therefore, the confidence assessment of the putative peptide sequence assignment may be larger that if the proline related peak has a small intensity or is not detected at all. Fifth, exemplary annotators may be derived on the basis of whether or not the putative peptide is a part of a known background protein. If a peptide is known to be derived from a background protein it is less likely to be a component of a protein analyte.

Annotators may also be derived from statistical analysis of correlations between masses and relative intensities in a fragmentation mass spectrum and entries in one or more protein sequence or peptide sequence database. An exemplary annotator comprises an error distribution annotator, which determines if the distribution of disparities between matched fragments and predicted fragments is random or if it follows a pattern. The hypothesis testing is performed by using the likelihood ratio method. The null hypothesis in this case is that the error distribution of the peptide under consideration is random and the exemplary annotator calculates the probability that this is the case.

Exemplary parallel confidence assessment algorithms of the present invention are derived from the operation of an artificial neural network algorithm on one or more MS/MS data sets resulting from the analysis of known peptides and/or proteins or one or more MS/MS data sets wherein putative protein sequences are manually verified. There are numerous commercially available artificial neural network packages that may be used in the methods of the present invention including, but not limited to, STATISTICA by StatSoft (http://www.statsoftinc.com/). Neural Network Toolbox for MATLAB by The MathWorks

(http://www.mathworks.com/products/neuralnet/). and NeuroSolutions by NeuroDimension (http://www.neurosolutions.com/). The design, programming and operation of conventional neural network algorithms are described in several references including "Introduction To Artificial Neural Systems, "Zurada, J.M. (1992), Boston: PWS Publishing Company, "Neural Networks: A Comprehensive Foundation," Haykin, S. (1994), NY: Macmillan and Judd, J.S. (1990) and "Neural Network Design and the Complexity of Learning," Haykin, S. (1994), Cambridge, MA: The MIT Press, which are hereby incorporated by reference in their entireties to the extent that they are not inconsistent with the disclosure in the present application.

In an exemplary embodiment, a training set comprising thousands of manually verified peptide sequence assignments is used in conjunction with an artificial neural network algorithm to determine a set of intricate relationships between selected peptide sequence annotators in a peptide sequence annotator index. An exemplary artificial neural network (ANN) is a system comprised of large number of simple processing elements linked together by connections, which is modeled after neuronal structure of a brain. Exemplary ANNs typically consist of one input layer, one or more processing layers and one output layer. Artificial neural networks are used in the situations where there is a relationship between the proposed (known) input and desired (unknown) output but the nature of the relationship is not precisely known. ANNs are particularly useful in situations in which the relationship between the input and output is not linear. An artificial neural network is first designed to best fit the problem it will be used for. Then, before it can be used it needs to be trained and that training can be supervised, or non-supervised. In one embodiment of the methods of the present invention, supervised training is used, which consists of feeding an artificial neural network with the large amounts of input data, together with the desired output. For example, a large number of putative peptide assignments with full sets of corresponding peptide sequence annotators are provide to the network as input. In this embodiment, manual determinations of whether or not a given peptide sequence assignment is correct is provided as the desired output. In an exemplary method, the ANN determines the best possible relationship between the input and output data by trial-and-error method. After training is complete, such an artificial neural network can be provided with the input of new data sets, and it can then calculate the (previously unknown) output. Neural networks use in the present invention may be trained by any method known in the art including use of back-propagation algorithms. The present invention includes use of artificial neural networks that employ non-supervised training.

In one embodiment, the parallel confidence assessment algorithm is a series of rules that applies weights to different annotators. In an exemplary embodiment of this aspect of the invention, the parallel confidence assessment algorithm is a parallel, multivariate statistical algorithm. For example, selected peptide sequence annotators and corresponding weighting factors may be combined in a parallel, multivariate statistical algorithm with linear weighting to provide an assessment of the confidence of a given putative peptide sequence assignment provided by the equation:

O.A _vA_r...) = «,₀* ∑ _ω . A . (II)

wherein C is the confidence assessment of a given putative peptide sequence assignment, are peptide sequence annotators and ω_t and ω₀are weighting factors. The present invention also includes non-linear parallel, multivariate statistical algorithms employing a wide range of nonlinear weighting schemes including exponential factor weighing, logarithmic factor weighting, polynomial factor weighting or any combinations of these. In addition, the present invention includes methods using parallel, multivariate statistical algorithms employing a combination of linear and non-linear weighting. In the present invention, the process of assigning weights to one or more peptide sequence annotators may be performed using artificial neural networks or other decision-making algorithms. Correlations between annotators are also especially important in deriving a confidence assessment. Accordingly, exemplary parallel confidence assessment algorithms of the present invention are capable of evaluating interdependencies and correlations between different annotators. In many cases, however, interdependencies of annotators, such as the numbers or relative intensities of match peaks correlating to specific fragment ion types, depends strongly on the instrumentation used for MS/MS analysis. Therefore, the MS/MS instrumentation used will often factor into an analysis of peptide sequence annotators interdependencies. However, correlations between annotators based on the number of times a peptide has been identified and annotators based on peptide retention time typically indicate a higher confidence assessment. Also, if proline peaks are identified, and if the immonium ions of some residues are identified, then one would expect to see the same amino acids in a de novo sequence tag. A variety of mass spectrometry systems can be used in the methods of the present invention. Mass analyzers providing high mass accuracy, high sensitivity and high resolution are preferred for some applications. Exemplary MS/MS systems usable in the present invention include TOF-TOF mass spectrometers, triple quadrupole mass spectrometers, linear ion traps, 3D ion traps, quadrupole-time-of- flight mass spectrometers and Fourier transform ion cyclotron resonance mass spectrometers. Ion formation via electrospray ionization or MALDI methods is useable in the methods of the present invention. The present methods are applicable to low energy CID conditions, high energy CID conditions, electron capture dissociation, laser induced dissociation, or any combination of these methods or any other equivalent methods known in the art of mass spectrometry.

It is to be appreciated that the methods or algorithms of the present invention may be performed using general-purpose computers or processing systems capable of running application software. Exemplary computers useable in the present methods include microcomputers computers, such as an IBM personal computer or suitable equivalent thereof, and work station computers. Preferably, algorithms of the present invention are embedded in a computer readable medium, such as a computer compact disc or floppy disc. Further, computer readable medium may be in the form of a hard disk or memory chip, such as random access memory or read only memory.

As appreciated by one skilled in the art, computer software code embodying the methods and algorithms of the present invention may be written using any suitable programming language. Exemplary languages include, but are not limited to, C or any versions of C, Perl, Java, Pascal, or any equivalents of these. While it is preferred for some applications of the present invention that a computer be used to accomplish all the steps of the present methods, it is contemplated that a computer may be used to perform only a certain step or selected series of steps in the present methods.

All references cited in this application are hereby incorporated in their entireties by reference herein to the extent that they are not inconsistent with the disclosure in this application. It will be apparent to one of ordinary skill in the art that methods, devices, device elements, materials, procedures and techniques other than those specifically described herein can be applied to the practice of the invention as broadly disclosed herein without resort to undue experimentation. All art-known functional equivalents of methods, devices, device elements, materials, procedures and techniques specifically described herein are intended to be encompassed by this invention.

Example 1 : Exemplary methods of verifying putative peptide assignments The methods of the present invention were used to determine the amino acid sequences of several peptides by analyzing peptide fragmentation mass spectra. Specifically, the present methods were used to verify peptide sequence assignments generated by conventional protein and peptide sequence database search tools. The results of these studies indicate that the peptide identification methods of the present invention are useful for confirming or rejecting sequence identities generated by these search tools. Peptide samples in this study were generated by proteolyic digestion of a sample containing a plurality of parent proteins. The peptide containing sample resulting from digestions was fractionated prior to MS/MS analysis using multidimensional chromatography employing strong cation exchange HPLC and separation on the basis of hydrophobicity using C18 columns. Peptide retention times for fractionated peptide containing aliquots were measured. Peptide fragmentation mass spectra were acquired using a three-dimensional quadrupole ion trap-based instrument employing an electrospray ionization source.

Peaks lists extracted from MS/MS spectra were submitted to the MASCOT protein/peptide sequence database search tool. Operation of MASCOT generated the list of scored putative peptide assignments shown in Table 1. In addition, the MASCOT search tool generated peaks lists and lists of matched ions corresponding to each sequence identity, which are summarized in Tables 2-5. The MS/MS spectra acquired for the four different peptides are shown in Figures 3A-D. Figure 3A corresponds to Sequence Identity No. 1 , Figure 3B corresponds to Sequence Identity No. 2, Figure 3C corresponds to Sequence Identity No. 3 and Figure 3D corresponds to Sequence Identity No. 4.

Mass agreement criteria employed for matching peaks in the fragmentation mass spectra and peaks in the database was 0.8 Daltons. Mass agreement criteria employed for matching the observed mass of the precursor ion and putative peptide sequence assignment was 1.5 Daltons.

Peptide sequence annotator indices were compiled for each putative peptide sequence assignment. Individual peptide sequence annotators used included: (1) the difference between observed peptide retention times on the C18 column and predicted retention times for each putative peptide sequence assignment; (2) the number of matched y-type fragments; (3) the number of matched b-type fragments; (4) the number of matched y neutral loss fragments; (5) the number of matched b neutral loss fragments; (6) the relative intensity of all matched fragments; (7) the relative intensity of matched y-type fragments; (8) the relative intensity of matched b- type fragments; (9) the relative intensity of matched y neutral loss fragments, (10) the relative intensity of matched b neutral loss fragments; and (11) the error distribution. Table 6 summarizes peptide sequence annotators based on measured and predicted peptide retention times. Table 7 summarizes peptide sequence annotators based on the number of matched fragments. The two numbers provided in each entry in Table 7 correspond to the number of matched fragments and the number of total fragments, respectively. Table 8 summarizes peptide sequence annotators based on the relative intensity of matched fragments. Table 9 summarizes peptide sequence annotators based on calculated error distributions. Manual evaluation of the putative peptide sequence assignment Identity Nos. 1 (DSTLIMQLLR) and 2 (LAEQAERYDDMAACMK) confirmed that these assignments are correct. In contrast, manual evaluation of putative peptide sequence assignment Identity Nos. 3 (LTQSMAIIR) and 4 (NLLSVAYK) resulted in rejection of these assignments as erroneous.

Table 1 : MASCOT Putative Peptide Assignments:

Table 2: Peak lists and matched peak lists for SEQ ID NO:1

Matched Intensity m/z Ions 267.7 239,000.00 b(4) 416.9 1 ,229,000.00 288 551,800.00 b(5) 530.1 554,500.00 416.9 1 ,229,000.00 b(6) 677.2 382,300.00 493.2 1 ,405,000.00 b(7) 805 1.70E+05 576.5 365,300.00 b(8) 918.2 768,000.00 676.3 3,560,000.00 b(9) 1031.3 1,260,000.00 789.3 4,770,000.00 Bnl 902.4 1 ,058,000.00 b0(3) 285.9 545,300.00 1031.3 1 ,260,000.00 b0(4) 399 620,700.00 1099.8 99,470.00 b0(8) 900.2 581,400.00 241.9 191 ,100.00 b0(9) 1013.5 716,200.00 285.9 545,300.00 Y 399 620,700.00 y(2) 288 551 ,800.00 529.3 662,300.00 y(4) 529.3 662,300.00 585.3 340,800.00 y(5) 676.3 3,560,000.00 677.2 382,300.00 y(6) 789.3 4,770,000.00 805 170,400.00 y(7) 902.4 1,058,000.00 918.2 768,000.00 Ynl 1013.5 716,200.00 y*(8)++ 493.2 1 ,405,000.00 530.1 554,500.00 y0(8)++ 493.2 1.41 E+06 882.5 235,400.00 883.3 237,400.00 900.2 581 ,400.00 1059.5 101 ,800.00

Table 3: Peak lists and matched peak lists for SEQ ID NO:2

Matched Intensity m/z Ions 184.9 1,254,000.00 372.3 1,290,000.00 454.1 3,201,000.00 B 525.2 7,164,000.00 b(10) 1191.8 698,700.00 596.2 9,290,000.00 b(10)++ 596.2 9,290,000.00 747 16,810,000.00 b(11)++ 669.8 2,725,000.00 875.6 22,060,000.00 b(13)++ 741.2 2,636,000.00 911.1 9,194,000.00 b(15)++ 894.3 1,624,000.00 1077.4 2,435,000.00 b(2) 184.9 1,254,000.00 1103.5 956,600.00 b(4) 442.3 703,700.00 1185.8 733,900.00 b(9) 1076.5 916,100.00 1320.3 816,000.00 b(9)++ 539 1,490,000.00 1393.7 1,023,000.00 Bnl 236.1 640,000.00 b^*(7)++ 390.8 812,200.00 329 950,600.00 b0(10) 1173.7 508,700.00 437.3 1,820,000.00 b0(11) 1320.3 816,000.00 578.2 4,049,000.00 b0(14)++ 811.1 12,020,000.00 597.2 5,570,000.00 b0(6) 624 2,939,000.00 711.5 15,860,000.00 b0(7)++ 390.8 812,200.00 811.1 12,020,000.00 Y 911.7 1,819,000.00 y(11)++ 711.5 15,860,000.00 1079.4 934,500.00 y(12)++ 747 16,810,000.00 1160.9 806,800.00 y(13)++ 811.1 12,020,000.00 1191.8 698,700.00 y(14)++ 875.6 22,060,000.00 1345.2 695,900.00 y(15)++ 911.1 9,194,000.00 1481.6 551,500.00 y(3) 454.1 3,201,000.00 268.7 562,700.00 y(4) 525.2 7,164,000.00 315.1 940,900.00 y(5) 596.2 9,290,000.00 397.9 1,025,000.00 y(6) 743.3 3,904,000.00 576.7 2,239,000.00 y(6)++ 372.3 1,290,000.00 624 2,939,000.00 y(7) 858 2,572,000.00 779.1 4,727,000.00 y(8) 973 1,213,000.00 802.2 4,960,000.00 Ynl 894.3 1,624,000.00 y^*(13)++ 802.2 4,960,000.00 1076.5 916,100.00 y*(14)++ 866.6 3,455,000.00 1096.2 685,900.00 y^*(3) 437.3 1,820,000.00 1189.4 632,400.00 y*(6)++ 363.3 743,600.00 1306.2 189,900.00 y0(13)++ 802.2 4,960,000.00 246.6 539,600.00 y0(14)++ 866.6 3,455,000.00 292.2 778,700.00 yθ(9) 1118.3 490,200.00 390.8 812,200.00 539 1,490,000.00 669.8 2,725,000.00 743.3 3,904,000.00 866.6 3,455,000.00 973 1,213,000.00 989.5 706,800.00 1173.7 508,700.00 1234.5 619,600.00 182.9 504,500.00 363.3 743,600.00 442.3 703,700.00 562.9 1 ,434,000.00 639.6 2,688,000.00 741.2 2,636,000.00 858 2,572,000.00 974.4 1 ,077,000.00 1008.2 499,700.00 1118.3 490,200.00 1218.5 604,000.00 1246.3 571 ,500.00

Table 4: Peak lists and matched peak lists for SEQ ID NO:3

Matched Intensity m/z Ions 175.1 350,200.00 288.2 645,500.00 B 408.8 1 ,672,000.00 b(2) 215 293,000.00 516 2,269,000.00 b(4)++ 215 293,000.00 619.5 249,500.00 b(5) 577.1 112,700.00 706.3 1 ,421 ,000.00 b(6)++ 324.9 124,800.00 834.3 898,100.00 b(7) 761.2 117,500.00 945.5 129,400.00 b(8) 874.3 1.55E+05 1023.7 47,180.00 Bnl 215 293,000.00 b^*(8) 857.3 2.29E+05 324.9 124,800.00 b0(3) 324.9 124,800.00 417.4 748,700.00 b0(5) 559.2 222,000.00 507.1 415,400.00 b0(6)++ 315 79,740.00 623.2 138,600.00 b0(8) 856.3 2.75E+05 707.2 564,600.00 Y 835.4 283,500.00 yd) 175.1 350,200.00 882.5 75,600.00 y(2) 288.2 645,500.00 186.8 113,300.00 y(3) 401.1 397,300.00 315 79,740.00 y(4) 472.3 263,500.00 468.5 531 ,100.00 y(4)++ 237.1 72,310.00 493 257,500.00 y(5) 619.5 249,500.00 647.2 113,000.00 y(6) 706.3 1 ,421 ,000.00 684.3 163,500.00 y(7) 834.3 8.98E+05 856.3 275,000.00 y(7)₊₊ 417.4 748,700.00 946.5 56,340.00 y(8)++ 468.5 531 ,100.00 230.8 73,000.00 Ynl 293.9 74,760.00 y*(5)++ 301 69,580.00 401.1 397,300.00 y*(7)++ 408.8 1 ,672,000.00 530 243,000.00 y*(8)++ 459.6 394,200.00 577.1 112,700.00 y0(7)++ 408.8 1 ,672,000.00 726.2 136,800.00 y0(8) 917.3 3.85E+04 857.3 228,800.00 y0(8)++ 459.6 394,200.00 917.3 38,530.00 237.1 72,310.00 301 69,580.00 459.6 394,200.00 559.2 222,000.00 649.5 89,640.00 756.1 121 ,600.00 818.4 158,200.00 936.5 18,700.00 176.2 41 ,330.00 374.4 60,820.00 472.3 263,500.00 517 200,000.00 664.9 77,760.00 761.2 117,500.00 874.3 154,600.00 377.1 174,600.00

Table 5: Peak lists and matched peak lists for SEQ ID NO:4

Matched Intensity m/z Ions 341 6,199,000.00 b(3) 341 6,199,000.00 381 16,650,000.00 b(7) 761.1 5,837,000.00 467.1 29,890,000.00 Bnl 630.2 57,070,000.00 b^*(7) 744.2 15,430,000.00 680.3 41 ,130,000.00 b0(7) 743.3 16,320,000.00 793.2 9,942,000.00 Y 890.4 39,740,000.00 y(2) 310 5,875,000.00 310 5,875,000.00 y(3) 381 16,650,000.00 393 15,830,000.00 y(5) 567.2 47,240,000.00 535.2 27,540,000.00 y(6) 680.3 41 ,130,000.00 567.2 47,240,000.00 y(7) 793.2 9,942,000.00 698.2 16,410,000.00 Ynl 761.1 5,837,000.00 889.6 35,260,000.00 367.1 14,090,000.00 726.2 10,840,000.00 743.3 16,320,000.00 744.2 15,430,000.00 872.3 17,850,000.00

Table 6: Summary of Peptide Seguence Annotators Based on Measured and

Predicted Retention Times.

SEQ ID NO RT^a Predicted Rl ^"a Difference 1 35.65 33.45833 2.191674 2 18.59 18.68694 0.096942 3 19.25 24.17538 4.925377 4 25.1 25.71325 0.6132463 a RT is an abbreviation for retention time.

Table 7: Summary of Peptide Seguence Annotators Based on Number of Matching

Fragments.

SEQ ID NO No. of y-type^b No. of b-type^c No. of y nls No. of b nls^e 1 5/18 6/18 2/22 4/24 2 12/30 9/30 7/48 6/50 3 10/16 6/16 6/22 5/26 4 5/7 2/7 0/10 2/11

^b "No. of y -type" is an abbreviation for the number of matched y -type fragments.

C ii No. of b -type" is an abbreviation for the number of matched b-type fragments. ^d "No. of y nls" is an abbreviation for the number of matched y neutral loss fragments. ^e "No. of b nls" is an abbreviation for the number of matched b neutral loss fragments

Table 8: Summary of Peptide Sequence Annotators Based on the Relative Intensities of Matching Fragments.

SEQ ID NO rel All' rel Y⁹ rel Ynl^h rel B¹ rel Br % % % % % 1 98.04 51.35 13.61 21.14 11.93 2 93.35 59.64 11.34 12.17 10.21 3 76.62 36.08 27.43 7.09 6.02 4 38.36 28.16 0 2.8 7.4

^f "rel all" is an abbreviation for the relative intensity of all matched fragments. ⁹ "rel Y" is an abbreviation for the relative intensity of matched y-type fragments. ^h "rel Ynl" is an abbreviation for the relative intensity of matched y neutral loss fragments.

⁹ "rel B" is an abbreviation for the relative intensity of matched b-type fragments. ' "rel Ynl" is an abbreviation for the relative intensity of matched y neutral loss fragments. Table 9: Summary of Peptide Sequence Annotators Based on Error Distributions. SEQ ID NO ED (p)^k 1 0.21 2 0.03 3 0.63 4 0.72

k .. ED(p)" is an abbreviation for error distribution.

Claims

We claim:

1. A method for identifying a peptide analyte, said method comprising the steps of: measuring the molecular mass of said peptide analyte; generating a fragmentation mass spectrum of said peptide analyte comprising a series of peaks corresponding to fragments of said peptide analyte; determining a plurality of putative peptide sequence assignments for said peptide analyte; compiling a peptide sequence annotator index for each of said putative peptide sequence assignments comprising a plurality of peptide sequence annotators; combining at least a portion of said peptide sequence annotators in a parallel confidence assessment algorithm, thereby generating a confidence assessment for each putative peptide sequence assignment, wherein said parallel confidence assessment algorithm comprises peptide sequence assessment rules derived from an artificial neural network algorithm; and identifying said peptide analyte by determining the putative peptide sequence assignment having the highest confidence assessment.

2. The method of claim 1 wherein said putative peptide sequence assignments are determined using a peptide sequence database comprising peptide amino acid sequences and peptide fragment amino acid sequences.

3. The method of claim 1 wherein said putative peptide sequence assignments are determined using a de novo peptide-sequencing algorithm.

4. The method of claim 1 wherein said putative peptide sequence assignments are determined using a protein sequence database comprising protein amino acid sequences.

5. The method of claims 2 or 4 wherein said putative peptide sequence assignments are determined using a spectrum matching algorithm.

6. The method of claim 1 wherein at least a portion of said peptide sequence annotators are determined by comparing said fragmentation mass spectrum of said peptide analyte to one or more a peptide sequence databases comprising the masses of peptides, fragments of peptides or both

7. The method of claim 1 wherein said compiling step further comprises the steps of: calculating a theoretical fragmentation mass spectrum for each of said putative peptide sequence assignments; and determining at least a portion of said peptide sequence annotators by comparing said fragmentation mass spectrum of said peptide analyte to said theoretical fragmentation mass spectrum of each putative peptide assignment.

8. The method of claim 1 wherein said compiling step further comprises the step of organizing said annotators in a relational database.

9. The method of claim 1 wherein said artificial neural network algorithm is determined or trained by analyzing peptides having known sequences.

10. The method of claim 1 wherein said confidence assessment algorithm uses a plurality of weighting factors, wherein selected peptide sequence annotators are assigned a weighting factor.

11. The method of claim 1 wherein one of said peptide sequence annotators is determined by the steps of: measuring the retention time of said peptide analyte for a selected chromatographic media, thereby generating a measured retention time; calculating predicted retention times of each of said putative peptide sequence assignments for said chromatographic media, thereby generating a plurality of predicted retention times; and comparing said measured retention time to said predicted retention times.

12. The method of claim 1 further comprising the step of determining the masses of said fragments.

13. The method of claim 12 wherein one of said peptide sequence annotators is determined by the step of determining a set of matching peaks of said fragmentation mass spectrum which correspond to fragments having masses equal to the masses of fragments predicted for each peptide sequence assignment using said peptide sequence database.

14. The method of claim 13 wherein said peptide sequence annotator is the number of matching peaks of said fragmentation pattern.

15. The method of claim 14 wherein said fragments predicted for each peptide sequence assignment are selected from the group consisting of: a-type fragments; b-type fragments; c-type fragments; x-type fragments; y-type fragments; z-type fragments; immonium ions; and internal fragments.

16. The method of claim 13 wherein said peptide sequence annotator is the cumulative relative intensity of said matched peaks of said fragmentation pattern.

17. The method of claim 16 wherein said fragments predicted for each peptide sequence assignment are selected from the group consisting of: a-type fragments; b-type fragments; c-type fragments; x-type fragments; y-type fragments; z-type fragments; immonium ions; and internal fragments.

18. The method of claim 1 wherein one of said peptide sequence annotators is determined by the step of subtracting said molecular mass of said peptide analyte from the mass corresponding to each of said putative peptide sequence assignment.

19. The method of claim 1 further comprising the step of determining the charge state of a precursor ion of said peptide analyte.

20. A method for assessing the confidence of a peptide sequence assignment comprising the steps of: measuring the molecular mass of a peptide analyte; generating a fragmentation mass spectrum of said peptide analyte comprising a series of peaks corresponding to fragments of said peptide analyte; compiling a peptide sequence annotator index for said putative peptide sequence assignment comprising a plurality of peptide sequence annotators; and combining at least a portion of said peptide sequence annotators in a parallel confidence assessment algorithm, thereby generating a confidence assessment for said putative peptide sequence assignment, wherein said parallel confidence assessment algorithm comprises peptide sequence assessment rules derived from an artificial neural network algorithm.

21. A method for identifying a protein analyte comprising the steps of: digesting said protein analyte, thereby generating a plurality of peptide analytes; measuring the molecular mass of said peptide analytes; generating a fragmentation mass spectrum for each of said peptide analytes comprising a series of peaks corresponding to fragments of said peptide analytes; determining a plurality of putative peptide sequence assignments for each of said peptide analytes; compiling a peptide sequence annotator index for each of said putative peptide sequence assignments comprising a plurality of peptide sequence annotators; combining at least a portion of said peptide sequence annotators in a parallel confidence assessment algorithm, thereby generating a confidence assessment for each putative peptide sequence assignment, wherein said parallel confidence assessment algorithm comprises peptide sequence assessment rules derived from an artificial neural network algorithm; and inputting said putative peptide sequence assignments and confidence assessments into a protein identification algorithm, wherein said protein identification algorithm compares said putative protein sequences to a protein sequence database comprising protein amino acid sequences, thereby determining the identity of said protein analyte.

22. The method of claim 21 wherein said putative peptide sequence assignments for each of said peptide analytes are determined using a peptide sequence database comprising peptide amino acid sequences and peptide fragment amino acid sequences.

23. The method of claim 21 wherein said putative peptide sequence assignments for each of said peptide analytes are determined using a de novo peptide- sequencing algorithm.

24. The method of claim 21 wherein putative peptide sequence assignments for each of said peptide analytes are determined using a protein sequence database comprising protein amino acid sequences.

25. The method of claim 21 wherein putative peptide sequence assignments for each of said peptide analytes are determined using a spectrum matching algorithm.

26. A method for identifying a post translational modification of a protein analyte comprising the steps of: digesting said protein analyte, thereby generating a plurality of peptide analytes; measuring the molecular mass of said peptide analytes; generating a fragmentation mass spectrum for each of said peptide analytes comprising a series of peaks corresponding to fragments of said peptide analytes; determining a plurality of putative peptide sequence assignments for each of said peptide analytes; compiling a peptide sequence annotator index for each of said putative peptide sequence assignments comprising a plurality of peptide sequence annotators; combining at least a portion of said peptide sequence annotators in a parallel confidence assessment algorithm, thereby generating a confidence assessment for each putative peptide sequence assignment, wherein said parallel confidence assessment algorithm comprises peptide sequence assessment rules derived from an artificial neural network algorithm; inputting said putative peptide sequence assignments and confidence assessments into a protein identification algorithm, wherein said protein identification algorithm compares said putative protein sequences to a protein sequence database comprising protein amino acid sequences and modified amino acid sequences, thereby determining said post translational modification of said protein analyte.

27. The method of claim 26 wherein said putative peptide sequence assignments for each of said peptide analytes are determined using a peptide sequence database comprising peptide amino acid sequences and peptide fragment amino acid sequences.

28. The method of claim 26 wherein said putative peptide sequence assignments for each of said peptide analytes are determined using a de novo peptide- sequencing algorithm.

29. The method of claim 26 wherein putative peptide sequence assignments for each of said peptide analytes are determined using a protein sequence database comprising protein amino acid sequences.