Professional Documents
Culture Documents
Tamanna Sultana
Bioinformatics Analysis Core (BAC)
Genomics & Proteomics Core Laboratories (GPCL)
University of Pittsburgh
Proteomics
Information science:
the sciences concerned with gathering, manipulating, storing,
retrieving, and classifying recorded information
wordnetweb.princeton.edu/perl/webwn
onSample
1D or 2D
MS
preparation 4700 R eflector Spec #1 MC =>TR [B P = 1479.9, 15779]
1479.8824
100
1.6E +4
1439.8967
Excise Trypsin
90
80
1567.8276
1163.7000
70
2045.1273
60
1881.0223
927.5582
% Intensity
Digest
50
1730.7723 1724.9272
Spot
1305.7888
40
1399.7751
1249.6954
30
1895.0386
1283.7881
1433.8074
1554.7437
1640.0277
2555.2903
1763.7820
1687.8691
841.5205
20
2262.0557
1014.6827
1516.7135
1590.8619
1081.5479
1121.5520
2458.3052
1195.6243
2493.3501
789.5378
898.5428
10
0
699.0 1159.2 1619.4 2079.6 2539.8 3000.0
Mass (m /z)
Peptides
Protein Mass spectrum (MS)
Peak List
Which proteins to analyze ???? 820.7
842.5
1012.6
1296.6
1555.7
...
...
Gel electrophoresis pI
MW GPCL
Fractionation
www.prometicbiosciences.com
Sample analysis by Bottom up MS
Digestion (cleavage) of proteins by an enzyme
Fractionation of peptides
Off line
On line
Analysis by MS and tandem MS (MS/MS)
Mass
Ion source Detector
analyzer
1570.6766
100
3840.4
90
904.4686
80
1296.6848
70
60
% Intensity
2465.1987
50
40
2093.0872
30
1552.6698
1829.9774
20
10
0
800 1180 1560 1940 2320 2700
Mass (m /z)
Fragment ion spectrum
Tandem MS
4700 MS/MS Precursor 1570.7 Spec #1 MC[BP = 175.1, 3106]
175.1326
100
3105.9
90
1056.5107
80 1554.7853
1571.9679
70 684.3845
60
1556.5172
% Intensity
50
40
30 112.0977
1558.4042
813.4371
246.1672 333.2105
20 1559.9417
1441.7213
480.2749
316.1747 1039.4810 1570.2634
10 120.0979 463.2531 627.3450 741.3559 942.4836 1040.9976 1171.5131 1268.5427 1551.7002
72.1029 229.1560 400.2173 490.3423 629.3128 758.3326 910.8679 1445.2834
837.0470
0
69.0 386.8 704.6 1022.4 1340.2 1658.0
Mass (m /z)
Tandem mass spectrum
http://qbab.aber.ac.uk
Tandem mass spectra (MS/MS) can be used for peptide sequencing
Database Searching
Peptide Mass Fingerprinting
Sequence tag approach
De novo sequencing
inspect raw data
http://qbab.aber.ac.uk
Mascot Search Results
Search title : SampleSetID: 362, AnalysisID: 567, MaldiWellID:
15790, SpectrumID: 17225, Path=\Mani\102004\New Analysis 1
Database : NCBInr 20040606 (1846720 sequences; 611532004
residues)
Timestamp : 20 Oct 2004 at 14:52:50 GMT
Top Score : 681 for gi|180570, creatine kinase [Homo sapiens]
Score is -10*Log(P), where P is the probability that the observed match is a random
event. Protein scores greater than 75 are significant (p<0.05).
Top hits from Mascot Search there are multiple accession
numbers for the same protein
2D gel Peptide ID
analysis consensus
Surya Viswanathan, Mustafa nl, Jonathan S Minden. Nature Protocols 1, 1351 - 1358 (2006)
Labeling strategy for 3 samples
Samples
W
- +
Cy3 Cy5 Cy3 Cy5
gel1 gel2
Cy5 Cy3
gel3
DiGE analysis of protein isoform expression
in STAT3 constitutively activated versus
STAT3 loss variant multiple myeloma cell
line U266
Sample prep
labeling: 266#1 cy3, 266#2 cy5 and reciprocal
1st dimension: Protein IEF cell, BioRad
Gel-strip: 3-10NL, 17cm
Sample volume load: 300 L
Running conditions: 250V for 15min., ramp to 10000V in 3hrs., reach 60000 V/hr,
hold at 500 V
2nd dimension: Protein II xi cell, BioRad
Gel: Jule, 8-16%
Running buffer: TGS (BioRad), 2X on top chamber and 1X on bottom chamber
Running conditions: 16 mA for 45 min. followed by 30 mA for 5 hrs.
# gels generated: 2
Gel processing
Gel fixing
Buffer: 40% Methanol, 5% acidic acid
Time: overnight
Gel staining
None
Gel imaging
DiGE scanner: custom made with Prometrix CCD camera
Image generated per gel: 2 (total 4 images)
Image label:
BricknerGel_A_30sec_Cy3-266#1
BricknerGel_A_30sec_Cy5-266#2
BricknerGel_B_30sec_Cy5-266#1
BricknerGel_B_30sec_Cy3-266#2
None
Image analysis using Delta 2D software
From Decodon
Gel import
1. BricknerGel_A_30sec_Cy3-266#1
2. BricknerGel_A_30sec_Cy5-266#2
3. BricknerGel_B_30sec_Cy5-266#1
4. BricknerGel_B_30sec_Cy3-266#2
Gel warp
Between gel A and B using sample 266#1 (images 1 & 3)
Possible knockdown
Over expression in 266#2 labeled 266#2: blue spots Under expression in 266#2 labeled
266#1: orange spots
Overlap: black
Quantitation table-1 (over-expression of 266#2)
Ave. % Volume of 266#1 STDEV. 266#1 Ave. % Volume of 266#2 STDEV. 266#2 Statistics label
0.7
0.6
Average % volume, 266#2
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2
Average % volume, 266#1
Preliminary Conclusions
Proteomics data validation: why all must provide data. Lennart Martens and Henning Hermjakob
Mol. BioSyst., 2007: 3, 518522.
Why different search engines generate different
peptide lists from the same dataset??
Mascot
Probability base MOWSE scoring
Sequest
Cross-correlation (Xcorr) among experimental and
theoretical spectra is used
Reports deltaCn
X-Tandem
Considers only B/Y-type ions
Creates a database of proteins identified and performs
an extensive search on only identified proteins
Protein databases
NCBI
NCBI is Entrez Protein database from National Center for Biotechnology information and
contains redundant protein sequences with poor annotation.
RefSeq is NCBIs Reference Sequence database with a comprehensive, integrated, non-redundant, well-
annotated set of sequences.
Uniprot/Swiss-Port
The UniProt Knowledgebase (UniProtKB) consists of two sections:
manually annotated and reviewed UniProtKB/Swiss-Prot and
automatically annotated UniProtKB/TrEMBL.
UniProtKB/Swiss-Prot is well-curated, well annotated, non redundant and
considerably smaller than NCBI, therefore widely used.
IPI
IPI, International Protein Index databases, is used for species specific searches and is
maintained by European Bioinformatics Institute (EBI).
The decision as which databases to use solely depends on aim of the project and type of the
experiment in concern
If the goal is to receive highest sensitivity, NCBI is more desirable as a first step.
However, it is time consuming to search against a large database and it requires manual
validation as a second step and/or further distillation of the protein list based on other
specific databases, but for identifying sequence variant, NCBI is a better starting point.
UniProtKB/Swiss-Prot, on the other hand, is a better option for investigators seeking faster and
reliable search results.
If species information is known, IPI database is a good candidate containing protein sequences
with cross-references to all its source data e.g. Ensembl, UniProt, RefSeq..
#1problem
Proliferation of new search algorithms, with a
variety of settings; which one(s)?
Importance of database search algorithms in peptide
identification
SEQUEST
Each search But the overlap is
engine identifies surprisingly small.
about the same 9% Different search
number of engines match
spectra, different spectra.
22% 4%
34%
X!Tandem 19% 7% Mascot
5%
algorithm by
Scaffold uses Nesvizskii to
Nesvizhskiis algorithm to combine peptide
convert SEQUEST and probabilities.
Mascot scores to peptide
probabilities *Nesvizhskii, A. I. et al, Anal. Chem. 2003, 75, 4646-4658
Scaffold View
Click on Scaffold
and import file
MSX50.sfd
We will be able to
import this file into
protein center after
exporting it as
ProtXML file format
( MSX50.xml)
Consensus Study Conducted by BAC
Mascot only (MO)
Sequest only (SO)
X!Tandem only (XO) ?
S
Union of M & S (MSU) ? ?
M ?
MXU
? ?
SXU
MSXU ? X
Intersection of M & S (MSI)
Evaluate the performance of each methods using a
SXI
standard protein dataset
MXI
Performance measures are sensitivity and specificity
MSXI of each methods
Sultana T, Jordan R, Lyons-Weiler J. 2009. Optimization of
the use of consensus methods for the detection and
putative identification of peptides via mass spectrometry
using protein standard mixtures. J Proteomics Bioinform
2: 263-273.
Scaffold confidence filter settings
Sample preparation:
Stable over expressed MICA: immuno-precipitated with MICA
antibody and then the complex were pull down
~5 ug/10 uL protein was reduced with TCEP, alkylated with
iodoacetamide and trypsin digested.
LCQ-Deca-XL (LC-ESI-MS) was used for MS and MS/MS data
generation
SCH2473A9MA3.RAW
Database Search Parameters
Database
IPI human v. 3.57
Search algorithms
Mascot, Sequest & X!Tandem
Modifications (variable)
Carbamidomethyl (+57 @C) and oxidation (+16 @M)
Missed cleavages: 2 maximum
Error tolerance: 2 Da on both parent and fragment ions
Peak list conversion
Raw file were converted into Mascot generic format (MGF)
peak list using extract_msn provided by Xcalibur software of
LCQ instrument
Accurate list for sample1 (MSXU-95_2_50)
Protein
Biological Protein molecular Protein %
sample accession weight identification # unique # unique # total % total sequence
name Protein name numbers (Da) probability peptides spectra spectra spectra coverage
Isoform 2 of Protein
MascotA8 spire homolog 1 IPI00645268 83,940.80 100.00% 34 36 40 3.76% 31.70%
similar to
MascotA8 hCG1820764 IPI00741841 11,265.40 99.90% 5 5 8 0.75% 50.50%
Ras-related protein
MascotA8 Rab-5A IPI00023510 23,640.80 99.40% 4 5 9 0.85% 23.70%
Isoform 1 of
Tropomyosin alpha-4
MascotA8 chain IPI00010779 28,504.40 98.70% 4 4 4 0.38% 33.50%
Sensitive list for sample1 (MSU-80_1_50)
Biological Protein Protein Protein
sample accession molecular identification # unique # unique # total % total %sequence
name Protein name numbers weight (Da) probability peptides spectra spectra spectra coverage
Putative
uncharacterized
protein
MascotA8 (Fragment) IPI00816622 9,137.30 100.00% 10 10 13 1.22% 69.90%
tudor domain
containing 10 IPI00432733,IPI
MascotA8 isoform a 00514618 40,923.70 100.00% 7 7 9 0.85% 15.30%
Isoform 2 of
Tropomyosin
MascotA8 alpha-4 chain IPI00216975 32,705.70 100.00% 18 19 20 1.88% 35.90%
Isoform 2 of
Protein spire
MascotA8 homolog 1 IPI00645268 83,940.80 100.00% 34 36 40 3.76% 31.70%
similar to
MascotA8 hCG1820764 IPI00741841 11,265.40 99.90% 5 5 8 0.75% 50.50%
Putative uncharacterized
MascotA8 protein (Fragment) IPI00816622 9,137.30 100.00% 10 10 13 1.22% 69.90%
tudor domain containing IPI00432733,IP
MascotA8 10 isoform a I00514618 40,923.70 100.00% 7 7 9 0.85% 15.30%
SEQUEST
29%
0% 3%
0%
X!Tandem 19% 42% Mascot
7%
Accurate list for sample 2 (MSXU-95_2_50)
Protein
Biological Protein molecular Protein %
sample accession weight identification # unique # unique # total % total sequence
name Protein name numbers (Da) probability peptides spectra spectra spectra coverage
IPI00829833,I 564,046.6
MascotA9 Isoform 2 of Mucin-19 PI00896516 0 100.00% 11 11 11 0.89% 3.27%
145,322.2
MascotA9 Protein IPI00916368 0 100.00% 5 5 6 0.49% 8.37%
Isoform 3 of HEAT
repeat-containing IPI00333696,I 214,978.6
MascotA9 protein 5B PI00479069 0 100.00% 7 7 7 0.57% 4.89%
IPI00170594,I
Isoform 1 of AT-hook- PI00217957,IP
containing I00876984,IPI 253,444.2
MascotA9 transcription factor 1 00878213 0 100.00% 8 8 8 0.65% 5.23%
annexin A2
MascotA9 isoform 1 IPI00418169 40,395.30 100.00% 6 6 7 0.57% 24.60%
Isoform 2 of IPI00829833,I
MascotA9 Mucin-19 PI00896516 564,046.60 100.00% 11 11 11 0.89% 3.27%
Zinc finger
MascotA9 protein 282 IPI00003798 74,277.40 99.90% 5 5 7 0.57% 12.10%
Interferon-
induced protein
with
tetratricopeptid
MascotA9 e repeats 3 IPI00024254 55,968.00 99.90% 3 3 3 0.24% 11.20%
Isoform 1 of
Transcription
Specific list for sample 2 (MO-99_3_50)
Protein
molecul %
Biological Protein ar Protein # # total total
sample accession weight identification # unique unique spectr spectr % sequence
name Protein name numbers (Da) probability peptides spectra a a coverage
annexin A2 isoform IPI0041816 40,395.3 0.57
MascotA9 1 9 0 100.00% 6 6 7 % 24.60%
IPI0082983
Isoform 2 of Mucin- 3,IPI00896 564,046. 0.89
MascotA9 19 516 60 100.00% 11 11 11 % 3.27%
IPI0091636 145,322. 0.49
MascotA9 Protein 8 20 100.00% 5 5 6 % 8.37%
Isoform 3 of HEAT IPI0033369
repeat-containing 6,IPI00479 214,978. 0.57
MascotA9 protein 5B 069 60 100.00% 7 7 7 % 4.89%
IPI0017059
Isoform 1 of AT- 4,IPI00217
hook-containing 957,IPI008
transcription factor 76984,IPI0 253,444. 0.65
MascotA9 1 0878213 20 100.00% 8 8 8 % 5.23%
Zinc finger protein IPI0000379 74,277.4 0.57
MascotA9 282 8 0 99.90% 5 5 7 % 12.10%
Interferon-induced
protein with
tetratricopeptide IPI0002425 55,968.0 0.24
MascotA9 repeats 3 4 0 99.90% 3 3 3 % 11.20%
Isoform 1 of
Transcription factor IPI0076087
TFIIIB component 7,IPI00893 293,875. 0.57
MascotA9 B'' homolog 272 60 99.90% 6 6 7 % 4.84%
IPI0002249 0.57
MascotA9 Metallothionein-2 8 6,023.60 99.90% 5 5 7 % 90.20%
Venn diagram for sample 2 using sensitive list
SEQUEST
35%
0% 13%
3%
X!Tandem 5% 42% Mascot
2%
Size of the consensus set (# of proteins
identified) for each consensus method
Sample 1 Sample 2
MSXU- 10 27
95_2_50
(Accurate)
MSU- 23 64
80_1_50
(Sensitive)
MO- 10 18
99_3_50
(Specific)
Sample 1 consensus list for each consensus
method
Acurate protein list Specific protein list Sensitive protein list
Calicin Calicin Calicin
Putative uncharacterized Putative uncharacterized
protein (Fragment) protein (Fragment) Putative uncharacterized protein (Fragment)
tudor domain containing 10 tudor domain containing 10
isoform a isoform a tudor domain containing 10 isoform a
Protein Protein Protein
Isoform 2 of Tropomyosin Isoform 2 of Tropomyosin
alpha-4 chain alpha-4 chain Isoform 2 of Tropomyosin alpha-4 chain
Isoform 2 of Protein spire Isoform 2 of Protein spire
homolog 1 homolog 1 Isoform 2 of Protein spire homolog 1
similar to hCG1820764 similar to hCG1820764 similar to hCG1820764
Protein FAM26E Protein FAM26E Protein FAM26E
Ras-related protein Rab-5A Ras-related protein Rab-5A Ras-related protein Rab-5A
Isoform 1 of Tropomyosin Isoform 1 of Tropomyosin
alpha-4 chain alpha-4 chain Isoform 1 of Tropomyosin alpha-4 chain
Isoform 2 of Membrane-associated guanylate kinase, WW and PDZ
domain-containing protein 2
Corrupt Accession: PI:IPI00890727.1|SWISS-PROT:Q8NB90-2
Isoform 1 of Uncharacterized protein KIAA1107
Isoform 2 of Protein FAM184A
Isoform 2 of Metallothionein-1G
GTPase IMAP family member 8
Isoform 1 of WSC domain-containing protein 2
similar to GOLGA8A protein
Sample 2 consensus list for each consensus
method
Accurate protein list Specific protein list Sensitive protein list
annexin A2 isoform 1 annexin A2 isoform 1 annexin A2 isoform 1
Isoform 2 of Mucin-19 Isoform 2 of Mucin-19 Isoform 2 of Mucin-19
Protein IPI00916368 Protein IPI00916368 Protein IPI00916368
Isoform 3 of HEAT repeat- Isoform 3 of HEAT repeat-
containing protein 5B containing protein 5B Isoform 3 of HEAT repeat-containing protein 5B
Isoform 1 of AT-hook-containing Isoform 1 of AT-hook-containing
transcription factor 1 transcription factor 1 Isoform 1 of AT-hook-containing transcription factor 1
Zinc finger protein 282 Zinc finger protein 282 Zinc finger protein 282
Interferon-induced protein with Interferon-induced protein with
tetratricopeptide repeats 3 tetratricopeptide repeats 3 Interferon-induced protein with tetratricopeptide repeats 3
Isoform 1 of Transcription factor Isoform 1 of Transcription factor
TFIIIB component B'' homolog TFIIIB component B'' homolog Isoform 1 of Transcription factor TFIIIB component B'' homolog
Metallothionein-2 Metallothionein-2 Metallothionein-2
Vimentin Vimentin Vimentin
Isoform 1 of Disabled homolog 2- Isoform 1 of Disabled homolog
interacting protein 2-interacting protein Isoform 1 of Disabled homolog 2-interacting protein
Similar to Signal peptidase Similar to Signal peptidase
complex subunit 2 complex subunit 2 Similar to Signal peptidase complex subunit 2
Vimentin Isoform 2 of Protein Dok-7 Vimentin
Isoform GTBP-N of DNA
Annexin A5 mismatch repair protein Msh6 Annexin A5
Isoform 1 of HEAT repeat-
Isoform 2 of Protein Dok-7 containing protein 5A Isoform 2 of Protein Dok-7
Isoform GTBP-N of DNA
mismatch repair protein Msh6 Isoform 1 of Cullin-4A Isoform GTBP-N of DNA mismatch repair protein Msh6
Methods Text
Sample Preparation
Protein extracts of osteo-sarcoma cell lines,SCH2473A8MA3_sample 1 and
SCH2473A9MA3_sample 2 were processed in the Genomics and Proteomics Core
Laboratories (GPCL) at the University of Pittsburgh prior to performing tandem mass
spectrometry. In brief, immuno precipitated samples provided by PIs lab were reduced with
tris-2-carboxyethyl-phosphine (TCEP), alkylated with iodoacetamide (IAC), and digested with
trypsin (Promega). The ESI-MS and information dependent (IDA) MS/MS spectra were
acquired at GPCL with an LCQ-Deca-XL coupled with a nano-LC system (Thermo Scientific,
Waltham, MA). The IDA was set so that MS/MS was done on the top three intense peaks per
cycle.
Database Search
Experimental Raw spectra files were searched against the human database, IPI Human
v3.57, for identifying peptides using following three search algorithms: Sequest, Mascot, and
X!Tandem. The search parameters for searching candidate peptides were: precursor ion
tolerance: 2 Da; fragment ion tolerance: 2 Da; variable modifications: Carbamidomethyl on
cysteine, and oxidation on methionine; maximum missed cleavages:2. For Mascot and
X!Tandem search, the raw files were converted to MGF peak list.
Merging the Data
Files containing database search results derived from Mascot, Sequest, and X!Tandem were
imported into Scaffold. The software then merged the peptide lists identified by all the three
search algorithms, re-scored, and re-ranked. Scaffold uses PeptideProphet and
ProteinProphet, that employ Bayesian statistics to combine the probability of identifying
spectra with the probability that all search methods agree with each other.
Protein List Generation
Accurate list: unions of Mascot, Sequest and X!Tandem (MSXU) at 95% minimum protein
probability, 2 minimum unique peptides and 50% minimum peptide probability
Sensitive list: unions of Mascot, and Sequest (MSU) at 80% minimum protein probability, 1
minimum unique peptides and 50% minimum peptide probability
Specific list: mascot only at 99% minimum protein probability, 3 minimum unique peptides
and 50% minimum peptide probability
Pathway express
Pathway express analysis*
Convert the Protein accession numbers to Genbank
accession IDs using DAVID
That is your input file for pathway express
*
Impacted Pathways (sample 1)
Datab #Genes #Input #Pathway %Input %Pathway
ase Impact in Genes in Genes on Genes in Genes in
Rank Name Pathway Name Factor Pathway Pathway Chip Pathway Input p-value
68 KEGG Ribosome 0.669 101 40 74 15.385 39.604 -3.77E-13
Parkinson''s
5 KEGG disease 25.022 137 15 101 5.769 10.949 5.26E-11
Alzheimer''s
6 KEGG disease 22.184 178 16 145 6.154 8.989 1.11E-09
Cardiac muscle
7 KEGG contraction 20.419 87 11 69 4.231 12.644 9.58E-09
Huntington''s
8 KEGG disease 16.844 189 14 154 5.385 7.407 1.48E-07
9 KEGG Focal adhesion 16.757 203 15 189 5.769 7.389 3.15E-07
10 KEGG $hsa05131$ 13.673 54 7 46 2.692 12.963 6.95E-06
Pathogenic
Escherichia coli
11 KEGG infection 13.673 54 7 46 2.692 12.963 6.95E-06
Antigen processing
3 KEGG and presentation 51.553 89 8 68 3.077 8.989 1.10E-05
ECM-receptor
12 KEGG interaction 12.773 84 8 76 3.077 9.524 2.51E-05
15 KEGG Allograft rejection 10.079 38 5 31 1.923 13.158 1.12E-04
Graft-versus-host
14 KEGG disease 10.315 42 5 32 1.923 11.905 1.32E-04
Type I diabetes
17 KEGG mellitus 9.624 44 5 34 1.923 11.364 1.77E-04
Autoimmune
18 KEGG thyroid disease 8.285 53 5 45 1.923 9.434 6.76E-04
19 KEGG Prostate cancer 7.728 90 6 81 2.308 6.667 0.001729
Impacted Pathways (Sample 1)
Under
expressed
Over
expressed
Protein Center v 3.0.4
Go to
http://hsls.proteincenter.proxeon.com/ProXweb/
Log in
Click on the file MSX50.xml
Play with peptide, protein and cluster view
Learn how to select information of interest, export
files or create a report
Learn how to compare two datasets (A8.xml and
A9.xml)
Bioinformatics and statistical analysis
Single protein
Look up using Accession Key 6807647
Dataset: Human Red blood cell (hRBC) dataset
Datasets
Proxeon
Tutorials
Data set comparison and click on hRBC_proteome