Bioinf Proteomics

Bioinformatics for Proteomics studies
Tamanna Sultana
Bioinformatics Analysis Core (BAC)
Genomics & Proteomics Core Laboratories (GPCL)
University of Pittsburgh
Proteomics
The term proteomics refers but not limited to the

analysis of proteins in terms of separation,
identification, quantification, expression and
function.
Bioinformatics for proteomics
Informatics is a field of study that focuses on the use of

technology for improving access to and utilization of
information. ...
library.ahima.org/xpedio/groups/public/documents/ahima
/bok1_025042.hcsp
Information science:
the sciences concerned with gathering, manipulating, storing,
retrieving, and classifying recorded information
wordnetweb.princeton.edu/perl/webwn
its broad meaning is the science of processing data. Within

health and social care, it is used to refer to the processing of
data on patients and clients, normally but by no means
exclusively through IT systems.
www.smarthealthcare.com/glossary
Mass Spectrometry (MS) based
proteomics
Bottom up MS workflow: close up
Further sample
prep through
LC separation
Sample Spectral data
onSample
1D or 2D
MS
preparation 4700 R eflector Spec #1 MC =>TR [B P = 1479.9, 15779]
1479.8824
100
1.6E +4
1439.8967
Excise Trypsin
90
80
1567.8276
1163.7000
70
2045.1273
60
1881.0223
927.5582
% Intensity
Digest
50
1730.7723 1724.9272
Spot
1305.7888
40
1399.7751
1249.6954
30
1895.0386
1283.7881
1433.8074
1554.7437
1640.0277
2555.2903
1763.7820
1687.8691
841.5205
20
2262.0557
1014.6827
1516.7135
1590.8619
1081.5479
1121.5520
2458.3052
1195.6243
2493.3501
789.5378
898.5428
10
0
699.0 1159.2 1619.4 2079.6 2539.8 3000.0
Mass (m /z)
Peptides
Protein Mass spectrum (MS)
Peak List
Which proteins to analyze ???? 820.7
842.5
1012.6
1296.6
1555.7
...
Experimental peak list

Reports Protein
Identification
Which data base/search engine

(algorithm) to use ???? Algorithm
compares peak
lists
Database
Peak Lists
Eg. Protein databases -
In silico digest
Non-redundant NCBI,
Swiss-Prot, 820.7
842.5
IPI, etc. 1012.6

1296.6
1555.7
...
How do I know this is correct ????

Sample preparation
Purification
www.qiagen.com
Gel electrophoresis pI
MW GPCL
Fractionation
www.prometicbiosciences.com
Sample analysis by Bottom up MS
Digestion (cleavage) of proteins by an enzyme
MGLSDGEWQQ VLNVWGKVEA DIAGHGQEVL

IRLFTGHPET LEKFDKFKHL KTEAEMKASE
DLKKHGTVVL TALGGILKKK GHHEAELKPL AQSHATKHKI
PIKYLEFISD AIIHVLHSKH PGDFGADAQG AMTKALELFR
NDIAAKYKEL GFQG
Fractionation of peptides
Off line
On line
Analysis by MS and tandem MS (MS/MS)
Mass
Ion source Detector
analyzer
Basic components of any MS

Tandem MS
4700 Proteomics Analyzer, Applied Biosystems

MS
MS, followed by precursor ion selection
4700 Reflector Spec #1 MC=>TR[BP = 1570.7, 3840]
1570.6766
100
3840.4
90
904.4686
80
1296.6848
70
60
% Intensity
2465.1987
50
40
2093.0872
30
1552.6698
1829.9774
20
10
0
800 1180 1560 1940 2320 2700
Mass (m /z)
Fragment ion spectrum
Tandem MS
4700 MS/MS Precursor 1570.7 Spec #1 MC[BP = 175.1, 3106]
175.1326
100
3105.9
90
1056.5107
80 1554.7853
1571.9679
70 684.3845
60
1556.5172
% Intensity
50
40
30 112.0977
1558.4042
813.4371
246.1672 333.2105
20 1559.9417
1441.7213
480.2749
316.1747 1039.4810 1570.2634
10 120.0979 463.2531 627.3450 741.3559 942.4836 1040.9976 1171.5131 1268.5427 1551.7002
72.1029 229.1560 400.2173 490.3423 629.3128 758.3326 910.8679 1445.2834
837.0470
0
69.0 386.8 704.6 1022.4 1340.2 1658.0
Mass (m /z)
Tandem mass spectrum
http://qbab.aber.ac.uk
Tandem mass spectra (MS/MS) can be used for peptide sequencing
Database Searching
Peptide Mass Fingerprinting
Sequence tag approach
De novo sequencing
inspect raw data
http://qbab.aber.ac.uk
Mascot Search Results
Search title : SampleSetID: 362, AnalysisID: 567, MaldiWellID:
15790, SpectrumID: 17225, Path=\Mani\102004\New Analysis 1
Database : NCBInr 20040606 (1846720 sequences; 611532004
residues)
Timestamp : 20 Oct 2004 at 14:52:50 GMT
Top Score : 681 for gi|180570, creatine kinase [Homo sapiens]
Probability Based Mowse Score
Score is -10*Log(P), where P is the probability that the observed match is a random
event. Protein scores greater than 75 are significant (p<0.05).
Top hits from Mascot Search there are multiple accession
numbers for the same protein
Accession Mass Score Description

1. gi|180570 42591 681 creatine kinase [Homo sapiens]
2. gi|21536286 42617 681 brain creatine kinase; creatine kinase-B [Homo sapiens]
3. gi|33304149 42730 681 creatine kinase, brain [synthetic construct]
4. gi|125292 42674 568 CREATINE KINASE, B CHAIN (B-CK) [Cannis familiaris]
5. gi|180572 42658 538 creatine kinase-B
6. gi|125295 42636 514 CREATINE KINASE, B CHAIN (B-CK)
9. gi|31542401 42685 471 creatine kinase, brain [Rattus norvegicus]
10. gi|203474 42699 471 creatine kinase
11. gi|40807002 44540 469 Unknown (protein for IMAGE:5598839) [Rattus norvegicus]
12. gi|47477783 44782 469 Ckb protein [Rattus norvegicus]
13. gi|13096153 42551 441 Chain A, Crystal Structure Of Bovine Retinal Creatine Kinase
14. gi|12852054 42700 427 unnamed protein product [Mus musculus]
15. gi|10946574 42686 427 creatine kinase, brain [Mus musculus]
16. gi|47213348 42953 237 unnamed protein product [Tetraodon nigroviridis]
17. gi|627264 40353 236 creatine kinase (EC 2.7.3.2) isozyme IV - African clawed frog
18. gi|27503418 42214 235 Ckb-prov protein [Xenopus laevis]
19. gi|45384340 42844 209 B-creatine kinase [Gallus gallus]
20. gi|6573489 42713 201 Chain A, Crystal Structure Of Chicken Brain-Type Creatine Kinase
Creatine kinase B is the highest scoring protein
Match to: gi|21536286 ; Score: 681

Creatine kinase - B [Homo sapiens]
Nominal mass (Mr): 42591; Calculated pI value: 5.34
Observed Mass & pI: 43kd, 6.2-6.27
Sequence Coverage: 46%
1 MPFSNSHNAL KLRFPAEDEF PDLSAHNNHM AKVLTPELYA ELRAKSTPSG
51 FTLDDVIQTG VDNPGHPYIM TVGCVAGDEE SYEVFKDLFD PIIEDRHGGY
101 KPSDEHKTDL NPDNLQGGDD LDPNYVLSSR VRTGRSIRGF CLPPHCSRGE
151 RRAIEKLAVE ALSSLDGDLA GRYYALKSMT EAEQQQLIDD HFLFDKPVSP
201 LLSASGMARD WPDARGIWHN DNKTFLVWVN EEDHLRVISM QKGGNMKEVF
251 TRFCTGLTQI ETLFKSKDYE FMWNPHLGYI LTCPSNLGTG LRAGVHIKLP
301 NLGKHEKFSE VLKRLRLQKR GTGGVDTAAV GGVFDVSNAD RLGFSEVELV
351 QMVVDGVKLL IEMEQRLEQG QAIDDLMPAQ K
Problems frequently faced in proteomics
How to choose which proteins to analyze

Depends on the goal and the availability of support
Which methods to choose to quantify the proteins
Again it depends on the goal
How do I know I have all the correct proteins
Complicated
Depends on the sample, methods, instruments,
softwares used
BACs fee for service methods for supporting
proteomic studies
Differential protein expression analysis
Tradition 2D
DiGE
Consensus peptide identification
Protein database searches with Mascot, Sequest,
X!Tandem, Phenyx etc.
Consensus searches among Mascot, Sequest and
X!Tandem
Pathway analysis of identified proteins
Intelligent Systems and Bioinformatics Laboratory
Pathway Express
Integration of BAC with Proteomics Lab
PI and
Proteomics
lab decides BAC Samples
the project suggests submitted to
path the study the lab
design according to
the study BAC
design performs
the Data
For samples not involving
analysis 2D gel electrophoresis
2D gel Peptide ID
analysis consensus
List of Further project

differentially specific
expressed spots Protein ID bioinformatics
sent to Core generated by analysis/help
lab
Difference gel electrophoresis (DiGE) image analysis
Surya Viswanathan, Mustafa nl, Jonathan S Minden. Nature Protocols 1, 1351 - 1358 (2006)
Labeling strategy for 3 samples
Samples
W
- +
Cy3 Cy5 Cy3 Cy5
gel1 gel2
Cy5 Cy3
gel3
DiGE analysis of protein isoform expression
in STAT3 constitutively activated versus
STAT3 loss variant multiple myeloma cell
line U266
Date: May 22, 2008

PI: ___ ___
BAC Analyst: Tamanna Sultana
Project Location: www.genetics.pitt.edu/mygpcl/
Example of DiGE analysis and Reporting to PI

Specific aims/objectives
To assess proteomic differences between

U266 with constitutively activated STAT3 and
the U266 STAT3 loss variant using 2D-DIGE,
with particular emphasis on mapping the
CENPM 58 AA isoform
Sample Details
Samples
U266 with constitutively activated STAT3: 266#1
U266 STAT3 loss variant: 266#2
Sample condition
Protein extracted and cleaned by PI lab
Sample amount
100 g each sample was used
Sample buffer
Lysis buffer, 20 L
Sample processing (Traditional 2D/DiGE)
Sample prep
labeling: 266#1 cy3, 266#2 cy5 and reciprocal
1st dimension: Protein IEF cell, BioRad
Gel-strip: 3-10NL, 17cm
Sample volume load: 300 L
Running conditions: 250V for 15min., ramp to 10000V in 3hrs., reach 60000 V/hr,
hold at 500 V
2nd dimension: Protein II xi cell, BioRad
Gel: Jule, 8-16%
Running buffer: TGS (BioRad), 2X on top chamber and 1X on bottom chamber
Running conditions: 16 mA for 45 min. followed by 30 mA for 5 hrs.
# gels generated: 2
Gel processing
Gel fixing
Buffer: 40% Methanol, 5% acidic acid
Time: overnight
Gel staining
None
Gel imaging
DiGE scanner: custom made with Prometrix CCD camera
Image generated per gel: 2 (total 4 images)
Image label:
BricknerGel_A_30sec_Cy3-266#1
BricknerGel_A_30sec_Cy5-266#2
BricknerGel_B_30sec_Cy5-266#1
BricknerGel_B_30sec_Cy3-266#2
Image Storage location:

http://www.genetics.pitt.edu/mygpcl/050208
Scanned gels and left over sample storage location
Gels: in Rm. 9035, BST3 at 4C, samples: -80C
Previous analysis summary
None
Image analysis using Delta 2D software
From Decodon
Gel import
1. BricknerGel_A_30sec_Cy3-266#1
2. BricknerGel_A_30sec_Cy5-266#2
3. BricknerGel_B_30sec_Cy5-266#1
4. BricknerGel_B_30sec_Cy3-266#2
Gel warp
Between gel A and B using sample 266#1 (images 1 & 3)
infusion between all four images
Spot detection in the infused image
Spot transfer from the infused image to all images

and spot labeling
Dual-view image of 266#1 and 266#2
Possible knockdown
Over expression in 266#2 labeled 266#2: blue spots Under expression in 266#2 labeled
266#1: orange spots
Overlap: black
Quantitation table-1 (over-expression of 266#2)
Ave. % Volume of 266#1 STDEV. 266#1 Ave. % Volume of 266#2 STDEV. 266#2 Statistics label
0.02286 95.541 0.157 21.485 6.86425 ID758
0.04919 5.16631 0.222 23.826 4.51013 ID1007
0.01342 46.4842 0.046 11.841 3.39664 ID638
0.08416 7.3891 0.278 10.941 3.30421 ID1034
0.018 42.3414 0.055 44.115 3.06272 ID856
0.03559 23.0147 0.101 13.473 2.8341 ID1001
0.04949 16.6076 0.136 2.09 2.75143 ID989
0.01316 66.0837 0.036 13.76 2.70679 ID766
0.00253 99.3149 0.006 98.832 2.52439 ID441
0.00738 99.0185 0.018 15.034 2.48139 ID777
0.02265 36.2054 0.054 3.8892 2.40566 ID768
0.00958 20.6202 0.022 6.2787 2.27036 ID637
0.01124 54.9229 0.025 28.956 2.26648 ID747
0.1959 27.1355 0.423 36.584 2.15748 ID1093
0.00491 20.2911 0.01 31.267 2.10956 ID671
0.27372 19.1721 0.568 25.056 2.07484 ID561
0.0122 22.5978 0.025 76.485 2.03149 ID436
0.01617 33.9946 0.032 43.049 2.0087 ID843

Quantitation table-2 (under-expression of 266#2)
Ave. % Volume of 266#1 STDEV. 266#1 Ave. % Volume of 266#2 STDEV. 266#2 Statistics label
0.02564 47.8103 0.009 21.208 -2.94189 ID1052
0.21899 41.5971 0.096 47.237 -2.27408 ID1005
0.02429 74.5 0.006 41.081 -3.96436 ID780
0.01715 2.38384 0.007 10.352 -2.38462 ID879
0.78354 5.48853 0.18 28.436 -4.35728 ID1035
0.2324 5.45632 0.089 14.433 -2.61328 ID965
1.36616 5.92218 0.628 16.394 -2.17684 ID674
0.07541 56.8537 0.034 27.446 -2.20996 ID789
0.33283 1.68301 0.128 2.5139 -2.60336 ID966
0.07982 11.8707 0.028 39.784 -2.85716 ID972

0.13488 20.6176 0.055 52.409 -2.47384 ID894
0.30791 7.37787 0.152 12.419 -2.03194 ID413
0.03932 15.9807 0.015 58.697 -2.61746 ID1102
0.2053 27.0974 0.077 53.213 -2.68374 ID678
1.81212 13.1749 0.524 33.088 -3.45585 ID953
0.07955 38.023 0.038 35.563 -2.11097 ID736
0.01965 66.2334 0.004 60.707 -5.13589 ID171

0.35734 32.9212 0.138 47.716 -2.59257 ID646
0.5569 18.2044 0.233 14.208 -2.39388 ID682
0.5001 6.47287 0.249 30.814 -2.0096 ID681
0.02863 54.201 0.01 55.458 -2.98691 ID1096
1.17678 41.0268 0.451 66.544 -2.609 ID388
0.00893 0.69254 0.004 50.129 -2.02044 ID919
0.06059 60.0037 0.02 80.683 -2.99036 ID1044

Graphical representation
0.7
0.6
Average % volume, 266#2
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2
Average % volume, 266#1
Preliminary Conclusions
266#1 versus 266#2

Number of over-expressed spots: 18
Number of under-expression: 24
Report storage location
Folder: 052708_Delta2D_image analysis
Contents: Power point presentation, pick list, snapshot
images with labeling and Delta 2D report
Mass spectrometry (MS) based peptide
identification
Gel bands of interests are excised for in-gel
digestion
Alternatively, a protein sample of interest can be
digested in solution
The digest is then subjected to MS or MS/MS for
peptide identification
You can chose to run LC before MS
Some instruments like FT-MS allows MS and
MS/MS of undigested protein samples
List of useful proteomics websites
http://www.fixingproteomics.org/
http://www.ionsource.com/
www.ebi.ac.uk
www.proteomecommons.org
www.peptideatlas.org
http://www.biochem.mpg.de/mann
http://ncrr.pnl.gov/software/
http://tools.proteomecenter.org
http://www.pil.sdu.dk/
www.proteomesoftware.com
www.hprd.org
www.expasy.org
Click on each link to get familiar
Protein database search engines (Algorithms)
Commercial
Mascot (Matrix sciences)
Sequest (Thermo Scientific)
Phenyx (GeneBio)
Spectrum Mills (Agilent Technologies)
Paragon etc. (Applied Biosystems)
Open source
Mascot (for small dataset)
http://www.matrixscience.com/search_form_select.html
X!Tandem
http://www.thegpm.org/tandem/
Requires command-line use
OMSSA etc.
http://pubchem.ncbi.nlm.nih.gov/omssa/
Source of ambiguity in proteomics data
analysis
Quality of the MS/MS spectra

Redundant databases
Search strategy used
Nature of the model used by database search
engines (scoring algorithms)
The most common source of ambiguity and incorrect
assignment
Overview of processing mass spectrometry data
Proteomics data validation: why all must provide data. Lennart Martens and Henning Hermjakob
Mol. BioSyst., 2007: 3, 518522.
Why different search engines generate different
peptide lists from the same dataset??
Mascot
Probability base MOWSE scoring
Sequest
Cross-correlation (Xcorr) among experimental and
theoretical spectra is used
Reports deltaCn
X-Tandem
Considers only B/Y-type ions
Creates a database of proteins identified and performs
an extensive search on only identified proteins
Protein databases
NCBI
NCBI is Entrez Protein database from National Center for Biotechnology information and
contains redundant protein sequences with poor annotation.
RefSeq is NCBIs Reference Sequence database with a comprehensive, integrated, non-redundant, well-
annotated set of sequences.
Uniprot/Swiss-Port
The UniProt Knowledgebase (UniProtKB) consists of two sections:
manually annotated and reviewed UniProtKB/Swiss-Prot and
automatically annotated UniProtKB/TrEMBL.
UniProtKB/Swiss-Prot is well-curated, well annotated, non redundant and
considerably smaller than NCBI, therefore widely used.
IPI
IPI, International Protein Index databases, is used for species specific searches and is
maintained by European Bioinformatics Institute (EBI).
The decision as which databases to use solely depends on aim of the project and type of the
experiment in concern
If the goal is to receive highest sensitivity, NCBI is more desirable as a first step.
However, it is time consuming to search against a large database and it requires manual
validation as a second step and/or further distillation of the protein list based on other
specific databases, but for identifying sequence variant, NCBI is a better starting point.
UniProtKB/Swiss-Prot, on the other hand, is a better option for investigators seeking faster and
reliable search results.
If species information is known, IPI database is a good candidate containing protein sequences
with cross-references to all its source data e.g. Ensembl, UniProt, RefSeq..
#1problem
Proliferation of new search algorithms, with a
variety of settings; which one(s)?
Importance of database search algorithms in peptide
identification
SEQUEST
Each search But the overlap is
engine identifies surprisingly small.
about the same 9% Different search
number of engines match
spectra, different spectra.
22% 4%
34%
X!Tandem 19% 7% Mascot
5%
Courtesy: Proteome Software Inc.

Sequest, mascot and X-tandem scores
SEQUEST: XCorr>2.5, DeltaCn>0.1

Mascot: Ion Score-Identity Score>0
X! Tandem: E-Value<0.01
How do we compare these????
Thats when Scaffold comes in
Courtesy: Proteome Software Inc.

Scaffold workflow
Peptide Protein
Prophet* Prophet*
Get Calculate
SEQUEST SEQUEST
IDs Probability
Calculate
Get Calculate Calculate
For Each Combined
Mascot Mascot Protein
Spectrum Peptide
IDs Probability Probabilities
Probability
Get Calculate
X!Tandem X!Tandem Scaffold
IDs Probability Scaffold uses
Merger another

algorithm by
Scaffold uses Nesvizskii to
Nesvizhskiis algorithm to combine peptide
convert SEQUEST and probabilities.
Mascot scores to peptide
probabilities *Nesvizhskii, A. I. et al, Anal. Chem. 2003, 75, 4646-4658
Scaffold View
Click on Scaffold
and import file
MSX50.sfd
We will be able to
import this file into
protein center after
exporting it as
ProtXML file format
( MSX50.xml)
Consensus Study Conducted by BAC
Mascot only (MO)
Sequest only (SO)
X!Tandem only (XO) ?
S
Union of M & S (MSU) ? ?
M ?
MXU
? ?
SXU
MSXU ? X
Intersection of M & S (MSI)
Evaluate the performance of each methods using a
SXI
standard protein dataset
MXI
Performance measures are sensitivity and specificity
MSXI of each methods
Sultana T, Jordan R, Lyons-Weiler J. 2009. Optimization of
the use of consensus methods for the detection and
putative identification of peptides via mass spectrometry
using protein standard mixtures. J Proteomics Bioinform
2: 263-273.
Scaffold confidence filter settings
Minimum Protein defines the probability that a

proteins identification is correct
(20, 50, 80, 90, 95, 99, 99.9)
Minimum # Peptides filters results by the number

of unique peptides on which the identification is based
(1, 2, 3, 4, 5)
Minimum Peptide requires a minimum probability

from at least one spectrum
(0, 20, 50, 80, 90, 95)
GPCL-BAC recommendations
(Sultana et al., 2009 publication)
Most Accurate peptide identification
Union of Sequest, Mascot & X!Tandem (MSXU)
Scaffold filter: 95% protein probability, 2 minimum unique peptides & 50%
peptide probability.
Most Sensitive peptide identification
Union of Sequest & Mascot (MSU) or union of Sequest & X!Tandem
(MSXU) or Sequest only (SO)
Most Specific peptide identification
Union of Mascot & X!Tandem (MXU) or Mascot only (MO)
Sultana T, Jordan R, Lyons-Weiler J. 2009. Optimization of the use of consensus methods

for the detection and putative identification of peptides via mass spectrometry using
protein standard mixtures. J Proteomics Bioinform 2: 263-273.
Sensitivity and specificity trade off

Biology of MICA Protein in Human
Sarcoma Cells
Final Research Report

March 5, 2010
PI: ____ _____
GPCL-Bioinformatics Analysis Core Analysts:
Tamanna Sultana and James Lyons Weiler
Project Location: http://mygpcl/
Example of consensus peptide ID data analysis and reporting

Specific Aims (obtained from PI)
ID proteins in sample
Identification of other closely interacting proteins with
MICA in human sarcoma cells
Study Details
# Samples: protein extract of osteo-sarcoma cell lines
SCH2473A8MA3_sample 1
SCH2473A9MA3_sample 2
Sample preparation:
Stable over expressed MICA: immuno-precipitated with MICA
antibody and then the complex were pull down
~5 ug/10 uL protein was reduced with TCEP, alkylated with
iodoacetamide and trypsin digested.
LCQ-Deca-XL (LC-ESI-MS) was used for MS and MS/MS data
generation
# Data sets generated

SCH2473A8MA3-SpectraFile.RAW
SCH2473A9MA3-SpectraFile.RAW
Previous Analysis Summary
Sequest search results provided by Manny
Schreiber from Proteomics Lab
GPCL-BAC Profile Data Analysis
Start with the LC-MS Raw data file

Convert to peak list (MGF file format)
Run Mascot and X!Tandem search
Using Scaffold, combine the search results of
Sequest, Mascot and X!Tandem
Provide protein lists
(Most) Accurate (list 1)
(Most) Sensitive (list 2)
(Most) Specific (list 3)
SCH2473A8MA3.RAW
SCH2473A9MA3.RAW
Database Search Parameters
Database
IPI human v. 3.57
Search algorithms
Mascot, Sequest & X!Tandem
Modifications (variable)
Carbamidomethyl (+57 @C) and oxidation (+16 @M)
Missed cleavages: 2 maximum
Error tolerance: 2 Da on both parent and fragment ions
Peak list conversion
Raw file were converted into Mascot generic format (MGF)
peak list using extract_msn provided by Xcalibur software of
LCQ instrument
Accurate list for sample1 (MSXU-95_2_50)
Protein
Biological Protein molecular Protein %
sample accession weight identification # unique # unique # total % total sequence
name Protein name numbers (Da) probability peptides spectra spectra spectra coverage
MascotA8 Calicin IPI00299881 66,564.70 100.00% 16 16 17 1.60% 25.20%

Putative
uncharacterized
MascotA8 protein (Fragment) IPI00816622 9,137.30 100.00% 10 10 13 1.22% 69.90%
tudor domain
containing 10 isoform IPI00432733,IP
MascotA8 a I00514618 40,923.70 100.00% 7 7 9 0.85% 15.30%
MascotA8 Protein IPI00513900 58,318.90 100.00% 20 22 23 2.16% 20.00%

Isoform 2 of
Tropomyosin alpha-4
MascotA8 chain IPI00216975 32,705.70 100.00% 18 19 20 1.88% 35.90%
Isoform 2 of Protein
MascotA8 spire homolog 1 IPI00645268 83,940.80 100.00% 34 36 40 3.76% 31.70%
similar to
MascotA8 hCG1820764 IPI00741841 11,265.40 99.90% 5 5 8 0.75% 50.50%
MascotA8 Protein FAM26E IPI00166835 35,152.50 99.80% 5 6 16 1.50% 9.39%
Ras-related protein
MascotA8 Rab-5A IPI00023510 23,640.80 99.40% 4 5 9 0.85% 23.70%
Isoform 1 of
Tropomyosin alpha-4
Sensitive list for sample1 (MSU-80_1_50)
Biological Protein Protein Protein
sample accession molecular identification # unique # unique # total % total %sequence
name Protein name numbers weight (Da) probability peptides spectra spectra spectra coverage
Putative
uncharacterized
protein
MascotA8 (Fragment) IPI00816622 9,137.30 100.00% 10 10 13 1.22% 69.90%
tudor domain
containing 10 IPI00432733,IPI
MascotA8 isoform a 00514618 40,923.70 100.00% 7 7 9 0.85% 15.30%
Isoform 2 of
Tropomyosin
MascotA8 alpha-4 chain IPI00216975 32,705.70 100.00% 18 19 20 1.88% 35.90%
Isoform 2 of
Protein spire
MascotA8 homolog 1 IPI00645268 83,940.80 100.00% 34 36 40 3.76% 31.70%
similar to
MascotA8 hCG1820764 IPI00741841 11,265.40 99.90% 5 5 8 0.75% 50.50%

Specific list for sample1 (MO-99_3_50)
Protein Protein Protein %

Biological accession molecular identification # unique # unique # total % total sequence
sample name Protein name numbers weight (Da) probability peptides spectra spectra spectra coverage
Putative uncharacterized
MascotA8 protein (Fragment) IPI00816622 9,137.30 100.00% 10 10 13 1.22% 69.90%
tudor domain containing IPI00432733,IP
MascotA8 10 isoform a I00514618 40,923.70 100.00% 7 7 9 0.85% 15.30%

Isoform 2 of
Tropomyosin alpha-4
Isoform 2 of Protein spire

MascotA8 homolog 1 IPI00645268 83,940.80 100.00% 34 36 40 3.76% 31.70%
MascotA8 similar to hCG1820764 IPI00741841 11,265.40 99.90% 5 5 8 0.75% 50.50%

Ras-related protein Rab-
MascotA8 5A IPI00023510 23,640.80 99.40% 4 5 9 0.85% 23.70%
Isoform 1 of
Tropomyosin alpha-4
Venn diagram for sample1 using sensitive list
SEQUEST
29%
0% 3%
0%
7%
Accurate list for sample 2 (MSXU-95_2_50)
Protein
Biological Protein molecular Protein %
sample accession weight identification # unique # unique # total % total sequence
name Protein name numbers (Da) probability peptides spectra spectra spectra coverage
MascotA9 annexin A2 isoform 1 IPI00418169 40,395.30 100.00% 6 6 7 0.57% 24.60%
IPI00829833,I 564,046.6
MascotA9 Isoform 2 of Mucin-19 PI00896516 0 100.00% 11 11 11 0.89% 3.27%
145,322.2
MascotA9 Protein IPI00916368 0 100.00% 5 5 6 0.49% 8.37%
Isoform 3 of HEAT
repeat-containing IPI00333696,I 214,978.6
MascotA9 protein 5B PI00479069 0 100.00% 7 7 7 0.57% 4.89%
IPI00170594,I
Isoform 1 of AT-hook- PI00217957,IP
containing I00876984,IPI 253,444.2
MascotA9 transcription factor 1 00878213 0 100.00% 8 8 8 0.65% 5.23%
Zinc finger protein

MascotA9 282 IPI00003798 74,277.40 99.90% 5 5 7 0.57% 12.10%
Interferon-induced
protein with
tetratricopeptide
MascotA9 repeats 3 IPI00024254 55,968.00 99.90% 3 3 3 0.24% 11.20%
Isoform 1 of
Transcription factor
TFIIIB component B'' IPI00760877,I 293,875.6
MascotA9 homolog PI00893272 0 99.90% 6 6 7 0.57% 4.84%
Sensitive list for sample 2 (MSU-80_1_50)
Biological Protein Protein Protein

sample accession molecular identification # unique # unique # total % total % sequence
name Protein name numbers weight (Da) probability peptides spectra spectra spectra coverage
annexin A2
MascotA9 isoform 1 IPI00418169 40,395.30 100.00% 6 6 7 0.57% 24.60%
Isoform 2 of IPI00829833,I
MascotA9 Mucin-19 PI00896516 564,046.60 100.00% 11 11 11 0.89% 3.27%

Isoform 3 of
HEAT repeat-
containing IPI00333696,I
MascotA9 protein 5B PI00479069 214,978.60 100.00% 7 7 7 0.57% 4.89%
Isoform 1 of
AT-hook- IPI00170594,I
containing PI00217957,I
transcription PI00876984,I
MascotA9 factor 1 PI00878213 253,444.20 100.00% 8 8 8 0.65% 5.23%
Zinc finger
MascotA9 protein 282 IPI00003798 74,277.40 99.90% 5 5 7 0.57% 12.10%
Interferon-
induced protein
with
tetratricopeptid
MascotA9 e repeats 3 IPI00024254 55,968.00 99.90% 3 3 3 0.24% 11.20%
Isoform 1 of
Transcription
Specific list for sample 2 (MO-99_3_50)
Protein
molecul %
Biological Protein ar Protein # # total total
sample accession weight identification # unique unique spectr spectr % sequence
name Protein name numbers (Da) probability peptides spectra a a coverage
annexin A2 isoform IPI0041816 40,395.3 0.57
MascotA9 1 9 0 100.00% 6 6 7 % 24.60%
IPI0082983
Isoform 2 of Mucin- 3,IPI00896 564,046. 0.89
MascotA9 19 516 60 100.00% 11 11 11 % 3.27%
IPI0091636 145,322. 0.49
MascotA9 Protein 8 20 100.00% 5 5 6 % 8.37%
Isoform 3 of HEAT IPI0033369
repeat-containing 6,IPI00479 214,978. 0.57
MascotA9 protein 5B 069 60 100.00% 7 7 7 % 4.89%
IPI0017059
Isoform 1 of AT- 4,IPI00217
hook-containing 957,IPI008
transcription factor 76984,IPI0 253,444. 0.65
MascotA9 1 0878213 20 100.00% 8 8 8 % 5.23%
Zinc finger protein IPI0000379 74,277.4 0.57
MascotA9 282 8 0 99.90% 5 5 7 % 12.10%
Interferon-induced
protein with
tetratricopeptide IPI0002425 55,968.0 0.24
MascotA9 repeats 3 4 0 99.90% 3 3 3 % 11.20%
Isoform 1 of
Transcription factor IPI0076087
TFIIIB component 7,IPI00893 293,875. 0.57
MascotA9 B'' homolog 272 60 99.90% 6 6 7 % 4.84%
IPI0002249 0.57
MascotA9 Metallothionein-2 8 6,023.60 99.90% 5 5 7 % 90.20%
Venn diagram for sample 2 using sensitive list
SEQUEST
35%
0% 13%
3%
2%
Size of the consensus set (# of proteins
identified) for each consensus method
Sample 1 Sample 2
MSXU- 10 27
95_2_50
(Accurate)
MSU- 23 64
80_1_50
(Sensitive)
MO- 10 18
99_3_50
(Specific)
Sample 1 consensus list for each consensus
method
Acurate protein list Specific protein list Sensitive protein list
Calicin Calicin Calicin
Putative uncharacterized Putative uncharacterized
protein (Fragment) protein (Fragment) Putative uncharacterized protein (Fragment)
tudor domain containing 10 tudor domain containing 10
isoform a isoform a tudor domain containing 10 isoform a
Protein Protein Protein
Isoform 2 of Tropomyosin Isoform 2 of Tropomyosin
alpha-4 chain alpha-4 chain Isoform 2 of Tropomyosin alpha-4 chain
Isoform 2 of Protein spire Isoform 2 of Protein spire
homolog 1 homolog 1 Isoform 2 of Protein spire homolog 1
similar to hCG1820764 similar to hCG1820764 similar to hCG1820764
Protein FAM26E Protein FAM26E Protein FAM26E
Ras-related protein Rab-5A Ras-related protein Rab-5A Ras-related protein Rab-5A
Isoform 1 of Tropomyosin Isoform 1 of Tropomyosin
alpha-4 chain alpha-4 chain Isoform 1 of Tropomyosin alpha-4 chain
Isoform 2 of Membrane-associated guanylate kinase, WW and PDZ
domain-containing protein 2
Corrupt Accession: PI:IPI00890727.1|SWISS-PROT:Q8NB90-2
Isoform 1 of Uncharacterized protein KIAA1107
Isoform 2 of Protein FAM184A
Isoform 2 of Metallothionein-1G
GTPase IMAP family member 8
Isoform 1 of WSC domain-containing protein 2
similar to GOLGA8A protein
Sample 2 consensus list for each consensus
method
Accurate protein list Specific protein list Sensitive protein list
annexin A2 isoform 1 annexin A2 isoform 1 annexin A2 isoform 1
Isoform 2 of Mucin-19 Isoform 2 of Mucin-19 Isoform 2 of Mucin-19
Protein IPI00916368 Protein IPI00916368 Protein IPI00916368
Isoform 3 of HEAT repeat- Isoform 3 of HEAT repeat-
containing protein 5B containing protein 5B Isoform 3 of HEAT repeat-containing protein 5B
Isoform 1 of AT-hook-containing Isoform 1 of AT-hook-containing
transcription factor 1 transcription factor 1 Isoform 1 of AT-hook-containing transcription factor 1
Zinc finger protein 282 Zinc finger protein 282 Zinc finger protein 282
Interferon-induced protein with Interferon-induced protein with
tetratricopeptide repeats 3 tetratricopeptide repeats 3 Interferon-induced protein with tetratricopeptide repeats 3
Isoform 1 of Transcription factor Isoform 1 of Transcription factor
TFIIIB component B'' homolog TFIIIB component B'' homolog Isoform 1 of Transcription factor TFIIIB component B'' homolog
Metallothionein-2 Metallothionein-2 Metallothionein-2
Vimentin Vimentin Vimentin
Isoform 1 of Disabled homolog 2- Isoform 1 of Disabled homolog
interacting protein 2-interacting protein Isoform 1 of Disabled homolog 2-interacting protein
Similar to Signal peptidase Similar to Signal peptidase
complex subunit 2 complex subunit 2 Similar to Signal peptidase complex subunit 2
Vimentin Isoform 2 of Protein Dok-7 Vimentin
Isoform GTBP-N of DNA
Annexin A5 mismatch repair protein Msh6 Annexin A5
Isoform 1 of HEAT repeat-
Isoform 2 of Protein Dok-7 containing protein 5A Isoform 2 of Protein Dok-7
Isoform GTBP-N of DNA
mismatch repair protein Msh6 Isoform 1 of Cullin-4A Isoform GTBP-N of DNA mismatch repair protein Msh6
Methods Text
Sample Preparation
Protein extracts of osteo-sarcoma cell lines,SCH2473A8MA3_sample 1 and
SCH2473A9MA3_sample 2 were processed in the Genomics and Proteomics Core
Laboratories (GPCL) at the University of Pittsburgh prior to performing tandem mass
spectrometry. In brief, immuno precipitated samples provided by PIs lab were reduced with
tris-2-carboxyethyl-phosphine (TCEP), alkylated with iodoacetamide (IAC), and digested with
trypsin (Promega). The ESI-MS and information dependent (IDA) MS/MS spectra were
acquired at GPCL with an LCQ-Deca-XL coupled with a nano-LC system (Thermo Scientific,
Waltham, MA). The IDA was set so that MS/MS was done on the top three intense peaks per
cycle.
Database Search
Experimental Raw spectra files were searched against the human database, IPI Human
v3.57, for identifying peptides using following three search algorithms: Sequest, Mascot, and
X!Tandem. The search parameters for searching candidate peptides were: precursor ion
tolerance: 2 Da; fragment ion tolerance: 2 Da; variable modifications: Carbamidomethyl on
cysteine, and oxidation on methionine; maximum missed cleavages:2. For Mascot and
X!Tandem search, the raw files were converted to MGF peak list.
Merging the Data
Files containing database search results derived from Mascot, Sequest, and X!Tandem were
imported into Scaffold. The software then merged the peptide lists identified by all the three
search algorithms, re-scored, and re-ranked. Scaffold uses PeptideProphet and
ProteinProphet, that employ Bayesian statistics to combine the probability of identifying
spectra with the probability that all search methods agree with each other.
Protein List Generation
Accurate list: unions of Mascot, Sequest and X!Tandem (MSXU) at 95% minimum protein
probability, 2 minimum unique peptides and 50% minimum peptide probability
Sensitive list: unions of Mascot, and Sequest (MSU) at 80% minimum protein probability, 1
minimum unique peptides and 50% minimum peptide probability
Specific list: mascot only at 99% minimum protein probability, 3 minimum unique peptides
and 50% minimum peptide probability
Pathway express
Pathway express analysis*
Convert the Protein accession numbers to Genbank
accession IDs using DAVID
That is your input file for pathway express
*
Impacted Pathways (sample 1)
Datab #Genes #Input #Pathway %Input %Pathway
ase Impact in Genes in Genes on Genes in Genes in
Rank Name Pathway Name Factor Pathway Pathway Chip Pathway Input p-value
68 KEGG Ribosome 0.669 101 40 74 15.385 39.604 -3.77E-13
Parkinson''s
5 KEGG disease 25.022 137 15 101 5.769 10.949 5.26E-11
Alzheimer''s
6 KEGG disease 22.184 178 16 145 6.154 8.989 1.11E-09
Cardiac muscle
7 KEGG contraction 20.419 87 11 69 4.231 12.644 9.58E-09
Huntington''s
8 KEGG disease 16.844 189 14 154 5.385 7.407 1.48E-07
9 KEGG Focal adhesion 16.757 203 15 189 5.769 7.389 3.15E-07
10 KEGG $hsa05131$ 13.673 54 7 46 2.692 12.963 6.95E-06
Pathogenic
Escherichia coli
11 KEGG infection 13.673 54 7 46 2.692 12.963 6.95E-06
Antigen processing
3 KEGG and presentation 51.553 89 8 68 3.077 8.989 1.10E-05
ECM-receptor
12 KEGG interaction 12.773 84 8 76 3.077 9.524 2.51E-05
15 KEGG Allograft rejection 10.079 38 5 31 1.923 13.158 1.12E-04
Graft-versus-host
14 KEGG disease 10.315 42 5 32 1.923 11.905 1.32E-04
Type I diabetes
17 KEGG mellitus 9.624 44 5 34 1.923 11.364 1.77E-04
Autoimmune
18 KEGG thyroid disease 8.285 53 5 45 1.923 9.434 6.76E-04
19 KEGG Prostate cancer 7.728 90 6 81 2.308 6.667 0.001729
Impacted Pathways (Sample 1)
Under
expressed
Over
expressed
Protein Center v 3.0.4
A Proteomics bioinformatics tool

available through
HSLS
Import and view a protein dataset, export desired info
Go to
http://hsls.proteincenter.proxeon.com/ProXweb/
Log in
Click on the file MSX50.xml
Play with peptide, protein and cluster view
Learn how to select information of interest, export
files or create a report
Learn how to compare two datasets (A8.xml and
A9.xml)
Bioinformatics and statistical analysis
Single protein
Look up using Accession Key 6807647
Dataset: Human Red blood cell (hRBC) dataset
Datasets
Proxeon
Tutorials
Data set comparison and click on hRBC_proteome
For further help email Protein Center support at

proteincenter-support@proxeon.com
Acknowledgement
Genomics and Proteomics Core Laboratories
James Lyons-Weiler, Director of BAC
Rick Jordan, programmer of BAC

Bioinf Proteomics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinf Proteomics

Uploaded by

Copyright:

Available Formats

Bioinformatics for Proteomics studies

The term proteomics refers but not limited to the

Informatics is a field of study that focuses on the use of

its broad meaning is the science of processing data. Within

Sample Spectral data

Experimental peak list

Which data base/search engine

IPI, etc. 1012.6

How do I know this is correct ????

MGLSDGEWQQ VLNVWGKVEA DIAGHGQEVL

Basic components of any MS

4700 Proteomics Analyzer, Applied Biosystems

4700 Reflector Spec #1 MC=>TR[BP = 1570.7, 3840]

Probability Based Mowse Score

Accession Mass Score Description

Match to: gi|21536286 ; Score: 681

How to choose which proteins to analyze

List of Further project

Date: May 22, 2008

Example of DiGE analysis and Reporting to PI

To assess proteomic differences between

Image Storage location:

infusion between all four images

Spot detection in the infused image

Spot transfer from the infused image to all images

0.02286 95.541 0.157 21.485 6.86425 ID758

0.04919 5.16631 0.222 23.826 4.51013 ID1007

0.01342 46.4842 0.046 11.841 3.39664 ID638

0.08416 7.3891 0.278 10.941 3.30421 ID1034

0.018 42.3414 0.055 44.115 3.06272 ID856

0.03559 23.0147 0.101 13.473 2.8341 ID1001

0.04949 16.6076 0.136 2.09 2.75143 ID989

0.01316 66.0837 0.036 13.76 2.70679 ID766

0.00253 99.3149 0.006 98.832 2.52439 ID441

0.00738 99.0185 0.018 15.034 2.48139 ID777

0.02265 36.2054 0.054 3.8892 2.40566 ID768

0.00958 20.6202 0.022 6.2787 2.27036 ID637

0.01124 54.9229 0.025 28.956 2.26648 ID747

0.1959 27.1355 0.423 36.584 2.15748 ID1093

0.00491 20.2911 0.01 31.267 2.10956 ID671

0.27372 19.1721 0.568 25.056 2.07484 ID561

0.0122 22.5978 0.025 76.485 2.03149 ID436

0.01617 33.9946 0.032 43.049 2.0087 ID843

0.02564 47.8103 0.009 21.208 -2.94189 ID1052

0.21899 41.5971 0.096 47.237 -2.27408 ID1005

0.02429 74.5 0.006 41.081 -3.96436 ID780

0.01715 2.38384 0.007 10.352 -2.38462 ID879

0.78354 5.48853 0.18 28.436 -4.35728 ID1035

0.2324 5.45632 0.089 14.433 -2.61328 ID965

1.36616 5.92218 0.628 16.394 -2.17684 ID674

0.07541 56.8537 0.034 27.446 -2.20996 ID789

0.33283 1.68301 0.128 2.5139 -2.60336 ID966

0.07982 11.8707 0.028 39.784 -2.85716 ID972

0.03932 15.9807 0.015 58.697 -2.61746 ID1102

0.2053 27.0974 0.077 53.213 -2.68374 ID678

1.81212 13.1749 0.524 33.088 -3.45585 ID953

0.07955 38.023 0.038 35.563 -2.11097 ID736

0.01965 66.2334 0.004 60.707 -5.13589 ID171

0.5569 18.2044 0.233 14.208 -2.39388 ID682

0.5001 6.47287 0.249 30.814 -2.0096 ID681

0.02863 54.201 0.01 55.458 -2.98691 ID1096

1.17678 41.0268 0.451 66.544 -2.609 ID388

0.00893 0.69254 0.004 50.129 -2.02044 ID919