You are on page 1of 32

Resource

Personal Omics Proling Reveals Dynamic Molecular and Medical Phenotypes


Rui Chen,1,11 George I. Mias,1,11 Jennifer Li-Pook-Than,1,11 Lihua Jiang,1,11 Hugo Y.K. Lam,1,12 Rong Chen,2,12 Elana Miriami,1 Konrad J. Karczewski,1 Manoj Hariharan,1 Frederick E. Dewey,3 Yong Cheng,1 Michael J. Clark,1 Hogune Im,1 Lukas Habegger,6,7 Suganthi Balasubramanian,6,7 Maeve OHuallachain,1 Joel T. Dudley,2 Sara Hillenmeyer,1 Rajini Haraksingh,1 Donald Sharon,1 Ghia Euskirchen,1 Phil Lacroute,1 Keith Bettinger,1 Alan P. Boyle,1 Maya Kasowski,1 Fabian Grubert,1 Scott Seki,2 Marco Garcia,2 Michelle Whirl-Carrillo,1 Mercedes Gallardo,9,10 Maria A. Blasco,9 Peter L. Greenberg,4 Phyllis Snyder,1 Teri E. Klein,1 Russ B. Altman,1,5 Atul J. Butte,2 Euan A. Ashley,3 Mark Gerstein,6,7,8 Kari C. Nadeau,2 Hua Tang,1 and Michael Snyder1,*
of Genetics, Stanford University School of Medicine of Systems Medicine and Division of Immunology and Allergy, Department of Pediatrics 3Center for Inherited Cardiovascular Disease, Division of Cardiovascular Medicine 4Division of Hematology, Department of Medicine 5Department of Bioengineering Stanford University, Stanford, CA 94305, USA 6Program in Computational Biology and Bioinformatics 7Department of Molecular Biophysics and Biochemistry 8Department of Computer Science Yale University, New Haven, CT 06520, USA 9Telomeres and Telomerase Group, Molecular Oncology Program, Spanish National Cancer Centre (CNIO), Madrid E-28029, Spain 10Life Length, Madrid E-28003, Spain 11These authors contributed equally to this work 12Present address: Personalis, Palo Alto, CA 94301, USA *Correspondence: mpsnyder@stanford.edu DOI 10.1016/j.cell.2012.02.009
2Division 1Department

SUMMARY

INTRODUCTION Personalized medicine aims to assess medical risks, monitor, diagnose and treat patients according to their specic genetic composition and molecular phenotype. The advent of genome sequencing and the analysis of physiological states has proven to be powerful (Cancer Genome Atlas Research Network, 2011). However, its implementation for the analysis of otherwise healthy individuals for estimation of disease risk and medical interpretation is less clear. Much of the genome is difcult to interpret and many complex diseases, such as diabetes, neurological disorders and cancer, likely involve a large number of different genes and biological pathways (Ashley et al., 2010; Grayson et al., 2011; Li et al., 2011), as well as environmental contributors that can be difcult to assess. As such, the combination of genomic information along with a detailed molecular analysis of samples will be important for predicting, diagnosing and treating diseases as well as for understanding the onset, progression, and prevalence of disease states (Snyder et al., 2009). Presently, healthy and diseased states are typically followed using a limited number of assays that analyze a small number of markers of distinct types. With the advancement of many new technologies, it is now possible to analyze upward of 105 molecular constituents. For example, DNA microarrays have allowed the subcategorization of lymphomas and gliomas
Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. 1293

Personalized medicine is expected to benet from combining genomic information with regular monitoring of physiological states by multiple highthroughput methods. Here, we present an integrative personal omics prole (iPOP), an analysis that combines genomic, transcriptomic, proteomic, metabolomic, and autoantibody proles from a single individual over a 14 month period. Our iPOP analysis revealed various medical risks, including type 2 diabetes. It also uncovered extensive, dynamic changes in diverse molecular components and biological pathways across healthy and diseased conditions. Extremely high-coverage genomic and transcriptomic data, which provide the basis of our iPOP, revealed extensive heteroallelic changes during healthy and diseased states and an unexpected RNA editing mechanism. This study demonstrates that longitudinal iPOP can be used to interpret healthy and diseased states by connecting genomic information with additional dynamic omics activity.

(Mischel et al., 2003), and RNA sequencing (RNA-Seq) has identied breast cancer transcript isoforms (Li et al., 2011; van der Werf et al., 2007; Wu et al., 2010; Lapuk et al., 2010). Although transcriptome and RNA splicing proling are powerful and convenient, they provide a partial portrait of an organisms physiological state. Transcriptomic data, when combined with genomic, proteomic, and metabolomic data are expected to provide a much deeper understanding of normal and diseased states (Snyder et al., 2010). To date, comprehensive integrative omics proles have been limited and have not been applied to the analysis of generally healthy individuals. To obtain a better understanding of: (1) how to generate an integrative personal omics prole (iPOP) and examine as many biological components as possible, (2) how these components change during healthy and diseased states, and (3) how this information can be combined with genomic information to estimate disease risk and gain new insights into diseased states, we performed extensive omics proling of blood components from a generally healthy individual over a 14 month period (24 months total when including time points with other molecular analyses). We determined the whole-genome sequence (WGS) of the subject, and together with transcriptomic, proteomic, metabolomic, and autoantibody proles, used this information to generate an iPOP. We analyzed the iPOP of the individual over the course of healthy states and two viral infections (Figure 1A). Our results indicate that disease risk can be estimated by a whole-genome sequence and by regularly monitoring health states with iPOP disease onset may also be observed. The wealth of information provided by detailed longitudinal iPOP revealed unexpected molecular complexity, which exhibited dynamic changes during healthy and diseased states, and provided insight into multiple biological processes. Detailed omics proling coupled with genome sequencing can provide molecular and physiological information of medical signicance. This approach can be generalized for personalized health monitoring and medicine. RESULTS Overview of Personal Omics Proling Our overall iPOP strategy was to: (1) determine the genome sequence at high accuracy and evaluate disease risks, (2) monitor omics components over time and integrate the relevant omics information to assess the variation of physiological states, and (3) examine in detail the expression of personal variants at the level of RNA and protein to study molecular complexity and dynamic changes in diseased states. We performed iPOP on blood components (peripheral blood mononuclear cells [PBMCs], plasma and sera that are highly accessible) from a 54-year-old male volunteer over the course of 14 months (IRB-8629). The samples used for iPOP were taken over an interval of 401 days (days 0400). In addition, a complete medical exam plus laboratory and additional tests were performed before the study ofcially launched (day 123) and blood glucose was sampled multiple times after the comprehensive omics proling (days 401602) (Figure 1A). Extensive sampling was performed during two viral infections that occurred during this period: a human rhinovirus (HRV) infection beginning on
1294 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

day 0 and a respiratory syncytial virus (RSV) infection starting on day 289. A total of 20 time points were extensively analyzed and a summary of the time course is indicated in Figure 1A. The different types of analyses performed are summarized in Figures 1B and 1C. These analyses, performed on PBMCs and/or serum components, included WGS, complete transcriptome analysis (providing information about the abundance of alternative spliced isoforms, heteroallelic expression, and RNA edits, as well as expression of miRNAs at selected time points), proteomic and metabolomic analyses, and autoantibody proles. An integrative analysis of these data highlights dynamic omics changes and provides rich information about healthy and diseased phenotypes. Whole-Genome Sequencing We rst generated a high quality genome sequence of this individual using a variety of different technologies. Genomic DNA was subjected to deep WGS using technologies from Complete Genomics (CG, 35 nt paired end) and Illumina (100 nt paired end) at 150- and 120-fold total coverage, respectively, exome sequencing using three different technologies to 80- to 100-fold average coverage (see Extended Experimental Procedures available online) and analysis using genotyping arrays and RNA sequencing. The vast majority of genomic sequences (91%) mapped to the hg19 (GRCh37) reference genome. However, because of the depth of our sequencing, we were able to identify sequences not present in the reference sequence. Assembly of the unmapped Illumina sequencing reads (60,434,531, 9% of the total) resulted in 1,425 (of 29,751) contigs (spanning 26 Mb) overlapping with RefSeq gene sequences that were not annotated in the hg19 reference genome. The remaining sequences appeared unique, including 2,919 exons expressed in the RNA-Seq data (e.g., Figure S1A). These results conrm that a large number of undocumented genetic regions exist in individual human genome sequences and can be identied by very deep sequencing and de novo assembly (Li et al., 2010). Our analysis detected many single nucleotide variants (SNVs), small insertions and deletions (indels) and structural variants (SVs; large insertions, deletions, and inversions relative to hg19), (summarized in Table 1 and Experimental Procedures). 134,341 (4.1%) high-condence SNVs are not present in dbSNP, indicating that they are very rare or private to the subject. Only 302 high-condence indels reside within RefSeq protein coding exons and exhibit enrichments in multiples of three nucleotides (p < 0.0001). In addition to indels, 2,566 high-condence SVs were identied (Experimental Procedures and Table S1) and 8,646 mobile element insertions were identied (Stewart et al., 2011). Analysis of the subjects mothers genome by comprehensive genome sequencing (as above) and imputation allowed a maternal/paternal chromosomal phasing of 92.5% of the subjects SNVs and indels (see Extended Experimental Procedures for details). Of 1,162 compound heterozygous mutations in genes, 139 contain predicted compound heterozygous deleterious and/or nonsense mutations. Phasing enabled the assembly of a personal genome sequence of very high condence (c.f., Rozowsky et al., 2011).

Figure 1. Summary of Study


(A) Time course summary. The subject was monitored for a total of 726 days, during which there were two infections (red bar, HRV; green bar, RSV). The black bar indicates the period when the subject: (1) increased exercise, (2) ingested 81 mg of acetylsalicylic acid and ibuprofen tablets each day (the latter only during the rst 6 weeks of this period), and (3) substantially reduced sugar intake. Blue numbers indicate fasted time points. (B) iPOP experimental design indicating the tissues and analyses involved in this study. (C) Circos (Krzywinski et al., 2009) plot summarizing iPOP. From outer to inner rings: chromosome ideogram; genomic data (pale blue ring), structural variants > 50 bp (deletions [blue tiles], duplications [red tiles]), indels (green triangles); transcriptomic data (yellow ring), expression ratio of HRV infection to healthy states; proteomic data (light purple ring), ratio of protein levels during HRV infection to healthy states; transcriptomic data (yellow ring), differential heteroallelic expression ratio of alternative allele to reference allele for missense and synonymous variants (purple dots) and candidate RNA missense and synonymous edits (red triangles, purple dots, orange triangles and green dots, respectively). See also Figure S1.

WGS-Based Disease Risk Evaluation We identied variants likely to be associated with increased susceptibility to disease (Dewey et al., 2011). The list of high condence SNVs and indels was analyzed for rare alleles (<5% of the major allele frequency in Europeans) and for changes in genes with known Mendelian disease phenotypes (data summarized in Table 2), revealing that 51 and 4 of the rare coding SNV and indels, respectively, in genes present in OMIM are predicted

to lead to loss-of-function (Table S2A). This list of genes was further examined for medical relevance (Table S2A; example alleles are summarized in Figure 2A), and 11 were validated by Sanger sequencing. High interest genes include: (1) a mutation (E366K) in the SERPINA1 gene previously known in the subject, (2) a damaging mutation in TERT, associated with acquired aplastic anemia (Yamaguchi et al., 2005), and (3) variants associated with hypertriglyceridemia and diabetes, such as GCKR
Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. 1295

Table 1. Summary and Breakdown of DNA Variants Type Total SNVs Total gene-associated SNVs Total coding/UTR Missense Nonsense Synonymous 50 UTR 30 UTR Intron Ts/Tv dbSNP Candidate private SNV Indels (107$ +36 bp) Coding Structural variants (>50 bp) In 1000G projecta Total Variants 3,739,701 1,312,780 49,017 10,592 83 11,459 4,085 22,798 1,263,763 3,493,748 245,953 1,022,901 3,263 44,781 4,434 Total High Condence 3,301,521 1,183,847 44,542 9,683 73 10,864 2,978 20,944 1,139,305 2.14 3,167,180 134,341 216,776 302 2,566 1,967 Heterozygous High Condence 1,971,629 717,485 27,383 5,944 49 6,747 1,802 12,841 690,102 Homozygous High Condence 1,329,892 466,362 17,159 3,739 24 4,117 1,176 8,103 449,203

High condence values are from variants identied across multiple platforms (Illumina and CG) and/or Exome and RNA-Seq data. Annotations were based from variant call formatted (vcf) les for heterozygous calls: 0/1, reference (ref)/alternative (alt); 1/2, alt/alt and homozygous calls; 1/1, alt/alt; 1/, (alt/alt-incomplete call). Polyphen-2 was used to identify the location of the SNVs. a 1000G (1000 Genomes Project Consortium, 2010).

(homozygous) (Vaxillaire et al., 2008), and KCNJ11 (homozygous) (Hani et al., 1998) and TCF7 (heterozygous) (Erlich et al., 2009). Genetic disease risks were also assessed by the RiskOGram algorithm, which integrates information from multiple alleles associated with disease risk (Ashley et al., 2010) (Figure 2B). This analysis revealed a modest elevated risk for coronary artery disease and signicantly elevated risk levels of basal cell carcinoma (Figure 2B), hypertriglyceridemia, and type 2 diabetes (T2D) (Figures 2B and 2C). In addition to coding region variants we also analyzed genomic variants that may affect regulatory elements (transcription factors [TF]), which had not been attempted previously (Data S1). A total of 14,922 (of 234,980) SNVs lie in the motifs of 36 TFs known to be associated with the binding data (see Experimental Procedures), indicating that these are likely having a direct effect on TF binding. Comparison of SNPs that alter binding patterns of NFkB and Pol II sites (Kasowski et al., 2010), also revealed a number of other interesting regulatory variants, some of which are associated with human disease (e.g., EDIL) (Sun et al., 2010) (Figure S1B). Medical Phenotypes Monitoring Based on the above analysis of medically relevant variants and the RiskOGram, we monitored markers associated with highrisk disease phenotypes and performed additional medically relevant assays. Monitoring of glucose levels and HbA1c revealed the onset of T2D as diagnosed by the subjects physician (day 369, Figures 2A and 2C). The subject lacked many known factors associated with diabetes (nonsmoker; BMI = 23.9 and 21.7 on day 0 and day 511, respectively) and glucose levels were normal for the rst
1296 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

part of the study. However, glucose levels elevated shortly after the RSV infection (after day 301) extending for several months (Figure 2D). High levels of glucose were further conrmed using glycated HbA1c measurements at two time points (days 329, 369) during this period (6.4% and 6.7%, respectively). After a dramatic change in diet, exercise and ingestion of low doses of acetylsalicylic acid a gradual decrease in glucose (to $93 mg/dl at day 602) and HbA1c levels to 4.7% was observed. Insulin resistance was not evident at day 322. The patient was negative for anti-GAD and anti-islet antibodies, and insulin levels correlated well with the fasted and nonfasted states (Figure S2C), consistent with T2D. These results indicate that a genome sequence can be used to estimate disease risk in a healthy individual, and by monitoring traits associated with that disease, disease markers can be detected and the phenotype treated. The subject contained a TERT mutation previously associated with aplastic anemia (Yamaguchi et al., 2005). However, measurements of telomere length suggested little or no decrease in telomere length and modest increase in numbers of cells with short telomeres relative to age-matched controls (Figures S2A and S2B). Importantly, the patient and his 83-year-old mother share the same mutation but neither exhibit symptoms of aplastic anemia, indicating that this mutation does not always result in disease and is likely context specic in its effects. Consistent with the elevated hypertriglyceridemia risk, triglycerides were found to be high (321 mg/dl) at the beginning of the study. These levels were reduced (81116 mg/dl) after regularly taking simvastatin (20 mg/day). We also examined the variants for their potential effects on drug response (see Extended Experimental Procedures). Among the alleles of interest, (Figure 2A and Table S2B) two genotypes affecting the LPIN1 and SLC22A1 genes were associated with

Table 2. Summary of Disease-Related Rare Variants Category Total high condence rare SNVs Coding Missense Synonymous Nonsense Nonstop Damaging or possibly damaging Putative loss-of-function SNVsa Total high condence rare indels Coding indels Frameshift indels miRNA indels miRNA target sequence indels Putative loss-of-function indels
a a

Count 289,989 2,546 1,320 1,214 11 1 233 51 51,248 61 27 3 5 4

In curated Mendelian disease genes.

favorable (glucose lowering) responses to two diabetic drugs, rosiglitazone and metformin, respectively. We followed the levels of 51 cytokines along with the C-reactive protein (CRP) using ELISA assays, which revealed strong induction of proinammatory cytokines and CRP during each infection (Figures 2E and 2F). We also observed a spike of many cytokines at day 12 after the RSV infection (day 301 overall). These data dene the physiological states and serve as a valuable reference for the omic proles integrated into a longitudinal map of healthy and diseased states described in the next sections. We also proled autoantibodies during the HRV infection. Plasma and serum samples from the rst four time points (days 123, 0, 4 and 21), along with plasma samples from 34 healthy controls were used to probe a protein microarray containing 9,483 unique human proteins spotted in duplicate. A total of 884 antigens with increased reactivity (Data S2) in the candidate plasma relative to healthy controls were found (p < 0.01, Benjamini-Hochberg p < 0.01). Among the potentially interesting results was high reactivity with DOK6, an insulin receptor binding protein (NCBI gene database). These results demonstrate that autoantibodies can be monitored and that information relevant to disease conditions can be found. Dynamic Omics Analysis: Integrative Omics Proling of Molecular Responses We proled the levels of transcripts, proteins, and metabolites across the HRV and RSV infections and healthy states using a variety of approaches. RNA-Seq of 20 time points generated over 2.67 billion uniquely mapped 101b paired-end reads (123 million reads average per time point) and allowed for an analysis of the molecular complexity of the transcriptome in normal cells (PBMCs) at an unprecedented level. The relative levels of 6,280 proteins were also measured at 14 time points through differential labeling of samples using isobaric tandem mass tags (TMT), followed by liquid chromatography and mass spectrometry (LC-MS/MS) (Cox and Mann, 2010; Theodoridis

et al., 2011). A total of 3,731 PBMC proteins could be consistently monitored across most of the 14 time points (see Figure S3A and Data S3). In addition, 6,862 and 4,228 metabolite peaks were identied for the HRV and RSV infection, and a total of 1,020 metabolites were tracked for both infections (see Figure S4 and Data S4, [3]). Finally, as described below, we also analyzed miRNAs during the HRV infection. This wealth of omics information allowed us to examine detailed dynamic trends related directly to the physiological states of the individual and revealed enormous changes in biological processes that occurred during healthy and diseased states. For each prole (transcriptome, proteome, metabolome), we systematically searched for two types of nonrandom patterns: (1) correlated patterns over time and (2) single unusual events (i.e., spikes that may occur at any given time point dened as statistically signicantly high or low signal instances compared to what would be expected by chance). To perform this analysis, we developed a general scheme for integrated analysis of data (see Figure S5 and Extended Experimental Procedures for further details). We used a Fourier spectral analysis approach that both normalizes the various omics data on equal basis for identifying the common trends and features, and, also accounts for data set variability, uneven sampling, and data gaps, in order to detect real-time changes in any kind of omics activity at the differential time points (see Supplemental Information). Autocorrelations were calculated to assess nonrandomness of the time-series (p < 0.05 one-tailed based on simulated bootstrap nonparametric distribution by sampling with replacement of the original data, n > 100,000), with signicant signals classied as autocorrelated (I). The remaining data was searched for spike events, which were classied as spike maxima (II) or spike minima (III) (p < 0.05 one-tailed based on differences from simulated, n > 100,000 random distribution of the timeseries). After classication, the data were agglomerated into hierarchical clusters (using correlation distance and average linkage) of common patterns and biological relevance was assessed through GO (Ashburner et al., 2000) analysis (Cytoscape [Smoot et al., 2011], BiNGO [Maere et al., 2005] p < 0.05, BenjaminiHochberg [Benjamini and Hochberg, 1995] adjusted p < 0.05) and pathway analysis (Reactome [Croft et al., 2011] functional interaction [FI], networks including KEGG [Kanehisa and Goto, 2000; Smoot et al., 2011], p < 0.05, FDR < 0.05). The unied framework approach was implemented on all the different data sets both individually and in combination, and our results revealed a number of differential changes that occurred both during infectious states and the varying glucose states. We rst analyzed the different individual transcriptome, proteome (serum and PBMC) and metabolome data sets; the proteome and metabolome results are presented in the Supplemental Information (Figures S3, S4, S6 and Data S3S6). A total of 19,714 distinct transcript isoforms (Wang et al., 2008) corresponding to 12,659 genes (Figure S1C) were tracked for the entire time course, and their dynamic expression response was classied into either autocorrelated (I) and spike sets, further subdivided as displaying maxima (II) or minima (III) (Figure 3). The clustering and enrichment analysis displayed a number of interesting pathways in each class. In the autocorrelated group (Figure 3B, [I]; see also Figure S6A and Data S6, [1 and 2]), we
Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. 1297

F E

Figure 2. Medical Findings


(A) High interest disease- and drug-related variants in the subjects genome. (B) RiskGraph of the top 20 diseases with the highest posttest probabilities. For each disease, the arrow represents the pretest probability according to the subjects age, gender, and ethnicity. The line represents the posttest probability after incorporating the subjects genome sequence. Listed to the right are the numbers of independent disease-associated SNVs used to calculate the subjects posttest probability. (C) RiskOGram of type 2 diabetes. The RiskOGram illustrates how the subjects posttest probability of T2D was calculated using 28 independent SNVs. The middle graph displays the posttest probability. The left side shows the associated genes, SNVs, and the subjects genotypes. The right side shows the likelihood ratio (LR), number of studies, cohort sizes, and the posttest probability. (D) Blood glucose trend. Measurements were taken from samples analyzed at either nonfasted or fasted states; the nonfasted states (all but days 186, 322, 329, and 369 and after day 400) were at a xed time after a constant meal. Data was presented as moving average with a window of 15 days. Red

1298 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

found two main trends: an upward trend (2,023 genes), following the onset of the RSV infection, and a similar coincidental downward trend (2,207 genes). The upward autocorrelated trend revealed a number of pathways as enriched (p < 0.002, FDR < 0.05), including protein metabolism and inuenza life cycle. Additionally, the downward autocorrelation cluster showed a multitude of enriched pathways (p < 0.008, FDR < 0.05), such as TCR signaling in naive CD4+ T cells, lysosome, B cell signaling, androgen regulation, and of particular interest, insulin signaling/response pathways. These different pathways, which are activated as a response to an immune infection, often share common genes and additionally we observe many genes hitherto unknown to be involved in these pathways but displaying the same trend. Furthermore, we observed that the downward trend, that began with the onset of the RSV infection and appeared to accelerate after day 307, coincided with the beginning of the observed elevated glucose levels in the subject. In the dynamic spike class we again saw patterns that were concordant with phenotypes (Figure 3B, [II] and [III]; see also Figure S6A and Data S6, [314]). A set of expression spikes displaying maxima (547 genes), that are common to the onset of both the RSV and HRV infections are associated with phagosome, immune processes and phagocytosis, (p < 1 3 104, FDR < 6 3 103). Furthermore, a cluster that exhibits an elevated spike at the onset of the RSV infection involves the major histocompatibility genes (p < 7 3 104, Benjamini-Hochberg adjusted p < 0.03). A large number of genes with a coexpression pattern common to both infections in the time course have yet to be implicated in known pathways and provide possible connections related to immune response. Finally, our spike class displaying minima showed a distinct cluster (1,535 genes) singular to day 307 (day 14 of the RSV infection), associated with TCR signaling again, TGF receptors, and T cell and insulin signaling pathways (p < 0.02, FDR < 0.03). Overall, the transcriptome analysis captures the dynamic response of the body responding to infection as also evidenced by our cytokine measurements, and also can monitor health changes over long periods of time, with various trends. To further leverage the transcriptome and genome data, we performed an integrated analysis of transcriptome, proteomic and metabolomics data for each time point, observing how this corresponded to the varying physiological states monitored as described in the above sections. Because of the availability of many time points through the course of infection, we examined in detail the onset of the RSV infection, as well as extended our complete dynamics omics prole during the times that our subject began exhibiting high glucose levels. Figure 4 shows an integrated interpretation of omics data (see also Figure S6B and Data S7), where all trends are combined for each omics data set and the common patterns emerge providing complementary information. In addition to the common patterns

observed in our transcriptome analysis, new patterns emerged, some unique to protein data, some to metabolite, and some common to all. In particular we found the following interesting results: for autocorrelated clusters we found the same trends as observed in the transcriptome, additionally augmented with concordant protein expressions. Pathways such as the phagosome, lysosome, protein processing in endoplasmic reticulum, and insulin pathways emerged as signicantly enriched (p < 0.002, FDR < 0.0075), and showed a downward trend postinfection, and further accelerated after $3 weeks following the initial onset of the RSV infection (this cluster comprised of 1,452 transcriptomic and 69 proteomic components, corresponding to 1,444 genes). The elevated spike class showed a maxima cluster on day 18 post RSV infection (one time point after the cytokine maximum), with enrichment in pathways such as the spliceosome, glucose regulation of insulin secretion, and various pathways related to a stress response (p < 1 3 104, FDR < 0.02)this cluster included 1,956 transcriptomic, 571 proteomic and 23 metabolomic components, corresponding to 2,344 genes. Even though current proteomic information is more limited than the full transcriptome because it follows fewer components, as evidenced in Figure 4 (II), several pathways, including the glucose regulation of insulin secretion pathway, clearly emerge from the proteomic information and would not have been observed by only monitoring the transcriptome. Additionally, in this cluster we nd signicant GO enrichment in splicing and metabolic processes (p < 6 3 1047, BenjaminiHochberg adjusted p < 1045). Furthermore, inspection of metabolites reveals 23 that show the same exact trend (i.e., spikes at day 18 post RSV infection); at least one, lauric acid has been implicated in fatty acid metabolism and insulin regulatory pathways (Kusunoki et al., 2007). Finally, we observe minima spikes as well, with yet another interesting group on day 18, which showed downregulation in several pathways (p < 0.003, FDR < 0.05), such as the formation of platelet plug. This cluster displayed a high degree of synergy between the various omics data, comprised of 3,237 transcriptomic and 761 proteomic components corresponding 3,400 genes and 83 metabolomic components. In summary, our integrated approach revealed a clear systemic response to the RSV infection following its onset and postinfection response, including a pronounced response evident at day 18 post RSV infection. A variety of infection/stress response related pathways were affected along with those associated to the high glucose levels in the later time points, including insulin response pathways. Dynamic Omics Analysis: Extensive Heteroallelic Variation and RNA Editing The considerable amount of transcriptome and proteome data allowed us to analyze and follow changes in allele-specic

and green arrows and bars indicate the times of the HRV and RSV infections, respectively. Black arrows and bars indicate the period with life style changes. (E) C-reactive protein trend line. Error bars represent standard deviation of three assays. (F) Serum cytokine proles. Red box and day number, HRV infection; green box and day number, RSV infection; question mark, elevated cytokine levels indicating an unknown event at day 301. Red is increased cytokine levels. See also Figure S2.

Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. 1299

(I)

(II)

(III)

Figure 3. Transcriptome Time Course Analysis


(A) Summary of approach for identication of differentially expressed components. The various omics sets were processed through a common framework involving spectral analysis, clustering, and pathway enrichment analysis. (B) Pattern classication. The different emergent patterns from the analysis of the transcriptome for the entire time course are displayed for the autocorrelation (I), spike maxima (II), and spike minima (III) classes. For different clusters, examples of gene connections in selected pathways based on Reactome (Croft et al., 2011) FI (Cytoscape plugin [Smoot et al., 2011]) are shown as networks. Example GO (Ashburner et al., 2000) enrichment analysis results from Cytoscape (Smoot et al., 2011) BiNGO (Maere et al., 2005) plugin and pathway enrichment results (Reactome FI [Croft et al., 2011]) are included. See also Figures S5 and S6.

1300 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

expression (ASE), splicing, and editing at the RNA and protein levels during healthy and diseased states. Of the 49,017 genomic variants associated with coding or UTR regions (Table 1), 12,785 (26%) were expressed in PBMCs (R40 read coverage; Table S3). A total of 8,509 of the variants are heterozygous (1,113 missense) and the remainder (4,686; 684 missense) are homozygous. Eight of the 83 nonsense mutations were expressed indicating that not all nonsense mutations result in transcript loss. The numerous heterozygous variants allowed an analysis of the dynamics of differential ASE, (shrunk ratios, Experimental Procedures; Figures 5A and S7B) in PBMCs during healthy and diseased states. We found 497 and 1,047 genes that exhibited differential ASE during HRV and RSV infection, respectively (posterior probability R 0.75, beta-binomial model; R 40 reads, R 7 time points); many of these are immune response genes, e.g., PADI4 and PLOD1 (Figure 5B). Among the differential ASE sites 100 and 218 were specic to HRV and RSV infected states, respectively (Figures 5C and 5D). Differential ASE genes in the HRV compared to healthy phase were enriched for those encoding SNARE vesicular transport proteins (DAVID analysis; Benjamini p < 0.05). Summing over all computed ASE alternative to total ratios revealed that nonreference heteroallelic variants were expressed at 98% of reference variants. The expression of over 50 heterozygous variants, including some of the rare/ private SNVs (which form 0.72% of the genomic total), and differentially expressed variants (SVIL and TRIM5), was conrmed by Sanger cDNA sequencing and/or digital PCR (Hindson et al., 2011) of cDNA (Figures 5B and S7). Overall, these results demonstrate that differential ASE is pervasive in humans and is particularly distinct during healthy and infected states, with many of these changes residing in immune response genes. The depth of our RNA-Seq data enabled us to re-evaluate the extent of RNA editing (Figure 6 and Data S8 and S11A), typically an adenosine to inosine (A-to-I) conversion (Li et al., 2009b) or infrequently cytidine to uridine (C-to-U), in normal human cells. We found 2,376 high-condence coding-associated RNA edits, including 795 A-to-I (A-to-G) and 277 C-to-U deamination-like edits (Figure 6A). A total of 587 edits in 175 genes were predicted to cause amino acid substitutions (Polyphen-2 [Adzhubei et al., 2010]); the remainder were nonsense (11), synonymous (435), or located in 50 /30 UTRs (103/ 1,240). Ten edited bases causing amino acid substitutions were validated by Sanger cDNA sequencing and/or digital droplet PCR, as well as by identication of their peptide counterparts by mass spectrometry (Figure 6B). Interestingly, we identied A-to-G edits (Figure 6B), e.g., IGFBP7, BLCAP, and AZIN1 in PBMCs that were known to occur in other tissues (Gommans et al., 2008; Levanon et al., 2005), indicating that the same RNA can be edited in other cell types. BLCAP exhibited two edited changes (Figure 6C) with edited/total ratios of 0.120.2 and 0.180.31, respectively, comparable to the 0.21 ratio previously observed in the brain (Galeano et al., 2010). Furthermore, we found and validated two missense-causing edits, U-to-C in SCFD2 and G-to-A in FBXO25 (Figure 6D), indicating an amination-like RNA-editing mechanism, previously not observed in human cells. Our results reveal that a large number of edits occur and exhibit dynamic and differential changes in

populations of PBMCs (Figure 6B). The total number of edited RNAs, while extensive, is signicantly lower than that reported in human lymphoblastoid lines and very different in its distribution (Li et al., 2011). We believe that in addition to tissue-specic variation, the observed differences are also likely due to overcalling of false-positive SNVs, a problem we corrected with deep exome sequencing, removal of repeat regions and pseudogenes, and strings of close-proximity variants (Data S11A). Finally, to determine whether the nonreference allele and edited RNAs serve as templates for protein synthesis, we generated proteome databases for 4,586 missense SNVs and all 30,385 edits and used them to search our mass spectra from the untargeted protein proling experiments as well as in a targeted approach to directly search for 500 edited proteins (see Extended Experimental Procedures). Peptides for 48 SNVs and 51 edits were identied (FDR < 0.01 and requiring one unique peptide per protein; Data S9 and S11B). A total of 17/17 selected SNVs (100%) were validated by Sanger sequencing. Seven peptides derived from the SNV and six peptides derived from edited transcripts were unique to a single protein in the IPI database (Kersey et al., 2004) and classied as high condence. These results indicate that a large fraction of personal variants are expressed as transcripts and a number of these are also translated as proteins. miRNA Variant Analysis In addition to the omics proling above, we identied 619681 known miRNAs from PBMCs per time point (>10 reads, days 4, 21, 116, 185, and 186), 106 of which showed dynamic changes (e.g., Figures S2D and S2E). Examination of miRNA editing revealed 50 edited miRNAs (C-to-U or A-to-I) with stringent criteria (edited reads > 5% of total reads or > 399 modied reads) indicating that at least $4% of expressed miRNAs are potentially edited. Eighteen miRNAs contain edits located within the functionally critical seed sequences, potentially affecting their mRNA targets. Interestingly, expression of SNV-containing miRNAs was generally higher compared to SNV-free miRNA (Figures 6E and 6F). In addition to edits, analysis of the SNVs located in miRNAs revealed that most (25 of 31) SNV-containing miRNAs were not expressed. These miRNAs were among those discovered in cancer cell lines (Jima et al., 2010) and may not normally be highly expressed in PBMCs from healthy individuals. DISCUSSION To our knowledge, our study is the rst to perform extensive personal iPOP of an individual through healthy and diseased states. It revealed extensive complex and dynamic changes in the omics proles, especially in the transcriptomes, between healthy states and viral infections, and between nondiabetic and diabetic states. iPOP provides a multidimensional view of medical states, including healthy states, response to viral infection, recovery, and T2D onset. Our study indicates that disease risk can be assessed from a genome sequence and illustrates how traits associated with disease can be monitored to identify varying physiological stages. We show that large numbers of molecular components are present in blood samples and can
Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. 1301

(I)

(II)

(III)

1302 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

Figure 5. Heteroallelic Expression Study of PBMCs


(A) Frequency of allele-specic expression (ASE) based on shrunk alternative/total ratios of RNA-Seq data. A total of 143 positions fall outside the three standard deviations (s) range (see Figure S7B; <0.33, >0.66), suggesting that certain heterozygous alleles (DNA level) are preferentially expressed in PBMCs. Standard deviations (s) are denoted with dotted lines and the average ratio overlapping across all time points is 0.49. (B) Digital droplet PCR validation of two heteroallelic expressed genes PADI4 and PLOD (relative to alternative allele). (C) Heat map of the HRV infection time course (seven time points) showing differential ASE during HRV infection day 0 (red arrow) relative to average shrunk ratios of healthy states (days 116255). (D) Heat map of the RSV infection time course (13 time points) showing differential ASE specic to RSV infection day 289 (red arrow) relative to average shrunk ratios of healthy states (days 311400), onset of high glucose on day 307 is also shown (red arrow). Heat map ratios are relative to the alternative allele (alternative/ total, posterior probability >0.75). Example of enriched KEGG pathway gene cluster (Huang et al., 2009; Benjamini p < 0.05) shown below Figure 5C. See also Figure S7 and Data S11A.

be measured (>3 billion measurements taken over 20 time points). For the transcriptome many of these arise from differential splicing, ASE, and editing events. By observing dynamic molecular changes that correspond to physiological states, this proof-of-principle study offers a pilot implementation of personalized medicine. The information obtained may greatly help in the design and application of personalized health monitoring, diagnosis, prognosis, and treatment.
Figure 4. Integrated Omics Analysis

We speculate that differential expression of ASE/edits may be important in monitoring and assessing diseased states. In this respect the genes/proteins in which one isoform is abundant in one condition (e.g., diseased or healthy state) whereas another is abundant in another (e.g., diseased state) may provide unique physiological advantages to the individual in distinct environmental conditions. Because multiple genes in our study that exhibit ASE and editing changes are

For days 186400, the different emergent patterns from an integrated analysis of the transcriptome, proteome, and metabolome data are displayed for the autocorrelation (I), spike maxima (II), and spike minima (III) classes. For different clusters, examples of gene connections in selected pathways based on Reactome (Croft et al., 2011) and FI Cytoscape (Smoot et al., 2011) plugin are shown as networks, with constituents marked as assessed from proteome data, transcriptome data or both. Example GO (Ashburner et al., 2000) enrichment analysis results from Cytoscape (Smoot et al., 2011) BiNGO (Maere et al., 2005) plugin and pathway enrichment results (Reactome FI [Croft et al., 2011]) are included. See also Figures S4S6.

Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. 1303

Figure 6. RNA Editing and miRNA Expression of PBMCs


(A) Distribution of candidate RNA editing types in missense (red) and synonymous and UTRs (blue), based on seven or more time points (total 20 time points). (B) Selected summary of known and novel RNA edits expressed in PBMCs. RNA edits were validated by digital PCR (green) and proteomic mass spectrometry (yellow). (C) Detail of two missense-causing edit sites in BLCAP. Selected data from RNA-Seq at day 4 and day 255 (top left), Sanger sequencing of day 255 cDNA (bottom left), and digital PCR (right panel) are shown. (D) Digital droplet PCR analysis of novel edit sites in SCFD2 (left) and FBXO25 (right) genes show no variants in DNA, whereas in RNA, editing is evident (top left quadrant). (E and F) Expression of SNV-containing and SNV-free miRNA, respectively, for days 4, 21, 116, 185, and 186. Red lines, mean; error bars, standard error of the mean. Genome browsers, chromatograms, and digital PCR data were analyzed with software from DNAnexus, Inc., Chromas 2.33, and QuantaLife, respectively. See also Figure S7 and Data S8 and S11A and S11B.

involved in immune function, we speculate that these components are particularly valuable for mediating immune responses to environmental conditions such as exposure to
1304 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

pathogens. Likewise miRNA SNVs and edits, which also undergo differential expression, may confer unique biological responses.

Although we analyzed a single individual, insights were gained by integrating the multiple omics proles associated with distinct physiological states. Through examination of molecular patterns, clear signatures of dynamic biological processes were evident, including immune responses during infection, insulin signaling response alterations after the RSV infection. Indeed, careful monitoring of omics changes across multiple time points for the same individual revealed detailed responses, which might not have been evident had the analyses been performed on groups due to interindividual variability. Hence, we expect that our longitudinal personalized proling approach provides valuable information on an individual basis. We focused on a generally healthy subject who exhibited no apparent disease symptoms. This is a critical aspect of personalized medicine, which is to perform iPOP and evaluate the importance and changes of all the proles in ordinary individuals. These results have important implications and suggest new paradigm shifts: rst, genome sequencing can be used to direct the monitoring of specic diseases (in this study, aplastic anemia and diabetes) and second, by following large numbers of molecules a more comprehensive view of disease states can be analyzed to follow physiological states. Our study revealed that many distinct molecular events and pathways are activated both through viral infection and the onset of diabetes. Indeed, the monitoring of large numbers of different components revealed a steady decrease of insulin-related responses that are associated with diabetes-insulin response pathways occurring from the early healthy state to a high glucose state. Although many of the activated and repressed pathways could be detected through transcript proling, some were detected only with the proteomics data and some with the combined set of data. In addition a large number of connections with diabetes and insulin signaling using metabolites, miRNAs, and autoantibodies were observed. One particularly interesting response detected with the proteomics data was the onset of the elevated glucose response that was tightly associated with the RSV infection and a particular subclinical response at day 12/18 postinfection. It is tempting to speculate that the RSV infection and/or the associated event at day 12/18 triggered the onset of high glucose/T2D. Although viral infections have been associated with T1D (van der Werf et al., 2007), we are unaware of viral infection associated with T2D. Inammation and activated innate immunity have been associated with T2D (Pickup, 2004), and we speculate that perhaps RSV triggered aberrant glucose metabolism through activation of a viral inammation response in conjunction with a predisposition toward T2D. Although this cannot be proven with the analyses from a single individual, this study nonetheless serves as proof-of-principle that iPOP can be performed and provide valuable information. Because diabetes is a complex disease there may be many ways to acquire high glucose phenotype; longitudinal iPOP analysis of a large number of individuals may be extremely valuable to dissecting the disease and its various subtypes, as well providing information into the molecular mechanism of its onset. Finally, we believe that the wealth of data generated from this study will serve as a valuable resource to the community in the developing eld of personalized medicine. A large database with the complete time-dynamic proles for more individuals

that acquire infections and other types of diseases will be extremely valuable in the early diagnostics, monitoring and treatment of diseased states.
EXPERIMENTAL PROCEDURES The subject and mother in this study were recruited under the IRB protocol IRB-8629 at Stanford University. Full methods and associated references can be found in the Extended Experimental Procedures section. WGS was performed at Complete Genomics and Illumina. High-condence SNVs were mostly correct as evidenced by: (1) Illumina Omni1-Quad genotyping arrays (99.3% sensitivity), (2) a Ti/Tv ratio of 2.14 as expected (1000 Genomes Project Consortium, 2010), (3) Illumina capture and DNA sequencing (92.7% accuracy), and (4) Sanger sequencing of 36 randomly selected SNVs (36/36 validated, Table S1). In contrast, the low condence SNVs had a Ti/ Tv of only 1.46 and an accuracy of 63.8% (19 of 33 conrmed by Sanger sequencing, Table S1A). Similarly, the majority of the 216,776 high-condence indels are likely to be correct as (1) Sanger sequencing validated 14 of 15 (93%) tested indels and (2) exome-sequencing validated most indels (4,706, 82%); meanwhile the 806,125 low condence indels had a low validation rate (5,225, 0.65%). SVs were called using: (1) paired-end mapping (Chen et al., 2009) (2) read depth (Abyzov et al., 2011), (3) split reads (Ye et al., 2009), and (4) junction mapping (Lam et al., 2010) to the breakpoint junction database from the 1000 G (Mills et al., 2011). A total of 2,566 were found by two different methods or platforms (CG or Illumina) and were called high condence; >90% of these were in the database of genome variants. Strand-specic RNA-Seq libraries were prepared as described previously (Parkhomchuk et al., 2009) and sequenced on 13 lanes of Illuminas HiSeq 2000 instrument. The TopHat package (Trapnell et al., 2009) was used to align the reads to the hg19 reference genome, followed by Cufinks for transcript assembly and RNA expression analysis (Trapnell et al., 2010). The Samtools package (Li et al., 2009a) was used to identify variants including single nucleotide variants (SNV) and Indels. Small RNAs were prepared from PBMCs for the rst ve time points; sequencing was performed according to Illuminas Small RNA v1.5 Sample Preparation Guide. The Luminex 51-plex Human Cytokines assay was performed at the Stanford Human Immune Monitoring Center. For mass spectrometry, proteins were prepared from PBMC cell lysates, labeled at lysines using the TMT isobaric tags by Pierce, and digested with trypsin and analyzed using reverse phase LC coupled to a Thermo Scientic (LTQ)-Orbitrap Velos instrument. In order to prole serum, 14 major glycoproteins were rst removed using the Agilent Human 14 Multiple Afnity Removal System (MARS) column in order to analyze the less abundant constituents. Metabolites were extracted by four times serum volume of equal mixture of methanol, acetonitrile, and acetone and separated using our Agilent 1260 liquid chromatography. Hydrophobic molecules were proled using reversed phase UPLC followed by APCI-MS and hydrophilic molecule were analyzed using HILIC UPLC followed by ESI-MS in either the positive or negative mode. For the integrated analysis, per omics set, for each time-series curve the Lomb-Scargle transformation (Hocke and Kampfer, 2009; Lomb, 1976; Scargle, 1982, 1989) for unevenly sampled gapped time-series data was imple mented (Ahdesmaki et al., 2007; Glynn et al., 2006; Van Dongen et al., 1999; Yang et al., 2011; Zhao et al., 2008). This allowed us to obtain a periodogram, which was used to calculate autocorrelations and then reconstruct the timeseries with even sampling, allowing standard time-series analysis and performing data clustering, while taking the time intervals into account (see Extended Experimental Procedures). Autoantibodyome proling was performed using the Invitrogen ProtoArray Protein Microarray v5.0 according to the manufacturers instructions. ACCESSION NUMBERS The SRA accession number for the WGS sequence reported in this paper is SRP008054.4. The GEO accession number for the RNA-Seq and miRNASeq data sequence reported in this paper is GSE33029. See Extended Experimental Procedures for data dissemination details.

Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. 1305

SUPPLEMENTAL INFORMATION Supplemental Information includes Extended Experimental Procedures, seven gures, four tables, and eleven data les and can be found with this article online at doi:10.1016/j.cell.2012.02.009. ACKNOWLEDGMENTS M.S. is funded by grants from Stanford University and the NIH. M.G. is funded by grants from the NIH. G.I.M. is funded by NIH training grant. K.J.K., J.T.D., and S.H. are supported by the NIH/NLM training grant T15-LM007033. T.E.K. and R.B.A are funded by NIH/NIGMS R24-GM61374. M.A.B.s laboratory is funded by the Spanish Ministry of Science and Innovation Projects SAF2008-05384 and CSD2007-00017, European Union FP7 Projects 2007A-201630 (GENICA) and 2007-A-200950 (TELOMARKER), European Research Council Advanced Grant GA232854, the Korber Foundation, the Fundacion Marcelino Botn, and Fundacion Lilly (Espana). F.E.D. was supported by NIH/NHLBI training grant T32 HL094274. E.A.A. was supported by NIH/NHLBI KO8 HL083914, NIH New Investigator DP2 Award OD004613, and a grant from the Breetwor Family Foundation. We dedicate this manuscript to Dr. Tara A. Gianoulis, an enthusiastic advocate for genomic science. R.B.A., E.A.A., A.B., and M.S. serve as founders and consultants for Personalis. R.B.A. is a consultant to 23andMe. M.S. is a member of the scientic advisory board of GenapSys and a consultant for Illumina. M.A.B. acts as consultant and holds stock in Life Length. Received: October 11, 2011 Revised: January 27, 2012 Accepted: February 4, 2012 Published: March 15, 2012 REFERENCES 1000 Genomes Project Consortium. (2010). A map of human genome variation from population-scale sequencing. Nature 467, 10611073. Abyzov, A., Urban, A.E., Snyder, M., and Gerstein, M. (2011). CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974984. Adzhubei, I.A., Schmidt, S., Peshkin, L., Ramensky, V.E., Gerasimova, A., Bork, P., Kondrashov, A.S., and Sunyaev, S.R. (2010). A method and server for predicting damaging missense mutations. Nat. Methods 7, 248249. Ahdesmaki, M., Lahdesmaki, H., Gracey, A., Shmulevich, L., and Yli-Harja, O. (2007). Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data. BMC Bioinformatics 8, 233. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. (2000). Gene ontology: tool for the unication of biology. The Gene Ontology Consortium. Nat. Genet. 25, 2529. Ashley, E.A., Butte, A.J., Wheeler, M.T., Chen, R., Klein, T.E., Dewey, F.E., Dudley, J.T., Ormond, K.E., Pavlovic, A., Morgan, A.A., et al. (2010). Clinical assessment incorporating a personal genome. Lancet 375, 15251535. Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Roy. Statist. Soc. Ser. B 57, 289300. Cancer Genome Atlas Research Network. (2011). Integrated genomic analyses of ovarian carcinoma. Nature 474, 609615. Chen, K., Wallis, J.W., McLellan, M.D., Larson, D.E., Kalicki, J.M., Pohl, C.S., McGrath, S.D., Wendl, M.C., Zhang, Q., Locke, D.P., et al. (2009). BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods 6, 677681. Cox, J., and Mann, M. (2010). Quantitative, high-resolution proteomics for data-driven systems biology. Annu. Rev. Biochem. 80, 273299. Croft, D., OKelly, G., Wu, G., Haw, R., Gillespie, M., Matthews, L., Caudy, M., Garapati, P., Gopinath, G., Jassal, B., et al. (2011). Reactome: a database of

reactions, pathways and biological processes. Nucleic Acids Res. 39 (Database issue), D691D697. Dewey, F.E., Chen, R., Cordero, S.P., Ormond, K.E., Caleshu, C., Karczewski, K.J., Whirl-Carrillo, M., Wheeler, M.T., Dudley, J.T., Byrnes, J.K., et al. (2011). Phased whole-genome genetic risk in a family quartet using a major allele reference sequence. PLoS Genet. 7, e1002280. Erlich, H.A., Valdes, A.M., Julier, C., Mirel, D., and Noble, J.A.; Type I Diabetes Genetics Consortium. (2009). Evidence for association of the TCF7 locus with type I diabetes. Genes Immun. 10 (Suppl 1), S54S59. Galeano, F., Leroy, A., Rossetti, C., Gromova, I., Gautier, P., Keegan, L.P., Massimi, L., Di Rocco, C., OConnell, M.A., and Gallo, A. (2010). Human BLCAP transcript: new editing events in normal and cancerous tissues. Int. J. Cancer 127, 127137. Glynn, E.F., Chen, J., and Mushegian, A.R. (2006). Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms. Bioinformatics 22, 310316. Gommans, W.M., Tatalias, N.E., Sie, C.P., Dupuis, D., Vendetti, N., Smith, L., Kaushal, R., and Maas, S. (2008). Screening of human SNP database identies recoding sites of A-to-I RNA editing. RNA 14, 20742085. Grayson, B.L., Wang, L., and Aune, T.M. (2011). Peripheral blood gene expression proles in metabolic syndrome, coronary artery disease and type 2 diabetes. Genes Immun. 12, 341351. Hani, E.H., Boutin, P., Durand, E., Inoue, H., Permutt, M.A., Velho, G., and Froguel, P. (1998). Missense mutations in the pancreatic islet beta cell inwardly rectifying K+ channel gene (KIR6.2/BIR): a meta-analysis suggests a role in the polygenic basis of Type II diabetes mellitus in Caucasians. Diabetologia 41, 15111515. Hindson, B.J., Ness, K.D., Masquelier, D.A., Belgrader, P., Heredia, N.J., Makarewicz, A.J., Bright, I.J., Lucero, M.Y., Hiddessen, A.L., Legler, T.C., et al. (2011). High-throughput droplet digital PCR system for absolute quantitation of DNA copy number. Anal. Chem. 83, 86048610. Hocke, K., and Kampfer, N. (2009). Gap lling and noise reduction of unevenly sampled data by means of the Lomb-Scargle periodogram. Atmos. Chem. Phys. 9, 41974206. Huang, W., Sherman, B.T., and Lempicki, R.A. (2009). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 113. Jima, D.D., Zhang, J., Jacobs, C., Richards, K.L., Dunphy, C.H., Choi, W.W., Au, W.Y., Srivastava, G., Czader, M.B., Rizzieri, D.A., et al; Hematologic Malignancies Research Consortium. (2010). Deep sequencing of the small RNA transcriptome of normal and malignant human B cells identies hundreds of novel microRNAs. Blood 116, e118e127. Kanehisa, M., and Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 2730. Kasowski, M., Grubert, F., Heffelnger, C., Hariharan, M., Asabere, A., Waszak, S.M., Habegger, L., Rozowsky, J., Shi, M., Urban, A.E., et al. (2010). Variation in transcription factor binding among humans. Science 328, 232235. Kersey, P.J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., and Apweiler, R. (2004). The International Protein Index: an integrated database for proteomics experiments. Proteomics 4, 19851988. Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J., and Marra, M.A. (2009). Circos: an information aesthetic for comparative genomics. Genome Res. 19, 16391645. Kusunoki, M., Tsutsumi, K., Nakayama, M., Kurokawa, T., Nakamura, T., Ogawa, H., Fukuzawa, Y., Morishita, M., Koide, T., and Miyata, T. (2007). Relationship between serum concentrations of saturated fatty acids and unsaturated fatty acids and the homeostasis model insulin resistance index in Japanese patients with type 2 diabetes mellitus. J. Med. Invest. 54, 243247. Lam, H.Y., Mu, X.J., Stutz, A.M., Tanzer, A., Cayting, P.D., Snyder, M., Kim, P.M., Korbel, J.O., and Gerstein, M.B. (2010). Nucleotide-resolution analysis

1306 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

of structural variants using BreakSeq and a breakpoint library. Nat. Biotechnol. 28, 4755. Lapuk, A., Marr, H., Jakkula, L., Pedro, H., Bhattacharya, S., Purdom, E., Hu, Z., Simpson, K., Pachter, L., Durinck, S., et al. (2010). Exon-level microarray analyses identify alternative splicing programs in breast cancer. Mol. Cancer Res. 8, 961974. Levanon, E.Y., Hallegger, M., Kinar, Y., Shemesh, R., Djinovic-Carugo, K., Rechavi, G., Jantsch, M.F., and Eisenberg, E. (2005). Evolutionarily conserved human targets of adenosine to inosine RNA editing. Nucleic Acids Res. 33, 11621168. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R.; 1000 Genome Project Data Processing Subgroup. (2009a). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 20782079. Li, J.B., Levanon, E.Y., Yoon, J.K., Aach, J., Xie, B., Leproust, E., Zhang, K., Gao, Y., and Church, G.M. (2009b). Genome-wide identication of human RNA editing sites by parallel DNA capturing and sequencing. Science 324, 12101213. Li, M., Wang, I.X., Li, Y., Bruzel, A., Richards, A.L., Toung, J.M., and Cheung, V.G. (2011). Widespread RNA and DNA sequence differences in the human transcriptome. Science 333, 5358. Li, R., Li, Y., Zheng, H., Luo, R., Zhu, H., Li, Q., Qian, W., Ren, Y., Tian, G., Li, J., et al. (2010). Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 5763. Lomb, N. (1976). Least-squares frequency analysis of unequally spaced data. Astrophys. Space Sci. 39, 447462. Maere, S., Heymans, K., and Kuiper, M. (2005). BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21, 34483449. Mills, R.E., Walter, K., Stewart, C., Handsaker, R.E., Chen, K., Alkan, C., Abyzov, A., Yoon, S.C., Ye, K., Cheetham, R.K., et al; 1000 Genomes Project. (2011). Mapping copy number variation by population-scale genome sequencing. Nature 470, 5965. Mischel, P.S., Shai, R., Shi, T., Horvath, S., Lu, K.V., Choe, G., Seligson, D., Kremen, T.J., Palotie, A., Liau, L.M., et al. (2003). Identication of molecular subtypes of glioblastoma by gene expression proling. Oncogene 22, 2361 2373. Parkhomchuk, D., Borodina, T., Amstislavskiy, V., Banaru, M., Hallen, L., Krobitsch, S., Lehrach, H., and Soldatov, A. (2009). Transcriptome analysis by strand-specic sequencing of complementary DNA. Nucleic Acids Res. 37, e123. Pickup, J.C. (2004). Inammation and activated innate immunity in the pathogenesis of type 2 diabetes. Diabetes Care 27, 813823. Rozowsky, J., Abyzov, A., Wang, J., Alves, P., Raha, D., Harmanci, A., Leng, J., Bjornson, R., Kong, Y., Kitabayashi, N., et al. (2011). AlleleSeq: analysis of allele-specic expression and binding in a network framework. Mol. Syst. Biol. 7, 522. Scargle, J.D. (1982). Studies in astronomical time series analysis. II-Statistical aspects of spectral analysis of unevenly spaced data. Astrophys. J. 263, 835853. Scargle, J.D. (1989). Studies in astronomical time series analysis. III-Fourier transforms, autocorrelation functions, and cross-correlation functions of unevenly spaced data. Astrophys. J. 343, 874887.

Smoot, M.E., Ono, K., Ruscheinski, J., Wang, P.L., and Ideker, T. (2011). Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 27, 431432. Snyder, M., Weissman, S., and Gerstein, M. (2009). Personal phenotypes to go with personal genomes. Mol. Syst. Biol. 5, 273. Snyder, M., Du, J., and Gerstein, M. (2010). Personal genome sequencing: current approaches and challenges. Genes Dev. 24, 423431. Stewart, C., Kural, D., Stromberg, M.P., Walker, J.A., Konkel, M.K., Stutz, A.M., Urban, A.E., Grubert, F., Lam, H.Y., Lee, W.P., et al; 1000 Genomes Project. (2011). A comprehensive map of mobile element insertion polymorphisms in humans. PLoS Genet. 7, e1002236. Sun, J.C., Liang, X.T., Pan, K., Wang, H., Zhao, J.J., Li, J.J., Ma, H.Q., Chen, Y.B., and Xia, J.C. (2010). High expression level of EDIL3 in HCC predicts poor prognosis of HCC patients. World J. Gastroenterol. 16, 46114615. Theodoridis, G., Gika, H.G., and Wilson, I.D. (2011). Mass spectrometry-based holistic analytical approaches for metabolite proling in systems biology studies. Mass. Spectrom. Rev. 30, 884906. Trapnell, C., Pachter, L., and Salzberg, S.L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 11051111. Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J., and Pachter, L. (2010). Transcript assembly and quantication by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511515. van der Werf, N., Kroese, F.G., Rozing, J., and Hillebrands, J.L. (2007). Viral infections as potential triggers of type 1 diabetes. Diabetes Metab. Res. Rev. 23, 169183. Van Dongen, H.P., Olofsen, E., VanHartevelt, J.H., and Kruyt, E.W. (1999). A procedure of multiple period searching in unequally spaced time-series with the Lomb-Scargle method. Biol. Rhythm Res. 30, 149177. Vaxillaire, M., Cavalcanti-Proenca, C., Dechaume, A., Tichet, J., Marre, M., Balkau, B., and Froguel, P.; DESIR Study Group. (2008). The common P446L polymorphism in GCKR inversely modulates fasting glucose and triglyceride levels and reduces type 2 diabetes risk in the DESIR prospective general French population. Diabetes 57, 22532257. Wang, E.T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S.F., Schroth, G.P., and Burge, C.B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470476. Wu, J.Q., Habegger, L., Noisa, P., Szekely, A., Qiu, C., Hutchison, S., Raha, D., Egholm, M., Lin, H., Weissman, S., et al. (2010). Dynamic transcriptomes during neural differentiation of human embryonic stem cells revealed by short, long, and paired-end sequencing. Proc. Natl. Acad. Sci. USA 107, 52545259. Yamaguchi, H., Calado, R.T., Ly, H., Kajigaya, S., Baerlocher, G.M., Chanock, S.J., Lansdorp, P.M., and Young, N.S. (2005). Mutations in TERT, the gene for telomerase reverse transcriptase, in aplastic anemia. N. Engl. J. Med. 352, 14131424. Yang, R., Zhang, C., and Su, Z. (2011). LSPR: an integrated periodicity detection algorithm for unevenly sampled temporal microarray data. Bioinformatics 27, 10231025. Ye, K., Schulz, M.H., Long, Q., Apweiler, R., and Ning, Z. (2009). Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 28652871. Zhao, W., Agyepong, K., Serpedin, E., and Dougherty, E.R. (2008). Detecting periodic genes from irregularly sampled gene expressions: a comparison study. EURASIP J. Bioinform. Syst. Biol. 2008, 769293.

Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. 1307

Supplemental Information
EXTENDED EXPERIMENTAL PROCEDURES Sample Collection The subject and mother in this study were recruited under the IRB protocol IRB-8629 at Stanford University. Whole-blood samples were collected at each time point and Peripheral Blood Mononuclear Cells (PBMCs) were isolated by density gradient centrifugation at 400 x g for 25 min using the Lymphocyte Separation Media (MP Biomedicals). Serum and plasma were also collected for each time point. Genomic DNA and RNA were isolated from the PBMCs using the AllPrep DNA/RNA/Protein Mini Kit (QIAGEN). Protein was also prepared from lysed PBMCs for mass spectrometry with the Lysis Buffer (4% SDS, 100mM Tris-HCl pH7.6, 100 mM DTT). Human Rhinovirus and Respiratory Syncytial Virus Detection Human rhinovirus (HRV) and respiratory syncytial virus (RSV) were detected from upper respiratory swab samples from the subject at the Stanford Hospital and Clinics with standard assays (the Respiratory Viral Panel Test). Briey, viral RNA was extracted from the swab samples, amplied with Reverse Transcription-Polymerase Chain Reaction, and the presence of a panel or respiratory viruses were detected using the Luminex xTag technology. For HRV infection, samples from days 0, 4, and 21 were examined; and for RSV infection, samples from days 289, 290, 292 and 294 were assayed. Whole-Genome Sequencing Whole-genome sequencing was performed at both Complete Genomics Inc. (Mountain View, CA) and Illumina, Inc. (San Diego, CA). Ten micrograms of genomic DNA was used for each platform. Paired-end 35b reads were used for Complete Genomics (CG) sequencing, and data were processed and variants (SNVs, Indels, SVs and CNVs) were called using the NCBI reference genome build 37 with the CG assembly software v1.10.1.32. For Illumina, 101b paired-end sequencing data were obtained using Illuminas HiSeq 2000 Sequencer. Illumina data were processed with the HugeSeq pipeline we developed during this project (Lam et al., 2012). This pipeline maps reads using Burrows-Wheeler Aligner (Li and Durbin, 2009) (BWA), calls SNVs, indels, and SVs using the algorithms. SVs detected by two or more methods were called high condence. Whole-Exome Sequencing Whole-exome sequencing was performed using three available platforms: the Agilent SureSelect All Exon 50Mb, the Nimblegen SeqCap EZ Exome Library v2.0, and the Illumina TruSeq Exome Enrichment Kit (Clark et al., 2011). Three micrograms of genomic DNA was used for each enrichment platform. The enriched sequencing libraries were prepared according to the manufacturers protocols with slight modications as stated below, and were each subjected to Illumina sequencing on one lane of the HiSeq 2000 sequencer. For the Agilent platform, genomic DNA was sheared with the Covaris S2 system; the DNA fragments were end-repaired, extended with an A base on the 30 end, ligated with paired-end adaptors, and amplied (4 cycles). Exome-containing adaptor-ligated libraries were hybridized for 24 hr with biotinylated oligo RNA baits, and enriched with Streptavidin-conjugated magnetic beads. The nal libraries were further amplied for 11 cycles with Polymerase Chain Reaction (PCR). For the Nimblegen SeqCap EZ-Exome Library, Illumina sequencing library was made following Nimblegens protocol with the following improvements: in Chapter 4 Steps 1-4 of the protocol two PCR reactions were set up for each sample with 15 ml of each unenriched sample library as template, and 2 mg of amplied sample library was used for each sample in the hybridization step described in Chapter 5 Step 2. These modications ensure that we obtain sufcient material from PCR for the hybridization, and by doubling the amount of amplied sample library we make the most use of the enrichment probes. Briey, DNA fragmented with the Covaris S2 system was concentrated with ethanol precipitation, end-repaired with the Epicenter End-It DNA End-Repair Kit, a deoxyadenosine was added at the 30 end of the fragments with the Klenow 30 - > 50 exo- enzyme (New England Biolabs), and ligated with Illuminas Paired-End Adaptor Oligo Mix (Part# 1001782). The ligated libraries were size selected for an average insert size of 250 bp (2 mm gel slice) by agarose gel excision and extraction, amplied for 8 cycles by Pre-Capture LM-PCR, and hybridized for 72 hr with biotinylated oligo DNA baits for exome-containing libraries. The hybridized libraries were enriched with Streptavidinconjugated magnetic beads and washed and amplied by PCR (18 cycles), and the quality of the libraries was checked by qPCR as described in the protocol. For the Illumina TruSeq Exome Enrichment Kit, Pre-enrichment DNA libraries were constructed following Illuminas TruSeq DNA Sample Preparation Guide. A 300-400 bp band was gel selected for each library and exome enrichment was performed according to Illuminas TruSeq Exome Enrichment Guide. Two 20 hr biotinylated bait-based hybridization were performed with each followed with Streptavidin Magnetic Beads binding and a washing step and an elution step. A 10-cycle PCR enrichment was performed after the second elution and the enriched libraries were subjected to Illumina sequencing after quality check on one lane of HiSeq 2000. Sanger DNA Sequencing Sanger DNA PCR and sequencing primers were designed manually and with the Optimus Primer software (http://op.pgx.ca/), and were synthesized at Integrated DNA Technologies (Coralville, IA). DNA sequencing was performed at ELIM BIOPHARM (Hayward, CA). Sequencing results were visualized with the CodonCode Aligner software (http://www.codoncode.com/aligner/).
Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. S1

Whole-Transcriptome Sequencing: mRNA-Seq Strand-specic RNA-Seq libraries were prepared as described previously (Parkhomchuk et al., 2009). Briey, 9 mg of total RNA isolated from PBMCs were used and mRNA was enriched with the Dynal Oligo (dT) beads (Invitrogen). The isolated mRNA was fragmented using the RNA Fragmentation Reagents (Ambion) and cDNA containing dUTP in the second strand was synthesized. The cDNA molecules were end-repaired with the Epicenter End-ItTM DNA End-Repair Kit, a deoxyadenosine was added at the 30 end of the fragments with the Klenow 30 - > 50 exo- enzyme (New England Biolabs), and ligated with Illuminas Paired-End Adaptor Oligo Mix (Part# 1001782). The ligated libraries were size selected for an average insert size of 250 bp (2 mm gel slice) by agarose gel excision and extraction, and the dUTP-containing second strands were digested with Uracil-DNA Glycosylase (New England Biolabs). The treated libraries were then amplied by Polymerase Chain Reaction at the following conditions: 98 C 30 s, 15 cycles of (98 C 10 s, 65 C 30 s, 72 C 30 s), 72 C 5 min. Each prepared library was sequenced on 1-3 HiSeq 2000 lanes to obtain an average of 123 million uniquely mapped reads (20 time points). The TopHat package (Trapnell et al., 2009) was used to align the reads to the hg19 reference genome, followed by Cufinks (Trapnell et al., 2010) for transcript assembly and RNA expression analysis. The number of redundant reads was low (7.78%). The Samtools package (Li et al., 2009) was used to identify variants including single nucleotide variants (SNV) and indels. Small RNA Sequencing: microRNA-Seq MicroRNA were isolated from 10 million PBMCs at ve time points (days 4, 21, 116, 185 and 186 from HRV infection) with the mirVana miRNA Isolation Kit (Ambion). microRNA-Seq libraries were prepared from 1 mg of isolated miRNA according to Illuminas Small RNA v1.5 Sample Preparation Guide. Each library was sequenced with 36 b single-end reads on 1 lane of Illuminas GAIIx sequencer. The human pre-miRNAs, miRNAs sequences were extracted from miRBase release17 [hg19]. The SOAP program (Li et al., 2008) was used to map sequence reads with a maximum of 2 bp mismatches to the hairpin sequences. miRanda algorithm (John et al., 2004) and TargetScan version 5 (Lewis et al., 2005) were used for targets prediction. For miR-7, 323 targets were predicted with TargetScan program, and 240 of 323 were expressed. 65 expressed mRNAs t the prole of miRNA expression along each time points tested. There are at least 108 additional mRNAs targets that were associated with diabetes predicted with miRanda. DAVID program (Dennis et al., 2003; Huang et al., 2008) was used for pathway enrichment analysis. To examine the signicance of gene term enrichment, the program uses a modied Fishers exact test (EASE score). The enrichment p-values are globally corrected for multiple hypothesis testing using Benjamini (Huang et al., 2008). Cluster 3 (Eisen et al., 1998) was used to perform the hierarchical cluster categories of mRNA targets. The Java TreeView program (Saldanha, 2004) was then used to visualize these clusters. PBMC and Serum Shotgun Proteome Proling Protein Extraction and Labeling Using TMT The PBMC cell pellets were lysed in 10x volume of buffer containing 4% SDS and 100 mM dithiotreitol in 100 mM tris-HCl pH 8.0. Lysates were incubated at 95 C for 5 min and briey sonicated. Detergent was removed from the lysates using the FASP protocol using YM-30 microcon lter units (Cat No. MRCF0R030, Millipore). In brief, 200 ml of 8 M urea in 0.1 M Tris/HCl, pH 8.5 was added and samples were centrifuged at 14 000xg at 20 C for 15 min. This step was repeated 3 times. Then 50 ml of 0.05 M iodoacetamide in 8 M urea was added to the lters and the samples were incubated in darkness for an hour. Sample was washed 3 times with 100 ml of 200 mM ThAB. Protein concentration was measured using Bradford method. Finally, trypsin (Promega, Madison, WI) was added at protein to enzyme ratio of 50:1. Samples were incubated overnight at 37 C. Peptides were collected by centrifugation and labeled using TMT 6plex reagent. Immediately before use, equilibrated the TMT label reagents to room temperature. For the 0.8 mg vials, 41 ml of anhydrous acetonitrile were added to each tube and 41 ml of the TMT Label Reagent was then added to each 25-100 mg sample. The reaction was incubated for 1 hr at room temperature. To quench the reaction, 8 ml of 5% hydroxylamine was added to the sample and incubated for 15 min. Samples were combined at equal amounts and dried by speed vac. For serum proteome, the 14 most abundant proteins were depleted using an Agilent Mars human 14 column (4.6 mm x 50mm). The unbound fraction from the column was collected for further proteome analysis. The protein sample was then processed as described above. Peptide Separation A highly reproducible online Waters 2D liquid chromatography (Waters NanoAquity 2D nLC) was used for peptide separation. The protein sample was rst resuspended in 100 mM ammonium formate at pH10 and then loaded to the LC system. Peptides were separated by reverse phase chromatography at high pH in the rst dimension, followed by an orthogonal separation at low pH in the second dimension. An online dilution of the efuent was performed after the rst dimension to ensure no peptides were lost prior to the second dimension. In the rst dimension the mobile phases were buffer A: 20mM ammonium formate at pH10 and buffer B: Acetonitrile. Peptides were separated on a Xbridge 300 mm x 5 cm C18 5.0 mm column (Waters) using 14 discontinuous step gradient at 2 ml/min. Acetonitrile concentration for each step was adjusted to ensure nearly equivalent peptide load and MS intensity for each second-dimension run. Since peptide fractions eluted from the rst dimension column were at high pH and differing Acetonitrile concentrations, they were not compatible with the second dimension separation. To maximize peptide recovery the fractions were diluted online using 0.1% formic acid in water at 20 ml/min and then trapped by Symmetry 180 mm x 2cm C18 5.0 mm trap column (Waters). In the second dimension, peptides were loaded to a in-house packed 75 mm ID/15 mm tip ID x 20cm C18-AQ 3.0 mm resin
S2 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

column with buffer A (0.1% formic acid in water). Peptides were separated with a linear gradient from 5% to 30% buffer B (0.1% formic acid in acetonitrile) at a ow rate of 300 nl/min in 180 min. Each sample separation was repeated 3 times. Proteomics MS Analysis The LC system was directly coupled in-line with a linear trap quadrupole (LTQ)-Orbitrap Velos instrument (Thermo Fisher Scientic) via Thermo nanoelectrospray source. The source was operated at 2.2-2.4 kV to optimize the nanospray, with the ion transfer tube at 200 C. The mass spectrometer was run in a data dependent mode. One survey scan acquired in the Orbitrap mass analyzer with resolution 60,000 at m/z 400 was followed by MS/MS of the 10 most intense peaks with charge state R 2 and above an intensity threshold of 5000. MS/MS fragmentation was done in the high collisional cell (HCD) with normalized collision energy of 40% and activation time of 0.1 s. The MS/MS scan was acquired in the Orbitrap at resolution of 7,500. For all sequencing events dynamic exclusion was enabled to minimize repeated sequencing. Peaks selected for fragmentation more than once within 30 s were excluded from selection (10 ppm window) for 60 s. Proteomics Data Processing and Analysis The raw data acquired were processed with the Proteome Discoverer (Thermo). IPI human database, v. 3.75 (Kersey et al., 2004) was used. Mass tolerance of 10 ppm was used for precursor ion and 0.02 Dalton for fragment ions for the database search. The search included cysteine carbamidomethylation as a xed modication. N-terminal and lysine TMT 6plex modication and methionine oxidation were used as variable modications. Up to two missed cleavages were allowed for trypsin digestion. Only unique peptides with minimum 6 amino acid length were considered for protein identication. The false discovery rate (FDR) was set as less than 1% and we required two unique peptides per protein for identication. For peptide quantitation, only unique peptides with reporter ion mass tolerance of less than 10 ppm were used. The median value of different peptide ratios was used for protein quantitation. Downstream analysis of proteomics is described below. Serum Metabolome Proling Serum Metabolite Extraction 100 ml of serum sample was used for metabolomics study. Metabolites were extracted by adding 4 times volume of equal volume mixture of methanol, acetonitrile and acetone that were pre-chilled at 20 C. To maximize metabolites extraction, samples were vortex at 4 C for 15 min at 2 min interval. Proteins were precipitated by incubating the sample at 20 C for 2 hr. Samples were then centrifuged at 10,000 rpm at 4 for 10min. The supernatant was collected and dried for metabolomics analysis. For each time point, 3 of the 100 ml samples were analyzed in triplicate. Metabolomics LC MS Analysis An Agilent 1260 Liquid Chromatography system was directly coupled in-line with an Agilent 6538 accurate mass Q-TOF MS with electrospray ionization (ESI) operated at positive and negative mode. The LC mobile phases consisted of 0.2% acetic acid in water (buffer A) and 0.2% acetic acid in methanol (buffer B). The extract was resuspended in 50% methanol and sonicated for 5min. Sample was loaded to an Agilent SB-aq 1.8 mm, 2.1 3 50 mm analytical column with a SB-C8 3.5 mM, 2.1 3 30 mm guard column in front. Columns were heated to 60 C with a ow rate of 0.6 ml/min. A linear gradient from 2% to 98% buffer B in 13min was used for metabolites separation. To assure the mass accuracy of the recorded ions, continuous internal calibration ions were infused in-line through the dual ESI source using an isocratic pump at ow rate of 0.05 ml/min. Internal calibrants at m/z 121.0509 and 922.0098 were used in positive ion mode and m/z of 119.0362 and m/z of 980.0164 were used in negative ion mode. The Q-TOF was operated at source condition of 3,750 V with drying gas 9 L/min and nebulizer gas 45 psi at 300 C. The instrument was run at extended mass range to1700 m/z. The fragmentor voltage was set at 125 V and skimmer at 47 V. The data was acquired at scan rate of 1.5 spectra/sec for MS. MS/MS was run at targeted mode at scan rate of 3 spec/sec with 10 spec/sec for MS. Collision energy of 20 V and a xed isolation window of 4 m/z and retention time window of 0.25 min were used for the targeted MS/MS. Each sample was run at MS mode rst at both positive and negative modes and the differentially expressed metabolites were selected for MS/MS experiment. Metabolomics Data Processing and Analysis MassHunter Workstation software (Agilent Technologies), including Qualitative Analysis (version 3.01) and Mass Proler Professional (MPP version B.02) were used to process both MS and MS/MS data. The Molecular Feature Extractor (MFE) in Qualitative analysis software were used to search for features that have common elution prole and groups ions into one or more compounds containing m/z values that are related (peaks in the same isotope cluster, different adducts or charge states of the same entity). The results were exported as les in Compound Exchange Format (CEF les) for further analysis in MPP. MPP was used to align data from different samples, lter data for statistical analysis and database search. For the chromatography alignment, only ions with intensity above 5,000 and retention time window within 0.2 min were selected. If ions were not present in all the les, they were ltered out. For samples from the same time point, the median value was used for that time point. Further statistical analysis was done to nd the differentially expressed compounds as described below. METLIN human metabolites database was used for the database search. Mass tolerance was set at 10ppm. Serum C-Reactive Protein and Plasma Insulin Enzyme-Linked ImmunoSorbant Assays Serum C-Reactive Protein (CRP) levels were quantitated with the hsCRP ELISA Kit from Abnova following the manufacturers instructions. Plasma Insulin levels were measure with the Human Insulin (Animal Serum Free) ELISA Kit (Millipore) according to the manufacturers protocol.
Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. S3

Serum Cytokine Proling Serum cytokine proling was performed with the Luminex 51-plex Human Cytokines bead-based assay at the Stanford Human Immune Monitoring Center with the Luminex 200 Instrument. The analytes are listed in Figure 2F plus IL-6 (which was not detected in all the samples for 2 repeated runs). One hundred microliters of serum were used for each time point. Blood Glucose, Glycated HbA1c, and Triglyceride Measurement Blood glucose, Glycated HbA1c, and triglyceride levels were measured at the Laboratory of the Stanford Hospital and Clinics, if not otherwise stated, along with other standard lipid and chemistry proles not covered in this manuscript. Glucose levels were measured with the ACCU-CHEK system for days 363602 (except days 369, 476, 532, 546 and 602). The moving average was shown with a window of 15 days (7 days prior and post each time point) in Figure 2D. Duplicate measurements were taken for 13 time points using ACCU-CHEK as well as for days 322 and 369, with a variance typically less than 3% and never more than 5%. Autoantibodyome Proling Autoantibodyome proling was completed for 4 time points (days 123, 0, 4 and 21) using the Invitrogen ProtoArray Protein Microarray v5.0 (which contain 9,483 unique human proteins spotted in duplicate), according to the manufacturers instructions and as described previously (Hudson et al., 2007). Thirty-four healthy plasma samples were used as controls. Plasma samples were diluted 1:100 in 5 ml Washing Buffer (1X PBS, 0.1% Tween 20, 1X Roti-Block) for the autoantibodyome proling. The probed protein microarray chips were dried and scanned with the Genepix 4200AL Microarray Scanner (Molecular Devices, Sunnyvale, CA) the Genepix Pro 6.1 software. The arrays were scanned to obtain signal location, intensity quantication and identication information (.gpr format) using GenePix Pro 6.1 (Molecular Devices). For each array inter-array normalization was performed via the ProCAT algorithm (Zhu et al., 2006) (sliding window of length 15). The arrays were then quantile normalized (Bolstad et al., 2003) and a comparison of intensities of probes was carried out between the subject and the healthy control group using a two-tailed Mann-Whitney non-parametric test, p < 0.01, in Mathematica 8.0 and using Benjamini-Hochberg (Benjamini and Hochberg, 1995) to correct for multiple hypothesis tests, adjusted p < 0.01 (Data S2). Biological replicates were compared for reproducibility showing a high degree of correlation across slides with R2 > 0.894, and the Coefcient of Variation (CV) across slides had median value 0.0656 and 96.6% of spots having CV < 1 (Data S11C, I.1-2). For protein spots duplicated on the arrays we found R2 = 0.99 and median CV 0.04 with 96.5% of signals having CV < 1. (Data S11C, I.3-4). Telomere Length Assay Telomere length in the PBMCs of the volunteer subject was measured and calculated with both Southern Blotting and the Highthroughput Q-FISH. Southern Blotting of telomeres was performed using the Telo TAGGG Telomere Length Assay Kit (Roche) following the manufacturers instructions. Telomere length at two time points was investigated (days 255 and 292) to reveal potential telomere length differences for healthy and infected states. X-ray images were digitized with the Typhoon scanner (GE Healthcare) and analyzed with the ImageJ software (Abramoff et al., 2004). High-throughput Q-FISH (HT Q-FISH) was performed using mononuclear cells isolated from peripheral blood using a coll separating solution (LymphoprepTM). The cells were then plated on a clear bottom black-walled 96-well plate, and HT-QFISH was performed as previously described (Canela et al., 2007). Telomere length values were analyzed using individual telomere spots corresponding to the specic binding of a Cy3 labeled telomeric probe (subject: 5.41kb compared to age group median: 5.95 kb). Fluorescence intensities were converted into kilobases as previously described (Canela et al., 2007; McIlrath et al., 2001). Each median telomere length value was calculated and plotted. Linear regression analysis was used to assess the correlation between age and median telomere length or percentage of nuclei with telomeres < 3kb in lymphocytes of the donors. Median telomere length values and percentage of telomeres < 3kb (short telomeres) of donors in the indicated age groups were calculated. The number of samples of each group is indicated (n). The minimum, 25th percentile, median, 75th percentile and the maximum values from each age group were calculated and used to create four equal groups, each representing a fourth of the distributed sampled population. GraphPad Prism has been used for data calculation. See Figures S2A and S2B. Genome Phasing Single nucleotide variants (obtained from a minimum 2 platforms) and indels (from 3 platforms) of the individuals DNA were phased as summarized in Data S11D [see also Figure S7E, Table S4, and Data S10]. This variant list was augmented with maternal sequence and genotype data, as well as with the phased CEU haplotypes from the 1000 Genomes Project. For variants that are observed in both the subject and in 1000 Genome haplotypes, phasing was achieved using the program BEAGLE (Browning and Browning, 2007). The maternal genotype is provided to BEAGLE only if the call is high condence (also from minimum of two platforms), otherwise the data is considered as missing. Novel variants not observed in the 1000 Genome haplotypes are phased based on a Mendelian inheritance pipeline and the maternal genotype alone. The two data sets are then merged, followed by correction by any experimental data (including data from Complete Genomics on haplotyping and paired-end sequencing if available). The inferred maternally- and paternally-derived haploid genome was then analyzed with programs Polyphen-2 (Adzhubei et al., 2010) to identify the biological impact of the phased variants. A secondary pipeline was developed to identify compound heterozygous variants, which tags the genes that accumulate variants (SNPs and indels) found on different alleles. This study focused on compound missense and
S4 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

nonsense mutations, which may potentially be damaging. Those identied genes are further categorized into three compound heterozygous types: Type 1- Genes with at least one heterozygous variant on each allele, Type 2- Genes with both homozygous and accumulated heterozygous variants on each allele, and Type 3: Homozygous variants with additional heterozygous variant(s) on only allele 1 (Type 3A) or on only allele 2 (Type 3B) (See also Table S4). Variants Identied in RNA: Heteroallelic Expression and RNA Editing Variants in RNA-Seq data were identied using Samtools (Li et al., 2009), as described above, and compared against the hg19 reference genome. The RITE-2-seq (RNA Identier Tool for Expression and Edits) pipeline was developed to identify RNA variants as summarized in Data S11A. A minimum of 40 unique reads (as well as 10 unique reads) were obtained at a variant position and compared to the high and low single nucleotide genomic calls (as described above). Those variants that matched DNA were subsequently characterized as heterozygous or homozygous (Table S3) and heterozygous calls were analyzed for differential allelic specic expression (ASE). Variants that were not in the genome were deemed as candidate RNA edits, and were further ltered to remove false positives due to misalignments (multigene families and pseudogenes), as well as close proximity variants (errors likely due to an alignment to an uncharacterized isoform; mapping errors accumulated within a window of 10 bp were removed). These candidates were also re-compared to both low and high-condence exome data (described above), as to remove any extra DNA based variants (high-condence candidate RNA edits summarized in Data S8). Polyphen-2 and ANNOVAR (Wang et al., 2010), as well as inhouse developed callers, were used to localize the variants to genic regions, and those identied as missense calls were further used in this omics study to validate corresponding variant transcripts at the protein level (further described below). RNA realignment with the corrected personalized genome and corrected transcriptome (see Data Dissemination Section below for availability) will aid particularly in improving mapping reference bias. To evaluate differential allele-specic expression (ASE) at each site, we used a two-component beta binomial mixture model (similar to that used in Skelly et al., 2011). Under this model, the number of observed non-reference allele, Xmt, given the total read depth, Nmt, is assumed to have a binomial distribution, Binom(Nmt,pmt). With probability 1-p, pmt is drawn from a beta distribution, Beta(a,a), and with probability p, it is drawn from a second beta distribution, Beta(d,d), such that d < a. The parameters a, d and p are estimated by maximizing the likelihood function. For the rst infection cycle, b = 78, d = 4, and p = 0.11; the second infection a cycle is more overdispersed, b = 45, b = 2.4, p = 0.17. The posterior probability that the observation, Xmt is derived from the second a d component is interpreted as the strength of allelic-specic expression. The distribution of this posterior probability is shown in Figure S7B and though most sites reveal no differential ASE, a few sites and time points show convincing evidence that the ratio is not (50%, 50%). We also estimated a shrunk ASE ratio (alternate allele count / total count ratio; alt/tot) for each data point by a weighted average under the two components (minimal over-dispersion). For Figure 5CD, alt/tot ratios from infection states (day 0 and day 289) were compared to uninfected states days 116255 and days 311400, respectively. All heatmaps and histogram analyses were performed using the rescaled shrunken ASE ratios, with a minimum coverage of 40 reads (RNA-Seq) across a minimum of 5 and 13 time points for HRV and RSV infections, respectively. Heatmaps examining differential ASE were generated using R program (version 2.13.1), where missing data points were imputed using row means (for multiple points) and the k nearest neighbor (for single points) method. Single and average linkage hierarchical clustering with application of the Pearson correlation distance metric was performed for the heatmaps. These gures contain all variant positions, including missense, synonymous and UTR locations. All heatmaps are based on the ratio of the alternative allele or edited nucleotide to total expression (alt/tot). Genes with differentially expressed alleles were further investigated for functional clustering utilizing DAVID [Database for Annotation, Visualization, and Integrated Discovery (Huang et al., 2008)], and those with KEGG pathways and GO patterns of Benjamini p < 0.05 values were of particular interest during this time course study. The RNA editing expression was analyzed using RNA-Seq data from the 20 time points (minimum of 7 time points / infection course), with the binomial test (log transformed modication) performed on reads with a minimum of 40 coverage (RNA-Seq), selecting p < 0.001 as a cutoff for candidates with RNA edited expression. The DNAnexus, Inc. genomic browser was used to view the location of the variants relative to the gene [from NCBI RefSeq database (Pruitt et al., 2007)]. Chromas, Technelysium Pty Ltd., version 2.33 was used to view Sanger sequencing of cDNA generated from RNA at corresponding time points, were used for validation of heteroallelic expression and RNA edits (Figures 56 and S7). Candidates for differential ASE and editing were also validated via digital droplet PCR quantication utilizing the QuantaLife Droplet Reader (Bio-Rad Laboratories, Inc.). Here, cDNA at the respective time point was prepared, followed by emulsion droplet preparation consisting of FAM and VIC variant-specic probes, gene-specic primers and an emulsion PCR pool (Hindson et al., 2011). Variants Identied in Proteins For variants (SNVs and edits) identied in the genome we created a workow to identify such changes at the protein level (Data S11B). After analyzing the data with Polyphen 2 (Adzhubei et al., 2010) the identied missense information and protein locations were obtained. A protein database was created using the available information based on the IPI (Kersey et al., 2004) human database v. 3.83. For each variant a corresponding protein sequence was constructed and added to a modied database. Additionally, a database of masses for modied peptides was calculated, post in silico digestion, to obtain a mass list for targeted identication experiments. The obtained spectra for both the targeted and untargeted experiments were searched independently against both the modied and unmodied (original sequence) databases containing the proteins of interest. Variant peptide candidates were identied in Proteome
Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. S5

Discoverer (Thermo Scientic) using the built-in SEQUEST (Eng et al., 1994) algorithm by searching against the constructed databases augmented with a reversed database search (Elias et al., 2004; Gygi et al., 1999; Peng et al., 2003) [with a False Discovery Rate (FDR) < 0.01 and requiring 1 unique peptide per protein for identication]. The identied peptides were then ltered and were selected if they matched only to the modied database and additionally successfully aligned to the original database using local Smith-Waterman algorithm (Smith and Waterman, 1981) to verify exact matching with a single mismatch of the input modied amino acid. Peptide variants corresponding to SNVs without an entry in dbSNP were classied as private. Furthermore heterozygous peptide candidates were identied for the proteins that matched the modied database if they also aligned to the original database and showed single mismatch to the modied database. All candidate peptides were then separated into high and low condence variant candidate lists if they exactly matched unique proteins, with a single mismatch of the input modied amino acid, after searching the IPI database sequences via BLAST (Altschul et al., 1997). Results are summarized in Data S9. Von Willebrand Factor Cleaving Protease Activity Assay Protease activity was assayed by the scanning densitometry of dimers of the 176 kd fragment of Von Willebrand Factor (VWF) generated by the addition of VWF substrate treated with guanidine hydrochloride to subject plasma. The percent protease activity was calculated as the percentage of VWF cleaving activity compared that measured in plasma from pooled normal controls (dened as 1 U ml-1) (Tsai and Lian, 1998). General Omics Analysis Framework and Result Summaries Analysis Framework During the investigation multiple omics data were collected at each time point. Each data set was analyzed as outlined in Figure S5. Namely, data sets were: (1) preprocessed using methods appropriate to the omics type, toward a similar goal of ultimately integrating the different omics platforms, (2) spectrally analyzed and classied into signicant categories (3) assessed for biological signicance through enrichment analysis. Most statistical analysis was performed using Mathematica 8.0 (Wolfram Research, 2010) (except as indicated below). In this section we provide further details pertaining to the analysis framework of each data set, as well as a summary of the methodology at each step. (1) Data Preprocessing. After completion of the various experiments the initial raw data was rst preprocessed for each omics as outlined below to obtain a vector normalized set of time points for each constituent: (a) Transcriptome: Illumina reads (.fastq les) were mapped to hg19 (Genome Reference Consortium GRCh37 using the TopHat (Trapnell et al., 2009), followed by Cufinks (Trapnell et al., 2010) for transcript assembly and expression levels using RefSeq (Pruitt et al., 2009; Pruitt et al., 2007) annotation. Data across the different time points was matched to accession, and Quality Control (QC) ltering was performed, requiring that at minimum one data point per accession displayed expression levels > 5 FPKM (Trapnell et al., 2010). The ltered data sets were then quantile normalized across all data points. Log-2 ratios of expression with respect to (w.r.t.) healthy time points (day 255) were then vector normalized (Euclidean metric) to one for each accession-number set. Concurrently, a bootstrap distribution of n > 100,000 timed sample sets was obtained (non-parametric sampling with replacement for each time point) for statistical comparison (see part (2) below). (b) Proteome: Spectra were obtained from three TMT (Tandem Mass Tag) labeled samples (with three technical replicates each) for relative quantitation analysis. As described above, protein identication was carried out using Proteome Discoverer , with FDR < 0.01 and requiring two unique peptides per protein for identication. For relative quantitation each time point was compared to a healthy time point, day 255, and all ratios were normalized by Proteome Discoverer so that the average ratio per sample is one. Post protein identication, the three sets were matched using a replicated common ratio present in all three (namely, in this investigation, for PBMC proteins using the 131/126 ratio for the intensities of tags with masses 126 amu to 131 amu corresponding to days 255 and 301 respectively, showing high reproducibility, with correlations R2 > 0.72, Data S11C, II). Additionally, QC assessment required a coefcient of variation (CV) < 0.13 for the replicated ratio (corresponding to excluding outliers > 3 standard deviations from the median CV); that the reference (day 255) mass tag be always present in all three samples; and a minimum of 2/3 points be present for all proteins identied. The log-2 relative ratios were again vector normalized to one (Euclidean metric), and again a non-parametric bootstrap distribution (n > 100,000 samples) was constructed by sampling each time point with replacement. (c) Metabolome: Spectra from proling at each time point were obtained with 3 technical replicates each and aligned for mass and retention times using MassHunter (Agilent Technologies) as described above. The aligned spectra information was ltered for a minima of 2/3 time points being present for each mass identied in the mass spectrometry sets, for which the median of the replicates was calculated, retaining data displaying a CV < 0.4. The log-2 distribution of each time-point set was standardized (baselining) to the median and average median deviation of its own distribution. Additionally a non-parametric bootstrap distribution, of 100,000 samples was constructed by sampling each time-point set with replacement. For both simulated and original data, the difference, sD = st-shealthy, was computed, comparing the median deviation, st, of each mass at time-point (t) from its distribution median to the median deviation of each mass, shealthy, at healthy time-point (day 255) from its own distribution median. Finally, the set of differences was vector normalized (Euclidean metric) for each mass. (2) Common Framework Data Classication. After all data had been vector normalized, it was analyzed to determine trends that dynamically emerge for each transcript, protein or metabolite. As the data sampling was uneven in time, a spectral analysis approach was adopted. For each time-series curve a periodogram was constructed through oversampling the frequency space by using a Lomb-Scargle (Lomb, 1976; Scargle, 1982, 1989) [Fourier] transformation - which has been successfully applied in astronomy
S6 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

(Gregory, 2005; Van Dongen et al., 1999; Van Dongen et al., 2001; Yang et al., 2011) for unevenly sampled time series data and implemented in various forms for biological problems (Abramoff et al., 2004; Ahdesmaki et al., 2007; Parkhomchuk et al., 2009; Schimmel, 2001; Van Dongen et al., 2001; Yang et al., 2011; Zhao et al., 2008). Briey the Lomb-Scargle method is equivalent to performing a linear least-squares t of harmonic functions for a given time-series. Namely for a time series Xtj ; j f1; 2; .Ng; sampled at an arbitrary N points, the periodogram can be written as (Van Dongen et al., 1999) 8 2  2 9 > > P P > > > = Xtj sinutj t > 1 < j Xtj cosutj t j PX u = ; (1.1) P + P 2 > > 2> > cos2 utj t sin utj t > > ; : j
j

where t is given by P tan2ut = P


j j

sin2utj : (1.2)

cos2utj

After obtaining the periodogram, the original time-series was reconstructed using an inverse Fourier transform and evenly resam pling frequencies/times (as discussed by Scargle and Hocke et al. (Hocke, 1998; Hocke and Kampfer, 2009; Scargle, 1982, 1989)). This allowed us to reconstruct the series so that standard time-series analysis methods could be applied, and to ll in gaps in a robust fashion given that the spectral approach considers the entire time-series data as a whole, in contrast to other local linear or spline interpolation methods. The data was then classied into three groups: (I) After reconstructing, each time-series curve, Ytj , we P P considered autocorrelation, rk = Nk Ytj mY Ytj + k mY = N 1 Ytj mY , at lag k = 1, as a check of non-randomness. A j=1 j= class of autocorrelated signals was selected (p < 0.05 cutoff, one-tailed, based on obtaining a distribution of the autocorrelations from the bootstrap distributions constructed for each data set. As an example, for transcriptome data for the duration of the time course this corresponds to r1 > 0.25, Data S11C, III, in good agreement with theoretical values (Anderson, 1942) for the length of data, N = 20). After removal of the autocorrelated signals from the set, the remaining signals were checked for aberrant spikes, significantly high or low signal instances compared to what would be expected in a random distribution. Signals that displayed aberrant high signals (p < 0.05, one-tailed by comparison to analysis of randomly simulated distribution of normalized time signals of corresponding length N for each time-series) were classied as (II) spike maxima, while signals that displayed aberrant low signals (p < 0.05, one-tailed) were classied as (III) spike minima. Thus three classes of signicant trends were selected for each of the input omics data sets. (3) Clustering. The classied data sets from (2) above were clustered using the hierarchical agglomerative algorithm in Mathematica 8.0, with correlation distance and average linkage. Once the clustering was determined, the number of clusters per agglomerated data set was ascertained by inspection of the fusion coefcients of their respective dendrograms. To assess the biological signicance for each of the obtained clusters, gene-based pathway and ontology enrichment and network analysis was performed using Cytoscape (Cline et al., 2007; Shannon et al., 2003; Smoot et al., 2011). Namely, the Reactome (Croft et al., 2011; Joshi-Tope et al., 2005; Matthews et al., 2007; Matthews et al., 2009; Vastrik et al., 2007) Functional Interaction (FI) plugin was used to assess membership of genes to Reactome and KEGG (Kanehisa and Goto, 2000) pathways and to calculate enrichment (p < 0.05, FDR < 0.05). Furthermore, Gene Ontology (Ashburner et al., 2000) (GO) analysis was performed using the BiNGO (Maere et al., 2005) plugin for Cytoscape, for signicantly enriched membership (p < 0.05 and Benjamini-Hochberg (Benjamini and Hochberg, 1995) adjusted p < 0.05) in each of Cellular Component (CC), Molecular Function (MF) and Biological Process (BP) categories.

Results Summaries and File Guide In this section we provide results summaries for the dynamical analysis following the analysis framework outlined above. In particular the results from the main text and relevant tables are also included in the supplemental tables following the naming conventions in associated gure as outlined below. Based on the criteria indicated in part (A) above, all data below is grouped into classes and assessed for biological signicance through enrichment analysis: Transcriptome: Entire Time Course Expression levels for 19,714 distinct isoforms from RNA-seq data analysis were consistently tracked from day 0 to day 400 of the study, covering the onset of both HRV and RSV infections (see Figure 1C for isoform distributions). Of these isoforms, 4,922 were grouped in the autocorrelation class, while 3,718 were categorized as spike maxima and 7,891 as spike minima. Clustering and signicant results from the enrichment analysis are shown Figure S6A and associated Data S6. Proteome PBMC: RSV Infection Relative expression levels for 3,731 PBMC proteins were consistently tracked from day 186 to day 400 of the study, covering the onset of RSV infection after day 289. Of the tracked proteins, 257 were grouped in the autocorrelation class, while others displayed signicant aberration from the median response, namely 1,240 showing spike maxima and 1,194 showing spike minima. Clustering and signicant results from the enrichment analysis are shown Figure S3A and associated Data S3.
Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. S7

Proteome Serum: HRV Infection Relative expression levels for 664 serum proteins were consistently tracked from day 0 to day 116 of the study, covering the onset of HRV infection (for this part of the analysis day 116 was used for TMT ratios, corresponding to ratios w.r.t. the 130 amu tag in the spectra). Ninety-four were grouped in the autocorrelation class, 57 categorized as spike maxima and 40 as spike minima. Clustering and signicant results from the enrichment analysis are shown Figure S3B and associated Data S5. Metabolome: HRV and RSV Infections For the HRV infection (days 0185), 6,862 distinct serum metabolite m/z intensities were tracked. Of these, 385 were grouped in the autocorrelation class, 506 categorized as spike maxima and 748 as spike minima. For the RSV infection, 4,228 distinct serum metabolite m/z intensities were tracked (days 255400); 475 were grouped in the autocorrelation class, 577 categorized as spike maxima and 884 as spike minima. Given the modest number of identied metabolites based solely on mass ($20%) enrichment analysis did not yield signicant pathways and further pathway associations will be discussed elsewhere. Clustering and overlap results are found in Figure S4 and associated Data S4. Integrated Proteome Transcriptome and Metabolome for RSV Infection The different omics data set classes were clustered together for the transcriptome, PBMC proteome and serum metabolome for days 186 to 400 of the study, covering the onset of RSV infection and high glucose levels in the latter stages of the investigation. Additional clustering and overlap results are found in Figure S6B and associated Data S7. Data Dissemination All omics data are being deposited in public databases. Transcriptome data (FASTQ les) and Protein Array data (GPR les) are being submitted to the GEO database. The GEO accession number for the RNA-Seq and miRNASeq data sequence reported in this paper is GSE33029. Whole-genome sequences [Complete Genomics, Illumina (whole genome and exome) for both the subject and mother] are being submitted to SRA. The SRA accession number for the WGS sequence reported in this paper is SRP008054.4. Proteome and Metabolome Mass Spectra data are being submitted to TRANCHE (http://www.proteomecommons.org). TRANCHE Hash Tags for data retrieval: Serum Metabolome: KhnRUmK/eiEDV+X7Pw1W3dQ8rXpUn0ru2bkRljND0kePwkIRm8khmZ4iG1qi1ZKQOcDRG4UKIoaIuWcH bTVEekK/Uv0AAAAAAAt6HQ== Serum Proteome: rpG8r/vpebuy09wtxMFnqMt9cvt3qebxzBbnkeYaaQI/ABgwJkMqR8nfVtO4OU8AlfMTzckwkGQYQSNVjaUlR TeWM4IAAAAAAAA4yQ== PBMC Proteome: (1) lC16sAF1o/eC9MPAmvTQi5rg3R1SpR2DO0wn6Mb13G+qZmyaoi9bEmSYOmQuCLehfk1uSJrYsSCNU6 vxx5oLpT3bzMIAAAAAAAAn6A== (2) SAC4BrLreg/h3S3lTSaJs9YziPZ4yrC7bfXhOaAuzAALv7U8teGuzr+WlfzevSFZKWBCWo b5onkUAJ0xItKEQPLkeQUAAAAAAABqkQ==
SUPPLEMENTAL REFERENCES Abramoff, M.D., Magalhaes, P.J., and Ram, S.J. (2004). Image Processing with ImageJ. Biophotonics International 11, 3642. Adzhubei, I.A., Schmidt, S., Peshkin, L., Ramensky, V.E., Gerasimova, A., Bork, P., Kondrashov, A.S., and Sunyaev, S.R. (2010). A method and server for predicting damaging missense mutations. Nat. Methods 7, 248249. Ahdesmaki, M., Lahdesmaki, H., Gracey, A., Shmulevich, L., and Yli-Harja, O. (2007). Robust regression for periodicity detection in non-uniformly sampled timecourse gene expression data. BMC Bioinformatics 8, 233. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 33893402. Anderson, R.L. (1942). Distribution of the serial correlation coefcient. Ann. Math. Stat. 13, 113. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.; The Gene Ontology Consortium. (2000). Gene ontology: tool for the unication of biology. Nat. Genet. 25, 2529. Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc., B 57, 289300. Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185193. Browning, S.R., and Browning, B.L. (2007). Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 81, 10841097. Canela, A., Vera, E., Klatt, P., and Blasco, M.A. (2007). High-throughput telomere length quantication by FISH and its application to human population studies. Proc. Natl. Acad. Sci. USA 104, 53005305. Clark, M.J., Chen, R., Lam, H.M., Karczewski, K.J., Euskirchen, G., and Snyder, M. (2011). Exome DNA Sequencing: A Comparison of Enrichment Technologies. Nat. Biotechnol. Cline, M.S., Smoot, M., Cerami, E., Kuchinsky, A., Landys, N., Workman, C., Christmas, R., Avila-Campilo, I., Creech, M., Gross, B., et al. (2007). Integration of biological networks and gene expression data using Cytoscape. Nat. Protoc. 2, 23662382. Croft, D., OKelly, G., Wu, G., Haw, R., Gillespie, M., Matthews, L., Caudy, M., Garapati, P., Gopinath, G., Jassal, B., et al. (2011). Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 39 (Database issue), D691D697.

S8 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

Dennis, G., Jr., Sherman, B.T., Hosack, D.A., Yang, J., Gao, W., Lane, H.C., and Lempicki, R.A. (2003). DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 4, 3. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95, 1486314868. Elias, J.E., Gibbons, F.D., King, O.D., Roth, F.P., and Gygi, S.P. (2004). Intensity-based protein identication by machine learning from a library of tandem mass spectra. Nat. Biotechnol. 22, 214219. Eng, J.K., McCormack, A.L., and Yates, J.R., III. (1994). An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976989. Gregory, P.C. (2005). Bayesian logical data analysis for the physical sciences: a comparative approach with Mathematica support (Cambridge Univ Pr). Gygi, S.P., Rochon, Y., Franza, B.R., and Aebersold, R. (1999). Correlation between protein and mRNA abundance in yeast. Mol. Cell. Biol. 19, 17201730. Hindson, B.J., Ness, K.D., Masquelier, D.A., Belgrader, P., Heredia, N.J., Makarewicz, A.J., Bright, I.J., Lucero, M.Y., Hiddessen, A.L., Legler, T.C., et al. (2011). High-throughput droplet digital PCR system for absolute quantitation of DNA copy number. Anal. Chem. 83, 86048610. Hocke, K. (1998). Phase estimation with the Lomb-Scargle periodogram method (European Geophysical Society). Hocke, K., and Kampfer, N. (2009). Gap lling and noise reduction of unevenly sampled data by means of the Lomb-Scargle periodogram. Atmos. Chem. Phys. 9, 41974206. Huang, D.W., Sherman, B.T., and Lempicki, R.A. (2008). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 4457. Huang, W., Sherman, B.T., and Lempicki, R.A. (2009). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 113. Hudson, M.E., Pozdnyakova, I., Haines, K., Mor, G., and Snyder, M. (2007). Identication of differentially expressed proteins in ovarian cancer using high-density protein microarrays. Proc. Natl. Acad. Sci. USA 104, 1749417499. John, B., Enright, A.J., Aravin, A., Tuschl, T., Sander, C., and Marks, D.S. (2004). Human MicroRNA targets. PLoS Biol. 2, e363. Joshi-Tope, G., Gillespie, M., Vastrik, I., DEustachio, P., Schmidt, E., de Bono, B., Jassal, B., Gopinath, G.R., Wu, G.R., Matthews, L., et al. (2005). Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 33 (Database issue), D428D432. Kanehisa, M., and Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 2730. Kersey, P.J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., and Apweiler, R. (2004). The International Protein Index: an integrated database for proteomics experiments. Proteomics 4, 19851988. Kokame, K., Matsumoto, M., Soejima, K., Yagi, H., Ishizashi, H., Funato, M., Tamai, H., Konno, M., Kamide, K., Kawano, Y., et al. (2002). Mutations and common polymorphisms in ADAMTS13 gene responsible for von Willebrand factor-cleaving protease activity. Proc. Natl. Acad. Sci. USA 99, 1190211907. Lam, Y.K.H., Pan, C., Clark, M.J., Lacroute, P., Chen, R., Haraksingh, R., OHuallachain, M., Gerstein, M.B., Kidd, J.M., Bustamante, C.D., and Snyder, M. (2012). Detection and annotating genetic variations using the HugeSeq pipeline. Nat. Biotech. 30, 226229. Levy, G.G., Nichols, W.C., Lian, E.C., Foroud, T., McClintick, J.N., McGee, B.M., Yang, A.Y., Siemieniak, D.R., Stark, K.R., Gruppo, R., et al. (2001). Mutations in a member of the ADAMTS gene family cause thrombotic thrombocytopenic purpura. Nature 413, 488494. Lewis, B.P., Burge, C.B., and Bartel, D.P. (2005). Conserved seed pairing, often anked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120, 1520. Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 17541760. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R.; 1000 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 20782079. Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008). SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713714. Lomb, N. (1976). Least-squares frequency analysis of unequally spaced data. Astrophys. Space Sci. 39, 447462. Maere, S., Heymans, K., and Kuiper, M. (2005). BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21, 34483449. Matthews, L., DEustachio, P., Gillespie, M., Croft, D., de Bono, B., Gopinath, G., Jassal, B., Lewis, S., Schmidt, E., Vastrik, I., et al. (2007). An Introduction to the Reactome Knowledgebase of Human Biological Pathways and Processes. Bioinformatics Primer (NCI/Nature Pathway Interaction Database). Matthews, L., Gopinath, G., Gillespie, M., Caudy, M., Croft, D., de Bono, B., Garapati, P., Hemish, J., Hermjakob, H., Jassal, B., et al. (2009). Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res. 37 (Database issue), D619D622. McIlrath, J., Boufer, S.D., Samper, E., Cuthbert, A., Wojcik, A., Szumiel, I., Bryant, P.E., Riches, A.C., Thompson, A., Blasco, M.A., et al. (2001). Telomere length abnormalities in mammalian radiosensitive cells. Cancer Res. 61, 912915. Parkhomchuk, D., Borodina, T., Amstislavskiy, V., Banaru, M., Hallen, L., Krobitsch, S., Lehrach, H., and Soldatov, A. (2009). Transcriptome analysis by strandspecic sequencing of complementary DNA. Nucleic Acids Res. 37, e123. Peng, J., Elias, J.E., Thoreen, C.C., Licklider, L.J., and Gygi, S.P. (2003). Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res. 2, 4350. Pruitt, K.D., Tatusova, T., Klimke, W., and Maglott, D.R. (2009). NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37 (Database issue), D32D36. Pruitt, K.D., Tatusova, T., and Maglott, D.R. (2007). NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35 (Database issue), D61D65. Saldanha, A.J. (2004). Java Treeviewextensible visualization of microarray data. Bioinformatics 20, 32463248. Scargle, J.D. (1982). Studies in astronomical time series analysis. II-Statistical aspects of spectral analysis of unevenly spaced data. Astrophys. J. 263, 835853.

Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. S9

Scargle, J.D. (1989). Studies in astronomical time series analysis. III-Fourier transforms, autocorrelation functions, and cross-correlation functions of unevenly spaced data. Astrophys. J. 343, 874887. Schimmel, M. (2001). Emphasizing difculties in the detection of rhythms with Lomb-Scargle periodograms. Biol. Rhythm Res. 32, 341345. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., and Ideker, T. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 24982504. Skelly, D.A., Johansson, M., Madeoy, J., Wakeeld, J., and Akey, J.M. (2011). A powerful and exible statistical framework for testing hypotheses of allelespecic gene expression from RNA-seq data. Genome Res. 21, 17281737. Smith, T.F., and Waterman, M.S. (1981). Identication of common molecular subsequences. J. Mol. Biol. 147, 195197. Smoot, M.E., Ono, K., Ruscheinski, J., Wang, P.L., and Ideker, T. (2011). Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 27, 431432. Trapnell, C., Pachter, L., and Salzberg, S.L. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 11051111. Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M.J., Salzberg, S.L., Wold, B.J., and Pachter, L. (2010). Transcript assembly and quantication by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511515. Tsai, H.M., and Lian, E.C. (1998). Antibodies to von Willebrand factor-cleaving protease in acute thrombotic thrombocytopenic purpura. N. Engl. J. Med. 339, 15851594. Upshaw, J.D., Jr. (1978). Congenital deciency of a factor in normal plasma that reverses microangiopathic hemolysis and thrombocytopenia. N. Engl. J. Med. 298, 13501352. Van Dongen, H.P., Olofsen, E., VanHartevelt, J.H., and Kruyt, E.W. (1999). A procedure of multiple period searching in unequally spaced time-series with the Lomb-Scargle method. Biol. Rhythm Res. 30, 149177. Van Dongen, H.P., Ruf, T., Olofsen, E., VanHartevelt, J.H., and Kruyt, E.W. (2001). Analysis of problematic time series with the Lomb-Scargle Method, a reply to emphasizing difculties in the detection of rhythms with Lomb-Scargle periodograms. Biol. Rhythm Res. 32, 347354. Vastrik, I., DEustachio, P., Schmidt, E., Gopinath, G., Croft, D., de Bono, B., Gillespie, M., Jassal, B., Lewis, S., Matthews, L., et al. (2007). Reactome: a knowledge base of biologic pathways and processes. Genome Biol. 8, R39. Wang, K., Li, M., and Hakonarson, H. (2010). ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164. Wolfram Research. I. (2010). Mathematica, Version 8.0 (Champaign Illinois, Wolfram Research, Inc.). Yang, R., Zhang, C., and Su, Z. (2011). LSPR: an integrated periodicity detection algorithm for unevenly sampled temporal microarray data. Bioinformatics 27, 10231025. Zhao, W., Agyepong, K., Serpedin, E., and Dougherty, E.R. (2008). Detecting periodic genes from irregularly sampled gene expressions: a comparison study. EURASIP J. Bioinform. Syst. Biol. 2008, 769293. Zhu, X., Gerstein, M., and Snyder, M. (2006). ProCAT: a data analysis approach for protein microarrays. Genome Biol. 7, R110.

S10 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

Figure S1. Supplemental iPOP Results, Related to Figure 1


(A) Representative genomic region not present in the reference genome (hg19): An assembled genomic contig containing the gene PECAM1 is shown. This region was discovered by contig assembly of the unmapped reads from WGS. Top track, WGS reads mapped to the contig; bottom track, RNA-Seq coverage of reads mapped uniquely to this contig. (B) Genomic variants affecting Transcription Factor binding sites: (B1) In EDIL3, the ancestral allele G, homozygous in 6 of the 8 samples and the subject, disrupts the motif whereas the allele A promotes binding of NFkB. (B2) In BMF the subject is also homozygous for a T allele disrupting the NFkB motif at rs539846 that lies in the rst exon of the gene. (C) Isoforms per gene: At every time-point the number of isoforms detected for every Ofcial Gene Symbol was computed.

Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. S11

Figure S2. Supplemental Medically Relevant Results, Related to Figure 2


(A) TERT A202T mutation and telomere length assay for PBMCs: Left top panel shows Sanger sequencing result displaying the heterozygous A202T mutation (T/C). Left bottom table shows Telomere length assay sample and result summary. Lanes 1 and 2, PBMC DNA from the subject at day 255 and day 292, respectively; Lanes 3-7, healthy controls that are free of this mutation. S, Subject (the volunteer subject); C, Control. Right panel shows Telomere length assay Southern blot result. (B) Percentage of chromosomes with short telomere (<3 kb) as determined by High-Throughput Q-FISH. The green, light green, peach and red colors represent the rst, second, third and fourth quartile in each age group, respectively. The black dot represents the test subject. (C) Insulin ELISA: Plasma insulin concentration at each time point was determined by ELISA. Day numbers were shown relative to the rst day of the HRV infection. Error bars: standard deviation of 3 assays. (D) Personal SNPs that lead to compensatory changes in hairpin in miR4273: For the pre-miRNA, hsa-mir-4273, the SNPs presented are from the dbSNP database. (E) Correlation of the levels of miR-7 with the plasma insulin, where miR-95 and miR-125a are shown as controls.

S12 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

Figure S3. Protein Classication and Clustering, Related to Figure 4


Dynamic protein data was grouped into (I) autocorrelated, spike maxima (II) and minima (III) classes and clustered hierarchically shown here for: (A) PBMC proteins, following the dynamics of the RSV infection and high glucose onset - for each labeled cluster, enrichment analyses may be found in Data S3. (B) Serum proteins, following the dynamics of the HRV infection - for each labeled cluster, enrichment analyses may be found in Data S5. (C) The overlaps between identied serum proteins (HRV infection time course) and PBMC proteins were determined (HRV and RSV infection time courses).

Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. S13

Figure S4. Clustering for Metabolites, Related to Figure 4


Metabolite data following separately the dynamics of the HRV and RSV infection and high glucose onset was grouped into (I) autocorrelated, spike maxima (II) and minima (III) classes and clustered hierarchically. For each labeled cluster, associated metabolites may be found in Data S4.

S14 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

Dynamic Data Analysis Framework


Raw Datasets
(1)

Time Series PreProcessing Analysis Framework: Raw Data Differential RNAseq Differential Protein
Paired-end Sequencing data .fastq Velos Orbitrap LTQ spectra/ raw les Proteome Discoverer Spectra Identication and TMT label quantication Replicate QC+ Missing Data Identication

Differential Metabolites
Raw data from Q-TOF

QC & TopHat + Cufinks Mapping

Aligned Mass and Retention time data

Annotated data time points

(1) Data Preprocessing

FPKM FIltering + Quantile Normalization

Normalized Ratios ( = 1); distinct runs

Protein Bootstrap Distribution Standardize Log Distributions per time point baselining Vector Normed Log Ratio w.r.t. Healthy physiological state

QC+ Average Replicate Data and identify Missing points

Vector Normalized data, RefSEQ annotation RNA Bootstrap Distribution

Data Consolidation Uniprot Annotation

Metabolite Bootstrap Distribution

= t- healthy + Vector Normalization

Time Annotated Normalized data compared to Healthy State

(2) Time Annotated Normalized data and Simulations compared to Healthy State Spectral Analysis Lomb-Scargle based Spectral Analysis

Time Series PreClustering Classication Analysis Framework per Data Set


Classied Differential Data Yes

Autocorrelation lag 1 p < .05 based on data specic Bootstrap?

(I) Autocorrelated Data

(2) Common Framework data Classication

No Real Signal Reconstitution Inverse FFT Periodogram and autocorrelation calculation (II) Spike Maxima Spike Max and Min, p < .05 based on data specic Bootstrap ? No Yes

(III) Spike Minima

Low priority data

Clustering

(3)

Classied Data Clustering

Clustering and Enrichment Analysis per Data Set

(3) Clustering and Enrichment Analysis

Cluster Selection based on fusion coefcient analysis Mathematica

BiNGO GO enrichment analysis: MF, BP, CC Known Reactome FI pathway analysis per Cluster Data Visualization and Cytoscape Data Integration

Gene-based Annotation (RefSeq + Uniprot). Metabolite KEGG annotation

Figure S5. Integrated Omics Analysis Framework, Related to Figures 3 and 4


Different omics data are analyzed accordingly with a view toward data integration through a common framework.

Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. S15

Figure S6. Supplementary Clustering Details, Related to Figures 3 and 4


Dynamic data was grouped into (I) autocorrelated, spike maxima (II) and minima (III) classes and clustered hierarchically shown here for: (A) Transcriptome data for the duration of the project; For each labeled cluster, enrichment analyses may be found in Data S6. (B) Integrated omics data (transcriptome, proteome and metabolome) following the dynamics of the RSV infection and high glucose onset. For each labeled cluster, enrichment analyses and associated metabolites may be found in Data S7.

S16 Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc.

Figure S7. Heteroallelic Expression and Editing in PBMCs, Related to Figure 5


(A) Sanger cDNA sequencing of selected heteroallelic expressed genes conrms heterozygosity but not the ratio of alternate allele (left), while digital PCR is utilized to validate differential allele-specic expression (ASE) alt/tot ratios across day 0 and day 186 time points (right). (B) Distribution of the posterior probability of ASE based on the two-component beta binomial distribution model. The posterior probability is the observation that, Xmt is derived from the second component is interpreted as the strength of the ASE. (C) Heatmap of the RSV infection time course (min. 10 time points, 684 sites, posterior probability > 0.75 at least at one time point) showing differential ASE with distinct patterning during onset of high glucose, day 307 (red arrow). (D) Heatmap of the RSV infection time course (min. 10 time points, 258 sites, posterior probability > 0.75 at day 307) showing differential ASE focused on day 289 (onset of RSV infection - red arrow), onset of T2D at day 307 is also shown (red arrow). (E) Two adjacent allele-specic phased variants in ENDOD1 30 UTR show concordance in alt/tot expression using digital PCR of cDNA.

Cell 148, 12931307, March 16, 2012 2012 Elsevier Inc. S17

You might also like