You are on page 1of 10

STAT:4720

Large Data Analysis Capstone

Learning from Data

Lecture 1
Jan. 24, 2017

Kate Cowles
374 SH, 335-0727
kate-cowles@uiowa.edu
Learning from data

 statistical learning and machine learning { di erent but


related strategies for learning from data
 supervised learning { using one or more inputs to
predict or estimate an output
{ have data in which values of output are known
{ goal is to develop a way of predicting output values
for future data where only the inputs are known
Examples of supervised learning problems: identifying digits in
hand-written zipcodes

 https://lagunita.stanford.edu/c4x/HumanitiesScience/
StatLearning/asset/introduction.pdf
 input: binary-coded image
 output: classi cation as one of 10 digits

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0
0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0
1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0
1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0
1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0
1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0
1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0, 6

 task is to classify such images into one of a set of de ned categories


Examples of supervised learning problems: predicting
ndings from prostate removal based on measurements available
before surgery

 prostate cancer { a ects 1 in 7 U.S. men in their lifetimes


 vast majority of prostate cancers are not aggressive
{ will not spread beyond prostate
{ man will die with it, not of it
 treatments for prostate cancer have serious side e ects
{ radical prostatectomy { surgical removal of prostate
{ radiation
 prostate-speci c antigen (PSA) test { blood test that measures
level of a protein produced by prostate cells
{ prostate cancer is one of many possible causes of elevated PSA
 prostate core needle biopsy { 12 samples of prostate tissue
obtained with needle and graded by pathologist
 Can information from non-surgical tests be used to predict
whether a man's prostate cancer is aggressive?
Supervised learning { setup

Observations in a dataset with

 one or more outcome variables Y


 one or (usually many) more variables X that may be useful in
predicting or explaining Y

X Y
Statistics Independent variables Dependent variables
Predictors Responses
Covariates
Machine learning Inputs Outputs
Pattern recognition Features
Supervised learning { more on data

 type of outcome variable


{ Y is quantitative { regression problem
{ Y is qualitative { classi cation problem
 training data { observations with both X and Y available
Supervised learning { goals

Use training data to do one or more of the following:

 accurately predict outcome variables in new test data (predictions)


 determine which input variables are related to the outcome
variable and in what ways (statistical inference)
 determine how good our predictions and inferences are
Statistical learning and machine learning

Machine Statistical
learning learning
Parent discipline Arti cial Statistics
intelligence (c.s)
Emphases Large scale; Models (relationships)
Prediction Uncertainty
accuracy

But the di erences between the two approaches are lessening.


Exploratory data analysis

 important prior to applying any supervised learning method


 graphical and numeric
 lab example
Prostate cancer dataset

 Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F.,
Redwine, E. and Yang, N (1989) Prostate speci c antigen in the
diagnosis and treatment of adenocarcinoma of the prostate II.
Radical prostatectomy treted patients, Journal of Urology 16:
1076-1083.
 included in R package ElemStatLearn
 variables available before surgery
{ age in years
{ gleason the gleason score based on biopsy; integer values from
2 (not likely to spread quickly) to 10 (likely to be aggressive)
{ lpsa log of PSA test value
 variables available only after prostate is removed
{ lweight log prostate weight
{ lbph log of the amount of benign prostatic hyperplasia
{ pgg45 percent of Gleason grade 4 or 5
{ lcavol log cancer volume
{ svi seminal vesicle invasion (if yes, cancer has spread outside
of prostate)
{ lcp log of capsular penetration (how much has has tumor
extended through the capsule surrounding the prostate)
 variable added for analysis purposes
{ train logical; training or test data

You might also like