Professional Documents
Culture Documents
Xuhu Wan*
March 4, 2018
1
Introduction to Statistical Analysis Xuhu Wan
Contents
1 Introduction 5
1.1 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Importance of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Big Data Chaning Financial Trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Learning Statistics to Become a Data Scientist . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Why Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Page 2
Introduction to Statistical Analysis Xuhu Wan
9 Sampling Distribution 62
Page 3
Introduction to Statistical Analysis Xuhu Wan
Part I
What Data Scientists Do
Page 4
Introduction to Statistical Analysis Xuhu Wan
1 Introduction
1.1 Big Data
Big data is a term that describes the large volume of data – both structured and unstructured –
that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important.
It’s what organizations do with the data that matters. Big data can be analyzed for insights that
lead to better decisions and strategic business moves.
While the term “big data” is relatively new, the act of gathering and storing large amounts of
information for eventual analysis is ages old. In the early 2000s, industry analyst Doug Laney
articulated the now-mainstream definition of big data as the three Vs:
• Volume. Organizations collect data from a variety of sources, including business transac-
tions, social media and information from sensor or machine-to-machine data. In the past,
storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the
burden.
• Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely
manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of
data in near-real time.
• Variety. Data comes in all types of formats – from structured, numeric data in traditional
databases to unstructured text documents, email, video, audio, stock ticker data and finan-
cial transactions.
Page 5
Introduction to Statistical Analysis Xuhu Wan
There are two additional dimensions revealing big values of big data:
• Variability. In addition to the increasing velocities and varieties of data, data flows can be
highly inconsistent with periodic peaks. Is something trending in social media? Daily, sea-
sonal and event-triggered peak data loads can be challenging to manage. Even more so with
unstructured data.
• Complexity. Today’s data comes from multiple sources, which makes it difficult to link,
match, cleanse and transform data across systems. However, it’s necessary to connect and
correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral
out of control.
• Generating coupons at the point of sale based on the customer’s buying habits.
With large amounts of information streaming in from countless sources, banks are faced with
finding new and innovative ways to manage big data. While it’s important to understand cus-
tomers and boost their satisfaction, it’s equally important to minimize risk and fraud while main-
taining regulatory compliance. Big data brings big insights, but it also requires financial institu-
tions to stay one step ahead of the game with advanced analytics.
When government agencies are able to harness and apply analytics to their big data, they gain
significant ground when it comes to managing utilities, running agencies, dealing with traffic
congestion or preventing crime. But while there are many advantages to big data, governments
must also address issues of transparency and privacy.
Page 6
Introduction to Statistical Analysis Xuhu Wan
Customer relationship building is critical to the retail industry – and the best way to manage
that is to manage big data. Retailers need to know the best way to market to customers, the most
effective way to handle transactions, and the most strategic way to bring back lapsed business.
Big data remains at the heart of all those things.
Page 7
Introduction to Statistical Analysis Xuhu Wan
• model fitting.
In regards to inference:
• Parameter Estimation
• Hypothesis testing
• Bayesian Analysis
• Identifying the best estimator
• Linear Regression
• Non-linear Regression
• Categorical Data Analysis/Classitification
• Time Series / Longitudinal Analysis
• Machine Learning
Before we start, I’d like to tell you about why I use Python for data analytics. I will try to convince
you that Python is really the best tool for most of the tasks involved.
Ideally, I would like learn only one language that is suited for all kinds of work: number
crunching, application building, web development, interfacing with APIs etc. This language
would be easy to learn, the code would be compact and clear, it would run on any platform.
It would enable me to work interactively, enabling the code to evolve as I write it and be at least
free as in speech. And most importantly, I care much more about my own time than the cpu time
of my pc, so number crunching performance is less important for me than my own productivity.
If we match different languages with positions on employment websites:
Page 8
Introduction to Statistical Analysis Xuhu Wan
Python, like most open source software has one specific characteristic: it can be challenging for
a beginner to find his way around thousands of libraries and tools. This guide will help you get
everything you need in your quant toolbox, hopefully without any problems. Fortunately there
are several distributions, containing most of the required packages, making installation a breeze.
The best distribution in my opinion is Anaconda from Continuum Analytics.
The Anaconda distribution includes:
Page 9
Introduction to Statistical Analysis Xuhu Wan
Page 10
Introduction to Statistical Analysis Xuhu Wan
We need to load data into notebook first and save it with a special format "DataFrame", which
is an important and advanced data type declared in python module "Pandas".
To use pandas we need to import.
import pandas as pd
"pd" is a short name which can be self-defined. Next we will import the data, "facebook.csv".
csv is an abbreviation of "Comma-Separated Values", which means that different values saved in
"facebook.csv" is separated by comma.
fb = pd . DataFrame . from_csv ( ' data / facebook . csv ')
We read data from "Facebook.csv" using a method "from_csv" from pd.DataFrame. All methods
in python is followed by "( )". Data is saved with a special data type/format/class "DataFrame"
and we give a name for this DataFrame "fb". To check the type of "fb", we can
print ( type ( fb ) )
<class 'pandas.core.frame.DataFrame'>
• has its own specialized methods, i.e., DataFrame has methods, head(),tail(),describe(),
• - and its own features, i.e. DataFrame has features: index, columns, shape, which are not
followed by "( )"
Page 11
Introduction to Statistical Analysis Xuhu Wan
Using head(), we can check columns, index and starting date if it is time series data. Now
we know that, our facebook data starts from Dec, 31 with 6 columns/variables - "Open",
"High","Low","Close", "Adj Close", "Volume".
Python use 0-based indexing. "Open" is column 0, "High" is column 1.
The "column" before column 0, is not variable which is index of DataFrame "fb".
We also can use tail() to check the bottom of DataFrame
fb . tail ()
Volume
Date
2018-01-30 14270800
2018-01-31 11964400
2018-02-01 12980600
2018-02-02 17961600
2018-02-05 28869000
Now we know that the end date is Feb, 05, 2018. Each row is the information of one trading
day. If we want to know how many trading days totally, we can
fb . shape
(780, 6)
"shape" is a feature of DataFrame. It is a tuple. The first element of tuple states how many rows
and the second element states how many columns/variables. We also can print columns’ names
fb . columns
Page 12
Introduction to Statistical Analysis Xuhu Wan
• Identifies data (i.e. provides metadata) using known indicators, important for analysis, vi-
sualization, and interactive console display
• Enables automatic and explicit data alignment
• Allows intuitive getting and setting of subsets of the data set
In this section, we will focus on the final point: namely, how to slice, dice, and generally get
and set subsets of pandas DataFrame. The primary focus will be on DataFrame and Series (a
Columns of DataFrame) as they have received more development attention in this area.
We have two major approached to select data: selection by label and selection by position
Page 13
Introduction to Statistical Analysis Xuhu Wan
Date
2014-12-31 20.049999
2015-01-02 20.129999
2015-01-05 19.790001
2015-01-06 19.190001
2015-01-07 19.139999
2015-01-08 19.860001
2015-01-09 19.940001
2015-01-12 19.690001
2015-01-13 19.660000
2015-01-14 19.740000
2015-01-15 19.600000
2015-01-16 19.959999
2015-01-20 20.020000
2015-01-21 20.299999
Name: Close, dtype: float64
<class 'pandas.core.series.Series'>
20.299999
Open High
Date
2014-12-31 20.400000 20.510000
2015-01-02 20.129999 20.280001
2015-01-05 20.129999 20.190001
2015-01-06 19.820000 19.840000
2015-01-07 19.330000 19.500000
Page 14
Introduction to Statistical Analysis Xuhu Wan
fb2015 . plot ()
plt . show ()
If you only want to get a column or multiple columns, we can do the following
a_column = fb [ ' Close ']
multiple_columns = fb [[ ' Open ' , ' Close ' ]]
Page 15
Introduction to Statistical Analysis Xuhu Wan
shift() is a method of DataFrame or Series. shift(-1) means look at value in one day. shift(1) is the
value of variables one day ago.
In computation above, we can find that, the difference of two columns is to calculate the dif-
ference of pairs of numbers between two columns. This is a very nice property for DataFrame:
element-wise operation.
fb . head ()
PriceDiff
Date
2014-12-31 0.080000
2015-01-02 -0.339998
2015-01-05 -0.600000
2015-01-06 -0.050002
2015-01-07 0.720002
S t +1 − S t
rt =
St
fb [ ' Return ' ]=( fb [ ' Close ' ]. shift ( -1) - fb [ ' Close ' ]) / fb [ ' Close ']
fb . head ()
Page 16
Introduction to Statistical Analysis Xuhu Wan
PriceDiff Return
Date
2014-12-31 0.080000 0.003990
2015-01-02 -0.339998 -0.016890
2015-01-05 -0.600000 -0.030318
2015-01-06 -0.050002 -0.002606
2015-01-07 0.720002 0.037618
fb . head ()
Page 17
Introduction to Statistical Analysis Xuhu Wan
Page 18
Introduction to Statistical Analysis Xuhu Wan
Then we can save it into a data file saved locally in your laptop
df . to_csv ( " data / microsoft . csv " )
Page 19
Introduction to Statistical Analysis Xuhu Wan
We will go long (buy) when "MA10" is above MA50 and do nothing otherwise. Each time we
will hold 1 share of Microsoft. We need to define a variable (column) with the name "Shares". We
will use list comprehension to do this job. list comprehension is extremely useful techqniue for
data analysis. We define a new list
Page 20
Introduction to Statistical Analysis Xuhu Wan
newlist =[ 1 if ms . loc [ ei , ' MA10 '] > ms . loc [ ei , ' MA50 '] else 0 for ei in
ms . index ]
List comprehension, is to compute shares by iterating through all index of ’ms’. Then we can
define a columns of ’ms’ using this list. List comprehension is very useful when preprocessing
data.
For example,
alist =[1 ,2 ,3 ,4 ,5]
More than that, we also can transform the values in alist following a more complicated rule. -
if x >3, ’pass’ - else, ’fail’
pflist =[ ' Pass ' if x >3 else ' Fail ' for x in alist ]
pflist
Since we use close price to compute signal, it is equivalent to say, we will evaluate our signals
when the market is almost close and decide whether buy or sell. If ms["Shares"] is 1, we will buy
one share of stock and the profit is close price of tomorrow minus close price of today. Otherise
we will short sell one share of profit and the profit is close price of today minus close price of
tomorrow. We need to define a new variable "Profit".
ms [ ' Close1 ' ]= ms [ ' Close ' ]. shift ( -1)
Page 21
Introduction to Statistical Analysis Xuhu Wan
ms [ ' Profit ' ]=[ ms . loc [ ei , ' Close1 '] - ms . loc [ ei , ' Close '] if
ms . loc [ ei , ' Shares ' ]==1
else 0 for ei in ms . index ]
It is not clear whether we make money or lose money. Hence we need to compute the cumu-
lative wealth.
ms [ ' Profit ' ]. cumsum () . plot ()
Page 22
Introduction to Statistical Analysis Xuhu Wan
It seems that we make some money but you get bancrupt before you get rich. Later we will
know that this strategy is not good because the maximum drawdown risk is high.
Page 23
Introduction to Statistical Analysis Xuhu Wan
Page 24
Introduction to Statistical Analysis Xuhu Wan
You only can determine your best parameters in training set. After final choice of parameters
are decided, use them in test set which is to mimic real trading process: we train our models using
historical data and apply the best model tomorrow.
Test . cumsum () . plot ()
Page 25
Introduction to Statistical Analysis Xuhu Wan
If you use only training set to adjust parameters, maybe you will choose different set of num-
bers.
fast =140
slow =160
ms [ ' MAF ' ]= ms [ ' Close ' ]. rolling ( fast ) . mean ()
ms [ ' MAS ' ]= ms [ ' Close ' ]. rolling ( slow ) . mean ()
ms [ ' Shares ' ]= [ 1 if ms . loc [ ei , ' MAF '] > ms . loc [ ei , ' MAS '] else -1 for
ei in ms . index ]
ms [ ' Profit ' ]=[ ms . loc [ ei , ' Close1 '] - ms . loc [ ei , ' Close '] if
ms . loc [ ei , ' Shares ' ]==1
else 0 for ei in ms . index ]
Train = ms [ ' Profit ' ]. iloc [: -200]
Test = ms [ ' Profit ' ]. iloc [ -200:]
Train . cumsum () . plot ()
Figure 14: Performance in train data with parameters tuned in train data.
Page 26
Introduction to Statistical Analysis Xuhu Wan
Figure 15: Performance in test data with parameters tuned in train data
Hence the best parameters (models) in the training data maybe is not the best parameters for
test data.
Page 27
Introduction to Statistical Analysis Xuhu Wan
import pandas as pd
import matplotlib . pyplot as plt
% matplotlib inline
import seaborn
Y = Closet+1 − Closet
ms [ 'Y ' ]= ms [ ' Close ' ]. shift ( -1) - ms [ ' Close ']
We will use the following predictors (independent variables) - X1,X2, change of close price today
and yesterday - X3,X4,difference between high and low, today and yesterday - X5,X6,difference
between high and close, today and yesterday - X7,X8,change of volume, today and yesterday
Page 28
Introduction to Statistical Analysis Xuhu Wan
ms [ ' X1 ' ]= ms [ ' Close '] - ms [ ' Close ' ]. shift (1)
ms [ ' X2 ' ]= ms [ ' Close ' ]. shift (1) - ms [ ' Close ' ]. shift (2)
ms [ ' X3 ' ]= ms [ ' High '] - ms [ ' Low ']
ms [ ' X4 ' ]= ms [ ' High ' ]. shift (1) - ms [ ' Low ' ]. shift (1)
ms [ ' X5 ' ]= ms [ ' High '] - ms [ ' Close ']
ms [ ' X6 ' ]= ms [ ' High ' ]. shift (1) - ms [ ' Close ' ]. shift (1)
ms [ ' X7 ' ]= ms [ ' Volume '] - ms [ ' Volume ' ]. shift (1)
ms [ ' X8 ' ]= ms [ ' Volume ' ]. shift (1) - ms [ ' Volume ' ]. shift (2)
predictors =[ ' X1 ' , ' X2 ' , ' X3 ' , ' X4 ' , ' X5 ' , ' X6 ' , ' X7 ' , ' X8 ']
from sklearn import linear_model
from sklearn . metrics import mean_squared_error
ms . head ()
Page 29
Introduction to Statistical Analysis Xuhu Wan
Y X1 X2 X3 X4 X5 \
Date
2015-01-05 -0.680000 -0.429996 0.309997 0.480000 0.879997 0.399998
2015-01-06 0.579998 -0.680000 -0.429996 1.209999 0.480000 1.099998
2015-01-07 1.360000 0.579998 -0.680000 0.969997 1.209999 0.229999
2015-01-08 -0.400001 1.360000 0.579998 1.029999 0.969997 0.160000
2015-01-09 -0.590001 -0.400001 1.360000 0.919998 1.029999 0.630001
X6 X7 X8 Y_predict
Date
2015-01-05 0.660000 11760000.0 6361400.0 0.183903
2015-01-06 0.399998 -3226000.0 11760000.0 0.233266
2015-01-07 1.099998 -7333800.0 -3226000.0 0.022806
2015-01-08 0.229999 531100.0 -7333800.0 -0.009937
2015-01-09 0.160000 -5702400.0 531100.0 0.032701
Y = β 0 + β 1 X1 + β 1 X1 . . . + β 7 X7 + β 8 X8 + e
0.13597169031644135
Can we generate the profit using our prediction model ? - if Y_predict>0 , we buy today and
sell it tomorrow - otherwise, remain unchanged.
Page 30
Introduction to Statistical Analysis Xuhu Wan
ms [ ' Profit ' ]=[ ms . loc [t , 'Y '] if ms . loc [t , ' Y_predict '] >0 else 0 for t
in ms . index ]
Total profit is
ms [ ' Profit ' ]. sum ()
66.28998699999993
Bingo, we make money using our model and we make more money than that using trend-
following strategy.
plt . plot ( ms [ ' Profit ' ])
If we plot the profit in fact, we lose money on some days, but the number of days we win is
more that we lose.
plt . plot ( ms [ ' Profit ' ]. cumsum () )
Page 31
Introduction to Statistical Analysis Xuhu Wan
(0.08531529858429848, 0.6727636361993468)
Hence our daily profit is 8 cents. If transaction cost is less than 8 cents, we will make money
for sure? or not?
Hence the best parameters (models) in the training data maybe is not the best parameters for
test data.
(777, 17)
We use 777 days to build a linear regression model and apply this model again in these 777
days. This is not right. We should separate the data into train and test, described in last section,
building model in training data and validate the model and strategy in test data. I will leave this
as your first assignment.
Page 32
Introduction to Statistical Analysis Xuhu Wan
If we use all historical data to build a regression model, and the average daily profit, for exam-
ple, in historical data, is $1 per day. We do not know, whether the performance of the model will
be similar. In other words, we cannot evaluate the performance of the model correctly.
Instead, we separate all historical data into train and test. For example, totally, we have 100
days in history. Then we will use the first 80 days as train data which is for model building. After
getting a model(which describe the relationship between your target Y and predictors X1 , X2 , . . . )
from training data, we will use this model to make prediction in both train and test data. We can
either evaluate the accuracy or fit of the model (We will cover the details in Part III.) or evaluate
the return if you make decision based on your model, in both train and test.
We claim that the performance of the model is consistent if the return or fit of the model are
similar in both train and test data.
Otherwise the model is said to be over fitting if the performance of the model are different
significantly in train and test.
That is why we need to study statistics in a more systematical way, in order to know: - Random
Variable and Distribution - Association of Two Variables - Hypothesis testing and significance
level - Evaluation of Linear Regression Models.
Page 33
Introduction to Statistical Analysis Xuhu Wan
The last two lines is to hide warning message when we run codes.
Background
In this chapter, we have build a regression model for the change of close price of Microsoft
ms = pd . DataFrame . from_csv ( ' microsoft . csv ')
ms [ 'Y ' ]= ms [ ' Close ' ]. shift ( -1) - ms [ ' Close ']
ms [ ' X1 ' ]= ms [ ' Close '] - ms [ ' Close ' ]. shift (1)
ms [ ' X2 ' ]= ms [ ' Close ' ]. shift (1) - ms [ ' Close ' ]. shift (2)
ms [ ' X3 ' ]= ms [ ' High '] - ms [ ' Low ']
ms [ ' X4 ' ]= ms [ ' High ' ]. shift (1) - ms [ ' Low ' ]. shift (1)
ms [ ' X5 ' ]= ms [ ' High '] - ms [ ' Close ']
ms [ ' X6 ' ]= ms [ ' High ' ]. shift (1) - ms [ ' Close ' ]. shift (1)
ms [ ' X7 ' ]= ms [ ' Volume '] - ms [ ' Volume ' ]. shift (1)
ms [ ' X8 ' ]= ms [ ' Volume ' ]. shift (1) - ms [ ' Volume ' ]. shift (2)
ms = ms . dropna ( axis =0)
Please notice, to read data sucessfully, you need to put "microsoft.csv" and Assignment notebook
in the same folder.
Problem 1.
divide the data into training (80%) and test data (20%).
Problem 2.
Build linear regression model for Y using X1 , . . . , X8 with training data. Make prediction in train-
ing and test data.
Problem 3.
Compute average daily profit in training data and test data using the following signal-based strat-
egy.
• if Y_predict >0, buy today and sell it tomorrow
Page 34
Introduction to Statistical Analysis Xuhu Wan
• else, do nothing.
Is the performance of your prediction model consistent? (Consistency means that it has similar
performance in train and test.)
Problem 4.
In the end of section 4.1 , we claim that, if the average daily profit is higher than transaction cost,
the strategy can be implemented (Now, we know that consistency of the model is also a necessary
condition.)
But on some days, we do not trade stocks which are also counted in average daily profit. To
evaluate the implementability more precisely, we need to compute average daily profit of those
days we trade. Could you compute this adjusted average profit for training and test periods?
Page 35
Introduction to Statistical Analysis Xuhu Wan
Appendix
Launching Jupyter notebook
The Jupyter Notebook App can be launched by clicking on the Jupyter Notebook icon installed by
Anaconda in the start menu (Windows) or by typing in a terminal (cmd on Windows):
This will launch a new browser window (or a new tab) showing the Notebook Dashboard, a
sort of control panel that allows (among other things) to select which notebook to open.
When started, the Jupyter Notebook App can access only files within its start-up folder (in-
cluding any sub-folder). If you store the notebook documents in a subfolder of your user folder
no configuration is necessary. Otherwise, you need to choose a folder which will contain all the
notebooks and set this as the Jupyter Notebook App start-up folder.
Basics of Python
Python, different from c++, java,scala, etc, is high level programming language.It does not care
about how to manage the hardware.
Learning python is considered as the easiest coding language. Comparing to R, which is an-
other high level language for data anaytics, it is also considered to be more friendly.
For beginners, it is crucial to know different class/types of different object(data). Understand
your data can be loaded or packed in different types.
Core python consists of all data types with their methods and features that are available with-
out importing any advanced modules. It has the following data types
Page 36
Introduction to Statistical Analysis Xuhu Wan
• integer
• float
• string
• bool
• list
• tuple
• dictionary
• set
For this book, we will use integer, float, string, bool, list and tuple.
Integer, float
a =10
b =10.1
b . real
10.1
a, b have different data types /class, hence they have different methods and feature.
big_length() is a built-in method for integer and .real is a build-in feature for float number.
String
String is a contend circulated by double quotes or single quotes
c = " I am handsome "
d = " 123 "
Page 37
Introduction to Statistical Analysis Xuhu Wan
--------------------------------------------------------------------------
TypeError: must be str, not int
You get error message. We can use str(3) to change integer 3 into string "3". Then we have
d + str (3)
and output is ’1233’. Hence the addition of strings is to concatenate two strings. Strings are very
useful in natural language processing. For this book, string is used for columns (variables)’ names.
list use square brackets. lists have many built-in attributes and most-often used one is "append".
• You have a list of seven elements. A list is an ordered set of elements enclosed in square
brackets.
• python is 0-indexing, and the first element of any non-empty list, i.e. aList, is always aList[0].
• The last element of this seven-element list is aList[6], because lists are always zero-based.
aList [ -1]
'Katy'
• A negative index accesses elements from the end of the list counting backwards. The last
element of any non-empty list, i.e. aList is always aList[-1].
aList [2:5]
Page 38
Introduction to Statistical Analysis Xuhu Wan
• You can get a subset of a list, called a “slice”, by specifying two indices. The return value is
a new list containing all the elements of the list, order, starting with the first slice index (in
this case alist[2]), up to but not including the second slice index (in this case alist[5]).
• Slicing works if one or both of the slice indices is negative. If it helps, you can think of it
this way: reading the list from left to right, the first slice index specifies the first element you
want, and the second slice index specifies the first element you don’t want. The return value
is everything in between.
• Lists are zero-based, so aList[0:3] returns the first three elements of the list, starting at aL-
ist[0], up to but not including aList[3].
aList [0:3]
Tuple is similar to list, in the sense that it is collections of elements with all kinds of types/class
aTuple =(10 ,12.06 , " Tiger " , False )
but tuples use parentheses, whereas lists use square brackets. The tuples cannot be changed after
it is defined.
aTuple [0]=20
---------------------------------------------------------------------------
<ipython-input-2-4253fec5e5d3> in <module>()
----> 1 aTuple[0]=20
aList [0]=10000
Page 39
Introduction to Statistical Analysis Xuhu Wan
Part II
Variables, Samples and Statistical Inferences
Page 40
Introduction to Statistical Analysis Xuhu Wan
Page 41
Introduction to Statistical Analysis Xuhu Wan
There are two random samples often used: - When a population element can be selected more
than one time, we are sampling with replacement. - When a population element can be selected
only one time, we are sampling without replacement.
For example, we consider a collection of scores of the first assignment as a population.
data = pd . DataFrame ()
data [ ' Population ' ]=[47 , 85 , 41 , 3, 15 , 46 , 35 , 43 , 92 ,
45 , 59 , 35 , 20 , 81 , 30 , 33 , 6 , 12 ,
38 , 10 , 11 , 48 , 4, 99 , 62 , 72 , 15 ,
8, 31 , 37 , 21 , 72 , 90 , 51 , 97 , 66 ,
5, 22 , 73 , 59 , 57 , 93 , 53 , 31 , 20 ,
82 , 20 , 39 , 82 , 22 , 28 , 56 , 94 , 73 ,
95 , 59 , 53 , 11 , 71 , 85 , 20 , 57 , 88]
10 59
54 95
28 31
5 46
45 82
20 11
16 6
61 57
8 92
62 88
Name: Population, dtype: int64
Page 42
Introduction to Statistical Analysis Xuhu Wan
For example, say you want to know the mean income of the subscribers to a particular
magazine—a parameter of a population. You draw a random sample of 100 subscribers and de-
termine that their mean income is $27, 500 (a statistic). You conclude that the population mean
income µ is likely to be close to $27, 500 as well. This example is one of statistical inference.
Different symbols are used to denote statistics and parameters, as the following table shows.
The values of these numbers will not change with samples. However
a_sample = data [ ' Population ' ]. sample (10 , replace = True )
print ( " Sample mean is " , a_sample . mean () )
print ( " Sample variance " , a_sample . var ( ddof =1) )
print ( " Sample standard deviation is " , a_sample . std ( ddof =1) )
print ( " Sample size is " , a_sample . shape [0])
Page 43
Introduction to Statistical Analysis Xuhu Wan
Notice that, we have different parameter values for "ddof" when we calculate standard devia-
tion and variance for population. First we have the following formula for parameter and statistics,
where x1 , x2 , . . . , x N are all items from population, x1 , x2 , . . . , xn are a random collections from the
population which make a sample. n is the sample size and N is population size. We have
x1 + x2 + . . . , + x n x1 + x2 + . . . , + x N
x̄ = , µ=
n N
( x1 − x̄ )2 + ( x2 − x̄ )2 . . . + ( xn − x̄ )2 ( x1 − µ )2 + ( x2 − µ )2 . . . + ( x N − µ )2
s2 = , σ2 =
n−1 N
These formulas look very lengthy. Statistician use Σ to present summation as follows
∑in=1 xi ∑ N xi
x̄ = , µ = i =1
n N
∑in=1 ( xi − x̄ )2 ∑ N ( x i − µ )2
s2 = , σ 2 = i =1
n−1 N
An often-asked questions for beginners is why the sample variance is divdied by n − 1 instead
n. We first need to explain what are estimators and unbiased estimators.
An estimator is a statisic of sample intended to approximate the population parameter.
An unbiased estimator is the estimator, whose average from all samples is identical to popu-
lation parameter. Otherwise the estimator is called biased.
For example, we take samples with replacement from the population, compute sample variace
with denominator equal to 1 for each sample, and then take average, we can check whether sample
variance is unbiased estimator for puplation variance.
sample_var ia nc e _c ol le c ti on =[]
We run the cell above 200 times and then "sample_variance_collection" get 200 sample variances
computed from 200 samples.
print ( " totally , we get " , len ( s am p le _v ar i an ce _c o ll ec ti o n ) , " sample
variance " )
# len is applied to compute the size of list
Page 44
Introduction to Statistical Analysis Xuhu Wan
sample_var i a nc e _c o ll e c ti o n1 = pd . DataFrame ( s am p le _v ar i an ce _c o ll ec ti o n )
print ( " Average of sample variance is " ,
sample_ v ar i an c e _c o ll e c ti o n1 [0]. mean () )
print ( " Population variance is " , data [ ' Population ' ]. var ( ddof =0) )
We can find that the average of sample varaince is very close to population variance. We
can expect that, as we continue to samples , the average will reach the real population varaince.
Readers can check that the average of sample varaince with ddof=0 will not reach population
variance. H
Hence sample variance with the denominator = n − 1 is unbiased estimator.
More intuitionally, we take a sample, and sample mean must be a center of all sample data,
which might be much different from population mean. Hence the deviation of sample items from
sample mean is much smaller than that from population mean. Hence it is divided by n − 1, a
smaller number comparing to n, to approximiate the population variance.
rb . head ()
Volume Turnover
NdateTime
2018-02-06 21:00:00.500 762.0 3007738.0
2018-02-06 21:00:01.000 2450.0 9668594.0
2018-02-06 21:00:01.500 1106.0 4363998.0
2018-02-06 21:00:02.000 540.0 2130974.0
2018-02-06 21:00:02.500 424.0 1673534.0
Page 45
Introduction to Statistical Analysis Xuhu Wan
"rb" is an abbreviation of "reinforcing steel bar, rebar", which is traded in Shanghai Futures
Exchange. The details of Rb contracts can be found here but in Chinese. It is traded from 21:00-
23:00, 9:00-10:15,10:30-11:30,13:30-3:00 on workdays.
Every 500 milliseconds, the exchange will release trading information once, including
Pricet − Pricet−1
Returnt =
Pricet−1
rb . head ()
Page 46
Introduction to Statistical Analysis Xuhu Wan
414020
sample size is huge, but it is still a sample. What is the population in this example? It is an
collection of inifinite return (rate) that our sample possible comes from. We are not sure what is
the mean, varaince or proportion of all possible values in population. But with this large sample,
we can make inference (meaningful guess) about it. To do that, we need to explore our sample to
get more knowledge about this set of data.
Page 47
Introduction to Statistical Analysis Xuhu Wan
import pandas as pd
import matplotlib . pyplot as plt
% matplotlib inline
import numpy as np
’numpy’ is another important module which can generate array of random variable with given
distribution, and column-wisely scientific computation etc.
6.1 Variables, Cases and DataFrame
The figure abov shows a small collection of sea shells gathered on s beach. All the shells in the
collection are similar: small disk-shaped shells with a hole in the center. But the shells also differ
from one another in overall size and weight, in color, in smoothness, in the size of the hole, etc.
Any data set is something like the shell collection. It consists of cases (obervations): the objects
in the collected sample.
Each case has one or more attributes or qualities, called variables. This word “variable” em-
phasizes that it is differences or variation that is often of primary interest. Usually, there are many
possible variables. The researcher chooses those that are of interest, often drawing on detailed
knowledge of the system that is under study.
The researcher measures or observes the value of each variable for each case. The result is a
Page 48
Introduction to Statistical Analysis Xuhu Wan
table, also known as a data frame: a sort of spreadsheet. Within the data frame, each row refers to
one case, each column to one variable.
4.3, 12.0, 3.8 are called observed values of the variable "diameter".
• Categorical: A description that can be put simply into words or categories, for instance male
versus female or red vs green vs yellow, and so on.
Page 49
Introduction to Statistical Analysis Xuhu Wan
Volume Turnover
NdateTime
2018-02-06 21:00:00 762.0 3007738.0
2018-02-06 21:00:30 56.0 220954.0
2018-02-06 21:01:00 290.0 1144022.0
2018-02-06 21:01:30 468.0 1849118.0
2018-02-06 21:02:00 38.0 149992.0
Next we compute VAP: volume adjusted price.
AskPrice ∗ BidVolume + BidPrice ∗ AskVolume
VAP =
AskVolume + BidVolume
rb [ ' VAP ' ]=( rb [ ' BidPrice ' ]* rb [ ' AskVolume ' ]+ rb [ ' AskPrice ' ]* rb [ ' BidVolume ' ]) /( rb [ ' A
Page 50
Introduction to Statistical Analysis Xuhu Wan
rb . mode ()
Page 51
Introduction to Statistical Analysis Xuhu Wan
rb . mean ()
AskPrice 39369.937942
BidPrice 39359.812383
AskVolume 391.124116
BidVolume 383.348102
Price 39364.801559
Volume 65.310723
Turnover 257289.964497
VAP 39364.851415
Return 0.000003
dtype: float64
rb . median ()
AskPrice 3.930000e+04
BidPrice 3.929000e+04
AskVolume 3.130000e+02
BidVolume 3.130000e+02
Price 3.929000e+04
Volume 8.000000e+00
Turnover 3.155200e+04
VAP 3.929283e+04
Return 5.855534e-07
dtype: float64
mean can be affected by extreme value (extremely large or small value). However, the mode
and median are not affected.
AskPrice 4.800383e+02
BidPrice 4.800329e+02
AskVolume 3.520719e+02
BidVolume 3.450304e+02
Price 4.799575e+02
Volume 6.468859e+02
Turnover 2.568355e+06
VAP 4.800255e+02
Return 3.469254e-04
dtype: float64
Page 53
Introduction to Statistical Analysis Xuhu Wan
Variation is also affected by extreme value. In order to correctly evaluate the variation of the
data, we need to use interquantile range. First let us compute quantile.
rb . quantile (0.5)
AskPrice 3.930000e+04
BidPrice 3.929000e+04
AskVolume 3.130000e+02
BidVolume 3.130000e+02
Price 3.929000e+04
Volume 8.000000e+00
Turnover 3.155200e+04
VAP 3.929283e+04
Return 5.855534e-07
Name: 0.5, dtype: float64
Interquantile range (IR) is the difference between 75% quantile and 25% quantile which is
applied to described variation of data . Different from standard deviation or variance, extreme
value has no impact on IR
IR = rb . quantile (0.75) - rb . quantile (0.25)
print ( IR )
AskPrice 360.000000
BidPrice 360.000000
AskVolume 360.000000
BidVolume 349.000000
Price 350.000000
Volume 34.000000
Turnover 133352.000000
VAP 352.470772
Return 0.000229
dtype: float64
Page 54
Introduction to Statistical Analysis Xuhu Wan
Page 55
Introduction to Statistical Analysis Xuhu Wan
Boxplot is important visualizatioin method for numerical data. - A descriptive statistics, a box
plot or boxplot is a convenient way of graphically depicting groups of numerical data through
Page 56
Introduction to Statistical Analysis Xuhu Wan
their quartiles. - Box plots may also have lines extending vertically from the boxes (whiskers)
indicating variability outside the upper and lower quartiles, hence also called box-and-whisker
plot and box-and-whisker diagram.
rb [ ' Return ' ]. plot ( kind = ' box ')
plt . show ()
Page 57
Introduction to Statistical Analysis Xuhu Wan
Page 58
Introduction to Statistical Analysis Xuhu Wan
bins
Page 59
Introduction to Statistical Analysis Xuhu Wan
Page 60
Introduction to Statistical Analysis Xuhu Wan
Page 61
Introduction to Statistical Analysis Xuhu Wan
9 Sampling Distribution
Page 62
Introduction to Statistical Analysis Xuhu Wan
Page 63
Introduction to Statistical Analysis Xuhu Wan
Page 64
Introduction to Statistical Analysis Xuhu Wan
Part III
Prediction with Multiple Linear Regression
Page 65
Introduction to Statistical Analysis Xuhu Wan
List of Figures
1 Maps of energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Growth of unstructured data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 High frequency data of stock price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Matching Index of "Data Science" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Matching index of "deep learning" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6 Daily close price of Facebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7 Slow and fast signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
8 Stock price and moving average processes. . . . . . . . . . . . . . . . . . . . . . . . . 20
9 Daily profit of signal-based strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
10 Accumulated profit or wealth process of signal-based stratgy . . . . . . . . . . . . . . 23
11 Improved profit with parameter-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . 24
12 Performance of strategy in training data. . . . . . . . . . . . . . . . . . . . . . . . . . . 25
13 Performance of strategy in testing set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
14 Performance in train data with parameters tuned in train data. . . . . . . . . . . . . . 26
15 Performance in test data with parameters tuned in train data . . . . . . . . . . . . . . 27
16 Average daily profit with regression model . . . . . . . . . . . . . . . . . . . . . . . . 31
17 Accumulated profit with regression model. . . . . . . . . . . . . . . . . . . . . . . . . 32
18 A snapshot of jupyter notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
19 Population and sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
20 Parameter and statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
21 Reinforcing steel bar, rebar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
22 shells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
23 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
24 Demonstration of interquantile range for Return. . . . . . . . . . . . . . . . . . . . . . 55
25 Pie chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
26 Bar chart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
27 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
28 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
29 Polygon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
30 Histogram with the output of frequency and bars . . . . . . . . . . . . . . . . . . . . 59
Page 66