Professional Documents
Culture Documents
Q1]
a) Expected Counts
To find out the expected counts in a reach table, the individual entries is calculated using
the formula:
e i=
rowTotal colTotal
Total
Female
Male
2
b)
25-34
23.625
39.375
35-44
32.625
54.375
45+
7.125
11.875
To calculate the 2
(e io i)2
=
ei
Hence we can compute the 2
2
2=
statistics:
^p (1 ^p )
Total of Subject
, where ^p =
n
Total
0.375(10.375)
0.375(10.375)
,0.375+ 1.96
200
200
p (0.3079, 0.4421)
Q2]
a) Removing Stop-words, Removing Punctuations, Case-folding and Stemming
Document
Number
i
ii
iii
Before
After
Go dog, go!
Stop cat, stop
The dog stops the cat and the
bird.
dog
stop cat stop
dog stop cat bird
b) Document-Term Index
dog
1
0
1
i
ii
iii
stop
0
2
1
cat
0
1
1
bird
0
0
1
[1000][0110]=0
[0210][0110]=3
[ 1 1 1 1 ] [ 0 1 1 0] = 2
Now to find the cosine similarity score, we need to turn the frequency matrix of Stop
Cat into a query vector value:
02 +12 +12+ 02
q =
Which the value would be: 2
With the same process as above, we need to query each of the documents in its
respective values:
i.
ii.
iii.
q = 1
q = 5
q = 4
0 ( 1 2)=0
3 ( 5 2)=0.9487
2 ( 4 2)=0.7071
d) TF-IDF Calculation
To find the TF-IDF value we need to use the formula:
n
w d ,t =log (f d ,t + 1)log ( )
ft
Given that there are 3 documents: N is 3, the matrix in document 3 is [1 1 1 1] and
is [2 2 2 1].
Category
Working Out
Answer
ft
Dog
Stop
Cat
Bird
3
log(2) log( )
2
0.2810
3
log(2) log( )
2
0.2810
3
log(2)log( )
2
0.2810
3
log(2) log ( )
1
0.7615
B
1
0
1
1
C
0
1
0
0
D
0
1
0
0
Q3]
a) Adjacency Matrix
A
0
1
0
0
A
B
C
D
b) Graph Diameter
Path
A -> B
A -> C
A -> D
B -> C
B -> D
C -> D
Length
1
2
2
1
1
2
The diameter of a graph is the longest shortest path. Hence the diameter is 2.
c) Betweenness Centrality
Path
A <->
A <->
A <->
B <->
B <->
C <->
Central Node
0
B
B
0
0
B
B
B <-> C
B <-> D
C
D
B <-> D
Since there are modes of central nodes, the betweenness centrality of the graph is B.
d) Graph comparison.
From the graph, the degree of distribution is as follows:
Connection
s
Frequency
Degree of Distribution
Frequency
S.D.
P.V.
C.P.
PC2
2.522
0.156
0.503
PC3
2.374
0.138
0.641
PC4
2.224
0.121
*
PC5
2.155
0.114
0.876
PC6
1.724
0.073
0.949
PC7
1.438
0.051
1.000
The value of the * at PC1 is 0.347 and the value of the * at PC4 is 0.762.
b) Binary Metric
Wor
ds
Twee
t1
Twee
t2
Rememb
ering
1
Lo Re
u
ed
1
1
Lif
es
1
Wo
rk
1
Ro
ck
1
Music
ian
1
Prov
ed
0
Car
eer
0
Me
an
0
Striv
ing
0
Publi
city
0
To compute the binary metric, we need to figure out the count of unique words in each
10
tweet over the total of unique words in all tweets. Hence the binary metric:
. If
13
stemming were applied to the tweets, it would affect the result. For proof, if the word
musician would stem to music, then there is an increase of common words.
Q5]
When looking at the tweet value of the data below:
Week 1
Day 1
36
Day 2
49
Day 3
57
Day 4
74
Day 5
74
Day 6
54
Day 7
61
Week 2
Week 3
58
98
89
145
115
140
89
140
117
156
109
115
93
124
Week 1
Week 2
Week 3
Day 1
Day 2
8.71
10.93
9.03
11.17
Trends
Day 3
Day 4
7.56
9.47
*
11.21
11.42
Day 5
7.79
10.06
Day 6
8.14
10.43
Day 7
8.59
10.59
To compute the missing trend at Week 2 Day 4, we need to add the average square root
of the Week 2 where the central distant value is at Day 4.
Using this formula:
=
( T d ,w )
n
Hence by using this, the missing value at Week 2 Day 4 can be find out:
=
Periodic
Day 1
-1.125
Day 2
0.578
Periodic
Day 3
Day 4
0.877
0.323
Day 5
0.724
Day 6
*
Day 7
-0.925
To compute the missing periodic at Day 6 Periodic, we need to know that the sum of all
values must equal to zero (0).
Given that
formula:
=( [T p ])
Hence by using this, we can find out the missing value at Day 6 Periodic:
=(1.125+ 0.578+ 0.877+0.323+0.724 +0.925)
a) Explanation
The reason why using a square root transformation is advisable for count data is
because the count data is most likely to be Poisson distributed. The problem with Poisson
is that its variance and the mean would be the same. Which mean if you take a same
with a high mean, and another sample with low mean, then the variance would be
different. Using the hypothesis testing, it is noted that it is impossible to calculate if the
mean is equal the variance then it would show difference in the test value and is
deemed bad for testing.
Hence if we were to square root the count data, it would stabilize the variance.
b) Sum of Squares Interaction Calculation
Given the value
Company
Competitor
Before
54.91
49.87
After
60.20
50.15
Competitor (C)
Company (I)
Before
After
+
+ + +
( SS )
To Figure out the Sum of Square Interaction, this formula is needed: SS I = I , C .
4n
SS I =1.5688 .
SS E =1.579
Where
SE =
SS E
4 (n1)
=0.313125
SS
SS
SE =0.132
also the
= 4 nI
SS
Therefore the F-Statistical value is 2.3797 and the Degree of Freedom is between 1 and
12.
Q7]
a) Sum of Squares Between Cluster Calculation
Given that the center point for the clusters:
( =[0 0])
=2
[1, ]
[2, ]
[3, ]
[4, ]
[5, ]
[6, ]
[7, ]
[8, ]
[9, ]
[10, ]
x2
1.5777
2..9893
0.5736
2.1324
1.0887
-1.2161
-3.8206
-1.6509
-0.6621
-1.0120
Cluster
2
2
2
2
2
1
1
1
1
1
SSB
0.0
2
*
3
315.2
4
318.3
5
319.6
6
320.9
If the cluster centers are not given then perform the following formula to calculate them:
CC= n , c
Hence by using the formula above, it can generate Cluster Centers as shown below:
1
2
x1
5.289
-5.289
x2
-1.672
1.672
To solve for the value of Sum of Square Between Clusters, it is given by this:
Cluster Distance [Cluster Center Square]values
SSB=
When figuring out the Cluster Distance Matrix, we need to look at the number of data
points in each cluster. In this case Cluster 1 and Cluster 2 both have 5 data points.
Hence the Cluster Distance Matrix would be:
[ ]
5 0
0 5
[ ][
SSB
0.0
2
307.7
3
315.2
4
318.3
5
319.6
6
320.9
SSB
SSB
[ ,1]
0
[ ,2]
1/3
[ ,3]
1/3
[ ,4]
1
[ ,5]
0
B
1/3
C
1/3
D
1/3
E
0
b) Graph Explanation
0
1/3
0
1/3
1/3
0
0
1/3
0
0
0
0
1/2
1/2
0
0
Since in the graph, it shows that there are no distinct specific network paths to be taken,
the graph is proved to be ergodic. This means that there is a path to all vertices making
the network able to be walked around infinitely without having to be confined to a
network vertex. Another reason would be that there are no arrows in the graph that
could indicate that this is an undirected graph.
c) State Distribution and Random Walks of 2 Steps
Given the Probability Transition Matrix from part a), it is deduced that the Matrix is T.
Therefore T:
[ ,1]
0
1/3
1/3
1/3
0
A
B
C
D
E
[ ,2]
1/3
0
1/3
0
1/3
[ ,3]
1/3
1/3
0
0
1/3
[ ,4]
1
0
0
0
0
[ ,5]
0
1/2
1/2
0
0
And since we begin at vertex A, then the initial state distribution would be:
0=[ 10 0 0 0]
To figure out the first step of Random Walks, then it follows as:
1= 0 T
Hence by using this formula, the first step of Random Walks is achieved by Matrix
multiplication:
[ ]
0
1
3
1=[ 1 0 0 0 0 ] 1
3
1
3
0
1
3
0
1
3
1
3
1
0
0
1
2
1
2
1
3
1
3
1
3
Therefore in each value of the first step of Random Walks, it is seen that the values
would be:
[ ]
0
1
3
1
3
1
3
0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 1 1
0
3 3 3
Next, to look for second step of Random Walks, it is performed with the same process as
above:
With concatenation of the above matrix, it will be seen as:
1= 0
2= 1 T
Hence by using this formula, the first step of Random Walks is achieved by Matrix
multiplication:
[ ]
0
2= 0
1
3
1
3
1
3
1
3
0 1
3
1
3
1
3
0
1
3
1
3
1
0
0
1
2
1
2
1
3
1
3
1
3
Therefore in each value of the first step of Random Walks, it is seen that the values
would be:
[ ]
5
9
1
9
1
9
0
2
9
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
2=
5
9
1
9
1
9
2
9
[ ]
0
1
3
T= 1
3
1
3
0
1
3
0
1
3
1
3
1
0
1
2
1
2
1
3
1
3
1
3
Now to perform the first equation of this question, matrix multiplication is used to show
it as:
[ ]
0
=[ 1 2 3
5 ]
1
3
1
3
1
3
0
1
3
0
1
3
1
3
1
0
0
1
2
1
2
1
3
1
3
1
3
F
M
Total
18-24
9
20
29
25-34
8
22
30
35-44
10
21
31
45-54
6
6
12
55-65
2
7
9
65+
4
12
16
Total
44
96
140
One problem with using 2 (Chi-squared) test with this data is that some of the
expected values in this data are less than 5. This is because it only works if the expected
values are all bigger than about 5.
b) Reduced Table
13-24
14
28
42
F
M
Total
25-34
8
22
30
35-44
10
21
31
45+
12
25
37
Total
44
96
140
c) Expected Counts
To find out the expected counts in a reach table, the individual entries is calculated using
the formula:
e i=
rowTotal colTotal
Total
Female
Male
2
d)
13-24
13.2
28.8
25-34
9.4286
20.5714
35-44
9.7429
21.2571
45+
11.6286
25.3714
To calculate the 2
(e io i)2
ei
Hence we can compute the 2
2
2=
statistics:
A
B
C
D
B
1
0
0
1
C
1
0
0
0
D
1
1
0
0
b) Degree of Distribution
From the graph, the degree of distribution is as follows:
Connection
s
Frequency
Degree of Distribution
Frequency
c) Closeness Centrality
To figure out the Closeness Centrality of a graph, it is needed to take in consideration of
what is the smallest amount of total steps to cover the entire network.
A
d) Central Decision
To find out about the most central vertex of a network, it is easily able to figure out in
the table of part c). By using this table, it is needed to look at the lowest value to
determine the most central vertex.
In this case, the answer is "A".
Q3]
a) Missing Values
PC1
3.923
0.316
*
S.D.
P.V.
C.P.
PC2
3.069
0.194
0.510
PC3
2.867
0.169
0.679
PC4
2.579
0.137
*
PC5
2.038
0.085
0.901
PC6
1.887
0.073
0.974
PC7
1.125
0.026
1.000
The value of the * at PC1 is 0.316 and the value of the * at PC4 is 0.816.
b) Binary Metric
Wor
ds
Twee
t1
Twee
t2
assa
ult
1
assista
nce
1
disadvant
aged
1
univer
sity
1
stude
nts
1
begi
ns
1
belie
ve
0
mo
re
0
doi
ng
0
bett
er
0
To compute the binary metric, we need to figure out the count of unique words in each
8
tweet over the total of unique words in all tweets. Hence the binary metric:
.
10
Q4]
a) Sum of Squares Within Cluster Calculation
Given that the center point for the clusters:
( =[0 0])
[1, ]
[2, ]
[3, ]
[4, ]
[5, ]
[6, ]
[7, ]
[8, ]
[9, ]
[10, ]
x1
-1.7016
-1.9107
-0.7456
-0.5877
-3.4993
-2.5633
-4.6964
-2.9502
8.9087
9
x2
-3.522
-2.111
-4.526
-2.968
2.989
3.684
1.079
3.249
1.238
0.888
Cluster
1
1
1
1
1
1
1
1
2
2
=2
SSW
2
*
3
11.381
4
5.511
5
3.076
6
1.875
If the cluster centers are not given then perform the following formula to calculate them:
CC= n , c
Hence by using the formula above, it can generate Cluster Centers as shown below:
x1
-2.332
9.327
1
2
x2
-0.2657
1.0629
To solve for the value of Sum of Square Between Clusters, it is given by this:
Cluster Distance [Cluster Center Square]values
SSB=
When figuring out the Cluster Distance Matrix, we need to look at the number of data
points in each cluster. In this case Cluster 1 have 8 data points and Cluster 2 have 2 data
points. Hence the Cluster Distance Matrix would be:
[ ]
8 0
0 2
[ ][
SSW
1
314.077
2
93.75
3
11.381
4
5.511
5
3.076
6
1.875
SSW
SSW
A
B
C
D
E
[ ,1]
0
1
0
0
0
[ ,2]
1
0
0
0
0
[ ,3]
1/2
1/2
0
0
0
[ ,4]
1/2
0
0
0
1/2
[ ,5]
1/2
0
0
1/2
0
b) Graph Explanation
Since the graph is a directed graph, by assumption there is a 50-50 chance that the
graph is state to be ergodic or non-ergodic. With close observation of the graph: Vertex
"A" is unable to travel to "C", "D" and "E"; Vertex "B" is unable to travel to "C", "D" and
"E"; Vertex "C" is unable to travel to "D" and "E"; Vertex "D" is unable to travel to "C" and
Vertex "E" is unable to travel to "C". Therefore the graph is deemed non-ergodic as
neither Vertex "A", "B", "D" nor "E" is able to travel to Vertex "C".
c) Random Surfer Probability Transition Matrix
[ ]
1
2
1
2
T= 0 0
1
2
1
2
0
0 0
0
1
2
0 0
1
2
0 1
1 0
When figuring out the Random Surfer Probability Transition Matrix, the Jump Matrix is
needed and =0.8 is needed as well.
[ ]
1
5
1
5
J= 1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
1
5
To perform the Random Surfer Probability Transition Matrix, the following formula is
applied:
=T + ( 1 ) J
Therefore the Random Surfer Probability Matrix will be:
0.44
0.04
0.04
0.44
0.04
the value of
0.04
0.84
T = 0.04
0.04
0.04
0.84
0.04
0.04
0.04
0.04
0.44
0.44
0.04
0.04
0.04
0.44
0.04
0.04
0.04
0.44
0.44
0.04
0.04
0.44
0.04
and given by
0.04
0.84
= [ 0.49 0.34 0.10 0.04 0.03 ] 0.04
0.04
0.04
0.84
0.04
0.04
0.04
0.04
0.44
0.44
0.04
0.04
0.04
0.44
0.04
0.04
0.04
0.44
0.44
0.04
0.04
0.44
0.04
Shown as the stationary distribution for the random surfer transition matrix in the
question to be assumed as = [ 0.49 0.34 0.10 0.04 0.03 ] , through calculation of the
actual answer. It has fallen to the conclusion that the calculated answer is
= [ 0.38 0.472 0.04 0.052 0.056 ] . With the difference in all of the values for the two
matrixes, it is given an indication that the hypothesized matrix is not the same as the
one that is calculated and confirmed. Hence the hypothesized matrix would not be the
stationary distribution of the random surfer probability transition matrix.
Q6]
a) Computing Trends
Given the information of the count aggregation are gathered in 4 periods, it is safely
assumed that these are specified windows of points.
We are given:
Day 1
Day 2
Day 3
Period 1
Trend
Period 2
7.65
8.42
7.87
8.68
Period 3
6.78
8.08
Period 4
*
8.17
To calculate the missing trend with window sizes, the following formula is used:
T P , D : n + T P , D:1 + (
Letting n=window =
T P , D :[ n1 : 2])
Hence by using this, the missing value at Day 1 Period 4 can be found out:
35 + 69 + 63+ 47 + 55
=
Periodic
Periodi 1
-0.610
Periodic
Period 2
0.235
Period 3
0.541
Period 4P
*
To compute the missing periodic at Day 6 Periodic, we need to know that the sum of all
values must equal to zero (0).
Given that
formula:
=( [T p ])
Hence by using this, we can find out the missing value at Day 6 Periodic:
=(0.610+ 0.235+0.541)
Company
Competitor
Before
29.27
24.86
After
33.98
23.92
Competitor (C)
Company (I)
Before
After
+
+ + +
( SS )
To Figure out the Sum of Square Interaction, this formula is needed: SS I = I , C .
4n
SS I =2.667 .
SS E =1.579
SS
SS E
SE =
4 (n1)
=0.222249
SS
SE =0.151667
also the
= 4 nI
SS
Therefore the F-Statistical value is 1.46538 and the Degree of Freedom is between 1 and
8.
Q8]
a) Word Distribution
Given from the tweets shown:
Positive
My teeth shine #funfun
#funfun love my fun teeth
#funfun is fun fun
Negative
No shine #funfun
No love fun fun
Where is my teeth shine #funfun
Now to tabulate the words:
Positiv
e
Negati
ve
#funf
un
3
fun
is
love
my
no
shine
teeth
where
b) Word Sentiment
The sentiment of the tweet of "fun teeth shine", is shown as:
Positiv
e
Negati
ve
~#funf
un
0/3
fun
~is
~love
~my
~no
shine
teeth
2/3
2/3
2/3
1/3
3/3
1/3
2/3
~wher
e
3/3
1/3
1/3
2/3
2/3
2/3
1/3
2/3
1/3
2/3
NOTE: WE ARE ONLY ACCOUNTING FOR "FUN TEETH SHINE" TO BE PRESENT, WHEREAS
THE REST ARE ABSENT PROBABILITY VALUES.
Now to apply the Rule of Succession:
Positiv
e
Negati
ve
~#funf
un
1/5
fun
~is
~love
~my
~no
shine
teeth
2/3
2/3
2/3
1/3
4/5
1/3
2/3
~wher
e
4/5
1/3
1/3
2/3
2/3
2/3
1/3
2/3
1/3
2/3
NOTE: THE RULE OF SUCCESSION ONLY APPLIES TO VALUE OF "1" AND "0", HENCE THE
CHANGE OF VALUES AT "~#funfun", "`no" AND "~where" OF POSITIVE.
To determine the probability ratio, the following formula is applied:
RATIO=
POSITIVE
NEGATIVE
Ratio
~#funf
un
0.6
fun
~is
~love
~my
~no
shine
teeth
2.0
1.0
1.0
0.5
2.4
0.5
2.0
~wher
e
1.2
Now to find out the log probability ratio of the values, which is done by:
=log RATIO
Hence the values are:
Ratio
~#funf
un
-0.5108
fun
~is
~love
~my
~no
shine
0.8755 0.6931
0.6931
teeth
0.6931
~wher
e
0.1823