Professional Documents
Culture Documents
Student ID:
Supervisor:
NB. If it is suspected that your assignment contains the work of others falsely represented as your own, it
will be referred to the Colleges Disciplinary Committee. Should the Committee be satisfied that plagiarism
has occurred this is likely to lead to your failing the module and possibly to your being suspended or
expelled from college.
Complete the sections above and attach it to the front of one of the copies of your assignment.
Alastair Macnair
x13129325
alastair.macnair@ncirl.ie
Table of Contents
Executive Summary .............................................................................................. 5
1
Introduction.................................................................................................... 6
1.1
Background ............................................................................................. 6
1.2
Aims ........................................................................................................ 7
1.2.1
1.3
Solution Overview.................................................................................. 10
1.4
Structure ................................................................................................ 11
Introduction ............................................................................................ 12
2.2
2.2.1
2.2.2
2.3
2.3.1
2.3.2
2.4
3
Conclusion ............................................................................................. 19
3.2
Datasets ................................................................................................ 21
3.2.1
3.2.2
3.2.3
3.2.4
3.2.5
3.3
3.3.1
Introduction ..................................................................................... 24
3.3.2
3.3.3
3.3.4
3.3.5
3.4
3.5
Mashup .................................................................................................. 29
3.6
4.1.1
4.1.2
4.1.3
4.1.4
4.1.5
4.1.6
4.1.7
4.2
Introduction ............................................................................................ 32
4.2.1
Introduction ..................................................................................... 35
4.2.2
4.2.3
4.2.4
4.2.5
4.2.6
4.2.7
4.3
Altitude Effects....................................................................................... 59
4.4
4.5
4.6
Conclusions ................................................................................................. 64
5.1.1
Introduction ..................................................................................... 64
5.1.2
5.1.3
Conclusion ...................................................................................... 67
References .................................................................................................. 70
Appendix ..................................................................................................... 75
8.1
8.2
Geography ............................................................................................. 77
8.2.1
8.2.2
8.2.3
8.2.4
8.2.5
8.3
8.3.1
Football Data................................................................................... 83
8.3.2
8.3.3
8.3.4
8.4
8.5
8.6
8.7
8.8
8.9
-4-
Executive Summary
This study seeks to investigate the relationship between football match goal
outcome and weather effects within Germanys top two football leagues;
Bundesliga 1 and 2. Twenty one seasons of historic football results for both of the
Bundesliga tiers is used from 1993 until 2013 which is linked to weather
observation data by matching each of the 80+ weather stations around Germany
with the nearest stadium where the matches were played. Five weather variables
are used in the study including average daily temperature, rainfall, cloud cover,
wind speed and humidity. These are investigated against four Goal Outcome
measures; Total Goals, Goal Difference, Under/Over2.5 and Home/Away/Draw
results. The R Studio open source software is used to investigate and determine
whether any relationship between them exists.
The study finds that temperature has a small but measurable effect on Total goals
scored reducing goal averages very slightly as temperature decreases,
predominantly during the winter season. Other effects such as humidity and cloud
cover have no measurable effect with goals scored being consistent across the
entire range. Wind and rain seem also have no obvious trend except at the extreme
ranges where they inflict interference on all goal outcome variables. However this
interference effect has no obvious trend or pattern, is generally unpredictable and
is based on a very low sample size, given the inherently rare nature of such events,
and as such the results should be treated cautiously. Overall there is no clear
relationship or correlation between weather and goal outcome and the study raises
the issue on how well we really understand such factors on sports.
-5-
1 Introduction
1.1
Background
The Sports industry within the Europe Union (EU) is estimated to contribute 294
Billion to the economy based on a recent study by the European Commission
(2012) and supports 2% of all EU jobs or around 4.46million employees. Germany
provides the highest number of these jobs at 1.15 million or almost 27% of all the
sports related jobs within the EU. The Bundesliga, Germanys top football league,
is one of the big five within Europe which includes England, Spain, Italy and
France. In 2013/14 the combined revenue of these top five leagues grew by 5% to
9.8 Billion representing almost half of the entire European football market which
was valued at 19.9 billion in the same period (Deloitte, 2014.) The German
Bundesliga is characterised by strong cost control with the lowest wage to revenue
ratio at 51% which resulted in it being one of only two Leagues in Europe to
generate an operating profit (264m) for the sixth successive year in the 2013/14
season. In this context football should not be seen solely as a hobby or weekly
television event but an increasingly important economic part of Europe and a major
provider of employment within the European Union.
In 2012 Germany passed new laws to promote liberalisation of its gambling and
betting market. Marketing company MECN (2012) estimates that the German
sports betting market will grow to around 1.5 Billion in 2015 alone. This is part of
a global sports market which is estimated to be valued at around 733 Billion (BBC,
2014) with over 70% of that total being derived from football matches. Betting on
football matches became popular around the early 1920s within the UK with the
creation of the Football Pools (2014), one of the oldest gaming companies in the
world, allowing fans to predict matches and win money if they proved to be correct.
Given the value of this market there is a critical need for gaming companies to
ensure they fully understand the products they are providing, and all of the
-6-
variables that effect the odds of the betting instruments they provide to customers,
as mistakes could be costly.
Finally, trainers, teams and players are always seeking to gain competitive
advantage to ensure continuing or future success. Utilising statistical information
and data analytics is becoming increasingly important within the football and sports
sector as these groups seek to become both smarter and more efficient (Lewis,
2014.) The majority of this analysis currently focusses on the players and there
has been very little analysis on the influence of external factors such as weather.
There are claims that suggest that weather, in particular temperature, could be a
factor in the outcome of football matches (British Weather Services, 2014.)
1.2
Aims
Goal outcome within this study is defined by four commonly used betting
instruments (PKR, 2014) as a means to assess and compare the data, identify any
relationships, and if viable provide a basis to make predictions. (See Appendix 8.1
for definitions.)
Secondary objectives will consider the differences, if any, between the first and
second leagues and also if a particular geographical location or time of year affects
the goal outcome.
-8-
Is there any relationship between weather effects and goal outcome for football
matches played within the Bundesliga 1 & 2?
From this the Null Hypothesis Ho and the hypothesis that will be tested H1 is
established: -
Ho: There is no relationship between goal outcome in football matches and daily
corresponding average values of temperature, wind speed, precipitation, humidity
and cloud cover.
H1: There is a relationship between goal outcome in football matches and daily
corresponding average values of temperature, wind speed, precipitation, humidity
and cloud cover.
The Null hypothesis is non directional and therefore a two tailed test will be applied
where appropriate with a significance level of 5%
-9-
1.3
Solution Overview
The project will primarily utilise the open source R Studio (2014) statistical
programming package to import, clean, merge, feature engineer and finally
analyse the data, and then provide all graphical and statistical outputs as required
throughout the study. This will be supported by programs such as Microsoft Excel
notepad and PeaZip for extracting files to ensure that all parts of the data can be
accessed, checked and manipulated as required at all stages of the project. The
R Studio platform will be used to handle the 200+ raw data files and apply the
relevant changes and functions to the files to remove unwanted information and
ultimately merge them all together into a single coherent data frame where each
match played is linked with its corresponding weather observation data across the
21 years of play. The stadium information will need to be manually extracted and
created as a fourth data set to link the weather stations and match results.
The study will contribute to the overall body of knowledge of statistical analysis in
sport and also develop the understanding of the effects of weather on sport and
football in particular. The study will also help progress understanding on the issues
of handling and analysing large data sets.
- 10 -
1.4
Structure
The study is structured as follows: Section 1 A background to the study, why the study is important and what
benefits it will potentially realise through its undertaking
Section 2 A literature review to assess the existing body of knowledge in the
areas of weather effects on sports performance, sports statistics, stadium design
issues and data mining and analysis techniques
Section 3 The system architecture and datasets utilised for the study and a
process flow description of how the various data sets were combined along with
key areas of interest detailed
Section 4 Testing, evaluation and primary analysis of the data sets
Section 5 Study conclusion and recommendations
Section 6 Areas where the study could benefit from further future development
Section 7 Supporting references
Section 8 Appendixes. All supporting information such as tables and graphs and
preliminary or supporting documents such as management progress reports
- 11 -
Introduction
This literature reviews primary aim is to examine all relevant and recent research
relating to the effects of weather on sport, in particular football, in relation to its
impact on achieved or measured performance levels. To support this objective
research has also been conducted into climatic and meteorological weather
effects, and types of measurement index. Additionally, statistical and data analysis
approaches in sports and also general analytical and relevant data mining
techniques have been investigated.
A range of sources were used including Google Scholar, Google Books, IEEE,
ACE, Directory of Open Access Journals and Wiley as well as books, articles and
websites to support this. The primary search terms used were (but not limited to):
Football, Soccer, Bundesliga, Weather, Sport, Rain, Wind, Temperature, Humidity,
Okta, Germany, Stadium, Environment, Performance, Statistics, data mining and
data analysis. Specific data parameters were not incorporated into the search
although a number of key relevant papers that were found were from as early as
1970. Papers that did focus of weather effects in sports were both quantative and
qualitative. Overall there is a reasonably good level of research in data analytics
on sports and weather effects on sport with recent publications in 2014 indicating
a strong ongoing interest in this area.
2.2
- 12 -
fans alike (Anderson and Sally, 2013.) Tradition exerts such a strong social
influence that knowledge can remain static, being rarely challenged or questioned
with the suggestion that football as a game is full of unexamined clichs and myths
which have never been tested against real life data (Kuper and Syzmanski, 2012.)
- 13 -
to draw conclusions on the effect of cold on score outcome suggesting that it was
in fact highly correlated. Hamilton (2014) developed this for the same data by using
a logistic regression to better explore if any relationship actually exists between
the match outcome and weather and found that there was almost no change in the
Under/over 2.5 goals scored probability due to temperature but noted that the
study only used 16 games on a single day.
Analytics Company Kickdex (2014) again adopted a simpler approach using the
goal Difference metric against totals of whether the favorite did or did not win in
rainy conditions for any given Goal Difference score. Gelade and Dobson (2007)
used linear regression modelling and factor analysis in their investigation of
worldwide football team performance. ODonoghue and Holmes (2014) also
suggest linear modelling and correlation techniques as being especially useful for
making predictive models but note that these are for making average
generalisations only and not as models for predicting individual match results.
They also note the use of non-parametric tests in sports analysis due to most
sports data not following assumptions applicable to data which is normally
distributed. Peters and ODonoghue (2013) used logistic regression to analyse the
performance of the Qatari football team and the influence of temperature on home
advantage. Finally the use of data mining methods such as Nave Bayes allows for
probability outcomes of binary goal outcome results such as UnderOver2.5 to be
determined based on variables such as weather (Witten, Frank and Hall, 2011.)
Overall there are a range of techniques and models applicable to sports analysis
with an emphasis on simplicity to draw generalisations as to overall trends or
making predictions of future match outcomes.
- 14 -
2.3
Physical features can affect weather, such as altitude, affecting temperature and
wind speed due to changes in atmospheric pressure and terrain. The lapse rate,
which is the rate at which temperature decreases for an increase in temperature is
approximately a 0.65C drop in temperature per 100m in height gain. However,
the actual lapse rate is specific to each location and the conditions at the time
(Stone and Carlson, 1979.) The use of equivalent indices such as heat index, wind
chill and apparent temperature have been noted as being of particular use for
sports due to the combined interaction of weather effects being considered (Perry,
2004.) One index of particular interest is the apparent temperature index
(Steadman, 1984) which was first outlined in 1979 and then revised to its currently
used format in 1984. A simplified version that uses temperature (T), wind speed
(V) and humidity (e) is most commonly used today and is often described as a
feels like temperature combining the wind chill and heat index to provide an
adjusted temperature that takes into account wind speed and humidity.
Steadmans (1984) approach uses a linear model which is applicable to most
outdoor conditions because it contains wind speed but it should be noted that there
can be no accurate formula for what a person actually feels like.
- 15 -
also note that low altitudes which range between 500-2000m would provide an
impairment to aerobic performance due to reduced air pressure although this can
vary and is specific to geographical location and individual player conditioning.
Acclimitisation has also been shown by Gelade and Dobson (2007) to have a
potentially significant effect on international football where teams travel to a
different climatic region that they are unaccustomed to.
Precipitation can affect overall play conditions making the ball drag and lose spin
and skid on the playing surface. This can affect each teams ability to control the
ball and increase the difficulty of the goalkeeper in making saves (Kickdex, 2014.)
The analytics company suggests that rain does have an effect on playing
conditions and goal outcome by altering the outcome of the favorite (the team with
the shortest pre match odds.) suggesting they are more likely to prevail if it doesnt
rain. Their data indicates that the favorites chances of winning when there is rain
is reduced to 50% compared to a 67% chance of winning when there is no rain,
and also that the overall total number of goals per match increases significantly as
well if rain occurs. However, the Kickdex (2014) study only considers 147 matches
played within London and only provides results. Rain has also been shown to
affect ball characteristics and studies that investigate the flight characteristics of
how a ball moves in flight (Carre, Asai, Akatsuka and Haake, 2002) show that
precipitation affects the trajectory such that it is more likely to be off target.
Thornes (1977) notes that lower temperatures can be a hazard as blood to the
hands and feet is lost much quicker in colder conditions affecting sports like football
where players and goalkeepers rely on extremities to maintain performance. Riley
and Williams (2003) also suggest that colder weather reduces limb temperatures
which would detrimentally affect motor performance as well as strength and power.
In fact muscle power was found to be reduced by 5% for every 1C drop in muscle
temperature below normal. The effects of very extreme hot or cold temperatures
on human physiology are known to directly affect both performance and health
- 17 -
The effects of temperature on the ball, being made of viscoelastic materials is also
a factor as with temperatures approaching zero degrees Celsius a goalkeeper has
7% more time to react to a penalty that at higher temperatures when the ball moves
quicker. (Wiart, Kelley, James and Allen, 2012) The flight of the ball is also affected
with colder conditions causing the ball to drop and move slower overall with less
power than at warmer temperatures. However as Riley and Williams (2003) point
out in colder weather the goalkeeper is most susceptible to reduced limb
temperature and dexterity unless they keep highly active. A study in baseball (Kraft
and Skeeter, 1995) found that temperature appeared to play a significant role in
how a far a batter could hit a ball compared to factors such as wind speed which
are specific to a particular stadium and local terrain and turbulence effects.
Advanced Football Analytics (2014) also found that temperature significantly
affected the success rate of field goals scored in American football with lower
temperatures reducing the distance that players can successfully score from.
Finally stadiums have been shown to affect weather factors (Szucs, Allard,
Moreau, 2009.) Wind in particular can be mitigated significantly where stadiums
are enclosed or have roofs although both temperature and humidity remain
unaffected. However, wind channeling effects and turbulence are highly localised
specific to each stadia and the surrounding area. Kraft and Skeeter (1995) also
noted that such effects were hard to predict.
- 18 -
2.4
Conclusion
This review investigated and examined the area of weather effects on sport and
statistical methods used in sports to gain understanding of how such variables
effect sport performance and goal outcome. The review found that in relation to
weather effects on sports there are multiple studies that suggest that such factors
are affecting sports like football and performance although these studies often use
a qualitative approach to justify this. However, there is some conflicting opinion on
which weather effect affects performance and to what extent actual sports
performance, like goal outcome is affected. In that regard there is a lack of
quantative knowledge in this field. Some of the sources used are commercial in
nature and so caution must be taken as to the data and results they present which
may not be entirely unbiased or be subject to checking.
- 19 -
The overall system architecture is shown below in figure 1. The Data sources that
form the basis of this study are freely available for download via a PC with internet
access. A stand-alone PC with the open source R studio (2014) software is used
at all stages to gather, clean, combine, analyse and then make predictions. R
Studio requires online access to Googles Mapping API through a number of
additional packages to facilitate geolocation functions, calculation of altitude and
creating distance matrixes to determine the nearest weather station to each
stadium. The results of the study and any viable predictions are then provided to
the customer as the end product.
- 20 -
3.2
Datasets
The study uses and combines four datasets to achieve a single data table for
feature engineering and analysis. One of the data sets, stadiums, was not available
in any single readily useable format and required the stadium names to be
manually obtained and entered for each unique team using secondary sources of
information. All the other data sets could be downloaded from their respective
sources as outlined below in either txt of csv file format. In total these four data
sets when combined will provide all 12,926 match results with the best available
weather observation information specific to each location where the match was
played and on the correct date.
of the stadia, as well potential errors in the recorded observation data for some
stations, there will be a subset of the stations used from all those covering
Germany and some crossover. An example of the weather station data lists is
provided in Appendix 8.3.4. In total 32 stations out of the 84 viable possibilities
were used in the study.
German football teams can have multiple variations on their names which are in
current use. The match data files and stadium database names lack unique
identifiers and are sufficiently different to prevent string matching using tools like
Levenshtein distance requiring manual selection of stadium names for each team.
Stadium names also vary as many are named after sponsors which can vary each
year. Commonly used, older names may prevail and are not always consistent with
current Google Map information. As such geocoding for stadiums needs to be
checked manually to ensure the right stadium has been matched.
- 23 -
3.3
Data Processing
3.3.1 Introduction
The studys objective is to be able to analyse football match outcome with weather
effects. To achieve this the four datasets as outlined above must be merged into a
coherent and valid single data frame for analysis. Each of the four sets needs to
be treated both individually to remove unwanted information and ensure consistent
formatting but also be combined to match viable weather observation and weather
stations with stadium locations. As the data flow diagram in figure 2 shows this is
not just a simple join of two data sets but an iterative process where both weather
stations and their observation data need to be verified and checked.
In calculating this there are two key difficulties. Firstly weather stations are located
at a variety of altitudes and some are in mountainous regions which represent a
significant height and therefore temperature differential making them unusable.
Secondly, observation data for some variables is missing to such an extent that it
also renders that station unusable. The size and number of these files makes it
impossible to manually check observation files.
Removing stations requires the distance matrix to be recalculated and the next
nearest station selected with altitude and observation checks undertaken again.
This process continues until no more errors are found. The overall data flow
diagram is shown below in figure 2. The colours shown indicate generally each
data sets role on the overall process with blue being the weather station, green the
match data, grey the weather observation data and red the stadium data. The
merged data set is denoted orange at the point where all four sets are combined.
A detailed description of the key processes are outlined below.
- 24 -
Weather Station
Data set
Match Data set
Weather
Observation Data
Final combined
Data Set
- 25 -
- 26 -
Finally this list is merged with the original stadium list which adds the weather
station ID (STAID) as a new column against each stadium name. STAID is critical
as a unique identifier to merge the weather observation data later on as the
weather observation data uses this.
The process has a manual element in that the observation weather files for each
of the five variables need to be selected (based on the results of the distance
calculation) and moved into a working folder for merging and checking later on.
Subject to this check files may need to be subsequently removed or added and
also the list of weather stations needs to be manually corrected to remove stations
that contain missing observation data before the process is repeated again.
- 27 -
3.4
The weather observation data contains a variety of missing values. These are
identified during the checking process. Some missing values are one offs having
no discernible pattern and are classified as Missing Completely at Random
(MCAR) as they are no more or less likely to be missing than any other value.
However, the error files also highlighted data which is almost certainly Missing Not
At Random (MNAR) such as October 2001 and February 2014 being months that
had missing values for a large number of all the weather stations. Generally
missing data was in three categories; 1) MCAR and typically one or two
consecutive days only, 2) MNAR and typically 10-20 days for specific recurring
months and years (figure 4), 3) Large scale missing data for multiple months and
often years for an entire weather variable.
There are a variety of ways to deal with missing data. The approach taken in this
study is to try and preserve as much of the data as possible as the percentage of
weather observation data affected is small. Of the 241,753 total lines just 1323
lines contain errors equating to 0.5%. If this was evenly represented when the files
are joined then there would be 70 lines out of 12,926 with errors. However, as
match data is based on weekly occurring events and errors are clustered across
10-20 consecutive days the real error rate would be much lower. Deleting these
values was considered but using imputation (Yuan, 2010) was decided on to
provide an unbiased replacement values for those occurrences.
- 28 -
STAID
DATE
RR
51
19/09/2001
51
20/09/2001
51
21/09/2001
51
22/09/2001
51
23/09/2001
51
24/09/2001
51
25/09/2001
51
26/09/2001
51
27/09/2001
51
28/09/2001
51
29/09/2001
51
30/09/2001
HU
12.8
11.9
0.2
0
0
0
0
0
0
0
6.4
0.7
TG
86
97
86
84
84
88
93
83
82
79
92
89
CC
10.7
12.6
13.1
10.6
10.9
13
10.8
13.3
15.8
15.3
13.8
16.3
FG
6
8
6
5
5
6
7
4
5
4
7
5
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
Figure 4: Example of missing data for wind speed for 12 days in Sept 2001
The use of the Amelia package (Cran, 2013) which is accessed through R Studio
allows for missing values to be imputed in lieu of NA values. Filling in the missing
values using an algorithm that approximates a best fit based on values on either
side avoids bias and the package has aspects that make it highly applicable to
time series data.
3.5
Mashup
- 29 -
3.6
Feature Engineering
Time Variables
Month: Extracted from the date field. Creates 12 standard calendar months;
January, February, March etc. (Categorical)
Year: Extracted from the date field. Creates a four digit numeric value for the year.
(Numeric)
Season: Groups together 3 consecutive months to generate spring, summer,
autumn and winter categories. Summer is June, July & August. (Categorical)
Weather Variables
BScale: Creates a simplified categorical wind variable based on the Beaufort scale
of measurement with values from 0 to 9. (Numeric)
HUscale: Creates a simplified scale from 1 to 10 for humidity range in 10%
increments. (Numeric)
Rain: Rain is grouped into 6 categories; No Rain, Light rain, Moderate, Heavy,
Very Heavy and Violent. (Categorical)
Atemp: Apparent Temperature is derived from temperature, humidity and
windspeed and provides a real feel equivalent (Steadman, 1984.) (Numeric)
- 30 -
Goal Outcome
TotalGoals: The total goals scored per match. (Numeric)
GDiff: The difference between home team goals and away team goals. (Numeric)
OverUnder2.5: Calculates based on total goals whether the result is Under 2.5
goals or Over 2.5 goals (Categorical)
H_A_Win: Creates a match result value of HomeWin, AwayWin, or Draw
(Categorical)
Geographical
Area: Germany is comprised of 16 states (Appendix 8.2.3) which can be obtained
using the output=more parameter within the geocode function within R Studio.
Region: Combines the Areas above into larger general geographical weather
regions based on typical climate conditions experienced around the country.
(Encyclopedia Brittanica, 2014 and Appendix 8.2.3)
- 31 -
Introduction
The processing stage as outlined in figure 2 and section 3.3 was subjected to
rigorous testing, error checking and evaluation at every stage to ensure that the
data sets were correct and accurate.
4) Spot check results using single line code version and goggle maps.
If the imputation process had altered the data significantly then removing these
values would have been the next best alternative course of action.
- 33 -
Row count Missing rows may indicate a missing weather observation file which
would remove all associated values during the join process. Use nrow.
NAs Values missed by the imputation process. Filter for NA, -9999 values and
use summary analysis to check.
Validity Several random rows are checked against the raw data sets to ensure
the right dates have been joined with every weather variable. Manual check.
File is saved as a csv file to preserve all processes and make retrieval easier for
the next stages.
- 34 -
4.2
Data analysis
4.2.1 Introduction
The studys primary objective is to determine if any relationship exists between
weather effects and goal outcome and R Studio provides a platform to investigate
the data set using both inbuilt and additional packages. Not all of the data set
contains information that will be analysed and categorical data cannot be analysed
using traditional statistical methods although it can be tabulated, graphed and
explored using R Studio and data mining techniques.
- 35 -
Variable mean
sd
median trimmed mad
FTHG 1.63
1.33
1.00
1.50
1.48
FTAG 1.17
1.13
1.00
1.02
1.48
alt 154.18 147.09 105.00
129.16 94.89
RR 1.94
4.04
0.10
0.94
0.15
HU 77.40 12.08
79.00
78.33
11.86
TG 9.10
6.38
9.20
9.11
6.52
CC 5.43
2.18
6.00
5.67
1.48
FG 3.48
1.78
3.10
3.28
1.63
TotalGoals 2.80
1.71
3.00
2.72
1.48
GDiff 0.46
1.78
0.00
0.46
1.48
BScale 2.51
0.90
2.00
2.47
1.48
HUscale 8.19
1.24
8.00
8.28
1.48
Atemp 5.76
7.67
5.40
5.65
8.15
min
0.00
0.00
4.00
0.00
24.00
-15.80
0.00
0.00
0.00
-8.00
0.00
3.00
-20.30
max
range
se
9.00
9.00 0.01
9.00
9.00 0.01
553.00 549.00 1.29
49.90 49.90 0.04
100.00 76.00 0.11
28.70 44.50 0.06
8.00
8.00 0.02
21.60 21.60 0.02
13.00 13.00 0.02
9.00
17.00 0.02
9.00
9.00 0.01
10.00
7.00 0.01
30.10 50.40 0.07
A preliminary approach (table 1) reveals some useful information for the data set
as a whole. As expected there is a higher number of home goals scored than away
goals although both have a range between zero and nine at the upper end which
is very high compared to the mean indicating these high scoring events are very
rare. Goal difference is positive indicating that home team wins are again more
prevalent. For the weather data rainfall is typically low with occasional high
- 36 -
downpours based on the mean and range. Humidity is often quite high as is the
cloud cover. The Apparent temperature sees a lower mean than temperature alone
suggesting that wind speed plays a greater role in reducing the feels like
temperature than humidity does in increasing it.
A range of graphs and charts considering goal structure and distribution and the
primary weather variables are considered.
Mean goals follow a relatively smooth curve although the top 7 teams do show a
stepped increase in average goals scored. Means goals scored are 2.8.
- 37 -
Total goals in figure 8 scored show a stepped divide where teams have either
scored 800 goals or more or less than 450 goals with only one team between these
limits. Some teams would have survived for a longer time period and in Bundesliga
2 (B2) there is much more volatility in team movement as there have been 66
teams that have played in B2 over the 21 years compared to 37 in B1.
Figure 9 over the page shows the distribution of total goals for each stadium. The
Allianz Stadium in Munich has the highest goal density because it has two teams
that have consistently participated in either the B1 and or B1 league for the entire
21 year period. Generally the overall density of goals scored is highest in the west
of the country where most teams and stadia are located.
- 38 -
Bundesliga 1
Mean = 2.87
sd =1.71
Bundesliga 2
Mean = 2.72
sd = 1.72
- 39 -
Figure 10 shows goal distributions are positively or right skewed although the goal
distribution structure, mean and standard distribution is virtually identical between
the two leagues with B1 having a slightly higher overall scoring average then B2.
- 40 -
Rainfall is very heavily positively skewed to the right as in figure 13 with the majority
of all instances, 6500 days (over 50%) in the data set, having no recorded rain at
- 41 -
all. Heavier periods of rain are increasingly intermittent and rare with most rainfall
being either light or moderate.
With a median of 6 Germany can be considered a fairly overcast country with cloud
cover being a prominent feature year round (as in figure 14) occurring at any time.
In fact clear days with little or no cloud (categories 0, 1 and 2) combined account
for just 12.5% of the total time period.
- 42 -
Wind speed is typically low with almost all occurrences being less than 6m/s wind
speeds. This roughly equates to Beaufort scale 3 and accounts for 87% of all
matches in the data set with only 13% of all matches seeing higher wind speeds.
- 43 -
Figure 17: Total Goals (Mean) graph for Apparent Temperature and Rainfall
- 44 -
Rainfall indicates a possible increase in goals scored for very heavy rain however
this category is comprised of only 49 matches. The violent category has only 5
matches. While a sample size of 30 is typically considered statistically sufficient,
- 45 -
in this case the unknown nature of the occurrence (see Limitations of the Data,
3.2.5) makes it less reliable and the results should be treated with suitable caution.
There is no clear indication that rain affects total goals scored in relation to a
particular trend or pattern although it may act as an interference factor.
Figure 19: Total Goals graph for Wind Speed and Humidity
The wind scale histogram demonstrated that the majority of all occurrences (87%)
were within the first four categories which are relatively low wind speeds. At
Beaufort Scale 6 (BS6) there are just 28 matches followed by 3, 6 and 2 matches
for BS7, 8 and 9 respectively. While the average scores in figure 19 indicate that
wind is having an interference effect by increasing goals scored the small sample
is too low to draw any reliable conclusions.
Likewise humidity scale 3 also consists of just 7 matches (2 for B1) and while there
are 67 for scale 4 the sample size is also probably too low. Analysis of the 7
matches shows that these were all played in mild conditions with no other
significant weather effects being present at the time. There is no clear trend or
pattern that humidity has an effect on Total goal outcome.
- 46 -
Figure 20: Total Goals scored for Cloud Cover and Month
Cloud cover in figure 20 shows no clear trend or pattern and although B1 teams
seem to perform very slightly better in clearer conditions the same cannot be said
for B2 teams. The overall effect is minimal. When looking at the monthly totals a
typical football season in Germany runs from August through to May. However
there are some years which started early in July (86 matches) and finished later in
June (128 matches). May and June, representing 10% of the data, together seem
to show much higher averages at the end of the season which could be from rising
temperatures at this time but could also be due to factors other than weather.
- 47 -
In Bundesliga 1 average goals scored remain high all year but there is a small but
noticeable drop of 7.5% from 2.95 to 2.73 average goals scored as the season
moves from autumn into the winter period. In Bundesliga 2 this is not as noticeable
at a 3.25% drop although the winter period does reflect the lowest period of
average goals scored. A Kruskal Wallis test applied where the data is not normally
distributed shows that this is statically significant although again practically not as
useful. Playing in the east of the country (408 B1 and 1079 B2 matches for this
region) seems to affect B1 teams with average goals lower compared to any other
region. The south East sees the biggest differential between B1 and B2 teams
despite a comparable 850 B1 matches and 1005 B2 matches being played there.
Kruskal Wallis tests indicate no statistical significance for humidity and cloud cover
indicating that those groups are very similar. All other variables show a difference
between groups.
- 48 -
Apparent Temperature in figure 22 (left) appears to show the OU ratio flatten out
at temperatures of zero degrees and below compared to temperatures above
freezing where the over metric has a clear advantage. It is also slightly flatter at
higher temps in the 20-31 category. Rainfall does not appear to show any change
in OU trends for changing rainfall intensity.
- 49 -
In figure 24 (left) and at Beaufort scale 4 (1382 matches) and above there is no
distinct difference between OU, in fact the under 2.5 goals scored is slightly higher
for BS4 and flattened for BS5. Wind is potentially having an interference effect by
reducing the number of goals being scored slightly. Humidity (right) however
shows no trend or pattern with all ranges being similar.
- 50 -
Winter shows some levelling out effect compared to the other three seasons due
to lower mean goals scored. Region seems to be having some effect with the East
going against the overall average and the south East being generally flat compared
to the coastal and west regions.
- 51 -
Two graphs of further interest are Apparent Temperature and Wind speed. These
are shown above in figure 26 above normalised to show proportional difference
between groups. For apparent temperature there is less of a difference between
the ranges than we may believe from the previous plots although the probability of
an over result increases slightly to 55% for the two warmest categories. At Beaufort
scale 6, 7 and 8 over results are achieved 71%, 67% and 67% of the time
respectively. However the number of matches played at just 28, 3 and 6 for the
three ranges respectively is a very low sample size. The switch to all results being
under for the two matches played at BS9 highlight the potential volatility of this
although there is some evidence that high wind speeds are potentially causing
interference leading to higher goals scored.
- 52 -
3
2
1
99.6% of
all matches
played
-1
-2
- 53 -
Wind speed in figure 28 shows the same interference at BS6 and above with the
number of draw results increasing slightly as wind speed increases at lower
speeds. There is no clear trend for humidity.
- 54 -
- 55 -
- 56 -
Wind in figure 32 (left) shows a slight decrease in home wins as wind speed
increases from BS2 to BS5, after which there is significant interference. Home wins
appear to be benefitting but the small sample size precludes drawing any
conclusions. Humidity indicates a stable pattern of HAD results excluding
categories three and four, which have low sample sizes.
- 57 -
Figure 33 for cloud cover and calendar month show no variation in HAD results.
Home Away or Draw results do not seem to be affected by weather effects overall.
There is some evidence that falling temperature reduces home wins and increases
draws slightly and that increasing wind speed also reduces home wins. However
the effect is very small. As with previous results the low sample size for the extreme
ranges of wind and rainfall prevent drawing any conclusions.
- 58 -
4.3
Altitude Effects
Altitude, while not a direct weather effect does affect player performance through
a reduction in atmospheric pressure and is therefore an external environmental
effect which could be considered an indirect weather effect.
Mean goals scored generally trend downward with altitude very slightly and drop
lower to just 2.38 where the altitude is 500m or over. The UO metric shows a
greater skew towards under results at 59%. There have been 323 games played
at 500m or more at two stadia which is a reasonable sample size. Pressure tends
to be constant so there is reduced risk as to matches and pressure occurring at
the same time. The issue with these results is that individual team results are
affecting the results. Unterhaching & Augsburg are the residents at these stadia
and they have goal averages of 2.25 and 2.58 respectively based on figure 7
represent very low scoring teams overall and this is what is being represented
rather than the effects of altitude.
- 59 -
4.4
In looking for any linear relationship and correlation the dependant value Total
Goals is used against the five numeric independent values of temperature (C),
wind speed (m/s), humidity (%), cloud cover (Oktas) and rainfall (mm). The
addition of the B1 and B2 variable shows how the two Leagues vary between each
other. The output results are shown in figure 36 below.
Overall there appears to be no real correlation between any of the weather effects
and total goals scored with temperature having the best match at just 0.05
indicating no real correlation. This would appear to be supported by the Goal
outcome analysis which only showed evidence of small changes overall and at
extreme ranges and no real obvious trends or patterns.
- 60 -
Figure 37: Scatter plot of total goals against Daily average temperature
The lack of relationship is highlighted with the scatter plot in Figure 37 above which
shows an almost flat linear regression line between temperature and Total Goals
providing a correlation value of just 0.0527. A multiple linear model was also
created and tested using all the available weather values and created for both Total
Goals and Goal Difference. The results were conclusive in that no relationship
exists between them whatsoever.
- 61 -
4.5
Although the results indicate there is no relationship between goal outcome and
weather data mining techniques could allow for knowledge discovery and
potentially even predictive modelling. An approach based on a Nave Bayes
algorithm was selected as it uses data from previous events to try to predict future
outcomes. The event being predicted is the OverUnder2.5 goals variable using the
featured engineered weather factors; Rain, HUscale, CC, ATempScale and
BScale. These are all factorised for use within the algorithm. The advantage of
nave bayes is the fact that it assumes all of the features in the data set are of
equal importance which makes it good at detecting potentially weak effects as
could be likely in this instance.
The data set is split into an 80% training and 20% random test data sets to allow
the algorithm to learn any rules and then apply them in predicting goal outcome on
the test data set. The results on the 2586 row test data set showed that 1366 were
correctly predicted equating to a 52.8% success rate. As the Over Under metric is
essentially an even spread this indicates that the nave bayes predictive model
using weather variables is potentially no better than making a guess where the
probability of getting a right result is essentially 50/50. An output of the Nave
Bayes probabilities is provided in Appendix 8.5.
4.6
Analysis Conclusion
The study has shown that mean goals scored are in fact reduced slightly as
temperature drops, which is reflected seasonally, with much colder apparent
temperatures of below -10C seeing a slight further drop. While statistically this is
relevant the overall change in mean goals is not sufficiently different from a
practical perspective to have any meaningful impact. Humidity, Cloud Cover,
Rainfall and wind speed appear to have no measurable effect whatsoever on goal
outcome although wind and to some extent rainfall show interference at the
- 62 -
extreme limits for just 0.4% of the total matches played. These extreme events are
too rare and the overall sample of matches too low to infer any specific pattern or
trend although very high wind speeds seem to increase goals scored and favour
home wins. Realistically, the most we can infer is that there is potentially an
interference effect.
Predictive modelling using the dataset or data mining techniques is not possible
due to the lack of relationship between goal outcome and weather effects. No
pattern or trend could be extracted from the data using a Nave Bayes algorithm.
- 63 -
5 Conclusions
5.1.1 Introduction
The aim of the study was to establish if there is any relationship between weather
effects including temperature, humidity, wind speed, rainfall and cloud cover
against football match goal outcome within the German Bundesliga 1 & 2 leagues
over the last 21 years. The sports industry has been shown to be an economically
important part of western economies as outlined in chapter 1 and teams and
betting product providers are seeking to gain competitive advantage wherever
possible. The study seeks to redress the lack of knowledge with respect to
understanding how or if weather affects goal outcome in football matches. Existing
knowledge in this area cannot answer with certainty to what extent weather
variables affect match outcome or which weather factors play a role. The study
seeks to answer one primary question which is: -
Is there any relationship between weather effects and goal outcome for football
matches played within the Bundesliga 1 & 2?
- 64 -
Does regional weather affect games played in that area and goal outcome?
Region doesnt appear to effect goal outcome and although average goals showed
a greater differential between B1 & B2 leagues all other goal outcomes showed no
change due to region. The UO2.5 spread was found to be unevenly spread with
the probability of an under score increasing to 64%.
Can the goal outcome of future matches be better predicted using selected
betting instruments and weather factors?
There is nothing to suggest that the four goal outcomes can predict match results
based on weather factors. High winds and high rainfall showed interference effects
such that wind speeds of BS6, 7 & 8 indicated a 0.7 probability of an over result
and probably a home win result but the sample size is too low (less than 0.4% of
matches) to say with any certainty this is correct. Analysis using the Nave Bayes
algorithm on predicting the Over Under result outcome was unable to make
predictions beyond a 52% accuracy level which is no better than a random guess.
Do some seasons, months or time periods see goal outcome affected more
due to weather effects?
Seasonally there was a small but measurable drop in average goals scored moving
from autumn into winter. This was matched by a flattening of the Over Under2.5
- 65 -
goals scored to an almost even spread. However the effect was slight and
practically has limited applications and impact.
Can goal outcome be better predicted using combined weather indices such
as apparent temperature?
The use of apparent temperature has the effect of extending the temperature range
compared to if just air temperature alone was used. However, there is nothing to
suggest that apparent temperature offers any significant benefit over just
temperature and no predictive advantage can be gained from using it based on its
use in the study.
Is there any relationship between weather effects and goal outcome for
football matches played within the Bundesliga 1 & 2?
Aside from several small effects as outlined above and some possible interference
effects for extreme events there is no relationship between goal outcome and
weather effects. The null hypothesis (see section 1.2.2) is therefore accepted.
extreme limits which are rare, at less than 0.4% of matches played, and due to the
small sample sizes not reliable enough to draw patterns or trends from. However,
the use of internal factors such as betting odds in conjunction with rainfall (Kickdex,
2014) could yield additional information not considered within this study. Other
studies have made generalist statements suggesting that that most sports are
affected by humidity, temperature and wind (Pezzoli, Cristofori, Moncalero,
Giacometto and Boscolo, 2013.) but again for the climatic region of Germany these
factors play almost certainly no role in affecting goal outcome and overall
performance. Thornes (1977) suggestion that colder conditions are a hazard to
goalkeepers in maintaining performance are also not borne out by the lower goals
scored in colder conditions although the effect on players extremities overall could
be factor in lower than average goals scored during colder months.
Overall the existing body of knowledge has been perhaps too general in its
approach and has lacked analytical methods to quantify such claims potentially
overplaying the effects weather is really having on sports like football. This could
be avoided by being specific for the sport type being analysed and limited to a
specific geographic region of study to ensure accuracy. Almost certainly a larger
sample size is required to better understand the effects of high wind and rain.
5.1.3 Conclusion
Despite ongoing interest and opinion on the effects of weather on football matches
there does not appear to be any relationship between weather effects and football
match goal outcome. Temperature effects are minimal and while they do reduce
average goals scored in colder conditions the effect is not large enough to be
practically significant and could be attributed to external factors. All other variables
also dont appear to have any measurable practical impact other than several slight
and often localised effects which are too small or rare in occurrence to draw any
meaningful conclusions except to say they potentially have an interference effect.
- 67 -
The weather it seems does not kill goals and football as a sport within the
Bundesliga 1 & 2 Leagues seems to be unaffected by weather effects overall.
- 68 -
The inclusion of a stadium factor could help investigate the effects stadiums have
as they range from almost totally open fields to being fully enclosed venues with
roofs. This adds further complexity however in data preparation as stadiums have
changed over time due to refurbishment and teams moving.
As part of a broader and long term study the installation of dedicated weather
stations at every major stadium in Europe would allow for the recording of
continuous weather data at the specific point at which matches are played. As
many stadia are also used for athletics events this would also potentially provide
additional study in these areas of sport as well and help broaden the research base
being considered. By measuring some parameters such as wind speed outside the
stadium as well as at pitch level the effect of stadia could be better understood on
how it mitigates weather factors such as wind.
- 69 -
7 References
Anderson, C., and Sally, D. (2013) The Numbers Game: Why everything you know
about football is wrong. Penguin Books.
BBC (2014) Football Betting The global industry worth Billion. [Online]. BBC.
Available at: http://www.bbc.com/sport/0/football/24354124 [Accessed 29th May
2014].
British Weather Services (2013) Cold Kills Goals The stats [Online]. British
Weather Services. Available from: http://www.britishweatherservices.co.uk/coldkills-goals-the-stats/ [Accessed 15 November 2014].
Advanced Football Analytics (2014) Temperature and Field Goals, Advanced
Football Analytics. Available from:
http://www.advancedfootballanalytics.com/index.php/home/research/weather/165
-temperature-and-field-goals [Accessed 18th December 2014].
Carr, M. J., Asai, T., Akatsuka, T., and Haake, S. J. (2002) The curve kick of a
football II: flight through the air. Sports Engineering, 5(4): 193-200.
Cran (2014) Amelia II: A Program for missing data [Online]. Cran r-project.
Available from: http://cran.r-project.org/web/packages/Amelia/index.html
[Accessed 15th November 2014].
Deloitte (2014) Premium Blend: A review of football finance. [Online]. Deloitte.
Available from: http://www.deloitte.com/assets/DcomItaly/Local%20Assets/Documents/Pubblicazioni/uk-sbg-annual-review-of-footballfinance-2014.pdf [Accessed 15th November 2014].
Domingos, P. (2012) A few useful things to know about machine learning.
Communications of the ACM, 55(10): 78-87.
- 70 -
- 71 -
Hamilton, H. (2014) Does the cold really kill Goals? Howard Hamilton Blog, 1st
May. Available from: http://www.soccermetrics.net/leaguecompetitions/temperature-vs-goals-study-premier-league [Accessed 24th May
2014].
from:
http://blog.kickdex.com/post/68368668405/does-rain-level-the-
from:
http://www.mecn.net/German_Betting_and_Gambling_Market-
- 72 -
O'Donoghue, P., and Holmes, L. (2014) Data Analysis in Sport. New York:
Routledge
Paddy Power (2014) Stephen Hawking Exclusive: The maths that show us how
England can triumph in the world cup [Online]. Paddy Power. Available from:
http://blog.paddypower.com/2014/06/18/stephen-hawking-exclusive-the-mathsthat-show-us-how-england-can-win-the-world-cup/ [Accessed 10th December
2014].
Perry, A. (2004). Sports tourism and climate variability. Advances in Tourism Cli.
Pezzoli, A., Cristoforu, E., Moncalero, M., Giacometto, F., and Boscolo, A. (2013)
Climatological Analysis, Weather Forecast and Sport Performance: Which are the
Connections?, Journal Climatol Weather Forecasting 1: e105
PKR.
(2014)
Under
Over
Betting
[Online].
PKR.
Available
from:
Studio
(2014)
studio.
[Online].
Rstudio.
Available
from:
- 73 -
Witten, I. H., Frank, E., and Hall, M.E (2011) Data Mining: Practical machine
learning tools and techniques. 3rd ed. Morgan Kaufmann.
World Stadiums (2014) World Stadiums [Online]. World Stadiums. Available from:
http://www.worldstadiums.com/ [Accessed 15th September 2014]
Yuan, Y. C (2010) Multiple imputation for missing data: Concepts and new
development (Version 9.0). SAS Institute: Rockville.
- 74 -
8 Appendix
8.1
Glossary of Terms
Goal Difference
Goal Outcome
Goal Outcome for the purpose of the study refers to the one
of four measurements including: - Under/Over 2.5 Goals, Goal
Difference, Home/Away Win or Total Goals
Home/Away Win
Either the home team wins, the Away team wins or it is a draw.
Nave Bayes
PCA
UEFA
Under/Over 2.5
- 75 -
R Studio
Total Goals
The total goals scored in a match is the home team goals plus
the away team goals.
- 76 -
8.2
Geography
- 77 -
Germany Map showing primary cities, towns and major topography like mountains
and plains.
Source: http://www.worldatlas.com/webimage/countrys/europe/lgcolor/decolor.gif
- 78 -
Coastal & NW
East
West
South West
Germany has 16 distinct states. 3 (Berlin, Bremen and Hamburg) are city states.
These were simplified by combining into four regions as shown above.
Coastal and NW Region
East Region
South East
Bavaria
West
Source: http://www.itcwebdesigns.com/tour_germany/map_german_states.gif
- 79 -
R Studio Mapping showing all of the 73 stadiums for Bundesliga 1 & 2 and all of
the possible 84 weather stations that contain every weather variable prior to
matching and selecting the nearest. Some weather stations are at map edges or
out at sea and will not be any use. Overall there is a good match between the two
locations although this assumes that all weather stations are viable at this stage.
The Dusseldorf area shown within the red dashed box is shown over the page as
a point of interest.
- 80 -
479
A zoomed map showing the Dusseldorf Area. This area has the highest density of
stadiums and teams linked to a single weather station. 6716 of all games played
(52%) are linked to this weather station ID (479). One of the project risks was
potentially that stations like this were unusable which could jeopardise the entire
study.
- 81 -
Map showing the final 32 weather stations with useable observation data for all
five weather variables and all 73 stadiums for both Bundesliga 1&2.
- 82 -
8.3
Data Sets
Div Date
HomeTeam
AwayTeam
FTHG FTAG FTR HTHG HTAG HTR
D1 09/08/2002 Dortmund
Hertha
2
2
D
2
1
H
D1 10/08/2002 Cottbus
Leverkusen
1
1
D
0
0
D
D1 10/08/2002 M'gladbach
Bayern Munich
0
0
D
0
0
D
D1 10/08/2002 Nurnberg
Bochum
1
3
A
0
2
A
D1 10/08/2002 Schalke 04
Wolfsburg
1
0
H
0
0
D
D1 10/08/2002 Stuttgart
Kaiserslautern
1
1
D
1
0
H
D1 11/08/2002 Bielefeld
Werder Bremen 3
0
H
1
0
H
D1 11/08/2002 Hamburg
Hannover
2
1
H
0
1
A
D1 14/08/2002 Munich 1860
Hansa Rostock
0
2
A
0
1
A
D1 17/08/2002 Bayern Munich Bielefeld
6
2
H
3
0
H
D1 17/08/2002 Bochum
Cottbus
5
0
H
3
0
H
D1 17/08/2002 Hannover
Munich 1860
1
3
A
1
2
A
D1 17/08/2002 Hansa Rostock Nurnberg
2
0
H
1
0
H
D1 17/08/2002 Hertha
Stuttgart
1
1
D
0
1
A
D1 17/08/2002 Kaiserslautern Schalke 04
1
3
A
1
0
H
Football-Data (2014) provides a single file in csv format for each season of play
D1 17/08/2002 Leverkusen
Dortmund
1
1
D
1
0
H
and
each division
Data is available
are
D1 for
18/08/2002
Werder separately.
Bremen Hamburg
2 from
1 1993
H onwards.
1
1 There
D
D1 18/08/2002
Wolfsburgfor each
M'gladbach
H participating
0
0
Din each
typically
306 matches
season based 1on 180 teams
D1 24/08/2002 Bielefeld
Wolfsburg
1
0
H
1
0
H
League.
There
are
52
columns
of
data
per
file
for
most
files
containing
the
D1 24/08/2002 Cottbus
Hansa Rostock
0
4
A
0
2
A date,
D1 24/08/2002
Dortmund
Stuttgart
3
1was Hplayed
1 and
0 a variety
H
final
time results,
half time results,
where the game
of
D1 24/08/2002 Hamburg
Bayern Munich
0
3
A
0
1
A
betting information. For earlier years not all of this information was recorded so the
files are inconsistent in their structure and layout. Twenty one seasons of matches
(306 games per season) across two leagues equates to 12,926 games played in
total. Note that this is slightly higher than expected as some years featured 20
teams in a season resulting in more games being played. This data is spread
across 42 csv files. The first six columns represent consistent elements across all
42 files from which all four goal outcome measurement metrics can be calculated.
- 83 -
Temperature
Historic weather data example for Germany for Daily Mean Temperature
The above data sample is taken from station number 494 (Augsburg) for mean
daily temperature. The text files are comma delimited and provide (from left to right)
station number, source identifier, date (yyyy/mm/dd), temperature and quality
code. Each location file contains around 25, 591 lines of data. The temperature is
provided in 0.1degrees Celsius in its current html format and requires a decimal
point to read correctly. For example the first line of data above for the 28 th March
records a daily mean temperature of 6.5 Degrees Celsius with no known errors or
missing data. Below freezing levels are identified with a minus symbol (none shown
in example above.)
- 84 -
Humidity
Historic weather data example for Germany for Daily Humidity levels
The above ECA&D (2014) data sample is taken from station number 494
(Augsburg) for daily humidity. The text files are comma delimited and provide (from
left to right) station number, source identifier, date (yyyy/mm/dd), humidity in
percent and quality code.
Rainfall
Historic weather data example for Germany for Daily precipitation levels.
The above ECA&D (2014) data sample is taken from station number 494
(Augsburg) for daily precipitation. The text files are comma delimited and provide
(from left to right) station number, source identifier, date (yyyy/mm/dd),
precipitation in html format of 0.1mm and quality code.
- 85 -
Wind Speed
Historic weather data example for Germany for Daily mean wind speed.
The above ECA&D (2014) data sample is taken from station number 494
(Augsburg) for daily average wind speed. The text files are comma delimited and
provide (from left to right) station number, source identifier, date (yyyy/mm/dd),
average wind speed in html format and 0.1m/s and quality code. The actual wind
speed is the above figure divided by 10. For example the first value for April shown
above would be 1.5m/s.
Cloud Cover
Historic weather data for Germany for Cloud Cover.
The above ECA&D (2014) data sample is taken from station number 494
(Augsburg) for daily average cloud cover. The text files are comma delimited and
- 86 -
provide (from left to right) station number, source identifier, date (yyyy/mm/dd),
cloud cover in Oktas and quality code. The oktas scale provides a measure of
cloud cover from 0 to 8 subject to the overall portion of sky covered. Zero
represents a totally clear sky while 8 would be totally overcast.
- 87 -
Stadium Guide.com
Contains 34 listings of present day stadia currently in use.
Source: http://www.stadiumguide.com/present/germany/
World Stadiums.com
Lists 523 stadiums organised by state location. Includes all types of stadia
including football.
- 88 -
Source: http://www.worldstadiums.com/europe/countries/germany.shtml
Stadium Database.com
Contains listings for all current Bundesliga 1 & 2 clubs and a large number of other
stadia
Source: http://stadiumdb.com/stadiums/ger
- 89 -
Google Maps.ie
Large online mapping tool with all current stadiums listed.
Source: https://www.google.ie/maps
- 90 -
There is one station list for each weather variable as not every station records
every weather variable. Each list is a comma delimited text file which can be
downloaded from the ECA&D (2014) website directly using R Studio. Lists contain
all European stations which are identified through two digit ISO3116 country
codes. The Station ID (STAID) is the unique identifier that allows the observation
data to be matched to each weather station. Latitude and longitude as well as
altitude is also provided. Altitude is critical to ensure each weather station and
stadium are at comparable heights.
- 91 -
8.4
The finished data set used for analysis has 12,926 rows each representing a single
match and 31 columns or variables. 14 columns were feature engineered.
> str(master)
'data.frame': 12926 obs. of 31 variables:
$ STAID: int 4012 472 3990 2763 2758 4005 3990 479 477 491 ...
$ HomeTeam: Factor w/ 73 levels "Aachen","Aalen".
$ Div: Factor w/ 2 levels "D1","D2": 2 2 2 2 2 2 2 2 2 2 ...
$ AwayTeam: Factor w/ 74 levels "Aachen","Aalen",..
$ FTHG: int 1 4 1 0 0 0 0 3 3 2 ...
$ FTAG: int 1 0 1 0 1 1 0 1 0 1 ...
$ Stadium_Name: Factor w/ 72 levels "Allianz Arena",.
$ lat: num 49.5 54.1 50.9 48.8 52.7 ...
$ lon: num 8.5 12.1 11.58 9.19 7.3 ...
$ alt: int 96 22 149 481 21 55 318 38 57 276 ...
$ dist: num 5.22 10.71 38.59 7.7 69.38 ...
$ RR: num 24 7.9 2.5 8.9 0.3 2.2 2.5 0.3 3.3 4.9 ...
$ HU: int 94 95 75 87 82 84 75 83 81 94 ...
$ TG: num 18 16.1 17.7 17.8 16.4 16.4 17.7 17.4 16.4 16.6 ...
$ CC: int 8 6 6 8 7 5 6 5 5 8 ...
$ FG: num 2.3 5.6 6 2.9 4.9 4.5 6 4.9 4.1 3.9 ...
The following data variables were feature engineered from the raw data set above:
$ month: Factor w/ 12 levels "April","August",..
$ year: int 1993 1993 1993 1993 1993 1993 1993 1993 1993 1993 ...
$ TotalGoals: int 2 4 2 0 1 1 0 4 3 3 ...
$ OverUnder2.5: Factor w/ 2 levels "Over","Under": 2 1 2 2 2 2 2 1 1 1 ...
$ H_A_Win: Factor w/ 3 levels "AwayWin","Draw",..: 2 3 2 2 1 1 2 3 3 3 ...
$ GDiff: int 0 4 0 0 -1 -1 0 2 3 1 ...
- 92 -
- 93 -
8.5
The majority of all the probabilities are essentially even except for those specific
cases at extreme ends and for altitude >500m.
- 94 -
8.6
8.7
List of Tables
- 96 -
8.8
- 97 -
Project Proposal
Table of Contents
1.
Objectives....................................................................................................................................... 3
2.
Background ................................................................................................................................... 4
3.
4.
5.
6.
7.
8.
Consultation................................................................................................................................. 14
9.
Declaration................................................................................................................................... 14
1. Objectives
The study will seek to assess whether certain weather factors such as temperature, cloud cover,
precipitation, wind and humidity have any determined effect on the goal outcome of football
matches within the Bundesliga 1&2 football leagues held within Germany when considered
across twenty seasons of historic play. The theory is that weather conditions, in particular lower
temperatures, may have a detrimental impact on goals scored although warmer temperatures
will also be considered. By linking daily historic weather data for specific weather stations with
stadiums and the dates and results of matches played it will be determined if the effects of
weather plays any role in goal outcome when considered over a significant time period.
Secondary objectives will consider if any difference exists between the first and second league
in relation to weather effects on goal outcome and also whether any particular stadium affects
goal outcome due to its geographical location or size for the teams that play there. The results
will also be used to compare against a particular betting instrument which is the over/under
goals scored bet (PKR, 2014) to see if any meaningful predictions can be made regarding total
match goals scored. Any possible lean towards an uneven spread (more goals under then over
due to weather factors) would be of particular interest to football teams, coaches, trainers and
in particular those companies that provide such betting instruments products.
2. Background
According to a recent study by the European Commission (2012) on the contribution of sport
to the economy it placed the value at 294 Billion euros. Additionally the betting market for
sports is estimated to be around 733Billion globally (BBC, 2014) with 70% of that income
coming from football matches. Betting on football matches became popularised in the early
1920s since the creation of the football pools (2014) in the UK, the oldest gaming company in
the world, which allowed fans to predict matches and win money if those predictions proved
to be correct. With some individual bets now reaching figures of over 200,000 (BBC, 2014)
it is important for gaming companies to be able to understand the level of risk they are being
exposed to as mistakes could be costly.
Additionally trainers and teams are always looking to gain competitive advantage to ensure
success and the use of statistical information and data analytics is becoming increasingly
important within football as more and more managers and teams use data analysis to become
smarter and more efficient (Lewis, 2014.) While nearly all analysis focusses on the players
there has been much less analysis on external factors. There is some evidence and a number
of studies to indicate that weather factors, predominantly temperature, may be a factor in the
outcome of European football matches (Hamilton, 2014.) While the effects of extreme hot or
cold temperatures on human physiology are known to directly affect both performance and
health (Hong, 2014) the overall contribution that weather makes, where extreme temperature
is not a factor, and in particular, to the goal outcome of football matches, is still not clearly
understood or established.
Germany has a moderate and temperate climate (See Appendix 2(e) for a typical weather year)
with temperatures ranging on average from just below zero degrees Celsius in winter to around
the mid twentys during the summer (ECA, 2014.) The use of a moderate temperate climate
seeks to reduce as much as possible any effects of very extreme temperatures. However there
are colder areas such as Munich which can see temperatures drop to around -10C which can
affect performance (Hong, 2014) although at around 0C there should be very little drop in
performance for persons engaged in moderate exercise even if wearing t-shirts. Germany is
large enough to have distinct regions of specific weather patterns (Encyclopaedia Britannica,
2014) with variable frequency of temperature, humidity and precipitation experienced in
different regions and throughout the year. The Bundesliga stadiums are distributed around
Germany widely enough to see if regional weather plays a role in match outcome (Appendix
C & E.)
The study considers two primary data sets which were identified for the purposes of being
suitable for analysis and to meet the studies objectives. Firstly the Bundesliga football league
results for which reliable historic data exists for its entire history since 1963. Within this a
selection of data will be considered for the period 1993 until 2013 which represents twenty one
seasons (years) of play. Useable football score data has been identified from an online provider
(Football-UK, 2014) which provides one csv file (Refer to Appendix A) for each season played
detailing every game played within the season, the date played, half time and full time scores
and where it was played as well as range of other match information. There are around 306
games played per season per league so the football data set will comprise around 306 (games)
x 21 (seasons) x 2 (leagues) = 12, 852 football matches being analysed in total. Each of the csv
files is relatively small at around 100kb in size.
Compared to this will be daily weather data for Germany obtained from the European Climate
Assessment & Database (ECA, 2014.) This site provides data on numerous weather stations
positioned around Europe which can be matched geographically using name, latitude and
longitude co-ordinates to each stadium being considered to within a few miles. The data is
available as a number of individual text files, one for each unique weather station and weather
variable (Refer to Appendix B.) The blended data was selected for use which combines weather
data from different sources, although checking shows no difference for the weather stations
being used. The files contain comma delimited text in their raw format and the uncompressed
size of the files for each weather variable ranges from 200MB to 4GB containing
approximately 400 to 5,000 individual weather stations in each zipped folder for that particular
weather variable. Each file provides data for approximately 67 years equating to 25,591 lines
of raw data for each weather station and single weather variable. There are 18 stadiums in the
Bundesliga 1 and the same in the Bundesliga 2. However, there are over the 21 year period 38
teams that have played in the top tier with a similar number expected in the second but, there
will be instances of cross over where teams share stadiums and the closeness of stadia may
allow for a common weather station to be used. This will be subject to more a more detailed
assessment during the initial stages of the project. This means that there could be approximately
30 distinct weather files for each weather parameter. The total approximate size of the raw data
will therefore be 25,591 x 30 = 767,730 lines of data for each weather variable. It is intended
5
to consider five variables; temperature, precipitation, cloud cover, wind and humidity which
will equate to almost 3.8 million lines of raw data prior to selecting the relevant lines that equate
to the 12,852 actual football matches that were played.
There are some important limitations to the data being considered, in particular the weather
data. The data being used provides daily averages which may not equate to the conditions
experienced during the time the match was played. For example rain may have fallen before,
after or during the match. The study is seeking to determine if any relationship exists between
the historic weather data and goal outcome and so some caution is advised as links could be
established where none really exist. However, the primary objective is to consider any overall
trend over the course of a playing season i.e. changes in seasons and over months rather than
specific matches.
3. Literature Review
The literature review will at this stage examine primarily factors relating to the statistical
analysis of sports and the effects of weather on sports but should also extend to consider
statistical analysis in general, prediction analysis, climate and weather, sports performance and
stadium design. Literature has been researched sourced from Google Scholar, CiteSeerX,
Google Books and the Directory of Open Access Journals along with articles, websites and
other sources.
The effects of temperature on ball properties is also a possible environmental factor as with
temperatures approaching zero degrees Celsius a goalkeeper has 7% more time to react to a
penalty that at higher temperatures when the ball moves quicker. (Wiart, Kelley, James and
Allen, 2012) The flight of the ball is also affected with colder conditions causing the ball to
drop and move slower overall with less power than at warmer temperatures. However as Riley
and Williams (2003) point out in colder weather the goalkeeper is most susceptible to reduced
limb temperature and dexterity unless they keep highly active.
4. Research Question
The problem being considered is that there is a lack of information regarding the effects that
weather factors like temperature, precipitation, humidity and wind may have on goals scored
in football matches. The primary research question being considered is: -
Does the weather effect the goal outcome in football matches within the Bundesliga 1 & 2?
From this the Null Hypothesis Ho and the hypothesis that will be tested H1 is established: -
Ho: There is no relationship between goal outcome in football matches and daily corresponding
average values of temperature, wind, precipitation and humidity.
H1: There is a relationship between goal outcome in football matches and daily corresponding
average values of temperature, wind, precipitation and humidity.
The Null hypothesis is non directional and therefore a two tailed test will be applied where
appropriate with a significance level (critical value) of 5%
Within the context of the broader research question there are further questions that will be
considered: (i)
(ii)
(iii)
(iv)
Is the Bundesliga 2 Leagues goal outcome affected more by weather factors than
the Bundesliga 1 League?
Can goal outcome at any particular stadium be attributed to any possible regional
weather effects?
Does a single weather variable affect match outcome or are multiple factors
required?
Do smaller stadiums have a greater effect on goal outcome due to greater expose
to the weather?
These are components of the primary research question and will be investigated. Appropriate
hypothesis testing will need to be established for these questions. Further questions will be
developed for the project.
8
Predictions
Additionally there is the possibility of the results analysis being used to undertake match
outcome prediction for goals scored using next day weather forecasting. It is expected that
rather than being able to predict actual total goals for a match with any accuracy it is more
likely that prediction of average goals scored due to general weather conditions experienced
over a time period would be possible. The use of a betting tool such as the Under/Over (x)
goals instrument will be used based on the average number of goals per game and league across
the period being considered. For example if the average goals scored was 2.7 then Under/Over
2.5 goals would be used as the instrument to see if the results can be used to reliably determine
significant push or pull above or below this level which could potentially indicate that the
predictions can be made. As the predictions are dependent on weather then the time period will
typically be in the 1 to 3 day period in line with weather forecasting but could increase to 10
days.
The research will be limited to only stadium locations within Germany, the weather data
identified and goals scored for a match. No other in match data or statistics will be used such
as corners or passes. Individual players will not be considered nor will any other variables other
than those indicated and referenced.
Elicitation techniques are the systems and tools used to bring forth the requirements and help
develop and find understanding. For this part of the process the tools used are Brainstorming
and Document Analysis as outlined in the (IIBA, 2009.)
The brainstorming process was utilised primarily at this stage to help stimulate ideas on the
project. This did not take the format of a scheduled session but instead was an ongoing process
where ideas were jotted down in a note book as and when they came to mind. No critiquing or
analysis of the ideas was undertaken deliberately as this is contrary to the brainstorming process
which is to develop new ideas.
Before determining the functional and non-functional project requirements it is useful to first
re state the problem being considered which was explored in the previous section: - The
problem being considered is that there is a lack of information regarding the effects that
weather factors like temperature, precipitation, humidity, cloud cover and wind may have on
goals scored in football matches. From this we can then look to determine the project
requirements.
Project Scope
The project is a Big Data Analysis study which will use a relational database most likely SQL
in conjunction with R Studio to undertake analysis of a large data set to find trends, patterns,
links and predictions supported by graphing and tables to present results.
General Description
The database will be created and designed to facilitate the querying and manipulation of a large
amount of data to allow for the effects of weather such as temperature, humidity, precipitation
and wind on total goals scored in football matches to be analysed to determine if a relationship
exists. The aim is that the analysis will provide insight into the possible effects of weather on
sports like football.
The database must be designed in such a way that all the entities and their relationships are
robust and well understood and that the data has been normalised prior to database creation.
The ability to handle very large queries and joins will be required as tables with thousands of
rows has a multiplying effect within SQL databases which can have significant demands on
processing ability of computers. If the database cannot function properly then either the number
of data points will have to be restricted or the amount of analysis limited which will not provide
a sufficient amount of information for a robust analysis which could damage the study as a
whole. The core function of the project is to compare the two primary datasets which must be
central to any design approach implemented.
10
System Interfaces
The database will be a self-contained system however it may interface with a PC or a server
that will be located on Amazon Web Services, or Windows Asia (to be decided subject to
further research.) It will also need to potentially receive input data from another programs such
as Microsoft Excel, R or Python and be required to export back to Excel and R Studio for
ongoing graphing and analysis.
7. The project must be stored electronically on three different media sources at all times and
at least be updated once a week.
8. The project must be completed by the specified date.
At this stage there may be additional programs that may be useful but have not yet been
identified as being a requirement. This will be a part of the project plan to determine what
technologies should be used.
13
7. Project Plan
The project plan is provided in Appendix D and shows the general expected timeline for project
delivery in the second semester. The first half of the project is planned for research, preparing
all the data, building databases and becoming familiar with them as well as the initial parts of
the thesis. The second part focuses on the analysis, findings and writing the analysis which are
key parts of the project process. The plan has been updated based on confirmation of the
submission date in early January and additional deadlines for management reports and the
presentation.
8. Consultation
The project proposal was discussed with NCI Lecturer Padraig De Burca. The discussion took
place 26th May 2014 and took the form of an informal discussion after scheduled classes.
Padraig provided valuable feedback relating to the potential for use of SQL to build a database
of all the normalised match and weather variables which can then be queried in multiple ways
with the results being outputted to other programs like Excel to generate graphs. The significant
benefit of using SQL would be firstly in the speed by which stadiums, teams, results and even
certain weather conditions can be isolated for comparative analysis but also would limit the
amount of preparation the weather files needed as there would be no need to eliminate all the
dates where games were not played. Just clip the data file at the start date to eliminate the
largest unneeded section prior to 1993. This would create potentially redundant data within
the database and may affect times to undertake joins but could be quicker than trying to
eliminate certain dates in the raw weather files as there are potentially 70-100 individual
weather files.
As a result of the consultation several possible new ways to view the data were considered.
Firstly it opens the possibility of considering the past few days of weather prior to any match
for consideration which had not ben though of and secondly it allows the comparison of
sequential matches played by the same team in different locations to see if the effects of any
general ongoing weather such as sustained cold has a compounding effect. Padraig also noted
that SQL has some graphing capabilities which will be investigated as to their potential use.
9. Declaration
By submitting this proposal through the NCI Moodle system, I declare that unless otherwise specified,
all content in this proposal is my own work and has not been copied from other sources.
14
Football-Data (2014) provides a full season of play for either Bundesliga 1 or 2 as a csv file
available for download. Each csv file contains the results for one entire season of play. There
are 306 matches in total for each season which equates to 18 teams. There are 52 columns of
data per file for most files containing the date, final time results, half time results, where the
game was played and a variety of betting information. For earlier years not all this information
was recorded. Twenty years of historic football data for both leagues equates to 306(games per
season) x 21 (seasons) x 2 (leagues) = 12,852 lines of data for the football matches which in
its raw form exists in 42 corresponding csv files. Total goals is not a parameter but any program
or database such as SQL could calculate this from the home and away goals scored columns.
15
The European Climate Assessment & Database Project (2014) provides data for weather
stations across Europe. The above data sample is taken from station number 494 (Augsburg)
for mean daily temperature. The text files are comma delimited and provide (from left to right)
station number, source identifier, date (yyyy/mm/dd), temperature and quality code. This file
contains 67 years, 3 months and 29 days of data which equates to around 25, 591 lines of data
for each of the locations. The year the station began monitoring varies but typically covers a
significant time period in all cases. The temperature is provided in 0.1degrees Celsius in its
current html format and requires a decimal point to read correctly. For example the first line of
data above for the 28th March records a daily mean temperature of 6.5 Degrees Celsius with no
known errors or missing data. Below freezing levels are identified with a minus symbol (none
shown in example above.)
16
Data Set 2(b) Historic weather data example for Germany for Daily Humidity levels at
station 494 (Augsburg, Germany)
The above ECA (2014) data sample is taken from station number 494 (Augsburg) for daily
humidity. The text files are comma delimited and provide (from left to right) station number,
source identifier, date (yyyy/mm/dd), humidity in percent and quality code. This file also
contains 67 years, 3 months and 29 days of data which equates to around 25, 591 lines of data
for each of the locations. The year the station began monitoring varies but typically covers a
significant time period in all cases.
17
Data Set 2(c) Historic weather data example for Germany for Daily precipitation levels at
station 494 (Augsburg, Germany)
The above ECA (2014) data sample is taken from station number 494 (Augsburg) for daily
precipitation. The text files are comma delimited and provide (from left to right) station
number, source identifier, date (yyyy/mm/dd), precipitation in 0.1mm and quality code. This
file also contains 67 years, 3 months and 29 days of data which equates to around 25, 591 lines
of data for each of the locations. The year each station began monitoring varies but typically
covers a significant time period in all cases.
18
Data Set 2(d) Historic weather data example for Germany for Daily mean wind speed at
station 494 (Augsburg, Germany)
The above ECA (2014) data sample is taken from station number 494 (Augsburg) for daily
average wind speed. The text files are comma delimited and provide (from left to right) station
number, source identifier, date (yyyy/mm/dd), average wind speed in 0.1m/s and quality code.
This file also contains 67 years, 3 months and 29 days of data which equates to around 25, 591
lines of data for each of the locations. In this data set all records prior to 1960 are Null. The
actual wind speed is the above figure divided by 10. For example the first value for April shown
above would be 1.5m/s.
Data Set 2(e) Historic weather data for Germany for Cloud Cover
The cloud cover data files (not shown) are based on the oktas scale which provides a measure
of cloud cover from 0 to 8 subject to the overall portion of sky covered. Zero represents a
totally clear sky while 8 would be totally overcast.
19
Example Weather Year 2(e) -Typical Weather Year for Mean Daily Temperature for weather station 494 Augsburg
20
21
Thesis writing
Supporting Processes
Key Landmarks
Notes
1/ Dates shown are week commencing for the Monday of each week.
22
Note: The regions are a base point for further study as it is accepted that these region
locations do not necessarily equate to accepted regional weather.
Image Source: 24point0. http://www.24point0.com/ppt-shop/media/catalog/product/r/e/regionsmap-of-germany-ppt-slides.jpg
23
4/ Research Question
A few extra sub research questions added and in the predictions section the limitations of
predictions are based on forecasting which is realistically limited to a few days.
6/ Special Resources
This area has been updated to better reflect the actual technology being used and for which
specific purpose based on time spent investigating each technology and undertaking small
scale tests.
7/ Project Plan
Updated to reflect known dates and revised to better break down sub components.
Appendix B
Cloud cover information added (without example picture) to note inclusion of this weather
data set in the project.
Appendix D
Project plan updated to reflect additional information such as key dates as outlined in section
seven.
Overall changes are considered minor with changes not exceeding 2-3% of the originally
submitted proposal.
24
References
Anderson, C., and Sally, D. (2013) The Numbers Game: Why everything you know about
football is wrong. Penguin Books.
BBC (2014) Football Betting The global industry worth Billion. [Online]. BBC. Available
at: http://www.bbc.com/sport/0/football/24354124 [Accessed 29th May 2014]
Encyclopaedia Britannica (2014) Germany - Climate [Online]. Encyclopaedia Britannica.
Available from:
http://www.britannica.com/EBchecked/topic/231186/Germany/57996/Climate [Accessed
28th May 2014].
European Commission (2012) Study on the Contribution of Sport to Economic Growth and
Employment in the EU. [Online]. European Commission. Available from:
http://ec.europa.eu/sport/library/studies/study-contribution-spors-economic-growth-finalrpt.pdf [Accessed 1st June 2014].
Football-Data (2014) Data-Files: Germany [Online]. Football-Data. Available from:
http://football-data.co.uk/germanym.php [Accessed 21st May 2014]
Football Pools (2014) The Pioneers of Football Pools [Online]. Football Pools. Available
from: http://www.footballpools.com/cust?action=GoHelp&help_page=about_us [Accessed
1st June 2014]
Hamilton, H. (2014) Does the cold really kill Goals? Howard Hamilton Blog, 1st May.
Available from: http://www.soccermetrics.net/league-competitions/temperature-vs-goalsstudy-premier-league [Accessed 24th May 2014]
Hong, Y (eds.) (2014) Routledge Handbook of ergonomics in sport and exercise. New York:
Routledge.
25
IIBA (2009) A Guide to the business analysis body of knowledge (BABOK Guide.)
International Institute of Business Analysis: Toronto, Canada.
Kasirun, Z.M. (2005) A survey on the requirements elicitation practices among courseware
developers, Malaysian Journal of Computer Science, Vol. 18 No. 1, June 2005, pp. 70-77.
Lewis, T. (2014) How computer analysts took over at Britains top football clubs [Online].
The Guardian, 9th March, Available from:
http://www.theguardian.com/football/2014/mar/09/premier-league-football-clubs-computeranalysts-managers-data-winning [Accessed 28th May 2014].
McGarry, T., ODonoghue, P., and Sampaio, J. (eds) (2013) Routledge Handbook of Sports
Performance Analysis. New York: Routledge
Pezzoli, A., Cristoforu, E., Moncalero, M., Giacometto, F., and Boscolo, A. (2013)
Climatological Analysis, Weather Forecast and Sport Performance: Which are the
Connections? Journal Climatol Weather Forecasting 1: e105
PKR. (2014) Under / Over Betting [Online]. PKR. Available from:
http://bet.pkr.com/en/get-started/bet-types/under-over/ [Accessed 28th May 2014].
Riley, T., Williams, A.M. (eds.) (2003) Science and Soccer. 2nd Edition. London: Routledge.
Wiart, N., Kelley, J., James, D., and Allen, T. (2011) Proceedings of the Institution of
Mechanical Engineers, Part P: Journal of Sports Engineering and Technology 2011 225: 189
26
8.9
- 98 -
HDSDAJAN 2014
Requirements
Specification (RS)
The effects of weather on goal outcome
for football matches played within the
German Bundesliga
Alastair Macnair
10/12/2014
Requirements Specification
Document Control
Revision History
Date
Version
12/10/2005
1
Scope of Activity
Create
Distribution List
Name
Ioana Ghergulescu
Prepared
AM
Title
Lecturer
Related Documents
Title
Title of Use Case Model
Title of Use Case Description
Comments
Page 1
Reviewed
X
Approved
X
Version
1
Requirements Specification
Table of Contents
Requirements Specification (RS)
Document Control
Revision History
Distribution List
Related Documents
Introduction
1.1
Purpose
1.2
Project Scope
1.3
Requirements Specification
2.1
Functional requirements
2.1.1
2.1.2
2.1.3
11
2.1.4
13
2.1.5
16
2.1.6
18
2.1.7
20
2.1.8
23
2.2
Non-Functional Requirements
25
2.2.1
Recover requirement
25
2.2.2
Reliability requirement
25
2.2.3
Extendibility requirement
25
2.2.4
25
Interface requirements
3.1
25
25
System Architecture
26
System Evolution
27
Page 2
Requirements Specification
6
27
Page 3
Requirements Specification
1 Introduction
1.1 Purpose
The purpose of this document is to set out the requirements for the development
of an analytical study between weather data variables recorded across Germany
and the goal outcome for matches played within the Bundesliga 1 & 2 Leagues.
The analysis of weather and match outcome built on 21 years of historical data will
enable all relevant users to gain a better insight into understanding the effects of
weather through the varied weather variables being considered and future match
outcome through predictive analysis.
The intended primary customers will include predominantly football teams,
coaches, and trainers and in particular those companies that provide betting
instrument products for football matches. Secondary customers however could
easily include those involved in any competitive or non-competitive sport where
weather is a factor.
Page 4
Requirements Specification
strategy and winning performance levels through the individual objectives outlined
below.
The data to be used as the basis of the analysis is comprised of ECA&D weather
data files and Football data UK files which are both freely available for noncommercial use.
The primary project objectives are listed below:
Objective #1 Determine if there is any link between goals scored and weather effects
within the Bundesliga 1&2 football Leagues.
Objective #2 Determine if there is any difference between the Bundesliga 1 &
Bundesliga 2 due to the effects of weather, location or smaller stadiums.
Objective #3 Determine if just single or multiple weather parameters predominantly
affect goal outcome.
Objective #4 Investigate if stadium location and regional local weather affects games
played there and match outcome.
Objective #5 Compare the outcomes of matches to under/over goal difference betting
instruments to determine if the spread of match results could have been
better predicted using the results of the analysis.
Objective #6 Attempt to use the data to predict goal outcome for a number of future
matches using weather predictions and selected betting instruments.
Objective #7 Determine if goal difference between teams is greater in periods of colder
or warmer weather and if sustained whether this effects a teams
performance over time.
Objective #8 Use analysis software including but not limited to Excel, Python, R and
SQL to gain knowledge in their use for analysing large data sets.
Page 5
Requirements Specification
Successful Outcome Criteria: The project will be considered successful if the
primary research question(s) can be answered. In the context of this study
determining that weather has no determinable effect or if only one or two of the
project objectives can be answered this will be considered equally successful.
Page 6
Requirements Specification
Data use: Both data providers offer the data for free use. However the ECA&D in
their data policy document note that the data cannot be used for commercial
purposes. While the study is not commercial in nature any future usage of the study
must take this into account where and if applicable.
Time: The project must be completed within the specified time frame from the 13th
October 2014 until the deadline of 5th January 2015.
Budget: There is no required budget as all the required software and tools are
available without cost penalty within the College.
ECA&D
Firefox
MySQL
PeaZip
Python
R Studio
Txt
2 Requirements Specification
This section provides an overall description of the project and detailed descriptions
of the functional requirements that represent the key steps and processes that are
essential to ensure successful project completion.
Page 7
Requirements Specification
Page 8
Requirements Specification
2.1.2.1.2 Inputs
2.1.2.1.3 Processing
1/ Check data usage policy and obtain any permissions to use data sets.
2/ Access the weather website and download the five zip files plus station lists.
3/ Access the football website and use Firefox to down load all 42 files across all
individual links simultaneously.
4/ Check files (sample) are viable and contain data as expected.
2.1.2.1.4 Outputs
Weather files: There will be around 5000 individual weather files in comma
delimited txt format. There will be five station lists, one for each weather variable
in comma delimited txt. Station lists provide a full description of every station for
that weather variable with a country code and location allowing for files to be
identified.
Football files: There will be 42 individual csv files. 21 for Bundesliga one and 21
for Bundesliga 2 representing each season of play.
Page 9
Requirements Specification
Page
10
Requirements Specification
7. Two Random CSV files are opened to visually confirm that the data
is not compromised and is all present.
Termination
The project folder shows all the required files as being present.
Post condition
The data sets are ready to be used.
2.1.3.1.2 Inputs
The five zip files from use case 001.
The station list comma delimited txt file for each of the ZIP files.
Football data files.
2.1.3.1.3 Processing
The football data files are filtered to provide a list of the teams that played for each
year of play. From this a master list is created that shows every team that played
in every season and the stadium location is matched to this with its decimal
degrees location.
The station lists are copied into Microsoft Excel. Each parameter is filtered using
the two digit country code to leave just Germanys entries. The smallest list is used
as there must be five weather variables for each location from the same station.
The latitude and Longitude data is split into its three constituent parts and the
decimal equivalent calculated. A geo location program using R and the list of
weather stations in excel is used to determine the closest weather station for each
of the stadiums. The weather station code is matched to each stadium. This
Page
11
Requirements Specification
eventually provides a list of the actual weather stations that will be used. These
files are extracted and the rest discarded.
2.1.3.1.4 Outputs
An excel file that contains a list of every team that has played across all 21 seasons
of play for both leagues and their respective stadium. A series of txt files (comma
delimited) that are the actual weather files needed for analysis with the football
data.
2.1.3.2 Use Case 002 Filter weather files
Scope
The scope of this use case is to filter the weather files and identify just those
particular weather stations that are relevant to the study and match actual
football stadia as closely as possible.
Description
This use case describes the process of identifying only those weather files
that match the actual stadia used over the 21 years of Bundesliga 1&2 match
history.
Use Case Diagram
Page
12
Requirements Specification
Flow Description
Precondition
The data files from use case 001. A PC with Microsoft Excel. R Studio with
Google maps and Map distance packages installed.
Activation
This use case starts when the data analyst completes use case 001.
Main flow
1. The participating teams for each season of play are entered into an
excel file to provide an overall summary list of every unique team that
has played across all 21 seasons of play for both Bundesliga 1 & 2.
2. The decimal degrees location of each stadium is established and
added to the master list.
3. The weather station files are filtered on DE for Germany for each of
the five weather parameters. The smallest list used to ensure that all
five variables exist for each unique weather station being used.
4. The stadium location is matched to the nearest weather station using
R Studio distance location. The results are entered onto the master
list to provide a complete description of each stadia and the weather
station that it relates to.
5. The relevant individual files are extracted from each of the five zip files.
All other files are removed and discarded.
Termination
All the non-relevant weather files are discarded and the project folder
contains only the files needed.
Post condition
The correct weather files to be used are located in the project directory.
Page
13
Requirements Specification
2.1.4.1.2 Inputs
The weather files and football data files from use case 002.
2.1.4.1.3 Processing
Football data: The 21 files associated with each season of play for an entire
league are loaded into R and bound into a single file. All unwanted columns and
data are removed. The date is reformatted to a standard ISO format.
Weather Data: The weather files are checked for null values.
All unwanted columns and headers are removed and the numeric element
corrected by a factor of ten. Each of the five elements is joined on date to provide
a single weather file for each weather station location containing all five
parameters. All weather information prior to 1993 is removed.
2.1.4.1.4 Outputs
Two CSV files. One for each of the Bundesliga 1&2 leagues.
A single CSV file for each weather station location containing all five weather
variables.
Page
14
Requirements Specification
Page
15
Requirements Specification
Alternate flow
A1 : Detection of NULL values in weather files
1. Where there are isolated null values the dates are checked against
actual match dates as matches are typically only played at weekends.
If the dates dont match then continue with main flow. If the dates do
conflict then interpolate a value based on values either side.
2. If there are a significant number of NULL values denoting missing
values across a number of weeks or months then the weather station
is discarded and the next nearest one is selected in lieu.
3. The use case continues at position (1) or (4) of the main flow
depending on if the new weather file has already been cleaned and
treated.
Termination
Python will have completed all error handling and R will have bound or joined
all files.
Post condition
The project directory will contain the finished files for the weather and football.
2.1.5.1.2 Inputs
The cleaned files from use case 003. A relational database system like MySQL or
sqldf (an SQL add on for R) is used to manage the data. This will allow the data to
be manipulated and visualised during the analysis stage.
2.1.5.1.3 Processing
An entity relationship diagram is created with all key entities, attributes and
relationships. Primary and foreign keys are created or identified. The data is loaded
into the data base.
Page
16
Requirements Specification
2.1.5.1.4 Outputs
A relational database, fully normalized with clear relationships and no null values
or errors.
2.1.5.2 Use Case 004 Database Management
Scope
The scope of this use case is to create the relational database structure that
will allow manipulation and visualisation of the data.
Description
This use case describes the process of the relational database creation and
its management.
Use Case Diagram
Page
17
Requirements Specification
This use case can commence any time after use case 003 is completed and
begins when a python script is run which begins the process of loading the
files into the database.
Main flow
1.
2.
3.
4.
Termination
The database is fully created and all data is loaded and there are no errors
or missing attributes or relationships. The python script returns a task
completed message.
Post condition
The data is now entered into a relational database and is ready for use.
2.1.6.1.2 Inputs
The relational database with all data loaded in.
2.1.6.1.3 Processing
Access database, manipulate data, and use MYSQL and R Studio to run scripts
and statistical analysis on the generated queries based on the primary project
objectives. Create the required graphs, tables, mapping and other outputs to
visualise the results for compilation in the report and presentation. Interpret the
results and document.
2.1.6.1.4 Outputs
Page
18
Requirements Specification
A variety of graphs, tables, maps and charts to be included in the presentation and
report.
2.1.6.2 Use Case 005 Data Analysis
Scope
The scope of this use case is to undertake statistical and data mining
activities to determine how weather is related to match goal outcome.
Description
This use case describes the process of statistical analysis and data mining
activities on the data set followed by interpretation of the results and the
creation of outputs to describe and explain the data.
Use Case Diagram
Page
19
Requirements Specification
This use case starts when the relational database (use case 004) is
completed.
Main flow
1. Undertake the data mining and statistical analysis based on the
projects primary objectives.
2. Generate Queries and scripts to support research and project
objectives.
3. Output tables, graphs, charts and maps to present the outcome of
the analysis.
4. Interpret the results and document what they mean.
5. Create the appropriate part of the report using the gathered
information.
Termination
The use case is terminated when the primary research objectives have been
answered and the relevant report section and presentation is completed.
Post condition
A report & presentation draft structure that provides discussion and
explanation of the results supported by graphs, tables, charts and maps.
2.1.7.1.2 Inputs
The interpreted results from use case 006.
2.1.7.1.3 Processing
A description of the processing steps. Describe the main steps involved in
processing
Page
20
Requirements Specification
2.1.7.1.4 Outputs
A predictive model that allows for customers and
Page
21
Requirements Specification
Flow Description
Precondition
The primary data analysis is completed and all results have been interpreted
and documented. The results show trends and patterns that allow for
predictive modelling to be undertaken.
Activation
This use case starts when the analysis in use case 005 is complete and the
predictive analysis is undertaken using R.
Main flow
1. Predictive modelling process commences.
2. 80% of the data is selected at random and designated as training
data. The remainder will be the actual test data.
3. The programs and models learn from the training data and this is
then applied to the test data to see if the model is able to correctly
determine match outcome.
4. Depending on time frames and availability the model may also be
applied to actual future football matches using match fixtures and
predicted weather forecasting.
5. The results are documented and interpreted.
Alternate flow
A1 : No clear relationships
1. Where no clear relationship exists between the data sets and
predictive modelling is not applicable then any general trends or
patterns will be discussed if applicable. (Returns to main process step
5.)
Termination
The use case ends either when the best predictive model is produced based
on the data analysis or it is determined that no model can be created.
Post condition
A report and interpretation of the results providing either predictive modelling
or general trends and patterns where possible.
Page
22
Requirements Specification
2.1.8.1.2 Inputs
The primary data analysis results from use case 005 and predictive modelling
results from use case 006.
2.1.8.1.3 Processing
A clear and detailed report is created to allow the customer to review the project
and understand the data sets and all relationships and trends that exist. In addition
a short presentation will be created alongside this to present the key findings to
the customer.
2.1.8.1.4 Outputs
A detailed and clear printed and electronic report of 10,000 to 12,000 words in line
with the customers requirements and a PowerPoint presentation file.
2.1.8.2 Use Case 007 Report Production and Output
Scope
The scope of this use case is to create a final report and presentation to be
presented to the customer.
Description
This use case describes the process of creating the final report and
presentation material.
Use Case Diagram
Page
23
Requirements Specification
Page
24
Requirements Specification
3 Interface requirements
3.1 Application Programming Interfaces (API)
The analysis process will use Google maps to provide mapping visualisation tools
for use within R Studio via dedicated add-ons. Google mapping is used to plot
weather station and stadium locations and also to determine the closest distance
between them where it is not clear or there is a choice of weather stations in close
proximity. Some analysis output results may be presented using Google Mapping
visualisation tools.
Page
25
Requirements Specification
4 System Architecture
The overall system architecture is shown as a high level diagram in Figure 09. The
system and its steps and processes is shown within the dotted line which
represents the use cases outlined above.
Overall System Architecture
Page
26
Requirements Specification
UML Class Diagram for Database Structure
The database will have three classes as shown below with attributes. This is a draft
class diagram which also forms the basis of an Entity Relationship diagram.
5 System Evolution
As outlined in section 2.2.3 weather data and football results are time series data
which are being added to on an ongoing basis. Both the ECA&D and Football
results providers add to the data sets on a continual basis providing the option of
additional data to be included in any future study.
The system could also consider extending the range of countries as the ECA&D
hold data for a huge range of European countries although the detail and reliability
of the data outside of modern countries like Germany is not as good quality with
higher incidences of Null values. The inclusion of very hot or wet countries like
Spain, and Italy could reveal patterns or trends across Europe as a whole.
Page
27
- 99 -
Highlight Report
Release:
Date:
Authors:
Alastair Macnair
x13129325
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
1
1.1
Report History
Document Location
1.2
Revision History
Revision date
Author
Version
Summary of Changes
31/10/2014
Alastair Macnair
01
Initial Issue
1.3
Changes
marked
Approvals
Title
Ioana Ghergulescu
Project Supervisor
31/10/2014 01
1.4
Distribution
Title
Date of Issue
Status
Page 2 of 11
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
Table of Contents
1
Page
1.2
1.3
Approvals _______________________________________________________________ 2
1.4
Distribution ______________________________________________________________ 2
Highlight Report from 15th September 2014 to 31st October 2014. _______________________ 4
Page 3 of 11
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
A Highlight report provides the Project Board and Client/Customer with a summary of the status of a
project at agreed stages and is used to monitor progress. The Project manager/Data Analyst uses the
Highlight report to advise the Project Board/Client of any potential problems or areas where they
could help.
Some areas have been started slightly ahead of schedule as per the project table summary indicates.
The project plan Gantt chart was re-done from scratch to maximise the effectiveness of excel in being
able to provide a simple project plan. A table was built using all currently identified tasks, although
some were omitted to ensure clarity of the overall chart. The revised chart can be adjusted and
updated much easier which the Gantt chart (Appendix A.) reflects automatically and the use of colour
allows for specific sub groups of tasks to be better identified. The table uses a traffic light system to
identify sections completed, in progress and yet to be started. The Project plan table is shown on the
next page: -
Page 4 of 11
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
4.1
Duration
End Date
Satus
28-Sep-14
28-Sep-14
Completed
01-Oct-14
12-Oct-14
12
1
12-Oct-14
12-Oct-14
Completed
Completed
31-Oct-14
30-Nov-14
20-Dec-14
1
1
1
31-Oct-14
30-Nov-14
20-Dec-14
Completed
To be started
To be started
01-Nov-14
15-Sep-14
01-Nov-14
01-Nov-14
15-Sep-14
12
60
12
10
55
12-Nov-14
13-Nov-14
12-Nov-14
10-Nov-14
08-Nov-14
In Progress
In Progress
To be started
To be started
In Progress
29-Sep-14
29-Sep-14
1
1
29-Sep-14
29-Sep-14
Completed
Completed
14-Oct-14
02-Nov-14
20
1
02-Nov-14
02-Nov-14
In Progress
To be started
01-Nov-14
30-Oct-14
30-Oct-14
05-Nov-14
5
2
2
1
05-Nov-14
31-Oct-14
31-Oct-14
05-Nov-14
In Progress
Completed
Completed
To be started
01-Nov-14
06-Nov-14
2
1
02-Nov-14
06-Nov-14
In Progress
To be started
06-Nov-14
15-Nov-14
15
6
20-Nov-14
20-Nov-14
To be started
To be started
20-Nov-14
21-Nov-14
To be started
31-Oct-14
04-Nov-14
12-Nov-14
20-Nov-14
03-Dec-14
08-Dec-14
15-Dec-14
06-Jan-15
4
8
2
14
3
5
3
1
03-Nov-14
11-Nov-14
13-Nov-14
03-Dec-14
05-Dec-14
12-Dec-14
17-Dec-14
06-Jan-15
In Progress
In Progress
In Progress
To be started
To be started
To be started
To be started
To be started
06-Jan-15
19-Jan-15
14
5
19-Jan-15
23-Jan-15
To be started
To be started
PROJECT PROPOSAL
Project Proposal Submitted
REQUIREMENTS SPECIFICATION
Requirements Specification
Requirements Specification Submitted
MANAGEMENT REPORT
Management Report 01
Management Report 02
Management Report 03
KEY RESEARCH AREAS
Germany's Weather
Statistical Tools
Sports Performance Factors
Stadium Design
Technology and Tools
DATA EXTRACTION
Download Weather Files
Download Football Files
FILTER WEATHER FILES
Collate Bundesliga Stadium list
Extract required weater stations
DATA TRANSFORMATION
Write R script to clean and transform weather files
Write R script to clean and transform football files
Bind and clean all football files
JOIN all weather parameters for each station
DATABASE MANAGEMENT
Design Relational Database
Insert Data into SQL database
DATA ANALYSIS
Analyse the database
Graphing and Visualisation
PREDICTION
Predictive Modelling
REPORT WRITING
Introduction
Literature Review
Data Set Description
Discussion
Conclusion
Checking & Review
Print 3 Copies and Bind
Submit Dissertation
PROJECT PRESENTATION
Prepare & Practice Presentation
Make Presentation
Page 5 of 11
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
The full RISK log is provided in Appendix B. Summary of the RAID log: -
Risks
Four risks have been identified that could affect the project in the future in the next period. One has
been resolved relatively easily.
Assumptions
One long range assumption has been identified relating to the holiday period and available resources
over this period to complete the project.
Issues
Five issues were encountered in the reported period. Four were dealt with and the fifth is due to be
resolved within the next few days.
Dependencies
There is only one current dependency which is ensuring the cleaned and transformed data is
completed on time to allow for database creation and analysis which is a major and key part of the
project.
Page 6 of 11
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
The next work period will be when the vast majority of all the key tasks will be undertaken including
the main analysis and report writing.
Page 7 of 11
PROJECT PROPOSAL
Project Proposal Submitted
REQUIREMENTS SPECIFICATION
Requirements Specification
Requirements Specification Submitted
MANAGEMENT REPORT
Management Report 01
Management Report 02
Management Report 03
KEY RESEARCH AREAS
Germany's Weather
Statistical Tools
Sports Performance Factors
Stadium Design
Technology and Tools
DATA EXTRACTION
Download Weather Files
Download Football Files
FILTER WEATHER FILES
Collate Bundesliga Stadium list
Extract required weater stations
DATA TRANSFORMATION
Write R script to clean and transform weather files
Write R script to clean and transform football files
Bind and clean all football files
JOIN all weather parameters for each station
DATABASE MANAGEMENT
Design Relational Database
Insert Data into SQL database
DATA ANALYSIS
Analyse the database
Graphing and Visualisation
PREDICTION
Predictive Modelling
REPORT WRITING
Introduction
Literature Review
Data Set Description
Discussion
Conclusion
Checking & Review
Print 3 Copies and Bind
Submit Dissertation
PROJECT PRESENTATION
Prepare & Practice Presentation
Make Presentation
28/09/2014
15/09/2014
20/09/2014
28/09/2014
Risk Description
Available time until project deadline is
limited
Project Data is lost due to PC or USB key
loss/damage
Severity
Date Raised
Impact
ID
Likelihood
RISKS
Mitigation Plan
Ensure constant review of project plan and
9 keep to project plan deadlines.
A dedicated Google Drive space has been set
up in addition to storage on a PC and USB key
creating three distinct storage places. Back
up latest files at least onc a week or after
15 significant work development
Ensure ongoing research and adherence to
project plan as well timely undertaking of
literature review to ensure all knowledge
3 areas are complete
Pull back project timeline to allow for
printing before christmas. Identify
businesses that provide this service and
determine opening hours as early as possible
15
Owner
Status
Date Closed
Open
Closed
Open
Open
15/10/2014
ISSUES
ID
Date Raised
14/10/2014
14/10/2014
25/10/2014
14/10/2014
25/10/2014
Issue Description
Bundesliga total team range over 21
years significantly higher than
anticiapted adding to work and
creating lots of 'one' off teams over
the period
Impact Description
Adds to time required to collate
stadia list and identify stadia and
potentially affects the analysis
where single teams and stadia are
present over the
Impact
Medium
Owner
Status
Closed
Closed
High
Low
High
Open
Closed
Closed
ASSUMPTIONS
ID
Date Raised
30/10/2014
Assumption Description
Action to Validate
Impact if Assumption Incorrect
Identify businesses required to bind
report and clarify opening hours
More time in project plan currently
well before christmas period
not being utilised.
Status
Open
DEPENDANCIES
ID
Date Raised
20/10/2014
Dependency Description
Analysis and database creation is
dependant upon stadium list
being compiled
Location
Deliverables
Internal
Delivery Date
Importance
Status
02/11/2014
High
Open
8.10.2
- 100 -
Highlight Report
Release:
Date:
Authors:
Alastair Macnair
x13129325
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
1
1.1
Report History
Document Location
1.2
Revision History
Revision date
Author
Version
Summary of Changes
31/10/2014
Alastair Macnair
01
Initial Issue
7/12/2014
Alastair Macnair
02
Progress Revision
1.3
Changes
marked
Approvals
Title
Ioana Ghergulescu
Project Supervisor
31/10/2014 01
Ioana Ghergulescu
Project Supervisor
31/10/2014 01
1.4
Distribution
Title
Date of Issue
Status
Page 2 of 13
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
Table of Contents
1
Page
1.2
1.3
Approvals _______________________________________________________________ 2
1.4
Distribution ______________________________________________________________ 2
File, Electronic and Hardcopy protection, Backup and recovery Plan ________________ 13
13.2
Page 3 of 13
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
A Highlight report provides the Project Board and Client/Customer with a summary of the status of a
project at agreed stages and is used to monitor progress. The Project manager/Data Analyst uses the
Highlight report to advise the Project Board/Client of any potential problems or areas where they
could help.
Stadiums and weather station (observation data) fully paired and calculated
Complete data set compiled and checked and 100% ready
R scripts written for Feature Engineering enhancements to Data set
Feature engineering elements added to data set. (Goal outcome measures, seasons etc.)
R scripts written for descriptive statistics and graphing
Report Section 01 Introduction, completed
Report Section 02 Literature Review, 20% Complete
Report Section 03 Data Sets, 25% complete
Report Section 04 Analysis, 5% complete
Research ongoing in statistical analysis and sports performance
Research ongoing in Data Mining techniques and predictive modelling
This period has seen progress in a number of areas. The Gantt chart has been split to provide an
overview and a separate more detailed plan for the next work period to help better understand the
various sub tasks required. Gathering weather forecast data will commence from 7th December until
the 20th December prior to each match.
Page 4 of 13
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
4.1
Duration
End Date
Satus
28-Sep-14
28-Sep-14
Completed
01-Oct-14
12-Oct-14
12
1
12-Oct-14
12-Oct-14
Completed
Completed
31-Oct-14
07-Dec-14
20-Dec-14
1
1
1
31-Oct-14
07-Dec-14
20-Dec-14
Completed
Completed
To be started
01-Nov-14
15-Sep-14
08-Dec-14
08-Dec-14
15-Sep-14
12
60
7
7
55
12-Nov-14
13-Nov-14
14-Dec-14
14-Dec-14
08-Nov-14
Completed
Completed
In Progress
In Progress
Completed
29-Sep-14
29-Sep-14
1
1
29-Sep-14
29-Sep-14
Completed
Completed
14-Oct-14
02-Nov-14
20
1
02-Nov-14
02-Nov-14
Completed
Completed
01-Nov-14
30-Oct-14
30-Oct-14
05-Nov-14
5
2
2
1
05-Nov-14
31-Oct-14
31-Oct-14
05-Nov-14
Completed
Completed
Completed
Completed
01-Nov-14
06-Nov-14
2
1
02-Nov-14
06-Nov-14
Omitted
Omitted
01-Dec-14
01-Dec-14
20
20
20-Dec-14
20-Dec-14
In Progress
In Progress
10-Dec-14
12
21-Dec-14
To be started
31-Oct-14
08-Dec-14
08-Dec-14
12-Dec-14
17-Dec-14
28-Dec-14
02-Jan-15
06-Jan-15
4
8
5
8
4
2
2
1
03-Nov-14
15-Dec-14
12-Dec-14
19-Dec-14
20-Dec-14
29-Dec-14
03-Jan-15
06-Jan-15
Completed
In Progress
In Progress
In Progress
To be started
To be started
To be started
To be started
12-Jan-15
19-Jan-15
7
5
18-Jan-15
23-Jan-15
To be started
To be started
PROJECT PROPOSAL
Project Proposal Submitted
REQUIREMENTS SPECIFICATION
Requirements Specification
Requirements Specification Submitted
MANAGEMENT REPORT
Management Report 01
Management Report 02
Management Report 03
KEY RESEARCH AREAS
Germany's Weather
Statistical Tools
Sports Performance Factors
Stadium Design
Technology and Tools
DATA EXTRACTION
Download Weather Files
Download Football Files
FILTER WEATHER FILES
Collate Bundesliga Stadium list
Extract required weater stations
DATA TRANSFORMATION
Write R script to clean and transform weather files
Write R script to clean and transform football files
Bind and clean all football files
JOIN all weather parameters for each station
DATABASE MANAGEMENT
Design Relational Database
Insert Data into SQL database
DATA ANALYSIS
Analyse the database
Graphing and Visualisation
PREDICTION
Predictive Modelling
REPORT WRITING
Introduction
Literature Review
Data Set Description
Analysis and Evaluation
Conclusion
Checking & Review
Print 3 Copies and Bind
Submit Dissertation
PROJECT PRESENTATION
Prepare & Practice Presentation
Make Presentation
Page 5 of 13
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
All four data sets (Observations, Stations, Stadiums, and Matches) have been cleaned, corrected
and joined to provide a complete data set of all matches, stadiums and weather observations.
Problems encountered last period such as name changes were corrected. A small percentage of
missing observation data was overcome using Multiple Imputation.
Feature Engineering enhancements have been added to the data set.
Descriptive statistics and graphing have been undertaken on the data.
Project Report has seen some sections completed and most others started.
Project Management Plan revised and updated 7th December 2014 (See Appendix A)
Contingency Plans created.
Although an R script works out the distances between stadia and weather stations there is still a
manual element to add or remove those weather station observation files from the project folder.
This needed to be checked as found even one missing file would affect the final joins creating a
list with values missing and making it was hard to detect errors.
Personal and Work/College factors continue to detrimentally impact on the project plan.
However, the majority of all other NCI commitments are now completed which allows for the
project to regain priority positioning.
The full RISK log is provided in Appendix B. Summary of the RAID log: -
Risks
Time continues to be the biggest risk factor to meet the required deadline. The previously unused
time over Christmas has been utilised and the project super visor has confirmed that binding is not a
critical requirement. Work and research in Data mining has developed knowledge in this area and
creating contingency plans alongside existing data backup methods has limited risk in this area.
Assumptions
Project Super visor has confirmed that binding is not an essential requirement.
Issues
All issues now resolved
Dependencies
Predictions are dependent on collecting weather forecast data (not historical data.) This data will need
to be recorded from a suitable forecast provider prior to every match. Failure to have reliable
forecasting data will prevent real match data to be used.
Page 6 of 13
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
As other NCI deadlines and commitments built up over November project work was affected
although it has continued albeit at a slower than ideal pace. These other commitments are now
essentially completed allowing more time for project work.
The updated plan now uses the previously unused time over Christmas previously omitted to
ensure project deliverables can be achieved due to delays in this period.
The MySQL data base that was to be used has been omitted after review as everything can be
achieved in R Studio and calculations, tabulation and graphing will be quicker and easier.
Contingency Planning
Sections on contingency planning have been added for data loss and for external threats and
circumstances. See Appendix C.
Page 7 of 13
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
Page 9 of 13
28/09/2014
15/09/2014
20/09/2014
28/09/2014
Risk Description
Available time until project deadline is
limited
Project Data is lost due to PC or USB key
loss/damage
Severity
Date Raised
Impact
ID
Likelihood
RISKS
Mitigation Plan
Ensure constant review of project plan and
9 keep to project plan deadlines.
A dedicated Google Drive space has been set
up in addition to storage on a PC and USB key
creating three distinct storage places. Back
up latest files at least onc a week or after
15 significant work development
Ensure ongoing research and adherence to
project plan as well timely undertaking of
literature review to ensure all knowledge
3 areas are complete
Pull back project timeline to allow for
printing before christmas. Identify
businesses that provide this service and
determine opening hours as early as possible
15
Owner
Status
Date Closed
Open
Closed
15/10/2014
Closed
01/12/2014
Closed
01/12/2014
ISSUES
ID
Date Raised
14/10/2014
14/10/2014
25/10/2014
14/10/2014
25/10/2014
Issue Description
Bundesliga total team range over 21
years significantly higher than
anticiapted adding to work and
creating lots of 'one' off teams over
the period
Impact Description
Adds to time required to collate
stadia list and identify stadia and
potentially affects the analysis
where single teams and stadia are
present over the
Impact
Medium
High
Low
High
Owner
Status
Closed
Closed
Closed
Closed
Closed
ASSUMPTIONS
ID
Date Raised
30/10/2014
Assumption Description
Action to Validate
Impact if Assumption Incorrect
Identify businesses required to bind
report and clarify opening hours
More time in project plan currently
well before christmas period
not being utilised.
Status
Closed
DEPENDANCIES
ID
Date Raised
20/10/2014
01/12/2014
Dependency Description
Analysis and database creation is
dependant upon stadium list
being compiled
Real match predictions need
weather forecasts to be captured
and recorded now for later
analysis
Location
Internal
External
Deliverables
Bound files for weather station
locations and stadium lists.
Weather forecast details for all
weather paramters to be taken on
day of match prior to being
played
Delivery Date
Importance
Status
02/11/2014
High
Closed
20/12/2014
High
Open
8.10.3
- 101 -
Highlight Report
Release:
Date:
Authors:
Alastair Macnair
x13129325
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
1
1.1
Report History
Document Location
1.2
Revision History
Revision date
Author
Version
Summary of Changes
31/10/2014
Alastair Macnair
01
Initial Issue
7/12/2014
Alastair Macnair
02
Progress Revision
30/12/2014
Alastair Macnair
03
Progress Revsision
1.3
Changes
marked
Approvals
Title
Ioana Ghergulescu
Project Supervisor
31/10/2014 01
Ioana Ghergulescu
Project Supervisor
30/12/2014 03
1.4
Distribution
Title
Date of Issue
Status
Page 2 of 12
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
Table of Contents
1
Page
1.2
1.3
Approvals _______________________________________________________________ 2
1.4
Distribution ______________________________________________________________ 2
Detailed Project Plan for upcoming period_______________ Error! Bookmark not defined.
File, Electronic and Hardcopy protection, Backup and recovery Plan ________________ 12
13.2
Page 3 of 12
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
A Highlight report provides the Project Board and Client/Customer with a summary of the status of a
project at agreed stages and is used to monitor progress. The Project manager/Data Analyst uses the
Highlight report to advise the Project Board/Client of any potential problems or areas where they
could help.
This period has seen the most work and completion of the various tasks. The Gantt chart has been
updated to reflect the few outstanding tasks left to complete the project. Gathering weather forecast
data has been omitted as the number of real matches taking place (15) was considered to be too
small a sample to be able to undertake meaningful predictive modelling. The data set will be split
instead into training and test sets.
Page 4 of 12
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
4.1
Duration
End Date
Satus
28-Sep-14
28-Sep-14
Completed
01-Oct-14
12-Oct-14
12
1
12-Oct-14
12-Oct-14
Completed
Completed
31-Oct-14
07-Dec-14
20-Dec-14
1
1
1
31-Oct-14
07-Dec-14
20-Dec-14
Completed
Completed
Completed
01-Nov-14
15-Sep-14
08-Dec-14
08-Dec-14
15-Sep-14
12
60
7
7
55
12-Nov-14
13-Nov-14
14-Dec-14
14-Dec-14
08-Nov-14
Completed
Completed
Completed
Completed
Completed
29-Sep-14
29-Sep-14
1
1
29-Sep-14
29-Sep-14
Completed
Completed
14-Oct-14
02-Nov-14
20
1
02-Nov-14
02-Nov-14
Completed
Completed
01-Nov-14
30-Oct-14
30-Oct-14
05-Nov-14
5
2
2
1
05-Nov-14
31-Oct-14
31-Oct-14
05-Nov-14
Completed
Completed
Completed
Completed
01-Nov-14
06-Nov-14
2
1
02-Nov-14
06-Nov-14
Omitted
Omitted
01-Dec-14
01-Dec-14
20
20
20-Dec-14
20-Dec-14
Completed
Completed
01-Jan-15
03-Jan-15
In Progress
31-Oct-14
08-Dec-14
08-Dec-14
12-Dec-14
28-Dec-14
28-Dec-14
05-Jan-15
06-Jan-15
4
8
5
8
8
8
1
1
03-Nov-14
15-Dec-14
12-Dec-14
19-Dec-14
04-Jan-15
04-Jan-15
05-Jan-15
06-Jan-15
Completed
Completed
Completed
Completed
In Progress
In Progress
To be started
To be started
12-Jan-15
19-Jan-15
7
5
18-Jan-15
23-Jan-15
To be started
To be started
PROJECT PROPOSAL
Project Proposal Submitted
REQUIREMENTS SPECIFICATION
Requirements Specification
Requirements Specification Submitted
MANAGEMENT REPORT
Management Report 01
Management Report 02
Management Report 03
KEY RESEARCH AREAS
Germany's Weather
Statistical Tools
Sports Performance Factors
Stadium Design
Technology and Tools
DATA EXTRACTION
Download Weather Files
Download Football Files
FILTER WEATHER FILES
Collate Bundesliga Stadium list
Extract required weater stations
DATA TRANSFORMATION
Write R script to clean and transform weather files
Write R script to clean and transform football files
Bind and clean all football files
JOIN all weather parameters for each station
DATABASE MANAGEMENT
Design Relational Database
Insert Data into SQL database
DATA ANALYSIS
Analyse the database
Graphing and Visualisation
PREDICTION
Predictive Modelling
REPORT WRITING
Introduction
Literature Review
Data Set Description
Analysis and Evaluation
Conclusion
Checking & Review
Print 3 Copies and Bind
Submit Dissertation
PROJECT PRESENTATION
Prepare & Practice Presentation
Make Presentation
Page 5 of 12
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
All analysis and graphing has been completed and added to the report.
All primary sections of the report have been drafted and most have been finished with just the
conclusion yet to complete.
Project Management Plan (03) revised and updated 30th December 2014 (See Appendix A)
With four dependant variables (goal outcome) with three subsets of this (All data, B1 and B2) and
8+ independent variables there is a massive amount of graphing and analysis that potentially
needs to be undertaken. Almost 96 distinct cases that need analysis. Deciding on how to approach
this and which ones need to be omitted has been the greatest challenge. It would have been better
to focus on just or two dependant variables only.
Personal and Work/College factors continue to detrimentally impact on the project plan.
The full RISK log is provided in Appendix B. Summary of the RAID log: -
Risks
As the project is essentially finished time risk is now reduced and the topic closed
Assumptions
There are no ongoing assumptions
Issues
All issues now resolved
Dependencies
Collecting real match data has been omitted as the number of matches is too low to be viable for
analysis. The existing data set will be used instead.
Overall the project plan has been adhered to although there has been some slippage over the
Christmas period but not detrimentally so.
The updated plan now reflects the outstanding tasks over the next 4-6 days needed to be
completed to finish the project.
Contingency Planning
Sections on contingency planning continue to be monitored and back-ups are in progress. See
Appendix C.
Page 6 of 12
The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report
Page 7 of 12
28/09/2014
15/09/2014
20/09/2014
28/09/2014
Risk Description
Available time until project deadline is
limited
Project Data is lost due to PC or USB key
loss/damage
Severity
Date Raised
Impact
ID
Likelihood
RISKS
Mitigation Plan
Ensure constant review of project plan and
3 keep to project plan deadlines.
A dedicated Google Drive space has been set
up in addition to storage on a PC and USB key
creating three distinct storage places. Back
up latest files at least onc a week or after
15 significant work development
Ensure ongoing research and adherence to
project plan as well timely undertaking of
literature review to ensure all knowledge
3 areas are complete
Pull back project timeline to allow for
printing before christmas. Identify
businesses that provide this service and
determine opening hours as early as possible
15
Owner
Status
Date Closed
Closed
Closed
15/10/2014
Closed
01/12/2014
Closed
01/12/2014
ISSUES
ID
Date Raised
14/10/2014
14/10/2014
25/10/2014
14/10/2014
25/10/2014
Issue Description
Bundesliga total team range over 21
years significantly higher than
anticiapted adding to work and
creating lots of 'one' off teams over
the period
Impact Description
Adds to time required to collate
stadia list and identify stadia and
potentially affects the analysis
where single teams and stadia are
present over the
Impact
Medium
High
Low
High
Owner
Status
Closed
Closed
Closed
Closed
Closed
ASSUMPTIONS
ID
Date Raised
30/10/2014
Assumption Description
Action to Validate
Impact if Assumption Incorrect
Identify businesses required to bind
report and clarify opening hours
More time in project plan currently
well before christmas period
not being utilised.
Status
Closed
DEPENDANCIES
ID
Date Raised
20/10/2014
01/12/2014
Dependency Description
Analysis and database creation is
dependant upon stadium list
being compiled
Real match predictions need
weather forecasts to be captured
and recorded now for later
analysis
Location
Internal
External
Deliverables
Bound files for weather station
locations and stadium lists.
Weather forecast details for all
weather paramters to be taken on
day of match prior to being
played
Delivery Date
Importance
Status
02/11/2014
High
Closed
20/12/2014
Low
Closed
- 102 -