You are on page 1of 193

Declaration Cover Sheet for Project Submission

SECTION 1 Student to complete


Name:

Student ID:
Supervisor:

SECTION 2 Confirmation of Authorship


The acceptance of your work is subject to your signature on the following declaration:
I confirm that I have read the College statement on plagiarism (summarised overleaf and printed in full in
the Student Handbook) and that the work I have submitted for assessment is entirely my own work.
Signature:_______________________________________________ Date:____________

NB. If it is suspected that your assignment contains the work of others falsely represented as your own, it
will be referred to the Colleges Disciplinary Committee. Should the Committee be satisfied that plagiarism
has occurred this is likely to lead to your failing the module and possibly to your being suspended or
expelled from college.
Complete the sections above and attach it to the front of one of the copies of your assignment.

National College of Ireland


Hdip
2013/2014

Alastair Macnair
x13129325
alastair.macnair@ncirl.ie

Can the weather kill goals?

The effects of weather on goal outcome for football


matches played within the German Bundesliga
Dissertation

Table of Contents
Executive Summary .............................................................................................. 5
1

Introduction.................................................................................................... 6
1.1

Background ............................................................................................. 6

1.2

Aims ........................................................................................................ 7

1.2.1

1.3

Solution Overview.................................................................................. 10

1.4

Structure ................................................................................................ 11

Literature Review and Related Work ........................................................... 12


2.1

Introduction ............................................................................................ 12

2.2

Statistical analysis in football ................................................................. 12

2.2.1

Football Tradition versus Data Analysis .......................................... 12

2.2.2

Statistical Methods in sports ........................................................... 13

2.3

The effect of weather on sports performance ........................................ 15

2.3.1

Weather, Altitude and measurement Indices. ................................. 15

2.3.2

Weather Effects in Sport ................................................................. 16

2.4
3

Research Questions ......................................................................... 9

Conclusion ............................................................................................. 19

System and Datasets .................................................................................. 20


3.1

Design and Architecture ........................................................................ 20

3.2

Datasets ................................................................................................ 21

3.2.1

Data Set 1 Bundesliga Football Results ....................................... 21

3.2.2

Data Set 2 Weather Observation Data for Germany .................... 21

3.2.3

Data Set 3 - Stadiums ..................................................................... 22

3.2.4

Data Set 4 - Weather Stations ........................................................ 22

3.2.5

Limitations of the Data .................................................................... 23

3.3

Data Processing .................................................................................... 24

3.3.1

Introduction ..................................................................................... 24

3.3.2

Football Match Data ........................................................................ 26

3.3.3

Stadium Data Set creation .............................................................. 26

3.3.4

Distance Matrix Calculation............................................................. 26

3.3.5

Checking Altitude Differential .......................................................... 27

3.4

Dealing with Missing Data ..................................................................... 28

3.5

Mashup .................................................................................................. 29

3.6

Feature Engineering .............................................................................. 30

Testing, Evaluation & Error Checking .......................................................... 32


4.1

4.1.1

Test 01 Check Downloaded files .................................................. 32

4.1.2

Test 02 Football match data set ................................................... 32

4.1.3

Test 03 Weather Station data ...................................................... 32

4.1.4

Test 04 Stadium Data .................................................................. 32

4.1.5

Test 05 Distance Matrix ............................................................... 32

4.1.6

Test 06 Imputed Data .................................................................. 33

4.1.7

Test 07 Final Checking ................................................................ 34

4.2

Introduction ............................................................................................ 32

Data analysis ......................................................................................... 35

4.2.1

Introduction ..................................................................................... 35

4.2.2

Data Analysis Process and Methodology ........................................ 35

4.2.3

Exploratory Analysis ....................................................................... 36

4.2.4

Total Goals Analysis ....................................................................... 44

4.2.5

Over/Under 2.5 goals scored .......................................................... 49

4.2.6

Goal Difference ............................................................................... 53

4.2.7

Home/Away Win or Draw ................................................................ 56

4.3

Altitude Effects....................................................................................... 59

4.4

Scatter plots and correlation .................................................................. 60

4.5

Data Mining and Predictive Modelling ................................................... 62

4.6

Analysis Conclusion .............................................................................. 62

Conclusions ................................................................................................. 64
5.1.1

Introduction ..................................................................................... 64

5.1.2

Theoretical Implications .................................................................. 66

5.1.3

Conclusion ...................................................................................... 67

Further development or research ................................................................ 69

References .................................................................................................. 70

Appendix ..................................................................................................... 75
8.1

Glossary of Terms ................................................................................. 75


-3-

8.2

Geography ............................................................................................. 77

8.2.1

Map of Europe and Germany .......................................................... 77

8.2.2

Map of Germany ............................................................................. 78

8.2.3

States of Germany (and created Regions) ...................................... 79

8.2.4

Weather Station & Stadium Locations ............................................ 80

8.2.5

Final weather station and Stadium locations ................................... 82

8.3

Data Sets ............................................................................................... 83

8.3.1

Football Data................................................................................... 83

8.3.2

Weather Observation Data.............................................................. 84

8.3.3

Stadium Data .................................................................................. 88

8.3.4

Weather Station Data ...................................................................... 91

8.4

Data Set Variables................................................................................. 92

8.5

Nave Bayes probability outcomes ........................................................ 94

8.6

List of Figures and Tables ..................................................................... 95

8.7

List of Tables ......................................................................................... 96

8.8

Initial Project Plan .................................................................................. 97

8.9

Initial Requirement Specification ......................................................... 124

8.10 Management Progress Reports ........................................................... 153


8.10.1 Management Progress Report 1 ................................................... 154
8.10.2 Management Progress Report 2 ................................................... 165
8.10.3 Management Progress Report 3 ................................................... 179
8.11 Other Material Used ............................................................................ 192

-4-

Executive Summary
This study seeks to investigate the relationship between football match goal
outcome and weather effects within Germanys top two football leagues;
Bundesliga 1 and 2. Twenty one seasons of historic football results for both of the
Bundesliga tiers is used from 1993 until 2013 which is linked to weather
observation data by matching each of the 80+ weather stations around Germany
with the nearest stadium where the matches were played. Five weather variables
are used in the study including average daily temperature, rainfall, cloud cover,
wind speed and humidity. These are investigated against four Goal Outcome
measures; Total Goals, Goal Difference, Under/Over2.5 and Home/Away/Draw
results. The R Studio open source software is used to investigate and determine
whether any relationship between them exists.

The study finds that temperature has a small but measurable effect on Total goals
scored reducing goal averages very slightly as temperature decreases,
predominantly during the winter season. Other effects such as humidity and cloud
cover have no measurable effect with goals scored being consistent across the
entire range. Wind and rain seem also have no obvious trend except at the extreme
ranges where they inflict interference on all goal outcome variables. However this
interference effect has no obvious trend or pattern, is generally unpredictable and
is based on a very low sample size, given the inherently rare nature of such events,
and as such the results should be treated cautiously. Overall there is no clear
relationship or correlation between weather and goal outcome and the study raises
the issue on how well we really understand such factors on sports.

-5-

1 Introduction
1.1

Background

The Sports industry within the Europe Union (EU) is estimated to contribute 294
Billion to the economy based on a recent study by the European Commission
(2012) and supports 2% of all EU jobs or around 4.46million employees. Germany
provides the highest number of these jobs at 1.15 million or almost 27% of all the
sports related jobs within the EU. The Bundesliga, Germanys top football league,
is one of the big five within Europe which includes England, Spain, Italy and
France. In 2013/14 the combined revenue of these top five leagues grew by 5% to
9.8 Billion representing almost half of the entire European football market which
was valued at 19.9 billion in the same period (Deloitte, 2014.) The German
Bundesliga is characterised by strong cost control with the lowest wage to revenue
ratio at 51% which resulted in it being one of only two Leagues in Europe to
generate an operating profit (264m) for the sixth successive year in the 2013/14
season. In this context football should not be seen solely as a hobby or weekly
television event but an increasingly important economic part of Europe and a major
provider of employment within the European Union.

In 2012 Germany passed new laws to promote liberalisation of its gambling and
betting market. Marketing company MECN (2012) estimates that the German
sports betting market will grow to around 1.5 Billion in 2015 alone. This is part of
a global sports market which is estimated to be valued at around 733 Billion (BBC,
2014) with over 70% of that total being derived from football matches. Betting on
football matches became popular around the early 1920s within the UK with the
creation of the Football Pools (2014), one of the oldest gaming companies in the
world, allowing fans to predict matches and win money if they proved to be correct.
Given the value of this market there is a critical need for gaming companies to
ensure they fully understand the products they are providing, and all of the

-6-

variables that effect the odds of the betting instruments they provide to customers,
as mistakes could be costly.

Finally, trainers, teams and players are always seeking to gain competitive
advantage to ensure continuing or future success. Utilising statistical information
and data analytics is becoming increasingly important within the football and sports
sector as these groups seek to become both smarter and more efficient (Lewis,
2014.) The majority of this analysis currently focusses on the players and there
has been very little analysis on the influence of external factors such as weather.
There are claims that suggest that weather, in particular temperature, could be a
factor in the outcome of football matches (British Weather Services, 2014.)

The weather in Germany is characterised as a typically moderate and temperate


climate with temperatures ranging from around zero degrees in winter to the mid
twentys in the summer (ECA&D, 2014) The use of a more temperate climate like
Germany reduces the known influence that extreme temperatures have on sports
performance and health (Hong, 2014.) However, throughout the 21 year period
there have been colder and warmer periods ranging from -15C to 28C as well
as wetter and windier periods (ECA&D, 2014) Germany is large enough to have
distinct regions of weather patterns (Encyclopedia Britannica, 2014) and the
Bundesliga stadiums are distributed around Germany widely enough (Appendix
8.2.3) to be able to see if this regional weather has any measurable effect on goal
outcome results.

1.2

Aims

The aim of the study is to determine if specific weather factors, either


independently or in combination, including temperature, precipitation, wind speed,
humidity and cloud cover, have any measurable effect on the goal outcome of
football matches played within the German Bundesliga 1 and 2. The proposed
-7-

theory is that weather conditions such as lower or higher temperatures, wind or


rainfall may have an effect on the goal outcome. Football match data and historic
recorded weather observations can be linked using Germanys weather station
network located throughout the entire country consisting of around 100 potentially
viable weather stations and geo-located using the stadiums where each match was
played. By linking these data sets it will be determined if weather effects play a role
in goal outcome when considered across a time period of 21 years ranging from
the 1993/94 season until the 2013/14 season.

Goal outcome within this study is defined by four commonly used betting
instruments (PKR, 2014) as a means to assess and compare the data, identify any
relationships, and if viable provide a basis to make predictions. (See Appendix 8.1
for definitions.)

1) The under/over 2.5 goals scored (UO)


2) The Home / Away team win or Draw (HAD)
3) Goal difference (GD)
4) Total Match Goals scored (TG)

If a particular pattern or lean towards an uneven spread is found due to one or


more weather effects then this would be of particular interest to managers, trainers,
players and in particular those companies that provide such betting instruments as
products where that weather effect is not currently an in built factor.

Secondary objectives will consider the differences, if any, between the first and
second leagues and also if a particular geographical location or time of year affects
the goal outcome.

-8-

1.2.1 Research Questions


The primary problem being considered is the lack of information regarding the
relationship between weather factors such as temperature, precipitation, humidity,
wind and cloud cover and the effect that this may have on goal outcome in football
matches. The primary research question being considered is: -

Is there any relationship between weather effects and goal outcome for football
matches played within the Bundesliga 1 & 2?

From this the Null Hypothesis Ho and the hypothesis that will be tested H1 is
established: -

Ho: There is no relationship between goal outcome in football matches and daily
corresponding average values of temperature, wind speed, precipitation, humidity
and cloud cover.

H1: There is a relationship between goal outcome in football matches and daily
corresponding average values of temperature, wind speed, precipitation, humidity
and cloud cover.

The Null hypothesis is non directional and therefore a two tailed test will be applied
where appropriate with a significance level of 5%

Secondary Research Questions


Within the context of the broader research question there are further questions that
will be considered: -

-9-

1) Is there any measurable difference between goal outcomes within the


Bundesliga 1 and 2 due to the effects or weather?
2) Is goal outcome affected by just one, or multiple weather variables?
3) Does regional weather affect goal outcome?
4) Can the goal outcome of future matches be better predicted using selected
betting instruments and weather factors?
5) Do some seasons, months or time periods see goal outcome affected more
due to weather effects?
6) Can goal outcome be better predicted using combined weather indices such as
apparent temperature?

1.3

Solution Overview

The project will primarily utilise the open source R Studio (2014) statistical
programming package to import, clean, merge, feature engineer and finally
analyse the data, and then provide all graphical and statistical outputs as required
throughout the study. This will be supported by programs such as Microsoft Excel
notepad and PeaZip for extracting files to ensure that all parts of the data can be
accessed, checked and manipulated as required at all stages of the project. The
R Studio platform will be used to handle the 200+ raw data files and apply the
relevant changes and functions to the files to remove unwanted information and
ultimately merge them all together into a single coherent data frame where each
match played is linked with its corresponding weather observation data across the
21 years of play. The stadium information will need to be manually extracted and
created as a fourth data set to link the weather stations and match results.

The study will contribute to the overall body of knowledge of statistical analysis in
sport and also develop the understanding of the effects of weather on sport and
football in particular. The study will also help progress understanding on the issues
of handling and analysing large data sets.
- 10 -

1.4

Structure

The study is structured as follows: Section 1 A background to the study, why the study is important and what
benefits it will potentially realise through its undertaking
Section 2 A literature review to assess the existing body of knowledge in the
areas of weather effects on sports performance, sports statistics, stadium design
issues and data mining and analysis techniques
Section 3 The system architecture and datasets utilised for the study and a
process flow description of how the various data sets were combined along with
key areas of interest detailed
Section 4 Testing, evaluation and primary analysis of the data sets
Section 5 Study conclusion and recommendations
Section 6 Areas where the study could benefit from further future development
Section 7 Supporting references
Section 8 Appendixes. All supporting information such as tables and graphs and
preliminary or supporting documents such as management progress reports

- 11 -

2 Literature Review and Related Work


2.1

Introduction

This literature reviews primary aim is to examine all relevant and recent research
relating to the effects of weather on sport, in particular football, in relation to its
impact on achieved or measured performance levels. To support this objective
research has also been conducted into climatic and meteorological weather
effects, and types of measurement index. Additionally, statistical and data analysis
approaches in sports and also general analytical and relevant data mining
techniques have been investigated.

A range of sources were used including Google Scholar, Google Books, IEEE,
ACE, Directory of Open Access Journals and Wiley as well as books, articles and
websites to support this. The primary search terms used were (but not limited to):
Football, Soccer, Bundesliga, Weather, Sport, Rain, Wind, Temperature, Humidity,
Okta, Germany, Stadium, Environment, Performance, Statistics, data mining and
data analysis. Specific data parameters were not incorporated into the search
although a number of key relevant papers that were found were from as early as
1970. Papers that did focus of weather effects in sports were both quantative and
qualitative. Overall there is a reasonably good level of research in data analytics
on sports and weather effects on sport with recent publications in 2014 indicating
a strong ongoing interest in this area.

2.2

Statistical analysis in football

2.2.1 Football Tradition versus Data Analysis


Football is a game that has long been dominated by seven words; Thats the way
it has always been done highlighting the strong belief systems and dogmas that
dominate the game and still continue to exert influence over players, coaches and

- 12 -

fans alike (Anderson and Sally, 2013.) Tradition exerts such a strong social
influence that knowledge can remain static, being rarely challenged or questioned
with the suggestion that football as a game is full of unexamined clichs and myths
which have never been tested against real life data (Kuper and Syzmanski, 2012.)

However, anecdotal evidence within football, once commonly accepted, is now


being proven to be incorrect using statistical analysis, such as believing corner
kicks increase the chances of scoring for example (Anderson and Sally, 2013.)
Seeking to understand issues like these potentially can provide competitive
advantage to those competing and help shape and adapt the way they play.

2.2.2 Statistical Methods in sports


As with most statistical analysis variables in sport are classified as either
independent or dependent. For example independent variables such as various
weather factors may have an effect on multiple dependent variables such as goal
outcome (ODonoghue and Holmes, 2014.) McGarry, ODonoghue and Sampaio
(2013) suggest two statistical methods that could be of benefit in sports
performance analysis. Firstly they suggest that principal component analysis
(PCA) as a way of reducing the often large number of variables down into a smaller
manageable group which contains those components that account for the largest
amount of variance. They also suggest that regression analysis is an important tool
in being able to predict or forecast game outcome. Analysis on goals scored in
worldwide domestic matches by Greenhough, Birch, Chapman and Rowlands
(2002) using thin tailed binomial Poisson and negative binomial distributions have
shown that beyond the lower score range such models do not fit well except in
some cases such as the English premier league and that a best fit extremal
distribution is also needed. A simpler approach taken by the British Weather
Service (2013) was to use average goals scored and counts of the Under/Over2.5
goals scored and whether match temperature was above or below zero degrees

- 13 -

to draw conclusions on the effect of cold on score outcome suggesting that it was
in fact highly correlated. Hamilton (2014) developed this for the same data by using
a logistic regression to better explore if any relationship actually exists between
the match outcome and weather and found that there was almost no change in the
Under/over 2.5 goals scored probability due to temperature but noted that the
study only used 16 games on a single day.

Analytics Company Kickdex (2014) again adopted a simpler approach using the
goal Difference metric against totals of whether the favorite did or did not win in
rainy conditions for any given Goal Difference score. Gelade and Dobson (2007)
used linear regression modelling and factor analysis in their investigation of
worldwide football team performance. ODonoghue and Holmes (2014) also
suggest linear modelling and correlation techniques as being especially useful for
making predictive models but note that these are for making average
generalisations only and not as models for predicting individual match results.
They also note the use of non-parametric tests in sports analysis due to most
sports data not following assumptions applicable to data which is normally
distributed. Peters and ODonoghue (2013) used logistic regression to analyse the
performance of the Qatari football team and the influence of temperature on home
advantage. Finally the use of data mining methods such as Nave Bayes allows for
probability outcomes of binary goal outcome results such as UnderOver2.5 to be
determined based on variables such as weather (Witten, Frank and Hall, 2011.)

Overall there are a range of techniques and models applicable to sports analysis
with an emphasis on simplicity to draw generalisations as to overall trends or
making predictions of future match outcomes.

- 14 -

2.3

The effect of weather on sports performance

2.3.1 Weather, Altitude and measurement Indices.


Germany has a temperature climate which is classified as Cfb based on the
Kppen-Geiger climate classification system (Rubel and Kottek, 2013.) This
identifies it as a having a warm temperature climate (C), being fully humid (f) and
having a generally warm summer (b). Recorded data for Germany (ECA&D, 2014)
indicate extreme temperatures ranging from -36C to +40C although these are
the exception and not typical of the overall climate. There have also been a number
of more extreme events classified for Germany occurring from 1999 onwards, but
despite these Germany is on average a temperate climatic region.

Physical features can affect weather, such as altitude, affecting temperature and
wind speed due to changes in atmospheric pressure and terrain. The lapse rate,
which is the rate at which temperature decreases for an increase in temperature is
approximately a 0.65C drop in temperature per 100m in height gain. However,
the actual lapse rate is specific to each location and the conditions at the time
(Stone and Carlson, 1979.) The use of equivalent indices such as heat index, wind
chill and apparent temperature have been noted as being of particular use for
sports due to the combined interaction of weather effects being considered (Perry,
2004.) One index of particular interest is the apparent temperature index
(Steadman, 1984) which was first outlined in 1979 and then revised to its currently
used format in 1984. A simplified version that uses temperature (T), wind speed
(V) and humidity (e) is most commonly used today and is often described as a
feels like temperature combining the wind chill and heat index to provide an
adjusted temperature that takes into account wind speed and humidity.
Steadmans (1984) approach uses a linear model which is applicable to most
outdoor conditions because it contains wind speed but it should be noted that there
can be no accurate formula for what a person actually feels like.

- 15 -

2.3.2 Weather Effects in Sport


Sports performance analysis is the process by which the various persons involved
within a sport such as coaches, analysts or physiologists come together to break
down a games performance from observed data and then identify those factors
which contributed towards either a good or bad performance (McGarry,
ODonoghue, Sampaio, 2013.) The effects that weather and environmental factors
have on sport is an area where potentially considerable improvements could be
made according to Thornes (1977) to improve sports management, performance
and economic performance. There is evidence to suggest that some sports are
more adversely affected than others with endurance sports, in particular cycling,
being affected more by the weather (Pezzoli, Cristofori, Moncalero, Giacometto
and Boscolo, 2013.) The study also found that most sports were affected by three
primary characteristics namely temperature, humidity and wind. Rain was also
noted as a factor in a number of cases, for some, but not all sports.
Football is catagorised as a weather interference sport in that it is ideally suited
to weather less days which are warm, dry, overcast, bright and with little wind. In
this category both teams experience the same weather conditions at the same time
and footballs structure of two halves where teams swap sides compensates for the
effects of factors such as wind. Weather for football therefore acts as an equal
interference factor (Thornes, 1977.) In contrast Perry (2004) suggests that factors
like wind in football could provide an unequal advantage to one team over the
other. Perry suggests that wind and rain have a significant subjective effect on
football matches played while temperature and other weather effects have only a
very limited impact.

Stephens Hawkings research into Englands chances of success also factored in


weather variables and altitude (Paddy Power, 2014.) The study showed that a 5
degree Celsius increase in temperature could significantly reduce Englands
chances of winning as would playing over 500m altitude. Taylor and Rollo (2014)
- 16 -

also note that low altitudes which range between 500-2000m would provide an
impairment to aerobic performance due to reduced air pressure although this can
vary and is specific to geographical location and individual player conditioning.
Acclimitisation has also been shown by Gelade and Dobson (2007) to have a
potentially significant effect on international football where teams travel to a
different climatic region that they are unaccustomed to.

Precipitation can affect overall play conditions making the ball drag and lose spin
and skid on the playing surface. This can affect each teams ability to control the
ball and increase the difficulty of the goalkeeper in making saves (Kickdex, 2014.)
The analytics company suggests that rain does have an effect on playing
conditions and goal outcome by altering the outcome of the favorite (the team with
the shortest pre match odds.) suggesting they are more likely to prevail if it doesnt
rain. Their data indicates that the favorites chances of winning when there is rain
is reduced to 50% compared to a 67% chance of winning when there is no rain,
and also that the overall total number of goals per match increases significantly as
well if rain occurs. However, the Kickdex (2014) study only considers 147 matches
played within London and only provides results. Rain has also been shown to
affect ball characteristics and studies that investigate the flight characteristics of
how a ball moves in flight (Carre, Asai, Akatsuka and Haake, 2002) show that
precipitation affects the trajectory such that it is more likely to be off target.

Thornes (1977) notes that lower temperatures can be a hazard as blood to the
hands and feet is lost much quicker in colder conditions affecting sports like football
where players and goalkeepers rely on extremities to maintain performance. Riley
and Williams (2003) also suggest that colder weather reduces limb temperatures
which would detrimentally affect motor performance as well as strength and power.
In fact muscle power was found to be reduced by 5% for every 1C drop in muscle
temperature below normal. The effects of very extreme hot or cold temperatures
on human physiology are known to directly affect both performance and health
- 17 -

(Hong, 2014.) Although the contribution of weather factors where extreme


temperature is not a factor, is still not clearly understood or established. A
qualitative study by Olszak (2012) found that while players perceptions of how
games were affected by weather varied they generally had no statistical
significance as to real measured effects.

The effects of temperature on the ball, being made of viscoelastic materials is also
a factor as with temperatures approaching zero degrees Celsius a goalkeeper has
7% more time to react to a penalty that at higher temperatures when the ball moves
quicker. (Wiart, Kelley, James and Allen, 2012) The flight of the ball is also affected
with colder conditions causing the ball to drop and move slower overall with less
power than at warmer temperatures. However as Riley and Williams (2003) point
out in colder weather the goalkeeper is most susceptible to reduced limb
temperature and dexterity unless they keep highly active. A study in baseball (Kraft
and Skeeter, 1995) found that temperature appeared to play a significant role in
how a far a batter could hit a ball compared to factors such as wind speed which
are specific to a particular stadium and local terrain and turbulence effects.
Advanced Football Analytics (2014) also found that temperature significantly
affected the success rate of field goals scored in American football with lower
temperatures reducing the distance that players can successfully score from.

Finally stadiums have been shown to affect weather factors (Szucs, Allard,
Moreau, 2009.) Wind in particular can be mitigated significantly where stadiums
are enclosed or have roofs although both temperature and humidity remain
unaffected. However, wind channeling effects and turbulence are highly localised
specific to each stadia and the surrounding area. Kraft and Skeeter (1995) also
noted that such effects were hard to predict.

- 18 -

2.4

Conclusion

This review investigated and examined the area of weather effects on sport and
statistical methods used in sports to gain understanding of how such variables
effect sport performance and goal outcome. The review found that in relation to
weather effects on sports there are multiple studies that suggest that such factors
are affecting sports like football and performance although these studies often use
a qualitative approach to justify this. However, there is some conflicting opinion on
which weather effect affects performance and to what extent actual sports
performance, like goal outcome is affected. In that regard there is a lack of
quantative knowledge in this field. Some of the sources used are commercial in
nature and so caution must be taken as to the data and results they present which
may not be entirely unbiased or be subject to checking.

- 19 -

3 System and Datasets


3.1

Design and Architecture

The overall system architecture is shown below in figure 1. The Data sources that
form the basis of this study are freely available for download via a PC with internet
access. A stand-alone PC with the open source R studio (2014) software is used
at all stages to gather, clean, combine, analyse and then make predictions. R
Studio requires online access to Googles Mapping API through a number of
additional packages to facilitate geolocation functions, calculation of altitude and
creating distance matrixes to determine the nearest weather station to each
stadium. The results of the study and any viable predictions are then provided to
the customer as the end product.

Figure 1: System Architecture Design

- 20 -

3.2

Datasets

The study uses and combines four datasets to achieve a single data table for
feature engineering and analysis. One of the data sets, stadiums, was not available
in any single readily useable format and required the stadium names to be
manually obtained and entered for each unique team using secondary sources of
information. All the other data sets could be downloaded from their respective
sources as outlined below in either txt of csv file format. In total these four data
sets when combined will provide all 12,926 match results with the best available
weather observation information specific to each location where the match was
played and on the correct date.

3.2.1 Data Set 1 Bundesliga Football Results


Useable football data for the Bundesliga 1 and 2 is available from an online source
(Football-UK, 2014) which provides a single csv file for every season of play. The
data covers the period from 1993 up to 2013 (21 years in total) which is the most
recent season that came to an end in May 2014. The data provides in all cases the
date the game was played, the home and away team names and the home and
away team goals scored at full time. Recent years gathered even more data but
this is not reflected in previous years consistently and so cannot be used. There
are typically 306 games played per season and across two leagues and 21 years
this equates to 12,926 games in total which will be subject to analysis. An example
of the raw football data is provided in appendix 8.3.1.

3.2.2 Data Set 2 Weather Observation Data for Germany


Weather data for Germany was obtained from the European Climate Assessment
& Database (ECA, 2014.) There are 84 weather stations for Germany which
contain every weather variable required in the study. The weather observation data
is the actual recorded weather information such as rainfall, temperature and
humidity. This is distinct from the actual weather station itself which is a physical
- 21 -

location at which weather observation data is recorded. Weather Observation data


is provided as a single comma delimited text file for every unique weather variable
at every weather station (See Appendix 8.3.2 for an example.) The unzipped file
size of all five weather variables being considered is 8GB as they contain the entire
European dataset. In total 32 stations are used in the final data set all having five
variables per station resulting in 160 individual text files which need to be extracted
from the data set for actual use. Each txt file contains on average 25,591 lines of
data which results in 4.1million lines of raw data. The five average daily variables
used are: - Temperature (C), Rainfall (mm), Cloud Cover (Oktas), Wind Speed
(m/s) and Humidity (%).

3.2.3 Data Set 3 - Stadiums


The stadium is the location at which each match is played and provides the
physical link to the nearest weather station. In a league structure each game
played is always based on the home team location as listed on the match results
which is used in conjunction with three online stadium databases (Appendix 8.3.3)
and Google Maps to physically locate every stadium. Extracting the stadium name
information was undertaken manually for each participating team. Once each
name was obtained the stadiums geodecimal coordinates using the Google API
can be accessed through R to provide Geolocation, distance, mapping and altitude
information to match the nearest weather station.

3.2.4 Data Set 4 - Weather Stations


The weather station data is obtained from ECA&D (2014) as a separate entity to
the observation data which provides a list of every weather station within Europe.
The lists contain latitude and longitude information which is the critical geolocation
information to match each stadium (and therefore the match played) to the actual
weather observation data. As more than one stadium will be covered by a single
weather station and some weather stations are located in areas nowhere near any
- 22 -

of the stadia, as well potential errors in the recorded observation data for some
stations, there will be a subset of the stations used from all those covering
Germany and some crossover. An example of the weather station data lists is
provided in Appendix 8.3.4. In total 32 stations out of the 84 viable possibilities
were used in the study.

3.2.5 Limitations of the Data


The data sets are considered to be relatively robust and accurate in their individual
states. The limitation is the way in which they are being combined in particular with
the weather data. The weather data provides numeric data, for example rainfall,
for every 24 hour period reflecting daily figures. Football matches last 105 minutes
including the half time break which represents only a portion of this overall weather
period or 7.3% overall. The weather data figures being used may not be exactly
representative of the actual conditions experienced when the match was played,
although they do represent the general average weather conditions of that day. In
that regard the study is not looking at specific days as such but considering any
overall trend between warmer and cooler periods, or wetter and dryer periods of
time.

German football teams can have multiple variations on their names which are in
current use. The match data files and stadium database names lack unique
identifiers and are sufficiently different to prevent string matching using tools like
Levenshtein distance requiring manual selection of stadium names for each team.

Stadium names also vary as many are named after sponsors which can vary each
year. Commonly used, older names may prevail and are not always consistent with
current Google Map information. As such geocoding for stadiums needs to be
checked manually to ensure the right stadium has been matched.

- 23 -

3.3

Data Processing

3.3.1 Introduction
The studys objective is to be able to analyse football match outcome with weather
effects. To achieve this the four datasets as outlined above must be merged into a
coherent and valid single data frame for analysis. Each of the four sets needs to
be treated both individually to remove unwanted information and ensure consistent
formatting but also be combined to match viable weather observation and weather
stations with stadium locations. As the data flow diagram in figure 2 shows this is
not just a simple join of two data sets but an iterative process where both weather
stations and their observation data need to be verified and checked.
In calculating this there are two key difficulties. Firstly weather stations are located
at a variety of altitudes and some are in mountainous regions which represent a
significant height and therefore temperature differential making them unusable.
Secondly, observation data for some variables is missing to such an extent that it
also renders that station unusable. The size and number of these files makes it
impossible to manually check observation files.

Removing stations requires the distance matrix to be recalculated and the next
nearest station selected with altitude and observation checks undertaken again.
This process continues until no more errors are found. The overall data flow
diagram is shown below in figure 2. The colours shown indicate generally each
data sets role on the overall process with blue being the weather station, green the
match data, grey the weather observation data and red the stadium data. The
merged data set is denoted orange at the point where all four sets are combined.
A detailed description of the key processes are outlined below.

- 24 -

Weather Station
Data set
Match Data set

Stadium Data set

Weather
Observation Data
Final combined
Data Set

Figure 2: Data flow diagram

- 25 -

3.3.2 Football Match Data


All 42 files were downloaded simultaneously from ECA&D (2014) with the Firefox
web browser using the Download it All add on package. The files were vertically
bound into a single data frame and the blank rows were removed and all unwanted
columns except the six key data fields (Division, Date, Home & Away Team name
and Home and away team score) which are constant across all years retained.
Some team names are corrected at this stage as they have been written differently
over time. A list of all unique teams was generated from this for each division which
will eventually form the basis of the stadium data set.

3.3.3 Stadium Data Set creation


The key part of the stadium set creation process is manual in that stadium names
must be selected from a suitable database and entered into a csv file. Because of
the similar multiple team names string matching tools such as those that employ
Levenshtein distance were found to be not feasible for use. Four online databases
as outlined in Appendix 8.3.3 were used to obtain the correct stadium name. Once
done the Google API was used through the geocode package within R Studio to
automatically generate latitude and longitude co-ordinates for each stadium.

3.3.4 Distance Matrix Calculation


Calculating the distance between each stadium and weather station is a critical
part of the process. Two approaches were taken to calculate these distances. The
primary method uses a point distance calculation which takes each pair of stadium
coordinates and calculates the distance to every weather station on the current list
as shown in figure 3. This output matrix consists of three columns; Stadium Name,
Weather Station ID (STAID) and the distance in metres between them. The matrix
is then reduced by using a minimum distance function to output a final list of each
stadium and its nearest single matching weather station.

- 26 -

Figure 3: Distance matrix Calculation Code in R Studio

Finally this list is merged with the original stadium list which adds the weather
station ID (STAID) as a new column against each stadium name. STAID is critical
as a unique identifier to merge the weather observation data later on as the
weather observation data uses this.

The process has a manual element in that the observation weather files for each
of the five variables need to be selected (based on the results of the distance
calculation) and moved into a working folder for merging and checking later on.
Subject to this check files may need to be subsequently removed or added and
also the list of weather stations needs to be manually corrected to remove stations
that contain missing observation data before the process is repeated again.

3.3.5 Checking Altitude Differential


While the weather station raw data includes altitude data the stadium altitude must
be obtained using the RJSONIO package within R. It is the height difference that
is being checked. A large difference between the stadia and weather station could
result in the temperatures being non representative of the climate. Height
differentials greater than 400m or about 2.5C are omitted from the study.

- 27 -

3.4

Dealing with Missing Data

The weather observation data contains a variety of missing values. These are
identified during the checking process. Some missing values are one offs having
no discernible pattern and are classified as Missing Completely at Random
(MCAR) as they are no more or less likely to be missing than any other value.
However, the error files also highlighted data which is almost certainly Missing Not
At Random (MNAR) such as October 2001 and February 2014 being months that
had missing values for a large number of all the weather stations. Generally
missing data was in three categories; 1) MCAR and typically one or two
consecutive days only, 2) MNAR and typically 10-20 days for specific recurring
months and years (figure 4), 3) Large scale missing data for multiple months and
often years for an entire weather variable.

There are a variety of ways to deal with missing data. The approach taken in this
study is to try and preserve as much of the data as possible as the percentage of
weather observation data affected is small. Of the 241,753 total lines just 1323
lines contain errors equating to 0.5%. If this was evenly represented when the files
are joined then there would be 70 lines out of 12,926 with errors. However, as
match data is based on weekly occurring events and errors are clustered across
10-20 consecutive days the real error rate would be much lower. Deleting these
values was considered but using imputation (Yuan, 2010) was decided on to
provide an unbiased replacement values for those occurrences.

- 28 -

STAID

DATE
RR
51
19/09/2001
51
20/09/2001
51
21/09/2001
51
22/09/2001
51
23/09/2001
51
24/09/2001
51
25/09/2001
51
26/09/2001
51
27/09/2001
51
28/09/2001
51
29/09/2001
51
30/09/2001

HU
12.8
11.9
0.2
0
0
0
0
0
0
0
6.4
0.7

TG
86
97
86
84
84
88
93
83
82
79
92
89

CC
10.7
12.6
13.1
10.6
10.9
13
10.8
13.3
15.8
15.3
13.8
16.3

FG
6
8
6
5
5
6
7
4
5
4
7
5

-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9
-999.9

Figure 4: Example of missing data for wind speed for 12 days in Sept 2001

The use of the Amelia package (Cran, 2013) which is accessed through R Studio
allows for missing values to be imputed in lieu of NA values. Filling in the missing
values using an algorithm that approximates a best fit based on values on either
side avoids bias and the package has aspects that make it highly applicable to
time series data.

3.5

Mashup

A mashup is taking independent data sets and recombining them to reveal


previously unknown information. The three separate data sets are now combined
to create a new and unique data set. As the station codes (STAID) were added to
the stadium data set earlier the weather station list is not required as the STAID
code bridges the gap between each stadium and the observation data allowing
them be merged.

- 29 -

3.6

Feature Engineering

Feature Engineering is the process of extracting even more potentially useful


information from the existing data (Domingos, 2011) and is an important process
in knowledge discovery and analysis especially in data mining. In total 14 new
columns were created each containing a unique variable that allows for enhanced
subsetting, simplification, grouping or analysis of the data set. The new variables
each fall into one of four categories; Time, Weather, Goal Outcome or
Geographical.

Time Variables
Month: Extracted from the date field. Creates 12 standard calendar months;
January, February, March etc. (Categorical)
Year: Extracted from the date field. Creates a four digit numeric value for the year.
(Numeric)
Season: Groups together 3 consecutive months to generate spring, summer,
autumn and winter categories. Summer is June, July & August. (Categorical)

Weather Variables
BScale: Creates a simplified categorical wind variable based on the Beaufort scale
of measurement with values from 0 to 9. (Numeric)
HUscale: Creates a simplified scale from 1 to 10 for humidity range in 10%
increments. (Numeric)
Rain: Rain is grouped into 6 categories; No Rain, Light rain, Moderate, Heavy,
Very Heavy and Violent. (Categorical)
Atemp: Apparent Temperature is derived from temperature, humidity and
windspeed and provides a real feel equivalent (Steadman, 1984.) (Numeric)

- 30 -

ATempScale: A simplification of the above range into 5 general temperature


ranges each of 10 degrees Celsius each. (Categorical)

Goal Outcome
TotalGoals: The total goals scored per match. (Numeric)
GDiff: The difference between home team goals and away team goals. (Numeric)
OverUnder2.5: Calculates based on total goals whether the result is Under 2.5
goals or Over 2.5 goals (Categorical)
H_A_Win: Creates a match result value of HomeWin, AwayWin, or Draw
(Categorical)

Geographical
Area: Germany is comprised of 16 states (Appendix 8.2.3) which can be obtained
using the output=more parameter within the geocode function within R Studio.
Region: Combines the Areas above into larger general geographical weather
regions based on typical climate conditions experienced around the country.
(Encyclopedia Brittanica, 2014 and Appendix 8.2.3)

- 31 -

4 Testing, Evaluation & Error Checking


4.1

Introduction

The processing stage as outlined in figure 2 and section 3.3 was subjected to
rigorous testing, error checking and evaluation at every stage to ensure that the
data sets were correct and accurate.

4.1.1 Test 01 Check Downloaded files


Downloaded files including football data, weather observation and weather station
are checked visually to make sure the download was successful and the files have
been obtained. A check of one or two files is made by opening then to validate
integrity and that the information is present as expected.

4.1.2 Test 02 Football match data set


1) Test for NA rows and remove from data frame.
2) Check final row count (12,926) against matches played based on number of
participating teams.
3) Unique team names checked & verified. Duplicates and spelling corrected to
remove false copies. Checked against Fussball Archiv (2014)

4.1.3 Test 03 Weather Station data


1) Check for weather variable with least weather variables using nrow.
2) Check lat/lon coordinates have been correctly transformed into decimal codes.

4.1.4 Test 04 Stadium Data


1) Check number of teams in stadium list is correct.
2) Check stadium names are valid within Google Map API using geocode.

4.1.5 Test 05 Distance Matrix


3) Create error file of all combined -9999 values for weather observation data
based on current weather station selection. Remove large scale missing values
greater than 30 consecutive days.
- 32 -

4) Spot check results using single line code version and goggle maps.

4.1.6 Test 06 Imputed Data


To ensure that the imputation process did not fundamentally adjust the original
data a statistical summary was undertaken before and after. Analysis found that
the imputed data had no overall effect on the original data. In particular no values
occurred at the extreme ranges and neither the mean nor median was altered.

Figure 5: Weather data before Imputation

Figure 6: Weather data after Imputation

If the imputation process had altered the data significantly then removing these
values would have been the next best alternative course of action.
- 33 -

4.1.7 Test 07 Final Checking


Prior to feature engineering the final merged data set is checked for: -

Row count Missing rows may indicate a missing weather observation file which
would remove all associated values during the join process. Use nrow.
NAs Values missed by the imputation process. Filter for NA, -9999 values and
use summary analysis to check.
Validity Several random rows are checked against the raw data sets to ensure
the right dates have been joined with every weather variable. Manual check.

File is saved as a csv file to preserve all processes and make retrieval easier for
the next stages.

- 34 -

4.2

Data analysis

4.2.1 Introduction
The studys primary objective is to determine if any relationship exists between
weather effects and goal outcome and R Studio provides a platform to investigate
the data set using both inbuilt and additional packages. Not all of the data set
contains information that will be analysed and categorical data cannot be analysed
using traditional statistical methods although it can be tabulated, graphed and
explored using R Studio and data mining techniques.

4.2.2 Data Analysis Process and Methodology


The analysis is split into 4 main sections: 1) Exploratory. Basic investigation of the variables using histograms and tables
to describe the main variables being considered.
2) Goal Outcome Analysis. Each of the four goal outcome variables is compared
against; Apparent Temperature, Rainfall, Wind Speed, Humidity, Cloud Cover,
Calendar Month, Season and Region. In each case the categorical expression
is used, as developed through feature engineering, to generalise the results as
much as possible. If any potential trend is identified then where possible this is
tested using appropriate inferential statistics.
3) Altitude Analysis. A general overview of altitude.
4) Scatter Plots and Correlation. The raw numeric weather data is compared to
total goals and goal difference to determine if any relationship exists. A multiple
linear regression model will be assessed to see if one or more of the variables
are related.
5) Data Mining. Nave Bayes will be used for the Under/Over 2.5 event to
determine if the spread of matches can be predicted using weather variables
by splitting the data set into 80% training and 20% test sets.

- 35 -

4.2.3 Exploratory Analysis


Descriptive analysis allows for a general overall understanding of the data to be
gained and will include bivariate analysis within. There are four categories of data
to be considered; Time, Goal Outcome, Weather Observations and Geography.
Both Time and Geography are in part weather items as they contain variables such
as season and region (Encyclopedia Britannica, 2014) which are associated with
a general weather type. From the 31 variables in the data set 13 are numeric, 12
are categorical and 6 are unused. The numerical values can be directly
investigated using the statistical function describe. A full list of all the data set
variables is provided in Appendix 8.4.

Table 1: Summary Overview of numerical variables

Variable mean
sd
median trimmed mad
FTHG 1.63
1.33
1.00
1.50
1.48
FTAG 1.17
1.13
1.00
1.02
1.48
alt 154.18 147.09 105.00
129.16 94.89
RR 1.94
4.04
0.10
0.94
0.15
HU 77.40 12.08
79.00
78.33
11.86
TG 9.10
6.38
9.20
9.11
6.52
CC 5.43
2.18
6.00
5.67
1.48
FG 3.48
1.78
3.10
3.28
1.63
TotalGoals 2.80
1.71
3.00
2.72
1.48
GDiff 0.46
1.78
0.00
0.46
1.48
BScale 2.51
0.90
2.00
2.47
1.48
HUscale 8.19
1.24
8.00
8.28
1.48
Atemp 5.76
7.67
5.40
5.65
8.15

min
0.00
0.00
4.00
0.00
24.00
-15.80
0.00
0.00
0.00
-8.00
0.00
3.00
-20.30

max
range
se
9.00
9.00 0.01
9.00
9.00 0.01
553.00 549.00 1.29
49.90 49.90 0.04
100.00 76.00 0.11
28.70 44.50 0.06
8.00
8.00 0.02
21.60 21.60 0.02
13.00 13.00 0.02
9.00
17.00 0.02
9.00
9.00 0.01
10.00
7.00 0.01
30.10 50.40 0.07

A preliminary approach (table 1) reveals some useful information for the data set
as a whole. As expected there is a higher number of home goals scored than away
goals although both have a range between zero and nine at the upper end which
is very high compared to the mean indicating these high scoring events are very
rare. Goal difference is positive indicating that home team wins are again more
prevalent. For the weather data rainfall is typically low with occasional high

- 36 -

downpours based on the mean and range. Humidity is often quite high as is the
cloud cover. The Apparent temperature sees a lower mean than temperature alone
suggesting that wind speed plays a greater role in reducing the feels like
temperature than humidity does in increasing it.

A range of graphs and charts considering goal structure and distribution and the
primary weather variables are considered.

Figure 7: Mean goals scored per match for all teams

Mean goals follow a relatively smooth curve although the top 7 teams do show a
stepped increase in average goals scored. Means goals scored are 2.8.

- 37 -

Figure 8: Total goals scored for all teams

Total goals in figure 8 scored show a stepped divide where teams have either
scored 800 goals or more or less than 450 goals with only one team between these
limits. Some teams would have survived for a longer time period and in Bundesliga
2 (B2) there is much more volatility in team movement as there have been 66
teams that have played in B2 over the 21 years compared to 37 in B1.

Figure 9 over the page shows the distribution of total goals for each stadium. The
Allianz Stadium in Munich has the highest goal density because it has two teams
that have consistently participated in either the B1 and or B1 league for the entire
21 year period. Generally the overall density of goals scored is highest in the west
of the country where most teams and stadia are located.

- 38 -

Figure 9: Total Goals scored per Stadium

Bundesliga 1
Mean = 2.87
sd =1.71

Bundesliga 2
Mean = 2.72
sd = 1.72

Figure 10: Goal Distribution for B1 (top) and B2 (bottom)

- 39 -

Figure 10 shows goal distributions are positively or right skewed although the goal
distribution structure, mean and standard distribution is virtually identical between
the two leagues with B1 having a slightly higher overall scoring average then B2.

Figure 11: Goal Difference distribution histogram

As with Goals scored the Goal difference in figure 11 between B1 and B2 is


similarly matched.

- 40 -

Figure 12: Histogram of Average Temperature

Average temperature is normally distributed as shown by figure 12 above with a


mean temperature of 9.1C.

Figure 13: Histogram of Rainfall for Germany for all 21 years

Rainfall is very heavily positively skewed to the right as in figure 13 with the majority
of all instances, 6500 days (over 50%) in the data set, having no recorded rain at
- 41 -

all. Heavier periods of rain are increasingly intermittent and rare with most rainfall
being either light or moderate.

Figure 14: Histogram of Cloud cover

With a median of 6 Germany can be considered a fairly overcast country with cloud
cover being a prominent feature year round (as in figure 14) occurring at any time.
In fact clear days with little or no cloud (categories 0, 1 and 2) combined account
for just 12.5% of the total time period.

- 42 -

Figure 15: Histogram of wind speed

Wind speed is typically low with almost all occurrences being less than 6m/s wind
speeds. This roughly equates to Beaufort scale 3 and accounts for 87% of all
matches in the data set with only 13% of all matches seeing higher wind speeds.

Figure 16: Histogram of density for average temperature and season.

- 43 -

An analysis of season make up and average daily temperature as in figure 16


reveals distinct distributions for summer and winter with some slight overlap and
each having an overall warm and cold average temperature respectively. Spring
and autumn however are much more similar having an almost identical average
temperature and distribution. Analysis shows that the ranges, quartiles and
medians of these two seasons are almost the same. The other weather factors
show even less distinction between seasons with significant overlap and with wind,
humidity and rain occurring at nearly all times of the year.

4.2.4 Total Goals Analysis


Total Goals are expressed in the plots below as average goals (of the total) to
provide a more meaningful comparison between groups for each weather variable.
The division grouping is also used in these plots to determine if there are any
differences between these groups. The average goals scored for all matches
played is 2.8 which is slightly higher at 2.88 for B1 and 2.72 for B2.

Figure 17: Total Goals (Mean) graph for Apparent Temperature and Rainfall

- 44 -

Apparent temperature appears to indicate a decrease in total goals scored as it


decreases. 32.3% of all games played are below zero degrees. An Anova test of
Total Goals against apparent temperature returns a p value of 5.62e-06 indicating
that the groups are not identical and there is a statistical difference between them.
A boxplot analysis in figure 18 shows the 5 different groups. However, in real terms
20% of all games are played in the winter (60 per season) and a 0.3 drop in
average would be equivalent to 150 goals rather than an average 170 being
scored, a difference of 20 for that season. The balancing of the OverUnder2.5
statistic however flattens the odds compared to other times when the home team
has an advantage (See 4.4.2.) Statistically the effects are noticeable but small
enough so that they are not practically important such that goal outcome is
significantly affected.

Figure 18: Boxplot of Apparent Temperature ranges against Total Goals

Rainfall indicates a possible increase in goals scored for very heavy rain however
this category is comprised of only 49 matches. The violent category has only 5
matches. While a sample size of 30 is typically considered statistically sufficient,

- 45 -

in this case the unknown nature of the occurrence (see Limitations of the Data,
3.2.5) makes it less reliable and the results should be treated with suitable caution.
There is no clear indication that rain affects total goals scored in relation to a
particular trend or pattern although it may act as an interference factor.

Figure 19: Total Goals graph for Wind Speed and Humidity

The wind scale histogram demonstrated that the majority of all occurrences (87%)
were within the first four categories which are relatively low wind speeds. At
Beaufort Scale 6 (BS6) there are just 28 matches followed by 3, 6 and 2 matches
for BS7, 8 and 9 respectively. While the average scores in figure 19 indicate that
wind is having an interference effect by increasing goals scored the small sample
is too low to draw any reliable conclusions.

Likewise humidity scale 3 also consists of just 7 matches (2 for B1) and while there
are 67 for scale 4 the sample size is also probably too low. Analysis of the 7
matches shows that these were all played in mild conditions with no other
significant weather effects being present at the time. There is no clear trend or
pattern that humidity has an effect on Total goal outcome.

- 46 -

Figure 20: Total Goals scored for Cloud Cover and Month

Cloud cover in figure 20 shows no clear trend or pattern and although B1 teams
seem to perform very slightly better in clearer conditions the same cannot be said
for B2 teams. The overall effect is minimal. When looking at the monthly totals a
typical football season in Germany runs from August through to May. However
there are some years which started early in July (86 matches) and finished later in
June (128 matches). May and June, representing 10% of the data, together seem
to show much higher averages at the end of the season which could be from rising
temperatures at this time but could also be due to factors other than weather.

- 47 -

Figure 21: Mean Goals Scored by Season and Region

In Bundesliga 1 average goals scored remain high all year but there is a small but
noticeable drop of 7.5% from 2.95 to 2.73 average goals scored as the season
moves from autumn into the winter period. In Bundesliga 2 this is not as noticeable
at a 3.25% drop although the winter period does reflect the lowest period of
average goals scored. A Kruskal Wallis test applied where the data is not normally
distributed shows that this is statically significant although again practically not as
useful. Playing in the east of the country (408 B1 and 1079 B2 matches for this
region) seems to affect B1 teams with average goals lower compared to any other
region. The south East sees the biggest differential between B1 and B2 teams
despite a comparable 850 B1 matches and 1005 B2 matches being played there.
Kruskal Wallis tests indicate no statistical significance for humidity and cloud cover
indicating that those groups are very similar. All other variables show a difference
between groups.

- 48 -

4.2.5 Over/Under 2.5 goals scored


The overall over/under (OU) scores for the data set are 6818 (over) versus 6108
(under). This reflects an average proportion of 52.7% over, versus 47.3% under
which is a fairly even spread overall.

Figure 22: OU results against Apparent Temperature and Rainfall.

Apparent Temperature in figure 22 (left) appears to show the OU ratio flatten out
at temperatures of zero degrees and below compared to temperatures above
freezing where the over metric has a clear advantage. It is also slightly flatter at
higher temps in the 20-31 category. Rainfall does not appear to show any change
in OU trends for changing rainfall intensity.

- 49 -

Figure 23: OU against wind speed and humidity

In figure 24 (left) and at Beaufort scale 4 (1382 matches) and above there is no
distinct difference between OU, in fact the under 2.5 goals scored is slightly higher
for BS4 and flattened for BS5. Wind is potentially having an interference effect by
reducing the number of goals being scored slightly. Humidity (right) however
shows no trend or pattern with all ranges being similar.

Figure 24: OU against Cloud Cover and Calendar month

- 50 -

Cloud cover appears to have no measurable effect on OU outcome. A monthly


breakdown indicates a very slightly higher differential in Aug, Sept, Oct and Nov
around 1% above average. However, May sees over scores 59% of the time and
June 65% of the time which is a noticeable shift against the average odds.
However, there has not been a June match since 2000 although matches in May
do occur every year.

Figure 25: Mean Goals Scored by Season

Winter shows some levelling out effect compared to the other three seasons due
to lower mean goals scored. Region seems to be having some effect with the East
going against the overall average and the south East being generally flat compared
to the coastal and west regions.

- 51 -

Figure 26: Apparent Temperature and Wind Speed Proportion Graphs

Two graphs of further interest are Apparent Temperature and Wind speed. These
are shown above in figure 26 above normalised to show proportional difference
between groups. For apparent temperature there is less of a difference between
the ranges than we may believe from the previous plots although the probability of
an over result increases slightly to 55% for the two warmest categories. At Beaufort
scale 6, 7 and 8 over results are achieved 71%, 67% and 67% of the time
respectively. However the number of matches played at just 28, 3 and 6 for the
three ranges respectively is a very low sample size. The switch to all results being
under for the two matches played at BS9 highlight the potential volatility of this
although there is some evidence that high wind speeds are potentially causing
interference leading to higher goals scored.

- 52 -

4.2.6 Goal Difference


Goal difference is numeric representation of the Home Away Draw result with a
value of zero being a draw and a negative value representing an away win and by
how much and a positive value a home win. There is a very large range with a goal
difference of -9 to +9 providing 18 variables to be plotted against each weather
effect which is largely unpractical. Again a proportional plot is used to determine if
there are any trends or patterns across the data as a whole as a frequency plot is
ineffective for such large ranges. Even with a proportional plot much of the goal
differences beyond +3 and -3 are not visible as the number of games featuring
such scores beyond these values is very low.

3
2
1

99.6% of
all matches

played

-1
-2

Figure 27: Goal Difference (GD) vs Apparent Temperature and Rainfall

Apparent temperature indicates more draw results (GD=0) towards lower


temperatures and less home wins as per the Home/Away/Draw results. There
appears to be a slight increase in away win scores at -3 and -4 at very low
temperatures. Rainfall again shows interference for heavy and violent rainfall but
as 99.6% of all matches played are within the four left categories no clear trend
can be inferred given the very low sample size.

- 53 -

Figure 28: Goal Difference vs Wind speed and Humidity

Wind speed in figure 28 shows the same interference at BS6 and above with the
number of draw results increasing slightly as wind speed increases at lower
speeds. There is no clear trend for humidity.

Figure 29: Goal Difference vs Cloud Cover and calendar month

Figure 29 indicates no trend or pattern between GD and cloud cover or month.

- 54 -

Figure 30: Goal Difference vs season and Region

Figure 30 indicates no significant trends or patterns between season and region


for goal difference. On the whole the graphical representation of goal difference as
a more detailed expression of Home/Away/Draw result did not yield any more
useful or additional information that those results did not provide.

- 55 -

4.2.7 Home/Away Win or Draw


For the data as a whole there are 6074 Home wins (47%), 3395 Away wins (26.3%)
and 3457 Draws (26.7%). The Home/Away or Draw results (HAD) are represented
as proportion plots as the number of matches played between groups was so large
using frequency plots made any visual analysis unfeasible.

Figure 31: HAD vs Apparent Temperature and Rainfall

Apparent Temperature in figure 31 (left) shows a slight decrease in home win


advantage as temperature decreases. Away wins dont increase however and the
difference is made up by draw results. Rainfall results show a fairly stable HAD
outcome except for the last two ranges which have very low sample sizes.

- 56 -

Figure 32: HAD vs Wind Speed and Humidity

Wind in figure 32 (left) shows a slight decrease in home wins as wind speed
increases from BS2 to BS5, after which there is significant interference. Home wins
appear to be benefitting but the small sample size precludes drawing any
conclusions. Humidity indicates a stable pattern of HAD results excluding
categories three and four, which have low sample sizes.

Figure 33: HAD vs Cloud Cover and Calendar month

- 57 -

Figure 33 for cloud cover and calendar month show no variation in HAD results.

Figure 34: HAD vs season and Region

Both graphs in figure 34 show no significant changes in HAD result.

Home Away or Draw results do not seem to be affected by weather effects overall.
There is some evidence that falling temperature reduces home wins and increases
draws slightly and that increasing wind speed also reduces home wins. However
the effect is very small. As with previous results the low sample size for the extreme
ranges of wind and rainfall prevent drawing any conclusions.

- 58 -

4.3

Altitude Effects

Altitude, while not a direct weather effect does affect player performance through
a reduction in atmospheric pressure and is therefore an external environmental
effect which could be considered an indirect weather effect.

Figure 35: Goal Outcome against Altitude

Mean goals scored generally trend downward with altitude very slightly and drop
lower to just 2.38 where the altitude is 500m or over. The UO metric shows a
greater skew towards under results at 59%. There have been 323 games played
at 500m or more at two stadia which is a reasonable sample size. Pressure tends
to be constant so there is reduced risk as to matches and pressure occurring at
the same time. The issue with these results is that individual team results are
affecting the results. Unterhaching & Augsburg are the residents at these stadia
and they have goal averages of 2.25 and 2.58 respectively based on figure 7
represent very low scoring teams overall and this is what is being represented
rather than the effects of altitude.

- 59 -

4.4

Scatter plots and correlation

In looking for any linear relationship and correlation the dependant value Total
Goals is used against the five numeric independent values of temperature (C),
wind speed (m/s), humidity (%), cloud cover (Oktas) and rainfall (mm). The
addition of the B1 and B2 variable shows how the two Leagues vary between each
other. The output results are shown in figure 36 below.

Figure 36: Pairs and correlation analysis of numeric data

Overall there appears to be no real correlation between any of the weather effects
and total goals scored with temperature having the best match at just 0.05
indicating no real correlation. This would appear to be supported by the Goal
outcome analysis which only showed evidence of small changes overall and at
extreme ranges and no real obvious trends or patterns.

- 60 -

Figure 37: Scatter plot of total goals against Daily average temperature

The lack of relationship is highlighted with the scatter plot in Figure 37 above which
shows an almost flat linear regression line between temperature and Total Goals
providing a correlation value of just 0.0527. A multiple linear model was also
created and tested using all the available weather values and created for both Total
Goals and Goal Difference. The results were conclusive in that no relationship
exists between them whatsoever.

- 61 -

4.5

Data Mining and Predictive Modelling

Although the results indicate there is no relationship between goal outcome and
weather data mining techniques could allow for knowledge discovery and
potentially even predictive modelling. An approach based on a Nave Bayes
algorithm was selected as it uses data from previous events to try to predict future
outcomes. The event being predicted is the OverUnder2.5 goals variable using the
featured engineered weather factors; Rain, HUscale, CC, ATempScale and
BScale. These are all factorised for use within the algorithm. The advantage of
nave bayes is the fact that it assumes all of the features in the data set are of
equal importance which makes it good at detecting potentially weak effects as
could be likely in this instance.

The data set is split into an 80% training and 20% random test data sets to allow
the algorithm to learn any rules and then apply them in predicting goal outcome on
the test data set. The results on the 2586 row test data set showed that 1366 were
correctly predicted equating to a 52.8% success rate. As the Over Under metric is
essentially an even spread this indicates that the nave bayes predictive model
using weather variables is potentially no better than making a guess where the
probability of getting a right result is essentially 50/50. An output of the Nave
Bayes probabilities is provided in Appendix 8.5.

4.6

Analysis Conclusion

The study has shown that mean goals scored are in fact reduced slightly as
temperature drops, which is reflected seasonally, with much colder apparent
temperatures of below -10C seeing a slight further drop. While statistically this is
relevant the overall change in mean goals is not sufficiently different from a
practical perspective to have any meaningful impact. Humidity, Cloud Cover,
Rainfall and wind speed appear to have no measurable effect whatsoever on goal
outcome although wind and to some extent rainfall show interference at the
- 62 -

extreme limits for just 0.4% of the total matches played. These extreme events are
too rare and the overall sample of matches too low to infer any specific pattern or
trend although very high wind speeds seem to increase goals scored and favour
home wins. Realistically, the most we can infer is that there is potentially an
interference effect.

Goal difference, Over Under2.5 and Home/Away/Draw outcomes were all


generally consistent in that they showed no overall trend for weather effects. May
showed a greater number of over match result at 59% which is reasonably
significant. Altitude as an indirect environmental factor did indicate that matches
played at 500m and above skewed the average goals and UO2.5 metric but this
was attributed to those two teams performing on average much worse.

Predictive modelling using the dataset or data mining techniques is not possible
due to the lack of relationship between goal outcome and weather effects. No
pattern or trend could be extracted from the data using a Nave Bayes algorithm.

- 63 -

5 Conclusions
5.1.1 Introduction
The aim of the study was to establish if there is any relationship between weather
effects including temperature, humidity, wind speed, rainfall and cloud cover
against football match goal outcome within the German Bundesliga 1 & 2 leagues
over the last 21 years. The sports industry has been shown to be an economically
important part of western economies as outlined in chapter 1 and teams and
betting product providers are seeking to gain competitive advantage wherever
possible. The study seeks to redress the lack of knowledge with respect to
understanding how or if weather affects goal outcome in football matches. Existing
knowledge in this area cannot answer with certainty to what extent weather
variables affect match outcome or which weather factors play a role. The study
seeks to answer one primary question which is: -

Is there any relationship between weather effects and goal outcome for football
matches played within the Bundesliga 1 & 2?

Supporting this a number of secondary research questions were also asked as


outlined in section 1.2.2 and are a summary outline of the findings is as follows:

Is there any difference between B1 and B2 due to weather effects?


The analysis shows that goals distributions are almost identical between the
groups and while B2 has a slightly lower overall average goal score there is no
discernable pattern or trend that would indicate that the second tier is affected any
differently to weather effects than the first tier.

- 64 -

Is goal outcome affected by just one, or multiple weather variables?


Goal outcome does not appear to be directly affected by any weather variable.
Slight effects are noted for reduced goal averages with temperature, reduced
home wins with increasing wind speed and higher over results in May. Multiple
linear regression modelling indicates no interaction effects between the variables.

Does regional weather affect games played in that area and goal outcome?
Region doesnt appear to effect goal outcome and although average goals showed
a greater differential between B1 & B2 leagues all other goal outcomes showed no
change due to region. The UO2.5 spread was found to be unevenly spread with
the probability of an under score increasing to 64%.

Can the goal outcome of future matches be better predicted using selected
betting instruments and weather factors?
There is nothing to suggest that the four goal outcomes can predict match results
based on weather factors. High winds and high rainfall showed interference effects
such that wind speeds of BS6, 7 & 8 indicated a 0.7 probability of an over result
and probably a home win result but the sample size is too low (less than 0.4% of
matches) to say with any certainty this is correct. Analysis using the Nave Bayes
algorithm on predicting the Over Under result outcome was unable to make
predictions beyond a 52% accuracy level which is no better than a random guess.

Do some seasons, months or time periods see goal outcome affected more
due to weather effects?
Seasonally there was a small but measurable drop in average goals scored moving
from autumn into winter. This was matched by a flattening of the Over Under2.5
- 65 -

goals scored to an almost even spread. However the effect was slight and
practically has limited applications and impact.

Can goal outcome be better predicted using combined weather indices such
as apparent temperature?
The use of apparent temperature has the effect of extending the temperature range
compared to if just air temperature alone was used. However, there is nothing to
suggest that apparent temperature offers any significant benefit over just
temperature and no predictive advantage can be gained from using it based on its
use in the study.

Is there any relationship between weather effects and goal outcome for
football matches played within the Bundesliga 1 & 2?
Aside from several small effects as outlined above and some possible interference
effects for extreme events there is no relationship between goal outcome and
weather effects. The null hypothesis (see section 1.2.2) is therefore accepted.

5.1.2 Theoretical Implications


Based on the results of this analysis the theoretical assumptions behind the effect
that weather has on football matches which were explored in section 2 should be
re-examined (Kuper and Syzmanski, 2012.) Claims that weather and in particular
cold weather detrimentally affects goal outcome (British Weather Service, 2013)
are not really supported from a more extensive analysis across the 12,926 matches
considered in this study. Recent studies (Perry, 2004) suggested that wind and
rain have a significant impact while temperature does not. However, if anything
this study has found the opposite to be true with temperature having a small effect
overall and wind and rain only having an interference effect, if at all, at the upper
- 66 -

extreme limits which are rare, at less than 0.4% of matches played, and due to the
small sample sizes not reliable enough to draw patterns or trends from. However,
the use of internal factors such as betting odds in conjunction with rainfall (Kickdex,
2014) could yield additional information not considered within this study. Other
studies have made generalist statements suggesting that that most sports are
affected by humidity, temperature and wind (Pezzoli, Cristofori, Moncalero,
Giacometto and Boscolo, 2013.) but again for the climatic region of Germany these
factors play almost certainly no role in affecting goal outcome and overall
performance. Thornes (1977) suggestion that colder conditions are a hazard to
goalkeepers in maintaining performance are also not borne out by the lower goals
scored in colder conditions although the effect on players extremities overall could
be factor in lower than average goals scored during colder months.

Overall the existing body of knowledge has been perhaps too general in its
approach and has lacked analytical methods to quantify such claims potentially
overplaying the effects weather is really having on sports like football. This could
be avoided by being specific for the sport type being analysed and limited to a
specific geographic region of study to ensure accuracy. Almost certainly a larger
sample size is required to better understand the effects of high wind and rain.

5.1.3 Conclusion
Despite ongoing interest and opinion on the effects of weather on football matches
there does not appear to be any relationship between weather effects and football
match goal outcome. Temperature effects are minimal and while they do reduce
average goals scored in colder conditions the effect is not large enough to be
practically significant and could be attributed to external factors. All other variables
also dont appear to have any measurable practical impact other than several slight
and often localised effects which are too small or rare in occurrence to draw any
meaningful conclusions except to say they potentially have an interference effect.

- 67 -

The weather it seems does not kill goals and football as a sport within the
Bundesliga 1 & 2 Leagues seems to be unaffected by weather effects overall.

- 68 -

6 Further development or research


The study undertaken was limited to a single European country and to a specific
time period. The project could be developed to include a greater range of countries
both within and beyond Europe and also include a greater time period of matches
played. There is match data available for the Bundesliga (Fussball Archiv, 2014)
for even earlier time periods although this would of required web scraping to obtain.
By including countries that have a warmer or colder climate such as Norway and
Greece for example the impact of how these teams are affected by overseas
temperature regimes and climates they are unaccustomed to can be investigated
through tournaments such as the World Cup, European champions League and
the Euros. Some weather variables provided by the ECA&D (2014) were unused
such as snow depth, wind direction, sunshine duration and pressure although
based on the ones used it is not clear if these would yield any more useful results.

The inclusion of a stadium factor could help investigate the effects stadiums have
as they range from almost totally open fields to being fully enclosed venues with
roofs. This adds further complexity however in data preparation as stadiums have
changed over time due to refurbishment and teams moving.

As part of a broader and long term study the installation of dedicated weather
stations at every major stadium in Europe would allow for the recording of
continuous weather data at the specific point at which matches are played. As
many stadia are also used for athletics events this would also potentially provide
additional study in these areas of sport as well and help broaden the research base
being considered. By measuring some parameters such as wind speed outside the
stadium as well as at pitch level the effect of stadia could be better understood on
how it mitigates weather factors such as wind.

- 69 -

7 References
Anderson, C., and Sally, D. (2013) The Numbers Game: Why everything you know
about football is wrong. Penguin Books.
BBC (2014) Football Betting The global industry worth Billion. [Online]. BBC.
Available at: http://www.bbc.com/sport/0/football/24354124 [Accessed 29th May
2014].
British Weather Services (2013) Cold Kills Goals The stats [Online]. British
Weather Services. Available from: http://www.britishweatherservices.co.uk/coldkills-goals-the-stats/ [Accessed 15 November 2014].
Advanced Football Analytics (2014) Temperature and Field Goals, Advanced
Football Analytics. Available from:
http://www.advancedfootballanalytics.com/index.php/home/research/weather/165
-temperature-and-field-goals [Accessed 18th December 2014].
Carr, M. J., Asai, T., Akatsuka, T., and Haake, S. J. (2002) The curve kick of a
football II: flight through the air. Sports Engineering, 5(4): 193-200.
Cran (2014) Amelia II: A Program for missing data [Online]. Cran r-project.
Available from: http://cran.r-project.org/web/packages/Amelia/index.html
[Accessed 15th November 2014].
Deloitte (2014) Premium Blend: A review of football finance. [Online]. Deloitte.
Available from: http://www.deloitte.com/assets/DcomItaly/Local%20Assets/Documents/Pubblicazioni/uk-sbg-annual-review-of-footballfinance-2014.pdf [Accessed 15th November 2014].
Domingos, P. (2012) A few useful things to know about machine learning.
Communications of the ACM, 55(10): 78-87.
- 70 -

ECA&D (2014) European Climate Assessment & Dataset Project [Online].


European Climate Assessment & Dataset Project. Available from:
http://eca.knmi.nl/ [Accessed 20th September 2014].
Encyclopedia Britannica (2014) Germany - Climate [Online]. Encyclopedia
Britannica. Available from:
http://www.britannica.com/EBchecked/topic/231186/Germany/57996/Climate
[Accessed 28th May 2014].
European Commission (2012) Study on the Contribution of Sport to Economic
Growth and Employment in the EU. [Online]. European Commission. Available
from: http://ec.europa.eu/sport/library/studies/study-contribution-spors-economicgrowth-final-rpt.pdf [Accessed 1st June 2014].
Football Pools (2014) The Pioneers of Football Pools [Online]. Football Pools.
Available from:
http://www.footballpools.com/cust?action=GoHelp&help_page=about_us
[Accessed 1st June 2014]
Fussball Archiv (2014) Das Deutsch Fussball Archiv [Online]. Fussball Archiv.
Available from: http://www.f-archiv.de/ [Accessed 15th October 2014].
Gelade, G.A., and Dobson,P. (2007) Predicting the Comparative Strengths of
National Football Teams Social Science Quarterly. 88(1): 244-258.
Greenhough, J., Birch, P. C., Chapman, S. C., & Rowlands, G. (2002) Football
goal distributions and extremal statistics. Physica A: Statistical Mechanics and its
Applications, 316(1): 615-624.

- 71 -

Hamilton, H. (2014) Does the cold really kill Goals? Howard Hamilton Blog, 1st
May. Available from: http://www.soccermetrics.net/leaguecompetitions/temperature-vs-goals-study-premier-league [Accessed 24th May
2014].

Hong, Y (eds.) (2014) Routledge Handbook of ergonomics in sport and exercise.


New York: Routledge.
Kickdex (2014) Does rain level the playing field?, Kickdex Blog, 28th November.
Available

from:

http://blog.kickdex.com/post/68368668405/does-rain-level-the-

playing-field [Accessed 30 November 2014].


Kraft, M. D., & Skeeter, B. R. (1995) The effect of meteorological conditions on fly
ball distances in North American Major League Baseball games. The
Geographical Bulletin, 37(1): 40-48.

Kuper, S., and Szymanski, S. (2012) Soccernomics. Philadelphia: Nation Books


Lewis, T. (2014) How computer analysts took over at Britains top football clubs
[Online]. The Guardian, 9th March, Available from:
http://www.theguardian.com/football/2014/mar/09/premier-league-football-clubscomputer-analysts-managers-data-winning [Accessed 28th May 2014].
MECN (2012) The German Gambling and betting market [Online]. MECN.
Available

from:

http://www.mecn.net/German_Betting_and_Gambling_Market-

Report_Summary.pdf [Accessed 5th November 2014].


McGarry, T., ODonoghue, P., and Sampaio, J. (eds) (2013) Routledge Handbook
of Sports Performance Analysis. New York: Routledge

- 72 -

O'Donoghue, P., and Holmes, L. (2014) Data Analysis in Sport. New York:
Routledge
Paddy Power (2014) Stephen Hawking Exclusive: The maths that show us how
England can triumph in the world cup [Online]. Paddy Power. Available from:
http://blog.paddypower.com/2014/06/18/stephen-hawking-exclusive-the-mathsthat-show-us-how-england-can-win-the-world-cup/ [Accessed 10th December
2014].

Perry, A. (2004). Sports tourism and climate variability. Advances in Tourism Cli.

Peters, D. M., and O'Donoghue, P. (Eds.). (2013) Performance analysis of sport


IX. New York: Routledge

Pezzoli, A., Cristoforu, E., Moncalero, M., Giacometto, F., and Boscolo, A. (2013)
Climatological Analysis, Weather Forecast and Sport Performance: Which are the
Connections?, Journal Climatol Weather Forecasting 1: e105

PKR.

(2014)

Under

Over

Betting

[Online].

PKR.

Available

from:

http://bet.pkr.com/en/get-started/bet-types/under-over/ [Accessed 28th May 2014].

Studio

(2014)

studio.

[Online].

Rstudio.

Available

from:

http://www.rstudio.com/ [Accessed 1st September 2014]


Riley, T., Williams, A.M. (eds.) (2003) Science and Soccer. 2nd Edition. London:
Routledge.
Rubel, F., and Kottek, M. (2010) Observed and projected climate shifts 1901
2100 depicted by world maps of the Kppen-Geiger climate classification.
Meteorologische Zeitschrift, 19(2): 135-141.

- 73 -

Steadman, R.G. (1984) A universal scale of apparent temperature. J. Appl.


Meteor., 23, 1674-1687.
Stone, P.H., and Carlson, J.H. (1979) Atmospheric Lapse rates regimes and
their parametrization. Journal of the Atmospheric Sciences, 36(3): 415-423.
Szucs, A., Allard, F., and Moreau, S. (2009) Open Stadium Design Aspects for
cold climates. PLEA2009 - 26th Conference on Passive and Low Energy
Architecture, Quebec City, Canada, 22-24.
Thornes, J. E. (1977). The effect of weather on sport. Weather, 32(7): 258-268.
Wiart, N, J Kelley, D James and T Allen (2011) Effect of temperature on the
dynamic properties of soccer balls, Journal of Sports Engineering and Technology
225.

Witten, I. H., Frank, E., and Hall, M.E (2011) Data Mining: Practical machine
learning tools and techniques. 3rd ed. Morgan Kaufmann.
World Stadiums (2014) World Stadiums [Online]. World Stadiums. Available from:
http://www.worldstadiums.com/ [Accessed 15th September 2014]
Yuan, Y. C (2010) Multiple imputation for missing data: Concepts and new
development (Version 9.0). SAS Institute: Rockville.

- 74 -

8 Appendix
8.1

Glossary of Terms

App. Temperature Apparent Temperature is a measurement index that combines


air temperature, humidity and wind speed to provide an
equivalent feels like temperature.
CSV

Comma Separated Value file format.

Goal Difference

The difference between the Home and Away goals scored.


This will return a positive value where the home team wins or
negative value where the away team wins.

Goal Outcome

Goal Outcome for the purpose of the study refers to the one
of four measurements including: - Under/Over 2.5 Goals, Goal
Difference, Home/Away Win or Total Goals

Home/Away Win

Either the home team wins, the Away team wins or it is a draw.

Nave Bayes

A data mining algorithm that is used to classify and predict


based on probabilities.

PCA

Principal Component Analysis is a technique used to reduce


the number of variables to a smaller number of components
which account for the largest amount of variance.

UEFA

Union of European Football Associations is the administrative


body for association football in Europe.

Under/Over 2.5

A commonly used betting instrument which allows the


customer to bet that the total goals scored within a match will
be less than or greater than 2.5. As average goals tend to be
around 2.8 this is typically regarded as an even spread bet.
The result is binary and will be either under or over.

- 75 -

R Studio

An open source statistical analysis program used to obtain


knowledge from data sets and provide a range of graphical
and tabular outputs.

Total Goals

The total goals scored in a match is the home team goals plus
the away team goals.

- 76 -

8.2

Geography

8.2.1 Map of Europe and Germany

Germany as shown above in orange as part of the European Union (EU.)

Image Source: http://wrm.org.uy/wp-content/uploads/2012/12/map-europe-germany.png

- 77 -

8.2.2 Map of Germany

Germany Map showing primary cities, towns and major topography like mountains
and plains.

Source: http://www.worldatlas.com/webimage/countrys/europe/lgcolor/decolor.gif

- 78 -

8.2.3 States of Germany (and created Regions)

Coastal & NW

East

West
South West

Germany has 16 distinct states. 3 (Berlin, Bremen and Hamburg) are city states.
These were simplified by combining into four regions as shown above.
Coastal and NW Region

Bremen, Niedersachsen, Hamburg, SchleswigHolstein, Mecklenburg-vorpommern

East Region

Brandenburg, Berlin, Sachsen, Thuringen

South East

Bavaria

West

Northrhine-westphalia, Rhineland-palatinate, Hessen,


Saarland, Baden-wurttemberg

Source: http://www.itcwebdesigns.com/tour_germany/map_german_states.gif

- 79 -

8.2.4 Weather Station & Stadium Locations

R Studio Mapping showing all of the 73 stadiums for Bundesliga 1 & 2 and all of
the possible 84 weather stations that contain every weather variable prior to
matching and selecting the nearest. Some weather stations are at map edges or
out at sea and will not be any use. Overall there is a good match between the two
locations although this assumes that all weather stations are viable at this stage.
The Dusseldorf area shown within the red dashed box is shown over the page as
a point of interest.

- 80 -

479

A zoomed map showing the Dusseldorf Area. This area has the highest density of
stadiums and teams linked to a single weather station. 6716 of all games played
(52%) are linked to this weather station ID (479). One of the project risks was
potentially that stations like this were unusable which could jeopardise the entire
study.

- 81 -

8.2.5 Final weather station and Stadium locations

Map showing the final 32 weather stations with useable observation data for all
five weather variables and all 73 stadiums for both Bundesliga 1&2.

- 82 -

8.3

Data Sets

8.3.1 Football Data


Example of a raw data file for Bundesliga 1 results for the 2002/2003 season.

Div Date
HomeTeam
AwayTeam
FTHG FTAG FTR HTHG HTAG HTR
D1 09/08/2002 Dortmund
Hertha
2
2
D
2
1
H
D1 10/08/2002 Cottbus
Leverkusen
1
1
D
0
0
D
D1 10/08/2002 M'gladbach
Bayern Munich
0
0
D
0
0
D
D1 10/08/2002 Nurnberg
Bochum
1
3
A
0
2
A
D1 10/08/2002 Schalke 04
Wolfsburg
1
0
H
0
0
D
D1 10/08/2002 Stuttgart
Kaiserslautern
1
1
D
1
0
H
D1 11/08/2002 Bielefeld
Werder Bremen 3
0
H
1
0
H
D1 11/08/2002 Hamburg
Hannover
2
1
H
0
1
A
D1 14/08/2002 Munich 1860
Hansa Rostock
0
2
A
0
1
A
D1 17/08/2002 Bayern Munich Bielefeld
6
2
H
3
0
H
D1 17/08/2002 Bochum
Cottbus
5
0
H
3
0
H
D1 17/08/2002 Hannover
Munich 1860
1
3
A
1
2
A
D1 17/08/2002 Hansa Rostock Nurnberg
2
0
H
1
0
H
D1 17/08/2002 Hertha
Stuttgart
1
1
D
0
1
A
D1 17/08/2002 Kaiserslautern Schalke 04
1
3
A
1
0
H
Football-Data (2014) provides a single file in csv format for each season of play
D1 17/08/2002 Leverkusen
Dortmund
1
1
D
1
0
H
and
each division
Data is available
are
D1 for
18/08/2002
Werder separately.
Bremen Hamburg
2 from
1 1993
H onwards.
1
1 There
D
D1 18/08/2002
Wolfsburgfor each
M'gladbach
H participating
0
0
Din each
typically
306 matches
season based 1on 180 teams
D1 24/08/2002 Bielefeld
Wolfsburg
1
0
H
1
0
H
League.
There
are
52
columns
of
data
per
file
for
most
files
containing
the
D1 24/08/2002 Cottbus
Hansa Rostock
0
4
A
0
2
A date,
D1 24/08/2002
Dortmund
Stuttgart
3
1was Hplayed
1 and
0 a variety
H
final
time results,
half time results,
where the game
of
D1 24/08/2002 Hamburg
Bayern Munich
0
3
A
0
1
A

betting information. For earlier years not all of this information was recorded so the
files are inconsistent in their structure and layout. Twenty one seasons of matches
(306 games per season) across two leagues equates to 12,926 games played in
total. Note that this is slightly higher than expected as some years featured 20
teams in a season resulting in more games being played. This data is spread
across 42 csv files. The first six columns represent consistent elements across all
42 files from which all four goal outcome measurement metrics can be calculated.

- 83 -

8.3.2 Weather Observation Data


Examples of the five historic weather observation data sets are provided below.
The European Climate Assessment & Database Project (2014) provides data for
weather stations across Europe. The five variables are: - Temperature, Humidity,
Rainfall, Wind speed and Cloud cover.

Temperature
Historic weather data example for Germany for Daily Mean Temperature

The above data sample is taken from station number 494 (Augsburg) for mean
daily temperature. The text files are comma delimited and provide (from left to right)
station number, source identifier, date (yyyy/mm/dd), temperature and quality
code. Each location file contains around 25, 591 lines of data. The temperature is
provided in 0.1degrees Celsius in its current html format and requires a decimal
point to read correctly. For example the first line of data above for the 28 th March
records a daily mean temperature of 6.5 Degrees Celsius with no known errors or
missing data. Below freezing levels are identified with a minus symbol (none shown
in example above.)

- 84 -

Humidity
Historic weather data example for Germany for Daily Humidity levels

The above ECA&D (2014) data sample is taken from station number 494
(Augsburg) for daily humidity. The text files are comma delimited and provide (from
left to right) station number, source identifier, date (yyyy/mm/dd), humidity in
percent and quality code.

Rainfall
Historic weather data example for Germany for Daily precipitation levels.

The above ECA&D (2014) data sample is taken from station number 494
(Augsburg) for daily precipitation. The text files are comma delimited and provide
(from left to right) station number, source identifier, date (yyyy/mm/dd),
precipitation in html format of 0.1mm and quality code.

- 85 -

Wind Speed
Historic weather data example for Germany for Daily mean wind speed.

The above ECA&D (2014) data sample is taken from station number 494
(Augsburg) for daily average wind speed. The text files are comma delimited and
provide (from left to right) station number, source identifier, date (yyyy/mm/dd),
average wind speed in html format and 0.1m/s and quality code. The actual wind
speed is the above figure divided by 10. For example the first value for April shown
above would be 1.5m/s.

Cloud Cover
Historic weather data for Germany for Cloud Cover.

The above ECA&D (2014) data sample is taken from station number 494
(Augsburg) for daily average cloud cover. The text files are comma delimited and

- 86 -

provide (from left to right) station number, source identifier, date (yyyy/mm/dd),
cloud cover in Oktas and quality code. The oktas scale provides a measure of
cloud cover from 0 to 8 subject to the overall portion of sky covered. Zero
represents a totally clear sky while 8 would be totally overcast.

- 87 -

8.3.3 Stadium Data


Four on line databases were used as shown below with an example of the structure
and format of the information. Three were stadium databases, the fourth was
Google maps for checking stadium names where geocoding indicated an error.
German stadium names can change frequently to reflect sponsors.

Stadium Guide.com
Contains 34 listings of present day stadia currently in use.

Source: http://www.stadiumguide.com/present/germany/

World Stadiums.com
Lists 523 stadiums organised by state location. Includes all types of stadia
including football.

- 88 -

Source: http://www.worldstadiums.com/europe/countries/germany.shtml

Stadium Database.com
Contains listings for all current Bundesliga 1 & 2 clubs and a large number of other
stadia

Source: http://stadiumdb.com/stadiums/ger

- 89 -

Google Maps.ie
Large online mapping tool with all current stadiums listed.

Source: https://www.google.ie/maps

- 90 -

8.3.4 Weather Station Data


Example of weather station list; Average Daily Temperature stations shown.

There is one station list for each weather variable as not every station records
every weather variable. Each list is a comma delimited text file which can be
downloaded from the ECA&D (2014) website directly using R Studio. Lists contain
all European stations which are identified through two digit ISO3116 country
codes. The Station ID (STAID) is the unique identifier that allows the observation
data to be matched to each weather station. Latitude and longitude as well as
altitude is also provided. Altitude is critical to ensure each weather station and
stadium are at comparable heights.

- 91 -

8.4

Data Set Variables

The finished data set used for analysis has 12,926 rows each representing a single
match and 31 columns or variables. 14 columns were feature engineered.
> str(master)
'data.frame': 12926 obs. of 31 variables:
$ STAID: int 4012 472 3990 2763 2758 4005 3990 479 477 491 ...
$ HomeTeam: Factor w/ 73 levels "Aachen","Aalen".
$ Div: Factor w/ 2 levels "D1","D2": 2 2 2 2 2 2 2 2 2 2 ...
$ AwayTeam: Factor w/ 74 levels "Aachen","Aalen",..
$ FTHG: int 1 4 1 0 0 0 0 3 3 2 ...
$ FTAG: int 1 0 1 0 1 1 0 1 0 1 ...
$ Stadium_Name: Factor w/ 72 levels "Allianz Arena",.
$ lat: num 49.5 54.1 50.9 48.8 52.7 ...
$ lon: num 8.5 12.1 11.58 9.19 7.3 ...
$ alt: int 96 22 149 481 21 55 318 38 57 276 ...
$ dist: num 5.22 10.71 38.59 7.7 69.38 ...
$ RR: num 24 7.9 2.5 8.9 0.3 2.2 2.5 0.3 3.3 4.9 ...
$ HU: int 94 95 75 87 82 84 75 83 81 94 ...
$ TG: num 18 16.1 17.7 17.8 16.4 16.4 17.7 17.4 16.4 16.6 ...
$ CC: int 8 6 6 8 7 5 6 5 5 8 ...
$ FG: num 2.3 5.6 6 2.9 4.9 4.5 6 4.9 4.1 3.9 ...

The following data variables were feature engineered from the raw data set above:
$ month: Factor w/ 12 levels "April","August",..
$ year: int 1993 1993 1993 1993 1993 1993 1993 1993 1993 1993 ...
$ TotalGoals: int 2 4 2 0 1 1 0 4 3 3 ...
$ OverUnder2.5: Factor w/ 2 levels "Over","Under": 2 1 2 2 2 2 2 1 1 1 ...
$ H_A_Win: Factor w/ 3 levels "AwayWin","Draw",..: 2 3 2 2 1 1 2 3 3 3 ...
$ GDiff: int 0 4 0 0 -1 -1 0 2 3 1 ...
- 92 -

$ BScale: int 2 4 4 2 3 3 4 3 3 3 ...


$ HUscale: int 10 10 8 9 9 9 8 9 9 10 ...
$ Rain: Factor w/ 6 levels "Heavy","Light",..: 1 3 3 3 2 3 3 2 3 3 ...
$ season: Factor w/ 4 levels "Autumn","Spring",..
$ Atemp: num 18.8 13.9 14.5 17.6 14 14.4 14.5 15.4 14.5 15.7 ...
$ Area: Factor w/ 15 levels "baden-wurttemberg",..
$ Region: Factor w/ 4 levels "Coastal & NW",..
$ ATempScale: Factor w/ 5 levels "<-10","<0","0-10",..

- 93 -

8.5

Nave Bayes probability outcomes

The majority of all the probabilities are essentially even except for those specific
cases at extreme ends and for altitude >500m.

- 94 -

8.6

List of Figures and Tables

Figure 1: System Architecture Design ................................................................ 20


Figure 2: Data flow diagram................................................................................ 25
Figure 3: Distance matrix Calculation Code in R Studio ..................................... 27
Figure 4: Example of missing data for wind speed for 12 days in Sept 2001 ..... 29
Figure 5: Weather data before Imputation .......................................................... 33
Figure 6: Weather data after Imputation ............................................................. 33
Figure 7: Mean goals scored per match for all teams ......................................... 37
Figure 8: Total goals scored for all teams ........................................................... 38
Figure 9: Total Goals scored per Stadium .......................................................... 39
Figure 10: Goal Distribution for B1 (top) and B2 (bottom)................................... 39
Figure 11: Goal Difference distribution histogram ............................................... 40
Figure 12: Histogram of Average Temperature .................................................. 41
Figure 13: Histogram of Rainfall for Germany for all 21 years ............................ 41
Figure 14: Histogram of Cloud cover .................................................................. 42
Figure 15: Histogram of wind speed ................................................................... 43
Figure 16: Histogram of density for average temperature and season. .............. 43
Figure 17: Total Goals (Mean) graph for Apparent Temperature and Rainfall .... 44
Figure 18: Boxplot of Apparent Temperature ranges against Total Goals .......... 45
Figure 19: Total Goals graph for Wind Speed and Humidity............................... 46
Figure 20: Total Goals scored for Cloud Cover and Month................................. 47
Figure 21: Mean Goals Scored by Season and Region ...................................... 48
Figure 22: OU results against Apparent Temperature and Rainfall. ................... 49
Figure 23: OU against wind speed and humidity ................................................ 50
Figure 24: OU against Cloud Cover and Calendar month .................................. 50
Figure 25: Mean Goals Scored by Season ......................................................... 51
Figure 26: Apparent Temperature and Wind Speed Proportion Graphs ............. 52
Figure 27: Goal Difference (GD) vs Apparent Temperature and Rainfall ........... 53
Figure 28: Goal Difference vs Wind speed and Humidity ................................... 54
Figure 29: Goal Difference vs Cloud Cover and calendar month........................ 54
Figure 30: Goal Difference vs season and Region ............................................. 55
- 95 -

Figure 31: HAD vs Apparent Temperature and Rainfall ..................................... 56


Figure 32: HAD vs Wind Speed and Humidity .................................................... 57
Figure 33: HAD vs Cloud Cover and Calendar month ........................................ 57
Figure 34: HAD vs season and Region .............................................................. 58
Figure 35: Goal Outcome against Altitude .......................................................... 59
Figure 36: Pairs and correlation analysis of numeric data .................................. 60
Figure 37: Scatter plot of total goals against Daily average temperature ........... 61

8.7

List of Tables

Table 1: Summary Overview of numerical variables ........................................... 36

- 96 -

8.8

Initial Project Plan

- 97 -

Project Proposal

Can the weather kill goals?


The effects of weather on goal outcome for football
matches played within the German Bundesliga

Alastair Macnair, x13129325, Alastair.MacNair@student.ncirl.ie

Higher Diploma in Data Analytics

4th June 2014

Table of Contents

1.

Objectives....................................................................................................................................... 3

2.

Background ................................................................................................................................... 4

3.

Literature Review ......................................................................................................................... 6

4.

Research Question ........................................................................................................................ 7

5.

Requirements Elicitation and Analysis ....................................................................................... 9

6.

Special Resources required ........................................................................................................ 13

7.

Project Plan ................................................................................................................................. 14

8.

Consultation................................................................................................................................. 14

9.

Declaration................................................................................................................................... 14

Appendix A Examples of the Football Data Sets .......................................................................... 15


Appendix B Examples of the Weather Data Sets .......................................................................... 16
Appendix C Map of Current Bundesliga 1 & 2 Stadium locations. ............................................ 21
Appendix D Project Plan Gantt chart ............................................................................................ 22
Appendix E Map showing Principle Regions of Germany ........................................................... 23
Appendix F Project Proposal Revisions ......................................................................................... 24
References ............................................................................................................................................ 25

1. Objectives
The study will seek to assess whether certain weather factors such as temperature, cloud cover,
precipitation, wind and humidity have any determined effect on the goal outcome of football
matches within the Bundesliga 1&2 football leagues held within Germany when considered
across twenty seasons of historic play. The theory is that weather conditions, in particular lower
temperatures, may have a detrimental impact on goals scored although warmer temperatures
will also be considered. By linking daily historic weather data for specific weather stations with
stadiums and the dates and results of matches played it will be determined if the effects of
weather plays any role in goal outcome when considered over a significant time period.

Secondary objectives will consider if any difference exists between the first and second league
in relation to weather effects on goal outcome and also whether any particular stadium affects
goal outcome due to its geographical location or size for the teams that play there. The results
will also be used to compare against a particular betting instrument which is the over/under
goals scored bet (PKR, 2014) to see if any meaningful predictions can be made regarding total
match goals scored. Any possible lean towards an uneven spread (more goals under then over
due to weather factors) would be of particular interest to football teams, coaches, trainers and
in particular those companies that provide such betting instruments products.

Summary of all objectives


Objective #1 Determine if there is any link between goals scored and weather effects
within the Bundesliga 1&2 football Leagues.
Objective #2 Determine if there is any difference between the Bundesliga 1 & Bundesliga
2 due to the effects of weather, location or smaller stadiums.
Objective #3 Determine if just single or multiple weather parameters predominantly affect
goal outcome.
Objective #4 Investigate if stadium location and regional local weather affects games
played there and match outcome.
Objective #5 Compare the outcomes of matches to under/over goal difference betting
instruments to determine if the spread of match results could have been better
predicted using the results of the analysis.
Objective #6 Attempt to use the data to predict goal outcome for a number of future
matches using weather predictions and selected betting instruments.
Objective #7 Determine if goal difference between teams is greater in colder weather and
if sustained cold weather effects a teams performance over time.
Objective #8 Use analysis software including but not limited to Excel, Python, R and SQL
to gain knowledge in their use for analysing large data sets.
3

2. Background
According to a recent study by the European Commission (2012) on the contribution of sport
to the economy it placed the value at 294 Billion euros. Additionally the betting market for
sports is estimated to be around 733Billion globally (BBC, 2014) with 70% of that income
coming from football matches. Betting on football matches became popularised in the early
1920s since the creation of the football pools (2014) in the UK, the oldest gaming company in
the world, which allowed fans to predict matches and win money if those predictions proved
to be correct. With some individual bets now reaching figures of over 200,000 (BBC, 2014)
it is important for gaming companies to be able to understand the level of risk they are being
exposed to as mistakes could be costly.

Additionally trainers and teams are always looking to gain competitive advantage to ensure
success and the use of statistical information and data analytics is becoming increasingly
important within football as more and more managers and teams use data analysis to become
smarter and more efficient (Lewis, 2014.) While nearly all analysis focusses on the players
there has been much less analysis on external factors. There is some evidence and a number
of studies to indicate that weather factors, predominantly temperature, may be a factor in the
outcome of European football matches (Hamilton, 2014.) While the effects of extreme hot or
cold temperatures on human physiology are known to directly affect both performance and
health (Hong, 2014) the overall contribution that weather makes, where extreme temperature
is not a factor, and in particular, to the goal outcome of football matches, is still not clearly
understood or established.

Germany has a moderate and temperate climate (See Appendix 2(e) for a typical weather year)
with temperatures ranging on average from just below zero degrees Celsius in winter to around
the mid twentys during the summer (ECA, 2014.) The use of a moderate temperate climate
seeks to reduce as much as possible any effects of very extreme temperatures. However there
are colder areas such as Munich which can see temperatures drop to around -10C which can
affect performance (Hong, 2014) although at around 0C there should be very little drop in
performance for persons engaged in moderate exercise even if wearing t-shirts. Germany is
large enough to have distinct regions of specific weather patterns (Encyclopaedia Britannica,
2014) with variable frequency of temperature, humidity and precipitation experienced in
different regions and throughout the year. The Bundesliga stadiums are distributed around

Germany widely enough to see if regional weather plays a role in match outcome (Appendix
C & E.)

The study considers two primary data sets which were identified for the purposes of being
suitable for analysis and to meet the studies objectives. Firstly the Bundesliga football league
results for which reliable historic data exists for its entire history since 1963. Within this a
selection of data will be considered for the period 1993 until 2013 which represents twenty one
seasons (years) of play. Useable football score data has been identified from an online provider
(Football-UK, 2014) which provides one csv file (Refer to Appendix A) for each season played
detailing every game played within the season, the date played, half time and full time scores
and where it was played as well as range of other match information. There are around 306
games played per season per league so the football data set will comprise around 306 (games)
x 21 (seasons) x 2 (leagues) = 12, 852 football matches being analysed in total. Each of the csv
files is relatively small at around 100kb in size.

Compared to this will be daily weather data for Germany obtained from the European Climate
Assessment & Database (ECA, 2014.) This site provides data on numerous weather stations
positioned around Europe which can be matched geographically using name, latitude and
longitude co-ordinates to each stadium being considered to within a few miles. The data is
available as a number of individual text files, one for each unique weather station and weather
variable (Refer to Appendix B.) The blended data was selected for use which combines weather
data from different sources, although checking shows no difference for the weather stations
being used. The files contain comma delimited text in their raw format and the uncompressed
size of the files for each weather variable ranges from 200MB to 4GB containing
approximately 400 to 5,000 individual weather stations in each zipped folder for that particular
weather variable. Each file provides data for approximately 67 years equating to 25,591 lines
of raw data for each weather station and single weather variable. There are 18 stadiums in the
Bundesliga 1 and the same in the Bundesliga 2. However, there are over the 21 year period 38
teams that have played in the top tier with a similar number expected in the second but, there
will be instances of cross over where teams share stadiums and the closeness of stadia may
allow for a common weather station to be used. This will be subject to more a more detailed
assessment during the initial stages of the project. This means that there could be approximately
30 distinct weather files for each weather parameter. The total approximate size of the raw data
will therefore be 25,591 x 30 = 767,730 lines of data for each weather variable. It is intended
5

to consider five variables; temperature, precipitation, cloud cover, wind and humidity which
will equate to almost 3.8 million lines of raw data prior to selecting the relevant lines that equate
to the 12,852 actual football matches that were played.

There are some important limitations to the data being considered, in particular the weather
data. The data being used provides daily averages which may not equate to the conditions
experienced during the time the match was played. For example rain may have fallen before,
after or during the match. The study is seeking to determine if any relationship exists between
the historic weather data and goal outcome and so some caution is advised as links could be
established where none really exist. However, the primary objective is to consider any overall
trend over the course of a playing season i.e. changes in seasons and over months rather than
specific matches.

3. Literature Review
The literature review will at this stage examine primarily factors relating to the statistical
analysis of sports and the effects of weather on sports but should also extend to consider
statistical analysis in general, prediction analysis, climate and weather, sports performance and
stadium design. Literature has been researched sourced from Google Scholar, CiteSeerX,
Google Books and the Directory of Open Access Journals along with articles, websites and
other sources.

Statistical Analysis in sports


Sports performance analysis is the process by which the various persons involved within a sport
such as coaches, analysts or physiologists come together to break down a games performance
from observed data and then identify those factors which contributed towards either a good or
bad performance (McGarry, ODonoghue, Sampaio, 2013.) A lot of commonly accepted
anecdotal evidence within football has been proven to be incorrect using statistical analysis
such as corner kicks increasing the chances of scoring (Anderson and Sally, 2013.) The authors
propose that understanding issues like this provides competitive advantage through knowledge
justifying the time and expense in undertaking such analysis in the first place.

The effect of cold weather in Sport


The effects that weather and environmental factors have on sport is an area where potentially
considerable improvements could be made according to Thornes (1977) to improve sports
management, performance and economic performance. There is evidence to suggest that some
sports are more adversely affected than others with endurance sports, in particular cycling,
being affected by the weather (Pezzoli, Cristofori, Moncalero, Giacometto and Boscolo, 2013.)
The study also found that most sports were affected by three primary characteristics namely
temperature, humidity and wind. Rain was also a factor in a number of cases for some but not
all sports. Riley and Williams (2003) indicates that colder weather reduces limb temperatures
which would detrimentally affect motor performance as well as strength and power. In fact
muscle power was found to be reduced by 5% for every 1C drop in muscle temperature below
normal.

The effects of temperature on ball properties is also a possible environmental factor as with
temperatures approaching zero degrees Celsius a goalkeeper has 7% more time to react to a
penalty that at higher temperatures when the ball moves quicker. (Wiart, Kelley, James and
Allen, 2012) The flight of the ball is also affected with colder conditions causing the ball to
drop and move slower overall with less power than at warmer temperatures. However as Riley
and Williams (2003) point out in colder weather the goalkeeper is most susceptible to reduced
limb temperature and dexterity unless they keep highly active.

4. Research Question
The problem being considered is that there is a lack of information regarding the effects that
weather factors like temperature, precipitation, humidity and wind may have on goals scored
in football matches. The primary research question being considered is: -

Does the weather effect the goal outcome in football matches within the Bundesliga 1 & 2?

From this the Null Hypothesis Ho and the hypothesis that will be tested H1 is established: -

Ho: There is no relationship between goal outcome in football matches and daily corresponding
average values of temperature, wind, precipitation and humidity.
H1: There is a relationship between goal outcome in football matches and daily corresponding
average values of temperature, wind, precipitation and humidity.

The Null hypothesis is non directional and therefore a two tailed test will be applied where
appropriate with a significance level (critical value) of 5%

Figure 1: Graphical representation of a two tailed test with rejection regions.

Within the context of the broader research question there are further questions that will be
considered: (i)
(ii)
(iii)
(iv)

Is the Bundesliga 2 Leagues goal outcome affected more by weather factors than
the Bundesliga 1 League?
Can goal outcome at any particular stadium be attributed to any possible regional
weather effects?
Does a single weather variable affect match outcome or are multiple factors
required?
Do smaller stadiums have a greater effect on goal outcome due to greater expose
to the weather?

These are components of the primary research question and will be investigated. Appropriate
hypothesis testing will need to be established for these questions. Further questions will be
developed for the project.
8

Predictions
Additionally there is the possibility of the results analysis being used to undertake match
outcome prediction for goals scored using next day weather forecasting. It is expected that
rather than being able to predict actual total goals for a match with any accuracy it is more
likely that prediction of average goals scored due to general weather conditions experienced
over a time period would be possible. The use of a betting tool such as the Under/Over (x)
goals instrument will be used based on the average number of goals per game and league across
the period being considered. For example if the average goals scored was 2.7 then Under/Over
2.5 goals would be used as the instrument to see if the results can be used to reliably determine
significant push or pull above or below this level which could potentially indicate that the
predictions can be made. As the predictions are dependent on weather then the time period will
typically be in the 1 to 3 day period in line with weather forecasting but could increase to 10
days.

The research will be limited to only stadium locations within Germany, the weather data
identified and goals scored for a match. No other in match data or statistics will be used such
as corners or passes. Individual players will not be considered nor will any other variables other
than those indicated and referenced.

5. Requirements Elicitation and Analysis


Requirements elicitation is a preliminary stage in which the requirements of the process are
specified and defined which then leads to the correct solution being designed and implemented.
Undertaking requirements elicitation is primarily a process to understand a particular problem
which comes typically from a business need. The objective of requirements elicitation is to
identify all of the requirements, or as many of them as is feasibly possibly (Kasirun, 2005.) At
this stage the requirements are a preliminary step towards a more detailed project specification
later on during the second semester when the dissertation will be initiated, undertaken and
completed.

Elicitation techniques are the systems and tools used to bring forth the requirements and help
develop and find understanding. For this part of the process the tools used are Brainstorming
and Document Analysis as outlined in the (IIBA, 2009.)

The brainstorming process was utilised primarily at this stage to help stimulate ideas on the
project. This did not take the format of a scheduled session but instead was an ongoing process
where ideas were jotted down in a note book as and when they came to mind. No critiquing or
analysis of the ideas was undertaken deliberately as this is contrary to the brainstorming process
which is to develop new ideas.
Before determining the functional and non-functional project requirements it is useful to first
re state the problem being considered which was explored in the previous section: - The
problem being considered is that there is a lack of information regarding the effects that
weather factors like temperature, precipitation, humidity, cloud cover and wind may have on
goals scored in football matches. From this we can then look to determine the project
requirements.

Project Scope
The project is a Big Data Analysis study which will use a relational database most likely SQL
in conjunction with R Studio to undertake analysis of a large data set to find trends, patterns,
links and predictions supported by graphing and tables to present results.

General Description
The database will be created and designed to facilitate the querying and manipulation of a large
amount of data to allow for the effects of weather such as temperature, humidity, precipitation
and wind on total goals scored in football matches to be analysed to determine if a relationship
exists. The aim is that the analysis will provide insight into the possible effects of weather on
sports like football.

The database must be designed in such a way that all the entities and their relationships are
robust and well understood and that the data has been normalised prior to database creation.
The ability to handle very large queries and joins will be required as tables with thousands of
rows has a multiplying effect within SQL databases which can have significant demands on
processing ability of computers. If the database cannot function properly then either the number
of data points will have to be restricted or the amount of analysis limited which will not provide
a sufficient amount of information for a robust analysis which could damage the study as a
whole. The core function of the project is to compare the two primary datasets which must be
central to any design approach implemented.

10

System Interfaces
The database will be a self-contained system however it may interface with a PC or a server
that will be located on Amazon Web Services, or Windows Asia (to be decided subject to
further research.) It will also need to potentially receive input data from another programs such
as Microsoft Excel, R or Python and be required to export back to Excel and R Studio for
ongoing graphing and analysis.

Preliminary list of Functional Requirements


The purpose of the project is to utilise a database to either accept or reject the null hypothesis
as set out within the specified project timeline and to produce a dissertation report.
1. The weather data cleaning preparation tool (R or Python) must be able to discard the dates
and associated data that are not relevant to reduce the weather file size.
2. The weather data cleaning preparation tool (R or Python) must be able to read, re-organise
and output the data files into a readable and standardised format for entry into the SQL
database.
3. The data preparation must ensure that dates from both files are in a standardised ISO format
that are compatible with each other.
4. The weather stations should have specific identity codes matched to each stadium.
5. The SQL database system must be able to be export results data out to other programs for
analysis, graphing and visualisation.
6. The SQL database being used for analysis must be able to hold several thousand entries.
7. The SQL database must be able to filter and select different columns and rows of
information for analysis and comparison.
8. The data outputted should be produced in a form that it is capable of being analysed by
using a variety of statistical tools (it is assumed that all of these will be utilised at this stage
to some extent subject to verification during the next stage) including: a. z-test (hypothesis testing)
b. power analysis (due to the large sample size)
c. Analysis of variance (ANOVA) to compare each season of play and other sub
groups of means.
d. Mean (there will be multiple means considered)
e. Calculation of Standard deviation(s)
f. Calculation of Variation(s) and Covariance.
11

g. Time series analysis (for possible prediction analysis)


h. Cluster analysis
i. Correlation Analysis (Calculation of r)
j. Simple linear Regression, multiple & logistic regression tools.
9. The SQL database should be designed so that comparison against weather variables can be
made against the following football variables:
a. The entire range of matches played by date of match.
b. Each season of play (by Individual selection.)
c. By Stadium location.
d. By Team.
e. By a pre-determined or local region (Refer to Appendix E.)
10. The SQL database should be designed so that comparison against football variables can be
made against one, two, three or all of the following weather statistics:
a. Temperature
b. Humidity
c. Precipitation
d. Wind
e. Cloud Cover
11. The database team table should provide the numbers of years they have played in each
league as not every team will have played for the entire time period being analysed.

Preliminary list of Non Functional requirements


Non-functional requirements are outlined below. They include:
1. The methodology section should enable another person to reproduce the research project
in its entirety and from the same data obtain the same/similar results.
2. The project and research objectives should be able to be understood by non-experts.
3. The data being used should be verified as authentic and reliable.
4. The author must invest a minimum of three hours a week on the project based on the project
plan.
5. The author must attend all lectures and tutorials within semester 2.
6. The database should be able to achieve a reasonable level of performance in its required
operation.
12

7. The project must be stored electronically on three different media sources at all times and
at least be updated once a week.
8. The project must be completed by the specified date.

6. Special Resources required


The proposed project will require a number of programs to undertake the required analysis and
then production of results: 1/Microsoft Excel Required to read and open primary football data files and do basic checks
and tables, graphical outputs.
2/ Microsoft Word To generate written reports.
3/ R. R will be the primary program used to prepare and analyse, graph and tabulate the data.
It will be used to clean up all the football files removing unwanted columns and binding all
years of play into one file. Weather data will also be cleaned up removing unwanted lines and
error checking for NULL values.
4/ SQL - The data lends itself towards a relational database such as SQL where the weather
data can be combined with the football data based on, temperature, precipitation, humidity,
wind or geographic location or team for example.
6/ Map Reduce/Hadoop & Python The use of a distributed computer system could offer
potential benefits for speed of computation as the data set may be too large to handle efficiently
on a single user PC. This will be investigated as to its necessity as the project develops.
7/ Pea Zip A program that can easily un-compress a variety of large file formats to be used
for the weather data.
8/ Microsoft PowerPoint To create the project presentation
9/ Adobe Photoshop - May be required to assist with image manipulation for the project and
presentation.
10/ Browser add-on for Mozilla; Download it all! to quickly extract and download all 42
football csv files.

At this stage there may be additional programs that may be useful but have not yet been
identified as being a requirement. This will be a part of the project plan to determine what
technologies should be used.

13

7. Project Plan
The project plan is provided in Appendix D and shows the general expected timeline for project
delivery in the second semester. The first half of the project is planned for research, preparing
all the data, building databases and becoming familiar with them as well as the initial parts of
the thesis. The second part focuses on the analysis, findings and writing the analysis which are
key parts of the project process. The plan has been updated based on confirmation of the
submission date in early January and additional deadlines for management reports and the
presentation.

8. Consultation
The project proposal was discussed with NCI Lecturer Padraig De Burca. The discussion took
place 26th May 2014 and took the form of an informal discussion after scheduled classes.
Padraig provided valuable feedback relating to the potential for use of SQL to build a database
of all the normalised match and weather variables which can then be queried in multiple ways
with the results being outputted to other programs like Excel to generate graphs. The significant
benefit of using SQL would be firstly in the speed by which stadiums, teams, results and even
certain weather conditions can be isolated for comparative analysis but also would limit the
amount of preparation the weather files needed as there would be no need to eliminate all the
dates where games were not played. Just clip the data file at the start date to eliminate the
largest unneeded section prior to 1993. This would create potentially redundant data within
the database and may affect times to undertake joins but could be quicker than trying to
eliminate certain dates in the raw weather files as there are potentially 70-100 individual
weather files.
As a result of the consultation several possible new ways to view the data were considered.
Firstly it opens the possibility of considering the past few days of weather prior to any match
for consideration which had not ben though of and secondly it allows the comparison of
sequential matches played by the same team in different locations to see if the effects of any
general ongoing weather such as sustained cold has a compounding effect. Padraig also noted
that SQL has some graphing capabilities which will be investigated as to their potential use.

9. Declaration
By submitting this proposal through the NCI Moodle system, I declare that unless otherwise specified,
all content in this proposal is my own work and has not been copied from other sources.

14

Appendix A Examples of the Football Data Sets


Data Set 1 Football results for the Bundesliga 1 & 2. Excerpt below shows Bundesliga 2
results for July 2013.

Football-Data (2014) provides a full season of play for either Bundesliga 1 or 2 as a csv file
available for download. Each csv file contains the results for one entire season of play. There
are 306 matches in total for each season which equates to 18 teams. There are 52 columns of
data per file for most files containing the date, final time results, half time results, where the
game was played and a variety of betting information. For earlier years not all this information
was recorded. Twenty years of historic football data for both leagues equates to 306(games per
season) x 21 (seasons) x 2 (leagues) = 12,852 lines of data for the football matches which in
its raw form exists in 42 corresponding csv files. Total goals is not a parameter but any program
or database such as SQL could calculate this from the home and away goals scored columns.

15

Appendix B Examples of the Weather Data Sets


Data Set 2(a) Historic weather data example for Germany for Daily Mean Temperature at
station 494 (Augsburg, Germany)

The European Climate Assessment & Database Project (2014) provides data for weather
stations across Europe. The above data sample is taken from station number 494 (Augsburg)
for mean daily temperature. The text files are comma delimited and provide (from left to right)
station number, source identifier, date (yyyy/mm/dd), temperature and quality code. This file
contains 67 years, 3 months and 29 days of data which equates to around 25, 591 lines of data
for each of the locations. The year the station began monitoring varies but typically covers a
significant time period in all cases. The temperature is provided in 0.1degrees Celsius in its
current html format and requires a decimal point to read correctly. For example the first line of
data above for the 28th March records a daily mean temperature of 6.5 Degrees Celsius with no
known errors or missing data. Below freezing levels are identified with a minus symbol (none
shown in example above.)

16

Data Set 2(b) Historic weather data example for Germany for Daily Humidity levels at
station 494 (Augsburg, Germany)

The above ECA (2014) data sample is taken from station number 494 (Augsburg) for daily
humidity. The text files are comma delimited and provide (from left to right) station number,
source identifier, date (yyyy/mm/dd), humidity in percent and quality code. This file also
contains 67 years, 3 months and 29 days of data which equates to around 25, 591 lines of data
for each of the locations. The year the station began monitoring varies but typically covers a
significant time period in all cases.

17

Data Set 2(c) Historic weather data example for Germany for Daily precipitation levels at
station 494 (Augsburg, Germany)

The above ECA (2014) data sample is taken from station number 494 (Augsburg) for daily
precipitation. The text files are comma delimited and provide (from left to right) station
number, source identifier, date (yyyy/mm/dd), precipitation in 0.1mm and quality code. This
file also contains 67 years, 3 months and 29 days of data which equates to around 25, 591 lines
of data for each of the locations. The year each station began monitoring varies but typically
covers a significant time period in all cases.

18

Data Set 2(d) Historic weather data example for Germany for Daily mean wind speed at
station 494 (Augsburg, Germany)

The above ECA (2014) data sample is taken from station number 494 (Augsburg) for daily
average wind speed. The text files are comma delimited and provide (from left to right) station
number, source identifier, date (yyyy/mm/dd), average wind speed in 0.1m/s and quality code.
This file also contains 67 years, 3 months and 29 days of data which equates to around 25, 591
lines of data for each of the locations. In this data set all records prior to 1960 are Null. The
actual wind speed is the above figure divided by 10. For example the first value for April shown
above would be 1.5m/s.

Data Set 2(e) Historic weather data for Germany for Cloud Cover
The cloud cover data files (not shown) are based on the oktas scale which provides a measure
of cloud cover from 0 to 8 subject to the overall portion of sky covered. Zero represents a
totally clear sky while 8 would be totally overcast.

19

Example Weather Year 2(e) -Typical Weather Year for Mean Daily Temperature for weather station 494 Augsburg

20

Appendix C Map of Current Bundesliga 1 & 2 Stadium locations.

Image Source: Total Football Forums, http://www.totalfootballforums.com/forums/topic/76502german-football-fans/

21

Appendix D Project Plan Gantt chart


September
October
November
December
January
Wk_01 Wk_02 Wk_03 Wk_04 Wk_05 Wk_06 Wk_07 Wk_08 Wk_09 Wk_10 Wk_11 Wk_12 Wk_13 Wk_14 Wk_15 Wk_16 Wk_17 Wk_18 Wk_19 Wk_20 Wk_21
8/9/14 15/9/14 22/9/14 29/9/14 6/10/14 13/10/14 20/10/14 27/10/14 3/11/14 10/11/14 17/11/14 24/11/14 1/12/14 8/12/14 15/12/14 22/12/14 29/12/14 5/1/15 12/1/15 19/1/15 26/1/15
Task
Revised Proposal (28/09/14)
Statistical research (ongoing)
Technology & Tools research
Data Cleaning & Preperation
Normalisation & ERD
Requirements Specification
SQL/R Set up and programming
System Testing
Introduction
Literature Review
Methodology
Data Analysis (pre-testing)
Data Analysis & Programming
Discussion
Graphing and Visualisation
Refinemant
Conclusion
Final Checking
Printing and Binding (x3 copies)
Submission (06/01/15)
Management Reports
Write Presentation
Practice Presentation
Presentations

Thesis writing
Supporting Processes
Key Landmarks

Notes
1/ Dates shown are week commencing for the Monday of each week.

22

Appendix E Map showing Principle Regions of Germany

Note: The regions are a base point for further study as it is accepted that these region
locations do not necessarily equate to accepted regional weather.
Image Source: 24point0. http://www.24point0.com/ppt-shop/media/catalog/product/r/e/regionsmap-of-germany-ppt-slides.jpg

23

Appendix F Project Proposal Revisions


2/ Background
Extra season of play added increasing size and additional weather factor included (cloud
cover) also increasing the raw data size. (Minor Change)

4/ Research Question
A few extra sub research questions added and in the predictions section the limitations of
predictions are based on forecasting which is realistically limited to a few days.

6/ Special Resources
This area has been updated to better reflect the actual technology being used and for which
specific purpose based on time spent investigating each technology and undertaking small
scale tests.

7/ Project Plan
Updated to reflect known dates and revised to better break down sub components.

Appendix B
Cloud cover information added (without example picture) to note inclusion of this weather
data set in the project.

Appendix D
Project plan updated to reflect additional information such as key dates as outlined in section
seven.

Overall changes are considered minor with changes not exceeding 2-3% of the originally
submitted proposal.

24

References
Anderson, C., and Sally, D. (2013) The Numbers Game: Why everything you know about
football is wrong. Penguin Books.
BBC (2014) Football Betting The global industry worth Billion. [Online]. BBC. Available
at: http://www.bbc.com/sport/0/football/24354124 [Accessed 29th May 2014]
Encyclopaedia Britannica (2014) Germany - Climate [Online]. Encyclopaedia Britannica.
Available from:
http://www.britannica.com/EBchecked/topic/231186/Germany/57996/Climate [Accessed
28th May 2014].
European Commission (2012) Study on the Contribution of Sport to Economic Growth and
Employment in the EU. [Online]. European Commission. Available from:
http://ec.europa.eu/sport/library/studies/study-contribution-spors-economic-growth-finalrpt.pdf [Accessed 1st June 2014].
Football-Data (2014) Data-Files: Germany [Online]. Football-Data. Available from:
http://football-data.co.uk/germanym.php [Accessed 21st May 2014]
Football Pools (2014) The Pioneers of Football Pools [Online]. Football Pools. Available
from: http://www.footballpools.com/cust?action=GoHelp&help_page=about_us [Accessed
1st June 2014]
Hamilton, H. (2014) Does the cold really kill Goals? Howard Hamilton Blog, 1st May.
Available from: http://www.soccermetrics.net/league-competitions/temperature-vs-goalsstudy-premier-league [Accessed 24th May 2014]

Hong, Y (eds.) (2014) Routledge Handbook of ergonomics in sport and exercise. New York:
Routledge.

25

IIBA (2009) A Guide to the business analysis body of knowledge (BABOK Guide.)
International Institute of Business Analysis: Toronto, Canada.

Kasirun, Z.M. (2005) A survey on the requirements elicitation practices among courseware
developers, Malaysian Journal of Computer Science, Vol. 18 No. 1, June 2005, pp. 70-77.
Lewis, T. (2014) How computer analysts took over at Britains top football clubs [Online].
The Guardian, 9th March, Available from:
http://www.theguardian.com/football/2014/mar/09/premier-league-football-clubs-computeranalysts-managers-data-winning [Accessed 28th May 2014].
McGarry, T., ODonoghue, P., and Sampaio, J. (eds) (2013) Routledge Handbook of Sports
Performance Analysis. New York: Routledge

Pezzoli, A., Cristoforu, E., Moncalero, M., Giacometto, F., and Boscolo, A. (2013)
Climatological Analysis, Weather Forecast and Sport Performance: Which are the
Connections? Journal Climatol Weather Forecasting 1: e105
PKR. (2014) Under / Over Betting [Online]. PKR. Available from:
http://bet.pkr.com/en/get-started/bet-types/under-over/ [Accessed 28th May 2014].
Riley, T., Williams, A.M. (eds.) (2003) Science and Soccer. 2nd Edition. London: Routledge.

Thornes, J. E. (1977), The Effect of Weather on Sport. Weather, 32: 258268.


Weather Online (2014) Climate Germany [Online]. Weather Online. Available from:
http://www.weatheronline.co.uk/reports/climate/Germany.htm [Accessed 28th May 2014].

Wiart, N., Kelley, J., James, D., and Allen, T. (2011) Proceedings of the Institution of
Mechanical Engineers, Part P: Journal of Sports Engineering and Technology 2011 225: 189

26

8.9

Initial Requirement Specification

- 98 -

HDSDAJAN 2014

Requirements
Specification (RS)
The effects of weather on goal outcome
for football matches played within the
German Bundesliga

Alastair Macnair
10/12/2014

Requirements Specification

Requirements Specification (RS)

Document Control
Revision History
Date
Version
12/10/2005
1

Scope of Activity
Create

Distribution List
Name

Ioana Ghergulescu

Prepared
AM

Title

Lecturer

Related Documents
Title
Title of Use Case Model
Title of Use Case Description

Comments

Page 1

Reviewed
X

Approved
X

Version
1

Requirements Specification

Table of Contents
Requirements Specification (RS)

Document Control

Revision History

Distribution List

Related Documents

Introduction
1.1

Purpose

1.2

Project Scope

1.3

Definitions, Acronyms, and Abbreviations

Requirements Specification
2.1

Functional requirements

2.1.1

Use Case Diagram

2.1.2

Requirement 1: Extract/Collect Required Data Sets

2.1.3

Requirement 2: Filter Weather files

11

2.1.4

Requirement 3: Transform Data

13

2.1.5

Requirement 4: Database Management

16

2.1.6

Requirement 5: Data Analysis

18

2.1.7

Requirement 6: Data Prediction Modelling

20

2.1.8

Requirement 7: Report Production and Output

23

2.2

Non-Functional Requirements

25

2.2.1

Recover requirement

25

2.2.2

Reliability requirement

25

2.2.3

Extendibility requirement

25

2.2.4

Resource utilization requirement

25

Interface requirements
3.1

25

Application Programming Interfaces (API)

25

System Architecture

26

System Evolution

27

Page 2

Requirements Specification
6

Special resources required

27

Page 3

Requirements Specification

1 Introduction
1.1 Purpose
The purpose of this document is to set out the requirements for the development
of an analytical study between weather data variables recorded across Germany
and the goal outcome for matches played within the Bundesliga 1 & 2 Leagues.
The analysis of weather and match outcome built on 21 years of historical data will
enable all relevant users to gain a better insight into understanding the effects of
weather through the varied weather variables being considered and future match
outcome through predictive analysis.
The intended primary customers will include predominantly football teams,
coaches, and trainers and in particular those companies that provide betting
instrument products for football matches. Secondary customers however could
easily include those involved in any competitive or non-competitive sport where
weather is a factor.

1.2 Project Scope


The scope of the project is to develop an understanding of the relationship between
weather conditions and football match goal outcome within the Bundesliga 1 & 2
leagues played in Germany. One of the projects primary outcomes is to provide all
of the identified customers with a better understanding, knowledge and statistical
insight into weather effects on goal outcome that can be used to improve match

Page 4

Requirements Specification
strategy and winning performance levels through the individual objectives outlined
below.
The data to be used as the basis of the analysis is comprised of ECA&D weather
data files and Football data UK files which are both freely available for noncommercial use.
The primary project objectives are listed below:

Objective #1 Determine if there is any link between goals scored and weather effects
within the Bundesliga 1&2 football Leagues.
Objective #2 Determine if there is any difference between the Bundesliga 1 &
Bundesliga 2 due to the effects of weather, location or smaller stadiums.
Objective #3 Determine if just single or multiple weather parameters predominantly
affect goal outcome.
Objective #4 Investigate if stadium location and regional local weather affects games
played there and match outcome.
Objective #5 Compare the outcomes of matches to under/over goal difference betting
instruments to determine if the spread of match results could have been
better predicted using the results of the analysis.
Objective #6 Attempt to use the data to predict goal outcome for a number of future
matches using weather predictions and selected betting instruments.
Objective #7 Determine if goal difference between teams is greater in periods of colder
or warmer weather and if sustained whether this effects a teams
performance over time.
Objective #8 Use analysis software including but not limited to Excel, Python, R and
SQL to gain knowledge in their use for analysing large data sets.

Page 5

Requirements Specification
Successful Outcome Criteria: The project will be considered successful if the
primary research question(s) can be answered. In the context of this study
determining that weather has no determinable effect or if only one or two of the
project objectives can be answered this will be considered equally successful.

There are a number of restrictions and limitations of the project:


Weather Data: The weather data uses daily averages which may not equate to
the conditions being experienced during the actual match. For example rain may
have fallen before, after or during the match. The intention is to consider general
trends rather than specific instances. i.e. colder weather, warmer weather, very
wet, very dry. The weather data could be used to generate secondary parameters
either on their own or by combining two or more variables. For example wind chill
factor, Beaufort scale and heat index.
Football Data: The only variables being considered are the goal outcome (each
team and total) and the respective geolocation based on the where the match was
played. No other variables such as the players, corners or other in match data will
be used.
Software and tools: All software and tools being used such MySQL, R Studio,
Python and Microsoft Office are available within the College without cost or
unreasonable restriction.

Page 6

Requirements Specification
Data use: Both data providers offer the data for free use. However the ECA&D in
their data policy document note that the data cannot be used for commercial
purposes. While the study is not commercial in nature any future usage of the study
must take this into account where and if applicable.
Time: The project must be completed within the specified time frame from the 13th
October 2014 until the deadline of 5th January 2015.
Budget: There is no required budget as all the required software and tools are
available without cost penalty within the College.

1.3 Definitions, Acronyms, and Abbreviations


CSV

Comma Separated Values file format

ECA&D

European Climate Assessment & Data Set Project

Firefox

Open source internet browser

MySQL

Relational Database Program

PeaZip

Program to unzip and extract large zip files

Python

Open Source programming program

R Studio

Open Source statistical analysis program

Txt

Text file format

2 Requirements Specification
This section provides an overall description of the project and detailed descriptions
of the functional requirements that represent the key steps and processes that are
essential to ensure successful project completion.

Page 7

Requirements Specification

2.1 Functional requirements


2.1.1 Use Case Diagram
The Use Case Diagram below provides a project overview of all the functional
requirements. The Data Analyst is involved in every step with use case using the
preceding use case from the first to the last step as indicated.

Figure 01_ Overall Use Case Diagram

2.1.2 Requirement 1: Extract/Collect Required Data Sets


2.1.2.1 Scope and Description

2.1.2.1.1 Description Overview & Priority


The data sets are downloaded from the websites identified. Downloading the data
is essential to being able to undertake analysis.

Page 8

Requirements Specification

2.1.2.1.2 Inputs

Weather Data Website: http://eca.knmi.nl/


Football Data Website: http://football-data.co.uk/
PC with Firefox web browser
High Speed internet access
PeaZip Program
Storage device with at least 10GB of free space for all files

2.1.2.1.3 Processing
1/ Check data usage policy and obtain any permissions to use data sets.
2/ Access the weather website and download the five zip files plus station lists.
3/ Access the football website and use Firefox to down load all 42 files across all
individual links simultaneously.
4/ Check files (sample) are viable and contain data as expected.

2.1.2.1.4 Outputs
Weather files: There will be around 5000 individual weather files in comma
delimited txt format. There will be five station lists, one for each weather variable
in comma delimited txt. Station lists provide a full description of every station for
that weather variable with a country code and location allowing for files to be
identified.
Football files: There will be 42 individual csv files. 21 for Bundesliga one and 21
for Bundesliga 2 representing each season of play.

2.1.2.1.5 Error handling


If the Files or data is not available, corrupt, missing, damaged or, the links do not
work then contact the site administrator to resolve.
2.1.2.2 Use Case 001 Data Collection/Extraction
Scope
The scope of this use case is to acquire the weather and football data sets
and any other supporting information required at this stage.
Description
The data analyst accesses each website and downloads the files to a
dedicated project directory.
Use Case Diagram

Page 9

Requirements Specification

Figure 02 Use Case 001


Flow Description
Precondition
There is an active internet connection. There is a PC with internet browser
and suitable storage device with at least 10GB of storage to hold the files
when un-zipped.
Activation
The use cases starts when the websites are accessed for the purposes of
downloading the data sets.
Main flow
1. The ECA&D Website is accessed
2. The weather blended ZIP files for Average Temperature, Rainfall,
Cloud Cover, Humidity and Average Wind Speed are downloaded to
the dedicated project directory.
3. The Station list txt files for each weather variable are downloaded.
4. The Football Data UK website is accessed
5. The 42 individual links to each CSV files for each season of play and
league are downloaded as a single operation using Firefox add on
download it all into the dedicated project directory.
6. The ZIP files are opened using PeaZip to unpack the files and
visually check that the operation was successful.

Page
10

Requirements Specification
7. Two Random CSV files are opened to visually confirm that the data
is not compromised and is all present.
Termination
The project folder shows all the required files as being present.
Post condition
The data sets are ready to be used.

2.1.3 Requirement 2: Filter Weather files


2.1.3.1 Scope and Description

2.1.3.1.1 Description Overview & Priority


The five Zipped weather files each contain between 500 and 2600 individual
weather station txt files representing the entire published weather station data set
across Europe comprising 5000+ txt files. The required txt files for each weather
parameter need to be identified and removed based on their unique numeric file
name. The remaining weather files can be discarded. High Priority as it is not
possible to proceed until the correct files and locations are correctly identified.

2.1.3.1.2 Inputs
The five zip files from use case 001.
The station list comma delimited txt file for each of the ZIP files.
Football data files.

2.1.3.1.3 Processing
The football data files are filtered to provide a list of the teams that played for each
year of play. From this a master list is created that shows every team that played
in every season and the stadium location is matched to this with its decimal
degrees location.
The station lists are copied into Microsoft Excel. Each parameter is filtered using
the two digit country code to leave just Germanys entries. The smallest list is used
as there must be five weather variables for each location from the same station.
The latitude and Longitude data is split into its three constituent parts and the
decimal equivalent calculated. A geo location program using R and the list of
weather stations in excel is used to determine the closest weather station for each
of the stadiums. The weather station code is matched to each stadium. This

Page
11

Requirements Specification
eventually provides a list of the actual weather stations that will be used. These
files are extracted and the rest discarded.

2.1.3.1.4 Outputs
An excel file that contains a list of every team that has played across all 21 seasons
of play for both leagues and their respective stadium. A series of txt files (comma
delimited) that are the actual weather files needed for analysis with the football
data.
2.1.3.2 Use Case 002 Filter weather files
Scope
The scope of this use case is to filter the weather files and identify just those
particular weather stations that are relevant to the study and match actual
football stadia as closely as possible.
Description
This use case describes the process of identifying only those weather files
that match the actual stadia used over the 21 years of Bundesliga 1&2 match
history.
Use Case Diagram

Figure 03 Use Case Diagram 02

Page
12

Requirements Specification

Flow Description
Precondition
The data files from use case 001. A PC with Microsoft Excel. R Studio with
Google maps and Map distance packages installed.
Activation
This use case starts when the data analyst completes use case 001.
Main flow
1. The participating teams for each season of play are entered into an
excel file to provide an overall summary list of every unique team that
has played across all 21 seasons of play for both Bundesliga 1 & 2.
2. The decimal degrees location of each stadium is established and
added to the master list.
3. The weather station files are filtered on DE for Germany for each of
the five weather parameters. The smallest list used to ensure that all
five variables exist for each unique weather station being used.
4. The stadium location is matched to the nearest weather station using
R Studio distance location. The results are entered onto the master
list to provide a complete description of each stadia and the weather
station that it relates to.
5. The relevant individual files are extracted from each of the five zip files.
All other files are removed and discarded.
Termination
All the non-relevant weather files are discarded and the project folder
contains only the files needed.
Post condition
The correct weather files to be used are located in the project directory.

2.1.4 Requirement 3: Transform Data


2.1.4.1 Scope and Description

2.1.4.1.1 Description Overview & Priority


The weather and football data files are prepared and treated to remove all
unrequired data, columns, characters and noise that are unrequired. The data is
error checked for NULL or missing values.

Page
13

Requirements Specification

2.1.4.1.2 Inputs
The weather files and football data files from use case 002.

2.1.4.1.3 Processing
Football data: The 21 files associated with each season of play for an entire
league are loaded into R and bound into a single file. All unwanted columns and
data are removed. The date is reformatted to a standard ISO format.
Weather Data: The weather files are checked for null values.
All unwanted columns and headers are removed and the numeric element
corrected by a factor of ten. Each of the five elements is joined on date to provide
a single weather file for each weather station location containing all five
parameters. All weather information prior to 1993 is removed.

2.1.4.1.4 Outputs
Two CSV files. One for each of the Bundesliga 1&2 leagues.
A single CSV file for each weather station location containing all five weather
variables.

2.1.4.1.5 Error handling


If the error checking process detects null values for weather data these are shown
as -9999 and will need to be checked against actual dates played to see if a conflict
occurs. If the error is small across a few days only then a comparable value will be
extrapolated from the data either side. If the error persists across a large range
then this weather station will need to be discarded for all five variables and a new
one (next nearest) selected in its place.
2.1.4.2 Use Case 003 Transform Data
Scope
The scope of this use case is to clean up the data files by removing all
unwanted information, bind and join the files together to create a smaller
number of files with common information and error check the weather data
for null values to determine if an alternative station location will need to be
used.
Description
This use case describes the process of cleaning, transforming and error
checking the data prior to its use within a database and for analysis.
Use Case Diagram

Page
14

Requirements Specification

Figure 04 Use Case 003


Flow Description
Precondition
All of the required files have been extracted and downloaded and unzipped
where required.
Activation
This use case starts when the data analyst starts the R script to begin
cleaning and transforming the data files.
Main flow
1. The R script opens each weather file and clips the data by removing
unwanted header information and removing all dates prior to 1993. All
unwanted columns are removed leaving just the date and weather
variable. R closes each file and saves as a csv.
2. Python script opens each weather file and checks for NULL values.
An output file is created listing all detected values.
3. If an error is detected the error handling process commences (A1)
4. If there is no error across all five variables the files are joined and
saved as a csv.
5. The football data files are bound vertically and all unwanted columns
removed. The date is reformatted to a standard ISO format.

Page
15

Requirements Specification
Alternate flow
A1 : Detection of NULL values in weather files
1. Where there are isolated null values the dates are checked against
actual match dates as matches are typically only played at weekends.
If the dates dont match then continue with main flow. If the dates do
conflict then interpolate a value based on values either side.
2. If there are a significant number of NULL values denoting missing
values across a number of weeks or months then the weather station
is discarded and the next nearest one is selected in lieu.
3. The use case continues at position (1) or (4) of the main flow
depending on if the new weather file has already been cleaned and
treated.

Termination
Python will have completed all error handling and R will have bound or joined
all files.
Post condition
The project directory will contain the finished files for the weather and football.

2.1.5 Requirement 4: Database Management


2.1.5.1 Scope and Description

2.1.5.1.1 Description Overview & Priority


The data is entered into a relational database structure such as MySQL. All tables,
entities, and relationships are created as required.

2.1.5.1.2 Inputs
The cleaned files from use case 003. A relational database system like MySQL or
sqldf (an SQL add on for R) is used to manage the data. This will allow the data to
be manipulated and visualised during the analysis stage.

2.1.5.1.3 Processing
An entity relationship diagram is created with all key entities, attributes and
relationships. Primary and foreign keys are created or identified. The data is loaded
into the data base.

Page
16

Requirements Specification

2.1.5.1.4 Outputs
A relational database, fully normalized with clear relationships and no null values
or errors.
2.1.5.2 Use Case 004 Database Management
Scope
The scope of this use case is to create the relational database structure that
will allow manipulation and visualisation of the data.
Description
This use case describes the process of the relational database creation and
its management.
Use Case Diagram

Figure 05 Use case 004


Flow Description
Precondition
Use case 003 is fully completed. There is access to a relational database
structure like MySQL to allow data to be imported.
Activation

Page
17

Requirements Specification
This use case can commence any time after use case 003 is completed and
begins when a python script is run which begins the process of loading the
files into the database.
Main flow
1.
2.
3.
4.

The database is created with all tables and relationships.


A python script loads the football and weather data files.
The program returns a successful outcome message.
The database is created and ready for use.

Termination
The database is fully created and all data is loaded and there are no errors
or missing attributes or relationships. The python script returns a task
completed message.
Post condition
The data is now entered into a relational database and is ready for use.

2.1.6 Requirement 5: Data Analysis


2.1.6.1 Scope and Description

2.1.6.1.1 Description Overview & Priority


Data analysis is a core function of the projects and the primary objectives of the
project relates to the undertaking of data analysis on the data sets which have
been prepared for analysis in the previous requirements and use cases.

2.1.6.1.2 Inputs
The relational database with all data loaded in.

2.1.6.1.3 Processing
Access database, manipulate data, and use MYSQL and R Studio to run scripts
and statistical analysis on the generated queries based on the primary project
objectives. Create the required graphs, tables, mapping and other outputs to
visualise the results for compilation in the report and presentation. Interpret the
results and document.

2.1.6.1.4 Outputs

Page
18

Requirements Specification
A variety of graphs, tables, maps and charts to be included in the presentation and
report.
2.1.6.2 Use Case 005 Data Analysis
Scope
The scope of this use case is to undertake statistical and data mining
activities to determine how weather is related to match goal outcome.
Description
This use case describes the process of statistical analysis and data mining
activities on the data set followed by interpretation of the results and the
creation of outputs to describe and explain the data.
Use Case Diagram

Figure 06 Use Case 005


Flow Description
Precondition
The relational database in use case 004 is ready for use and accepting
queries.
Activation

Page
19

Requirements Specification
This use case starts when the relational database (use case 004) is
completed.
Main flow
1. Undertake the data mining and statistical analysis based on the
projects primary objectives.
2. Generate Queries and scripts to support research and project
objectives.
3. Output tables, graphs, charts and maps to present the outcome of
the analysis.
4. Interpret the results and document what they mean.
5. Create the appropriate part of the report using the gathered
information.
Termination
The use case is terminated when the primary research objectives have been
answered and the relevant report section and presentation is completed.
Post condition
A report & presentation draft structure that provides discussion and
explanation of the results supported by graphs, tables, charts and maps.

2.1.7 Requirement 6: Data Prediction Modelling


2.1.7.1 Scope and Description

2.1.7.1.1 Description Overview & Priority


Predictive Modelling is the process by which the data analysis results can be used
in conjunction with weather data to make predictions on matches yet to be played.
This process has a high priority because being able to make predictions about data
provides a potentially useful tool for the customer.

2.1.7.1.2 Inputs
The interpreted results from use case 006.

2.1.7.1.3 Processing
A description of the processing steps. Describe the main steps involved in
processing

Page
20

Requirements Specification

2.1.7.1.4 Outputs
A predictive model that allows for customers and

2.1.7.1.1 Error handling


If the analysis is inconclusive or there is no relationship between the data sets then
the scope for a predictive model is reduced. In this case it may be possible to
determine some general trends or patterns if such exist. If no relationship exists
between at all then no predictive modelling may be possible.
2.1.7.2 Use Case 006 Predictive Modelling
Scope
The scope of this use case is to provide a predictive modelling program or
overall trends and insights that provides the customer(s) with information that
may provide a competitive edge in actual play or through betting instruments.
Description
This use case describes the process to determine goal outcome using match
fixture data and weather forecasting information between 1 to 5 days in
advance of any game being played.
Use Case Diagram

Figure 07 Use Case 006

Page
21

Requirements Specification
Flow Description
Precondition
The primary data analysis is completed and all results have been interpreted
and documented. The results show trends and patterns that allow for
predictive modelling to be undertaken.
Activation
This use case starts when the analysis in use case 005 is complete and the
predictive analysis is undertaken using R.
Main flow
1. Predictive modelling process commences.
2. 80% of the data is selected at random and designated as training
data. The remainder will be the actual test data.
3. The programs and models learn from the training data and this is
then applied to the test data to see if the model is able to correctly
determine match outcome.
4. Depending on time frames and availability the model may also be
applied to actual future football matches using match fixtures and
predicted weather forecasting.
5. The results are documented and interpreted.
Alternate flow
A1 : No clear relationships
1. Where no clear relationship exists between the data sets and
predictive modelling is not applicable then any general trends or
patterns will be discussed if applicable. (Returns to main process step
5.)
Termination
The use case ends either when the best predictive model is produced based
on the data analysis or it is determined that no model can be created.
Post condition
A report and interpretation of the results providing either predictive modelling
or general trends and patterns where possible.

Page
22

Requirements Specification

2.1.8 Requirement 7: Report Production and Output


2.1.8.1 Scope and Description

2.1.8.1.1 Description Overview & Priority


A report is created which outlines the entire project and all results, findings,
discussions and any predictive modelling.

2.1.8.1.2 Inputs
The primary data analysis results from use case 005 and predictive modelling
results from use case 006.

2.1.8.1.3 Processing
A clear and detailed report is created to allow the customer to review the project
and understand the data sets and all relationships and trends that exist. In addition
a short presentation will be created alongside this to present the key findings to
the customer.

2.1.8.1.4 Outputs
A detailed and clear printed and electronic report of 10,000 to 12,000 words in line
with the customers requirements and a PowerPoint presentation file.
2.1.8.2 Use Case 007 Report Production and Output
Scope
The scope of this use case is to create a final report and presentation to be
presented to the customer.
Description
This use case describes the process of creating the final report and
presentation material.
Use Case Diagram

Page
23

Requirements Specification

Figure 08 Use case 007


Flow Description
Precondition
The data analysis and predictive modelling use cases are completed.
Activation
This use case starts when the previous two use cases are completed.
Main flow
1. The data analysis is used to create the final report
2. The data analysis and report is used to create the presentation
3. Both the presentation and report are presented to the customer for
review.
Termination
The use case ends once the final and finished report and presentation are
fully completed and ready to be issued/presented to the customer.
Post condition
The report and presentation are delivered to the customer for review.

Page
24

Requirements Specification

2.2 Non-Functional Requirements


Specifies any other particular non-functional attributes required by the system.

2.2.1 Recover requirement


The project data, analysis programs and output reports must be stored on at least
three different mediums that are unrelated to each other. They should include at
least one PC, one high capacity USB drive or external hard drive and a cloud
storage system such as Google Drive. In the event of accidental loss or damage
all of the project data and analysis can be easily recovered and re-instated.

2.2.2 Reliability requirement


The weather data is checked on an ongoing basis by ECA&D and the database is
updated to reflect the addition of new data as this is time series based. The football
data is also checked for errors and both data sets are deemed to be highly
accurate.

2.2.3 Extendibility requirement


There is no requirement to extend the project at this stage. However, due to the
ongoing addition of new weather data and ongoing football games being played
there is new data being created year on year which could add to the overall data
sets being used for this project.

2.2.4 Resource utilization requirement


The project will require a PC with high speed internet connection to ensure timely
download of the estimated 1GB of data required. All required software packages
will be required including Microsoft Office, Python, R Studio, Web browser and
Peazip as well as MySQL.

3 Interface requirements
3.1 Application Programming Interfaces (API)
The analysis process will use Google maps to provide mapping visualisation tools
for use within R Studio via dedicated add-ons. Google mapping is used to plot
weather station and stadium locations and also to determine the closest distance
between them where it is not clear or there is a choice of weather stations in close
proximity. Some analysis output results may be presented using Google Mapping
visualisation tools.

Page
25

Requirements Specification

4 System Architecture
The overall system architecture is shown as a high level diagram in Figure 09. The
system and its steps and processes is shown within the dotted line which
represents the use cases outlined above.
Overall System Architecture

Figure 09 Overall system Architecture

Page
26

Requirements Specification
UML Class Diagram for Database Structure
The database will have three classes as shown below with attributes. This is a draft
class diagram which also forms the basis of an Entity Relationship diagram.

Figure 10 UML Class diagram for the Database structure

5 System Evolution
As outlined in section 2.2.3 weather data and football results are time series data
which are being added to on an ongoing basis. Both the ECA&D and Football
results providers add to the data sets on a continual basis providing the option of
additional data to be included in any future study.
The system could also consider extending the range of countries as the ECA&D
hold data for a huge range of European countries although the detail and reliability
of the data outside of modern countries like Germany is not as good quality with
higher incidences of Null values. The inclusion of very hot or wet countries like
Spain, and Italy could reveal patterns or trends across Europe as a whole.

6 Special resources required


No special resources are required or anticipated at this stage.

Page
27

8.10 Management Progress Reports


8.10.1

Management Progress Report 1

- 99 -

Management Progress Report 01

Highlight Report

The effects of weather on goal outcome for football matches


played within the German Bundesliga

Release:

Management Progress Report 01

Date:

31st October 2014

Authors:

Alastair Macnair
x13129325

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

1
1.1

Date: 31 October 2014

Report History
Document Location

This document is only valid on the day it was printed.

1.2

Revision History

Revision date

Author

Version

Summary of Changes

31/10/2014

Alastair Macnair

01

Initial Issue

1.3

Changes
marked

Approvals

This document requires the following approvals:


Name

Title

Date of Issue Version

Ioana Ghergulescu

Project Supervisor

31/10/2014 01

1.4

Distribution

This document has additionally been distributed to:


Name

Title

Management Progress Report 01

Date of Issue

Status

Page 2 of 11

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Table of Contents
1

Date: 31 October 2014

Page

Report History ________________________________________________________________ 2


1.1

Document Location ________________________________________________________ 2

1.2

Revision History __________________________________________________________ 2

1.3

Approvals _______________________________________________________________ 2

1.4

Distribution ______________________________________________________________ 2

Highlight Report from 15th September 2014 to 31st October 2014. _______________________ 4

Highlight Report Purpose _______________________________________________________ 4

Summary of Project Progress ____________________________________________________ 4


4.1

Project Plan Table and summary status ________________________________________ 5

Key Milestones Achieved in this period ____________________________________________ 6

Problems encountered in this period ______________________________________________ 6

Highlighting Concerns (RAID Log) _________________________________________________ 6

Variance from Plan ____________________________________________________________ 6

Planned Work for Next Period (to 30-11-2014) ______________________________________ 7

10 Appendix A Revised Project Plan Gantt Chart ______________________________________ 8


11 Appendix B Risks, Assumptions, Issues and Dependencies (RAID) Logs __________________ 9

Management Progress Report 01

Page 3 of 11

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Date: 31 October 2014

Highlight Report from 15th September 2014 to 31st October 2014.

Highlight Report Purpose

A Highlight report provides the Project Board and Client/Customer with a summary of the status of a
project at agreed stages and is used to monitor progress. The Project manager/Data Analyst uses the
Highlight report to advise the Project Board/Client of any potential problems or areas where they
could help.

Summary of Project Progress


Data Sets Downloaded and preliminary check indicates they are OK
Bundesliga 1 Team and Stadium List 100% completed, weather station list compiled
Bundesliga 2 Team and Stadium List 65% completed, weather station list yet to be completed
R script written to match decimal degree co-ordinates of each stadium to nearest weather
station to create stadium lists above
R script written to bind all football match data files
R script written to strip and clean football and weather data files
Python script written to error check weather files for missing values
Introduction started
Literature Review started. Primary sections initially identified
Data description Section started
All 21 years of Bundesliga results bound into one file
Some research started in technology and statistical analysis

Some areas have been started slightly ahead of schedule as per the project table summary indicates.

The project plan Gantt chart was re-done from scratch to maximise the effectiveness of excel in being
able to provide a simple project plan. A table was built using all currently identified tasks, although
some were omitted to ensure clarity of the overall chart. The revised chart can be adjusted and
updated much easier which the Gantt chart (Appendix A.) reflects automatically and the use of colour
allows for specific sub groups of tasks to be better identified. The table uses a traffic light system to
identify sections completed, in progress and yet to be started. The Project plan table is shown on the
next page: -

Management Progress Report 01

Page 4 of 11

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

4.1

Date: 31 October 2014

Project Plan Table and summary status


Start Date

Duration

End Date

Satus

28-Sep-14

28-Sep-14

Completed

01-Oct-14
12-Oct-14

12
1

12-Oct-14
12-Oct-14

Completed
Completed

31-Oct-14
30-Nov-14
20-Dec-14

1
1
1

31-Oct-14
30-Nov-14
20-Dec-14

Completed
To be started
To be started

01-Nov-14
15-Sep-14
01-Nov-14
01-Nov-14
15-Sep-14

12
60
12
10
55

12-Nov-14
13-Nov-14
12-Nov-14
10-Nov-14
08-Nov-14

In Progress
In Progress
To be started
To be started
In Progress

29-Sep-14
29-Sep-14

1
1

29-Sep-14
29-Sep-14

Completed
Completed

14-Oct-14
02-Nov-14

20
1

02-Nov-14
02-Nov-14

In Progress
To be started

01-Nov-14
30-Oct-14
30-Oct-14
05-Nov-14

5
2
2
1

05-Nov-14
31-Oct-14
31-Oct-14
05-Nov-14

In Progress
Completed
Completed
To be started

01-Nov-14
06-Nov-14

2
1

02-Nov-14
06-Nov-14

In Progress
To be started

06-Nov-14
15-Nov-14

15
6

20-Nov-14
20-Nov-14

To be started
To be started

20-Nov-14

21-Nov-14

To be started

31-Oct-14
04-Nov-14
12-Nov-14
20-Nov-14
03-Dec-14
08-Dec-14
15-Dec-14
06-Jan-15

4
8
2
14
3
5
3
1

03-Nov-14
11-Nov-14
13-Nov-14
03-Dec-14
05-Dec-14
12-Dec-14
17-Dec-14
06-Jan-15

In Progress
In Progress
In Progress
To be started
To be started
To be started
To be started
To be started

06-Jan-15
19-Jan-15

14
5

19-Jan-15
23-Jan-15

To be started
To be started

PROJECT PROPOSAL
Project Proposal Submitted
REQUIREMENTS SPECIFICATION
Requirements Specification
Requirements Specification Submitted
MANAGEMENT REPORT
Management Report 01
Management Report 02
Management Report 03
KEY RESEARCH AREAS
Germany's Weather
Statistical Tools
Sports Performance Factors
Stadium Design
Technology and Tools
DATA EXTRACTION
Download Weather Files
Download Football Files
FILTER WEATHER FILES
Collate Bundesliga Stadium list
Extract required weater stations
DATA TRANSFORMATION
Write R script to clean and transform weather files
Write R script to clean and transform football files
Bind and clean all football files
JOIN all weather parameters for each station
DATABASE MANAGEMENT
Design Relational Database
Insert Data into SQL database
DATA ANALYSIS
Analyse the database
Graphing and Visualisation
PREDICTION
Predictive Modelling
REPORT WRITING
Introduction
Literature Review
Data Set Description
Discussion
Conclusion
Checking & Review
Print 3 Copies and Bind
Submit Dissertation
PROJECT PRESENTATION
Prepare & Practice Presentation
Make Presentation

Management Progress Report 01

Page 5 of 11

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Date: 31 October 2014

Key Milestones Achieved in this period


Project Proposal document Updated and re-issued 28th September 2014
Requirements Specification document completed and issued 12th October 2014
Project Plan revised and updated 31st October 2014 (See Appendix 1)

Problems encountered in this period


When the stadium list was compiled the total number of individual teams that have played over
the 21 year period was much larger than anticipated. Both the Bundesliga 1 & 2 leagues have a
total of 18 teams each at any one time. Bundesliga 1 has unique 38 teams and Bundesliga 2
features 67 teams over the 21 year period. This added significantly to time to create the stadium
list
Some teams have changed name or stadium over the time period
The European Climate Assessment & Database website updated its datasets for a large part of
October and the files were unavailable for download.

Highlighting Concerns (RAID Log)

The full RISK log is provided in Appendix B. Summary of the RAID log: -

Risks
Four risks have been identified that could affect the project in the future in the next period. One has
been resolved relatively easily.
Assumptions
One long range assumption has been identified relating to the holiday period and available resources
over this period to complete the project.
Issues
Five issues were encountered in the reported period. Four were dealt with and the fifth is due to be
resolved within the next few days.
Dependencies
There is only one current dependency which is ensuring the cleaned and transformed data is
completed on time to allow for database creation and analysis which is a major and key part of the
project.

Variance from Plan


The original plan had assumed that more work would have been completed during October which
has not been the case. Other external demands limited the time able to invest in this
The original plan assumed for more work over the Christmas period than could be considered
realistic upon reflection. Based on bank holidays and potential business closures the project has
been pulled back from the Christmas period

Management Progress Report 01

Page 6 of 11

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Date: 31 October 2014

Planned Work for Next Period (to 30-11-2014)

The next work period will be when the vast majority of all the key tasks will be undertaken including
the main analysis and report writing.

Finalise and complete stadium lists and weather file extraction


Clean and bind/join all required data files
Update all final data into database
Undertake analysis and all graphing tables and charts
Write first three primary report sections
Undertake ongoing research to support analysis and report writing

Management Progress Report 01

Page 7 of 11

Management Progress Report 01

10 Appendix A Revised Project Plan Gantt Chart

PROJECT PROPOSAL
Project Proposal Submitted
REQUIREMENTS SPECIFICATION
Requirements Specification
Requirements Specification Submitted
MANAGEMENT REPORT
Management Report 01
Management Report 02
Management Report 03
KEY RESEARCH AREAS
Germany's Weather
Statistical Tools
Sports Performance Factors
Stadium Design
Technology and Tools
DATA EXTRACTION
Download Weather Files
Download Football Files
FILTER WEATHER FILES
Collate Bundesliga Stadium list
Extract required weater stations
DATA TRANSFORMATION
Write R script to clean and transform weather files
Write R script to clean and transform football files
Bind and clean all football files
JOIN all weather parameters for each station
DATABASE MANAGEMENT
Design Relational Database
Insert Data into SQL database
DATA ANALYSIS
Analyse the database
Graphing and Visualisation
PREDICTION
Predictive Modelling
REPORT WRITING
Introduction
Literature Review
Data Set Description
Discussion
Conclusion
Checking & Review
Print 3 Copies and Bind
Submit Dissertation
PROJECT PRESENTATION
Prepare & Practice Presentation
Make Presentation

Management Progress Report 01

11 Appendix B Risks, Assumptions, Issues and Dependencies (RAID) Logs

28/09/2014

15/09/2014

20/09/2014

28/09/2014

Risk Description
Available time until project deadline is
limited
Project Data is lost due to PC or USB key
loss/damage

Ability to apply appropriate statistical and


algorithm applications to data limited due to
time and knowledge shortfalls impacting on
project quality
Period of Christmas is potentially much less
useful than it appears especially as to
working days and using external providers to
print and bind prior to the deadline

Severity

Date Raised

Impact

ID

Likelihood

RISKS

Mitigation Plan
Ensure constant review of project plan and
9 keep to project plan deadlines.
A dedicated Google Drive space has been set
up in addition to storage on a PC and USB key
creating three distinct storage places. Back
up latest files at least onc a week or after
15 significant work development
Ensure ongoing research and adherence to
project plan as well timely undertaking of
literature review to ensure all knowledge
3 areas are complete
Pull back project timeline to allow for
printing before christmas. Identify
businesses that provide this service and
determine opening hours as early as possible
15

Owner

Status

Date Closed

Open

Closed

Open

Open

15/10/2014

Management Progress Report 01

ISSUES
ID

Date Raised

14/10/2014

14/10/2014

25/10/2014

14/10/2014

25/10/2014

Issue Description
Bundesliga total team range over 21
years significantly higher than
anticiapted adding to work and
creating lots of 'one' off teams over
the period

Impact Description
Adds to time required to collate
stadia list and identify stadia and
potentially affects the analysis
where single teams and stadia are
present over the

Impact

Medium

Some teams have changed over the


years and stadium location has
Over 21 years teams have changed
changed
in name and even location
Low
Origonal plan had allowed for much
more work to be completed in this
Project plan has slipped from
period placing additional stress into
origonal
next work preiod.
Medium
ECA&D website undertook a major Unable to download primary data
update and some data sets could
sets as more were added due to
not be downloaded
project proposal revision
High
Unable to plan and manage project
and significant time to adjust chart
Existing Gantt Chart cannot be easily moving forward. Danger is that this
adjusted or updated to reflect
is not done and ability to manage is
change
compromised.
Medium

Priority Mitigation Plan

Owner

Status

Allow more time and adjust project


plan accordingley
Number of teams upon analysis is
only one or two. Make a reasonable
decision to place and re-name to
ensure consistency across all 21
years

Closed

Revise plan realistically as possible


and better allow for external
Medium pressures

Closed

High

Low

High

Check every day until data sets


become available for download

Revise plan to use changeable date


format and days that automatically
Medium updates chart

Open

Closed

Closed

Management Progress Report 01

ASSUMPTIONS
ID

Date Raised

30/10/2014

Assumption Description

Reason for Assumption


Holidays and reduced demand may
Businesses that print and bind may see these businesses close over an
be closed over the christmas period extended period

Action to Validate
Impact if Assumption Incorrect
Identify businesses required to bind
report and clarify opening hours
More time in project plan currently
well before christmas period
not being utilised.

Status

Open

DEPENDANCIES
ID

Date Raised

20/10/2014

Dependency Description
Analysis and database creation is
dependant upon stadium list
being compiled

Location

Deliverables

Internal

Bound files for weather station


locations and stadium lists.

Delivery Date

Importance

Status

02/11/2014

High

Open

8.10.2

Management Progress Report 2

- 100 -

Management Progress Report 02

Highlight Report

The effects of weather on goal outcome for football matches


played within the German Bundesliga

Release:

Management Progress Report 02

Date:

7th December 2014

Authors:

Alastair Macnair
x13129325

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

1
1.1

Date: 7 December 2014

Report History
Document Location

This document is only valid on the day it was printed.

1.2

Revision History

Revision date

Author

Version

Summary of Changes

31/10/2014

Alastair Macnair

01

Initial Issue

7/12/2014

Alastair Macnair

02

Progress Revision

1.3

Changes
marked

Approvals

This document requires the following approvals:


Name

Title

Date of Issue Version

Ioana Ghergulescu

Project Supervisor

31/10/2014 01

Ioana Ghergulescu

Project Supervisor

31/10/2014 01

1.4

Distribution

This document has additionally been distributed to:


Name

Title

Management Progress Report 02

Date of Issue

Status

Page 2 of 13

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Table of Contents
1

Date: 7 December 2014

Page

Report History ________________________________________________________________ 2


1.1

Document Location ________________________________________________________ 2

1.2

Revision History __________________________________________________________ 2

1.3

Approvals _______________________________________________________________ 2

1.4

Distribution ______________________________________________________________ 2

Highlight Report from 1st November 2014 to 7th December. ____________________________ 4

Highlight Report Purpose _______________________________________________________ 4

Summary of Project Progress ____________________________________________________ 4


4.1

Project Plan Table and summary status ________________________________________ 5

Key Milestones Achieved in this period ____________________________________________ 6

Problems encountered in this period ______________________________________________ 6

Highlighting Concerns (RAID Log) _________________________________________________ 6

Variance from Plan ____________________________________________________________ 7

Contingency Planning __________________________________________________________ 7

10 Planned Work for Next Period (to 20-12-2014) ______________________________________ 7


11 Appendix A Revised Project Plan Gantt Chart ______________________________________ 8
11.1

Detailed Project Plan for upcoming period______________________________________ 9

12 Appendix B Risks, Assumptions, Issues and Dependencies (RAID) Logs _________________ 10


13 Appendix C - Contingency Plans _________________________________________________ 13
13.1

File, Electronic and Hardcopy protection, Backup and recovery Plan ________________ 13

13.2

Critical Circumstance & Threat Contingency Plan _______________________________ 13

Management Progress Report 02

Page 3 of 13

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Date: 7 December 2014

Highlight Report from 1st November 2014 to 7th December.

Highlight Report Purpose

A Highlight report provides the Project Board and Client/Customer with a summary of the status of a
project at agreed stages and is used to monitor progress. The Project manager/Data Analyst uses the
Highlight report to advise the Project Board/Client of any potential problems or areas where they
could help.

Summary of Project Progress

Stadiums and weather station (observation data) fully paired and calculated
Complete data set compiled and checked and 100% ready
R scripts written for Feature Engineering enhancements to Data set
Feature engineering elements added to data set. (Goal outcome measures, seasons etc.)
R scripts written for descriptive statistics and graphing
Report Section 01 Introduction, completed
Report Section 02 Literature Review, 20% Complete
Report Section 03 Data Sets, 25% complete
Report Section 04 Analysis, 5% complete
Research ongoing in statistical analysis and sports performance
Research ongoing in Data Mining techniques and predictive modelling

This period has seen progress in a number of areas. The Gantt chart has been split to provide an
overview and a separate more detailed plan for the next work period to help better understand the
various sub tasks required. Gathering weather forecast data will commence from 7th December until
the 20th December prior to each match.

Management Progress Report 02

Page 4 of 13

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

4.1

Date: 7 December 2014

Project Plan Table and summary status


Start Date

Duration

End Date

Satus

28-Sep-14

28-Sep-14

Completed

01-Oct-14
12-Oct-14

12
1

12-Oct-14
12-Oct-14

Completed
Completed

31-Oct-14
07-Dec-14
20-Dec-14

1
1
1

31-Oct-14
07-Dec-14
20-Dec-14

Completed
Completed
To be started

01-Nov-14
15-Sep-14
08-Dec-14
08-Dec-14
15-Sep-14

12
60
7
7
55

12-Nov-14
13-Nov-14
14-Dec-14
14-Dec-14
08-Nov-14

Completed
Completed
In Progress
In Progress
Completed

29-Sep-14
29-Sep-14

1
1

29-Sep-14
29-Sep-14

Completed
Completed

14-Oct-14
02-Nov-14

20
1

02-Nov-14
02-Nov-14

Completed
Completed

01-Nov-14
30-Oct-14
30-Oct-14
05-Nov-14

5
2
2
1

05-Nov-14
31-Oct-14
31-Oct-14
05-Nov-14

Completed
Completed
Completed
Completed

01-Nov-14
06-Nov-14

2
1

02-Nov-14
06-Nov-14

Omitted
Omitted

01-Dec-14
01-Dec-14

20
20

20-Dec-14
20-Dec-14

In Progress
In Progress

10-Dec-14

12

21-Dec-14

To be started

31-Oct-14
08-Dec-14
08-Dec-14
12-Dec-14
17-Dec-14
28-Dec-14
02-Jan-15
06-Jan-15

4
8
5
8
4
2
2
1

03-Nov-14
15-Dec-14
12-Dec-14
19-Dec-14
20-Dec-14
29-Dec-14
03-Jan-15
06-Jan-15

Completed
In Progress
In Progress
In Progress
To be started
To be started
To be started
To be started

12-Jan-15
19-Jan-15

7
5

18-Jan-15
23-Jan-15

To be started
To be started

PROJECT PROPOSAL
Project Proposal Submitted
REQUIREMENTS SPECIFICATION
Requirements Specification
Requirements Specification Submitted
MANAGEMENT REPORT
Management Report 01
Management Report 02
Management Report 03
KEY RESEARCH AREAS
Germany's Weather
Statistical Tools
Sports Performance Factors
Stadium Design
Technology and Tools
DATA EXTRACTION
Download Weather Files
Download Football Files
FILTER WEATHER FILES
Collate Bundesliga Stadium list
Extract required weater stations
DATA TRANSFORMATION
Write R script to clean and transform weather files
Write R script to clean and transform football files
Bind and clean all football files
JOIN all weather parameters for each station
DATABASE MANAGEMENT
Design Relational Database
Insert Data into SQL database
DATA ANALYSIS
Analyse the database
Graphing and Visualisation
PREDICTION
Predictive Modelling
REPORT WRITING
Introduction
Literature Review
Data Set Description
Analysis and Evaluation
Conclusion
Checking & Review
Print 3 Copies and Bind
Submit Dissertation
PROJECT PRESENTATION
Prepare & Practice Presentation
Make Presentation

Management Progress Report 02

Page 5 of 13

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Key Milestones Achieved in this period

All four data sets (Observations, Stations, Stadiums, and Matches) have been cleaned, corrected
and joined to provide a complete data set of all matches, stadiums and weather observations.
Problems encountered last period such as name changes were corrected. A small percentage of
missing observation data was overcome using Multiple Imputation.
Feature Engineering enhancements have been added to the data set.
Descriptive statistics and graphing have been undertaken on the data.
Project Report has seen some sections completed and most others started.
Project Management Plan revised and updated 7th December 2014 (See Appendix A)
Contingency Plans created.

Date: 7 December 2014

Problems encountered in this period

Although an R script works out the distances between stadia and weather stations there is still a
manual element to add or remove those weather station observation files from the project folder.
This needed to be checked as found even one missing file would affect the final joins creating a
list with values missing and making it was hard to detect errors.
Personal and Work/College factors continue to detrimentally impact on the project plan.
However, the majority of all other NCI commitments are now completed which allows for the
project to regain priority positioning.

Highlighting Concerns (RAID Log)

The full RISK log is provided in Appendix B. Summary of the RAID log: -

Risks
Time continues to be the biggest risk factor to meet the required deadline. The previously unused
time over Christmas has been utilised and the project super visor has confirmed that binding is not a
critical requirement. Work and research in Data mining has developed knowledge in this area and
creating contingency plans alongside existing data backup methods has limited risk in this area.
Assumptions
Project Super visor has confirmed that binding is not an essential requirement.
Issues
All issues now resolved
Dependencies
Predictions are dependent on collecting weather forecast data (not historical data.) This data will need
to be recorded from a suitable forecast provider prior to every match. Failure to have reliable
forecasting data will prevent real match data to be used.

Management Progress Report 02

Page 6 of 13

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Date: 7 December 2014

Variance from Plan

As other NCI deadlines and commitments built up over November project work was affected
although it has continued albeit at a slower than ideal pace. These other commitments are now
essentially completed allowing more time for project work.
The updated plan now uses the previously unused time over Christmas previously omitted to
ensure project deliverables can be achieved due to delays in this period.
The MySQL data base that was to be used has been omitted after review as everything can be
achieved in R Studio and calculations, tabulation and graphing will be quicker and easier.

Contingency Planning

Sections on contingency planning have been added for data loss and for external threats and
circumstances. See Appendix C.

10 Planned Work for Next Period (to 20-12-2014)


The next work period will still be based on key sections such as the analysis and report writing being
undertaken.

Complete detailed analysis of data using descriptive statistics


Undertake detailed analysis of data using inferential statistics
Undertake detailed analysis of data using Data Mining techniques
Capture weather forecasts for remaining matches to be played in 2014.
Complete Report sections 2 & 3
Commence report section 4
Finalise research into sports performance analysis

Management Progress Report 02

Page 7 of 13

Management Progress Report 02

11 Appendix A Revised Project Plan Gantt Chart

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Date: 7 December 2014

11.1 Detailed Project Plan for upcoming period

Management Progress Report 02

Page 9 of 13

Management Progress Report 02

12 Appendix B Risks, Assumptions, Issues and Dependencies (RAID) Logs

28/09/2014

15/09/2014

20/09/2014

28/09/2014

Risk Description
Available time until project deadline is
limited
Project Data is lost due to PC or USB key
loss/damage

Ability to apply appropriate statistical and


algorithm applications to data limited due to
time and knowledge shortfalls impacting on
project quality
Period of Christmas is potentially much less
useful than it appears especially as to
working days and using external providers to
print and bind prior to the deadline

Severity

Date Raised

Impact

ID

Likelihood

RISKS

Mitigation Plan
Ensure constant review of project plan and
9 keep to project plan deadlines.
A dedicated Google Drive space has been set
up in addition to storage on a PC and USB key
creating three distinct storage places. Back
up latest files at least onc a week or after
15 significant work development
Ensure ongoing research and adherence to
project plan as well timely undertaking of
literature review to ensure all knowledge
3 areas are complete
Pull back project timeline to allow for
printing before christmas. Identify
businesses that provide this service and
determine opening hours as early as possible
15

Owner

Status

Date Closed

Open

Closed

15/10/2014

Closed

01/12/2014

Closed

01/12/2014

Management Progress Report 02

ISSUES
ID

Date Raised

14/10/2014

14/10/2014

25/10/2014

14/10/2014

25/10/2014

Issue Description
Bundesliga total team range over 21
years significantly higher than
anticiapted adding to work and
creating lots of 'one' off teams over
the period

Impact Description
Adds to time required to collate
stadia list and identify stadia and
potentially affects the analysis
where single teams and stadia are
present over the

Impact

Medium

Some teams have changed over the


years and stadium location has
Over 21 years teams have changed
changed
in name and even location
Low
Origonal plan had allowed for much
more work to be completed in this
Project plan has slipped from
period placing additional stress into
origonal
next work preiod.
Medium
ECA&D website undertook a major Unable to download primary data
update and some data sets could
sets as more were added due to
not be downloaded
project proposal revision
High
Unable to plan and manage project
and significant time to adjust chart
Existing Gantt Chart cannot be easily moving forward. Danger is that this
adjusted or updated to reflect
is not done and ability to manage is
change
compromised.
Medium

Priority Mitigation Plan

High

Low

Allow more time and adjust project


plan accordingley
Number of teams upon analysis is
only one or two. Make a reasonable
decision to place and re-name to
ensure consistency across all 21
years

Revise plan realistically as possible


and better allow for external
Medium pressures

High

Check every day until data sets


become available for download

Revise plan to use changeable date


format and days that automatically
Medium updates chart

Owner

Status

Closed

Closed

Closed

Closed

Closed

Management Progress Report 02

ASSUMPTIONS
ID

Date Raised

30/10/2014

Assumption Description

Reason for Assumption


Holidays and reduced demand may
Businesses that print and bind may see these businesses close over an
be closed over the christmas period extended period

Action to Validate
Impact if Assumption Incorrect
Identify businesses required to bind
report and clarify opening hours
More time in project plan currently
well before christmas period
not being utilised.

Status

Closed

DEPENDANCIES
ID

Date Raised

20/10/2014

01/12/2014

Dependency Description
Analysis and database creation is
dependant upon stadium list
being compiled
Real match predictions need
weather forecasts to be captured
and recorded now for later
analysis

Location

Internal

External

Deliverables
Bound files for weather station
locations and stadium lists.
Weather forecast details for all
weather paramters to be taken on
day of match prior to being
played

Delivery Date

Importance

Status

02/11/2014

High

Closed

20/12/2014

High

Open

Management Progress Report 02

13 Appendix C - Contingency Plans


13.1 File, Electronic and Hardcopy protection, Backup and recovery Plan
This contingency plan addresses how critical project data can be protected from events such
as PC hard drive failure, USB loss, power failure, Internet loss or any other incident which
when occurring can threaten the project data including all analysis and reports electronic and
hard copy. This process should be enacted if these events occur
Critical Electronic Project Data
All project data is backed up onto PC from the working USB key and also onto a dedicated
cloud storage file space. In the event that one of these is compromised or lost then an
alternative should be established immediately to ensure that multiple redundancy is
maintained at all times. A secondary hard drive or storage device should be obtained with the
most recent version of the project files copied onto it.
Hard Copy File Data
Hard copies will be created towards the end of the project. They should be suitably protected
and bound to ensure they cannot be easily damaged and stored in a safe place until submitted
to the client or submitted early. 3 copies should be printed.
13.2 Critical Circumstance & Threat Contingency Plan
In the event of personal circumstances beyond anyones control such as illness or family
emergency the following plan should be put in place.

1. Determine as best as possible impact on project plan


2. If time lost will be detrimental to the project to the extent where delivery is
comprised immediately contact Project Supervisor to advise in writing
3. Ensue that personal circumstances forms are completed and Project Supervisor is
kept informed as to situation.

8.10.3

Management Progress Report 3

- 101 -

Management Progress Report 03

Highlight Report

The effects of weather on goal outcome for football matches


played within the German Bundesliga

Release:

Management Progress Report 03

Date:

30th December 2014

Authors:

Alastair Macnair
x13129325

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

1
1.1

Date: 30 December 2014

Report History
Document Location

This document is only valid on the day it was printed.

1.2

Revision History

Revision date

Author

Version

Summary of Changes

31/10/2014

Alastair Macnair

01

Initial Issue

7/12/2014

Alastair Macnair

02

Progress Revision

30/12/2014

Alastair Macnair

03

Progress Revsision

1.3

Changes
marked

Approvals

This document requires the following approvals:


Name

Title

Date of Issue Version

Ioana Ghergulescu

Project Supervisor

31/10/2014 01

Ioana Ghergulescu

Project Supervisor

30/12/2014 03

1.4

Distribution

This document has additionally been distributed to:


Name

Title

Management Progress Report 03

Date of Issue

Status

Page 2 of 12

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Table of Contents
1

Date: 30 December 2014

Page

Report History ________________________________________________________________ 2


1.1

Document Location ________________________________________________________ 2

1.2

Revision History __________________________________________________________ 2

1.3

Approvals _______________________________________________________________ 2

1.4

Distribution ______________________________________________________________ 2

Highlight Report from 1st November 2014 to 7th December. ____________________________ 4

Highlight Report Purpose _______________________________________________________ 4

Summary of Project Progress ____________________________________________________ 4


4.1

Project Plan Table and summary status ________________________________________ 5

Key Milestones Achieved in this period ____________________________________________ 6

Problems encountered in this period ______________________________________________ 6

Highlighting Concerns (RAID Log) _________________________________________________ 6

Variance from Plan ____________________________________________________________ 6

Contingency Planning __________________________________________________________ 6

10 Planned Work for Next Period (to 20-12-2014) ______________________________________ 7


11 Appendix A Revised Project Plan Gantt Chart ______________________________________ 8
11.1

Detailed Project Plan for upcoming period_______________ Error! Bookmark not defined.

12 Appendix B Risks, Assumptions, Issues and Dependencies (RAID) Logs __________________ 9


13 Appendix C - Contingency Plans _________________________________________________ 12
13.1

File, Electronic and Hardcopy protection, Backup and recovery Plan ________________ 12

13.2

Critical Circumstance & Threat Contingency Plan _______________________________ 12

Management Progress Report 03

Page 3 of 12

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Date: 30 December 2014

Highlight Report from 7th December to 30th December 2014.

Highlight Report Purpose

A Highlight report provides the Project Board and Client/Customer with a summary of the status of a
project at agreed stages and is used to monitor progress. The Project manager/Data Analyst uses the
Highlight report to advise the Project Board/Client of any potential problems or areas where they
could help.

Summary of Project Progress

Report Section 01 Completed


Report Section 02 Completed
Report Section 03 Completed
Report Section 04 90% complete
Report Section 05 65% complete
Report Section 06 50% complete
Data Analysis complete
Data Mining (predictive analysis) analysis 50% complete

This period has seen the most work and completion of the various tasks. The Gantt chart has been
updated to reflect the few outstanding tasks left to complete the project. Gathering weather forecast
data has been omitted as the number of real matches taking place (15) was considered to be too
small a sample to be able to undertake meaningful predictive modelling. The data set will be split
instead into training and test sets.

Management Progress Report 03

Page 4 of 12

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

4.1

Date: 30 December 2014

Project Plan Table and summary status


Start Date

Duration

End Date

Satus

28-Sep-14

28-Sep-14

Completed

01-Oct-14
12-Oct-14

12
1

12-Oct-14
12-Oct-14

Completed
Completed

31-Oct-14
07-Dec-14
20-Dec-14

1
1
1

31-Oct-14
07-Dec-14
20-Dec-14

Completed
Completed
Completed

01-Nov-14
15-Sep-14
08-Dec-14
08-Dec-14
15-Sep-14

12
60
7
7
55

12-Nov-14
13-Nov-14
14-Dec-14
14-Dec-14
08-Nov-14

Completed
Completed
Completed
Completed
Completed

29-Sep-14
29-Sep-14

1
1

29-Sep-14
29-Sep-14

Completed
Completed

14-Oct-14
02-Nov-14

20
1

02-Nov-14
02-Nov-14

Completed
Completed

01-Nov-14
30-Oct-14
30-Oct-14
05-Nov-14

5
2
2
1

05-Nov-14
31-Oct-14
31-Oct-14
05-Nov-14

Completed
Completed
Completed
Completed

01-Nov-14
06-Nov-14

2
1

02-Nov-14
06-Nov-14

Omitted
Omitted

01-Dec-14
01-Dec-14

20
20

20-Dec-14
20-Dec-14

Completed
Completed

01-Jan-15

03-Jan-15

In Progress

31-Oct-14
08-Dec-14
08-Dec-14
12-Dec-14
28-Dec-14
28-Dec-14
05-Jan-15
06-Jan-15

4
8
5
8
8
8
1
1

03-Nov-14
15-Dec-14
12-Dec-14
19-Dec-14
04-Jan-15
04-Jan-15
05-Jan-15
06-Jan-15

Completed
Completed
Completed
Completed
In Progress
In Progress
To be started
To be started

12-Jan-15
19-Jan-15

7
5

18-Jan-15
23-Jan-15

To be started
To be started

PROJECT PROPOSAL
Project Proposal Submitted
REQUIREMENTS SPECIFICATION
Requirements Specification
Requirements Specification Submitted
MANAGEMENT REPORT
Management Report 01
Management Report 02
Management Report 03
KEY RESEARCH AREAS
Germany's Weather
Statistical Tools
Sports Performance Factors
Stadium Design
Technology and Tools
DATA EXTRACTION
Download Weather Files
Download Football Files
FILTER WEATHER FILES
Collate Bundesliga Stadium list
Extract required weater stations
DATA TRANSFORMATION
Write R script to clean and transform weather files
Write R script to clean and transform football files
Bind and clean all football files
JOIN all weather parameters for each station
DATABASE MANAGEMENT
Design Relational Database
Insert Data into SQL database
DATA ANALYSIS
Analyse the database
Graphing and Visualisation
PREDICTION
Predictive Modelling
REPORT WRITING
Introduction
Literature Review
Data Set Description
Analysis and Evaluation
Conclusion
Checking & Review
Print 3 Copies and Bind
Submit Dissertation
PROJECT PRESENTATION
Prepare & Practice Presentation
Make Presentation

Management Progress Report 03

Page 5 of 12

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Key Milestones Achieved in this period

All analysis and graphing has been completed and added to the report.
All primary sections of the report have been drafted and most have been finished with just the
conclusion yet to complete.
Project Management Plan (03) revised and updated 30th December 2014 (See Appendix A)

Date: 30 December 2014

Problems encountered in this period

With four dependant variables (goal outcome) with three subsets of this (All data, B1 and B2) and
8+ independent variables there is a massive amount of graphing and analysis that potentially
needs to be undertaken. Almost 96 distinct cases that need analysis. Deciding on how to approach
this and which ones need to be omitted has been the greatest challenge. It would have been better
to focus on just or two dependant variables only.
Personal and Work/College factors continue to detrimentally impact on the project plan.

Highlighting Concerns (RAID Log)

The full RISK log is provided in Appendix B. Summary of the RAID log: -

Risks
As the project is essentially finished time risk is now reduced and the topic closed
Assumptions
There are no ongoing assumptions
Issues
All issues now resolved
Dependencies
Collecting real match data has been omitted as the number of matches is too low to be viable for
analysis. The existing data set will be used instead.

Variance from Plan

Overall the project plan has been adhered to although there has been some slippage over the
Christmas period but not detrimentally so.
The updated plan now reflects the outstanding tasks over the next 4-6 days needed to be
completed to finish the project.

Contingency Planning

Sections on contingency planning continue to be monitored and back-ups are in progress. See
Appendix C.

Management Progress Report 03

Page 6 of 12

The effects of weather on goal outcome for football matches played within the German Bundesliga
Highlight Report

Date: 30 December 2014

10 Planned Work for Next Period (to 06-01-2015)


The next work period will see the completion of the project and finalising unfished elements.

Complete Data mining and predictive modelling


Complete Conclusion
Print and submit finished report

Management Progress Report 03

Page 7 of 12

Management Progress Report 03

11 Appendix A Revised Project Plan Gantt Chart

Management Progress Report 03

12 Appendix B Risks, Assumptions, Issues and Dependencies (RAID) Logs

28/09/2014

15/09/2014

20/09/2014

28/09/2014

Risk Description
Available time until project deadline is
limited
Project Data is lost due to PC or USB key
loss/damage

Ability to apply appropriate statistical and


algorithm applications to data limited due to
time and knowledge shortfalls impacting on
project quality
Period of Christmas is potentially much less
useful than it appears especially as to
working days and using external providers to
print and bind prior to the deadline

Severity

Date Raised

Impact

ID

Likelihood

RISKS

Mitigation Plan
Ensure constant review of project plan and
3 keep to project plan deadlines.
A dedicated Google Drive space has been set
up in addition to storage on a PC and USB key
creating three distinct storage places. Back
up latest files at least onc a week or after
15 significant work development
Ensure ongoing research and adherence to
project plan as well timely undertaking of
literature review to ensure all knowledge
3 areas are complete
Pull back project timeline to allow for
printing before christmas. Identify
businesses that provide this service and
determine opening hours as early as possible
15

Owner

Status

Date Closed

Closed

Closed

15/10/2014

Closed

01/12/2014

Closed

01/12/2014

Management Progress Report 03

ISSUES
ID

Date Raised

14/10/2014

14/10/2014

25/10/2014

14/10/2014

25/10/2014

Issue Description
Bundesliga total team range over 21
years significantly higher than
anticiapted adding to work and
creating lots of 'one' off teams over
the period

Impact Description
Adds to time required to collate
stadia list and identify stadia and
potentially affects the analysis
where single teams and stadia are
present over the

Impact

Medium

Some teams have changed over the


years and stadium location has
Over 21 years teams have changed
changed
in name and even location
Low
Origonal plan had allowed for much
more work to be completed in this
Project plan has slipped from
period placing additional stress into
origonal
next work preiod.
Medium
ECA&D website undertook a major Unable to download primary data
update and some data sets could
sets as more were added due to
not be downloaded
project proposal revision
High
Unable to plan and manage project
and significant time to adjust chart
Existing Gantt Chart cannot be easily moving forward. Danger is that this
adjusted or updated to reflect
is not done and ability to manage is
change
compromised.
Medium

Priority Mitigation Plan

High

Low

Allow more time and adjust project


plan accordingley
Number of teams upon analysis is
only one or two. Make a reasonable
decision to place and re-name to
ensure consistency across all 21
years

Revise plan realistically as possible


and better allow for external
Medium pressures

High

Check every day until data sets


become available for download

Revise plan to use changeable date


format and days that automatically
Medium updates chart

Owner

Status

Closed

Closed

Closed

Closed

Closed

Management Progress Report 03

ASSUMPTIONS
ID

Date Raised

30/10/2014

Assumption Description

Reason for Assumption


Holidays and reduced demand may
Businesses that print and bind may see these businesses close over an
be closed over the christmas period extended period

Action to Validate
Impact if Assumption Incorrect
Identify businesses required to bind
report and clarify opening hours
More time in project plan currently
well before christmas period
not being utilised.

Status

Closed

DEPENDANCIES
ID

Date Raised

20/10/2014

01/12/2014

Dependency Description
Analysis and database creation is
dependant upon stadium list
being compiled
Real match predictions need
weather forecasts to be captured
and recorded now for later
analysis

Location

Internal

External

Deliverables
Bound files for weather station
locations and stadium lists.
Weather forecast details for all
weather paramters to be taken on
day of match prior to being
played

Delivery Date

Importance

Status

02/11/2014

High

Closed

20/12/2014

Low

Closed

Management Progress Report 03

13 Appendix C - Contingency Plans


13.1 File, Electronic and Hardcopy protection, Backup and recovery Plan
This contingency plan addresses how critical project data can be protected from events such
as PC hard drive failure, USB loss, power failure, Internet loss or any other incident which
when occurring can threaten the project data including all analysis and reports electronic and
hard copy. This process should be enacted if these events occur
Critical Electronic Project Data
All project data is backed up onto PC from the working USB key and also onto a dedicated
cloud storage file space. In the event that one of these is compromised or lost then an
alternative should be established immediately to ensure that multiple redundancy is
maintained at all times. A secondary hard drive or storage device should be obtained with the
most recent version of the project files copied onto it.
Hard Copy File Data
Hard copies will be created towards the end of the project. They should be suitably protected
and bound to ensure they cannot be easily damaged and stored in a safe place until submitted
to the client or submitted early. 3 copies should be printed.
13.2 Critical Circumstance & Threat Contingency Plan
In the event of personal circumstances beyond anyones control such as illness or family
emergency the following plan should be put in place.

1. Determine as best as possible impact on project plan


2. If time lost will be detrimental to the project to the extent where delivery is
comprised immediately contact Project Supervisor to advise in writing
3. Ensue that personal circumstances forms are completed and Project Supervisor is
kept informed as to situation.

8.11 Other Material Used


A CD containing all the code used in this study is attached to the front cover of this
document. Open the README.txt file for information.

- 102 -

You might also like