Professional Documents
Culture Documents
#Schools in the
Big 10 = 14
#Newspapers in #Newspapers
the US = 3767 in BIG10
Cities = 130
TABLE III: The number of retrieved documents TABLE IV: The Number of Games played
type of a filter has the advantage over the conventional our background corpus, we wanted to use a corpus
mentioned-sport filter, as it is able to retrieve docu- which is not derived solely from news articles as this
ments which do not explicitly mention a sport by name. might add an unwanted bias, and we opted for the
And unlike the keyword based filter, this method is Brown Corpus[8]. An unwanted artifact of our retrieval
unsupervised and does not require domain knowledge. method is the over representation of player names in
This unsupervised method can not only retrieve the our document sets, so in order to minimize this, we
relevant documents but can also classify them based on preprocessed our datasets to remove the detected first
the gender of the mentioned players . We acknowledge name, last name filter combinations
that this retrieval method is somewhat limited, as it will
not be able to retrieve documents which do not mention IV. R ESULTS
a player by name. As the current scope of our project A. Coverage
deals with gendered documents, documents which do
not explicitly make use of gendered language are not Out of the 4,296 documents that were retrieved,
considered relevant to our experiments (see Table III). 2,427 were labeled as male and 1,869 were labeled as
The set of keywords used in this method were female. One could argue that this difference in coverage
generated by using a spider to retrieve the list of Big might be a product of the fact that the men’s teams
Ten players (see Table II)from the NCAA statistics have played more games in the 2017 season than their
website [18] for the 2017 season(See Section V). This female counterparts. In order to investigate this claim,
method can be repeated to get the list of players for we aggregated the total number of games that each
any season. Using the filters created from this player university played (see Table IV). In total 4003 games
list, we were able to retrieve 4,296 documents out of were played and out of these 2243 were played by
the original set of 332,750 (see Fig 1). women’s teams and 1760 by their male counterpart.
Basketball was the outlier in terms of the number of
B. Statistically Over-represented Words articles published(see Fig 2). Both the men’s basketball
Since the documents for each sport type can only teams and the women’s basketball teams,compared to
have two classes, male or female, a TFIDF based all the other 10 sports, received a disproportionately
approach would not be able to identify the statistically large amount of coverage for the number of games that
over-represented words. Instead like [17], we use log- they played.
odd ratios with informative Dirichelt priors[15]to iden- We calculated the correlation coefficient between
tify words that are over-represented in our datasets. the men’s sports and the coverage they receive to
Given a background corpus α and a sport s, we cal- be 0.7555 and between the women’s sports and the
culate the log-odds ratio ws between that two classes, coverage they receive to be 0.9752(excluding basketball
male, m and, female, f . The log-odds for a word in
any sport corpus can be calculated as follows: in both cases). From this we can see that, separately,
m +α
yw f
yw
the coverage both men’s and women’s sports received
(m−f ) s w s + αw
δw s = log − log f is strongly correlated with the number of matches they
nm + α0 − (yw
m +α )
s w nf + α0 − (yw s + αw )
(1) play. However when gender is ignored and both the
Here nm is the size of the male corpus, n is f
dataset are combined, the overall correlation coefficient
the size of the female corpus, yw m is the count of
s
drops to 0.4099(excluding Basketball).
f
word w in corpus m for a sport s and similarly yw s If the gender of the players was not a significant fea-
is the count of word w in corpus f for a sport s. ture, then the correlation of the combined dataset would
α0 is the size of the background corpus and αw is the have been similar to the correlation of the separate
frequency of word w in the background corpus. For datasets. The drop in correlation can be interpreted as
Fig. 2: Relationship between number of games played and number of articles published
being indicative of the fact that the difference between We represent the relationship between the number of
the numbers of articles published is gender dependent. articles and games played, for both genders, using the
linear regression models shown in equations 2 and 3.
This gender bias is further illustrated in Figure 3, which
shows the proportion of articles that were published Am = 0.5489Pm + 127.49 (2)
for each gender for each sport and the proportion of Af = 0.1272Pf + 89.99 (3)
games played by each gender for each sport. From
Tables III and IV, it can be seen that even though in the above equations, Am and Af represent the
men and women played a comparable number of number of articles published about men’s and women’s
baseball/softball matches, (49.0%, 51.0% respectively), teams respectively and Pm and Pf represent the number
79.0% of the articles published were about baseball and of games played by men’s and women’s teams respec-
just 21.0% were about softball. This disparity is even tively.
more extreme in case of lacrosse, in which even though It can be seen from the above models that, on
men played just 42.2% of the games, 80.8% of the average, for every one game that a men’s team plays
articles were published about the men’s team. The only about 0.5489 articles are published but for every one
exceptions to this trend are basketball and volleyball. In game that a women’s team plays, only 0.1272 articles
the case of basketball, 51.1% of the games were played are published.
by the men’s team and 47.6% of the articles were about B. Content
the men’s team.
In order to analyze the linguistic difference in the
For men, lacrosse had the greatest difference(38.6%) contents of the documents, for each sport pair, we
between the proportion of games played and the num- ranked all the words according to their log-odd scores
ber of articles published. In case of women the greatest (equation 1). In order to avoid the bias created by the
difference was for volleyball. Women played 87.6% of over representation of our search filters, we removed
the volleyball games and were covered in 93.8% of all the filter terms. We retrieved the top 10 and the
the articles, a difference of 6.2%. Overall it can be bottom 10 ranks. The top 10 ranks represented words
observed that women are underrepresented in 4 out of that were statistically over-represented for the male
the 6 sports under consideration. datasets, and the bottom 10 ranks represented words
(a) Proportion of articles published for each gender (b) Proportion of games played by each gender
Fig. 3: Gender and Proportionality
that were statistically over-represented for the female to be compared to professional players and mentioned
datasets. Table V shows the 10 most over represented in the same article as them.
words for each gendered sport. From the sample of top 10 words, it can be observed
Using the consensus of a grad student and a domain that the out of the 32 names present in the male corpus,
expert we were able to classify the retrieved words into only one;‘kevin’, was a first name. For women 3 out of
three distinct types. the 13 names were first names. This result seems to be
Type I: Technical Words: a word that is associated consistent with the findings of[14], who in their study
with the sport, it can include technical terms like of media coverage of cross-country skiing reported that
‘netminders’, location names like ‘layola’ or events and commentators used men’s first names 5.1% of the times
trophies named after past players like ‘Naismith’ and and in comparison women’s first names were used
‘tenpac12’. 21.5% of the times. While last names were used 29.1%
Type II: Player Names: Since our dataset does not of the times for men and 18.2% of the times for women.
contain the keywords we used to filter the documents, The disproportionate use of first names for women
this type of words include instances of just a player’s can be understood using the semiotics framework pro-
first name like ‘cara’ or last name like ‘seider’, or of posed by [6]. In this context, the use of first names
names of players on opposing teams, or names of play- can symbolize informality. [11] argues that members of
ers who professionally play the sport like ‘olofsson’. dominant groups are more often referred to formally;
Type III: Unrelated Words: This type includes using their last names, while the subordinate members
words not directly related to the sport like ‘barack’ are referred to more informally, thus creating a power
Using this classification we can observe that the top differential. Language can be used as a tool to reinforce
9 over represented words for Baseball were all Type II, social power. In this context the disparity in the medias
whereas for softball all the words were Type I. It can use of language underscores sports as a masculine
also be seen that for all sports, the male sports had an activity and projects womens participation as anomic
over representation of Type II words. as sports are incongruent to the female stereotype.
One explanation of this over representation of Type II
words can be that articles which cover women’s sports V. D ISCUSSION AND L IMITATIONS
are more focused and primarily discuss one individual As discussed in the previous section, more type I
player (the filter word), while for men’s sports the words are used in relation to women than they are
articles are more diverse and do not just talk about one to men. The use of language can be considered to be
player but also discuss the opposing teams. This would ideologically mediated [2], and one could argue that the
explain the occurrences of names like ‘Jaylon’ who over use of type I words stems from the newspapers
played basketball at the University of Evansville, which perception of their audience. The newspapers might
is not in the Big Ten. The occurrences of professional believe that the readership of the articles about female
players like ‘Olofsson’ can be interpreted as indicative athletes themselves are not well versed in the nuances
of the fact that players on men’s teams are more likely of the sport and the use of technical terms is to help
rosenbaum
californias
followup
blaisdell
watkins
Women
aanhpi
revzen
viejo
lorin
this audience better understand the sport. Conversely
aclu
in case of the mens sports, the use of more Type II
might signify that the newspapers believe that sports
saintfrancis
jovanovski
Volleyball
boogaard
are an integral part of the habitus[4] of the readers and
mcgrath
Reagan
wilkins
gillies
Mens
as such they do not need to explain the technicalities to
Men
bray
Li
their audience. If sports are perceived as a masculine
Goalscorers activity, then women athletes are considered outsiders
Netminders
Hardfought
Secondbest
josephs
starters
Libero
can be used as a tool to signal and reinforce the power
boyle
Katy
Cara
of male athletes in the sporting world.
Using an unsupervised approach, we were able to
Ice Hockey
szmatula
olofsson
calgary
topline
out to explore.
dumba
disney
Men
lhes
ahl
Skillset
Scorers
Barack
Libero
loyola
Cara
Rick
hennepin
Threerun
freeland
renfroe
Soccer
godley
stanek
torey
lhp
mindset
Women
Legend:
ashs
bannons
redskins
stinner
howen
sydow
brandi
kevin
andone
hurdler
boxing
11seed
touted
perrys
basic concepts.
8of8
sportswright
skywalker
statencaa
jaylon
wentz
Men
professional.
At this point we would like to acknowledge, that
spokeswoman
Backandforth
allconference
Sevengame
mankato
Women
Redhot
Mitt
browns
leidner
diggs
klans
Men
staal