You are on page 1of 10

Online User Location Inference Exploiting Spatiotemporal

Correlations in Social Streams

Yuto Yamaguchi Toshiyuki Amagasa


University of Tsukuba, Japan University of Tsukuba, Japan
yuto_ymgc@ amagasa@
kde.cs.tsukuba.ac.jp cs.tsukuba.ac.jp
Hiroyuki Kitagawa Yohei Ikawa
University of Tsukuba, Japan IBM ResearchTokyo, Japan
kitagawa@ yikawa@
cs.tsukuba.ac.jp jp.ibm.com

ABSTRACT
The location profiles of social media users are valuable for
various applications, such as marketing and real-world anal-
ysis. As most users do not disclose their home locations,
the problem of inferring home locations has been well stud-
ied in recent years. In fact, most existing methods perform
batch inference using static (i.e., pre-stored) social media
contents. However, social media contents are generated and
delivered in real-time as social streams. In this situation, it
is important to continuously update current inference results
based on the newly arriving contents to improve the results (a) In ordinary times. (b) After the earthquake at
over time. Moreover, it is eective for location inference to Hiroshima.
use the spatiotemporal correlation between contents and lo-
Figure 1: Geographical distributions of tweets con-
cations. The main idea of this paper is that we can infer
taining the word earthquake (a) in ordinary times,
the locations of users who simultaneously post about a local
and (b) after the earthquake at Hiroshima prefec-
event (e.g., earthquakes). Hence, in this paper, we propose
ture. Each circle represents the ratio of the number
an online location inference method over social streams that
of tweets posted there. The distribution of the word
exploits the spatiotemporal correlation, achieving 1) contin-
occurrence changes over time.
uous updates with low computational and storage costs, and
2) better inference accuracy than that of existing methods.
The experimental results using a Twitter dataset show that
our method reduces the inference error to less than 68% of 1. INTRODUCTION
existing methods. The results also show that the proposed The location profiles of social media users have become
method can update inference results in constant time re- an important information source for analyzing social me-
gardless of the amount of accumulated contents. dia. For example, real-world event detection [24], disaster-
analysis [26], and modeling outbreaks [19] require home lo-
Categories and Subject Descriptors cations of individual users to identify the locations of these
phenomena being analyzed. However, location profiles are
H.2.8 [Database Management]: Database Applications
not available for many users; it has been reported that many
Data Mining; H.3.5 [Information Storage and Retrieval]:
users do not disclose their home locations for various reasons
On-line Information ServicesWeb-based services
(e.g., privacy) [9]. Hence, it is quite useful for many appli-
cations if we can infer home locations of social media users.
Keywords For the last few years, the home location inference prob-
online location inference; spatiotemporal; social streams lem has been well studied. Most existing methods can be
categorized as either a graph-based or a content-based ap-
proach. The graph-based approach [2, 11, 22] is based on
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
the structure of social graphs where friends are connected.
for profit or commercial advantage and that copies bear this notice and the full cita- In contrast, the content-based approach [6, 8, 9] leverages
tion on the first page. Copyrights for components of this work owned by others than user-generated contents in the form of texts. Social media
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- contents are generated in real-time, which is called social
publish, to post on servers or to redistribute to lists, requires prior specific permission streams. Such dynamism is one of the remarkable features
and/or a fee. Request permissions from permissions@acm.org.
of social media. However, we would say that most of the
CIKM14, November 37, 2014, Shanghai, China.
Copyright 2014 ACM 978-1-4503-2598-1/14/11 ...$15.00.
existing content-based methods do not fully utilize such dy-
http://dx.doi.org/10.1145/2661829.2662039. namic characteristics, because they typically assume that

1139
contents are stored in advance and do not change. Hence, We introduce a new definition of local words that ex-
we focus on the dynamic characteristics of contents, which ploits the spatiotemporal correlation between contents
poses the following two major research issues. and locations in addition to the traditional static cor-
relation being used by existing methods.
Continuous updates. Since social media contents are con-
tinuously generated and delivered with high rate, it is de- We perform intensive experiments using a Twitter dataset,
sirable if we can keep updating the inference results. For which shows that our proposed method outperforms
example, suppose that a user has generated some contents existing methods in both accuracy and eciency.
about Massachusetts and the location inference result of this
user is Massachusetts. If this user begins to post about the More concretely, the experimental results show that OLIM
Red Sox, then the inference result can be updated to Boston. achieves a 28km median error distance, which means that
However, updating inference results based on social streams the error distance is reduced to less than 68% of existing
has two main challenges: computational and storage costs. methods. Our method is also theoretically and experimen-
Prior studies propose batch inference methods, which re- tally shown to be more ecient in computational time than
quire the entire inference process to be repeated using all existing content-based methods. Additionally, the results
the old content as well as the newly arriving content when demonstrate that OLIM can identify and exploit novel temporally-
updating the results. Obviously, these methods suer from local words that have spatiotemporal correlations to specific
unacceptable computational costs to deal with frequently ar- locations, in addition to traditional static local words.
riving social media contents. In addition, storage costs also The rest of this paper is organized as follows: Section
become large because existing methods need to store all the 2 overviews studies on user location inference. Section 3
old contents for future updates. defines the terminology and states the problem being ad-
dressed in this paper. The proposed method is described
Spatiotemporal correlation. It is reported that social in Section 4, and is experimentally evaluated in Section 5.
media users tend to simultaneously post when they expe- Finally, Section 6 concludes the paper.
rience a local event in the real world [24], which leads to
the spatiotemporal correlation between contents and loca-
tions. Figure 1 shows the geographical distribution of occur-
2. RELATED WORK
rences of the word earthquake in Twitter (see Section 5 for A lot of research has investigated the problem of user lo-
dataset details). We can see that (a) the distribution does cation inference in social media. This section briefly sur-
not correlate to any locations at ordinary times (i.e., the dis- veys the three types of existing location inference methods,
tribution is similar to the population distribution of Japan), namely, content-based, graph-based, and hybrid approach.
while (b) the distribution appears to have a strong correla- Content-based approaches. The content-based approach
tion to Hiroshima prefecture after an earthquake happened leverages user generated contents. The idea of this approach
there, which indicates that the word earthquake is spa- is that users are likely to post contents about their residen-
tiotemporally correlated with Hiroshima prefecture. This tial areas more frequently than other areas. For example, it
correlation tells us that a user who uses the word earth- is natural to assume that a user who lives in Boston is more
quake right after the earthquake at Hiroshima is likely to likely to post about the Red Sox.
live there. In fact, existing methods use only static spa- The content-based approach has its roots in Cheng et
tial correlation between contents and locations (e.g., place- al. [6]. They introduced a concept of local words and used
names and cuisines); in other words, existing methods do them to infer home locations of Twitter users. Local words
not exploit spatiotemporal correlation. are those that have strong correlation with a specific loca-
In this paper we devise an online location inference method tion. For example, it is reported that the word rockets is
(OLIM), which can update inference results using only newly local because it is frequently posted by users living in Hous-
arriving contents without using previous contents. Con- ton [6]. With this local word, Cheng et al. inferred that
cretely, our method is based on Bayesian inference over users who tend to use it live in Houston.
Dirichlet-multinomial (Polya) distribution [18], which is widely There have appeared various types of content-based meth-
adopted distribution to model maltivariate discrete random ods other than the one by Cheng et al. Eisenshtein et al. [8]
variables from the Bayesian point of view. Consequently, we and Hong et al. [10] proposed inference methods based on
can perform online updates in constant time regardless of the topic models. Yuan et al. [28] also proposed a topic model
amount of accumulated contents, and the amount of storage to infer W4 (Who, Where, When, and What) aspects of
size stays constant. Moreover, we extend the definition of social media users. Kinsella et al. [12], Chandra et al. [3],
local words [6], which are the words that have a static corre- and Chang et al. [4] extended Cheng et al.s method [6] in
lation with a specific location. Based on our new definition several aspects. The inference method by Hecht et al. [9]
of local words, we can exploit both traditional static correla- adopted Naive Bayes Classifier [16]. Schulz et al. [25] re-
tion and spatiotemporal correlation being introduced, which ported that various indicators such as home pages of users
results in higher accuracy than that of existing methods. and time zones are eective to infer home locations. Pontes
The following summarizes the contributions in this paper: et al. [20] leveraged the check-in history, GPS tags, and other
attributes to infer home locations.
This paper diers from these existing content-based meth-
ods in that 1) existing methods do not have the scheme of
We develop an online location inference method that
online updating, and 2) existing methods do not consider
continuously updates the inference results based solely
the spatiotemporal correlation.
on the newly arriving contents, achieving constant com-
putational and storage cost regardless of the amount Graph-based approaches. Graph-based approaches an-
of accumulated contents. alyze social graphs that home locations of connected users

1140
Social stream and time window. Each post p is de-
fined via three elements: timestamp ps , text pt , and user
pu , and is denoted as p = (ps , pt , pu ). We assume that posts
continuously arrive from social stream SS in chronological
(N )
order. For each word w, we define the time window Tw .
(N )
Time window Tw is the sliding window with the length of
N that is a parameter specifying the number of occurrences
of word w from labeled users as shown in Figure 2. In this
figure, N is set to 5, and it slides toward the next word
occurrence. Hereafter, for the simplicity of description, we
(N )
omit (N ) from Tw if there is no ambiguity. The definition
of window length by the number of words N has several ad-
vantages for location inference using social streams as we
Figure 2: Sliding time window. The window slides discuss in the later sections.
toward the next word occurrence. Using the notations defined so far, our problem can be
stated as follows:
in social graphs are likely to be close to each other. The Problem 1 (Online Location Inference). Each time
(k)
graph-based approach has its roots in Backstrom et al. [2]. post p is observed from user u, infer the home location lu
Their method maximizes the likelihood of generating the (k1)
of u based on p and the previous inference result lu so
Facebook social graph assuming that short-distance edges (k)

that the distance between lu and the true home location lu
have a larger likelihood than long-distance ones.
After Backstrom et al., varieties of graph-based methods is reduced compared to the previous result.
have been proposed. The method by Clodoveu et al. [7] Note that since an unlabeled users true home location lu is
is based on the majority vote of friends home locations. unavailable, it is not guaranteed that the distance between
Abrol et al. [1] proposed TweetHood, which recursively in- the inferred home location and the true home location will
fers friends home locations. Jurgens [11] also developed a decrease in the next inference.
recursive inference method based on the label propagation
method [5]. Rout et al. [22] takes into account the popula- 3.2 Terminology
tions of locations. McGee et al. [17] classified informative Population distribution. We define the population distri-
friends that live close to the target user and others. Yam- bution R where each probability value R(l) at location l L
aguchi et al. [27] utilized users who attract a lot of local at- denotes the ratio of labeled users whose home locations are
tentions for location inference. Sadilek et al. [23] addressed l. For example, in the U.S., the population distribution has
the integrated problem of inferring Twitter users trajecto- high values at metropolises like New York and Chicago.
ries and predicting links in the Twitter social graph. Our
Word distribution. Each word w is assigned the word
method diers from these studies in that it does not use
distribution Qw where each probability value Qw (l) denotes
social graphs.
the ratio of the number of occurrences of w at l in the current
Hybrid approaches. Hybrid approaches, which use both time window Tw . Precisely, the occurrence of word w at l
social media contents and social graphs, are proposed by Li means that a user who lives in l posts p that contains word
et al. [15, 14]. They first proposed the unified discriminative w. For example, the word distribution of RedSox has a
influence model (UDI ) [15], which models user-relationships high value at Boston because the word is likely to be posted
and venue names extracted from contents as a heterogeneous at Boston frequently. Note that the word distribution Qw
graph assuming that each node (i.e., user or venue) has its changes over time as the corresponding time window Tw
own influence scope. Their method considers that nodes slides.
with a large influence scope (e.g., Lady Gaga) do not pro-
Local word. Intuitively, the word distributions of regular
vide good clues to infer location because numerous users in
words tend to be similar to the population distribution be-
diverse locations follow them.
cause all users are equally likely to post the words. On the
Li et al. proposed another model [14], which assigns mul-
other hand, local words have a strong correlation to a specific
tiple locations for a single user (e.g., home, work, and former
location, so their word distributions are expected to dier
home locations). These two hybrid approaches use only to-
from the population distribution. Furthermore, according
ponyms extracted by Gazetteer, meaning that local words
to Figure 1, word distributions change over time, meaning
other than toponyms are not considered.
that the set of local words varies at dierent times. Based
on this observation, we define the local word as follows:
3. DEFINITIONS Definition 1 (Local word). A local word w is a word
that satisfies the following equation.
3.1 Problem Statement
Qw (l)
User accounts and locations. Each user account u KL (Qw ||R) = Qw (l) log > dmin , (1)
U has its own home location lu L. The set of users is lL
R(l)
composed of two types of components U = U L U N where
U L is a set of labeled users whose home locations are known, where dmin is a predefined parameter.
while U N is a set of unlabeled users whose home locations Note that we determine whether or not word w is local when-
are unavailable. ever w is observed. We adopt KL divergence because it is

1141
a standard measure to calculate the dierence between two Algorithm 1 OLIM
probability distributions. This definition of local words is a Input: SS, U , N , dmin , L
natural extension of the traditional static local word. If we 1: R calcP opulation(U L , L)
use unlimited length of time windows (i.e., N ), the 2: for post p from SS do
above definition is equal to the traditional one that does not 3: u user(p)
consider temporal features. 4: if u U L then
Furthermore, by adopting KL divergence as the measure 5: updateKL(u, p, N, R)
of locality, our algorithm can identify local words with multi- 6: else if u U N then
modal word distribution (i.e., the distribution with multiple 7: updateUD(u, p, dmin , L)
peaks). Clearly, for example, the word rain has multi- 8: end if
modal distribution even if it is local. The definition of local 9: end for
words of [6] that uses the focus of the word distribution can-
not extract this kind of local words.
2
According to the above definition of local words, we obtain Earthquake
the sequence of KL divergence values for each word. Namely,
the value of KL divergence is calculated each time the word 1.5

KL divergence
occurs. Figure 3 shows an example of the sequence of KL
divergence values of the word earthquake, where the x-axis 1
denotes the flow of time from left to right, and the y-axis de-
notes the value of KL divrergence at the corresponding time
0.5
point. We used a Twitter dataset described in Section 5 and
the updating scheme described in Section 4.1. We observe
that the word earthquake sometimes shows high values, 0
and these bursts indeed correspond to the real earthquakes Time
in Japan, which indicates a lot of users close to the earth-
quake generate contents about it when it happens. Further
discussions are described in Section 5.3. Figure 3: Sequence of KL divergence of earth-
quake. This word sometimes shows high values,
which indicates that earthquakes happen at corre-
4. METHOD sponding time points.
This section describes our proposed method, called OLIM
(Online Location Inference Method). OLIM can perform
online inference with lower computational and storage costs whether or not the corresponding word is local. On the other
than those of existing methods under the situation where hand, if user u is an unlabeled user, OLIM updates the user
posts continuously arrive from the social stream. Online distribution of u based on the local words in the text of p
inference maintains user distributions for all unlabeled users (Section 4.2). In the following subsections, we describe the
by continuously inferring and updating them based solely on details of each step.
the newly arriving contents. User us home location can be
inferred based on the user distribution of u by choosing the 4.1 Updating KL Divergence
location with the largest probability value. By the definition of local words (Eqn. (1)), determining
Our algorithm treats the geographical space as a discrete whether or not word w is local requires the KL divergence
set of locations L rather than numerical coordinates of lat- between the word distribution of w and the population dis-
itude and longitude. In other words, each unlabeled user tribution. Since the time complexity of calculating KL di-
will be assigned the most likely location label l L. To con- vergence is O(|L|), recalculating it whenever word w is ob-
struct L, we partition the geographical space by quad-tree served is computationally expensive. We therefore update
partitioning as used in [13] in the preprocessing step. Quad- KL divergence by the following scheme, which requires O(1).
tree partitioning is suitable for location inference because Suppose that word wi is observed at location lk , and the
populated areas, which requires fine grained inference, are oldest observation of wi in the previous time window was at
divided into small cells (i.e., location), whereas depopulated lk . OLIM updates the KL divergence of word wi as:
areas are divided into large cells. Note that the number of
locations aects the inference accuracy, which is experimen- KL(Qwi ||R) KL(Qwi ||R) + (i, k, k ), (2)
tally examined in Section 5. {
Algorithm 1 shows the algorithm of OLIM, which receives 1 R(lk ) nik + 1
(i, k, k ) = log + nik log
social stream SS, a set of users U = U L U N , the length N R(lk ) nik
of time window N , the threshold value dmin , and the set of }
nik 1 nik + 1
locations L. OLIM fist constructs population distribution +nik log + log ,
nik nik 1
based on home locations of labeled users in line 1. Con-
cretely, probability mass of population distribution at loca- where nik is the number of observations of wi at lk in the
tion l is calculated as R(l) = |{u U L |lu = l}| / |U L |. previous time window. Proof is omited as the above updat-
Every time post p arrives from social stream SS, OLIM ing equation can be derived by simple transformation. Note
performs the steps from line 3 to 8. If user u of p is a labeled that the word distribution Qw is calculated using nik value
user, OLIM updates values of KL divergence for each word as Qwi (lk ) = nik /N .
in the text of p using us home location lu (Section 4.1). The Here we discuss why we define the window length by the
maintained values of KL divergence are used to determine number of word occurrence N . The first reason is that the

1142
Algorithm 2 updateKL() First, it is natural to model user distributions as the multino-
Input: u, p, N , R mial distribution over the set of locations L, where the prob-
1: for word wi in p do ability at location l means the probability that the users
2: lk = getUserLocation(u) home location is l. Since there may not be sucient clues to
3: window[wi ].enqueue(lk ) infer user home locations (especially in the initial inference
4: if window[wi ].length == N then stage), it is also natural to adopt Bayesian inference with
5: nik nik + 1 Dirichlet distribution as prior, which results in Dirichlet-
6: di KL(Qwi ||R) multinomial. Second, an online updating scheme of prob-
7: else if window[wi ].length > N then ability distribution, which is our goal, can be achieved by
8: lk = window[wi ].dequeue() Bayesian inference.
9: di di + (i, k, k ) Let zij be the categorical random variable with |L| cat-
10: nik nik + 1; nik nik 1 egories that denotes the location of the jth occurrence of
11: end if local word wi in the current time window. For i.i.d categor-
12: end for ical variables Zi = (zi1 , zi2 , , ziN ) corresponding to the
current time window of local word wi , the marginal joint
distribution is obtained by the Dirichlet-multinomial (Polya)
Algorithm 3 updateUD() distribution as:
Input: u, p, dmin , L
1: for word wi in p do P (Zi |) = P (Zi |)P (|)d
2: if di > dmin then
3: for location lk in L do (A) (nik + k )
= , (3)
4: Buk Buk + nik (N + A) (k )
k
5: end for
6: end if where is the gamma function, nik is the number of zij s
7: end for with the value k1 , and A = k k . Note that the distri-
bution P (Zi |) is the categorical multinomial distribution
with parameter , and P (|) is Dirichlet distribution with
parameter .
above updating scheme holds only if the window length is
Eqn. (3) denotes the probability of generating local word
fixed to N . If the window length N varies over time, we have
wi at a specific time point from the geographical point of
to recalculate the value of KL divergence by definition, lead-
view. By integrating out the parameter , we consider all
ing to the time complexity of O(|L|). The second reason is
the possible value of the parameter to calculate the prob-
that the fixed length of window guarantees that a sucient
ability of generate wi , instead of simply using MLE or MAP
number of word occurrences are in the window, which results
estimation of the multinomial distribution.
in reliable values of KL divergence. Although one may think
Here, let zu be also the categorical random variable with
that too long window length in terms of time duration for
|L| categories that denotes the home location of user u. Sup-
rare words can degrade the inference accuracy, it is not the
pose that unlabeled user u posts p that contains a local word
case. Suppose that word w is rare but local (e.g., less-known
wi , we can write the posterior predictive distribution of us
location names), our algorithm can reliably extract such a
home location as:
local word with a sucient number of word occurrences. In
contrast, if the window length is fixed by time duration (e.g., P (zu = k , Zi |)
P (zu = k |Zi , ) =
6 hours), non-local words with insucient occurrences may P (Zi |)
happen to be extracted as local by unreliable KL divergence (N + A) (nik + 1 + k )
value. By our preliminary study, we observe that the thresh- =
(N + 1 + A) (nik + k )
old to avoid too long window length (i.e., discard words with
insucient occurrences) results in the low accuracy. nik + k
= ,
Algorithm 2 shows the procedure of updating KL diver- N +A
gence. Lines 2 through 12 update the KL divergence for each which means that us home location can be simply inferred
word contained in the text of observed post p. window is by counting the number of zij s with the value k and adding
the dictionary of queues where each queue window[w] stores it to k for all k. This predictive distribution is based on
locations where w is observed, chronologically. Note that the assumption that a user is likely to live close to the local
queues are initialized as empty. words the user used.
When the length of the queue window[wi ] first reaches After the observation of wi , suppose that u further posts
N , OLIM calculates the KL divergence based on the def- p containing another local word wi , we can rewrite the
inition, which is stored as di (lines 4 through 6). On the distribution as:
other hand, for the word whose KL divergence is already
P (zu = k , Zi , Zi |)
calculated, OLIM updates the value by Eqn. (2). P (zu = k |Zi , Zi , ) =
P (Zi , Zi |)
4.2 Updating User Distributions =
nik + ni k + k
,
Every time post p is observed from an unlabeled user 2N + A
u, OLIM updates the user distribution of u based on local where we assume that Zi and Zi are observed indepen-
words contained in the text of p. OLIM infers home loca- dently.
tions by Bayesian inference based on Dirichlet-multinomial
1
(Polya) distribution [18] because of the following reasons. It is the same value of nik in the previous section.

1143
From the above discussion, we can infer the home location
of user u after u posts local words w1 , , wm as: Table 1: Time complexity.
Method Time complexity posts graph
lu = arg max P (zu = k|Z1 , , Zm , ) OLIM (Proposed) O(|L|)
k
(m ) Cheng [6] O(|U ||P ||L|)
Hecht [9] O(|U ||P ||L|)
= arg max nik + k , (4)
k
i=1
Backstrom [2] O(|U |k2 )
Jurgens [11] O(|U |k2 )
which indicates that all we have to do to infer lu is main- LMM [27] O(|U |k2 )
taining the value of m i=1 nik for all user u and location lk . UDI [15] O(|E|)
Algorithm 3 shows the procedure of updating user distri-
butions. Buk maintains the sum of nik for each u in Eqn.
(4). For each word wi in the text of post p, this algorithm are not local and cannot be used for user location inference.
determines whether wi is a local word or not by comparing On the other hand, check-in history, which clearly shows the
the value of KL divergence di and the threshold dmin . If wi locations of users, can be directly utilized for user location
is a local word, the algorithm adds the value of nik to the inference. Concretely, when we observe user u checks-in the
maintained variable Buk for all k. Before OLIM starts run- venue close to the location lk , we can update us location
ning, Buk is initialized as 0 for all u and k. We can obtain distribution by incrementing the value Buk (see Algorithm
the current inference result whenever we want by Eqn. (4). 3). However, we notice that user check-ins do not always in-
dicate the home location, meaning that the check-in history
4.3 Complexity Analysis may degrade the accuracy of home location inference.
Table 1 shows time complexities of our method and ma- Social graphs can also be incorporated naturally into our
jor existing methods. The first three methods including our method. The simplest way to take it into account is to
method are content-based methods, the next three meth- bring together posts of a user and those of its neighbors,
ods are graph-based methods, and the last one is the hybrid which is based on the assumption that user us neighbors
method that uses both posts and the social graph. P is (i.e., followers or followees) are also likely to post about us
the set of accumulated posts, k is the number of neighbors home location. This is potentially eective against the data
on the social graph, and E is set of edges on Li et al.s sparsity problem, in other words, even if a user does not
heterogeneous graph [15]. For content-based methods in- post about his/her location, we can still infer the location
cluding our method, time complexities shown in the table based on the posts of its neighbors.
are required for each update. We can see that the complex-
ity of our method depends only on the number of locations NLP techniques. One may think that extracting loca-
|L|, indicating that it is in constant time regardless of the tion names or landmark names by named entity recognition
amount of accumulated posts and the number of users. On (NER) [21] techniques improves our method. However, it
the other hand, existing content-based methods must repeat is not the case, because, as discussed later in Section 5.3,
the whole inference process by using all the old contents, there exist location names that are not local. For exam-
which requires O(|U ||P ||L|). These points are experimen- ple, in Japan, a lot of users mention the popular city name
tally examined in Section 5. Tokyo, which makes that word non-local (see Figure 5).
The complexities of graph-based and hybrid approaches We should also note that synonyms and homonyms of-
depend on the size of the graph. The graph-based methods ten degrade the text-based methods. In this sense, word
all depend on the square of the number of neighbors, which sense disambiguation techniques or topic models may im-
results in huge computational cost if the graph has hubs prove the inference accuracy. Those techniques and our in-
(i.e., nodes with a large number of neighbors). Indeed, social ference method are orthogonal because the results of those
graphs have hubs because the degree distributions of those techniques can be utilized for our method without any mod-
graphs follow the power law. The hybrid approach does not ification. Moreover, according to the experimental results,
require high computational cost if the graph is sparse. OLIM achieves high inference accuracy without these tech-
niques, which indicates that our method can handle syn-
4.4 Discussion onyms and homonyms by filtering them out if these words
We would note that our method proposed in this paper are mentioned from distant locations.
is purely based on the word occurrence in user-generated
contents. This simple scheme, as we will see later, achieves 5. EXPERIMENTS
significant improvement over existing inference methods in-
We conducted the experiments listed below:
cluding graph-based and content-based ones. However, one
may think that taking other information (e.g., check-in his- Accuracy comparison: We compared the accuracy
tory or graph structure) into account or incorporating some of our proposed method with six existing methods listed
NLP techniques can improve the inference accuracy. Here in Table 1 (Section 5.2).
we discuss how to incorporate such information or NLP tech-
niques into our method, which can be taken into account KL divergence sequence: We examined the sequences
naturally. Evaluating the possible extensions discussed be- of KL divergence values of various kinds of words over
low is our future work. time (Section 5.3).
Other information. OLIM can be divided into two parts: Error reduction over time: We checked that the
identifying local words and exploiting them for user location proposed method reduces the inference error by using
inference. The first part is necessary because most of words continuously arriving posts (Section 5.4).

1144
Computational costs: We evaluated the computa-
tional cost of our proposed method and existing content- Table 2: Comparison summary.
based methods (Section 5.5). Method Median ED (m) Precision Coverage
OLIM 28,507 0.663 1.000
Parameter validations: We performed parameter Cheng [6] 200,851 0.478 0.960
validation using a validation user set (Section 5.6). Hecht [9] 212,333 0.461 1.000
Backstrom [2] 41,946 0.571 0.981
Datasets and evaluation metrics used in the experiments are
detailed in Section 5.1. Jurgens [11] 91,363 0.560 0.984
LMM [27] 89,881 0.611 0.929
5.1 Experimental Setups UDI [15] 138,653 0.556 0.999
Data. We randomly collected Twitter users who satisfy (1)
the location profile text, which describes the home location,
can be geocoded into coordinates of latitude and longitude of test users whose error distances are lower than x). Ac-
by using Yahoo! Geocoder 2 , (2) the geocoded coordinate is cording to this figure, our proposed method outperformed
in Japan, and (3) with at least 200 tweets. all compared existing methods at any error distance, mean-
As a result, we obtained 201,570 labeled Twitter users in ing that OLIM correctly placed more users with small error
Japan. In addition, we also collected the latest 200 tweets of distances than any of the other methods. Table 2 summa-
each collected user (40,314,000 tweets in total), and 33,569,924 rizes the results. We can see that OLIM achieves the best
follow edges among these users by using Twitter API. Nouns results for all metrics. Coverage is the ratio of inferred test
were extracted from the text of each tweet by using MeCab3 , users regardless of the inference result. OLIM shows 1.0 cov-
which is a widely used Japanese morphological analyzer. erage because it can infer home locations of users using the
Geocoded coordinates are used as true home locations lu . prior distribution even if they have no observations. Note
The number of distinct locations in the dataset is 11,142, that in the case of not using prior prediction, OLIM also
which is reduced to |L| by quad-tree partitioning. We di- shows a high coverage (about 0.98). OLIM successfully re-
vided users in the dataset into the training set (90%), the duced the median error distance of the second best method
validation set (5%), and the test set (5%). Users in the val- of Backstrom to about 68%.
idation set and the test set, which are treated as unlabeled UDI, which is the state-of-the-art method, shows low ac-
users, are used for the parameter validation (Section 5.6) curacy although it achieves good results in the original pa-
and accuracy evaluation, respectively. Although misreports per [15]. The reason of this is thought to be the highly
of location profiles may degrade the location inference, Ju- biased population distribution of Japan, where majority of
rgens [11] has experimentally demonstrated that users tend labeled users live in Tokyo, the capital of Japan. UDI adopts
to report accurate information. Thus, we assumed that the the concept of maximizing the likelihood of the whole graph,
users location profiles show their true home locations. which prefers most unlabeled users to live around Tokyo al-
though some of them live far from Tokyo. On the other
Evaluation metrics. The results were evaluated based on hand, OLIM and Backstrom consider the population distri-
the error distance, which is the distance between the cen- bution, leading to avoiding such an inference.
ter point of inferred location cell lu and the true location
lu . Concretely, we used Median Error Distance and Pre- 5.3 KL Divergence Sequence
cision (the ratio of users whose error distance is less than In this experiment, we investigated sequences of KL di-
100 miles (about 160km)), which is widely used [6, 15]. vergence values of various kinds of words over time. Figure
Details of the proposed method. Parameters of OLIM 5 shows the results of some representative words. The x-axis
are set to N = 400, dmin = 1.0, and |L| = 211, which indicates the elapsed time, and the y-axis indicates the value
are experimentally determined using the validation set in of KL divergence at the corresponding time point.
Section 5.6. Note that |L| is obtained by quad-tree parti- We found that words can be categorized as local (a and
tioning where cells with more than 3,000 data points (i.e., b), non-local (c and d), and temporally-local words (e to
labeled users) are partitioned. Hyperparameter of Dirichlet h) by their dierent types of KL sequences. Examples of
distribution is set to k = R(lk ), which enables reasonable local words include Tsukuba and San-nomiya, which are city
inference for users with no observation based on the prior. names in Japan. This type of words shows high values of
Our code and data are available at https://github.com/ KL divergence at any time point. These words are also used
yamaguchiyuto/olim. by existing content-based methods for location inference.
Tokyo and Friends are examples of non-local words.
5.2 Accuracy Comparison Friends is a obvious non-local word, while Tokyo is also
In this experiment, we compared our proposed method a non-local although it is a city name. Popular city names
with the six existing methods listed in Table 1 in terms of are frequently mentioned by many users from various loca-
inference accuracy. None of the existing methods consider tions [15], which results in low values of KL divergence.
temporal features. Examples of temporally-local words include Thunder, Rain,
Figure 4(a) shows the accumulative precision at various Fireworks, and Hokkaido. We can see that the first three
error distances, where the x-axis denotes the error distance words show high values of KL divergence occasionally, while
and the y-axis denotes the accumulative precision (i.e., ratio the last word shows low values occasionally. Such bursty be-
haviors are considered to be triggered by real-world events
2 the words describe. For example, when it thunders, many
http://developer.yahoo.co.jp/webapi/map/
openlocalplatform/v1/geocoder.html people around it tend to post about it, which results in high
3
http://code.google.com/p/mecab/ values. In contrast, interestingly, although the city name

1145
0.7 250 1000
OLIM OLIM
OLIM-P OLIM-P
Accumulative precision

0.6
OLIM-R

Computational time (s)


200 800
0.5 Cheng

Median ED (km)
Hecht
0.4 150 600
OLIM
0.3 Cheng
Hecht 100 400
0.2 Backstrom
Jurgens 50 200
0.1 LMM
UDI
0 0 0
0 50 100 150 200 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Error distance (km) Processed tweets (%) Processed tweets (%)

(a) Accumulative precision (b) Error reduction over time (c) Computational time

Figure 4: Experimental results. (a) Our proposed method achieves the largest precision at any point. (b)
OLIM, which considers the spatiotemporal correlation, shows the steep reduction. (c) Our method can update
the results in constant time regardless of the amount of accumulated tweets.

3 3 3 3
San-nomiya Tokyo Friends

2 2 2 2

1 1 1 1

Tsukuba
0 0 0 0

(a) Tsukuba (city) (b) San-nomiya (city) (c) Tokyo (city) (d) Friends

3 3 3 3
Thunder Rain Fireworks Hokkaido

2 2 2 2

1 1 1 1

0 0 0 0

(e) Thunder (f) Rain (g) Fireworks (h) Hokkaido (city)

Figure 5: Sequences of KL divergence of various kinds of words. (a) and (b) are local words, (c) and (d)
are non-local words, and (e) to (h) are temporally-local words. Temporally-local words are not exploited for
location inference by existing content-based methods.

Hokkaido is generally a local word, its KL divergence drops cessing step by using all posts in the dataset. OLIM-P
when the popular news about the city is mentioned from all then runs Algorithm 1 without performing updateKL be-
over Japan. This type of behaviors cannot be identified by cause values of KL divergence are already calculated for
existing methods that do not use spatiotemporal correla- all words. This method cannot leverage temporally-local
tions. For that reason, our method outperformed existing words because temporal features of words cannot be ex-
content-based methods in terms of accuracy. ploited without taking time windows into account.
Although our focus in this paper is the location inference, Figure 4(b) shows the results, where the x-axis denotes
extracted local words themselves are interesting. We are also the percentage of tweets processed, and the y-axis shows the
interested in the local words in other cultures or countries values of median error distance. Error bars denote the 95%
not limited to Japan. If a real-world event happens and peo- confidence intervals. The results show that both variants
ple around the event post about it, temporally-local words of OLIM reduce the error distance as new tweets arrive.
are also expected to be extracted in other countries. Further However, the full version of OLIM, which considers time
experiments using other datasets are our major future work. windows, shows a greater improvement in reducing the error
distance, suggesting that leveraging temporally-local words
5.4 Error Reduction over Time is eective for location inference.
This experiment examined the ability of OLIM to reduce
the inference error over time. All tweets in the dataset were 5.5 Computational Cost
considered as a pseudo social stream, and OLIM continu- We compared computational costs of content-based meth-
ously receives tweets from it chronologically. Each time 1% ods including our method. Graph-based methods are not
of tweets (about 400,000 tweets) were processed, OLIM pre- compared in this experiment because they cannot update
dicted the home locations of all test users by Eqn. (4). the inference results, and because their time complexities do
We compared the full version of OLIM and a dierent ver- not depend on the amount of tweets being used. In this ex-
sion of OLIM, called OLIM-P. OLIM-P first extracts static periment, each method predicted home locations every time
local words (at least with N occurrences) in the prepro- 1% of tweets were processed similar to the previous section.

1146
300 0.7 100 0.7 70 0.7 9500
Median ED Median ED Median ED
Precision Precision Precision

Computational time (s)


0.65 60
80 9000
Median ED (km)

Median ED (km)

Median ED (km)
200 0.65 0.65

Precision

Precision

Precision
0.6 50
60 8500
0.55 40
100 0.6 0.6
40 8000
0.5 30

0 0.45 20 0.55 20 0.55 7500


0 0.5 1 1.5 2 0 1000 2000 3000 0 200 400 600 800 0 200 400 600 800
dmin N |L| |L|

(a) dmin (b) N (c) |L| (accuarcy) (d) |L| (cost)

Figure 6: Eects of parameters.

We measured the computational time required for each up- number of extracted local words becomes too small, leading
date. Note that existing content-based methods must repeat to the number of users with no observation increased.
the whole inference process using all the contents to up- Figure 6(b) shows the result of validating N value. With
date the results every time 1% of tweets arrive. Figure 4(c) a small N , non-local words may accidentally be used as lo-
shows the results, where the x-axis denotes the percentage cal words, leading to the lower accuracy. A too large N also
of tweets processed and the y-axis shows the computational results in lower accuracy because OLIM is no longer able to
time. OLIM-R is our method but recalculates KL divergence identify spatiotemporal correlation. For example, the word
for each word every time the word occurs (i.e., does not use Thunder cannot be identified as local with too large N be-
updating scheme of Eqn. (2)). cause the word shows high KL divergence quite temporally.
The results show that computational time of three vari- As mentioned in Section 4.1, we also tried to assign the
ants of our methods do not increase except for the initial threshold of window length in terms of time duration (e.g.,
part, meaning that our method can update the results in 6 hours) because too long time window may have the bad
constant time regardless of accumulated posts. In the ini- eect on inference. However, contrary to our prediction,
tial part, many words reach N occurrence for the first time, assigning the window length threshold resulted in poor ac-
which requires the initial calculation of KL divergence, lead- curacy. This is because of the existence of rare but local
ing to the initial increase of computational time. On the words with long window length (e.g., less-popular location
other hand, computational time of Cheng and Hecht linearly names), which would be discarded if we assign the window
increases as the amount of accumulated tweets increases. length threshold.
These results confirm the discussion of time complexities of On the other hand, we would note that there exists the
the content-based methods in Section 4.3. Furthermore, in burst-and-lean eect for our definition of window length. For
addition to the eciency of OLIM in updating the results, it example, suppose that a tornado hits a place and many users
can also eciently update the model (i.e., set of local words) around it start to post about it, which makes the word tor-
by considering the spatiotemporal correlations. In contrast, nado local in the current window, and after that, no one
Cheng and Hecht do not update the models because they do posts about it. The word tornado will stay local for a long
not consider the dynamic feature of social streams. time because the time window of it will not slide without
OLIM can successfully reduce the computational time of further word occurrences. In this case, if someone far from
OLIM-R, which means that the updating scheme of KL it happen to post the word, say, one week after, this local
divergence is ecient. Note that the inference results of word causes the wrong inference. Although we observed very
these two variants are identical. Moreover, OLIM-P shows few cases, it may result in the more accurate inference if we
smaller computational time than the other two variants be- can suppress this eect.
cause OLIM-P calculates KL divergence values of all words Figures 6(c) and 6(d) show the accuracy and the cost for
in preprocessing and therefore does not perform updateKL. dierent number of partitions by quad-tree. We obtain each
Regarding the storage cost, in contrast to Cheng and value of |L| by quad-tree where cells with more than 500,
Hecht, our method can discard old contents because it does 1,000, 1,500, 2,000, 3,000, 4,000, 5,000, 10,000 data points
not need them to update the results. Therefore, the storage are partitioned. We can see that too small |L| results in
cost of our method does not increase over time. low accuracy, which means that too coarse partition cannot
accomplish good inference. On the other hand, too large |L|
5.6 Parameter Validation also shows low accuracy, suggesting that overly fine-grained
partition goes into over fitting without sucient number of
This section describes the parameter validation of window
observations of local words. Figure 6(d) indicates that the
length N , threshold value dmin , and the number of locations
computational time is almost linearly proportional to |L|,
|L| partitioned by the quad-tree. In this experiments, we
which confirms the time complexity of OLIM.
used the validation user set. Figure 6 shows the results. The
following paragraphs discuss the eect of each parameter.
Figure 6(a) shows the result of validating dmin , where the 6. CONCLUSION
x-axis indicates the value of dmin , and the y-axis indicates This paper tackles the problem of online location infer-
the Median Error Distance and the Precision. Error bars ence in social media where contents are generated and de-
denote the 95% confidence intervals. We can see that too livered in real-time. Our proposed method, named OLIM,
small dmin degrades the accuracy because words with small can perform online inference based solely on the newly ar-
KL divergence values are used as local words. On the other riving content, which achieves constant computational and
hand, too large dmin also shows poor results because the storage costs regardless of accumulated contents. In addi-

1147
tion, OLIM exploits the spatiotemporal correlation between [12] S. Kinsella, V. Murdock, and N. OHare. im eating a
contents and locations thereby achieving more accurate in- sandwich in glasgow: modeling locations with tweets.
ference than that of existing methods. Experimental results In SMUC, pages 6168, 2011.
demonstrate that the proposed method outperforms existing [13] R. Lee and K. Sumiya. Measuring geographical
methods in terms of accuracy and cost. regularities of crowd behaviors for twitter-based
Our future work includes the followings. First, exploit- geo-social event detection. In Proceedings of the 2nd
ing both contents and social graphs for online inference is ACM SIGSPATIAL International Workshop on
interesting. Propagating observations of local words in so- Location Based Social Networks, pages 110. ACM,
cial graphs is thought to be eective against the data spar- 2010.
sity problem. Second, we plan to conduct experiments on [14] R. Li, S. Wang, and K. C.-C. Chang. Multiple location
the other datasets to further study temporally-local words profiling for users and relationships from social
in the other cultures or countries. Third, our method can network and content. Proceedings of the VLDB
potentially predict users current location with a few modi- Endowment, 5(11):16031614, 2012.
fications. We will explore the forgetting factor that assigns [15] R. Li, S. Wang, H. Deng, R. Wang, and K. C.-C.
less weight to old observations. Chang. Towards social user profiling: unified and
Acknowledgements: This research was partly supported discriminative influence model for inferring home
by the program Research and Development on Real World locations. In KDD, pages 10231031, 2012.
Big Data Integration and Analysis of the Ministry of Ed- [16] A. McCallum, K. Nigam, et al. A comparison of event
ucation, Culture, Sports, Science and Technology, Japan. models for naive bayes text classification. In AAAI-98
This work was also supported in part by JSPS KAKENHI, workshop on learning for text categorization, volume
Grant-in-Aid for JSPS Fellows #242322. 752, pages 4148. Citeseer, 1998.
[17] J. McGee, J. Caverlee, and Z. Cheng. Location
prediction in social media based on tie strength. In
7. REFERENCES CIKM, pages 459468, 2013.
[1] S. Abrol and L. Khan. Tweethood: Agglomerative [18] T. Minka. Estimating a dirichlet distribution, 2000.
clustering on fuzzy k-closest friends with variable [19] M. J. Paul and M. Dredze. You are what you tweet:
depth for location mining. In SocialCom/PASSAT, Analyzing twitter for public health. In ICWSM, 2011.
pages 153160, 2010. [20] T. Pontes, G. Magno, M. A. Vasconcelos, A. Gupta,
[2] L. Backstrom, E. Sun, and C. Marlow. Find me if you J. M. Almeida, P. Kumaraguru, and V. Almeida.
can: improving geographical prediction with social Beware of what you share: Inferring home location in
and spatial proximity. In WWW, pages 6170, 2010. social networks. In ICDM Workshops, pages 571578,
[3] S. Chandra, L. Khan, and F. B. Muhaya. Estimating 2012.
twitter user location using social interactions-a [21] A. Ritter, S. Clark, O. Etzioni, et al. Named entity
content based approach. In SocialCom/PASSAT, recognition in tweets: an experimental study. In
pages 838843, 2011. Proceedings of the Conference on Empirical Methods
[4] H.-W. Chang, D. Lee, M. Eltaher, and J. Lee. in Natural Language Processing, pages 15241534.
@phillies tweeting from philly? predicting twitter user Association for Computational Linguistics, 2011.
locations with spatial word usage. In ASONAM, pages [22] D. P. Rout, K. Bontcheva, D. Preotiuc-Pietro, and
111118, 2012. T. Cohn. Wheres @wally?: a classification approach
[5] O. Chapelle, B. Scholkopf, A. Zien, et al. to geolocating users based on their social ties. In HT,
Semi-supervised learning, volume 2. MIT press pages 1120, 2013.
Cambridge, 2006. [23] A. Sadilek, H. A. Kautz, and J. P. Bigham. Finding
[6] Z. Cheng, J. Caverlee, and K. Lee. You are where you your friends and following them to where you are. In
tweet: a content-based approach to geo-locating WSDM, pages 723732, 2012.
twitter users. In CIKM, pages 759768, 2010. [24] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake
[7] C. A. Davis Jr, G. L. Pappa, D. R. R. de Oliveira, and shakes twitter users: real-time event detection by
F. de L. Arcanjo. Inferring the location of twitter social sensors. In WWW, pages 851860, 2010.
messages based on user relationships. Transactions in [25] A. Schulz, A. Hadjakos, H. Paulheim, J. Nachtwey,
GIS, 15(6):735751, 2011. and M. Muhlhauser. A multi-indicator approach for
[8] J. Eisenstein, B. OConnor, N. A. Smith, and E. P. geolocalization of tweets. In ICWSM, 2013.
Xing. A latent variable model for geographic lexical [26] S. Vieweg, A. L. Hughes, K. Starbird, and L. Palen.
variation. In EMNLP, pages 12771287, 2010. Microblogging during two natural hazards events:
[9] B. Hecht, L. Hong, B. Suh, and E. H. Chi. Tweets what twitter may contribute to situational awareness.
from justin biebers heart: the dynamics of the In CHI, pages 10791088, 2010.
location field in user profiles. In CHI, pages 237246, [27] Y. Yamaguchi, T. Amagasa, and H. Kitagawa.
2011. Landmark-based user location inference in social
[10] L. Hong, A. Ahmed, S. Gurumurthy, A. J. Smola, and media. In COSN, pages 223234, 2013.
K. Tsioutsiouliklis. Discovering geographical topics in [28] Q. Yuan, G. Cong, Z. Ma, A. Sun, and
the twitter stream. In WWW, pages 769778, 2012. N. Magnenat-Thalmann. Who, where, when and what:
[11] D. Jurgens. Thats what friends are for: Inferring discover spatio-temporal topics for twitter users. In
location in online social media platforms based on KDD, pages 605613, 2013.
social relationships. In ICWSM, 2013.

1148

You might also like