You are on page 1of 5

EXPERT OPINION

Contact Editor: Brian Brannon, bbrannon@computer.org

The Unreasonable
Effectiveness of Data
Alon Halevy, Peter Norvig, and Fernando Pereira, Google

E
ugene Wigner’s article “The Unreasonable Ef- behavior. So, this corpus could serve as the basis of
fectiveness of Mathematics in the Natural Sci- a complete model for certain tasks—if only we knew
how to extract the model from the data.
ences”1 examines why so much of physics can be
neatly explained with simple mathematical formulas Learning from Text at Web Scale
The biggest successes in natural-language-related
such as f = ma or e = mc 2 . Meanwhile, sciences that machine learning have been statistical speech rec-
involve human beings rather than elementary par- ognition and statistical machine translation. The
ticles have proven more resistant to elegant math- reason for these successes is not that these tasks are
ematics. Economists suffer from physics envy over easier than other tasks; they are in fact much harder
their inability to neatly model human behavior. than tasks such as document classification that ex-
An informal, incomplete grammar of the English tract just a few bits of information from each doc-
language runs over 1,700 pages. 2 Perhaps when it ument. The reason is that translation is a natural
comes to natural language processing and related task routinely done every day for a real human need
fields, we’re doomed to complex theories that will (think of the operations of the European Union or
never have the elegance of physics equations. But of news agencies). The same is true of speech tran-
if that’s so, we should stop acting as if our goal is scription (think of closed-caption broadcasts). In
to author extremely elegant theories, and instead other words, a large training set of the input-output
embrace complexity and make use of the best ally behavior that we seek to automate is available to us
we have: the unreasonable effectiveness of data. in the wild. In contrast, traditional natural language
One of us, as an undergraduate at Brown Univer- processing problems such as document classifica-
sity, remembers the excitement of having access to tion, part-of-speech tagging, named-entity recogni-
the Brown Corpus, containing one million English tion, or parsing are not routine tasks, so they have
words.3 Since then, our field has seen several notable no large corpus available in the wild. Instead, a cor-
corpora that are about 100 times larger, and in 2006, pus for these tasks requires skilled human annota-
Google released a trillion-word corpus with frequency tion. Such annotation is not only slow and expen-
counts for all sequences up to five words long.4 In sive to acquire but also difficult for experts to agree
some ways this corpus is a step backwards from the on, being bedeviled by many of the difficulties we
Brown Corpus: it’s taken from unfiltered Web pages discuss later in relation to the Semantic Web. The
and thus contains incomplete sentences, spelling er- fi rst lesson of Web-scale learning is to use available
rors, grammatical errors, and all sorts of other er- large-scale data rather than hoping for annotated
rors. It’s not annotated with carefully hand-corrected data that isn’t available. For instance, we fi nd that
part-of-speech tags. But the fact that it’s a million useful semantic relationships can be automatically
times larger than the Brown Corpus outweighs these learned from the statistics of search queries and the
drawbacks. A trillion-word corpus—along with other corresponding results5 or from the accumulated evi-
Web-derived corpora of millions, billions, or tril- dence of Web-based text patterns and formatted ta-
lions of links, videos, images, tables, and user inter- bles, 6 in both cases without needing any manually
actions—captures even very rare aspects of human annotated data.

8 1541-1672/09/$25.00 © 2009 IEEE iEEE iNTElliGENT SYSTEMS


Published by the IEEE Computer Society
Another important lesson from sta- With a corpus of thousands of photos, collectively frequent events. For many
tistical methods in speech recognition the results were poor. But once they tasks, words and word combinations
and machine translation is that mem- accumulated millions of photos, the provide all the representational ma-
orization is a good policy if you have same algorithm performed quite well. chinery we need to learn from text.
a lot of training data. The statistical We know that the number of gram- Human language has evolved over
language models that are used in both matical English sentences is theoreti- millennia to have words for the im-
tasks consist primarily of a huge data- cally infinite and the number of pos- portant concepts; let’s use them. Ab-
base of probabilities of short sequences sible 2-Mbyte photos is 2562,000,000. stract representations (such as clusters
of consecutive words (n-grams). These However, in practice we humans care from latent analysis) that lack linguis-
models are built by counting the num- to make only a finite number of dis- tic counterparts are hard to learn or
ber of occurrences of each n-gram se- tinctions. For many tasks, once we validate and tend to lose information.
quence from a corpus of billions or tril- have a billion or so examples, we es- Relying on overt statistics of words
lions of words. Researchers have done sentially have a closed set that repre- and word co-occurrences has the fur-
a lot of work in estimating the prob- ther advantage that we can estimate
abilities of new n-grams from the fre- models in an amount of time propor-
quencies of observed n-grams (using, tional to available data and can of-
for example, Good-Turing or Kneser- For many tasks, ten parallelize them easily. So, learn-
Ney smoothing), leading to elaborate ing from the Web becomes naturally
probabilistic models. But invariably, words and word scalable.
simple models and a lot of data trump The success of n-gram models has
more elaborate models based on less combinations provide unfortunately led to a false dichotomy.
data. Similarly, early work on machine Many people now believe there are
translation relied on elaborate rules for all the representational only two approaches to natural lan-
the relationships between syntactic and guage processing:
semantic patterns in the source and machinery we need
target languages. Currently, statistical • a deep approach that relies on hand-
translation models consist mostly of to learn from text. coded grammars and ontologies,
large memorized phrase tables that give represented as complex networks of
candidate mappings between specific relations; and
source- and target-language phrases. • a statistical approach that relies on
Instead of assuming that general pat- sents (or at least approximates) what learning n-gram statistics from large
terns are more effective than memoriz- we need, without generative rules. corpora.
ing specific phrases, today’s translation For those who were hoping that a
models introduce general rules only small number of general rules could ex- In reality, three orthogonal problems
when they improve translation over just plain language, it is worth noting that arise:
memorizing particular phrases (for in- language is inherently complex, with
stance, in rules for dates and numbers). hundreds of thousands of vocabulary • choosing a representation language,
Similar observations have been made words and a vast variety of grammati- • encoding a model in that language,
in every other application of machine cal constructions. Every day, new words and
learning to Web data: simple n-gram are coined and old usages are modified. • performing inference on the model.
models or linear classifiers based on This suggests that we can’t reduce what
millions of specific features perform we want to say to the free combination Each problem can be addressed in sev-
better than elaborate models that try of a few abstract primitives. eral ways, resulting in dozens of ap-
to discover general rules. In many cases For those with experience in small- proaches. The deep approach that was
there appears to be a threshold of suf- scale machine learning who are wor- popular in the 1980s used first-order
ficient data. For example, James Hays ried about the curse of dimensionality logic (or something similar) as the rep-
and Alexei A. Efros addressed the task and overfitting of models to data, note resentation language, encoded a model
of scene completion: removing an un- that all the experimental evidence with the labor of a team of graduate
wanted, unsightly automobile or ex- from the last decade suggests that students, and did inference with com-
spouse from a photograph and filling throwing away rare events is almost plex inference rules appropriate to the
in the background with pixels taken always a bad idea, because much Web representation language. In the 1980s
from a large corpus of other photos.7 data consists of individually rare but and 90s, it became fashionable to

March/April 2009 www.computer.org/intelligent 9


use finite state machines as the repre- man speech and writing—the seman- technology, we must deal with signifi-
sentation language, use counting and tic interpretation problem—is quite cant hurdles:
smoothing over a large corpus to en- different from the problem of software
code a model, and use simple Bayesian service interoperability. Semantic inter- • Ontology writing. The important
statistics as the inference method. pretation deals with imprecise, ambig- easy cases have been done. For ex-
But many other combinations are uous natural languages, whereas ser- ample, the Dublin Core defines
possible, and in the 2000s, many are vice interoperability deals with making dates, locations, publishers, and
being tried. For example, Lise Getoor data precise enough that the programs other concepts that are sufficient for
and Ben Taskar collect work on statis- operating on the data will function ef- card catalog entries. Bioformats.org
tical relational learning—that is, rep- fectively. Unfortunately, the fact that defines chromosomes, species, and
resentation languages that are pow- the word “semantic” appears in both gene sequences. Other organizations
erful enough to represent relations “Semantic Web” and “semantic inter- provide ontologies for their specific
between objects (such as first-order pretation” means that the two prob- fields. But there’s a long tail of rarely
logic) but that have a sound, probabi- used concepts that are too expensive
listic definition that allows models to to formalize with current technol-
be built by statistical learning. 8 Taskar ogy. Project Halo did an excellent
and his colleagues show how the same Because of a huge job of encoding and reasoning with
kind of maximum-margin classifier knowledge from a chemistry text-
used in support vector machines can shared cognitive and book, but the cost was US$10,000
improve traditional parsing.9 Stefan per page.12 Obviously we can’t af-
Schoenmackers, Oren Etzioni, and cultural context, linguistic ford that cost for a trillion Web
Daniel S. Weld show how a relational pages.
logic and a 100-million-page corpus expression can be highly • Difficulty of implementation. Pub-
can answer questions such as “what lishing a static Web page written
vegetables help prevent osteoporosis?” ambiguous and still often in natural language is easy; anyone
by isolating and combining the rela- with a keyboard and Web connec-
tional assertions that “kale is high in be understood correctly. tion can do it. Creating a database-
calcium” and “calcium helps prevent backed Web service is substantially
osteoporosis.”10 harder, requiring specialized skills.
Making that service compliant with
Semantic Web versus lems have often been conflated, caus- Semantic Web protocols is harder
Semantic Interpretation ing needless and endless consternation still. Major sites with competent
The Semantic Web is a convention for and confusion. The “semantics” in Se- technology experts will find the ex-
formal representation languages that mantic Web services is embodied in the tra effort worthwhile, but the vast
lets software services interact with code that implements those services in majority of small sites and individu-
each other “without needing artificial accordance with the specifications ex- als will find it too difficult, at least
intelligence.”11 A software service that pressed by the relevant ontologies and with current tools.
enables us to make a hotel reservation attached informal documentation. The • Competition. In some domains,
is transformed into a Semantic Web “semantics” in semantic interpretation competing factions each want to
service by agreeing to use one of sev- of natural languages is instead embod- promote their own ontology. In
eral standards for representing dates, ied in human cognitive and cultural other domains, the entrenched lead-
prices, and locations. The service can processes whereby linguistic expres- ers of the field oppose any ontology
then interoperate with other services sion elicits expected responses and ex- because it would level the playing
that use either the same standard or pected changes in cognitive state. Be- field for their competitors. This is
a different one with a known transla- cause of a huge shared cognitive and a problem in diplomacy, not tech-
tion into the chosen standard. As Tim cultural context, linguistic expression nology. As Tom Gruber says, “Ev-
Berners-Lee, James Hendler, and Ora can be highly ambiguous and still of- ery ontology is a treaty—a social
Lassila write, “The Semantic Web will ten be understood correctly. agreement—among people with
enable machines to comprehend se- Given these challenges, building Se- some common motive in sharing.”13
mantic documents and data, not hu- mantic Web services is an engineering When a motive for sharing is lack-
man speech and writings.”11 and sociological challenge. So, even ing, so are common ontologies.
The problem of understanding hu- though we understand the required • Inaccuracy and deception. We

10 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS


know how to build sound inference time. We also can’t always tell which belong in the same column of a table.
mechanisms that take true premises business is meant by the string “HP.” We’ve never before had such a vast col-
and infer true conclusions. But we It could refer to Helmerich & Payne lection of tables (and their schemata) at
don’t have an established methodol- Corp. when the column is populated our disposal to help us resolve seman-
ogy to deal with mistaken premises by stock ticker symbols but probably tic heterogeneity. Using such a corpus,
or with actors who lie, cheat, or oth- refers to Hewlett-Packard when the we hope to be able to accomplish tasks
erwise deceive. Some work in repu- column is populated by names of large such as deciding when “Company”
tation management and trust exists, technology companies. The problem of and “Company Name” are synonyms,
but for the time being we can expect semantic interpretation remains; using deciding when “HP” means Helmerich
Semantic Web technology to work a Semantic Web formalism just means & Payne or Hewlett-Packard, and de-
best where an honest, self-correcting that semantic interpretation must be termining that an object with attri-
group of cooperative users exists done on shorter strings that fall be- butes “passengers” and “cruising alti-
and not as well where competition tween angle brackets. tude” is probably an aircraft.
and deception exist.
Examples
The challenges for achieving accu- How can we use such a corpus of ta-
rate semantic interpretation are dif- The same meaning bles? Suppose we want to find syn-
ferent. We’ve already solved the socio- onyms for attribute names—for exam-
logical problem of building a network can be expressed ple, when “Company Name” could be
infrastructure that has encouraged equivalent to “Company” and “price”
hundreds of millions of authors to in many different ways, could be equivalent to “discount”).
share a trillion pages of content. We’ve Such synonyms differ from those in a
solved the technological problem of ag- and the same expression thesaurus because here, they are highly
gregating and indexing all this content. context dependent (both in tables and
But we’re left with a scientific problem can express many in natural language). Given the cor-
of interpreting the content, which is pus, we can extract a set of schemata
mainly that of learning as much as pos- different meanings. from the tables’ column labels; for ex-
sible about the context of the content to ample, researchers reliably extracted
correctly disambiguate it. The seman- 2.5 million distinct schemata from a
tic interpretation problem remains re- collection of 150 million tables, not all
gardless of whether or not we’re using What we need are methods to infer of which had schema.14 We can now
a Semantic Web framework. The same relationships between column headers examine the co-occurrences of attri-
meaning can be expressed in many dif- or mentions of entities in the world. bute names in these schemata. If we
ferent ways, and the same expression These inferences may be incorrect at see a pair of attributes A and B that
can express many different meanings. times, but if they’re done well enough rarely occur together but always occur
For example, a table of company infor- we can connect disparate data collec- with the same other attribute names,
mation might be expressed in ad hoc tions and thereby substantially en- this might mean that A and B are syn-
HTML with column headers called hance our interaction with Web data. onyms. We can further justify this hy-
“Company,” “Location,” and so on. Interestingly, here too Web-scale data pothesis if we see that data elements
Or it could be expressed in a Semantic might be an important part of the so- have a significant overlap or are of the
Web format, with standard identifiers lution. The Web contains hundreds of same data type. Similarly, we can also
for “Company Name” and “Location,” millions of independently created ta- offer a schema autocomplete feature
using the Dublin Core Metadata Initia- bles and possibly a similar number of for database designers. For example,
tive point-encoding scheme. But even if lists that can be transformed into ta- by analyzing such a large corpus of
we have a formal Semantic Web “Com- bles.14 These tables represent struc- schemata, we can discover that sche-
pany Name” attribute, we can’t expect tured data in myriad domains. They mata that have the attributes Make
to have an ontology for every possible also represent how different people and Model also tend to have the attri-
value of this attribute. For example, we organize data—the choices they make butes Year, Color, and Mileage. Pro-
can’t know for sure what company the for which columns to include and the viding such feedback to schemata cre-
string “Joe’s Pizza” refers to because names given to the columns. The tables ators can save them time but can also
hundreds of businesses have that name also provide a rich collection of column help them use more common attribute
and new ones are being added all the values, and values that they decided names, thereby decreasing a possible

March/April 2009 www.computer.org/intelligent 11


source of heterogeneity in Web-based nonparametric model rather than try- 7. J. Hays and A.A. Efros, “Scene Comple-
data. Of course, we’ll find immense ing to summarize it with a parametric tion Using Millions of Photographs,”
Comm. ACM, vol. 51, no. 10, 2008, pp.
opportunities to create interesting data model, because with very large data 87–94.
sets if we can automatically combine sources, the data holds a lot of detail. 8. L. Getoor and B. Taskar, Introduction
data from multiple tables in this collec- For natural language applications, to Statistical Relational Learning, MIT
tion. This is an area of active research. trust that human language has already Press, 2007.
Another opportunity is to combine evolved words for the important con- 9. B. Taskar et al., “Max-Margin Parsing,”
Proc. Conf. Empirical Methods in Nat-
data from multiple tables with data cepts. See how far you can go by ty- ural Language Processing (EMNLP 04),
from other sources, such as unstruc- ing together the words that are already Assoc. for Computational Linguistics,
tured Web pages or Web search que- there, rather than by inventing new 2004, pp. 1–8.
ries. For example, Marius Paşca also concepts with clusters of words. Now 10. S. Schoenmackers, O. Etzioni, and D.S.
Weld, “Scaling Textual Inference to
considered the task of identifying at- go out and gather some data, and see the Web,” Proc. 2008 Conf. Empirical
tributes of classes.15 That is, his sys- what it can do. Methods in Natural Language Process-
tem first identifies classes such as ing (EMNLP 08), Assoc. for Computa-
“Company,” then finds examples such tional Linguistics, 2008, pp. 79−88.
11. T. Berners-Lee, J. Hendler, and O. Las-
as “Adobe Systems,” “Macromedia,”
“Apple Computer,” “Target,” and Choose a representation sila, “The Semantic Web,” Scientific
Am., 17 May 2001.
so on, and finally identifies class at- 12. P. Friedland et al., “Towards a Quanti-
tributes such as “location,” “CEO,” that can use unsupervised tative, Platform-Independent Analysis of
Knowledge Systems,” Proc. Int’l Conf.
“headquarters,” “stock price,” and
“company profile.” Michael Cafarella learning on unlabeled Principles of Knowledge Representa-
tion, AAAI Press, 2004, pp. 507–514.
and his colleagues showed this can be 13. “Interview of Tom Gruber,” AIS SIG-
gleaned from tables, but Paşca showed data, which is so SEMIS Bull., vol. 1, no. 3, 2004.
14. M.J. Cafarella et al., “WebTables: Ex-
it can also be extracted from plain
text on Web pages and from user que- much more plentiful ploring the Power of Tables on the Web,”
Proc. Very Large Data Base Endow-
ries in search logs. That is, from the ment (VLDB 08), ACM Press, 2008, pp.
user query “Apple Computer stock than labeled data. 538–549.
price” and from the other information 15. M. Paşca, “Organizing and Searching
we know about existing classes and the World Wide Web of Facts. Step Two:
Harnessing the Wisdom of the Crowds,”
attributes, we can confirm that “stock Proc. 16th Int’l World Wide Web Conf.,
price” is an attribute of the “Com- ACM Press, 2007, pp. 101–110.
pany” class. Moreover, the technique References
works not just for a few dozen of the 1. E. Wigner, “The Unreasonable Effective-
ness of Mathematics in the Natural Sci-
most popular classes but for thou- ences,” Comm. Pure and Applied Math-
sands of classes and tens of thousands ematics, vol. 13, no. 1, 1960, pp. 1–14.
of attributes, including classes like 2. R. Quirk et al., A Comprehensive Gram-
“Aircraft Model,” which has attri- mar of the English Language, Longman,
1985.
butes “weight,” “length,” “fuel con- 3. H. Kucera, W.N. Francis, and J.B.
sumption,” “interior photos,” “speci- Carroll, Computational Analysis of
fications,” and “seating arrangement.” Present-Day American English, Brown
Paşca shows that including query logs Univ. Press, 1967.
4. T. Brants and A. Franz, Web 1T 5-Gram
can lead to excellent performance,
Version 1, Linguistic Data Consortium,
with 90 percent precision over the top 2006.
10 attributes per class. 5. S. Riezler, Y. Liu, and A. Vasserman,
“Translating Queries into Snippets for
Improved Query Expansion,” Proc. Alon Halevy is a research scientist at Google.
22nd Int’l Conf. Computational Lin- Contact him at halevy@google.com.
So, follow the data. Choose a rep- guistics (Coling 08), Assoc. Computa-
tional Linguistics, 2008, pp. 737–744.
Peter Norvig is a research director at Google.
resentation that can use unsupervised 6. P.P. Talukdar et al., “Learning to Cre-
Contact him at pnorvig@google.com.
learning on unlabeled data, which is ate Data-Integrating Queries,” Proc.
34th Int’l Conf. Very Large Databases
so much more plentiful than labeled (VLDB 08), Very Large Database En- Fernando Pereira is a research director at
data. Represent all the data with a dowment, 2008, pp. 785–796. Google. Contact him at pereira@google.com.

12 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

You might also like