You are on page 1of 9

contributed articles

doi: 10.1145/ 1364782.1364798

The Web must be studied as an entity in its


own right to ensure it keeps flourishing
and prevent unanticipated social effects.
by James Hendler, Nigel Shadbolt, Wendy Hall,
Tim Berners-Lee, and Daniel Weitzner

Web Science:
An Interdisciplinary
Approach to
Understanding
the Web

Despite the Web ’ s great success as a technology and


the significant amount of computing infrastructure on
which it is built, it remains, as an entity, surprisingly
unstudied. Here, we look at some of the technical and
social challenges that must be overcome to model the
ILLUSTRATION BY MA RIUS WAT Z

Web as a whole, keep it growing, and understand its


continuing social impact. A systems approach, in the
sense of “systems biology,” is needed if we are to be
able to understand and engineer the future Web.
60 comm unicatio ns o f the acm | J U LY 2008 | vo l . 5 1 | no. 7
contributed articles

Figure 1: The social interactions enabled by the Web put demands on the Web applications whereas the number of separate Web
behind them, in turn putting further demands on the Web’s infrastructure. documents is more than 1011.
Computing has made significant
contributions to the Web. Our everyday
use of the Web depends on fundamen-
Social Interactions tal developments in CS that took place
long before the Web was invented. To-
day’s search engines are based on, for
example, developments in information
retrieval with a legacy going back to the
Application Needs
1960s. The innovations of the 1990s 9, 23
provide the crucial algorithms underly-
ing modern search and are fundamen-
tal to Web use. New resources (such as
Infrastructure Reqs Hadoop, lucene.apache.org/hadoop/,
an open-source software framework
that supports data-intensive distrib-
uted applications on large clusters of
commodity computers) make it pos-
sible for students to explore these al-
Despite the huge effect the Web has academia teach, communicate, pub- gorithms and experiment with large-
had on computing, as well as on the lish, and do research. In industry, it scale Web-programming practices like
overall field of computer science, the has not only created an entire sector MapReduce parallelism 11 in a way not
best keyword indicator one can find in (or, arguably, multiple sectors) but af- previously accessible beyond a few top
the ACM taxonomy, the one by which fected the communications and deliv- universities.
the field organizes many of its research ery of services across the entire indus- Other aspects of human interaction
papers and conferences, is “miscella- trial spectrum. In government, it has on the Web have been studied else-
neous.” Similarly, if you look at CS cur- changed not only the nature of how where. Of special note, many interest-
ricula in most universities worldwide governments communicate with their ing aspects of the use of the Web (such
you will find “Web design” is taught as citizens but also how these popula- as social networking, tagging, data in-
a service course, along with, perhaps, tions communicate and even, in some tegration, information retrieval, and
a course on Web scripting languages. cases, how they end up choosing their Web ontologies) have become part of
You are unlikely to find a course that governments in the first place; recall a new “social computing” area at some
teaches Web architecture or protocols. the U.S. presidential debates in which of the top information schools. They of-
It is as if the Web, at least below the candidates took questions online and fer classes in the general properties of
browser, simply does not exist. Many through YouTube videos. It is estimat- networks and interconnected systems
“information schools” and “informat- ed that the size of the human popu- in both the policy and political aspects
ics departments” offer courses that fo- lation is on the order of 1010 people, of computing and in the economics
cus on applications on the Web or on
such topics as “Web 2.0,” but the pro- Figure 2: The Web presents new challenges to software engineering
tocols, architectures, and underlying and application development.
principles of the Web per se are rarely
covered.
Simplifying a bit, part of the reason
for this is that networking has long Design Technology
been part of the systems curricula in
many departments, and thus the Inter-
net, defined via the TCP/IP networking creativity Idea
protocols, has long been considered an
important part of CS work. The Web, Social micro
despite having its own protocols, algo-
rithms, and architectural principles, is
Issues
often viewed by people in the CS field
as an application running on top of the
Net, more than as an entity unto itself.
This is odd, as the Web is the most
used and one of the most transfor- analysis macro complexity

mative applications in the history of


computing, even of human communi-
cations. It has changed how those in

62 com municatio ns o f th e ac m | J U LY 2008 | vo l . 5 1 | no. 7


contributed articles

of computer use. However, in many of ties that require new analytic methods
these courses, the Web itself is treat- to be understood. Some are desirable
ed as a specific instantiation of more and therefore to be engineered in;
general principals. In other cases, the others are undesirable and if possible
Web is treated primarily as a dynamic
content mechanism that supports the A large-scale engineered out. We also need to keep
in mind that the Web is part of a wider
social interactions among multiple
browser users. Whether in CS studies
system may system of human interaction; it has
profoundly affected society, with each
or in information-school courses, the have emergent emerging wave creating new challeng-
Web is often studied exclusively as the
delivery vehicle for content, technical
properties not es and opportunities in making infor-
mation more available to wider sectors
or social, rather than as an object of predictable by of the population than ever before.
study in its own right.
Here, we present the emerging in-
analyzing micro It may seem that the best way to un-
derstand the Web is as a set of protocols
terdisciplinary field of Web science5, technical and/or that can be studied for their properties,
taking the Web as its primary object
social effects. with individual applications analyzed
6

of study. We show there is significant for their algorithmic properties. How-


interplay among the social interac- ever, the Web wasn’t (and still isn’t)
tions enabled by the Web’s design, the built using the specify, design, build,
scalable and open applications devel- test development cycle CS has tradi-
opment mandated to support them, tionally viewed as software engineering
and the architectural and data require- best practice.
ments of these large-scale applications Figure 2 outlines a new way of look-
(see Figure 1). However, the study of ing at Web development. A software
the relationships among these levels application is designed based on an
is often hampered by the disciplinary appropriate technology (such as algo-
boundaries that tend to separate the rithm and design) and with an envi-
study of the underlying networking sioned “social” construct; it is indeed
from the study of the social applica- a contradiction in terms to talk about
tions. We identify some of these rela- a Web application built for a single
tionships and briefly review the status user on a single machine. The system
of Web-related research within com- is generally tested in a small group
puting, We primarily focus on identify- or deployed on a limited basis; the
ing emerging and extremely challeng- system’s “micro” properties are thus
ing problems researchers (in their role tested. In some cases, when more and
as Web scientists) need to explore. more people accept the micro system,
accelerating “viral” scaling occurs. For
What Is It? example, when Mosaic, the first popu-
Where physical science is commonly lar Web browser, was released publicly
regarded as an analytic discipline that in 1992, the number of users quickly
aims to find laws that generate or ex- grew by several orders of magnitude,
plain observed phenomena, CS is pre- with more than a million downloads
dominantly (though not exclusively) in the first year; for more recent exam-
synthetic, in that formalisms and algo- ples, consider photo-sharing on Flickr,
rithms are created in order to support video-uploading on YouTube, and so-
specific desired behaviors. Web science cial-networking sites like mySpace and
deliberately seeks to merge these two Facebook.
paradigms. The Web needs to be stud- The macro system, that is, the use
ied and understood as a phenomenon of the micro system by many users in-
but also as something to be engineered teracting with one another in often-un-
for future growth and capabilities. predicted ways, is far more interesting
At the micro scale, the Web is an in- in and of itself and generally must be
frastructure of artificial languages and analyzed in ways that are different from
protocols; it is a piece of engineering. the micro system. Also, these macro
However, it is the interaction of human systems engender new challenges that
beings creating, linking, and consum- do not occur at the micro scale; for ex-
ing information that generates the ample, the wide deployment of Mosaic
Web’s behavior as emergent proper- led to a need for a way to find relevant
ties at the macro scale. These proper- material on the growing Web, and thus
ties often generate surprising proper- search became an important applica-

JU LY 2 0 0 8 | vo l . 51 | n o. 7 | c om m u n ic at ion s of t he acm 63
contributed articles

tion, and later an industry, in its own was shown in Broder et al.10 for the out- page to an article on a Communications
right. In other cases, the large-scale sys- branching of vertices in the graph. An page will actually involve a number of
tem may have emergent properties that important result in Dill et al.12 showed requests among a number of servers; at
were not predictable by analyzing the that large samples of the Web, gener- the time of this writing, typing the URI
micro technical and/or social effects. ated through a variety of methods, all for Communications into a browser will
Dealing with these issues can lead to had similar properties—important as cause more than 20 different HTTP-
subsequent generations of technology. the Web graph grows, reported in 2005 GET requests to occur for seven differ-
For example, the enormous success of to be on the order of seven million new ent types of Web formats. Crawlers can
search engines has inevitably yielded pages a day.17 Various models have capture these links and create the Web
techniques to game the algorithms (an been proposed as to how the Web graph graph as, essentially, a static snapshot
unexpected result) to improve search grows and which models best capture of the linking of the Web.
rank, leading, in turn, to the develop- its evolution; see Donato et al.14 for an However, the Web graph is just one
ment of better search technologies to analysis of a number of these models abstraction of the Web based on one
defeat the gaming. and their properties. part of the processing and protocols
The essence of our understanding of Along with analyses of this graph underlying its function. While it is an
what succeeds on the Web and how to and its growth, a number of algorithms important result that the Web graph is
develop better Web applications is that have been devised to exploit various scale-free, it is the design of the proto-
we must create new ways to understand properties of the graph. For example, cols and services that we now call the
how to design systems to produce the the HITS algorithm23 and PageRank9 Web that makes it possible for it to be
effect we want. The best we can do today assume that the insertion of a hyper- this way. The Web was built around a
is design and build in the micro, hop- link from one page to another can be set of core design components defined
ing for the best, but how do we know if taken as a sort of endorsement of the in The Architecture of The World Wide
we’ve built in the right functionality to “authority” of the page being linked to, Web, Volume 121 as “the identification
ensure the desired macroscale effects? an assumption that led to the develop- of resources, the representation of re-
How do we predict other side effects ment of powerful search engines for source state, and the protocols that
and the emergent properties of the finding pages on the Web. While mod- support the interaction between agents
macro? Further, as the success or fail- ern search engines use a number of and resources in the space.”
ure of a particular Web technology may heuristics beyond these page-author- A feature of the Web is that, depend-
involve aspects of social interaction ity calculations, due in part to com- ing on the details of a request, differ-
among users, a topic we return to later, petitive pressure from those trying to ent representations may be served up
understanding the Web requires more spoof the algorithms and get a higher to different requesters. For example,
than a simple analysis of technological rank, these Web-graph-based models the HTML produced may vary based
issues but also of the social dynamic of still form the heart of the critical crawl- on conditions hidden from the client
perhaps millions of users. ers and rank-assessment algorithms (such as which particular machines
Given the breadth of the Web and its behind Web search. in a back-end server farm process the
inherently multi-user (social) nature, The links in this Web graph rep- request) and by the server’s customi-
its science is necessarily interdisciplin- resent single instantiations of the zation of the response. Cookies, rep-
ary, involving at least mathematics, CS, results of calling the HTTP protocol resenting previous state, may also be
artificial intelligence, sociology, psy- with a GET request that returns a par- used, causing different users to see dif-
chology, biology, and economics. We ticular representation (in this case an ferent content (and thus have different
invite computer scientists to expand HTML page) of a document based on links in the Web graph) based on ear-
the discipline by addressing the chal- a universal resource identifier (URI) lier behavior and visits to the same or
lenges following from the widespread that serves as an identifier common to other sites. This sort of user-depen-
adoption of the Web and its profound across the entire Web. So, for example, dent state is not directly accounted for
influence on social structures, political the URI http://www.acm.org/publica- in current Web-graph models.
systems, commercial organizations, tions/cacm typed into a standard Web There are also other ways the Web, as
and educational institutions. browser invokes the hypertext transfer an application of the Internet, cannot
protocol (HTTP) and returns an HTML simply be analyzed using the model of
Beneath the Web Graph page that contains content describing a quasi-static graph of linked hypertext
One way to understand the Web, famil- the publication known as Communica- pages. For example, many Web sites
iar to many in CS, is as a graph whose tions of the ACM. Note, however, that use Web forms to access a wealth of
nodes are Web pages (defined as static the content itself contains other URIs information behind the servers, where
HTML documents) and whose edges that are themselves pointers to objects that information, sometimes called
are the hypertext links among these that are also displayed (such as icons “the deep Web,” is not visible in the
nodes. This was named the “Web and images) and that the formatting of Web model. For many sites, in which
graph” in 22, which also included the the page itself may require retrieving the applications’s data forms a linked
first related analysis. The in-degree other resources (such as cascaded style Web, the links are not explicit, and
of the Web graph was shown in Klein- sheets) or XML DTD documents. So HTTP-POST requests are used instead
berg et al.3 and Kumar et al.24 to follow a what we might naively view as a single of the HTTP-GETs in the Web graph. In
power-law distribution; a similar effect link from, say, a research group’s Web other cases, these sites generate com-

64 communicatio ns o f th e ac m | J U LY 2008 | vo l . 5 1 | no. 7


contributed articles

plex URIs that use GET requests to pass cations, and (iii) the increasing num-
on statea, thus obscuring the identity of ber of diverse users from everywhere
the actual resources. in the world makes a similar analysis
URIs that carry state are used heav- impossible today without creating and
ily in Web applications but are, to
date, largely unanalyzed. For exam- Today’s interactive validating new models of the Web’s
dynamics. Such models must also pay
ple, in a June 2007 talk, Udi Manber,
Google’s VP of engineering, addressed
applications are special attention to the details of the
Web’s architecture, as well as to the
the issue of why Web search is so dif- very early social complexity of the interactions actually
ficult,25 explaining that on an average
day, 20%–25% of the searches seen by
machines, limited taking place there.
Additionally, modern, sophisti-
Google have never been submitted be- by the fact that they cated Web sites provide powerful
fore and that each of these searches
generates a unique identifier (using
are largely isolated user-interface functionality by run-
ning large script systems within the
server-specific encoding information). one from another. browser. These applications access the
So a Web-graph model would repre- underlying remote data model through
sent only the requesting document Web APIs. This application architec-
(whether a user request or a request ture allows users and entrepreneurs
generated by, for example, a dynamic to quickly build many new forms of
advertisement content request) linked global systems using the processing
to the www.google.com node. How- power of users’ machines and the stor-
ever if, as is widely reported, Google age capacity of a mass of conventional
receives more than 100 million queries Web servers. Like the basic Web, each
per day, and if 20% of them are unique, such system is interesting mainly for
then more than 20 million links, rep- its emergent macro-scale properties,
resented as new URIs that encode the of which we have little understanding.
search term(s), should show up in the Are such systems stable? Are they fair?
Web graph every day, or around 200 per Do they effectively create a new form
second. Do these links follow the same of currency? And if they do should it
power laws? Do the same growth mod- be regulated?
els explain these behaviors? We simply Similarly, many user-generated
don’t know. content sites now store personal in-
Analyzing the Web solely as a graph formation yet have rather simplistic
also ignores many of its dynamics (es- systems to restrict access to a person’s
pecially at short timescales). Many “friends.” This information is not avail-
phenomena known to Web users (such able to wide-scale analysis. Some other
as denial-of-service attacks caused by sites must be allowed to access the sites
flooding a server and the need to click by posing as the user or as a friend; a
the same link multiple times before get- number of three-party authentication
ting a response) cannot be explained by protocols are being deployed to allow
the Web-graph model and often can’t this. A complex system is thus being
be expressed in terms amenable to built piece by piece, with no invariants
such graph-based analysis. Represent- (such as “my employer will never see
ing them at the networking level, ignor- this picture”) assured for the user.
ing protocols and how they work, also The purpose of this discussion is not
misses key aspects of the Web, as well to go into the detail of Web protocols
as a number of behaviors that emerge or the relative merits of Web-modeling
from the interactions of millions of re- approaches but to stress that they are
quests hitting many thousands of serv- critical to the current and continued
ers every second. Web dynamics were working of the Web. Understanding
analyzed more than a decade ago,20 but the protocols and issues is important
the combination of (i) the exponential to understanding the Web as a tech-
growth in the amount of Web content, nical construct and to analyzing and
(ii) the change in the number, power, modeling its dynamic nature. Our abil-
and diversity of Web servers and appli- ity to engineer Web systems with desir-
able properties at scale requires that
we understand these dynamics. This
a. These characters, including ?.#, =, and &, fol-
lowed by keywords, may follow the last “slash” analysis and modeling are thus an im-
in the URI, thus making for the long URIs of- portant challenge to computer scien-
ten generated by dynamic content servers. tists if they are to be able to understand

JU LY 2 0 0 8 | vo l . 51 | n o. 7 | c om m u n ic at ion s of t he acm 65
contributed articles

the growth and behaviors of the future the linguistic content of its pages. The in ways allowed by that technology is
Web, as well as to engineer systems figure shows the same kind of Zipf-like more difficult to explain. The dynam-
with desired properties in a way that is distribution found in the original Web ics of any “social machine” are highly
significantly less hit or miss. graph analyses. There is also some evi- complex, and dozens of academic pa-
dence16 and a lot of speculation29 that pers, from multiple disciplines, have
From Power Laws to People similar effects can be seen in the use been written about it; en.wikipedia.
Mathematically based analysis of the of tags in Web-based tagging systems. org/wiki/Wikipedia:Wikipedia_in_aca-
Web involves another potential failing. Current research is also exploring demic_studies uses Wikipedia itself to
Whereas the structure and use of vari- whether these results depart from such maintain an up-to-date reference list.
ous Web sites (taken mathematically) models as preferential attachment3 The idea of a social machine was
may have interesting properties, these used to explain the scale-free features introduced in Weaving the Web,8 which
properties may not be very useful in ex- of Web graphs. hypothesized that the architectural
plaining the behavior of the sites over Unfortunately, whatever explains design of the Web would allow devel-
time. Consider the following example: these effects, another aspect of Wiki- opers, and thus end users, to use com-
Wikipedia (www.wikipedia.org), the pedia’s use is not explained by these puter technology to help provide the
management function for social sys-
Figure 3: Results of an analysis of the link structure of Wikipedia with tems as they were realized online. The
respect to the use of link labels, not the linguistic content of pages. social machine includes the underlying
technology (mediaWiki in the case of
+ Predicate occurrence distribution Wikipedia) but also the rules, policies,
1
and organizational structures used
to manage the technology. Examples
abound on the Web today. Consider
0.1
the coupling of the application design
of blogging-support systems (such as
LiveJournal and WordPress) with the
0.01
social mechanisms provided by blog-
P(k)

rolls, permalinks, and trackbacks that


have led to the so-called blogosphere.
0.001
Similarly, the protocols used by social
networking sites like MySpace and Fa-
cebook have much in common, but the
0.0001
success or failure of the sites hinges
on the rules, policies, and user com-
1e-05
munities they support. Given that the
1 10 100 1000 10000 100000 1e+06 1e+07 success or failure of Web technologies
often seems to rely on these social fea-
k
tures, the ability to engineer successful
applications requires a better under-
standing of the features and functions
of the social aspects of the systems.b
online wiki-based encyclopedia, in- models and does not necessarily follow Today’s interactive applications are
cludes more than two million articles from these properties. Wikipedia is very early social machines, limited by
in English and more than six million built on top of the MediaWiki software the fact that they are largely isolated
in all languages combined. They are package (www.mediawiki.org/wiki/Me- from one another. We hypothesize that
hyperlinked, and it is logical to ask diaWiki), which is freely available and (i) there are forms of social machine
whether the hyperlinks have structure used in many other Web applications that will someday be significantly more
similar to those on the Web in general besides Wikipedia. While some of effective than those we have today; (ii)
or whether, since this is a managed cor- them have also been successful, many that different social processes interlink
pus, they have yet other properties. have failed to generate significant use. in society and therefore must be inter-
Answering can be done in a num- A purely “technological” explanation linked on the Web; and (iii) that they
ber of ways; Figure 3 shows the result cannot account for this; rather, some- are unlikely to be developed through a
of one of them. In this case, DBPedia thing about the organizational struc- single deliberate effort in a single proj-
(dbpedia.org), which is a dump of the tures of Wikipedia and the needs of its
link structure of Wikipedia using the users accounts for its success over other b. When we say “success” or “failure,” we are re-
labeled links of the resource descrip- systems built from the same code base. ferring not to the business factors that deter-
tion framework, or RDF, has been ana- The model by which articles are cre- mine whether, for example, Facebook or MyS-
pace will attract more users but to the success
lyzed with respect to the use of the link ated, edited, and tracked is provided by or failure of the sites to provide the particular
labels; that is, we are looking at the the underlying technology. The social types of social interaction for which they are
structure of Wikipedia as opposed to model enabled by humans interacting designed.

66 communicatio ns o f th e ac m | J U LY 2008 | vo l . 5 1 | no. 7


contributed articles

ect or site; rather, technology is needed Web.27 To this end, an exemplar Web
to allow user communities to construct, science research area we are pursu-
share, and adapt social machines so ing involves interdisciplinary research
successful models evolve through trial, toward augmenting Web architecture
use, and refinement.
A number of research challenges The Web is changing with technical and social conventions
that increase individual accountability
and questions must be resolved before
a new generation of interacting social
at a rate that may to social and legal rules governing in-
formation use.31 Continued failure to
machines can be created and evolved be greater than develop scalable models for handling
this way:
˲˲ What are the fundamental theoret-
even the most policy will impede the ability of the
Web to be the best possible medium
ical properties of social machines, and knowledgeable for exchanging cultural, scientific, and
what kinds of algorithms are needed to
create them?;
researcher’s ability political information.
Further, we can see from the dra-
˲˲ What underlying architectural to observe it. matic growth of new collaborative
principles are needed to guide the de- styles of creating and publishing in-
sign and efficient engineering of new formation on the Web that many of the
Web infrastructure components for social institutions we rely on to judge
this social software?; trustworthiness and veracity are miss-
˲˲ How can we extend the current ing from our online information life.
Web infrastructure to provide mecha- Being able to engineer the Web of the
nisms that make the social properties future requires not only understanding
of information-sharing explicit and it as a computational structure but also
guarantee that the use of this informa- how it interacts with and supports in-
tion conforms to relevant social-policy teraction among its users.
expectations?; and An important aspect of research
˲˲ How do cultural differences af- exploring the influence of the Web on
fect the development and use of social society involves online societies using
mechanisms on the Web? As the Web Web infrastructure to support dynamic
is indeed worldwide, the properties human interaction. This work—seen
desired by one culture may be seen as in trout.cpsr.org and other such ef-
counterproductive by others. Can Web forts—explores how the Web can en-
infrastructure help bridge cultural di- courage more human engagement in
vides and/or increase cross-cultural the political sphere. Combining it with
understanding? the emerging study of the Web and the
In addition, a crucial aspect of hu- coevolution of technology and social
man interaction with information is needs is an important focus of design-
our ability to represent and reason ing the future Web.30
over such attributes as trustworthi-
ness, reliability, and tacit expectations The Web of Data
about the use of information, as well as This emerging area of study involves
about privacy, copyright, and other le- the heavy use of tagging provided by
gal rules. While some of this informa- many of what are known as Web 2.0
tion is available on the Web today, we technologies. Articles, blogs, photos,
lack structures for formally represent- videos, and all manner of other Web
ing and computing over them. Tradi- resources may be annotated with user-
tional cryptographic security research generated keywords, or tags, that can
and well-known access-control-policy later be used for searching or brows-
frameworks have failed to meet these ing these resources. Much has been
challenges in today’s online environ- made of how “folksonomies,” or tax-
ment and are thus insufficient as a onomies that emerge through the use
foundation for the social machines of of tags, can be used as metadata to
the future. Recent work on formal mod- help explain the content of the objects
els for privacyb has demonstrated that being described.
traditional cryptographic approaches One aspect of tagging generating
to privacy protection can fail in open interest today is the need for “social
Web environments. Similar problems context” in tagging.26 Many tags in-
with copyright enforcement have volve terms that are extremely ambigu-
also hampered the flow of commer- ous in a general context. For example,
cial and scholarly information on the first names are popular tags on Flickr,

JU LY 2 0 0 8 | vo l . 51 | n o. 7 | c om m u n ic at ion s of t he acm 67
contributed articles

though they are not good general looks to provide models that can be developed and what their potential so-
search terms. On the other hand, in a used to represent expressive semantic cietal effects might be.
specific social context (such as a par- descriptions of application domains
ticular person’s photos), the same tag and provide inferencing power for Conclusion
can be useful since it can designate a both Web and non-Web applications The Web is different from most pre-
particular individual. The use of a tag that need a knowledge base. viously studied systems in that it is
as metadata often depends on such a Current research is exploring how changing at a rate that may be of the
context, and the “network effect” in the databases of the semantic Web same order as, or perhaps greater
these cites is thus socially organized.19 relate to traditional database ap- than, even the most knowledgeable
A more ambitious use of metadata proaches and to scaling semantic Web researcher’s ability to observe it. An
involves recent applications of seman- stores to very large scales.1 In terms of unavoidable fact is that the future
tic Web technologies7 and represents modeling, one goal is to develop tools of human society is now inextricably
an important paradigm shift that is a to speed inference in large knowl- linked to the future of the Web. We
significant element of emerging Web edge bases (without sacrificing per- therefore have a duty to ensure that
technologies. The semantic Web rep- formance), including how to exploit future Web development makes the
resents a new level of abstraction from trade-offs between expressivity and world a better place. Corporations
the underlying network infrastructure, reasoning to provide the capabilities have a responsibility to ensure that
as the Internet and Web did earlier. needed for Web scale.15 A market is the products and services they de-
The Internet allowed programmers to beginning to emerge for “bottom-up” velop on the Web don’t produce side
create programs that could communi- tools driven by data and “top-down” effects that harm society, and govern-
cate without concern for the network technologies driven by Web ontolo- ments and regulators have a respon-
of cables through which the communi- gies. Creating back-ends for the se- sibility to understand and anticipate
cation had to flow. The Web allows pro- mantic Web is being transitioned the consequences of the laws and pol-
grammers and users to work with a set (bottom-up) from an arcane art into an icies they enact and enforce.
of interconnected documents without emerging Web application program- We cannot achieve these aims un-
concern for the details of the comput- ming approach, as new open-source til we better understand the complex,
ers storing and exchanging them. technologies integrate well with tradi- cross-disciplinary dynamics driving
The semantic Web will allow pro- tional Web servers. At the same time, development on the Web—the main
grammers and users alike to refer to new tools support ontology develop- aim of Web science. Just as climate-
real-world objects—people, chemicals, ment and deployment (top-down), and change scientists have had to develop
agreements, stars, whatever—without tens of thousands of OWL ontologies ways to gather and analyze evidence
concern for the underlying documents are available for jumpstarting new to prove or disprove theories about
in which these things, abstract and domain-modeling efforts. In addition, the effect of human behavior on the
concrete, are described. While basic approaches using rule-based reason- Earth’s climate, Web scientists need
semantic Web technologies have been ing modified for the Web have also new methodologies for gathering evi-
defined and are being deployed more gained attention.4 Engineering the fu- dence and finding ways to anticipate
widely, little work has sought to explain ture Web includes the design and use how human behavior will affect devel-
the effect of these new capabilities on of these emerging technologies, along opment of a system that is evolving at
the connections within the Web of peo- with how they differ from traditional such an amazing rate. We also must
ple who use them.28 approaches to databases, in one case consider what would happen to so-
The semantic Web arena reflects two creating back-ends for the semantic ciety if access to the Web was denied
principle nexuses of activity. One tends Web, in the other new tools for ontol- to some or all and to raise awareness
to involve data (and the Web), and the ogy-based applications. among major corporations and gov-
other on the domain (and semantics). The semantic Web is a key emerg- ernments that the consequences of
The first, based largely on innovation ing technology on the Web, but, also, what appear to be relatively small de-
in data-integration applications, focus- as we’ve discussed, there are different cisions can profoundly affect society
es on developing Web applications that opinions as to what it is best for and, in the future by affecting Web devel-
employ only limited semantics but pro- more important, what the macro ef- opment today.
vide a powerful mechanism for linking fects might be. Our lack of a better un- Computing plays a crucial role in
data entities using the URIs that are derstanding of how Web systems de- the Web science vision, and much of
the basis of the Web. Powered by the velop makes it difficult for us to know what we know about the Web today
RDF, these applications focus largely the kinds of effects the technology will is based on our understanding of it
on querying graph-oriented triple-store produce at scale. What social conse- in a computational way. However, as
databases using the emerging SPARQL quences might there be from greater we’ve explored here, significant re-
language, which helps create Web ap- public exposure and the sharing of in- search must still be done to be able
plications and portals that use REST- formation hidden away in databases? to engineer future successful Web
based models, integrating data from A better understanding of how Web applications. We must understand
multiple sources without preexisting systems move from the micro to the the Web as a dynamic and changing
schema. The second, based largely on macro scale would provide a better entity, exploring the emergent be-
the Web Ontology Language, or OWL, understanding of how they could be haviors that arise from the “macro”

68 communicatio ns o f th e ac m | J U LY 2008 | vo l . 5 1 | no. 7


contributed articles

interactions of people enabled by Intelligent Systems, Trends & Controversies 20, 1 31. Weitzner, D., Abelson, H., Berners-Lee, T., Feigenbaum,
(Jan./Feb. 2005). J., Hendler, J., and Sussman, G. Information
the Web’s technology base. We must 14. Donato, D., Laura, L., Leonardi, S., and Millozzi, S. The accountability. Commun. ACM 51, 6 (June 2008).
therefore understand the “social ma- Web as a graph: How far we are. ACM Transactions on 32. Weitzner, D., Hendler, J., Berners-Lee., T., and Connolly,
Internet Technology 7, 1 (Feb. 2007). D. Creating a policy-aware Web: Discretionary, rule-
chines” that may be the critical dif- 15. Fokoue, A., Kershenbaum, A., Ma, L., Schonberg, based access for the World Wide Web. In Web and
ference between the success or fail- E., and Srinivas, K. The Summary Abox: Cutting Information Security, E. Ferrari and B. Thuraisingham,
ontologies down to size. In Proceedings of the Eds. IRM Press, Hershey, PA, 2006.
ure of Web applications and learn to International Semantic Web Conference (Athens, GA,
build them in a way that allows inter- Nov. 5–9). Springer Berlin, Heidelberg, 2006. Funding for this work comes from the U.S.
16. Golder, S. and Huberman, B. The Structure of National Science Foundation (Policy Aware Web
linking and sharing. Collaborative Tagging Systems (2005); arxiv.org/abs/ and Transparency Aware Data Mining Projects),
cs/0508082. iARPA(End-to-End Semantic Accountability),
17. Gulli, A. and Signorini, A. The indexable Web is more the U.K. Engineering and Physical Sciences Research
Acknowledgments than 11.5 billion pages. In the special-interest tracks Council (Advanced Knowledge Technologies
Figure 2 is taken from talks Tim and posters of the 14th International World Wide Web Project), and the U.S. Army Research Laboratory
Conference (Chiba, Japan, May 10–14). ACM Press, and U.K. Ministry of Defence (U.S./U.K. Information
Berners-Lee gave in 2007 (www.w3. New York, 2005. Technology Alliance). We also thank industrial and
18. Hendler, J. Web 3.0: Semantic Web chicken farms. individual donors to the authors’ research at RPI,
org/2007/Talks/1018-websci-mit-tbl/ IEEE Computer 41, 1 (Jan. 2008). Southampton, and MIT and to the Web Science
Overview.html). We also thank the 19. Hendler, J. and Golbeck, J. Metcalfe’s Law, Web 2.0, Research Initiative (www.webscience.org).
and the semantic Web. Journal of Web Semantics 6, 1
other members of the WSRI Scientific (Feb. 2008).
Council (webscience.org/about/peo- 20. Huberman, B. and Lukose, R. Social dilemmas and James Hendler (hendler@cs.rpi.edu) is the Tetherless
Internet congestion. Science 277, 5325 (July 1997). World Chair of Computer and Cognitive Science at
ple/) for input relating to the goals of 21. Jacobs, I. and Walsh, N. Architecture of the World Rensselaer Polytechnic Institute, Troy, NY.
Web science and the interaction of the Wide Web, Vol. One. W3C Recommendation, Dec. 15,
2004; www.w3.org/TR/webarch/. Nigel Shadbolt (nrs@ecs.soton.ac.uk) is professor of
Web and computer and information 22. Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., artificial intelligence and deputy head of the School of
sciences. We are indebted to Konstan- and Tomkins, A. The Web as a graph: Measurements, Electronics and Computer Science at Southampton
models, and methods. In Proceedings of the Fifth University, Southampton, U.K.
tin Mertsalov of Rensselaer Polytechnic Annual International Conference on Computing and
Institute for the DBpedia analysis dis- Combinatorics (Tokyo, July 26–28). Springer, New Wendy Hall (wh@ecs.soton.ac.uk) is a professor of
York, 1999. computer science at the University of Southampton,
cussed in the section on power laws. 23. Kleinberg, J. Authoritative sources in a hyperlinked Southampton, U.K.
environment. Journal of the ACM 46, 5 (Sept. 1997).
24. Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, Tim Berners-Lee (timbl@csail.mit.edu) is the Director
A. Trawling the Web for emerging cyber communities. of the World Wide Web Consortium and holds the
References
In Proceedings of the Eighth International World Wide 3Com Founders chair and is a senior research scientist
1. Abadi, D., Marcus, A., Madden, S., and Hollenbach,
Web Conference (Toronto, May 11–14). Elsevier North- in the Laboratory for Computer Science and Artificial
K. Scalable semantic Web data management using
Holland, Inc., New York, 1999. Intelligence at the Massachusetts Institute of
vertical partitioning. In Proceedings of the 33rd
25. Manber, U. Why Search Is a Hard Problem. Technology, Cambridge, MA.
International Conference on Very Large Data Bases
Presentation at Supernova 2007 (San Francisco, June
(Vienna, Austria, Sept. 23–27). VLDB Endowment,
16–18, 2008); www.readwriteweb.com/archives/ Daniel Weitzner (djweitzner@csail.mit.edu) is director of
Heidelberg, 2007.
udi_manber_search_is_a_hard_problem.php the Massachusetts Institute of Technology Decentralized
2. Backstron, L., Dwork, C., and Kleinberg, J. Wherefore
26. Marcus, A. and Perez, A. m-YouTube mobile UI: Video Information Group and principle research scientist in
art thou R3579X? Anonymized social networks,
selection based on social influence. In Proceedings of the MIT Computer Science and Artificial Intelligence
hidden patterns, and structural steganography. In
the 12th International HCI Conference (Beijing, July Laboratory, Cambridge, MA.
Proceedings of the 16th International World Wide Web
22–27). Springer, 2007.
Conference (Banff, Alberta, Canada, May 8–12). ACM
27. Samuelson, P. Copyright’s fair use doctrine and digital
Press, New York, 2007.
data. Commun. ACM 37, 1 (Jan.1994), 21–27.
3. Barabasi, A. and Albert, A. Emergence of scaling in
28. Shadbolt, N., Hall, W., and Berners-Lee, T. The
random networks. Science 286 (1999).
semantic Web revisited. IEEE Intelligent Systems 21,
4. Berners-Lee, T., Connolly, D., Kagal, L., Scharf, Y., and
3 (May/June 2006).
Hendler, J. N3Logic: A logical framework for the World
29. Shirky, C. Power Laws, Weblogs, and Inequality. In
Wide Web. Theory and Practice of Logic Programming
Clay Shirky’s blog (2003); www.shirky.com/writings/
(2008).
powerlaw_weblog.html.
5. Berners-Lee, T., Hall, W., Hendler, J., Shadbolt, N., and
30. Shneiderman, B. Web science: A provocative invitation
Wietzner, D. Creating a science of the Web. Science
to computer science. Commun. ACM 50, 6 (June
311 (2006).
2007), 25–27.
6. Berners-Lee, T., Hall, W., Hendler, J., O’Hara, K.,
Shadbolt, N., and Weitzner, D. A framework for Web
science. Foundations and Trends in Web Science 1, 1
(Sept. 2006).
7. Berners-Lee, T., Hendler, J., and Lassila, O. The
semantic Web. Scientific American (May 2001).
8. Berners-Lee, T. and Fischetti, M. Weaving the Web:
The Original Design and Ultimate Destiny of the World
Wide Web. Harper Collins, New York, 1999.
9. Brin, S. and Page, L. The anatomy of large-scale
hypertextural Web search engine. Presented at the
Sixth International World Wide Web Conference
(Santa Clara, CA, Apr. 7–11, 1997).
10. Broder, A., Kumar, R., Maghoul, F., Raghavan, P.,
Rajagopalan, S., Stata, R., Tomkins, A., and Wiener,
J. Graph structure in the Web. In Proceedings of
the Ninth International World Wide Web Conference
(Amsterdam, The Netherlands, May 15–19). Elsevier,
Amsterdam, The Netherlands, 2000.
11. Dean, J. and Ghemawat, S. MapReduce: Simplified
data processing on large clusters. In Proceedings of
the Sixth Symposium on Operating System Design and
Implementation (San Francisco, Dec. 6–8). USENIX
ILLUSTRATION BY MA RIUS WAT Z

Association, Berkeley, CA, 2004.


12. Dill, S., Kumar, R., McCurley, K., Rajagopalan, S.,
Sivakumar, D. and Tomkins, A. Self-similarity in
the Web. In Proceedings of the 27th International
Conference on Very Large Data Bases (Rome, Italy,
Sept. 11–14). Morgan Kaufmann Publishers, Inc., San
Francisco, 2001.
13. Domingos, P., Golbeck, J., Mika, P., and Nowak, A.
Social networks and intelligent systems. IEEE

JU LY 2 0 0 8 | vo l . 51 | n o. 7 | c om m u n ic at ion s of t he acm 69

You might also like