Internet Research: What's Hot in Search, Advertizing & Cloud Computing

Internet Research: What’s hot in Search,
Advertizing & Cloud Computing
Rajeev Rastogi
Yahoo! Labs Bangalore
The most visited site on the
internet
• 600 million+ users
per month
• Super popular
properties
– News, finance,
sports
– Answers, flickr,
del.icio.us
– Mail, messaging
– Search
Unparalleled scale
• 25 terabytes of data collected each day

– Over 4 billion clicks every day
– Over 4 billion emails per day
– Over 6 billion instant messages per day
• Over 20 billion web documents indexed
• Over 4 billion images searchable
No other company on the planet

processes as much data as we do!
Yahoo! Labs Bangalore
• Focus is on basic and applied research

– Search
– Advertizing
– Cloud computing
• University relations
– Faculty research grants
– Summer internships
– Sharing data/computing
infrastructure
– Conference sponsorships
– PhD co-op program
Web Search
What does search look like
today?
Search results of the future:
Structured abstracts
yelp.com Gawker
babycenter New York Times
epicurious LinkedIn
answers.com webmd
Query refinement
Rich media
Technologies that are enabling
search transformation
• Information extraction (structured
abstracts)
• Web page classification (query
refinement)
• Multimedia search (rich media)
Information extraction (IE)
• Goal: Extract structured records from Web pages
Name
Category
Address Map
Phone
Price
Reviews
Multiple verticals
• Business, social networking, video, ….
One schema per vertical
Price
Category
Address
Name
Phone Title Price Posted by
Date
Title
Education
Connections
Rating Views
IE on the Web is a hard problem
• Web pages are noisy
• Pages belonging to different Web sites have different
layouts
Noise
Web page types
Template-based
Hand-
crafted
Template-based pages
• Pages within a Web site generated using scripts,

have very similar structure
– Can be leveraged for extraction
• ~30% of crawled Web pages
• Information rich, frequently appear in the top
results of search queries
• E.g. search query: “Chinese Mirch New York”
– 9 template-based pages in the top 10 results
Wrapper Induction
• Enables extraction from template-based pages
Learn Sample pages Annotations

Website pages
Annotate Learn
Sample Pages Wrappers
Apply wrappers XPath

Rules
Extract
Website pages
Extract Records
Example
Generalize
XPath: /html/body/div/div/div/div/div/div/span /html/body//div//span
Filters
• Apply filters to prune from multiple candidates that match
XPath expression
XPath: /html/body//div//span
Regex Filter (Phone):

([0-9]3) [0-9]3-[0-9]4
Limitations of wrappers
• Won’t work across Web sites due to different page
layouts
• Scaling to thousands of sites can be a challenge
– Need to learn a separate wrapper for each site
– Annotating example pages from thousands of sites can be
time-consuming & expensive
Research challenge
• Unsupervised IE: Extract attribute

values from pages of a new Web site
without annotating a single page from
the site
• Only annotate pages from a few sites
initially as training data
Conditional Random Fields
(CRFs)
• Models conditional probability distribution of label
sequence y=y1,…,yn given input sequence x=x1,…,xn
1 |x|  
P (y | x) = ∏ exp ∑ λk f k ( yt , yt −1 , x, t ) 
Z ( x) t =1  k 
– fk: features, λ k: weights
• Choose λ k to maximize log-likelihood of training data

• Use Viterbi algorithm to compute label sequence y with
highest probability
CRFs-based IE
• Web pages can be viewed as labeled
sequences Name
Noise
Category
Address
Phone
• Train CRF using pages from few Web sites

• Then use trained CRF to extract from remaining
sites
Drawbacks of CRFs
• Require too many training examples

• Have been used previously to segment
short strings with similar structure
• However, may not work too well
across Web sites that
– contain long pages with lots of noise
– have very different structure
An alternate approach that
exploits site knowledge
• Build attribute classifiers for each attribute
– Use pages from a few initial Web sites
• For each page from a new Web site
– Segment page into sequence of fields (using static repeating
text)
– Use attribute classifiers to assign attribute labels to fields
• Use constraints to disambiguate labels
– Uniqueness: an attribute occurs at most once in a page
– Proximity: attribute values appear close together in a page
– Structural: relative positions of attributes are identical
across pages of a Web site
Attribute classifiers +
constraints example
Page1: Chinese Mirch Chinese, Indian 120 Lexington Avenue (212) 532 3663
New York, NY 10016
Name Category Phone
Address
Page2: Jewel of India Indian 15 W 44th St (212) 869 5544

Category New York, NY 10016
Name Phone
Address
Page3: 21 Club American 21 W 52nd St (212) 582 7200

New York, NY 10019 Phone
Name, Noise Category, Name
Address
Uniqueness constraint: Name

Precedence constraint: Name < Category
Page3: 21 Club American 21 W 52nd St (212) 582 7200

New York, NY 10019 Phone
Name Category
Address
Other IE scenarios: Browse
page extraction
Similar-structured
records
IE big picture/taxonomy
• Things to extract from
– Template-based, browse, hand-crafted pages, text
• Things to extract
– Records, tables, lists, named entities
• Techniques used
– Structure-based (HTML tags, DOM tree paths) – e.g.
Wrappers
– Content-based (attribute values/models) – e.g. dictionaries
– Structure + Content (sequential/hierarchical relationships
among attribute values) – e.g. hierarchical CRFs
• Level of automation
– Manual, supervised, unsupervised
Web Page Classification:
Requirements
• Quality
– High Precision and Recall
– Leverage structured input (links, co-citations) and
output (taxonomy)
• Scalability
– Large numbers of training Examples, Features and
Classes
– Complex Structured input and output
• Cost
– Small human effort (for labeling of pages)
– Compact classifier model
– Low prediction time
Structured Output Learning
• Structured Output Examples
– Multi-class
– Taxonomy Health Sport
• Naïve approach
– Separate binary classifier per class Fitness Medicine Cricket
Soccer
– Separate classifier for each taxonomy level
• Better approach – single (SVM) classifier One-day Test
– Higher accuracy, more efficient

– Sequential Dual Method (SDM)
• Visit each example sequentially and solve
associated QP problem (in dual) efficiently
• Order of magnitude faster
Classification With Relational
Information
Similar structure
• Relational Information
– Web page links, structural similarity
• Graph representation
– Pages as nodes (with labels)
– Edge weights (s(j,k)): Page similarity, Co-citation
out-link/co-citation existence, etc.
• Classification can be expressed as
an optimization problem:
Link
Multimedia Search
• Availability & consumption of multimedia
content on the Internet is increasing
– 500 billion images will be captured in 2010
• Leveraging content and metadata are
important for MM search
• Some big technical challenges are:
– Results diversity
– Relevance
– Image Classification, e.g., pornography
Near-Duplicate Detection
• Multiple near-similar versions of an
image exist on the internet
– scaled, cropped, captioned, small
scene change, etc.
• Near-duplicates adversely impact
user experience
• Can we use a compact description
and dedup in constant time?
• Fourier-Mellin Transform (FMT):
translation, rotation, and scale
invariant
• Signature generation using a small
number of low-frequency coefficients
Filtering noisy tags to improve
relevance
• Measures such as IDF may assign high weights to noisy tags
– Treat Tag-Sets as Bag-of-words, random collection of terms
• Boosting weights of tags based on their co-occurrence with other tags can
filter out noise
idf co-occur
10.2765 hinduism 8.8989 child
8.6589 hindu 8.8033 smile
8.6259 finger 8.338 happy
7.8524 kerala 7.982 mother
7.3432 mother 6.0989 women
6.7895 smile 4.8763 family
6.6507 child 4.208 india
6.576 women 2.9307 hinduism
6.5535 point 2.8871 hindu
6.4512 happy 2.8318 orange
6.0312 orange 1.4355 kerala
5.2129 india 0.2292 point
4.312 family 0 finger
Online Advertizing
Sponsored search ads
Search query
Ad
How it works
I want to bid $5 on Ad Index
Advertiser
canon camera Sponsored
I want to bid $2 on search engine
cannon camera
• Engine decides when/where to show this ad on

search results page
• Advertizer pays only if user

clicks on ad
Ad selection criterion
• Problem: which ads to show from among ads
containing keyword?
Ad Bid Click Expecte
• Ads with highest bid may Prob d
Revenue
not maximize revenue
A1 $4 0.1 0.4
• Choose ads with
maximum expected A2 $2 0.7 1.4
revenue
– Weigh bid amount with A3 $3 0.3 0.9
click probability
Contextual Advertising
Ads
Contextual ads
• Similar to sponsored search, but now

ads are shown on general Web pages
as opposed to only search pages
– Advertizers bid on keywords
– Advertizer pays only if user clicks, Y! &
publisher share paid amount
– Ad matching engine ranks ads based on
expected revenue (bid amount * click
probability)
Estimating click probability
• Use logistic regression model
1
p(click | ad, page, user) =
1 + exp ∑ −
w f ( ad , page ,user )
i i i
• fi: ith feature for ad, page, user

• wi: weight for feature fi
• Training data: ad click logs (all clicks + non-
click samples)
• Optimize log-likelihood to learn weights
Features
• Ad: bid terms, title, body, category,…
• Page: url, title, keywords in body, category, …
• User
– Geographic (location, time)
– Demographic (age, gender)
– Behavioral
• Combine above to get (billions of) richer features

E.g: (apple ad title) (ipod page body) (20 < user age < 30)
• Select subset that leads to improvement in likelihood
∈ ∧ ∈ ∧
Banner ads
• Show Web page with display ads
Ad
Creates Brand
Awareness
How it works
Ad Index
Advertiser
I want 1M impressions Banner Ad
Engine
“On finance.yahoo.com,
gender = male, age = 20-30
during the month of April 2009”
• Engine guarantees 1M impressions

• Advertiser pays a fixed price
– No dependence on clicks
• Engine does admission control, decides
allocation of ads to pages
Allocation Example
SUPPLY (Qty, Price) (6M,$10)
Unallocated
Value=$60M
(10M,$10) (10M,$10)
12
el a me F
12
r edne G
Suboptimal
(10M,$20) (10M,$10)
el a M
20 - 30 > 30
12
Age
DEMAND (Target, Qty) Value= (6M,

(Gender=Male, 12M) $20) 12
$120M
(Age>30, 12M) Optimal

Research problem
• Goal: Allocate demands so that the value
of unallocated inventory is maximized
• Similar to transportation problem
Transportation problem
Demands Supply Price
d1 1 1 s1 p1
xi1
d2 2 2
xi2 s2 p2
∑ j∈Ri
xij ≥ d i di i
Edges to Ri
xij j sj pj ∑x
i ij ≤ sj
xij: Units of demand I allocated to region j
Objective : maximize ∑ j
p j ( s j − ∑i xij )
Ads taxonomy
Online Ads
Search pages Web pages
Sponsored search Contextual Banner
Targeting: Keywords Keywords Attributes
Guarantees: NG NG G NG
Model: CPC CPC CPM CPM/CPC

Major trend: Ads convergence
• Today
Contextual Display
Separate systems
for contextual &
display
CPC CPM
• Tomorrow
Unified Ads marketplace
– Unify contextual & Display
– Increase supply & demand
Y! Ad Exchange – Enable better matching
CPC, CPM – CPC, CPM ads compete
Publisher: Creates supply of pages Advertiser: Creates demand

Research challenge
• Which ad to select between competing
CPC, CPM ads?
– Use eCPM
Estimated eCPM
• For CPM ads: eCPM = bid
• For CPC ads: eCPM = bid * Pr(click)
– Select ad with max eCPM to maximize
revenue
• Problem: ad with highest eCPM may not

get selected CPC CPM
– eCPMs “estimated” based on historical data, ad ad
which can differ from actual eCPMs
– Variance in estimated eCPMs higher for CPC Actual eCPM
ads
Estimated eCPM
– Selection gets biased towards ads which have
higher variance as they have higher
probability of over-estimated eCPMs
Cloud Computing
Much of the stuff we do is
compute/data-intensive
• Search
– Index 100+ billion crawled Web pages
– Build Web graph, compute PageRank
• Advertizing
– Construct ML models to predict click probability
• Cluster, classify Web pages

– Improve search relevance, ad matching
• Data mining
– Analyze TBs of Web logs to compute correlations
between (billions of) user profiles and page views
Solution: Cloud computing
• A cloud consists of
– 1000s of commodity machines (e.g., Linux
PCs)
– Software layer for
• Distributing data across machines
• Parallelizing application execution across cluster
• Detecting and recovering from failures
– Yahoo!’s software layer based on Hadoop
Open Source
Cloud computing benefits
• Enables processing of massive compute-

intensive tasks
• Reduces computing and storage costs
– Resource sharing leads to efficient utilization
– Commodity hardware, open source
• Shields application developers from
complexity of building in reliability,
scalability in their programs
– In large clusters, machines fail every day
– Parallel programming is hard
Cloud computing at Yahoo!
• 10,000s of nodes running Hadoop, TBs of RAM, PBs of

disk
• Multiple clusters, largest is a 1600 node cluster
Hadoop’s Map/Reduce
Framework
• Framework for parallel computation over massive
data sets on large clusters
• As an example, consider the problem of creating
an index for word search.
– Input: Thousands of documents/web pages
– Output: A mapping of word to document IDs
Farmer1 has the
following animals:
bees, cows, goats. Animals: 1, 2, 3, 4, 12
Some other Bees: 1, 2, 23, 34
animals …
Dog: 3,9
Farmer1: 1, 7
…
Hadoop’s Map/Reduce
Index example Animals: 1,3
(contd.) Dog: 3
Animals: 1,3
Animals: 1,2,3,12
Animals:2,12
Bees:23
Machine1 Bees:23
Machine4 Machine4
Animals:2,12
Bees: 23
Machine2 Dog: 3 Dog: 3,9

Dog:9 Farmer1: 7
Farmer1: 7
Dog:9
Farmer1: 7 Machine5 Machine5
Machine3
intermediate Shuffle Reduce

Input split Map
output (sorted) Tasks
Tasks
Research challenges
Data Distribution and Replication
Compute Nodes in Racks Data Blocks for a given job
distributed and replicated across
Rack 1 nodes in a rack and across racks
Rack 2
Challenges:
Rack i • Optimize distribution to provide
maximum locality
• Optimize replication to provide best
fault tolerance
Rack n
Job Scheduling
Job Queues based on priorities and SLAs
Challenges:
SDS Q1 40% L1 • Schedule jobs to maximize resource
1 2 3
utilization while preserving SLAs
YST Q2 35% L2 • Schedule jobs to maximize data
locality
• Performance modeling
ATG Qm 25% Lm
Summary
• Internet is an exciting place, plenty of research needed to

improve
– User experience
– Monetization
– Scalability
• Search -> Information extraction, classification, ….
• Advertizing -> Click prediction, ad placement, ….
• Cloud computing -> Job scheduling, perf modeling, …
• Solving problems will require techniques from multiple
disciplines: ML, statistics, economics, algos, systems, …

Internet Research: What's Hot in Search, Advertizing & Cloud Computing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Internet Research: What's Hot in Search, Advertizing & Cloud Computing

Uploaded by

Copyright:

Available Formats

Internet Research: What’s hot in Search,

Advertizing & Cloud Computing

• 25 terabytes of data collected each day

No other company on the planet

• Focus is on basic and applied research

babycenter New York Times

• Pages within a Web site generated using scripts,

Learn Sample pages Annotations

Apply wrappers XPath

Regex Filter (Phone):

• Unsupervised IE: Extract attribute

• Choose λ k to maximize log-likelihood of training data

• Train CRF using pages from few Web sites

• Require too many training examples

Page2: Jewel of India Indian 15 W 44th St (212) 869 5544

Page3: 21 Club American 21 W 52nd St (212) 582 7200

Uniqueness constraint: Name

Page3: 21 Club American 21 W 52nd St (212) 582 7200

– Higher accuracy, more efficient

• Engine decides when/where to show this ad on

• Advertizer pays only if user

• Similar to sponsored search, but now

• fi: ith feature for ad, page, user

• Combine above to get (billions of) richer features

• Show Web page with display ads

• Engine guarantees 1M impressions

DEMAND (Target, Qty) Value= (6M,

(Age>30, 12M) Optimal

xij: Units of demand I allocated to region j

Search pages Web pages

Sponsored search Contextual Banner

Targeting: Keywords Keywords Attributes

Model: CPC CPC CPM CPM/CPC

Publisher: Creates supply of pages Advertiser: Creates demand

• Problem: ad with highest eCPM may not

• Cluster, classify Web pages

• Enables processing of massive compute-

• 10,000s of nodes running Hadoop, TBs of RAM, PBs of

Machine2 Dog: 3 Dog: 3,9

intermediate Shuffle Reduce

• Internet is an exciting place, plenty of research needed to

You might also like