You are on page 1of 35

XML: An introduction

Overview

 HTML Model
 XML Model
 RDBMS vs. XML
 XML Schema
 XML Tools
 …

2
What is a Data Model?
 Structure of Data
 Mathematical representation of data.
 Examples: relational model = tables; semi-
structured model = trees/graphs.
 Operations on data.
 Constraints.
History of HTML
 HTML: Hyper-Text Markup Language
 Invented by Tim Berners-Lee and Robert Caillau at CERN
in 1991
 What is hyper-text?
 A document that contains links to other documents (and
text, sound, images...)
 Invented around 1945 by Vannevar Bush
 What is a markup language?
 A notation for writing text with markup tags (<tag>)
 Tags indicate the structure of the text
 Tags have names and attributes
 Tags may enclose a part of the text
 Invented around 1970 by Charles F. Goldfarb (SGML)

4
HTML
 HTML was designed to display data and to focus
on how data looks.

5
XML
 XML is a framework for defining markup languages:
 XML was designed to describe data, not format.
 There is no fixed collection of markup tags. One must
define your own tags, tailored for our kind of information
 Allow tailor-made markup for any imaginable application
domain
 XML uses a schema language (eg, DTD, XML-Schema) to
formally describe the data.
 XML separates syntax from semantics to provide a
common framework for structuring information
 Web browser rendering semantics is separately defined by
stylesheets

6
XML
 XML is not a replacement for HTML:
 HTML should ideally be just another XML
language
 in fact, XHTML is just that
 XHTML is a (very popular) XML language for
hypertext markup

 HTML is about displaying information, but


XML is about describing information.

7
HTML vs. XML
HTML XML

<center> <event eID=“sigmod02”>


<h1>SIGMOD</h1> <acronym>SIGMOD</acronym>
<society>ACM</society>
<p><b>
<url>www.sigmod02.org</url>
<u>ACM</u> <a <loc>
href=“sigmod02.org”>SIG <city>Madison</city>
MOD Conference</a>, <state>WI</state>
Madison, WI, 2002 </loc>
</b></p> <year>2002</year>
</center> </event>

8
HTML vs. XML
 Need a stylesheet to define browser
presentation semantics

HTML XML CSS/XSL

Browser Browser

9
HTML vs. XML
 Need a stylesheet to define browser
presentation semantics

Data/Format Data Format

Browser Browser

10
Database Perspective
 DB must support
 Capture
 Storage
 Retrieval
 Exchange

 XML originally as the language to exchange


data over Web
 Replacing EDI (Electronic Data Interchange)

11
RDBMS vs. XML Example
 AFV receives 100+ videos every week
 Build a DB to be able to answer the following queries:
 Who sent which videos?

 Show me all videos about Cat category

 How many videos in a database since Jan 1, 2003?

 Which is the video with the best rating for the 1st week ofJan?

 Where does the sender James live? Phone? Gender?

 How many videos does James send so far?

 Show me all the ghost videos (ones without sender information)

12
ER Model
Category

VID Videos Date

Rating

Sends

Phone Name
Owners
Gender Address

13
ER => RDBMS
Category Videos

VID Videos Date

Rating
Sends

Sends

Phone Name Owners


Owners
Gender Address

14
RDBMS
VID Category Date Rating Video
100 Comedy 2005/1/1 5
200 Action 2005/1/10 4
300 SF 2004/12/31 5

Sends Name Phone VID


Jenny 564-3456 100
Tom 123-4567 200
Tom 123-4567 300

Owners Name Phone Address Gender


Jenny 564-3456 1050 Harvard F
Tom 123-4567 132 W 15th M

15
Changes #1
 VHS video => VHS, CD, DVD
 100+ videos => 1 million videos

16
RDBMS
VID Category Format Date Rating Video
100 Comedy VHS 2005/1/1 5

1000000 SF DVD 2004/12/31 5

Sends Name Phone VID


Jenny 564-3456 100


Tom 123-4567 1000000

Owners Name Phone Address Gender


Jenny 564-3456 1050 Harvard F
Tom 123-4567 132 W 15th M


17
Changes #2
 Arbitrary name formats for owners
 Eg, J. Doe vs. Dr. “Jonny” John Jay Doe Jr
 100+ different ways to capture owners’
information
 “1781 Louisiana St #200, Lawrence, KS, 66046”
 adr1=“1781 Louisiana St #200”, adr2=“Lawrence,
KS, 66046”
 street=“1781 Louisiana St #200”, city=“Lawrence”,
state=“KS”, zip=“66046”
 100+ different video formats with varying
properties => 1000+ attributes for Videos
18
RDBMS: Finest Granularity
Video VID Category Date Rating Att1 Att2 … att1000
100 Comedy 2005/1/1 5 10 … T1
200 Action 2005/1/10 4 20 …
300 SF 2004/12/31 5 … S20

Sends Prefix NN FN MN LN Suffix Phone VID


Jenny 564-3456 100
Dr. Jonny John Jay Doe Jr 123-4567 200
Dr. Jonny John Jay Doe Jr 123-4567 300
Owners
Prefix NN FN MN LN Suffix Phone Street City State Zip Gender

Jenny 564-3456 1050 CO 66049 F


Denver
Harvard
Dr. Jonny John Jay Doe Jr 123-4567 10 S. CO M
Beaver

19
RDBMS: Coarsest Granularity
Video VID Category Date Rating Att1to1000
100 Comedy 2005/1/1 5 10, T1
200 Action 2005/1/10 4 20
300 SF 2004/12/31 5 S20

Sends Name Phone VID


Jenny 564-3456 100
Dr. “Jonny” John 123-4567 200
Jay Doe Jr
Dr. “Jonny” John 123-4567 300
Jay Doe Jr

Owners Name Phone Address Gender


Jenny 564-3456 1050 Harvard, Denver,CO F
66049
Dr. “Jonny” John 123-4567 1322 W 15th, CO M
Jay Doe Jr

20
RDBMS: Ideal Case
Video VID Category Date Rating Att1 Att2 … att1000
100 Comedy 2005/1/1 5 10 … T1
200 Action 2005/1/10 4 20 …
300 SF 2004/12/31 5 … S20

Sends Name Phone VID


Jenny 564-3456 100
Violation
Of 1NF Dr. Jonny John Jay Doe Jr 123-4567 200
Dr. Jonny John Jay Doe Jr 123-4567 300
Owners
Name Phone Address Gender
Jenny 564-3456 1050 Denver CO 66049 F
Harvard
Dr. Jonny John Jay Doe 123-4567 132 W 15th KS M
Jr

21
XML
Video VID Category Date Rating Att1 Att2 … att1000
100 Comedy 2005/1/1 5 10 … T1
200 Action 2005/1/10 4 20 …
300 SF 2004/12/31 5 … S20

<VideoTable>
<Video vid=“100” category=“comedy” date=“2005/1/1”
rating=“5” att2=“10” att1000=“T1” />
<Video vid=“200” category=“action” date=“2005/1/10”
rating=“4” att1=“20” />
<Video vid=“300” category=“SF” date=“2004/12/31”
rating=“5” att1000=“S20” />
</VideoTable>

22
Address Address

Adr1 Adr2
XML Street City State Zip

Owners State

Name Phone Address Gender


Jenny 564-3456 1050 Lawrence KS 66049 F
Harvard
Dr. Jonny John Jay Doe Jr 123-4567 132 W 15th KS M

<OwnerTable>
<Owner phone=“564-3456” gender=“F”>
<Name FN=“Jenny” />
<Address>
<Street>1050 Harvard</Street><City> Denver</City>
<State>CO</State><Zip>66049</Zip>
</Address>
</Owner>
<Owner phone=“123-4567” gender=“M”>
<Name Prefix=“Dr.” NN=“Jonny” FN=“John” MN=“Jay” LN=“Doe” Suffix=“Jr.” />
<Address> <1322 W 15th</Adr1> <Adr2><State>CO</State></Adr2> </Addres>

</Owner>
</OwnerTable>
23
RDBMS vs. XML
RDBMS XML

 Structured model  Unstructured or Semi-


structured model
 Large scale data
 Small to Medium scale data
 Limited semantics
 Document vs. data

 Flexible and rich semantics


 Focus: how to handle large
size data efficiently?
 Focus: how can one handle
large number of small size
data with various formats
efficiently?

24
A conceptual view of XML
 An XML document is an ordered, labeled
tree
 Character data leaf nodes contain the actual
data (text strings)
 Elements nodes are each labeled with
 a name (often called the element type), and
 a set of attributes, each consisting of a name
and a value,
 and can have child nodes

25
A concrete view of XML
 An XML document is a text with markup tags
and other meta-information.
 Markup tags denote elements:
..<foo attr="val" ...>bar</foo>...
| | | |
| | | a matching element end tag
| | the contents “bar” of the element
| an attribute with name attr and value val, enclosed by ' or "
an element start tag with name foo
 An XML document must be well-formed:
 start and end tags must match
 element tags must be properly nested
26
Example of XML document
<?xml version="1.0"?>
<note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading>
<body>Don't forget me this weekend!</body> </note>

 The XML declaration (1st line) should always


be included; root element is <note>
 In XML all elements must have a closing tag:
 <p>foo<p>bar => <p>foo</p><p>bar</p>
 XML tags are case sensitive:
 <Message>This is incorrect</message>
 Must be properly nested within each other:
 <b><i>This is incorrect</b></i>
27
Element vs. Attribute
 The same information can be captured by
either Element or Attribute in XML
<event eID=“sigmod02”> <event
<acronym>SIGMOD</acronym> eID=“sigmod02”
<society>ACM</society> acronym=“SIGMOD”
<url>www.sigmod02.org</url> society=“ACM
<loc> url=“www.sigmod02.org”
<city>Madison</city> city=“Madison”
state=“WI”
<state>WI</state>
</loc> year=“2002”/>
<year>2002</year>
</event>

28
Applications of XML
 XML is a meta-language to create another
languages; the main application of XML is
making new languages
 XHTML: W3C's XMLization of HTML 4.0.

<?xml version="1.0" encoding="UTF-8"?>


<html xmlns=http://www.w3.org/1999/xhtml xml:lang="en">
<head><title>Hello world!</title></head>
<body><p>foobar</p></body>
</html>

29
Applications of XML (cont.)
 CML: Chemical Markup Language
<molecule id="METHANOL">
<atomArray>
<stringArray builtin="elementType">C O H H H H</stringArray>
<floatArray builtin="x3" units="pm"> -0.748 0.558 -1.293 -1.263
-0.699 0.716 </floatArray>
</atomArray>
</molecule>

 There are +1000 new markup languages


made by XML (eg, www.schema.net)
30
Why is XML important?
 Technically, … Nothing; Just old simple tree
model…
 Non-technically, …
 Hot ($$$)
 The standard for representation of Web
information
 The real force of XML is generic languages and
tools!
 By building on XML, you get a massive (standard)
infrastructure for free

31
32
Conclusion
 XML is an important language that one
should learn
 Plenty of research issues for Database
Researchers
 XML query language issue
 Conversion issue btw XML and other (eg,
relational) models
 Storage issue for native XML database
 Novel indexing issue
 System design and implementation issue

33
Further References
 World-Wide Web Consortium:
 www.w3c.org/xml/

 XML Cover Page: www.oasis-open.org/cover/


 XML Articles: www.xml.com
 Latest XML News: www.xmlhack.com
 XML Tutorial:
 www.w3schools.com/xml/default.asp

 www.brics.dk/~amoeller/XML

 www.xml101.com/xml/default.asp

 XML WIKI:
 en.wikibooks.org/wiki/XML

 Slides prepared by Dr. Bo Luo. 34


<End hasQuestion=“Yes”>
<Thanks/>
</End>

35

You might also like