You are on page 1of 42

The Simple API for XML (SAX)

Part I

Ethan Cerami
New York University

©Copyright 2003-2004. These slides are


based on material from the upcoming book,
“XML and Bioinformatics” (Springer-Verlag)
by Ethan Cerami. Please email
cerami@cs.nyu.edu for permission to copy.
10/17/08 Simple API for XML (SAX), Part I 1
Road Map
 SAX Overview
 What is SAX?
 Advantages/Disadvantages
 Basic SAX Examples
 About Xerces 2 Parser
 XMLReader Interface
 ContentHandler Interface
 Extending the SAX Default Handler
 Checking for Well-Formedness

10/17/08 Simple API for XML (SAX), Part I 2


SAX Overview

10/17/08 Simple API for XML (SAX), Part I 3


Introduction to SAX
 The Simple API for XML (SAX) is a standard, event-
based interface for parsing XML documents.
 Versions:
 SAX 1.0: original standard
 SAX 2.0: current standard
 SAX is a de facto standard, supported by most XML
parsers today.
 Unlike DOM, it is not an official W3C standard.
 SAX was originally built explicitly for Java, but SAX
now exists for other languages, including Perl,
Python, etc.

10/17/08 Simple API for XML (SAX), Part I 4


SAX Interface
 At its core, SAX is simply a series of
interfaces that are implemented by an
XML parser.
 Because different parsers implement
the same SAX interface, you can
easily swap in/out different parsers.

10/17/08 Simple API for XML (SAX), Part I 5


SAX Interface
Xerces
Parser

Java SAX XML


Crimson
App Interface Document
Parser

Implementation details are hidden


Ælfred
behind the SAX interface. You can
therefore swap parsers in/out. Parser

Same idea as JDBC.

10/17/08 Simple API for XML (SAX), Part I 6


Advantages/Disadvantages
 Advantages
 Very widely implemented by just about every XML
Parser
 Fast Performance
 Low Memory Overhead
 Disadvantages
 Does not provide an easy to navigate XML tree
like DOM or JDOM.
 Does not provide an easy mechanism for
creating/modifying XML documents.

10/17/08 Simple API for XML (SAX), Part I 7


Basic SAX Example

10/17/08 Simple API for XML (SAX), Part I 8


Xerces 2 Parser
 All of our examples will use the Xerces 2 Parser.
 Xerces 2 is the latest open source XML parser from
the Apache XML Group.
 The Distribution is available at:
http://xml.apache.org/xerces2-j/
 The distribution includes two JAR files:
 xmlParserAPIs.jar: includes the relevant XML
APIs, including DOM Level 2, SAX 2.0, and JAXP
1.2.
 xercesImpl.jar: includes the Xerces
implementation of the XML APIs.

10/17/08 Simple API for XML (SAX), Part I 9


BasicSAX.java
 First example illustrates the simplest SAX
functionality:
 Creates an XML Parser object
 Parses a document specified on the command line
 Receives SAX events and prints these to the
console.
 First, let’s examine a sample XML document.
Then view the output when this document is
parsed.

10/17/08 Simple API for XML (SAX), Part I 10


Sample XML Document
<?xml version='1.0' standalone='no' ?>
<!DOCTYPE DASDNA SYSTEM
'http://servlet.sanger.ac.uk:8080/das/dasdna.dtd' >
<DASDNA>
<SEQUENCE id="1" version="8.30" start="1000" stop="1050">
<DNA length="51">
taatttctcccattttgtaggttatcacttcactctgttgactttcttttg
</DNA>
</SEQUENCE>
<SEQUENCE id="2" version="8.30" start="1000" stop="1050">
<DNA length="51">
taatgcaactaaatccaggcgaagcatttcagcttaaccccgagacttttg
</DNA>
</SEQUENCE>
</DASDNA> Document contains two
sequences of DNA.
10/17/08 Simple API for XML (SAX), Part I 11
Start Document
Start Element: DASDNA Sample Output
Start Element: SEQUENCE
Start Element: DNA
Characters:
taatttctcccattttgtaggttatcacttcactctgttgactttcttttg
Characters:

End Element: DNA


End Element: SEQUENCE
Start Element: SEQUENCE
Start Element: DNA
Characters:
taatgcaactaaatccaggcgaagcatttcagcttaaccccgagacttttg
Characters:

End Element: DNA


End Element: SEQUENCE
End Element: DASDNA
End Document

10/17/08 Simple API for XML (SAX), Part I 12


package com.oreilly.bioxml.sax;

import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;

import java.io.IOException;

/**
* Basic SAX Example.
* Illustrates basic implementation of the SAX Content Handler.
*/
public class SAXBasic implements ContentHandler {

public void startDocument() throws SAXException {


System.out.println("Start Document");
}

10/17/08 Simple API for XML (SAX), Part I 13


public void characters(char[] ch, int start, int length)
throws SAXException {
String str = new String(ch, start, length);
System.out.println("Characters: " + str);
}

public void endDocument() throws SAXException {


System.out.println("End Document");
}

public void endElement(String namespaceURI, String localName,


String qName) throws SAXException {
System.out.println("End Element: " + localName);
}

public void endPrefixMapping(String prefix) throws SAXException {


// No-op
}

10/17/08 Simple API for XML (SAX), Part I 14


public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
// No-op
}

public void processingInstruction(java.lang.String target,


java.lang.String data) throws SAXException {
// No-op
}

public void setDocumentLocator(Locator locator) {


// No-op
}

public void skippedEntity(String name) throws SAXException {


// No-op
}

public void startElement(String namespaceURI, String localName,


String qName, Attributes atts) throws SAXException {
System.out.println("Start Element: " + localName);
}
10/17/08 Simple API for XML (SAX), Part I 15
public void startPrefixMapping(String prefix, String uri)
throws SAXException {
// No-op

**
* Prints Command Line Usage
*/
private static void printUsage() {
System.out.println ("usage: SAXBasic xml-file");
System.exit(0);

**
* Main Method
* Options for instantiating XMLReader Implementation:
* 1) XMLReader parser = XMLReaderFactory.createXMLReader();
* 2) XMLReader parser = XMLReaderFactory.createXMLReader
* ("org.apache.xerces.parsers.SAXParser");
* 3) XMLReader parser = new org.apache.xerces.parsers.SAXParser();
*/
10/17/08 Simple API for XML (SAX), Part I 16
public static void main(String[] args) {
if (args.length != 1) {
printUsage();
}
try {
SAXBasic saxHandler = new SAXBasic();
XMLReader parser = XMLReaderFactory.createXMLReader
("org.apache.xerces.parsers.SAXParser");
parser.setContentHandler(saxHandler);
parser.parse(args[0]);
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}

10/17/08 Simple API for XML (SAX), Part I 17


Main SAX Interfaces
 SAX provides two main interfaces:
 XMLReader: implemented by the XML
parser.
 ContentHandler: implemented by your
application in order to receive SAX events.
 Each time an event occurs, e.g. start element,
end element, the XML parser calls the
ContentHandler and informs you of the specific
event.

10/17/08 Simple API for XML (SAX), Part I 18


XMLReader Interface
 You have three main options for instantiating
an XMLReader class.
 Option 1: Use the SAX XMLReaderFactory
class with no arguments:
 XMLReader parser =
XMLReaderFactory.createXMLReader();
 The factory will attempt to instantiate an
XMLReader based on system defaults.

10/17/08 Simple API for XML (SAX), Part I 19


Option 1: Continued
 You can specify a system property from the java
command line via the -D option.
 For example, the following line invokes the SAXBasic
class and specifies the Xerces2 XML Parser:
 javaw.exe -Dorg.
xml.sax.driver=org.apache.xerces.parsers.SAXParse
r com.oreilly.bioxml.sax.SAXBasic sample.xml
 The advantage of using system properties is that you
can dynamically change parsers at any time without
recompiling any code.
 If the Factory is unable to determine any valid system
defaults, it will throw a SAXException, with a specific
message: "System property org.xml.sax.driver not
specified."

10/17/08 Simple API for XML (SAX), Part I 20


Using Different Parsers
 The specific class the implements the
XMLReader interface varies from
parser to parser. For example:
 For the Xerces XML Parser, it's
org.apache.xerces.parser.SAXParser.
 For the Crimson XML Parser, it's
org.apache.crimson.parser.XMLReader
Impl.

10/17/08 Simple API for XML (SAX), Part I 21


Option 2
 Call the XMLReaderFactory with a
String argument indicating the class
name that implements the XMLReader
interface:
 For example:
 XMLReader parser =
XMLReaderFactory.createXMLReader
("org.apache.xerces.parsers.SAXParser");

10/17/08 Simple API for XML (SAX), Part I 22


Option 3
 Instantiate the XMLReader
implementation directly:
 For example:
 XMLReader parser = new
org.apache.xerces.parsers.SAXParser();
 This option works fine. However, note
that if you switch parsers, you will need
to recompile.

10/17/08 Simple API for XML (SAX), Part I 23


Using an XMLReader
 Once you have an XMLReader class, you can
call the parse() method to start parsing:

MLReader parser = XMLReaderFactory.createXMLR


("org.apache.xerces.parsers.SAXParser");
parser.parse(“simple.xml”);

 You can pass a local file name or an absolute


URL to the parse() method.

10/17/08 Simple API for XML (SAX), Part I 24


ContentHandler Interface
 The ContentHandler receives all SAX events.
 In total, there are 11 defined events.
 The most important events/methods are defined
below:

characters Receive notification of


character data.
endDocument Receive notification of the end
of a document.
endElement Receive notification of the end of
an element.

Continued…
10/17/08 Simple API for XML (SAX), Part I 25
Content Handler API (cont)
ignorableWhitespace Receive notification of
ignorable whitespace in
element content.
setDocumentLocator Receive an object for
locating the origin of SAX
document events.
startDocument Receive notification of the
beginning of a document.
startElement Receive notification of the
beginning of an element.

10/17/08 Simple API for XML (SAX), Part I 26


Character “Chunking”
 Suppose you have the following piece of XML:
<DNA length="51">
taatgcaactaaatccaggcgaagcatttcagcttaaccccg
</DNA>
 You will receive a start element event, followed by one or more
character events.
 Parsers are free to call the characters() method any way they
want. For example, one parse might do the following:
 characters (“t”);
 characters (“a”);
 characters (“a”);
 Another parser might do this:
 characters (“taatgcaactaaatccagg”);
 characters (“cgaagcatttcagcttaaccccg”);

10/17/08 Simple API for XML (SAX), Part I 27


Character Chunking
 Your application needs to be able to handle either of these
strategies.
 To do this, it is best to store character data in some kind of
buffer, like StringBuffer.
 For example:
/**
* Processes Character Events via Buffer
*/
public void characters(char[] ch, int start, int length)
throws SAXException {
String str = new String(ch, start, length);
currentText.append(str);
}

10/17/08 Simple API for XML (SAX), Part I 28


Using ContentHandlers
 To receive events, you must:
 Implement the ContentHandler interface
 Register your content handler with the XML
parser:

MLReader parser = XMLReaderFactory.createXMLRead


("org.apache.xerces.parsers.SAXParser");
rser.setContentHandler(saxHandler);
rser.parse(args[0]);

10/17/08 Simple API for XML (SAX), Part I 29


ContentHandler Implementation
 Here’s a sample implementation that just outputs
information about each event:

public void characters(char[] ch, int start, int length)


throws SAXException {
String str = new String(ch, start, length);
System.out.println("Characters: " + str);
}
public void endElement(String namespaceURI, String
localName,
String qName) throws SAXException {
System.out.println("End Element: " + localName);
}

10/17/08 Simple API for XML (SAX), Part I 30


Using the SAX Default Handler

10/17/08 Simple API for XML (SAX), Part I 31


SAX Default Handler
 In total, an implementation of
ContentHandler must implement 11 methods.
 You usually don’t need to intercept all 11 of
these events.
 It is therefore much easier to extend the SAX
DefaultHandler.
 The DefaultHandler provides no-op
implementations of all methods. You can
therefore simply override those that you want.
 The next few slides provides an example.

10/17/08 Simple API for XML (SAX), Part I 32


package com.oreilly.bioxml.sax;

import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.SAXException;
import org.xml.sax.Attributes;
import org.xml.sax.XMLReader;

import java.io.IOException;

/**
* Basic SAX Example.
* Illustrates extending of DefaultHandler
*/
public class SAXDefaultHandler extends DefaultHandler {

public void startDocument() throws SAXException {


System.out.println("Start Document");
}

10/17/08 Simple API for XML (SAX), Part I 33


public void characters(char[] ch, int start, int length)
throws SAXException {
String str = new String(ch, start, length);
System.out.println("Characters: " + str);
}

public void endDocument() throws SAXException {


System.out.println("End Document");
}

public void endElement(String namespaceURI, String localName,


String qName) throws SAXException {
System.out.println("End Element: " + localName);
}

public void startElement(String namespaceURI, String localName,


String qName, Attributes atts) throws SAXException {
System.out.println("Start Element: " + localName);
}
Only override those methods that you n

10/17/08 Simple API for XML (SAX), Part I 34


/**
* Prints Command Line Usage
*/
private static void printUsage() {
System.out.println ("usage: SAXDefaultHandler xml-file");
System.exit(0);
}

/**
* Main Method
*/
public static void main(String[] args) {
if (args.length != 1) {
printUsage();
}
try {
SAXDefaultHandler saxHandler = new SAXDefaultHandler();
XMLReader parser = XMLReaderFactory.createXMLReader
("org.apache.xerces.parsers.SAXParser");

10/17/08 Simple API for XML (SAX), Part I 35


parser.setContentHandler(saxHandler);
parser.parse(args[0]);
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
} By extending the Default
Handler, your code is much
more compact and concise.

The output of this program is


identical to the first example.

10/17/08 Simple API for XML (SAX), Part I 36


Checking for Well-Formedness

10/17/08 Simple API for XML (SAX), Part I 37


Defaults
 By default, the Xerces XML parser (and
most other parsers) will check for well-
formedness, but they will not
automatically check for validity.
 Suppose we have the following
document on the next page.

10/17/08 Simple API for XML (SAX), Part I 38


Sample Document: Not Well-formed
<?xml version='1.0' standalone='no' ?>
<!DOCTYPE DASDNA SYSTEM
'http://servlet.sanger.ac.uk:8080/das/dasdna.dtd' >
<DASDNA>
<SEQUENCE id="1" version="8.30" start="1000" stop="1050">
<DNA length="51">
taatttctcccattttgtaggttatcacttcactctgttgactttcttttg
</SEQUENCE>
<SEQUENCE id="2" version="8.30" start="1000" stop="1050">
<DNA length="51">
taatgcaactaaatccaggcgaagcatttcagcttaaccccgagacttttg
</DNA>
</SEQUENCE>
</DASDNA>
This document is not well-forme
because I deleted one of the end
</DNA> tags.

10/17/08 Simple API for XML (SAX), Part I 39


Sample Output This is a fatal error.
The parser therefore
Start Document throws a SAXParseExceptio
Start Element: DASDNA
Start Element: SEQUENCE
Start Element: DNA
Characters:
taatttctcccattttgtaggttatcacttcactctgttgactttcttttg
Characters:

[Fatal Error] ensemble_dna_error.xml:8:5: The element type "DNA"


must be terminated by the matching end-tag "</DNA>".
org.xml.sax.SAXParseException: The element type "DNA" must be
terminated by the matching end-tag "</DNA>".
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
at com.oreilly.bioxml.sax.SAXBasic.main(SAXBasic.java:101)

10/17/08 Simple API for XML (SAX), Part I 40


Try / Catch Clause
try {
SAXDefaultHandler saxHandler = new SAXDefaultHandler();
XMLReader parser = XMLReaderFactory.createXMLReader
("org.apache.xerces.parsers.SAXParser");
parser.setContentHandler(saxHandler);
parser.parse(args[0]);
} catch (SAXException e) Indicates
{ a fatal parsing erro
e.printStackTrace();
} catch (IOException e) {such as errors in
e.printStackTrace(); well-formedness.
}

Indicates an IO Error, such as


failed network connection.

10/17/08 Simple API for XML (SAX), Part I 41


Summary
 SAX is a standard, event-based interface for
parsing XML documents.
 It is a de facto standard, not an official W3C
standard.
 XML Parsers must implement the XMLReader
interface.
 Applications must implement the
ContentHandler interface.
 For more concise programs, extend the SAX
Default Handler.
 Make sure to surround calls to parse() with a
try/catch clause.
10/17/08 Simple API for XML (SAX), Part I 42

You might also like