XML: Nn tng, k thut v ng dng XML SAX (Simple API for XML) 2 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Sample Document <transaction> <account>89-344</account> <buy shares=100> <ticker exch=NASDAQ>WEBM</ticker> </buy> <sell shares=30> <ticker exch=NYSE>GE</ticker> </sell> </transaction> 3 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> DOM Parser DOM = Document Object Model Parser creates a tree object out of the document User accesses data by traversing the tree The API allows for constructing, accessing and manipulating the structure and content of XML documents 4 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Document as Tree transaction account 89-344 buy ticker shares 100 WEBM exch sell ticker shares 30 NYSE GE exch NASDAQ Methods like: getRoot getChildren getAttributes etc. 5 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Advantages and Disadvantages Advantages: Natural and relatively easy to use Can repeatedly traverse tree Disadvantages: High memory requirements the whole document is kept in memory Must parse the whole document before use 6 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> SAX Parser SAX = Simple API for XML Parser creates events while traversing tree Parser calls methods (that you write) to deal with the events Similar to an IOStream, goes in one direction 7 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Document as Events <transaction> <account>89-344</account> <buy shares=100> <ticker exch=NASDAQ>WEBM</ticker> </buy> <sell shares=30> <ticker exch=NYSE>GE</ticker> </sell> </transaction> Start tag: transaction Start tag: account Text: 89-344 End tag: account Start tag: buy Attribute: shares Value: 100 8 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Advantages and Disadvantages Advantages: Requires little memory Fast Disadvantages: Cannot reread Less natural for object oriented programmers (perhaps) 9 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Which should we use? DOM vs. SAX If your document is very large and you only need a few elements - use SAX If you need to manipulate (i.e., change) the XML - use DOM If you need to access the XML many times - use DOM 10 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> XML Parsers There are several different ways to categorise parsers: Validating versus non-validating parsers DOM parsers versus SAX parsers Parsers written in a particular language (Java, C++, Perl, etc.) 11 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Validating Parsers A validating parser makes sure that the document conforms to the specified DTD This is time consuming, so a non-validating parser is faster 12 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Using an XML Parser Three basic steps Create a parser object Pass the XML document to the parser Process the results Generally, writing out XML is not in the scope of parsers (though some may implement proprietary mechanisms) 13 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> The SAX Parser SAX parser is an event-driven API An XML document is sent to the SAX parser The XML file is read sequentially The parser notifies the class when events happen, including errors The events are handled by the implemented API methods to handle events that the programmer implemented 14 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Used to create a SAX Parser Handles document events: start tag, end tag, etc. Handles Parser Errors Handles DTDs and Entities 15 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Problem The SAX interface is an accepted standard There are many implementations Like to be able to change the implementation used without changing any code in the program How is this done? 16 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Factory Design Pattern Have a Factory class that creates the actual Parsers. The Factory checks the value of a system property that states which implementation should be used In order to change the implementation, simply change the system property 17 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Creating a SAX Parser Import the following packages: org.xml.sax.*; org.xml.sax.helpers.*; Set the following system property: System.setProperty("org.xml.sax.driver", "org.apache.xerces.parsers.SAXParser"); Create the instance from the Factory: XMLReader reader = XMLReaderFactory.createXMLReader(); 18 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Receiving Parsing Information A SAX Parser calls methods such as startDocument, startElement, etc., as it runs In order to react to such events we must: implement the ContentHandler interface set the parsers content handler with an instance of our class 19 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> ContentHandler // Methods (partial list) public void startDocument(); public void endDocument(); public void characters(char[] ch, int start, int length); public void startElement(String namespaceURI, String localName, String qName, Attributes atts); public void endElement(String namespaceURI, String localName, String qName); 20 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Namespaces and Element Names <?xml version='1.0' encoding='utf-8'?> <forsale date="12/2/03" xmlns:xhtml = "urn:http://www.w3.org/1999/xhtml"> <book> <title> <xhtml:em> DBI: </xhtml:em> The Course I Wish I never Took </title> <comment> My <xhtml:b> favorite </xhtml:b> book! </comment> </book> </forsale> 21 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Namespaces and Element Names <?xml version='1.0' encoding='utf-8'?> <forsale date="12/2/03" xmlns:xhtml = "urn:http://www.w3.org/1999/xhtml"> <book> <title> <xhtml:em> DBI: </xhtml:em> The Course I Wish I never Took </title> <comment> My <xhtml:b> favorite </xhtml:b> book! </comment> </book> </forsale> namespaceURI = urn:http://www.w3.org/1999/xhtml localName = em qName = xhtml:em namespaceURI = "" localName = book qName = book 22 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Receiving Parsing Information (cont.) An easy way to implement the ContentHandler interface is the extend the DefaultHandler, which implements this interface (and a few others) in an empty fashion To actually parse a document, create an InputSource from the document and supply the input source to the parse method of the XMLReader 23 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> import java.io.*; import org.xml.sax.*; import org.xml.sax.helpers.*; public class InfoWithSax extends DefaultHandler { public static void main(String[] args) { System.setProperty("org.xml.sax.driver", "org.apache.xerces.parsers.SAXParser"); try { XMLReader reader = XMLReaderFactory.createXMLReader(); reader.setContentHandler(new InfoWithSax()); reader.parse(new InputSource(new FileReader(args[0]))); } catch(Exception e) { e.printStackTrace()} } 24 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> public static startDocument() throws SAXException { System.out.println(START DOCUMENT); } public static endDocument() throws SAXException { System.out.println(END DOCUMENT); } int depth; String indent = ; private void println(String header, String value) { for (int i = 0 ; i < depth ; i++) System.out.print(indent); System.out.println(header + ": " + value); } 25 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> public void characters(char buf[], int offset, int len) throws SAXException { String s = (new String(buf, offset, len)).trim(); if (!"".equals(s)) println("CHARACTERS", s); } public void endElement(String namespaceURI, String localName, String name) throws SAXException { depth--; String elementName = name; if (!"".equals(namespaceURI) && !"".equals(localName)) elementName = namespaceURI + ":" + localName; println("END ELEMENT", elementName); } 26 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> public static startElement(String namespaceURI, String localName, String name, Attributes attrs) throws SAXException { String elementName = name; if (!"".equals(namespaceURI) && !"".equals(localName)) elementName = namespaceURI + ":" + localName; println("START ELEMENT", elementName); if (attrs != null && attrs.getLength() > 0) { for (int i = 0; i < attrs.getLength(); i++) println("ATTRIBUTE", attrs.getLocalName(i) + = + attrs.getValue(i)); } depth++; } 27 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Bachelor Tags What do you think happens when the parser parses a bachelor tag? <rating stars="five" /> 28 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Attributes Interface Elements may have attributes There is no distinction between attributes that are defined explicitly from those that are specified in the DTD (with a default value) 29 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Attributes Interface (cont.) int getLength(); String getQName(int i); String getType(int i); String getValue(int i); String getType(String qname); String getValue(String qname); etc. 30 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Attributes Types The following are possible types for attributes: "CDATA", "ID", "IDREF", "IDREFS", "NMTOKEN", "NMTOKENS", "ENTITY", "ENTITIES", "NOTATION" 31 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Setting Features It is possible to set the features of a parser using the setFeature method. Examples: reader.setFeature(http://xml.org/sax/features/namespaces, true) reader.setFeature(http://xml.org/sax/features/validation", false) For a full list, see: http://www.saxproject.org/?selected=get-set 32 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> ErrorHandler Interface We implement ErrorHandler to receive error events (similar to implementing ContentHandler) DefaultHandler implements ErrorHandler in an empty fashion, so we can extend it (as before) An ErrorHandler is registered with reader.setErrorHandler(handler); Three methods: void error(SAXParseException ex); void fatalError(SAXParserExcpetion ex); void warning(SAXParserException ex); 33 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> public void warning(SAXParseException err) throws SAXException { System.out.println(Warning in line + err.getLineNumber() + and column + err.getColumnNumber()); } public void error(SAXParseException err) throws SAXException { System.out.println(Oy vaavoi, an error!); } public void fatalError(SAXParseException err) throws SAXException { System.out.println(OY VAAVOI, a fatal error!); } Extending the InfoWithSax Program Will these methods be called in the case of a problem? 34 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Lexical Events Lexical events have to do with the way that a document was written and not with its content Examples: A comment is a lexical event (<!-- comment -->) The use of an entity is a lexical event (>) These can be dealt with by implementing the LexicalHandler interface, and set on a parser by reader.setProperty("http://xml.org/sax/properties/ lexical- handler", mylexicalhandler); 35 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> LexicalHandler // Methods (partial list) public void startEntity(String name); public void endEntity(String name); public void comment(char[] ch, int start, int length); public void startCDATA(); public void endCDATA(); 36 eXtensible Markup Language Lecturer: Phan Vo Minh Thang MSc. <?xml version=1.0> <material> XML Lectures Notes <section id=08> Simple API for XML </section> </material> Info Course name: Special Selected Topic in Information System Section: Simple API for XML Number of slides: 36 Updated date: 12/02/2006 Contact: Mr.Phan Vo Minh Thang (minhthangpv@hcmuaf.edu.vn)
The Morgan Kaufmann Series in Data Management Systems Jiawei Han Micheline Kamber Jian Pei Data Mining. Concepts and Techniques 3rd Edition Morgan Kaufmann 2011