You are on page 1of 37

UNIT V

CURRENT TRENDS Object Oriented Databases Need for Complex Data types- OO data Model- Nested relations- Complex Types- Inheritance Reference Types Distributed databases- Homogenous and Heterogeneous- Distributed data Storage XML Structure of XML- Data- XML Document- Schema- Querying and Transformation. Data Mining and Data Warehousing.
_____________________________________________________________________________________________________

Object Oriented Databases


Need for Complex Data types Object oriented data Model Object oriented Languages

Need for Complex Data types


Traditional database applications consists of data processing tasks, such as

banking and payroll management, etc.Such application have conceptually simple data types. The basic data items are records that are fairly small and whose fields are atomic, that is they are not further structured. In recent years, there is need for more complex data types. For example, address could be viewed as an atomic data item of type string, this view hide details such as street, city, state, pin code. On the other hand, if an address were represented by breaking it into the components, writing queries would be more complicated. A better alternative is to use OODBMS (Object Oriented) which uses structured types, which allow a type address with Subparts Street, city, sate, pin code. Another example to store multivalued attributes as without creating a separates relation to store the values in (1NF) First normal form. Applications Computer Aided Design (CAD), Computer-Aided Software Engineering, Multimedia and Image Databases and Document/Hypertext data bases.

Object oriented data Model


Relational database technology has failed to handle the needs of complex

information systems.

The problem with relational database is that they require the application

developer to force an information model into tables where relationships between entities are defined by values. Relational database design is really a process of trying to figure out how to represent real-world objects within the confines of tables in such a way that good performance results and preserving data integrity is possible. Object database design is quite different. For the most part, Object database design is a fundamental part of the overall application design process. The object classes used by the programming language are the classes used by the ODBMS. Because their models are consistent, there is no need to transform the programs object model to something unique for the database manager. An Object-Oriented data mode is consists of 1. Static properties such as object, attributes and relationships. 2. Integrity rules over objects and operations 3. Dynamic properties such as operations or rules defining new database states based on applied state change. Object-oriented databases have the ability to model all three of these components directly within the database supporting a complete problem modeling capability. Prior to object-oriented databases, databases were capable of directly supporting points 1 and 2 above and relied on applications for defining the dynamic properties of the model. The disadvantage of this approach is that these dynamic properties could not be applied uniformly in all database usage scenarios since they were defined outside the database in autonomous applications. Object oriented databases provide a unifying paradigm that allows one to integrate all three aspects of data modeling and to apply them uniformly to all users of the database Advantages: It has capability to handle large number of different data types such as text, number, pictures, voice and video. It combines object-Oriented programming with database technology to provide integrated application development systems. It supports the object oriented programming concepts inheritance, polymorphism, dynamic binding ,etc Disadvantage: It is difficult to maintain OODBMS are not suited for all applications

Main Concepts of OO data model.


Oject Structure Object Classes Inheritance Object Identity

Object Structure The object-oriented paradigm is based on encapsulation of data and code related to an object into single unit, whose contents are not visible to the outside world. An object corresponds to an entity in the E-R model. All interactions between an object and the rest of the system are via messages. In general, an object has associated with it: A set of variables that contain the data for the object. A set of messages to which the object responds; each message may have zero, one,or more parameters. A set of methods, each of which is a body of code to implement a message; a method returns a value as the response to the message. Object classes: There are many similar objects in a database. They respond to the same messages, use the same methods, and have variables of the same name and type. It would be wasteful to define each such object separately. Therefore, we group similar objects to form a class. Each object is called an instance of its class. All objects in a class share a common definition. Examples of classes in bank database are employees, customers, accounts etc. Figure below describes the class customer in psuedocode. The definition shows the variables and the message to which the objects of the class respond. Class customer { /* variables */ String name; String address; Int customer_code; /*message*/ Void get_customer_details( ); Void set_customer_address( ); };

In this definition, each object of class customer contains the variable name, address, and customer_code. It also responds to two messages get_customer_details and set_customer_address. Following code describes implementation of these two messages Void get_customer_details( ) { Cout<<Enter customer details; Cin>>ob1.name>> ob1.address>> ob1.customer_code; }

Void set_customer_address( ) { Cout<<Enter new address; Cin>>ob1.address; } Inheritance: An object-oriented database schema usually requires a large number of classes. Several classes are similar. For example, assume object-oriented database for banking system. We would expect the class of bank customers to be similar attributes like name, address, and so on. However there are variables specific to employees such as salary and variables specific to customers such as credit-rating. It would be desirable to define to define a representation for the common variables in one place, called as base class. E.g., class of bank customers is similar to class of bank employees, although there are differences both share some variables and messages, e.g., name and address. o But there are variables and messages specific to each class e.g., salary for employees and credit-rating for customers. Every employee is a person; thus employee is a specialization of person Similarly, customer is a specialization of person. Create classes person, employee and customer o variables/messages applicable to all persons associated with class person. o variables/messages specific to employees associated with class employee; similarly for customer
o

Place classes into a specialization/IS-A hierarchy o variables/messages belonging to class person are inherited by class employee as well as customer Result is a class hierarchy

Class Hierarchy Definition class person{ string name; string address: }; class customer isa person { int credit-rating; }; class employee isa person { date start-date; int salary; }; class officer isa employee { int office-number, int expense-account-number, }; Example of Multiple Inheritance

Class DAG for banking example.


With multiple inheritances a class may have more than one superclass.

The class/subclass relationship is represented by a directed acyclic graph (DAG) o Particularly useful when objects can be classified in more than one way, which are independent of each other E.g. temporary/permanent is independent of Officer/secretary/teller Create a subclass for each combination of subclasses Need not create subclasses for combinations that are not possible in the database being modeled A class inherits variables and methods from all its superclasses There is potential for ambiguity when a variable/message N with the same name is inherited from two superclasses A and B o No problem if the variable/message is defined in a shared superclass o Otherwise, do one of the following
o

flag as an error, rename variables choose one. More Examples of Multiple Inheritance Conceptually, an object can belong to each of several subclasses o A person can play the roles of student, a teacher or footballPlayer, or any combination of the three E.g., student teaching assistant who also play football Can use multiple inheritance to model roles of an object o That is, allow an object to take on any one or more of a set of types But many systems insist an object should have a most-specific class o That is, there must be one class that an object belongs to which is a subclass of all other classes that the object belongs to o Create subclasses such as student-teacher and student-teacher-footballPlayer for each combination o When many combinations are possible, creating subclasses for each combination can become cumbersome Object Identity An object retains its identity even if some or all of the values of variables or definitions of methods change over time. Object identity is a stronger notion of identity than in programming languages or data models not based on object orientation. o Value data value; e.g. primary key value used in relational systems. o Name supplied by user; used for variables in procedures. o Built-in identity built into data model or programming language.

no user-supplied identifier is required. Is the form of identity used in object-oriented systems. Object identifiers used to uniquely identify objects o Object identifiers are unique: no two objects have the same identifier each object has only one object identifier o E.g., the spouse field of a person object may be an identifier of another person object. o can be stored as a field of an object, to refer to another object. o Can be system generated (created by database) or external (such as social-security number) o System generated identifiers: Are easier to use, but cannot be used across database systems May be redundant if unique identifier already exists Object Containment

Each component in a design may contain other components Can be modeled as containment of objects. Objects containing; other objects are called composite objects. Multiple levels of containment create a containment hierarchy o links interpreted as is-part-of, not is-a. Allows data to be viewed at different granularities by different users.

Object oriented Languages


The concepts of Object Oriented can be incorporated into a programming language that is used to manipulate the database. Object relational systems-add complex data types and object orientation to relational language. Persistent Programming Languages: Extend the Object Oriented programming language to deal with database by adding concepts such as persistence and collection

Persistent Programming Languages The Object Oriented programming language that deal with database is called persistent programming language. It deals with persistent data. Persistent data is a data that continues to exist even after the program that created it has terminated. Languages extended with constructs to handle persistent data Programmer can manipulate persistent data directly o no need to fetch it into memory and store it back to disk (unlike embedded SQL) Difference between persistent programming language and embedded SQL Allow data to be manipulated directly from the programming language and need to go through SQL No need for explicit format (type) changes, without persistent programming languages, format changes becomes burden on the programmer. In this, object to be manipulated in-memory, no need to explicit load from or store to the database. Drawbacks of persistent programming language Due to power of most programming language, it is easy to make programming errors that damages the database. Complexity of languages makes automatic high-level optimization more difficult Do not support declarative querying as well as relational database. Persistent objects: 1. Persistent by class - explicit declaration of persistence 2. Persistent by creation - special syntax to create persistent objects 3. Persistent by marking - make objects persistent after creation 4. Persistent by reachability - object is persistent if it is declared explicitly to be so or is reachable from a persistent object Object Identity and Pointers o Degrees of permanence of object identity Intraprocedure: only during execution of a single procedure Intraprogram: only during execution of a single program or query Interprogram: across program executions, but not if data-storage format on disk changes

Persistent: interprogram, plus persistent across data reorganizations o Persistent versions of C++ and Java have been implemented C++ ODMG C++ ObjectStore Java Java Database Objects (JDO) Storage and access of Persistent Object Storage Each object contains data and code. The data part of an object is stored separately for each object. The code that implements methods of a class should in the database as part of the database schema. Howe ever, many implementation store the code in files outside the database. Access Several ways to find objects in the database are: 1. One way is to give names to objects, just as we give names to files. 2. A second way is to expose identifiers or persistent pointers to the objects, which can be stored externally. 3. A third way is to store collections of objects, and to allow programs to iterate over the collection to find required objects. Comparison of O-O and O-R Databases Relational systems o Simple data types, powerful query languages, high protection. Persistent-programming-language-based OODBs o Complex data types, integration with programming language, high performance. Object-relational systems o Complex data types, powerful query languages, high protection.

_______________________________________________________________________________ ________________________ Object Relational Database


Nested Relations Complex Types Inheritance Reference Types

Nested Relations

The nested relational model is an extension of the relational model in which domains may be either atomic or relation valued. Thus, the value of a tuple on an attribute may be a relation, and relations may be contained within relations. A complex object thus can be represented by a single tuple of a nested relation. The use of nested relations leads to an easier-to-understand model. We illustrate nested relations by an example from a library. Example: library information system Each book has o title, o a set of authors, o Publisher, and o a set of keywords

But Nested relation violates First Normal Form (1NF)

Complex Types
Using complex types we can represent E-R model concepts, such as identity of entities, multivalued attributes, and generalization and specialization directly, without a complex translation to the relational model. Complex types include Collection and large object types Structure types Inheritances Reference Types Collection and large object types Sets, multisets and array are the instances of collection types. Consider following SQL statement: create table books ( ... keyword-set setof(varchar(20)) ... )

Above table definitions allows attributes that are sets, which is not supported by ordinary relational database. Thus the multivalued attribute keyword is represented directly by collection type. The array declaration is given below: author-array varchar(20) array [10] Here, author-array is an array of up to 10 author names.
Sometimes, it is required to store the information in the database which is

large in size like photograph of a person, medical image or video clip. Extensions are done in SQL to support such large-object data types. SQL:1999 supports two new large-object data types for character data (clob) and binary data (blob). The letters lob in these data types stand for Large Object. For example, book-review clob(10KB) image blob(10MB) movie blob(2GB)) Structure types Structured types can be declared and used in SQL create type Name as (firstname lastname final varchar(20), varchar(20))

create type Address as (street varchar(20), city varchar(20), zipcode varchar(20)) not final Note: final and not final indicate whether subtypes can be created Structured types can be used to create tables with composite attributes create table customer ( name Name, address Address, dateOfBirth date) Dot notation used to reference components: name.firstname User-defined row types create type CustomerType as (

name Name, address Address, dateOfBirth date) not final Can then create a table whose rows are a user-defined type create table customer of CustomerType Methods Can add a method declaration with a structured type. method ageOnDate (onDate date) returns interval year Method body is given separately. create instance method ageOnDate (onDate date) returns interval year for CustomerType begin return onDate - self.dateOfBirth; end We can now find the age of each customer: select name.lastname, ageOnDate (current_date) from customer

Inheritance
Suppose that we have the following type definition for people: create type Person (name varchar(20), address varchar(20)) Using inheritance to define the student and teacher types create type Student under Person (degree varchar(20), department varchar(20)) create type Teacher under Person (salary integer, department varchar(20)) Subtypes can redefine methods by using overriding method in place of method in the method declaration Multiple Inheritances SQL:1999 and SQL:2003 do not support multiple inheritance If our type system supports multiple inheritance, we can define a type for teaching assistant as follows:

create type Teaching Assistant under Student, Teacher To avoid a conflict between the two occurrences of department we can rename them create type Teaching Assistant under Student with (department as student_dept ), Teacher with (department as teacher_dept ) Array and Multiset Types in SQL Example of array and multiset declaration: create type Publisher as (name varchar(20), branch varchar(20)) create type Book as (title varchar(20), author-array varchar(20) array [10], pub-date date, publisher Publisher, keyword-set varchar(20) multiset ) create table books of Book Similar to the nested relation books, but with array of authors instead of set Creation of Collection Values Array construction array [Silberschatz,`Korth,`Sudarshan] Multisets multisetset [computer, database, SQL] To create a tuple of the type defined by the books relation: (Compilers, array[`Smith,`Jones], Publisher (`McGraw-Hill,`New York), multiset [`parsing,`analysis ]) To insert the preceding tuple into the relation books insert into books values (Compilers, array[`Smith,`Jones], Publisher (`McGraw-Hill,`New York), multiset [`parsing,`analysis ]) Object-Identity and Reference Types Define a type Department with a field name and a field head which is a reference to the type Person, with table people as scope:

create type Department ( name varchar (20), head ref (Person) scope people) We can then create a table departments as follows create table departments of Department We can omit the declaration scope people from the type declaration and instead make an addition to the create table statement: create table departments of Department (head with options scope people) Initializing Reference-Typed Values To create a tuple with a reference value, we can first create the tuple with a null reference and then set the reference separately: insert into departments values (`CS, null) update departments set head = (select p.person_id from people as p where name = `John) where name = `CS _____________________________________________________________________________________ ______________________________

Distributed Databases
In a distributed Database system, Data spread over multiple machines (also

referred to as sites or nodes). Network interconnects the machines Data shared by users on multiple machines

Types of Transactions Distributed database systems supports two types of transaction A local transaction accesses data in the single site at which the transaction was initiated. A global transaction either accesses data in a site different from the one at which the transaction was initiated or accesses data in several different sites. An Example of a Distributed Database Consider a banking system consisting of four branches in four cities. Each branch has its own computer, with a database of all the accounts maintained at that

branch. There is also exists one single site that maintains information about all the branches of the bank. Each branch maintains a relation account(Account-schema), where Account-schema = (account-number, branch-name, balance) The site containing information about all the branches of the bank maintains the relation branch(Branch-schema), where Branch-schema = (branch-name, branch-city, assets) There are other relations maintained at the various sites Example of Local transaction Consider a transaction to add $50 to account number A-177 located at the Valley view branch. If the transaction was initiated at the Valley view branch, then it is considered local else it is global transaction. Example of Global transaction Consider a transaction to transfer $50 from account A-177 to account A-305, which is located at the Hillside branch, is a global transaction, since accounts in two different sites are accessed. Types of Distributed Databases Distributed database are classified as: 1. Homogeneous distributed database 2. Heterogeneous distributed database Homogeneous distributed databases In a homogeneous distributed database o All sites have identical software o Are aware of each other and agree to cooperate in processing user requests. o Each site surrenders part of its autonomy in terms of right to change schemas or software o Appears to user as a single system Heterogeneous distributed databases In a heterogeneous distributed database o Different sites may use different schemas and software Difference in schema is a major problem for query processing Difference in software is a major problem for transaction processing o Sites may not be aware of each other and may provide only limited facilities for cooperation in transaction processing

Distributed Data Storage Assume relational data model i)Replication o System maintains multiple copies of data, stored in different sites, for faster retrieval and fault tolerance. ii)Fragmentation o Relation is partitioned into several fragments stored in distinct sites iii) Replication and fragmentation can be combined o Relation is partitioned into several fragments: system maintains several identical replicas of each such fragment. Data Replication A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites. Full replication of a relation is the case where the relation is stored at all sites. Fully redundant databases are those in which every site contains a copy of the entire database. Advantages of Replication o Availability: failure of site containing relation r does not result in unavailability of r is replicas exist. o Parallelism: queries on r may be processed by several nodes in parallel. o Reduced data transfer: relation r is available locally at each site containing a replica of r. Disadvantages of Replication o Increased cost of updates: each replica of relation r must be updated. o Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented. One solution: choose one copy as primary copy and apply concurrency control operations on primary copy Data Fragmentation Division of relation r into fragments r1, r2, , rn which contain sufficient information to reconstruct relation r. Horizontal fragmentation: In horizontal fragmentation, a relation r is partitioned into a number of subsets, r1, r2, . . . , rn. Each tuple of relation r must belong to at least one of the fragments, so that the original relation can be reconstructed, if needed.

In general, a horizontal fragment can be defined as a selection on the global relation r. That is,we use a predicate Pi to construct fragment ri: ri = Pi (r) We reconstruct the relation r by taking the union of all fragments; that is,

Vertical fragmentation: the schema for relation r is split into several smaller schemas o All schemas must contain a common candidate key (or superkey) to ensure lossless join property. o A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate key. Vertical fragmentation of r(R) involves the definition of several subsets of attributes R1, R2, . . .,Rn of the schema R so that Each fragment ri of r is defined by ri = Ri (r) The fragmentation should be done in such a way that we can reconstruct relation r from the fragments by taking the natural join

Example : relation account with following schema Account = (branch_name, account_number, balance )
Example :Horizontal Fragmentation of account Relation

Example :Vertical Fragmentation of employee_info Relation

Advantages of Fragmentation Horizontal: o allows parallel processing on fragments of a relation o allows a relation to be split so that tuples are located where they are most frequently accessed Vertical: o allows tuples to be split so that each part of the tuple is stored where it is most frequently accessed o tuple-id attribute allows efficient joining of vertical fragments o allows parallel processing on a relation Vertical and horizontal fragmentation can be mixed. o Fragments may be successively fragmented to an arbitrary depth. Advantages of Distributed System 1. Sharing data: Using distributed systems, users at one site are able to access the data residing at other site. 2. Autonomy: Each site is able to retain a degree of control over data that are stored locally. 3. Availability: If one site fails in distributed systems, the remaining sites may be able to continue operating. Disadvantages of Distributed System 1. Software development cost It is more difficult to implement a distributed database system; thus, it is more costly 2. Greater potential for bugs

Since the sites that constitute the distributed system operate in parallel, it is harder to ensure the correctness of algorithms, especially during failures of part of the systems, and recovery from failure. 3. Increased processing overhead The exchange of messages and the additional computation required to achieve intensity co-ordination are a form of overhead that does not arise in centralized systems. _____________________________________________________________________________________ ______________________

XML
Introduction XML: Extensible Markup Language Defined by the WWW Consortium (W3C) Originally intended as a document markup language not a database language o Documents have tags giving extra information about sections of the document E.g. <title> XML </title> <slide> Introduction </slide> o Derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML o Extensible, unlike HTML Users can add new tags, and separately specify how the tag should be handled for display The ability to specify new tags, and to create nested tag structures made XML a great way to exchange data, not just documents. o Much of the use of XML has been in data exchange applications, not as a replacement for HTML Tags make data (relatively) self-documenting E.g. <bank> <account> <account-number> A-101 </account-number> <branch-name> Downtown </branch-name> <balance> 500 </balance> </account> <depositor> <account-number> A-101 </account-number> <customer-name> Johnson </customer-name> </depositor> </bank>

XML: Motivation Data interchange is critical in todays networked world XML has become the basis for all new generation data interchange formats Earlier generation formats were based on plain text with line headers indicating the meaning of fields Each XML based standard defines what are valid elements, using o XML type specification languages to specify the syntax DTD (Document Type Descriptors) XML Schema XML allows new tags to be defined as required A wide variety of tools is available for parsing, browsing and querying XML documents/data Structure of XML Data Tag: label for a section of data Element: section of data beginning with <tagname> and ending with matching </tagname> Elements must be properly nested o Proper nesting <account> <balance> . </balance> </account> o Improper nesting <account> <balance> . </account> </balance> o Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element. Every document must have a single top-level element Example of Nested Elements <bank-1> <customer> <customer-name> Hayes </customer-name> <customer-street> Main </customer-street> <customer-city> Harrison </customer-city> <account> <account-number> A-102 </account-number> <branch-name> Perryridge </branch-name> <balance> 400 </balance> </account> <account> o </account>

</customer> . . </bank-1> Mixture of text with sub-elements is legal in XML. Example: <account> This account is seldom used any more. <account-number> A-102</account-number> <branch-name> Perryridge</branch-name> <balance>400 </balance> </account> Useful for document markup, but discouraged for data representation Attributes Elements can have attributes <account acct-type = checking > <account-number> A-102 </account-number> <branch-name> Perryridge </branch-name> <balance> 400 </balance> </account> Attributes are specified by name=value pairs inside the starting tag of an element An element may have several attributes, but each attribute name can only occur once <account acct-type = checking monthly-fee=5> Attributes Vs. Subelements Distinction between subelement and attribute o In the context of documents, attributes are part of markup, while subelement contents are part of the basic document contents o In the context of data representation, the difference is unclear and may be confusing Same information can be represented in two ways <account account-number = A-101> . </account> <account> <account-number>A-101</account-number> </account> o Suggestion: use attributes for identifiers of elements, and use subelements for contents

Namespaces XML data has to be exchanged between organizations Same tag name may have different meaning in different organizations, causing confusion on exchanged documents Specifying a unique string as an element name avoids confusion Better solution: use unique-name:element-name Avoid using long unique names all over document by using XML Namespaces <bank Xmlns:FB=http://www.FirstBank.com> o <FB:branch> <FB:branchname>Downtown</FB:branchname> o <FB:branchcity> Brooklyn </FB:branchcity> o </FB:branch> o </bank> XML Document Schema Database schemas constrain what information can be stored, and the data types of stored values XML documents are not required to have an associated schema However, schemas are very important for XML data exchange o Otherwise, a site cannot automatically interpret data received from another site Two mechanisms for specifying XML schema o Document Type Definition (DTD) Widely used o XML Schema Newer, increasing use Document Type Definition (DTD) The type of an XML document can be specified using a DTD DTD constraints structure of XML data o What elements can occur o What attributes can/must an element have o What subelements can/must occur inside each element, and how many times. DTD does not constrain data types o All values represented as strings in XML DTD syntax

o <!ELEMENT element (subelements-specification) > o <!ATTLIST element (attributes) > Element Specification in DTD Subelements can be specified as o names of elements, or o #PCDATA (parsed character data), i.e., character strings o EMPTY (no subelements) or ANY (anything can be a subelement) Example <! ELEMENT depositor (customer-name account-number)> <! ELEMENT customer-name (#PCDATA)> <! ELEMENT account-number (#PCDATA)> Subelement specification may have regular expressions <!ELEMENT bank ( ( account | customer | depositor)+)> Notation: | - alternatives + - 1 or more occurrences * - 0 or more occurrences Bank DTD <!DOCTYPE bank [ <!ELEMENT bank ( ( account | customer | depositor)+)> <!ELEMENT account (account-number branch-name balance)> <! ELEMENT customer(customer-name customer-street customer-city)> <! ELEMENT depositor (customer-name account-number)> <! ELEMENT account-number (#PCDATA)> <! ELEMENT branch-name (#PCDATA)> <! ELEMENT balance(#PCDATA)> <! ELEMENT customer-name(#PCDATA)> <! ELEMENT customer-street(#PCDATA)> <! ELEMENT customer-city(#PCDATA)> ]> Limitations of DTDs No typing of text elements and attributes Difficult to specify unordered sets of subelements IDs and IDREFs are untyped XML Schema

XML Schema is a more sophisticated schema language which addresses the drawbacks of DTDs. Supports o Typing of values E.g. integer, string, etc Also, constraints on min/max values User defined types Is itself specified in XML syntax, unlike DTDs Is integrated with namespaces Many more features

o o o o

List types, uniqueness and foreign key constraints, inheritance .. BUT: significantly more complicated than DTDs, not yet widely used. XML Schema Version of Bank DTD <xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema> <xsd:element name=bank type=BankType/> <xsd:element name=account> <xsd:complexType> <xsd:sequence> <xsd:element name=account-number type=xsd:string/> <xsd:element name=branch-name type=xsd:string/> <xsd:element name=balance type=xsd:decimal/> </xsd:squence> </xsd:complexType> </xsd:element> .. definitions of customer and depositor . <xsd:complexType name=BankType> <xsd:squence> <xsd:element ref=account minOccurs=0 maxOccurs=unbounded/> <xsd:element ref=customer minOccurs=0 maxOccurs=unbounded/> <xsd:element ref=depositor minOccurs=0 maxOccurs=unbounded/> </xsd:sequence> </xsd:complexType> </xsd:schema> Advantages of XML schema over DTD 1. It allows the text that appears in elements to specific types. 2. It allows user defined types to be created 3. It allows uniqueness and foreign key constraints. 4. It is integrated with namespaces to allow different parts of a document to conform to different document.

5. It allows types to be restricted to create specialized type, for instance by specifying minimum and maximum values. 6. It allows complex types to be extended by using a form of inheritance. Disadvantages 1. Price of XML schema is more than DTD. 2. More complicated than DTD. Querying and Transforming XML Data Translation of information from one XML schema to another Querying on XML data Above two are closely related, and handled by the same tools Standard XML querying/translation languages o XPath Simple language consisting of path expressions o XSLT Simple language designed for translation from XML to XML and XML to HTML o XQuery An XML query language with a rich set of features XPath XPath is used to address (select) parts of documents using path expressions A path expression is a sequence of steps separated by / o Think of file names in a directory hierarchy Result of path expression: set of values that along with their containing elements/attributes match the specified path E.g. /bank-2/customer/customer-name evaluated on the bank-2 data we saw earlier returns <customer-name>Joe</customer-name> <customer-name>Mary</customer-name> E.g. /bank-2/customer/customer-name/text( ) o returns the same names, but without the enclosing tags The initial / denotes root of the document (above the top-level tag) Path expressions are evaluated left to right o Each step operates on the set of instances produced by the previous step Selection predicates may follow any step in a path, in [ ] o E.g. /bank-2/account[balance > 400] returns account elements with a balance value greater than 400 /bank-2/account[balance] returns account elements containing a balance subelement

Attributes are accessed using @ o E.g. /bank-2/account[balance > 400]/@account-number returns the account numbers of those accounts with balance > 400 o IDREF attributes are not dereferenced automatically (more on this later) Functions in XPath XPath provides several functions o The function count() at the end of a path counts the number of elements in the set generated by the path E.g. /bank-2/account[customer/count() > 2] Returns accounts with > 2 customers o Also function for testing position (1, 2, ..) of node w.r.t. siblings Boolean connectives and and or and function not() can be used in predicates IDREFs can be referenced using function id() o id() can also be applied to sets of references such as IDREFS and even to strings containing multiple references separated by blanks o E.g. /bank-2/account/id(@owner) returns all customers referred to from the owners attribute of account elements. More XPath Features Operator | used to implement union o E.g. /bank-2/account/id(@owner) | /bank-2/loan/id(@borrower) gives customers with either accounts or loans However, | cannot be nested inside other operators. // can be used to skip multiple levels of nodes o E.g. /bank-2//customer-name finds any customer-name element anywhere under the /bank-2 element, regardless of the element in which it is contained. XSLT A stylesheet stores formatting options for a document, usually separately from document o E.g. HTML style sheet may specify font colors and sizes for headings, etc. The XML Stylesheet Language (XSL) was originally designed for generating HTML from XML XSLT is a general-purpose transformation language o Can translate XML to XML, and XML to HTML XSLT transformations are expressed using rules called templates o Templates combine selection using XPath with construction of results

XSLT Templates Example of XSLT template with match and select part <xsl:template match=/bank-2/customer> <xsl:value-of select=customer-name/> </xsl:template> <xsl:template match=*/> The match attribute of xsl:template specifies a pattern in XPath Elements in the XML document matching the pattern are processed by the actions within the xsl:template element o xsl:value-of selects (outputs) specified values (here, customer-name) For elements that do not match any template o Attributes and text contents are output as is o Templates are recursively applied on subelements The <xsl:template match=*/> template matches all elements that do not match any other template o Used to ensure that their contents do not get output. If an element matches several templates, only one is used o Which one depends on a complex priority scheme/user-defined priorities o We assume only one template matches any element Creating XML Output Any text or tag in the XSL stylesheet that is not in the xsl namespace is output as is E.g. to wrap results in new XML elements. <xsl:template match=/bank-2/customer> <customer> <xsl:value-of select=customer-name/> </customer> </xsl;template> <xsl:template match=*/> o Example output: <customer> Joe </customer> <customer> Mary </customer> Note: Cannot directly insert a xsl:value-of tag inside another tag o E.g. cannot create an attribute for <customer> in the previous example by directly using xsl:value-of o XSLT provides a construct xsl:attribute to handle this situation xsl:attribute adds attribute to the preceding element E.g. <customer>

<xsl:attribute name=customer-id> <xsl:value-of select = customer-id/> </xsl:attribute> </customer> results in output of the form <customer customer-id=.> . xsl:element is used to create output elements with computed names Structural Recursion Action of a template can be to recursively apply templates to the contents of a matched element E.g. <xsl:template match=/bank> <customers> <xsl:template apply-templates/> </customers > </xsl:template> <xsl:template match=/customer> <customer> <xsl:value-of select=customer-name/> </customer> </xsl:template> <xsl:template match=*/> Example output: <customers> <customer> John </customer> <customer> Mary </customer> </customers> XQuery XQuery is a general purpose query language for XML data Currently being standardized by the World Wide Web Consortium (W3C) Alpha version of XQuery engine available free from Microsoft XQuery is derived from the Quilt query language, which itself borrows from SQL, XQL and XML-QL XQuery uses a for let where .. result syntax for SQL from where SQL where

result SQL select let allows temporary variables, and has no equivalent in SQL FLWR Syntax in XQuery For clause uses XPath expressions, and variable in for clause ranges over values in the set returned by XPath Simple FLWR expression in XQuery o find all accounts with balance > 400, with each result enclosed in an <account-number> .. </account-number> tag for $x in /bank-2/account let $acctno := $x/@account-number where $x/balance > 400 return <account-number> $acctno </account-number> Let clause not really needed in this query, and selection can be done In XPath. Query can be written as: for $x in /bank-2/account[balance>400] return <account-number> $x/@account-number </account-number> Storage of XML Data XML data can be stored in o Non-relational data stores Flat files Natural for storing XML XML database Database built specifically for storing XML data, supporting DOM model and declarative querying Currently no commercial-grade systems o Relational databases Data must be translated into relational form Advantage: mature database systems Disadvantages: overhead of translating data and queries _____________________________________________________________________________________ ______________________________

Data Warehousing and Data Mining Data Warehousing


Data sources often store only current data, not historical data Corporate decision making requires a unified view of all organizational data, including historical data A data warehouse is a repository (archive) of information gathered from multiple sources, stored under a unified schema, at a single site

o Greatly simplifies querying, permits study of historical trends o Shifts decision support query load away from transaction processing systems

Design Issues When and how to gather data o Source driven architecture: data sources transmit new information to warehouse, either continuously or periodically (e.g. at night) o Destination driven architecture: warehouse periodically requests new information from data sources o Keeping warehouse exactly synchronized with data sources (e.g. using two-phase commit) is too expensive Usually OK to have slightly out-of-date data at warehouse Data/updates are periodically downloaded form online transaction processing (OLTP) systems. What schema to use o Schema integration More Warehouse Design Issues Data cleansing o E.g. correct mistakes in addresses (misspellings, zip code errors) o Merge address lists from different sources and purge duplicates How to propagate updates o Warehouse schema may be a (materialized) view of schema from data sources What data to summarize o Raw data may be too large to store on-line o Aggregate values (totals/subtotals) often suffice o Queries on raw data can often be transformed by query optimizer to use aggregate values

Warehouse Schemas Dimension values are usually encoded using small integers and mapped to full values via dimension tables Resultant schema is called a star schema o More complicated schema structures Snowflake schema: multiple levels of dimension tables Constellation: multiple fact tables Data Warehouse Schema

Snowflake Schema: The snow flake schema is a variant of the start schema model where some dimension tables are normalized, thereby further splitting item into additional tables. The major difference between snowflake and star is that the dimensional tables of the snowflake may be kept in normalized form to reduce redundancies. Advantages Such a table is easy to maintain and saves storage space. Disadvantages Snowflake structure can reduce the effectiveness of browsing since more joins will be needed to execute a query. Therefore, system performance degrades. Hence snowflake schema is not as popular as the star schema.
SupplierKey Suppliertype

Store-id City State Country Supplierkey

Advantages of data Warehouse 1. It provides architecture and tools for business executive to understand, organize and use their data to make strategic decisions, 2. Data Warehouse systems are valuable tools in todays competitive fastevolving world. 3. It is a marketing weapon, a way to keep customers by learning more about their needs. Application of Data Warehouse 1. They are used extensively in banking and financial services, consumer goods. 2. It is mainly used for generating reports and answering predefined queries. 3. It is used for strategic purposes, performing multidimensional analysis. 4. It is used for knowledge discovery and strategic decision-making using data mining tools. Difficulties of implementing Data Warehouse 1. Every time a source database changes, the data warehouse administrator must consider the possible interactions with other elements of the warehouse. 2. A team of highly skilled technical expect with overlapping areas of expertise will kikely be needed, rather than a single individual.

Data Mining

Data mining is the process of semi-automatically analyzing large databases to

find useful patterns Prediction based on past history o Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past history o Predict if a pattern of phone calling card usage is likely to be fraudulent Some examples of prediction mechanisms: o Classification Given a new item whose class is unknown, predict to which class it belongs o Regression formulae Given a set of mappings for an unknown function, predict the function result for a new parameter value Data mining as a confluence of multiple disciplines
Database technology Statistics

Information Science

Data Mining

Machine

Visualization

Data Mining Issues User interaction/Visualization Incorporation of background knowledge Noisy or incomplete data Efficiency and scalability Parallel and distributed mining Incremental learning/Mining time-changing phenomena Mining from image/video/audio data Mining unstructured data Requirements of data mining 1. Quality 2. The right data 3. Sample size 4. The right tool

Other disciplines

Steps for Data Ming 1. The initial exploration 2. Model building or pattern identification with validation/verification 3. Deployment 4. Stage 1: The initial exploration This stage usually starts with data preparation which may involve cleaning data, data transformation, selecting subsets of records and performing some preliminary feature selection operations to bring the number of variables to a manageable range. Stage 2: Model building or pattern identification with validation/verification This stage involves considering various models and choosing the best one based on their predictive performance. Stage 3: Deployment The final stage involves, using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome. Descriptive Patterns o Associations Find books that are often bought by similar customers. If a new such customer buys one such book, suggest the others too. o Associations may be used as a first step in detecting causation E.g. association between exposure to chemical X and cancer, o Clusters E.g. typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics Classification Rules Classifications rules help assign new objects to classes. o E.g., given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk? Classification rules for above example could use a variety of data, such as educational level, salary, age, etc. person P, P.degree = masters and P.income > 75,000 P.credit = excellent

person P, P.degree = bachelors and (P.income 25,000 and P.income 75,000) P.credit = good Rules are not necessarily exact: there may be some misclassifications Classification rules can be shown compactly as a decision tree. Decision Tree

Construction of Decision Trees Training set: a data sample in which the classification is already known. Greedy top down generation of decision trees. o Each internal node of the tree partitions the data into groups based on a partitioning attribute, and a partitioning condition for the node o Leaf node: all (or most) of the items at the node belong to the same class, or all attributes have been considered, and no further partitioning is possible. Association Rules Retail shops are often interested in associations between different items that people buy. o Someone who buys bread is quite likely also to buy milk o A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts. Associations information can be used in several ways. o E.g. when a customer buys a particular book, an online shop may suggest associated books. Association rules:

bread milk DB-Concepts, OS-Concepts Networks o Left hand side: antecedent, right hand side: consequent o An association rule must have an associated population; the population consists of a set of instances E.g. each transaction (sale) at a shop is an instance, and the set of all transactions is the population Rules have an associated support, as well as an associated confidence. Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule. o E.g. suppose only 0.001 percent of all purchases include milk and screwdrivers. The support for the rule is milk screwdrivers is low. Confidence is a measure of how often the consequent is true when the antecedent is true. o E.g. the rule bread milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milk.
o

Other Types of Associations Basic association rules have several limitations Deviations from the expected probability are more interesting o E.g. if many people purchase bread, and many people purchase cereal, quite a few would be expected to purchase both o We are interested in positive as well as negative correlations between sets of items Positive correlation: co-occurrence is higher than predicted Negative correlation: co-occurrence is lower than predicted Sequence associations / correlations o E.g. whenever bonds go up, stock prices go down in 2 days Deviations from temporal patterns o E.g. deviation from a steady growth o E.g. sales of winter wear go down in summer Clustering Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in the same cluster Can be formalized using distance metrics in several ways o Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized Centroid: point defined by taking average of coordinates in each dimension.

o Another metric: minimize average distance between every pair of points in a cluster Has been studied extensively in statistics, but on small data sets o Data mining systems aim at clustering techniques that can handle very large data sets o E.g. the Birch clustering algorithm (more shortly) Hierarchical Clustering Example from biological classification o (the word classification here does not mean a prediction mechanism)

Other examples: Internet directory systems (e.g. Yahoo, more on this later) Agglomerative clustering algorithms o Build small clusters, then cluster small clusters into bigger clusters, and so on Divisive clustering algorithms o Start with all items in a single cluster, repeatedly refine (break) clusters into smaller ones Other Types of Mining Text mining: application of data mining to textual documents o cluster Web pages to find related pages o cluster pages a user has visited to organize their visit history o classify Web pages automatically into a Web directory Data visualization systems help users examine large volumes of data and detect patterns visually o Can visually encode large amounts of information on a single screen o Humans are very good a detecting visual patterns Application of Data Mining 1. Marketing 2. Finance 3. Manufacturing 4. Health care _____________________________________________________________________________________ _____________________________

You might also like