You are on page 1of 51

Exploring Correlated Subspaces for Efficient Query Processing in Sparse Databases

Abstract

The sparse data is becoming increasingly common and available in many real-life applications. However, relative little attention has been paid to effectively model the sparse data and existing approaches such as the conventional "horizontal" and "vertical" representations fail to provide satisfactory performance for both storage and query processing, as such approaches are too rigid and generally do not consider the dimension correlations. In this project, we propose a new approach, named HoVer, to store and conduct query for sparse datasets in an unmodified RDBMS, where HoVer stands for Horizontal representation over Vertically partitioned subspaces. According to the dimension correlations of sparse datasets, a novel mechanism has been developed to vertically partition a high-dimensional sparse dataset into multiple lower dimensional subspaces, and all the dimensions are highly correlated intra-subspace and highly unrelated inter-subspace respectively. Therefore, original data objects can be represented by the horizontal format in respective subspaces. With the novel HoVer representation, users can write SQL queries over the original horizontal view, which can be easily rewritten into queries over the subspace tables. Experiments over synthetic and real-life datasets show that our

approach is effective in finding correlated subspaces and yields superior performance for the storage and query of sparse data.

Introduction

With continuous advances in the network and storage technology, there is dramatic growth in the amount of very high-dimensional sparse data from a variety of new application domains, such as bioinformatics, time series, and perhaps, most importantly e-commerce, which pose significant challenges to RDBMS. The main characteristics of these sparse data sets may be summarized as follows: High dimensionality: The dimensionality of feature vectors may be very high, i.e., the number of possible attributes for all objects is huge. For example, in some e-commerce applications, each participant may declare their own idiosyncratic attributes for the products, which results in data sets that have thousands of attributes. Sparsity: Each object may have only a small subset of attributes, which is called active dimensions, i.e., significant values appears only in few active dimensions; in addition, different objects may have different active dimensions. For example, an e-commerce data set may have thousands of attributes, but most of which are null and only a few of which apply to a particular product. Correlation: Since each object may have only few active dimensions, more likely, similar objects share same or similar active dimensions. For example, in recommendation systems, it is important to find homogeneous groups of

users with similar ratings in subsets of the attributes. Therefore, it is possible to find certain subspaces shared by similar objects.

In existing RDBMSs, objects are conventionally stored using a horizontal format called the horizontal representation in this project. For this format, one column corresponds to an attribute, and if an object misses a particular attribute, the corresponding column in the row for the object will be null and storing a sparse data set using the horizontal format, which is straightforward and can be easily implemented. However, the format is not suitable for sparse database, for it may suffer from sparsity and frequent schema evolution, hence, the space and time performance may not be satisfactory; in addition, the number of columns in a horizontal table is typically limited to 1,000 in general commercial DBMSs, which is not enough for many real-life applications. If the number of columns in a horizontal table is more than 1,000, a record may not reside in a single disk page, and the page overflow will significantly degrade the performance. In the last decades, commercial RDBMSs, such as DB2, SQL Server, and Oracle, have improved the null storing and handling capabilities, which results in smaller horizontal tables and better query performance. Our

approach proposed in this project still uniformly outperforms the horizontal representation.

An alternative is known as the vertical format, and is called the vertical representation in this project. This format storing a sparse data set using the vertical format; each active dimension of an object is represented by the object identifier, the attribute name, and the value. The vertical format can scale to thousands of attributes, avoid storage of null values, and support evolving schemas; however, writing queries over the format is cumbersome and error-prone, and an expensive multi way self-join needs to be conducted if the objects in the query result need to be returned in the conventional horizontal format.

From the above introduction, we know that both the horizontal and the vertical representations have advantages and disadvantages. The horizontal representation has lots of nulls but simple queries, and the vertical representation has no nulls but more complex queries. Therefore, an optimal representation should benefit from the advantages and alleviate the drawbacks. In this project, we propose a new approach which combines the horizontal and the vertical representations, and can store and conduct query

for sparse data sets in an unmodified RDBMS. This novel representation is named HoVer, which stands for Horizontal representation over vertically partitioned subspaces. The HoVer representation can efficiently find a better intermediate ground between the horizontal representation and the vertical representation if there are dimension correlations to be exploited. In HoVer, we first vertically partition the data set into multiple lower- dimensional subspaces, and objects are represented in horizontal format in the subspace tables. Partitioning the sparse data space into meaningful subspaces is a nontrivial task; however, sparse data sets generally have some good properties in nature, such as sparsity and correlation.

Therefore, we can design an effective mechanism to split the data space into multiple subspaces. We define the correlated degree between dimensions and cluster highly correlated dimensions as a subspace. After partitioning, the original sparse data set can be transformed into the HoVer format. The combination of different approaches demonstrates the frameworks of the horizontal, HoVer, and vertical approaches based on an unmodified RDBMS. Users write SQL queries over the conventional horizontal view; for the HoVer and vertical approaches, the SQL queries can be rewritten into queries over the subspace tables and the vertical table

stored by the unmodified RDBMS, respectively, and the query results returned by the RDBMS are all in the horizontal format. A comprehensive experimental study demonstrates the superiority of our approach, as our approach fully utilizes the properties of the sparse data.

Existing system:

The existing RDBMSs, objects are conventionally stored using a horizontal format called the horizontal representation in this paper. For this format, one column corresponds to an attribute, and if an object misses a particular attribute, the corresponding column in the row for the object will be null. Fig. 1 shows an example of storing a sparse data set using the horizontal format, which is straightforward and can be easily implemented. The format is not suitable for sparse database, for it may suffer from sparsity and frequent schema evolution, hence, the space and time performance may not be satisfactory; in addition, the number of columns in a horizontal table is typically limited to 1,000 in general commercial DBMSs, which is not enough for many real-life applications. If the number of columns in a horizontal table is more than 1,000, a record may not reside in a single disk page, and the page overflow will significantly degrade the performance. In the last decades, commercial RDBMSs, such as DB2, SQL Server, and Oracle, have improved the null storing and handling capabilities, which results in smaller horizontal tables and better query performance.

An alternative is known as the vertical format and is called the vertical representation in this paper. Fig. 2 shows an example of storing a sparse data set using the vertical format; each active dimension of an object is represented by the object identifier, the attribute name, and the value. The vertical format can scale to thousands of attributes, avoid storage of null values, and support evolving schemas; however, writing queries over the format is cumbersome and error-prone, and an expensive multi way self-join needs to be conducted if the objects in the query result need to be returned in the conventional horizontal format. From the above introduction, we know that both the horizontal and the vertical representations have advantages and disadvantages. The horizontal representation has lots of nulls but simple queries, and the vertical representation has no nulls but more complex

queries. Therefore, an optimal representation should benefit from the advantages and alleviate the drawbacks. In this paper, we propose a new approach which combines the horizontal and the vertical representations, and can store and conduct query for sparse data sets in an unmodified RDBMS. This novel representation is named HoVer, which stands for Horizontal representation over vertically partitioned subspaces.

Proposed system:

THE HoVer REPRESENTATION

As we introduced previously, the pure horizontal or vertical representation may yield unsatisfactory performance in sparse databases. Therefore, we propose a new representation called HoVer, which can effectively exploit the characteristics of sparse data sets, such as sparsity and dimension correlation. We aim at achieving good space and time performance for storing and querying high-dimensional sparse data sets.

Although the dimensionality of sparse data sets could be very high, up to thousands, a single data object typically has only a few active dimensions, and similar objects have a better chance to share similar active dimensions. A closer inspection of many e-commerce sparse data sets shows that typical e-commerce data sets have a wide variety of items which can be organized into categories and the categories themselves are hierarchically grouped; items that belong to a common category are likely to have common attributes, while those within the same subcategory are likely to have more common attributes. The RDF data also shows that the attributes of similar

subjects tend to be defined together. This motivates us to find certain subspaces which are shared by similar data groups, and to split the full space into some lower-dimensional subspaces.

There are some previous research works which focus on subspace clustering. In general, subspace clustering is the task of automatically detecting all clusters in the original feature space, either by directly computing the subspace clusters or by selecting interesting subspaces for clustering. However, such approaches are very time-inefficient, and hence, cannot scale well to high-dimensional space. For example, in the proposed algorithm takes 5 hours for a 30-dimensional data set, while jumping to 30 hours for a 50-dimensional data set.

Considering sparse data sets with thousands of dimensions, such approaches are unacceptable in real-life applications. On the other hand, our purpose is to split the full space into subspaces which can yield superior performance for the storage and query of sparse data. These approaches are not suitable for this scenario. Here we introduce how to present sparse data sets using the novel HoVer representation. First, we design an efficient and effective approach to find correlated dimensions. After that, we partition the original full space into subspaces and store the original sparse data set using multiple tables where each table corresponds to a certain subspace.

Correlated Degree Determination Before subspace selection, we first consider how to measure the correlation between two dimensions. Suppose that the sparse data set is dimensional, and has N tuples, we generate a table to represent the

relation of inter dimensions of the data set. We call this table the correlation table for ease of presentation. Definition 1 (Correlation Table). The correlation table represents the

correlation of dimensions in a sparse data set, which is a super triangle matrix. An entry , where , counts for the times that dimensions i and j

are active simultaneously.

Given the sparse data set as shown in Fig. 1, we can generate the corresponding correlation table as shown in Fig. 5. For example, which means that dimension 1 is active in four tuples; , which means

that dimensions 1 and 3 are active simultaneously once. Algorithm 1 illustrates an efficient way to generate the correlation table for a sparse data set. We first initialize the correlation table , where each entry in the super triangle matrix is set to 0. After that, the sparse data set is scanned, and tuples in the data set are processed one by one. For each tuple, we convert it into an array with length in this array, the value of a certain entry is set

to 1 if the corresponding dimension is active; otherwise, and it is set to 0. With this array, we can accumulate the correlation information into the correlation table. For each active dimension i, we access the ith row of the super triangle matrix, scan the array, and increase by 1 if dimensions i

and j are both active. The algorithm is very efficient since the sparse data set only needs to be scanned once. In addition, it is also time-efficient because no distance computation is involved.

After the correlation table is created, it can be incrementally maintained in the presence of updates, and we only need to revise the values of the entries in the correlation table which correspond to two columns of a row which tend to be active simultaneously or begin to be active simultaneously. In the presence of insertions and deletions, the table can be maintained in a similar way.

The information in the correlation table can be utilized to evaluate the correlation between any two dimensions. We first define the correlated degree between two dimensions which can facilitate subspace partitioning of high-dimensional sparse data.

Definition 2 (Correlated Degree). The correlated degree

measures the

correlation between two dimensions i and j, where i<j, in a sparse data set

We use the correlated degree defined in Definition 2 to measure the correlation between two dimensions, where number of tuples in which dimension i(j) is active, and represents the represents the

number of tuples in which dimensions i and j are active simultaneously. According to set theory, characterizes the number of tuples

in which at least one of the two dimensions i and j is active. The correlated degree defined in Definition 2 hence characterizes the ratio between the number of tuples in which the two dimensions are active simultaneously and the number of tuples in which at least one of the two dimensions is active. Actually, the correlated degree used here is similar to the Jaccard coefficient used for finding hidden schemas from sparse data sets. There are many different measures of the correlation between two dimensions: see for some classical statistical measures and, for some general families of measures. In our work, the correlation between two dimensions should be measured according to the distribution of the active entries, and detailed attribute values are negligible.

where

means that dimension i is active, seems to be a good choice for

measuring the correlation (with a slight abuse of terminology, characterizes the ratio instead of the probability that dimension i is active) According to the probability theory, with the increase of the value of , the correlation between the two dimensions increases at the same time, and the two dimensions are independent in case of But, the value of

is highly influenced by active densities of the two dimensions, which makes it not eligible to measure the correlation. A variation of which ,in

is redefined as the ratio of dimension i is active among tuples in

which at least one of the two dimensions is active, i.e., seems to be a good choice for measuring the correlation

Since the two dimensions within the to be active simultaneously, we can prove that

rows try not all the time, which

means that the two dimensions cannot be positive-correlated. The problem of is that it cannot accurately measure the correlation between two

dimensions in some cases. According to our correlation measure criteria and above analysis, we select the correlated degree defined in Definition 2

to measure the correlation between two dimensions in sparse data sets, where 0 of , and with the increase of the value

, the correlation between the two dimensions i and j increases at the

same time.

Subspace Selection

An optimal subspace partitioning should enjoy two properties, i.e., all dimensions are highly correlated intra subspaces while being highly unrelated inters subspaces. If the number of subspaces determined by the user is smaller, dimensions which are not highly correlated may be clustered into the same subspace; hence, the subspace tables are still very sparse. On the other hand, if the number of subspaces determined by the user is larger, dimensions which are highly correlated may be distributed into different sub- spaces; since the highly correlated dimensions are often defined and accessed together, the join operations for accessing the dimensions, which are distributed into different subspaces, are rather expensive.

According to our above analysis, the number of subspaces should be determined by the subspace selection algorithm according to the dimension

correlations of the sparse data set. Because the underlying storage and query processing details of the RDBMS may have some influence on the performance, there may not exist perfect subspace clustering typically. Therefore, the main aim of our subspace selection algorithm is to find the subspaces in an efficient way, and yield superior performance for the storage and query of sparse data. First of all, any two dimensions in a subspace should be highly correlated, which can ensure that the subspace tables no longer suffer from sparsity. Next, in order to ensure that the highly correlated dimensions, which are often defined and accessed together, can be clustered into the same subspace, the number of the subspaces should be as small as possible.

Therefore, our subspace selection problem can be formally defined as follows: Given the correlated degree threshold contains subspaces dimensions , where and a sparse data set, which

we partition the original full space into m and for any two our objective is that the

sub- spaces si and sj, where

correlated degree between any two dimensions in a subspace is no less than and the number of subspaces m is minimized.

Our subspace selection problem can be mapped to the Minimum Clique Partition problem. Given a graph Partition problem partitions V into disjoint subsets is that for the Minimum Clique the objective

, the sub graph induced by Vi is a complete graph, and the

number of partitions m is minimized. If we map each dimension in the sparse data set to a node in the graph, and if the correlated degree between two dimensions is no less than the correlated degree threshold, we add an edge to link the two corresponding nodes in the graph, and our subspace selection problem is exactly same as the Minimum Clique Partition problem. Unfortunately, the Minimum Clique Partition problem is NP-complete, which means that we should use a heuristic algorithm to approximate optimal partitions which tries to group together correlated dimensions.

Algorithm 2 presents how to generate subspaces from a given correlation table in a heuristic manner. If there exist unclassified dimensions, i.e., not included into any existing subspace, we pick the unclassified dimension D with the highest correlation table value, which is the most active dimension left in the sparse data set. Then, a new subspace will

be generated, and all unclassified dimensions are examined. If the correlated degree between an unclassified dimension d and each dimension d0 in s is

not less than the given correlated degree threshold , then d is added to subspace s. It is apparent that the algorithm ensures that the correlated degree between any two dimensions in a subspace is no less than the correlated degree threshold, and the algorithm minimizes the number of the subspaces in a greedy manner, i.e., tries to add the unclassified dimensions to current subspace.

The correlated degree threshold has great influence on the subspace generation. With a larger threshold, the non null density of each sub space will be larger, i.e., the dimensions in the subspace are highly correlated, but more subspaces will be generated. With a smaller threshold, fewer subspaces will be generated, but the non null density of each sub space will be smaller. Actually, the optimal correlated degree threshold varies for different data sets.

Given the correlation table as shown in Fig. 5, we are able to partition the original 8-dimensional space into multiple subspaces. At the beginning, D2 is selected as the first dimension of subspace s1, for D2 has the maximal correlation table value, i.e., 5. If subsequently added to subspace s1, i.e., , D1, D3, and D4 will be . After that, we can

use the same strategy to generate other two subspaces, if , four subspaces and will be generated,

and i.e.

. We can see that with the

increase of the correlated degree threshold, the number of subspaces increases at the same time.

Vertical Partition

The

HoVer

representation

means

vertical

partition

of

the

corresponding horizontal representation. The OID attribute exists in each subspace for linking the data items which are partitioned into multiple subspaces. The course of transforming horizontal representations into HoVer representations is lossless, since the candidate key, i.e., OID is contained by each subspace table. Fig. 6 shows the HoVer representation corresponding to the horizontal representation shown in Fig. 1 with correlated degree threshold ; as shown in Fig. 7, if we increase to 0.5, the subspace

D1234 will be further split into two subspaces D12 with objects {1, 2, 3, 4, 6} and D34 with objects {3, 5, 6}. We can see that if none of the dimensions in a subspace is active in a horizontally represented tuple, the tuple will be absent in the subspace table after vertical partition. It should be easy to

convert horizontally represented data to the HoVer representation; for each horizontally represented tuple, if at least one dimension (not including OID) of a subspace is active, OID along with the subspace dimensions are projected and inserted into the subspace table. For example, converting the horizontally represented table H shown in Fig. 1 to the subspace table shown in Fig. 6b can be characterized by a relational algebraic expression as

Schema Evolution

When a new column is added, a new subspace which only contains the new column will be created, and the correlation table should also be updated accordingly. Since the correlation table is incrementally maintained, the new column may be merged to a subspace when subspaces are reorganized. When a column is deleted, we only need to delete the column from the corresponding subspace and update the correlation table accordingly.

QUERY PROCESSING IN HoVer

In this section, we introduce how the queries over the horizontal representation can be processed over the HoVer representation.

Query Rewriting

Our ultimate purpose is to define horizontally represented views over the HoVer representation. Users typically issue traditional SQL queries over the horizontal view, which can be rewritten into queries over the underlying HoVer representation. Generally, the reconstruction of the horizontal table H from the subspace tables can be characterized by a relational algebraic expression as

Where

is the OID list which contains all the OIDs in the

horizontal table. Hence, we should maintain an OID list during vertical partition. For example, the reconstruction of the horizontal table H shown in Fig. 1 from the subspace tables shown in Fig. 6 can be characterized by a relational algebraic expression as

In our work, the dimensions in the original sparse data space are clustered into subspaces, and a horizontal table is

Vertically partitioned into subspace tables. In many real-life applications, the dimensions with a high correlated degree are likely to characterize similar topics and have high probability of being accessed together; hence, they should be stored in the same subspace table. We can take advantage of this characteristic and access as few subspace tables as possible in query evaluation.

FEASIBILITY STUDY The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential. Three key considerations involved in the feasibility analysis are 1.Economical feasibility 2.Technical feasibility 3.Operational feasibility. ECONOMICAL FEASIBILITY The study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only customized products had to be purchased. OPERATIONAL FEASIBILITY The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not be threatened by the system, instead must accept it as a necessity. The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. The level of confidence must be

raised so that user is also able to make some constructive criticism, which is welcomed, as user is the final user of the system. TECHNICAL FEASIBILITY Technical feasibility is carried out to check the, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client.

SYSTEM SPECIFICATION

S/W REQUIREMENTS

Windows XP MS-SQL server MS Visual Studio 2005

H/W REQUIREMENTS

Processor

: Dual Core

CPU Clock Speed : 651 MHz External memory : 512 MB (min) Hard Disk Drive Mouse Keyboard Monitor : 40 GB (min) : Logitech Mouse : Logitech of 104 Keys : 15.6 LCD Monitor

SOFTWARE SPECIFICATION
FRONT END NET FRAMEWORK .NET is a "Software Platform". It is a language-neutral environment for developing rich .NET experiences and building applications that can easily and securely operate within it. When developed applications are deployed, those applications will target .NET and will execute wherever .NET is implemented instead of targeting a particular Hardware/OS combination. The components that make up the .NET platform are collectively called the .NET Framework. The .NET Framework is a managed, type-safe environment for developing and executing applications. The .NET Framework manages all aspects of program execution, like, allocation of memory for the storage of data and instructions, granting and denying permissions to the application, managing execution of the application and reallocation of memory for resources that are not needed. The .NET Framework is designed for cross-language compatibility. Cross-language compatibility means, an application written in Visual

Basic .NET may reference a DLL file written in C# (C-Sharp). A Visual Basic .NET class might be derived from a C# class or vice versa. The .NET Framework consists of two main components: Common Language Runtime (CLR) Class Libraries

COMMON LANGUAGE RUNTIME (CLR) The CLR is described as the "execution engine" of .NET. It provides the environment within which the programs run. It's this CLR that manages the execution of programs and provides core services, such as code compilation, memory allocation, thread management, and garbage collection. Through the Common Type(CTS), it enforces strict type safety, and it ensures that the code is executed in a safe environment by enforcing code access. The software version of .NET is actually the CLR version. WORKING OF THE CLR

When the .NET program is compiled, the output of the is not an executable file but a file that contains a special type of code called the Microsoft Intermediate Language (MSIL), which is a low-level set of instructions understood by the common language run time. This MSIL defines a set of portable instructions that are independent of any specific CPU. It's the job of the CLR to translate this Intermediate code into a executable code when the program is executed making the program to run in any environment for which the CLR is implemented. And that's how the .NET Framework achieves Portability. This MSIL is turned into executable code using a JIT (Just In Time) complier. The process goes like this, when .NET programs are executed, the CLR activates the JIT complier. The JIT complier converts MSIL into native code on a demand basis as each part of the program is needed. Thus the program executes as a native code even though it is compiled into MSIL making the program to run as fast as it would if it is compiled to native code but achieves the portability benefits of MSIL.

CLASS LIBRARIES

Class library is the second major entity of the .NET Framework which is designed to integrate with the common language runtime. This library gives the program access to runtime environment. The class library consists of lots of prewritten code that all the applications created in VB .NET and Visual Studio .NET will use. The code for all the elements like forms, controls and the rest in VB .NET applications actually comes from

BACK END - SQL SQL stands for Structured Query Language. SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems. SQL statements are used to perform tasks such as update data on a database, or retrieve data from a database. Some common relational database management systems that use SQL are: Oracle, Sybase, Microsoft SQL Server, Access, Ingres, etc. Although most database systems use SQL, most of them also have their own additional proprietary extensions that are usually only used on their system. However, the standard SQL commands such as "Select", "Insert", "Update", "Delete", "Create", and "Drop" can be used to accomplish almost everything that one needs to do with a database. This tutorial will provide you with the instruction on the

basics of each of these commands as well as allow you to put them to practice using the SQL Interpreter. CREATE A TABLE To create a new table, enter the keywords create table followed by the table name, followed by an open parenthesis, followed by the first column name, followed by the data type for that column, followed by any optional constraints, and followed by a closing parenthesis. It is important to make sure you use an open parenthesis before the beginning table, and a closing parenthesis after the end of the last column definition. Make sure you seperate each column definition with a comma. All SQL statements should end with a ";". The table and column names must start with a letter and can be followed by letters, numbers, or underscores - not to exceed a total of 30 characters in length. Do not use any SQL reserved keywords as names for tables or column names (such as "select", "create", "insert", etc). Data types specify what the type of data can be for that particular column. If a column called "Last_Name", is to be used to hold names, then

that particular column should have a "varchar" (variable-length character) data type. INSERTING INTO A TABLE The insert statement is used to insert or add a row of data into the table. To insert records into a table, enter the key words insert into followed by the table name, followed by an open parenthesis, followed by a list of column names separated by commas, followed by a closing parenthesis, followed by the keyword values, followed by the list of values enclosed in parenthesis. The values that you enter will be held in the rows and they will match up with the column names that you specify. Strings should be enclosed in single quotes, and numbers should not. insert into "tablename" (first_column,...last_column) values

(first_value,...last_value); UPDATING RECORDS The update statement is used to update or change records that match a specified criteria. This is accomplished by carefully constructing a where clause.

update "tablename"set "columnname" = "newvalue" [,"nextcolumn" = "newvalue2"...] where "columnname" OPERATOR "value" [and|or "column" OPERATOR "value"]; DELETING RECORDS The delete statement is used to delete records or rows from the table. delete from "tablename"where "columnname" OPERATOR "value" [and|or "column" OPERATOR "value"]; DROP A TABLE The drop table command is used to delete a table and all rows in the table. To delete an entire table including all of its rows, issue the drop table command followed by the tablename. drop table is different from deleting all of the records in the table. Deleting all of the records in the table leaves the table including column and constraint information. Dropping the table removes the table definition as well as all of its rows. drop table "tablename".

List of Modules Data entry:

Get the details of user. To process the database we need many records to show the efficiency of our Hover method. In this module we get the datas from the user to process the data.

Horizontal Representation:

The horizontal format, which is straightforward and can be easily implemented. However, the format is not suitable for sparse database, for it may suffer from sparsity and frequent schema evolution, hence, the space and time performance may not be satisfactory; in addition, the number of columns in a horizontal table is typically limited to 1,000 in general commercial DBMSs, which is not enough for many real-life applications.

Vertical representation:

In vertical format each active dimension of an object is represented by the object identifier, the attribute name, and the value. The vertical format

can scale to thousands of attributes, avoid storage of null values, and support evolving schemas; however, writing queries over the format is cumbersome and error-prone, and an expensive multiway self-join needs to be conducted if the objects in the query result need to be returned in the conventional horizontal format.

HoVer representation:

Hover which stands for Horizontal representation over Vertically partitioned sub- spaces. The HoVer representation can efficiently find a better intermediate ground between the horizontal representation and the vertical representation if there are dimension correlations to be exploited. In HoVer, we first vertically partition the data set into multiple lowerdimensional subspaces, and objects are represented in horizontal format in the subspace tables. Partitioning the sparse data space into meaningful subspaces is a nontrivial task

DFD: ARCHITECTURE: SYSTEM FLOW DIAGRAM:

INPUT DESIGN

Input design is one of the most important phases of the system design. Input design is the process where the input received in the system are planned and designed, so as to get necessary information from the user, eliminating the information that is not required. The aim of the input design is to ensure the maximum possible levels of accuracy and also ensures that the input is accessible that understood by the user.

The input design is the part of overall system design, which requires very careful attention. If the data going into the system is incorrect then the processing and output will magnify the errors.

The objectives considered during input design are: Nature of input processing. Flexibility and thoroughness of validation rules. Handling of properties within the input documents. Screen design to ensure accuracy and efficiency of the input relationship with files.

Careful design of the input also involves attention to error handling, controls, batching and validation

procedures.

Input design features can ensure the reliability of the system and produce result from accurate data or they can result in the production of erroneous information.

OUTPUT DESIGN

The term output applying to information produced by an information system whether printed or displayed while designing the output we should identify the specific output that is needed to information requirements select a method to present the formation and create a document report or other formats that contains produced by the system.

TYPES OF OUTPUT

Whether the output is formatted report or a simple listing of the contents of a file, a computer process will produce the output. A Document A Message Retrieval from a data store Transmission from a process or system activity Directly from an output sources The Output of our project will be a object detection with separate part detection.

SOFTWARE TESTING FUNDAMENTALS Testing presents an interesting task for software engineers. Earlier in the software process, the engineer attempts to build software from an abstract concept to a tangible implementation.. The engineer creates the series of test cases that are intended to demolish the software that has been build. To test any program we need to have a description of its expected behavior and a method of determining whether the observed behavior conforms to the expected behavior for this we need a test oracle. A testoracle is a mechanism; different from the program itself that can be used to check the correctness of the output of the program for the test cases. Human-oracle is human beings who mostly compute by hand what the output of the program should be. Human-oracle can make mistake. So test oracle is defined in the tool to automate testing and avoids mistakes. Testing principles All the tests should be traceable to requirement. Tests should be planned long before testing begins that is the Testing should begin in the small and progress towards

test planning can bring as soon as the requirement model is complete. testing in the large. The first planned and executed generally focus on individual program modules. As testing progresses, testing shifts focus and attempt to find errors in integrated clusters of modules and ultimately in the entire system. UNIT TESTING

In unit testing the program making the system are tested. It is sometimes for this reason is called program testing. The software units in a system are module routines that are assembled and integrated to perform a specific function. It mainly focuses on the module, independently of on another, to locate the errors in coding and logic that are contained within the module alone. Setting break point in the code so that it is easy to find the error location when the input is given does the unit testing. The unit test is always white box oriented. Since each module in the system, receive input and generate the output, test cases are needed to test the range expected. This system dividing into sales module and purchase module, both the modules are tested separately and unit test get successful. ACCEPTANCE TESTING This is the final stage in the testing process before the system is accepted for the operational use. This system is tested with data supplied by the system procure rather than similar test data acceptance testing may reveal errors and omissions in the system requirements definition because the real data exercise the system in different ways from the test data. Acceptance testing may also reveal requirements problem where the systems facilities do no really meet the users needs for the system performance is unacceptable but this system met all the requirements of the user and performed well.

INTEGRATION TESTING

Integration level testing focuses on the transfer of data and control across a programs internal and external interfaces. External interfaces are those with other software, system hardware, and the users and can be described as communications links. PERFORMANCE TESTING Performance testing helps ensure that a product performs its functions at the required speed. Planning for performance testing starts at the beginning of the project when product goals and requirements are defined. Performance testing is a part of the products initial engineering plan. SYSTEM TESTING System level testing demonstrates that all specified functionality exists and that the software product is trustworthy. This testing verifies the as built programs functionality and performance with respect to the requirements for the software product as exhibited on the specified operating platform(s). System level software testing addresses functional concerns and the following elements of a devices software that are related to the intended use(s). Performance issues (e.g. response times, reliability measurements): Response to stress conditions, e.g. behavior under maximum load, continuous use.

Operational of internal and external security features. Effectiveness of recovery procedures, including disaster recovery. Usability. Compatibility with other software products. Behavior in each of the defined hardware configurations and Accuracy of documentation. Test Plan Before going for testing, first decide the type of testing. For this impact system unit testing is carried out. Before going for testing, the following things are taken into consideration. To ensure that information properly flows in and out of the program. To find out whether the local data structures maintains its integrity To ensure that the module operates properly at boundaries established To find out whether all statements in the module have been executed To find out whether error-handling paths are working correctly or not.

during all steps in an algorithm execution. to limit or restrict processing. at least once.

TEST CASES A test case is as set of conditions or variables under which a tester will determine if a requirement or use case upon an application is partially or fully satisfied. It may take many test cases to determine that a requirement is fully satisfied. In order to fully test that all the requirements of and application met, there must be at least one test case for each requirement unless a requirement has sub requirement. In that situation, each sub requirement must have at least one test case. The written test case is that there is known input and an expected output, which is worked out before the test is executed. The known input should test a precondition and the expected output should test a post condition test cases uncover the following categories: Erroneous initialization or default values and inconsistent data types Incorrect (misspelled or truncated) variable name Underflow, overflow and addressing exceptions

SYSTEM IMPLEMENTATION

Implementation is the most crucial stage in achieving a successful system and giving the users confidence that the new system is workable and effective. This type of conversation is relatively easy to handle, provided there are no major changes in the system.

Each program is tested individually at the time of development using the data and has verified that this program linked together in the way specified in the programs specification, the computer system and its environment is tested to the satisfaction of the user. The system that has been developed is accepted and proved to be satisfactory for the user. And so the system is going to be implemented very soon. A simple operating procedure is included so that the user can understand the different functions clearly and quickly.

Initially as a first step the executable form of the application is to be created and loaded in the common server machine which is accessible to the entire user and the server is to be connected to a network. The final stage is

to document the entire system which provides components and the operating procedures of the system.

Implementation is the stage of the project when the theoretical design is turned out into a working system. Thus it can be considered to be the most critical stage in achieving a successful new system and in giving the user, confidence that the new system will work and be effective. The file is downloaded from the server which takes minimum time for retrieval.

Conclusion:

In this project, we have addressed the problem of efficient query processing over sparse databases. To alleviate the suffering from sparsity and high-dimensionality of sparse data, we proposed a new approach named HoVer. According to the characteristics of sparse data sets, we vertically partition the high-dimensional sparse data into multiple lower-dimensional subspaces, and all the dimensions in each subspace are highly correlated, respectively. The experimental results show that our proposed scheme can find correlated subspaces effectively, and yield superior storage and query performance for conducting queries in sparse databases.

You might also like