The Cloud Database

WHAT IS A CLOUD DATABASE ?
The Suitability of Algebraix Datas Technology to Cloud Computing
Robin Bloor, Ph D
WHITE PAPER
Copyright 2011, The Bloor Group All rights reserved. Neither this publication nor any part of it may be reproduced or transmitted or stored in any form or by any means, without either the prior written permission of the copyright holder or the issue of a license by the copyright holder. The Bloor Group is the sole copyright holder of this publication.
22214 Oban Drive Spicewood TX 78669 Tel: 512-524-3689
Email contact: info@bloorgroup.com www.TheVirtualCircle.com www.BloorGroup.com
WHITE PAPER
WHAT IS A CLOUD DATABASE?
Executive Summary
This white paper was commissioned by Algebraix Data. The goal of the paper is to provide a denition of what a cloud database is, and in the light of that denition, examine the suitability of Algebraix Datas technology to fulll the role of a cloud database. Here is a brief summary of the contents of this paper: We dene a cloud dbms (CDBMS) to be a distributed database that can deliver a query service across multiple distributed database nodes located in multiple data centers, including cloud data centers. Querying distributed data sources is precisely the problem that businesses will encounter as cloud computing grows in popularity. Such a database also needs to deliver high availability and cater for disaster recovery. In our view, a CDBMS only needs to provide a query service. SOA already delivers connectivity and integration for transactional systems, so we see no need for a CDBMS to cater for transactional trafc - only query trafc. A CDBMS needs to scale across large computer grids, but it also needs to be able to span multiple data centers and, as far as is possible, cater for slow network connections. We review traditional databases, focusing primarily on relational databases and column store databases, concluding that such databases, as currently engineered, could not fulll the role of a CDBMS. They have centralized architectures and such architectures would encounter a scalability limit at some point, both within and between data centers. We conclude that a distributed peer-to-peer architecture is needed to satisfy the CDBMS characteristics that we have dened. We move on to examine the Hadoop/MapReduce environment and its suitability as a CDBMS. It has much better scalability for many workloads than relational or column store databases, because of its distributed architecture. However it was not built for mixed workloads or for complex data structures or even for multitasking. In its current form it emphasizes fault tolerance. It succeeds as a database for very large volumes of data, but does not have the characteristics of a CDBMS. Finally, we examine Algebraix Datas technology as implemented in its database product A2DB. Our conclusion is that it has an architecture which is suitable for deployment as a CDBMS. Our view is as follows: A2DBs unique capability to reuse intermediate results of queries that it has previously executed, contribute to it delivering high performance at a single node. The same performance characteristics can be employed to speed up queries that join information between a local node and remote nodes, whether in the same data center or in a remote data center. Algebraix Datas technology is capable of global optimization, balancing the performance requirements of both global and local queries. Additionally the technology can deliver high availability/fault tolerant operation.
We are aware that Algebraix Data has not been deployed and tested its database A2DB in the role of CDBMS hence our conclusion is not that it qualies as a CDBMS, but that it has an architecture that would enable it to be tested in this role.
1
The Cloud Database - In Concept

Cloud computing is a major driving trend for IT. Over 36 percent of US companies already run applications in the cloud (Mimecast survey, February 2010) and the major cloud vendors are growing their revenues and customer bases rapidly. Given the trends, fairly soon the majority of IT departments will be running applications in the cloud, possibly using more than one cloud provider. So corporate computing will inevitably become much more distributed than it currently is, spreading itself across multiple data centers. This will pose management, architectural and performance challenges - and foster innovation to meet those challenges.
The Cloud Implementation of Transactional and Query Systems

If we think solely in terms of database technology, the wider distribution of transactional systems, such as OLTP systems, communications applications and work ow systems, will not pose a severe problem at the data level. The sweeping success of Salesforce.com demonstrates this. The data problems of placing your CRM system in the cloud are resolved easily enough by the regular transfer of customer and other data from the cloud to the data center. Indeed the broad success of SOA demonstrates the same thing. Loosely coupling silo transaction systems together works ne as regards the work ow between transactional systems. Because the volume of data passed between applications within a SOA is low, it is highly unlikely that the relatively slow speeds of the Internet will be prohibitive to placing some of these applications in the cloud. There will be exceptions, but in principle it will work well most of the time. For query workloads typied by BI applications, distribution of the data across multiple data centers is more problematic. There are three main reasons for this: 1. Internet speeds are generally slow compared to data center network speeds and this limits performance considerably. This issue can be addressed through high-speed direct connections, but this becomes expensive very quickly. 2. Query workloads are not as predictable as transactional workloads. We can predict transactional workloads reasonably accurately, but we cannot easily predict specically what questions a user might wish to ask - hence we are less able to predict the workload. This has profound architectural implications for the distribution of query systems. Stated simply: we dont know where best to locate the data ahead of time, because we do not know which sets of data users may wish to join together. 3. Even if we achieve an efcient distribution of data, query workloads involve the movement of much greater volumes of data than transactional workloads. That movement of data will inevitably be slower than if the data was located in a single data center. This set of constraints suggests that it may be better to centralize query workloads in one physical location. This is traditionally how most BI domains have been constructed, around a big data warehouse with subsets of data drawn off to serve individual BI applications. But ultimately that approach fails the test of scalability. A centralized architecture scales poorly over very large numbers of nodes. Bottlenecks eventually arise.
2
Towards a Cloud Database

For the moment, we will set aside that fact that there are many challenges in implementing a distributed architecture for query workloads across several data centers, and provide a view of what a cloud database would look like. We can dene a cloud dbms (CDBMS) as a distributed database that delivers a query service across multiple distributed database nodes located in multiple geographically-distributed data centers, both corporate data centers and cloud data centers. So think in terms of an organization with some applications running in the cloud. Perhaps Salesforce.com plus some
User
CDBMS Node - 8 Data Data
Query
CDBMS Node - 1
Data Data Data Data
CDBMS Node - 7
Data
Cloud Data Center 2
Internet
Cloud Data Center 1
CDBMS Node - 2
CDBMS Node - 3
CDBMS Node - 4
CDBMS Node - 5
CDBMS Node - 6
Data
Data Data
Data
Data
Data Data
Data Center 1
Data Center 2
Figure 1. A CDBMS hosted transactional web applications in some remote data center plus local applications including BI applications split between two data centers. Such a situation is illustrated in Figure 1. It is the typical situation that companies will have to deal with as we move forward. In practice, a query can originate from anywhere; from a PC within the corporation, which is connected by a fast line to the local data center, from a PC in the home via a VPN line, from a laptop via a WiFi connection, or from a smart phone via a 3G or 4G connection. For that reason we represent a query here as coming through the Internet implying that the response will possibly travel through the Internet too. The CDBMS will not concentrate all query trafc through a single node. A peer-to-peer architecture will be far more scalable - with any single node able to receive any query. In such
3

an arrangement, each node needs to have a map of the data stored at every node and know the performance characteristics of every node. When a node receives a query its rst task is to determine which node is best able to respond to the query. It then passes responsibility for the query to that node. That node executes the query and returns the result directly to the user. Figure 1 shows more than one CDBMS node in some of the data centers. In practice, it will probably be necessary to congure more than one node per data center to distribute the database workload within the data center as well as between data centers. Consider Figure 2. It illustrates the likely strategy that would CDBMS be used by a CDBMS node in App 1 App 2 Node - x accessing data held in local transactional databases or les. If the data is held in a database, the CDBMS can either get at the DBMS data directly (via ODBC, for example) or access a replicated Node data store. Replication will only Data be needed if read access to the data imposes too great an DBMS File Repl. File impact on performance. Critical Data Data systems often have a hot standby in place ready to go if the primary system fails, in which Figure 2. A CDBMS Node case the stand-by systems database could be used as a data source. Data might also be drawn from operational data stores or data warehouses, with the same kind of replication strategy being employed. Where the application data is held in a le, the CDBMS will probably be able to access the data directly. For non-database data, the CDBMS would maintain a metadata map of the le so it could identify data items within the records read from the le. Finally, the CDBMS will maintain its own store of data consisting of frequently used data drawn from the data sources it accesses. This would likely be most of the data the CDBMS node was responsible for, with direct access to data stores being used primarily for data refresh.
Local Data and Distributed Data

In processing local data, the CDBMS acts as an operational data store. It has up-to-date data and responds to queries using that data. While BI databases, such as a data warehouse or large data mart, could be included, the cloud database might replace rather than complement such data stores. There is a scalability issue here. If we consider a large data center with many terabytes of data, no matter how efcient the CDBMS node is, it probably will not be able to deal with all the query trafc. At each data center there would likely be several database nodes. And if the query trafc grew, as usually happens, the CDBMS would need to instantiate extra nodes to handle the increased workload.
4

When workload expands, node A instantiates a new node, A' Network
DBMS A2
DBMS A3
CDBMS Node A
CDBMS Node A'
DBMS A4
File A1
Data A2
Data A3
Node A Data
Node A' Data
Data A4
File A5
Figure 3. Cloud Database Node Splitting Consider the situation illustrated in Figure 3 where Node A of the CDBMS is managing queries for les A1, A5 and databases A2, A3 and A4. If the workload gets too great for the resources at its disposal, then assuming that there is another server available to use, it could split like an amoeba as indicated. The original node might take responsibility for le A1 and databases A2 and A3, while the newly created node A takes responsibility for A4 and A5. In order to do this, Node A would have to have keep a full history of query trafc so that it would be able to calculate the optimal division as it split in two. Similarly there would need to be a reverse procedure that amalgamated two local nodes in the event that the query workload diminished. In concept, that takes care of queries that only access local data that Node A has responsibility for. However, there will necessarily be queries that span multiple nodes.
Distributed Queries
Consider the major entities that a company holds as data: customer, product, sales transaction, staff member, supplier, purchase transaction and so on. They crop up in many applications. Consequently, many queries that seek information on these major entities will inevitably span multiple nodes of a CDBMS. Even if we could nd a convenient way to distribute and cluster the applications around these entities, there would be many queries that spanned multiple nodes. Most query-oriented databases, column store databases or traditional relational databases, could be congured to handle single node queries. Technically, the fundamental challenge for the CDBMS is to handle distributed queries effectively. A distributed query which accesses multiple nodes of the CDBMS can be thought of as an amalgamation (a union) of several queries that access individual nodes of the CDBMS. This is illustrated in Figure 4. Note that the resolution of a query in this manner could result in more than one result set from each node as illustrated. Once the answers have been calculated, the CDBMS has to determine which node will join them together.
5

The Node that receives the query decomposes it.
Answer
Query
Join
Sub Query 1
Sub Query 2
Sub Query 3
Sub Query 4
Answer SQ-1
Answer SQ-2
Answer SQ-3
Answer SQ-4
CDBMS Node - 2
CDBMS Node - 5
CDBMS Node - 8
Data
Data
Data
The most cost effective Node performs the Join.
Cloud Data Center 1
Data Center 1
Data Center 2
Figure 4. CDBMS: Distributed Queries The best node to choose is the one that is least cost in respect of time. That can depend upon many physical factors, not just the volume of data that needs to be transmitted, but the network speeds and how long it will take each node to carry out its work. It could even depend upon which node is currently busiest. The challenge is to nd the fastest solution, but the problem is not a trivial one.
Other Cloud Database Issues

There are other issues that a CDBMS needs to address. A primary one is high availability. This is a necessity rather than a nice-to-have. The CDBMS needs to be able to recover from the failure of any node and, in the extreme, the failure of a whole data center. However, that is achievable by any distributed database that is capable of replicating its nodes. There are also the traditional issues of database security and the broader issues of data quality and data governance. However, these are not show-stoppers. The CDBMS has to be able to assemble a complete metadata map of all the nodes. For that reason, data security, data quality and data governance issues can be handled as if the CDBMS were a single database. There is also the need to provide support for a variety of data access interfaces. Ultimately these will include the usual SQL interfaces (ODBC, JDBC, ADO.NET), web services interfaces (HTTP, REST, SOAP, XQuery, etc.) and any other specialized interfaces such as MDX (for data cubes.) All of these features are both necessary and important, but catering for them is not where the main CDBMS challenge lies. The greatest engineering challenge is in optimizing varied query workloads across a widely distributed resource space in a manner that consistently performs well.
6
Can a Traditional Databases Evolve to be a CDBMS?

Databases came into existence over 40 years ago because of the limitations of le systems. They were a more effective mechanism for storing data, for many reasons. The main one was that they made metadata (data denition data) available, so that many different programs could use the same data store. The situation further improved with the emergence of a standard data access language; SQL. This meant that, for the most part, the programmer no longer needed to think about how data was stored. Naturally, when databases rst appeared, a hope arose that it would eventually be possible to store all of a companys data in a single database. It was a forlorn hope.
Relational Database Evolution

Relational databases (RDBMS) became the dominant type of database as soon as computer hardware was fast enough to enable their use for OLTP. The relational database was originally viewed as a more appropriate database for query workloads, and it was. But in time it was engineered to be suitable for OLTP. Once databases had standardized around a data model (relational) and an access language (SQL), the hope that it would become possible to implement a single corporate database for use by all programs strengthened. There were many reasons why this did not happen. The major ones were: RDBMS products could cater for many different data structures, but never catered for every possible data structure. The relational model was not a universal model of data and to compound this problem, SQL was not a universal data access language that could access any kind of data structure. In practice this meant that RDBMS was simply unt for storing some kinds of data. Specically, RDBMS did not properly cater for many important data types (e.g. text, composite data types, etc.) Consequently other types of database arose (e.g. object databases, text databases, content databases, etc.) Even though RDBMS were based on the use of a two dimensional structure (the table) it never catered for structures of a higher dimension. This meant they did not cater for 3D data cubes or higher dimensional data cubes. Consequently specic databases emerged for dealing with such structures (OLAP databases.) Most importantly, RDBMS did not directly cater for the dimension of time and for time series data. While RDBMS could cater to both OLTP and query workloads, it never had the performance capability to cater for both types of workload at the same time. From an engineering perspective it made much more sense to have two database instances, one which was congured for OLTP and another, fed from the rst, which was congured and tuned for query trafc. Most RDBMS products charged license fees, so Independent Software Vendors (ISVs) rarely used them. But even when open source RDBMS products became available at no charge, most ISVs continued to ignore them, preferring their own les structures.
The IT industry never even tried to agree on a standard le format that exposed the metadata of a le. Thus the commonly used operating systems never provided such a le type. This

meant there was no alternative for ISVs but to constantly invent new types of les, and even new data types, for the data they stored. This brought us to the situation where the industry began to accept a de facto reality: There was structured data; data held in databases with its metadata available. There was unstructured data; data held in les of various kinds where the metadata was either unavailable or incomplete.
Scale and Scalability

In the light of these constraints, databases evolved in two directions. On one hand databases accommodated some unstructured data - by extensions to the relational model, implementing some version of an object-relational model. On the other hand, the dream of a single corporate database continued - but only for query trafc - giving rise to the idea of the data warehouse. In practice, data warehouses were an attempt to scale up by storing all data in a single instance of a database. But in practice they never did scale up. From the get-go users were forced to store data subsets in data marts. Focusing all query workloads on the data warehouse would have paralyzed it. Because of the limitations of the relational model, some of the data marts were OLAP databases holding multidimensional data cubes. The impressive march of Moores Law, which vaporized performance issues in many areas of IT, never came close to xing this scalability issue - and it still hasnt. Data owed from operational systems, through ETL and data quality programs into a data warehouse for later extraction into a data mart for eventual use. This was a slow process. Consequently, software designed to short cut that pedestrian route emerged, called Enterprise Information Integration (EII) software. EII tools created Operational Data Stores which were nothing more than accelerated data marts. RDBMS did not scale out and little effort was put into that. So when the likes of Yahoo and Google assembled large data centers with thousands of servers, there was no database technology at all that could scale out across such large computing grids. This gave rise to a completely different approach to scaling out for large volumes of data, which went by the name of MapReduce and which gave rise to Hadoop, a programming framework for implementing MapReduce across large grids of servers.
The Coming of the Column Store

As a database idea, the column store is very old. It goes back to the 1970s. Edward Glaser, principal developer on the MIT MULTICS project, rst proposed the idea and it was used by IBM on a database called APLDI. It came back into fashion via Sybase and Sand Technology when the scalability limitations of the indexed data structures that RDBMS used became more apparent. Column-store databases became increasingly popular with the emergence of new start-up database companies like Vertica and ParAccel that took this approach. The column stores were RDBMS in the sense that they employed SQL as the primary data access language and they held data in tables, but at a physical level they stored columns rather than tables, they made heavy use of data compression and they didnt use indexes. The simple fact was that, while the speed at which data could be read from disk had been

increasing rapidly over the years, the speed of the movement of the read/write head across the disk had not increased much. Consequently, using indexes for accessing data on disk had become a liability. It caused disk head movement and slowed everything down. It had become far faster to read data serially from disk.
The query is decomposed into a sub-query for each node
Query
Database Table
Sub Query 1 Server 1

CPU CPU
Sub Query 2 Server 2

CPU CPU
The columnar database scales up and out by adding more servers Server 3
CPU CPU
As Much
As Much
Memory As Possible
Memory As Possible
As Much Memory As Possible
Data is compressed then partitioned on disk by column and by range
Data Data Data Data Data Data
Figure 5. Column Store DBMS Scalability This gave rise to the scalability approach illustrated in Figure 5. This depicts the general approach of the column store DBMS to scalability. First of all, data is compressed when it is loaded, resulting in a much smaller volume of data - one twentieth of the original raw data is achievable. Then the data is stored in columns. The columns may also be split up between disks and between servers. This ensures good parallelism. A query may need to read the whole of a column from a table, for example, so if the column is split between 12 disks that are split between two servers, then the data retrieval may be 12 times faster. Furthermore, the servers will most likely be congured for a high level of memory so that a good deal of the data is already in memory. The caching algorithms will probably split a fair amount of the memory equally between the disks to balance the average workload. In addition to this, multiple processes will be running and they will be distributed between multiple cores in the cpus on each server.

The overall performance of the column store DBMS will depend on how well the software balances the workload when multiple queries are processed. This solution has the advantage that you can simply add more servers as the data volume expands and the balancing of the workload across 3, then 4 then 5 servers will usually work out well. This solution scales out onto multiple servers more effectively than the traditional RDBMS - which is precisely why it has become popular. Unfortunately it will hit a limit at some point. Clearly that limit will depend upon the structure of the data and the variety of queries being processed. Even though it scales out more effectively, it is still a centralized architecture. As the workload increases a messaging bottleneck will naturally develop at the master node of the column store database and ultimately, this limits the number of servers it can expand onto.
Hadoop and Map/Reduce: A Distributed Architecture

The Hadoop development framework for MapReduce has attracted a great deal of attention for two reasons. First, it does scale out across large grids of computers and secondly it is the product of an Open Source project, so companies can test it out at low cost. MapReduce is a parallel architecture designed by Google specically for large scale search and data analysis. It is very scalable and works in a distributed manner. The Map Partition Combine Reduce Hadoop environment is a MapReduce framework that BackUp Scheduler enables the addition of Java /Recov Node i+1 software components. It also provides HDFS (the Hadoop Reducing BackUp Process Distributed File System) and /Recov BackUp has been extended to include /Recov HBase, which is a kind of column store database. Figure 6 shows how Hadoop works. Basically, a mapping function partitions data and then passes it to a reducing function, which calculates a result. In the diagram we show many nodes (servers) with nodes 1 to i running the mapping process and nodes i +1 to k running the reducing process. The environment is (designed to recover from the failure of any node. The HDFS holds a redundant copy of all data, so if any node fails, the same data will be available through another
Node 1 Node j
HDFS Mapping Process Reducing Process BackUp /Recov
BackUp /Recov
Node k
Reducing Process BackUp /Recov
Node i
HDFS
Mapping Process
Figure 6. Hadoop & MapReduce

10

node. Every server logs what it is doing and can be recovered using its backup/recovery le, if it fails. Because of that, Hadoop/MapReduce is quite slow at each node, but it compensates for this by scaling out over thousands of nodes. It has been used productively on grids of over 5000 servers. Node failure is a daily event when you have that many commodity servers working together, so at that scale, its recoverability is an advantage. With MapReduce, all the data records consist of a simple key and value pair. An example might be a log le, consisting of message codes (the key) and the details of the condition being reported (the value). For the sake of illustrating the MapReduce process, imagine we have a large log le of many terabytes containing messages and message codes and we simply want to count each type of message record. It could be done in the following way: The log le is loaded into the HDFS le system. Each mapping node will read some of the log records. The mappers will look at each record they read and output a key value pair containing the message code as the key and 1 as the value (the count of occurrences). The reducer(s) will sort by the key and aggregate the counts. With repeated reductions eventually it will arrive at the result; a map of distinct keys with their overall counts from all inputs. While this example is very simple, if we had a very large fact table of the type that might reside in a data warehouse, we could execute SQL queries in the same way. The map process would be the SQL SELECT and the reduce process could simply be the sorting and merging of results. You can add any kind of logic to either the map or the reduce step and you can also have multiple map and reduce cycles for a single task. Also, by deploying HBase it is possible to have a very large massively parallel column-store database that presides over petabytes of data and which can be regularly updated. The CDBMS Ultimately, neither column store databases nor Hadoop (with Hbase) currently have the capabilities needed to function as a CDBMS. Column-store DBMS are (in most cases) centralized databases that will encounter scalability limits as data volumes and workloads increase. Ultimately, all centralized architectures suffer that fate no matter how splendid the underlying engineering. For that reason some of the column-store vendors are integrating with Hadoop and enhancing it in various ways. Because Hadoop provides a fully distributed environment it is unlikely to encounter a scalability limit of the kind that would oor a centralized architecture. Hadoop was purposely designed to preside over massive tables and, in that role, it can be useful, especially for those organizations that run into scalability limits with column store databases. However, in its current form it processes only one workload at a time - it has no multiprocessing capability at all. Also, it does not work well with complex data structures, even when they only contain structured data. Big tables, yes; but lots of little tables from lots of databases all with varying data structures, decidedly no. Neither is Hadoop equipped to easily distribute workloads across complex networks that work at varying speeds. Hadoop expects a clean environment of similar sized servers all networked together at the same speed in an orderly fashion. Its secret sauce is homogeneity in everything it does. A CDBMS has to be able to handle heterogeneity at every level.
11
Algebraix Data and Cloud Database

Algebraix Datas A2DB is, uniquely, an algebraic database. As such, it is capable of representing any kind of data in an algebraic form and managing it accordingly. Many databases (RDBMS and derivative products) are constrained by the relational model of data, unable to handle data that does not t in that limited environment. A2DB is not constrained in that way. Its algebraic nature allows it to represent hierarchies, ordered lists, recursive data structures and compound data objects of any kind. (For a more detailed mathematical explanation of how it achieves this, read the Bloor Group white paper: Doing The Math).
Algebraic Optimization and the Use of Intermediate Results

To understand how Alegbraix Datas technology could implement a CDBMS, you need understand the optimization strategy it implements. The A2DB product stores all the sets it calculates, including all intermediate result sets for possible reuse. Consider a fairly simple query which accesses some rows and columns from one table and then joins them to some rows and columns of another table. Most databases will select the data from the rst table, select it from the second table and then join the resulting two tables together to provide the answer. A2DB behaves in the same manner, but with the additional nuance that it stores the rst selection and the second selection and the joined result, for possible later use. If later queries make the same selection or make a selection of a subset of either of the two stored selections, then A2DB will reuse those results. Once A2DB has processed many queries it has assembled a reasonably large population of these intermediate results. Not only does it store each such set of data, it also stores their algebraic representation. So when it processes a new query, it simply examines its store of algebraic representations and selects those that can contribute to resolving the query. It then works out which of them has the least cost in terms of resource usage, and uses those sets to resolve the query. The adjacent graph illustrates how the performance of A2DB improves when the same type of query is repeated. The rst time a query runs, response is slow. But it improves with each repetition until the response time falls to a very low level. This happens with all types of query. The use
Figure 7. The A2DB Optimizer Performance Curves

12

Data Sources
Queries Queries Queries
Apps Apps Apps
Universe Manager
(Algebraic model) LOGICAL
XSN Translator
Resource Manager
(CPU/cores, memory, disk) PHYSICAL Answers Answers Remote Access
File File File
Optimizer Set Processor

Storage Manager
Log Log Log Log Files Files Files Files
Local Access
Load Load Load Load Files Files Files Files
Local Result Sets Apps Apps Apps Apps
Remote Result Sets
Mgt Data
CDBMS Node i
CDBMS Node k
DBMS DBMS DBMS DBMS
CDBMS Node j
DBMS DBMS DBMS DBMS Data Data Data Data
CDBMS Node l Remote Data Center
Local Data Center
Figure 8. Algebraix Datas Technology in a Distributed Operation of intermediate result sets proves valuable in a distributed environment and a cloud environment. Figure 8 illustrates this. The distributed architecture is peer-to-peer, so there could be many such nodes, even thousands - all functioning in the same way. On the left of the diagram are the data sources that this particular node takes input from and is responsible for. In order to load the database node it is only necessary to create load les of the source databases. The database doesnt immediately load the data, it just loads the metadata from those les. The way the technology works is that there is no data load per se. As queries arrive it references the load les (or log les or other data les) and gradually accumulates intermediate result sets, which constitute its managed data store - as illustrated. It uses physically efcient mechanisms to store such data, the same techniques as the typical column store database; no indexes, data compression and data partitioning. There is complete separation between the logical representation of the data sets stored and the physical storage of those data sets. It works in the following way: The XSN Translator translates a query into an algebraic representation that corresponds with the algebraic sets dened at a logical level in the Universe Manager. (XSN stands for Extended Set Notation.) The Universe Manager holds a logical model of all the databases sets and their relations. The Optimizer rst works out which stored sets might participate in a solution. It may deduce it has to go to source data (load les) for all or part of the data requested by the query.
13

In any event the search for alternatives will yield one or more possible solutions. The Optimizer now consults the Resource Manager and tests each of its algebraic solutions against PHYSICAL information held by the Resource Manager. Armed with precise cost information, the Optimizer works out the physical cost of each algebraic solution and chooses the fastest one. The Resource Manager knows whether data is on disk or cached in memory and it knows how it is physically organized. Once the Optimizer has decided on a solution, it passes it to the Set Processor, which executes it.
The Distributed Query

Now consider what happens if the query requests some data that is not on this database node. How does it know what to do? By design, the Universe Manager doesnt just hold a map of local data, it also holds a global map that identies all other database nodes and the data they are responsible for. When we described how the database handles a query, we omitted to discuss how it handles a query that spans more than one node. Such a query will naturally involve a join of some kind with one or more parts of the join operation referencing remote data. The mode of operation of Algebraix Datas technology is essentially the same, but slightly more complex. The Optimizer always checks to see if any of the data requested is part of the remote universe rather than the local universe. If it discovers that some element in the query references remote data, it deconstructs the query into several parts, as follows: A subquery for this node A subquery for each remote node that is involved A master query that joins together all the results of all the subqueries
It calculates which node is the best node to execute the master query by estimating the resource cost of transporting result data from one location to another. If it decides to pass that responsibility to another node then it behaves as follows: It passes all the other subqueries to the nodes where they need to execute. It also informs each node where to deliver the result of their subquery. It then executes its own subquery and passes the result to the master node when local processing completes. At that point it has nished with that query.
If it has determined that it is, itself, the best node to execute the master query, it behaves as follows: It passes all the other subqueries to the nodes where they need to execute. It gives itself as the return address for the results of those subqueries. It executes its own subquery. When it receives all the remote result sets, it executes the master query. Finally it dispatches the end result to the program that sent the query.
Note that in carrying out such a distributed query the database gathers some remote result sets at the node that masters the distributed query. It will save these results as remote result
14

sets in the same way that it saves local result sets, so that when more queries of that type come in it may be able to resolve those queries locally rather than in a distributed manner.
Failover
With Hadoop, failure of any node can be catered for. The same is true of Algebraix Datas technology. It is fairly easy to congure complete node mirrors so that a standby node can take over immediately if an active node fails. It would be more economic though to use a SAN at each data center, and only mirror data that is written to disk (the intermediate results). Then if a node fails, it will be possible to recover the node from the SAN. This injects a greater delay into the recovery process, as the recovered database would have to recreate the last known state of the failed node. In practice, Algebraix Datas technology can run on commodity servers. While it may appear that it has a substantial requirement for data storage, because of its strategy of storing intermediate results, in practice this is not the case. This is because, after a suitable time has passed, the database deletes the intermediate results it didnt reuse. The database rarely requires the deployment of additional storage (such as NAS or a SAN). For atypical workloads special congurations can be deployed for any given node.
Node Splitting
Node splitting becomes necessary when the query load for a node becomes too great. The need becomes apparent when the performance of the node begins to decline. However, node splitting is simple to achieve: A replica node is created of the node and the data sources that the new node will be responsible for are dened - deleting those it will not be responsible for from the Universe Manager. The technology can estimate what the best split is likely to be from an analysis of past query workloads. It can also recognize which intermediate results are derived from which source les or databases. So it reclassies those intermediate results as remote rather than local. The conguration of the original node is congured in the same way, deleting the data sources that it is no longer responsible for. The nature of the changes are then relayed to all the nodes in the CDBMS.
Data Growth
Most source data will consist of databases that are themselves being added to on a regular basis. That data growth is best dealt with by feeding database log le images to the database. For other applications which simply use le systems, it is best to feed the equivalent of an update audit trail to the database. There is a specic reason for this. Algebraix Datas technology does not cater for updated data in the way most databases do. Typically, database updates destroy data by over-writing one value with another. This database technology is different. It treats updates as additional (i.e. new) data. In effect, they become non-destructive updates, with a record of the previous values remaining. For deletions, it simply marks the set of data or a data item as no longer current. To achieve these things, the database adds a time stamp to all data as it arrives and is used (if such a time stamp does not exist in source data.) All queries to the database either specify the time that applies, so that the result has an as at date/time or omit the time, in which case the current
15

date and time is applied. So all updates are taken into account when the associated data is processed according to time stamp. Because of this, all intermediate result tables also have an as at date/time associated with them. The database is congured at every node to accept new data on the basis of a timed switch. It is inadvisable to set the time switch to too short a period as this rapidly increases the number of sets held by the Universe Manager - and this, in turn, could impact performance.
The Economy of A2DB

In any database and especially in any distributed database, it is always possible to pose queries that will take a long time to answer. This technology does not make that problem suddenly disappear. For example if you join two terabyte-sized tables together that are on different nodes, a terabyte of data must pass over the network. If it is a slow network line, the query could take a very long time. If such a query is frequently run, the database will solve this particular performance issue naturally by holding one of the terabyte tables as an intermediate result. If you have a petabyte or even several petabytes of data that you wish to query regularly, then the database could be used for the task by deploying it on a sufcient number of nodes. In such circumstances it could look quite similar to Hadoop (with HBase). However that is not the prime requirement of a CDBMS. A CDBMS needs to be able to handle heterogenous workloads some of which access complex data structures, and it needs to do so with economy and with speed. That is what Algebraix Datas technology does. In the distributed environment it is helped by the fact that users and programs that request data normally do not pose queries that have terabyte-long answers. They pose queries that have quite short answers - a few megabytes or less. An exception is when users are downloading a large data extract for more detailed analysis, but such downloads are relatively rare. This distributed approach has the virtue that it naturally localizes data to suit the query trafc. In each node it localizes the data that is frequently queried in memory. In a distributed environment with multiple nodes it will, through its natural performance mechanisms, gradually localize the data to suit the local and global query trafc. If query volumes rise too high at a given node, then the node can split like an amoeba to cater for the rising workload. If the query trafc changes with, say, one kind of query not being posed so frequently and a new set of previously unknown queries becoming common the database will simply adjust, by adjusting the intermediate results it holds. After three or four queries of each new query type its natural performance will be restored. The nature of this technology, coupled with the fact that it can be congured for high availability, qualies it as suitable for deployment as a CDBMS.
16
About The Bloor Group

The Bloor Group is a consulting, research and analyst rm that focuses on quality research and analysis of emerging information technologies across the whole spectrum of the IT industry. The rms research focuses on understanding both the technical features and the business value of information technologies and how they are successfully implemented within modern computing environments. Additional information on The Bloor Group can be found at www.TheBloorGroup.com and www.TheVirtualCircle.com. The Bloor Group is the sole copyright holder of this publication.
22214 Oban Drive Spicewood TX 78 669 Tel: 512-524-3689
www.TheVirtualCircle.com www.BloorGroup.com 17

The Cloud Database

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Cloud Database

Uploaded by

Copyright:

Available Formats

WHAT IS A CLOUD DATABASE ?

The Suitability of Algebraix Datas Technology to Cloud Computing

Email contact: info@bloorgroup.com www.TheVirtualCircle.com www.BloorGroup.com

WHAT IS A CLOUD DATABASE?

WHAT IS A CLOUD DATABASE?