Professional Documents
Culture Documents
Robin Bloor, Ph D
WHITE PAPER
Copyright 2011, The Bloor Group All rights reserved. Neither this publication nor any part of it may be reproduced or transmitted or stored in any form or by any means, without either the prior written permission of the copyright holder or the issue of a license by the copyright holder. The Bloor Group is the sole copyright holder of this publication.
22214 Oban Drive Spicewood TX 78669 Tel: 512-524-3689
WHITE PAPER
Executive Summary
This white paper was commissioned by Algebraix Data. The goal of the paper is to provide a denition of what a cloud database is, and in the light of that denition, examine the suitability of Algebraix Datas technology to fulll the role of a cloud database. Here is a brief summary of the contents of this paper: We dene a cloud dbms (CDBMS) to be a distributed database that can deliver a query service across multiple distributed database nodes located in multiple data centers, including cloud data centers. Querying distributed data sources is precisely the problem that businesses will encounter as cloud computing grows in popularity. Such a database also needs to deliver high availability and cater for disaster recovery. In our view, a CDBMS only needs to provide a query service. SOA already delivers connectivity and integration for transactional systems, so we see no need for a CDBMS to cater for transactional trafc - only query trafc. A CDBMS needs to scale across large computer grids, but it also needs to be able to span multiple data centers and, as far as is possible, cater for slow network connections. We review traditional databases, focusing primarily on relational databases and column store databases, concluding that such databases, as currently engineered, could not fulll the role of a CDBMS. They have centralized architectures and such architectures would encounter a scalability limit at some point, both within and between data centers. We conclude that a distributed peer-to-peer architecture is needed to satisfy the CDBMS characteristics that we have dened. We move on to examine the Hadoop/MapReduce environment and its suitability as a CDBMS. It has much better scalability for many workloads than relational or column store databases, because of its distributed architecture. However it was not built for mixed workloads or for complex data structures or even for multitasking. In its current form it emphasizes fault tolerance. It succeeds as a database for very large volumes of data, but does not have the characteristics of a CDBMS. Finally, we examine Algebraix Datas technology as implemented in its database product A2DB. Our conclusion is that it has an architecture which is suitable for deployment as a CDBMS. Our view is as follows: A2DBs unique capability to reuse intermediate results of queries that it has previously executed, contribute to it delivering high performance at a single node. The same performance characteristics can be employed to speed up queries that join information between a local node and remote nodes, whether in the same data center or in a remote data center. Algebraix Datas technology is capable of global optimization, balancing the performance requirements of both global and local queries. Additionally the technology can deliver high availability/fault tolerant operation.
We are aware that Algebraix Data has not been deployed and tested its database A2DB in the role of CDBMS hence our conclusion is not that it qualies as a CDBMS, but that it has an architecture that would enable it to be tested in this role.
1
User
CDBMS Node - 8 Data Data
Query
CDBMS Node - 1
CDBMS Node - 7
Data
Internet
Cloud Data Center 1
CDBMS Node - 2
CDBMS Node - 3
CDBMS Node - 4
CDBMS Node - 5
CDBMS Node - 6
Data
Data Data
Data
Data
Data Data
Data Center 1
Data Center 2
Figure 1. A CDBMS hosted transactional web applications in some remote data center plus local applications including BI applications split between two data centers. Such a situation is illustrated in Figure 1. It is the typical situation that companies will have to deal with as we move forward. In practice, a query can originate from anywhere; from a PC within the corporation, which is connected by a fast line to the local data center, from a PC in the home via a VPN line, from a laptop via a WiFi connection, or from a smart phone via a 3G or 4G connection. For that reason we represent a query here as coming through the Internet implying that the response will possibly travel through the Internet too. The CDBMS will not concentrate all query trafc through a single node. A peer-to-peer architecture will be far more scalable - with any single node able to receive any query. In such
3
DBMS A2
DBMS A3
CDBMS Node A
DBMS A4
File A1
Data A2
Data A3
Node A Data
Data A4
File A5
Figure 3. Cloud Database Node Splitting Consider the situation illustrated in Figure 3 where Node A of the CDBMS is managing queries for les A1, A5 and databases A2, A3 and A4. If the workload gets too great for the resources at its disposal, then assuming that there is another server available to use, it could split like an amoeba as indicated. The original node might take responsibility for le A1 and databases A2 and A3, while the newly created node A takes responsibility for A4 and A5. In order to do this, Node A would have to have keep a full history of query trafc so that it would be able to calculate the optimal division as it split in two. Similarly there would need to be a reverse procedure that amalgamated two local nodes in the event that the query workload diminished. In concept, that takes care of queries that only access local data that Node A has responsibility for. However, there will necessarily be queries that span multiple nodes.
Distributed Queries
Consider the major entities that a company holds as data: customer, product, sales transaction, staff member, supplier, purchase transaction and so on. They crop up in many applications. Consequently, many queries that seek information on these major entities will inevitably span multiple nodes of a CDBMS. Even if we could nd a convenient way to distribute and cluster the applications around these entities, there would be many queries that spanned multiple nodes. Most query-oriented databases, column store databases or traditional relational databases, could be congured to handle single node queries. Technically, the fundamental challenge for the CDBMS is to handle distributed queries effectively. A distributed query which accesses multiple nodes of the CDBMS can be thought of as an amalgamation (a union) of several queries that access individual nodes of the CDBMS. This is illustrated in Figure 4. Note that the resolution of a query in this manner could result in more than one result set from each node as illustrated. Once the answers have been calculated, the CDBMS has to determine which node will join them together.
5
Query
Join
Sub Query 1
Sub Query 2
Sub Query 3
Sub Query 4
Answer SQ-1
Answer SQ-2
Answer SQ-3
Answer SQ-4
CDBMS Node - 2
CDBMS Node - 5
CDBMS Node - 8
Data
Data
Data
Data Center 1
Data Center 2
Figure 4. CDBMS: Distributed Queries The best node to choose is the one that is least cost in respect of time. That can depend upon many physical factors, not just the volume of data that needs to be transmitted, but the network speeds and how long it will take each node to carry out its work. It could even depend upon which node is currently busiest. The challenge is to nd the fastest solution, but the problem is not a trivial one.
The IT industry never even tried to agree on a standard le format that exposed the metadata of a le. Thus the commonly used operating systems never provided such a le type. This
Query
Database Table
The columnar database scales up and out by adding more servers Server 3
CPU CPU
As Much
As Much
Memory As Possible
Memory As Possible
Figure 5. Column Store DBMS Scalability This gave rise to the scalability approach illustrated in Figure 5. This depicts the general approach of the column store DBMS to scalability. First of all, data is compressed when it is loaded, resulting in a much smaller volume of data - one twentieth of the original raw data is achievable. Then the data is stored in columns. The columns may also be split up between disks and between servers. This ensures good parallelism. A query may need to read the whole of a column from a table, for example, so if the column is split between 12 disks that are split between two servers, then the data retrieval may be 12 times faster. Furthermore, the servers will most likely be congured for a high level of memory so that a good deal of the data is already in memory. The caching algorithms will probably split a fair amount of the memory equally between the disks to balance the average workload. In addition to this, multiple processes will be running and they will be distributed between multiple cores in the cpus on each server.
BackUp /Recov
Node k
Reducing Process BackUp /Recov
Node i
HDFS
Mapping Process
Universe Manager
(Algebraic model) LOGICAL
XSN Translator
Resource Manager
(CPU/cores, memory, disk) PHYSICAL Answers Answers Remote Access
Local Access
Mgt Data
CDBMS Node i
CDBMS Node k
CDBMS Node j
DBMS DBMS DBMS DBMS Data Data Data Data
Figure 8. Algebraix Datas Technology in a Distributed Operation of intermediate result sets proves valuable in a distributed environment and a cloud environment. Figure 8 illustrates this. The distributed architecture is peer-to-peer, so there could be many such nodes, even thousands - all functioning in the same way. On the left of the diagram are the data sources that this particular node takes input from and is responsible for. In order to load the database node it is only necessary to create load les of the source databases. The database doesnt immediately load the data, it just loads the metadata from those les. The way the technology works is that there is no data load per se. As queries arrive it references the load les (or log les or other data les) and gradually accumulates intermediate result sets, which constitute its managed data store - as illustrated. It uses physically efcient mechanisms to store such data, the same techniques as the typical column store database; no indexes, data compression and data partitioning. There is complete separation between the logical representation of the data sets stored and the physical storage of those data sets. It works in the following way: The XSN Translator translates a query into an algebraic representation that corresponds with the algebraic sets dened at a logical level in the Universe Manager. (XSN stands for Extended Set Notation.) The Universe Manager holds a logical model of all the databases sets and their relations. The Optimizer rst works out which stored sets might participate in a solution. It may deduce it has to go to source data (load les) for all or part of the data requested by the query.
13
It calculates which node is the best node to execute the master query by estimating the resource cost of transporting result data from one location to another. If it decides to pass that responsibility to another node then it behaves as follows: It passes all the other subqueries to the nodes where they need to execute. It also informs each node where to deliver the result of their subquery. It then executes its own subquery and passes the result to the master node when local processing completes. At that point it has nished with that query.
If it has determined that it is, itself, the best node to execute the master query, it behaves as follows: It passes all the other subqueries to the nodes where they need to execute. It gives itself as the return address for the results of those subqueries. It executes its own subquery. When it receives all the remote result sets, it executes the master query. Finally it dispatches the end result to the program that sent the query.
Note that in carrying out such a distributed query the database gathers some remote result sets at the node that masters the distributed query. It will save these results as remote result
14
Failover
With Hadoop, failure of any node can be catered for. The same is true of Algebraix Datas technology. It is fairly easy to congure complete node mirrors so that a standby node can take over immediately if an active node fails. It would be more economic though to use a SAN at each data center, and only mirror data that is written to disk (the intermediate results). Then if a node fails, it will be possible to recover the node from the SAN. This injects a greater delay into the recovery process, as the recovered database would have to recreate the last known state of the failed node. In practice, Algebraix Datas technology can run on commodity servers. While it may appear that it has a substantial requirement for data storage, because of its strategy of storing intermediate results, in practice this is not the case. This is because, after a suitable time has passed, the database deletes the intermediate results it didnt reuse. The database rarely requires the deployment of additional storage (such as NAS or a SAN). For atypical workloads special congurations can be deployed for any given node.
Node Splitting
Node splitting becomes necessary when the query load for a node becomes too great. The need becomes apparent when the performance of the node begins to decline. However, node splitting is simple to achieve: A replica node is created of the node and the data sources that the new node will be responsible for are dened - deleting those it will not be responsible for from the Universe Manager. The technology can estimate what the best split is likely to be from an analysis of past query workloads. It can also recognize which intermediate results are derived from which source les or databases. So it reclassies those intermediate results as remote rather than local. The conguration of the original node is congured in the same way, deleting the data sources that it is no longer responsible for. The nature of the changes are then relayed to all the nodes in the CDBMS.
Data Growth
Most source data will consist of databases that are themselves being added to on a regular basis. That data growth is best dealt with by feeding database log le images to the database. For other applications which simply use le systems, it is best to feed the equivalent of an update audit trail to the database. There is a specic reason for this. Algebraix Datas technology does not cater for updated data in the way most databases do. Typically, database updates destroy data by over-writing one value with another. This database technology is different. It treats updates as additional (i.e. new) data. In effect, they become non-destructive updates, with a record of the previous values remaining. For deletions, it simply marks the set of data or a data item as no longer current. To achieve these things, the database adds a time stamp to all data as it arrives and is used (if such a time stamp does not exist in source data.) All queries to the database either specify the time that applies, so that the result has an as at date/time or omit the time, in which case the current
15
16
www.TheVirtualCircle.com www.BloorGroup.com 17