Professional Documents
Culture Documents
nosql databases
eMag Issue 47 - Nov 2016
NoSQL
ARTICLE
ARTICLE
INTERVIEW
Using Redis as a
Time-Series
Database
Current
State of NoSQL
Databases
Synchronization of data across systems is expensive and impractical when running systems
at scale. Traditional approaches for performing computations or information dissemination are not viable. In this article Basho Sr. Software Engineer Chris Meiklejohn explores
the basic building blocks for crafting deterministic applications that guarantee convergence
of data without synchronization.
Using Redis as a
Time-Series Database
Building a Mars-Rover
Application with DynamoDB
DynamoDB is a NoSQL database service that
aims to be easily managed, so you dont have to
worry about administrative burdens such as operating and scaling. This article shows how to use
Amazon DynamoDB to create a Mars Rover application. You can use the same concepts described
in this post to build your own web application.
In this article, author Dan Macklin discusses the transition to Riak NoSQL and Erlang
based architecture coupled with Convergent
Replicated Data Types (CRDTs) and lessons
learned with the transition.
NoSQL databases have been around for several years now and have
become a choice of data storage for managing semi-structured and
unstructured data. These databases offer lot of advantages in terms
of linear scalability and better performance for both data writes and
reads. InfoQ spoke with four panelists to get different perspectives on
the current state of NoSQL databases.
FOLLOW US
CONTACT US
GENERAL FEEDBACK feedback@infoq.com
ADVERTISING sales@infoq.com
EDITORIAL editors@infoq.com
facebook.com
/InfoQ
@InfoQ
google.com
/+InfoQ
linkedin.com
company/infoq
SRINI PENCHIKALA
A LETTER FROM
THE EDITOR
NoSQL databases have been around for several years
now and have become the preferred choice of data
management for a variety of business use cases.
With the emergence of other trends like distributed
systems, cloud computing, social media, mobile devices, and Internet of Things (IoT), the need for NoSQL
database solutions has only become more critical in
the recent years.
NoSQL databases offer a lot of advantages compared
to the traditional relational databases, in terms of linear scalability, better performance and cost effectiveness for managing the data.
They provide features like data partitioning and replication out-of-the box which are critical for running
applications in distributed system environments.
NoSQL databases also offer built-in integration with
big data technologies like Hadoop and Spark.
Its important to take a look at the current state of
NoSQL databases and learn about whats happening
now in the NoSQL space and whats coming up in the
future for these database technologies.
This eMag focuses on the current state of NoSQL
databases. It includes articles, a presentation and a
virtual panel discussion covering a variety of topics
ranging from highly distributed computations, time
series databases to what it takes to transition to a
NoSQL database solution.
The Highly Distributed Computations without Synchronization article covers the concept of Con-
Conflict-free replicated
data types
Figure 1
Distributed
deterministic dataflow
Figure 2
Whats a joinsemilattice?
Generalizing to joinsemilattices
Distribution
Replication of
applications
Lets look at an example of an application that requires communication between a series of clients
and servers: an eventually consistent advertisement counter.
Figure 3
Model
The model assumes Dynamo-style partitioning and replication of data. It uses hash-space
partitioning and consistent hashing to break up the hash space
into a group of disjoint replication sets, each of which has a
group of replicas responsible for
Replication of variables
When partitioning and replicating variables, we assume the client application will run outside
of the cluster or spread across a
series of nodes internal or external to the cluster. Each operation,
such as bind or read, is turned
into a request and sent across the
network to the cluster responsible for managing the constraint
Advertisement counter
servers responsible for tracking advertisement impressions for all clients and
clients responsible for incrementing the advertisement
impressions.
This example uses a grow-only
counter, a counter that can handle concurrent increment options in a safe and convergent
Different distribution
models
Where do we go from
here?
Feedback
Conflict-free
Replicated Data
Types (CRDTs)
provide the solution
to the semantic
resolution problem
described in
Amazons Dynamo
paper.
Causal+ consistency
Redis has been used for storing and analyzing time-series data since
its creation. Initially intended as a buffer and destination for logging,
Redis has grown to include five explicit and three implicit structures/
types that offer different methods for analyzing data. This article
intends to introduce the reader to the most flexible method of using
Redis for time-series analysis.
A note on race
conditions and
transactions
10
Use cases
Initial questions about Redis and its use as a time-series database concern the use or purpose of a
time-series database itself. Its use cases relate to the
data involved specifically that the data is structured as a series of events or samples of one or more
values or metrics over time. A few examples include
(but are not limited to):
Storing events
11
Event analysis
12
13
Next steps
geo-indexed
mands.6
sorted-set
com-
Read-only scripts can be interrupted if we have enabled the lua-time-limit configuration option, and the script has been
executing for longer than the configured limit.
1
When scores are equal, items are sub-ordered by the lexicographic ordering of the members themselves.
While we generally use colons as name/namespace/data separators when operating with Redis data in this article, you can feel
free to use whatever character you like. Other users of Redis use periods, semicolons, and more. Picking some character that
doesnt usually appear in your keys or data is a good idea.
ZRANGE and ZREVRANGE offer the ability to retrieve elements from a sorted set based on their sorted position, indexed 0 from the
minimum score in the case of ZRANGE and indexed 0 from the maximum score in the case of ZREVRANGE.
ZCOUNT as a command does count the values in a range of data in a sorted set, but does so by starting at one endpoint and incrementally walking through the entire range. For ranges with many items, this can be quite expensive. As an alternative, we can
use ZRANGEBYSCORE and ZREVRANGEBYSCORE to discover the members at both the starting and ending points of the range. By
using ZRANK on the members of both ends, we can discover the indices of those members in the sorted set. And with both indices,
a quick subtraction of the two (plus one) will return the same answer with far less computational overhead, even if it may take a
few more calls to Redis.
Much like the Z*LEX commands introduced in Redis 2.8.9, which uses sorted sets to provide limited prefix searching, Redis 3.2
offers limited geographic searching and indexing with GEO* commands.
6
14
15
16
Data model
In addition to using primary keys to access and manipulate specified items, DynamoDB also allows us to
search for specific data with query, update, and scan:
Query A query operation finds items in a table using only primary key attribute values. We
must provide a hash key attribute/value pair and
optionally a range key attribute/value pair. For
instance, in the MSL Image Explorer app, we can
query for a specific picture by setting imageid =
201.
Primary key
17
Secondary indexes
Instead of scanning the entire table, which can sometimes be inefficient, we can create secondary indexes
to help the querying process. Secondary indexes on
a table will help optimize querying on non-key attributes. DynamoDB supports two kinds of secondary
indexes: a local secondary index that has the same
hash key as the table but a different range key and
a global secondary index that has a hash and range
keys that can differ from those on the table.
Secondary indexes can be thought of as separate tables that are first grouped by the index hash key then
by the range hash key. For example, in the marsDemoImages table, we might want to look up images
from a specific mission and instrument, filtered by a
time range. To do this, we could create a secondary
index grouped first by the Mission+Instrument attribute (hash key) then by the TimeStamp attribute
(range key).
18
3. update execution.
19
Query execution
20
Update execution
We explain how
JSON document
support for
DynamoDB has
made building
the Mars
Rover demo
application easy
and intuitive
Code 1
Code 2
Code 3
user votes for this photo. Next, we
try to put the item into the userVotes table, with the condition in
place: (Code 4)
fields to update in a simple and intuitive way. Lets take a look at how
the incrementVotesCount function
works: (Code 5)
Retrieve thumbnail
the JSON-document SDK, which images from DynamoDB
provides much more support for
21
Code 4
Code 5
22
Conclusions
You can use the concepts described in this post to build your
own web application. Lets recap
the process:
1.
2.
3.
4.
Dan Macklinis a hands-on technical manager who loves to learn, make things happen, and get
things done. After running his own business for 10 years, Dan is now the head of Research and
Development at Bet365.
23
Erlang
24
NoSQL databases
In operation-based CRDTs, commutative operations are propagated and applied to all data
replicas in situ, and as such their
implementation resembles a
distributed log-ship. Whilst this
approach reduces transmission
volumes, the fact that these operations are not idempotent
means that practical systems require additional network protocol guarantees that are not easy
to implement. Therefore, most
state-of-the-art production systems end up going with statebased CRDTs, in which full state
updates are sent to all replicas.
Upon receiving a new state, the
replica queries its own state, runs
an update function (which must
monotonically increase the state
for example, adding a value
to an ordered set is a monotonic
operation), and finally calls a
merge function (which must be
commutative, associative, and
idempotent) to bring causal consistency to the new and previous
states.
Some examples will hopefully clarify this. At Bet365, we
ended up using the ORSWOT
(observed-remove set without
tombstones) CRDT as it facilitated add and remove operations in
a relatively space-efficient manner.Each ORSWOT is stored as a
value in a key-value store. (Figure
1)
The grey text represents
our data. In this case, its a
set of Erlang binary strings
([<<Data1>>, <<Data2>>,
<<Data3>>]). The green text
represents a version vector (a
type of server-side vector clock)
whose job is to keep track of the
entire sets top-level causal history.The blue text next to each
element represents its dotted
version vector (or dot).The dot
stores the actor and its associated count for the last mutation.
25
Figure 1
So lets look at what happens during an add operation.For simplicity, lets set our initial state:
001 {[{x,1}], [ {<<Data1>>, [{x,1}]}]}
Adding <<Data2>> to this set using the unique
actoryresults in:
001 {[{x,1},{y,1}], [ {<<Data1>>,
[{x,1}]}, {<<Data2>>,[{y,1}]}]}
Note that the new actoryhas been added to the version vector and <<Data2>> has been added to the
set with a birth dot of[{y,1}].
That was easy enough, but what happens if we need
to merge two concurrently updated ORSWOTs (ORSWOT A and ORSWOT B)?
First, well set up some new data for our example.
Here is ORSWOT A:
001 ORSWOT A = {[{x,1},{y,2}], [
002 {<<Data1>>,[{x,1}]},{<<Data2>>,
[{y,1}]},{<<Data3>>,[{y,2}]}]}
ORSWOT A has seen the addition of:
1. element <<Data1>> via actor x,
2. element <<Data2>> via actor y, and
3. element <<Data3>> via actor y.
We will merge ORSWOT A with the following ORSWOT B:
001 ORSWOT B = { [{x,1},{y,1},{z,2}],
[ {<<Data2>>,[{y,1}]},
{<<Data3>>,[{z,1}]},
002 {<<Data4>>,[{z,2}]}]}
ORSWOT B has seen:
1.
2.
3.
4.
5.
26
Conclusions
Riak uses a
mechanism called
Vector Clocks to
detect concurrent
updates to data
without the need
for a central clock.
27
Virtual Panel:
Current State of NoSQL Databases
by Srini Penchikala
THE PANELISTS
Seema Jethaniwas until recently the director of product management at Basho Technologies
for Bashos flagship products, the distributed NoSQL databases Riak KV and Riak TS. Prior to
joining Basho, she held product management and strategy positions at Dell, Enstratius, and IBM.
She can be found on Twitter as @seemaj.
Perry Krugis a principal solutions architect and customer advocate for Couchbase. Perry has
worked with hundreds of users and companies to deploy and maintain Couchbases NoSQL
database technology. He has over 10 years of experience in high-performance caching and
database systems.
Jim Webberis chief scientist with Neo Technology, the company behind the popular opensource graph database Neo4j, where he works on graph-database server technology and writes
open-source software. Jim is interested in using big graphs like the Web for building distributed
systems, which led him to co-write the book REST in Practice. He previously wrote Developing
Enterprise Web Services: An Architects Guide.
Tim Berglundis a teacher, author, and technology leader with DataStax, where he serves as the
director of training. He can frequently be found speaking at conferences in the United States and
all over the world. He is the co-presenter of various OReilly training videos on topics ranging
from Git to distributed systems, and is the author of Gradle Beyond the Basics. He tweets as @
tlberglund,blogs very occasionally,and lives in Littleton, Colo. with his wife and their youngest
child.
28
NoSQL databaseshave been around for several years and have become
the preferred choice of data storage for managing semi-structured and
unstructured data.
These databases offer lot of advantages in terms of linear scalability and better performance
for both data writes and reads.
With the emergence of Internet of Things (IoT) devices and
sensors and their generation of
time-series data, its important
to take a look at the current state
of NoSQL databases and learn
about whats happening now
and whats coming up in the future for these databases.
InfoQ spoke with four panelists
from different NoSQL-database
organizations to get different
perspectives on the current state
of NoSQL databases.
29
30
At present, my work is dedicated to producing educational resources for Cassandra. Some Cassandra applications, like storing
time-series data, have specific
data models that work best and
can be learned as canned solutions. More general business-domain modeling, such as is intuitive for many developers using
a relational database, require a
specific methodology that differs
from the received relational tradition. At one point, we were all
new to relational data modeling,
and we had to learn how to do it
well. Its the same thing with the
NoSQL databases. Someone has
to sit us down and explain to us
how they work and how best to
represent the world using the
data models they expose.
31
32
33
34
Having a graph view of your system enables you to have a predictive and reactive analysis of
faults and contention. You can
ascribe value to points of failure
and reason about their costs and
roll all of this up to the end user.
You can locate single points of
failure, too, and you can keep
your model up to date with live
monitoring data that then gives
an empirically tempered view of
the dependability of your whole
system and the risks to which it is
subjected.
35
36
ning and managing NoSQL databases, security has not been until
more recently. Thats not to say
that some technologies didnt
provide better or worse capabilities in each area, but monitoring
has been a topic of discussion
and improvement for much longer and I think is in a much better
state overall.
I think security is the more interesting topic to talk about. The
early adopters of NoSQL technology didnt place a high value or
have a great need for very robust
security capabilities. When faced
with an endless list of possible
features/improvements, the creators of NoSQL technologies followed what was most important
to their consumers. Over the last
few years, that level of value/importance on security has shifted
directly in line with the kinds of
applications and organizations
adopting NoSQL and the creators of those technologies have
followed suit.
In my opinion, it is not valid to
compare NoSQL to RDBMSs in
terms of security. RDBMSs have
had 30 to 40 years of history to
build those features. Looking
back into their history will show
that they made similar use-casedriven decisions to those NoSQL
has made in its early years. I have
no doubt that security will play
a more and more important role
for NoSQL and that the leading
technologies will continue to
build the features that their users
require.
Webber: I actually dont know
what the general case is, but I
imagine its reasonable since
NoSQL databases underpin lots
of production systems. In Neo4js
case, we have long had security
and monitoring baked into the
product and have a team whose
entire responsibility is these kinds
of operability opportunities. In
Performance
requirements always
have to win. You cant
turn a failing latency SLA
into a success if you are
asking the the underlying
database to outdo its
best-case performance.
37
38
PREVIOUS ISSUES
45
A Preview of C# 7
44
Cloud Lock-In
46
Architectures Youve
Always Wondered
About
Technology choices are made, and because of a variety of reasons--such as multi-year licensing cost,
tightly coupled links to mission-critical systems,
long-standing vendor relationships--you feel locked
into those choices. In this InfoQ eMag, we explore the
topic of cloud lock-in from multiple angles and look
for the best ways to approach it.
Exploring Container
Technology in the Real
World
43