The idea of this article is to understand NoSQL databases, its properties, various types, data model and how it differs from standard RDBMS.
The RDMS databases are here for nearly three decades now. But in the era of social media, smart phones and cloud, we generate large volume of data, at a high velocity. Also the data varies from simple text messages to high resolution video files. The traditional RDBMS could not able to cope up with the velocity, volume and variety of data requirement of this new era. Also most of the RDBMS software are licensed and needs enterprise class, proprietary, licensed hardware machines. This has clearly let way for Open Source NoSQL Databases, where the basic properties are dynamic schema, distributed and horizontally scalable on commodity hardware.
2. Properties of NoSQL
NoSQL is the acronym for Not Only SQL. The basic qualities of NoSQL databases are schemaless, distributed and horizontally scalable on commodity hardware. The NoSQL databases offers variety of functions to solve various problems with variety of data types, where “blob” used to be the only data type in RDBMS to store unstructured data.
2.1 Dynamic Schema
NoSQL databases allows schema to be flexible. New columns can be added anytime. Rows may or may not have values for those columns and no strict enforcement of data types for columns. This flexibility is handy for developers, especially when they expect frequent changes during the course of product life cycle.
2.2 Variety of Data
NoSQL databases support any type of data. It supports structured, semi-structured and unstructured data to be stored. Its supports logs, images files, videos, graphs, jpegs, JSON, XML to be stored and operated as it is without any pre-processing. So it reduces the need for ETL (Extract – Transform – Load).
2.3 High Availability Cluster
NoSQL databases support distributed storage using commodity hardware. It also supports high availability by horizontal scalability. This features enables NoSQL databases get the benefit of elastic nature of the Cloud infrastructure services.
2.4 Open Source
NoSQL databases are open source software. The usage of software is free and most of them are free to use in commercial products. The open sources codebase can be modified to solve the business needs. There are minor variations in the open source software licenses, users must be aware of license agreements.
2.5 NoSQL – Not Only SQL
NoSQL databases not only depend SQL to retrieve data. They provide rich API interfaces to perform DML and CRUD operations. These are APIs are move developer friendly and supported in variety of programming languages.
3. Types of No-SQL
There are four types of No-SQL data bases. They are: Key-Value databases, Column oriented database, Document oriented databases and Graph databases. At a very high level most of these databases follows the similar structure of RDBMS databases.
The database server might contain many data bases. The databases might contain one or more tables inside it. The table intern will have rows and columns to store the actual data. This hierarchy is common across all No-SQL databases, but the terminologies might vary.
3.1 Key Value Database
Key-Value databases developed based on Dynamo white paper published by Amazon. Key-Value database allows the user to store data in simple <key> : <value> format, where key is used to retrieve the value from the table.
3.1.1 Data Model
The table contains many key spaces and each key space can have many identifiers to store key value pairs. The key-space is similar to column in typical RDBMS and the group of identifiers presented under the key-space can be considered as rows.
It is suitable for building simple, non-complex, high available applications. Since most of Key Value Databases support in memory storage, can be used for building cache mechanism.
3.2 Column oriented Database
Column oriented databases are developed based on Big Table white paper published by Google. This takes a different approach than traditional RDBMS, where it supports to add more and more columns and have wider table. Since the table is going to be very broad, it supports to group the column with a family name, call it “Column Family” or “Super Column“. The Column Family can also be optional in some of the Column data bases. As per the common philosophy of No-SQL databases, the values to the columns can be sparsely distributed.
3.2.1 Data Model
The table contains column families (optional). Each column family contains many columns. The values for columns might be sparsely distributed with key-value pairs.
The Column oriented databases are alternate to the typical Data warehousing databases (Eg. Teradata) and they are suitable for OLAP kind of application.
Apache Cassandra, HBase
3.3 Document-oriented Database
Document oriented databases supports to store semi-structured data. It can be JSON, XML, YAML or even a Word Document. The unit of data is called document (similar to a row in RDBMS). The table which contains a group of documents is called as a “Collection”.
3.3.1 Data Model
The Database contains many Collections. A Collection contains many documents. Each document might contain a JSON document or XML document or YAML or even a Word Document.
Document databases are suitable for Web based applications and applications exposing RESTful services.
3.4 Graph Database
The real world graph contains vertices and edges. They are called nodes and relations in graph. The graph databases allow us to store and perform data manipulation operations on nodes, relations and attributes of nodes and relations.
The graph databases works better when the graphs are directed graphs, i.e. when there are relations between graphs.
3.4.1 Data Model
The graph database is the two dimensional representation of graph. The graph is similar to table. Each graph contains Node, Node Properties, Relation and Relation Properties as Columns. There will be values for each row for these columns. The values for properties columns can have key-value pairs.
Graph databases are suitable for social media, network problems which involves complex queries with more joins.
Neo4j, OrientDB, HyperGraphDB, GraphBase, InfiniteGraph
4. Possible Problem Areas
Following are the important areas to be considered while choosing a NoSQL database for given problem statement.
4.1 ACID Transactions:
Most of the NoSQL databases do not support ACID transactions. E.g. MongoDB, CouchBase, Cassandra. [Note: To know more about ACID transaction capabilities, refer the appendix below].
4.2 Proprietary APIs / SQL Support
Some of NoSQL databases does not support Structured Query Language, they only support API interface. There is no common standard for APIs. Every database follows its own way of implementing APIs, so there is a overhead of learning and developing separate adaptor layers for each and every databases. Some of NoSQL databases do not support all standard SQL features.
4.3 No JOIN Operator
Due to the nature of the schema or data model, not all NoSQL databases support JOIN operations by default, whereas in RDBMS JOIN operation is a core feature. The query language in Couchbase supports join operations. In HBase it can be achieved by integrating with Hive. MongoDB does not support it currently.
4.4 Lee-way of CAP Theorem
Most of the NoSQL databases, take the leeway suggested by CAP theorem and they support only any two properties of Consistency, Availability and Partition aware. They do not support all the three qualities. [Note: Please refer appendix to know more about CAP theorem].
NoSQL databases solve the problems where RDBMS could not succeed in both functional and non-functional areas. In this article we have seen the basic properties, generic data models, various types and features of NoSQL databases. To further proceed, start using anyone of NoSQL database and get hands-on.
Appendix A Theories behind Databases
A.1 ACID Transactions
ACID is an acronym for Atomicity, Consistency, Isolation and Durability. These four properties are used to measure
Atomicity means that the database transactions must be atomic in nature. It is also called all or nothing rule. Databases must ensure that a single failure must result rollback of the entire transaction until the commit point. Only if all transactions are successful the transaction must be committed.
Databases must ensure that only valid data must be allowed to be stored. In RDBMS, it is all about enforcing schema. In NoSQL the consistency varies depends on the type of DB. For example, in GraphDB such as Neo4J, consistency ensures that relationship must have start and end node. In MongoDB, it automatically creates a unique rowid, using a 24bit length value.
Databases allow multiple transactions in parallel. For example, when read and write operations happens in parallel, read will not know about the write operation until write transaction is committed. The read operation will have only legacy data, until the full commit of the write transaction is completed.
Databases must ensure that committed transactions are persisted into storage. There must be appropriate transaction and commit logs available to enforce writing into hard disk.
A.2 Brewer’s CAP-Theorem
The CAP theorem states that any networked shared-data system can have at most two of three desirable properties. They are : Consistency, Availability and Partition tolerence.
In a distributed database systems, all the nodes must see the same data at the same time.
The database system must be available to service a request received. Basically, the DBMS must be a high available system.
A.2.3. Partition Tolerance
The database system must continue to operate despite arbitrary partitioning due to network failures.