Features of Apache Big Data Streaming Frameworks

Posted on

The idea of this article is to explain the general features of Apache big data  stream processing frameworks.  Also provide a crisp comparative analysis of Apache’s big data streaming frameworks against the generic features.  So that it useful to select the right framework for the application development.

1. Introduction

In the big data world, there are many tools and frameworks available to process the large volume of data in offline mode or batch mode.  But the need for real time processing to analyze the data arriving at high velocity on the fly and provide analytics or enrichment services is also high.  In the last couple of year this is an ever changing landscape, with many new entrants of streaming frameworks.  So choosing the real time processing engine becomes a challenge.

2. Design

The real time streaming engines interacts with stream or messaging frameworks such as Apache Kafka, RabbitMQ, Apache Flume to receive the data in real time.

It process the data inside the cluster computing engine which typically runs on top of a cluster manager such as Apache YARN, Apache Mesos or Apache Tez.

The processed data sent back to message queues ( Apache Kafka, RabbitMQ, Flume) or written into storage such as HDFS, NFS.



3. Characteristics of Real Time Stream Process Engines

3.1 Programming Models

They are two types of programming models present in real time streaming frameworks.

3.1.1 Compositional

This approach provides basic components, using which the streaming application can be created. For example, In Apache Storm, the spout is used to connect to different sources and receive the data and bolts are used to process the received data.

3.1.2. Declarative

This is more of a functional programming approach, where the framework allows us to define higher order functions. This declarative APIs provides more advanced operations like windowing or state management & it is considered more flexible.

3.2  Message Delivery Guarantee

There are three message delivery guarantee mechanisms. They are : at most once, at least
once & exactly once.

3.2.1 At most once

This is a best effort delivery mechanism. The message may be delivered one or more times.  So the possibilities of getting duplicate events processed are very high.

3.2.2 At least once

This mechanism will ensure that the message is delivered at-least once.
But in the process of delivering at least once, the framework might deliver the
message more than once. So, duplicate message might be received and processed.
This might result in unnecessary complications, where the processing logic is not

3.2.3 Exactly once

The framework will ensure that the message is delivered and processed exactly once.
The message delivery is guaranteed and there won’t be any duplicate messages.
So, “Exactly Once” delivery guarantee is considered to be best of all.

3.3 State Management

Statement management defines the way events are accumulated in side the frameworks  before it actually process the data.  This is a critical factor while deciding the framework for real time analytics.

3.3.1 Stateless processing

The frameworks which process the incoming events independently without the knowledge
of any previous events are considered to be stateless.  The data enrichment and data processing applications might need kind of processing power.

3.3.2 Stateful Processing

The stream processing frameworks can make use of the previous events to process the
incoming events, by storing them in cache or external databases.  Real time analytics applications need stateful processing, so that it can collect the data for a specific interval and process them before it really recommends any suggestions to the user.

3.4 Processing Modes

Processing mode defines, how the incoming data is processed.  There are three processing modes: Event, Micro batch & batch.

3.4.1 . Event Mode

Each and every incoming message is processed independently. It may or may not maintain the state information.

3.4.2 Micro Batch

The incoming events are accumulated for a specific time window and the collected events processed together as batch.

3.4.3 Batch

The incoming events are processed like a bounded stream of inputs.
This allows to process the large finite set of incoming events.

3.5 Cluster Manager

The real time processing frameworks runs in cluster computing environment might need a cluster manager.  The support for cluster manager is critical to support the scalability and performance requirement of the application.  The frameworks might run on standalone mode, their own cluster manager, Apache YARN, Apache Mesos or Apache Tez.

3.5.1 Standalone Mode

The support to run on standalone mode is useful during development phase, where the developers can run the code in their development environment, they do not need to deploy their code in the large cluster computing environment.

3.5.2 Proprietary Cluster Manager

Some of real time processing frameworks might support their own cluster managers, such Apache Spark has its own Standalone Cluster manager, which is bundled with the software.  This reduces the overhead of installing , configuration and maintenance of other cluster managers such as Apache Yarn or Apache Mesos.

3.5.3 Support for Industry Standard Cluster Managers

If you already have a big data environment and want to leverage the cluster for real time processing, then support to existing cluster computing manager is very critical.  The real time stream processing frameworks must support Apache YARN, Apache Mesos or Apache Tez.

3.6 Fault Tolerance

Most of the Big Data frameworks follows master slave architecture.  Basically the master is responsible for running the job on the cluster and monitor the clients in the cluster.  So, the framework must handle failures at the master node as well as failure in client nodes.  Some frameworks might need some external tools like monit/supervisord to monitor the master node.  For example, Apache Spark streaming has its own monitoring process for the master (driver) node.  If the master node fails it will be automatically restarted.  If the client node fails, master takes care of restarting them.    But in Apache Storm, the master has to be monitored using monit.

3.7 External Connectors

The framework must support seamless connection to external data generation sources such Twitter feeds, Kafka , RabbitMQ, Flume, RSS Feeds, Hekad, etc.  The frameworks must provide standard inbuilt connectors as well as provision to extend the connectors to connect various streaming data sources.

3.7.1 Social Media Connectors – Twitter / RSS Feeds

3.7.2 Message Queue Connectors -Kafka / RabbitMQ

3.7.3 Network Port Connectors – TCP/UDP Ports

3.7.4 Custom Connectors – Support to develop customized connectors to read from custom applications.

3.8 Programming Language Support

Most of these frameworks supports JVM languages, especially Java & Scala.  Some also supports Python.  The selection of the framework might depend on the language of the choice.

3.9 Reference Data Storage & Access

The real time processing engines, might need to refer some data bases to enhance or aggregate the given data.  So, the framework must provide features to integrate and efficient access to the reference data.  Some frameworks provide ways to internally cache the reference data in memory (Eg. Apache Spark Broadcast Variable).  Apache Samza and Apache Flink supports storing the reference data internally in each cluster node, so that jobs can access them internally without connecting to the data base over the network.

Following are the various methods available in the big data streaming frameworks:

3.9.1 In-memory cache : Allows to store reference data inside cluster nodes, so that it improves the performance by reducing the delay in connecting to external data bases.

3.9.2 Per Client Data Base Storage: Allows to store data in 3rd party database systems like MySQL, SQLite,MongoDB etc inside the streaming cluster.  Also provides API support to connect and retrieve data from those data bases & provides efficient data base connection methodologies.

3.9.3. Remote DBMS connection:  These systems support connecting to the external databases outside the streaming clusters.  This is considered to be less efficient due to higher latency introduced due to network connectivity and bottlenecks introduced due to network communication.

3.10 Latency and throughput

Though hardware configuration plays a major role in latency and throughput, some of the design factors of the frameworks affects the performance.  The factors are : Network IO, efficient use memory,  reduced disk access, in memory cache for reference data.  For example, Apache Kafka Streaming API provides higher throughput and low latency due to reduced network I/O, hence the messaging framework and computing engines are in the same cluster.  Similarly, Apache Spark uses the memory to cache the data, there by reduces the disk access results in low latency and higher throughput.

4. Feature Comparison Table

Following table provides a comparison Apache streaming frameworks against the above discussed features.


The above frameworks supports both statefull and stateless processing modes.

5. Conclusion

This article summarizes the various features of the  streaming framework, which are critical selection criteria for new streaming application.  Every application is unique and has its own specific functional and non-functional requirement, so the right framework is completely depends on the requirement.

6. References

6.1 Apache Spark Streaming – http://spark.apache.org/streaming/

6.2 Apache Storm – http://storm.apache.org/

6.3 Apache Flink – https://flink.apache.org/

6.4 Apache Samza – http://samza.apache.org/

6.5 Apache Kafka Streaming API – http://kafka.apache.org/documentation.html#streams



R Data Frame – The One Data Type for Predictive Analytics

Posted on Updated on

The idea of this article is to introduce the R language’s high level data structure named data frame and its usage in the context of programming with predictive machine learning algorithms.


The R data frame is a high level data structure which is equivalent to a table in database systems.  It is highly useful to work with machine learning algorithms and it is very flexible and easy to use.

The standard definition says  data frames is a “tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R‘s modeling software.”

2. CRUD Operations

2.1 Create

2.1.1 Creating a new Data Frame

The data.frame() API will create a new dataFrame.  This creates a data frame object, which can be later updated with rows and columns.

dataFrame <- data.frame()

2.1.2 Creating a DataFrame from a CSV file

This is the most standard way of creating data frame while working with machine learning problems.  The read.csv() API creates a new data frame and loads it with the contents of the data frame.  The column are named with the first row of the CSV file.

dataFrame <- read.csv("C:\DATA\training.csv")

2.1.3 Creating a CSV file from Data Frame

The write.csv() API is used to persist contents of the data frame into a CSV file.  In a machine learning problem, once we perform the prediction we want to persist the results into storage.  So write.csv() is answer to that requirement.


2.1.4 Creating a CSV file from Data Frame without heading

By default write.csv() adds a new row id into the output file.  But if you don’t want the same, we can set the row.names to “False”.

write.csv(DataFrame,"C:\Data\NewTraining.csv",row.names = False)

2.1.5 Create a vector from a Data Frame

Basically each and every column of the data frame is a vector.  We can create a new vector from a column of a data frame.

newVector <- dataFrame$ColumnName

2.2 Read

2.2.1 The str() API

The str() displays the contents of the data frame in a tabular format.


2.2.2 Filter the DataFrame Based on the Value set for column

In the following example, we will select only the records where the shopId is 100.  We can also use other relational operators and combine it with logical operators as well.

newFrame <- dataFrame[dataFrame$ShopId == '100',]

2.2.3 Sort the DataFrame on a column

The below code snippet sorts the dataFrame based on the Date column in the dataFrame and stores the result in a new data frame object.

newFrame <- dataFrame[order(dataFrame$Date),]

2.3 Update

2.3.1 Adding Rows

The rbind() API allows to append a new row into an existing data frame.  This api will throw errors if the row does not contain similar columns of the data frame.  The newRow must be a vector() data type.

NewDataFrame <- rbind(dataFrame,newRow)

2.3.2 Adding Columns

The cbind() API allows to append a new column to an existing data frame.  This api will throw errors if the newColumn does not contain similar rows of the data frame.  The newColumn  must be vector() data type.

newDataFrame <- cbind(dataFrame,newColumn)

2.3.3 Dynamically Adding Column

Following is an alternate way to add a new column using a vector to an existing data frame.

dataFrame$newColumn <- Vector()

The newColumn will be appeneded to the dataFrame with the values populated by the Vector.

2.4 Delete

The standard rm()  api is used to delete the dataFrame.


3. Prediction Models with DataFrame

We will take a simple sales prediction problem, where a shop wants to predict the expected sales based on the past history of 10 days.  Most of the R scripts takes a general format of 1)loading the training data, 2)loading the test data, 3) build the model with the training data and 4) predict the test data with the model.  Finally 5)write the predicted values into a storage.

3.1 Training Data

The prediction model is built with training data.  The prediction model performs the machine learning algorithms on the training data and builds the model.  So that we can perform prediction using this model later.

We have the training data in a csv file.  It has the following columns.  Day -specifies the number of the day, Customers – specifies the total number of customers visited the shop, Promo – specifies whether the shop ran a promotion on that day, Holiday – specifies if it is a holiday in that state and Sales specifies the amount of sales on that day.

For example, the first row says that, one day 1, 55 customers visited the shop and sales was 5488.  Its a regular weekday and there was no promotion or holiday on that day.

Sales Data ( Training Data)


3.2 Test Data

The test data is also a csv file, for which we have to predict the sales.  It contains the similar set of data, except Sales.  For example, we have to predict the sales for day 1, where 53 customers visited and its a regular working day without a promotion or a holiday.

Test Data

3.3 Linear Regression Model

#Create a dataframe using the sales.txt to train the model
trainingData <- read.csv("C:\\Data\\Sales.txt")
#Create a dataframe to using Test.txt for which the predictions to be computed
testData <- read.csv("C:\\Data\\Test.txt")

#Build the linear regression model using the glm() library
#This model will predicts the Sales using Day, Customer, Promo & Holiday

Model <- glm(Sales ~ Day + Customers + Promo + Holiday, data = trainingData)

#Apply the model in predict() api with the test data
#It returns the predicted sales in a vector

predictSales <- predict(Model,testData)

#Round of the predicted sales values 
predictSales <- round(predictSales,digit=2)

#Create a new column in testData framework with predicted Sales values
testData$Sales <- predictSales

#Write the result into a file.
write.csv(testData, file = "C:\\Result.txt",row.names=FALSE)

3.4 Predicted Results

Now we get the results in a format similar to our Sales.txt.  Actually it contains all the fields of Test.txt and the respective Sales data predicted by the algorithm.

Linear Regression Model Results

3.5 Random Forest Model

To run the Random Forest algorithm, we should have the package installed in R.  We can execute the below command to install the package.


Below the R script to predict the sales using Random Forest algorithm:

#load the ramdonForest library

#Create a dataframe for Sales data to train the model
trgData <- read.csv("C:\\Data\\Sales.txt")

#Create a dataframe with the test data, for which the sales to be predicted
testData <- read.csv("C:\\Data\\Test.txt")

#Build the model using Random Forest algorithm
#This model predicts the sales using Day, Customers, Promo and Holiday
Model <- randomForest(Sales ~ Day + Customers + Promo + Holiday, data = trgData)

#Predict the sales using the above model, for the test data
#predictSales is a vector which contains the predicted values
predictSales <- predict(Model,testData)

#Round off the sales numbers
predictSales <- round(predictSales,digit=2)

#Add additional column to the test data and copy the predicted sales values
testData$Sales <- predictSales

#Create a output file using the computed values.
write.csv(testData, file = "C:\\Data\\Result.txt",row.names=FALSE)

3.6 Predicted Output (Random Forest)

Now the results for test.txt is added with the predicted Sales value column for each and every day.  For example, for day 5, where 79 customers visited the shop and it is a holiday and the shop was running a promotion on that day.  The algorithm predicted the expected sales as 7889.6 for that day.

Random Forest Model Results


I have briefly explained the core concepts of R data frame in the context of machine learning algorithms, based on my experience in kaggle competitions.  This article will give a quick heads up into data analytics.  To learn further and gain expertise I would suggest to start reading the books R Cookbook[2] and Machine Learning with R [3]. Hope this helps.

5. References

[1] R Data Frame Documentation – https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html

[2] R Cookbook – O Reilley Publications

[3] Machine Learning with R – https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition

[4] Download R – https://cran.rstudio.com

[5] Download R Studio – www.rstudio.com/products/rstudio/download/

File Handling in Amazon S3 with Python Boto Library

Posted on Updated on

Understand Python Boto library for standard S3 workflows.

1. Introduction

Amazon Web Services (AWS) Simple Storage Service (S3) is a storage as a service provided by Amazon.  It a general purpose object store, the objects are grouped under a name space called as “buckets”.  The buckets are unique across entire AWS S3.

Boto library is the official Python SDK for software development.  It provides APIs to work with AWS services like EC2, S3 and others.

In this article we will focus on how to use Amzaon S3 for regular file handling operations using Python and Boto library.

2. Amzon S3 & Work Flows

In Amzaon S3, the user has to first create a bucket.  The bucket is a namespace, which is has a unique name across AWS.  The users can set access privileges to it based on their requirement.  The buckets can contain objects.  The objects are referred as a key-value pair, where key is the identifier to operate on the object.  The key must be unique inside the bucket.  The object can be of any type.  It can be used to store strings, integers, JSON, text files, sequence files, binary files, picture & videos.  To understand more about Amazon S3 Refer Amazon Documentation [2].

Following are the possible work flow of operations in Amazon S3:

  • Create a Bucket
  • Upload file to a bucket
  • List the contents of a bucket
  • Download a file from a bucket
  • Move files across buckets
  • Delete a file from bucket
  • Delete a bucket

3. Python Boto Library

Boto library is the official Python SDK for software development.  It supports Python 2.7.  Work for Python 3.x is on going.  The code snippets in this article are developed using boto v2.x.  To install the boto library, pip command can be used as below:

pip install -u boto


Also in the below code snippets, I have used connect_s3() API, by passing the access credentials as arguments.  This provides the connection object to work with.  But If you don’t want to code  the access credentials in your program, there are other ways of do it.  We can create environmental variables for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.  The other way is to create a credential files and keep them under .aws directory in the name of “credentials” in the users home directory.   The file should contain the below:

File Name : ~/.aws/credentials

aws_access_key_id = ACCESS_KEY
aws_secret_access_key = SECRET_KEY

4. S3 Work flow Automation

4.1 Create a bucket

The first operation to be performed before any other operation to access the S3 is to create a bucket.  The create_bucket() api in connection object performs the same.  The bucket is the name space under which all the objects of the users can be stored.

import boto

keyId = "your_aws_key_id"
#Connect to S3 with access credentials 
conn = boto.connect_s3(keyId,sKeyId)  

#Create the bucket in a specific region.
bucket = conn.create_bucket('mybucket001',location='us-west-2')

In create_bucket() api, the bucketname (‘mybucket001’) is the mandatory parameter, which is the name of the bucket.  The location is optional parameter, if the location is not given, then bucket will be created in the default region of the user.

create_bucket() call might throw an error message, if a bucket with the same name already exists.  Also the bucket name is unique across the system.  Naming convention of the bucket is depend the rules enforced by the AWS region.  Generally, bucket name must be in lower case.

4.2 Upload a file

To upload a file into S3, we can use set_contents_from_file() api of the Key object.  The Key object resides inside the bucket object.

import boto
from boto.s3.key import Key

keyId = "your_aws_key_id"
sKeyId= "your_aws_secret_key_id"

file = open(fileName)

conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName)
#Get the Key object of the bucket
k = Key(bucket)
#Crete a new key with id as the name of the file
#Upload the file
result = k.set_contents_from_file(file)
#result contains the size of the file uploaded

4.3 Download a file

To download the file, we can use get_contents_to_file() api.

import boto
from boto.s3.key import Key

keyId ="your_aws_key_id"

conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName)

#Get the Key object of the given key, in the bucket
k = Key(bucket,srcFileName)

#Get the contents of the key into a file 

4.4 Move a file from one bucket to another

We can achieve moving a file from one bucket to another, only by copying the object from one bucket to another.  The copy_key() api of bucket object, copies the object from a given bucket to local.

import boto

keyId = "your_aws_access_key_id"

conn = boto.connect_s3(keyId,sKeyId)
srcBucket = conn.get_bucket('mybucket001')   #Source Bucket Object
dstBucket = conn.get_bucket('mybucket002')   #Destination Bucket Object
fileName = "abc.txt"
#Call the copy_key() from destination bucket

 4.5 Delete a file

To delete a file inside the object, we have to retrieve the key of the object and call the delete() API of the Key object.  The key object can be retrieved by calling Key() with bucket name and object name.

import boto
from boto.s3.key import Key

keyId = "your_aws_access_key"
sKeyId = "your_aws_secret_key"
srcFileName="abc.txt"      #Name of the file to be deleted
bucketName="mybucket001"   #Name of the bucket, where the file resides

conn = boto.connect_s3(keyId,sKeyId)   #Connect to S3
bucket = conn.get_bucket(bucketName)   #Get the bucket object

k = Key(bucket,srcFileName)            #Get the key of the given object

k.delete()                             #Delete the object

4.6 Delete a bucket

The delete_bucket() api of the connection object deletes the given bucket in the parameter.

import boto

keyId = "your_aws_access_key_id"
sKeyId= "your_aws_secret_key_id"
conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.delete_bucket('mybucket002')

The delete_bucket() call will fail, if there are objects inside the bucket.

4.7 Empty a bucket

Emptying a bucket can be achieved by deleting all the objects indie the bucket.  The list() api of bucket object (bucket.list()) will provide all the objects inside the bucket.  By calling the delete() api for those objects, we can delete them.

import boto

keyId = "your_aws_access_key_id"
sKeyId= "your_aws_secret_key_id"

conn = boto.connect_s3(keyId,sKeyId)     #Connect to S3
bucket = conn.get_bucket(bucketName)     #Get the bucket Object

for i in bucket.list():
    i.delete()                           #Delete the object

4.8 List All Buckets

The get_all_buckets() of the connection object returns list of all buckets for the user.  This can be used to validate existence of the bucket once you have created or deleted a bucket.

import boto

keyId = "your_aws_access_key_id"
sKeyId= "your_aws_secret_key_id"

conn = boto.connect_s3(keyId,sKeyId)      #Connect to S3
buckets = conn.get_all_buckets()          #Get the bucket list
for i in buckets:

5 Summary

The boto library provides connection object, bucket object and key object which exactly represents the design of S3.  By understanding various methods of these objects we can perform all the possible operations on S3 using this boto library.

Hope this helps.

6. References

[1] Boto S3 API Documentation – http://boto.cloudhackers.com/en/latest/ref/s3.html

[2] Amazon S3 Documention – https://aws.amazon.com/documentation/s3/




A Primer On Open Source NoSQL Databases

Posted on Updated on

The idea of this article is to understand NoSQL databases, its properties, various types, data model and how it differs from standard RDBMS.

1. Introduction

The RDMS databases are here for nearly three decades now.  But in the era of social media, smart phones and cloud, we generate large volume of data, at a high velocity.  Also the data varies from simple text messages to high resolution video files.  The traditional RDBMS could not able to cope up with the velocity, volume and variety of data requirement of this new era.  Also most of the RDBMS software are licensed and needs enterprise class, proprietary, licensed hardware machines.  This has clearly let way for Open Source NoSQL Databases, where the basic properties are dynamic schema, distributed and horizontally scalable on commodity hardware.

2. Properties of NoSQL

NoSQL is the acronym for Not Only SQL.  The basic qualities of NoSQL databases are schemaless, distributed and horizontally scalable on commodity hardware.  The NoSQL databases offers variety of functions to solve various problems with variety of data types, where “blob” used to be the only data type in RDBMS to store unstructured data.

2.1 Dynamic Schema

NoSQL databases allows schema to be flexible. New columns can be added anytime.  Rows may or may not have values for those columns and no strict enforcement of data types for columns. This flexibility is handy for developers, especially when they expect frequent changes during the course of product life cycle.

2.2 Variety of Data

NoSQL databases support any type of data.  It supports structured, semi-structured and unstructured data to be stored.  Its supports logs, images files, videos, graphs, jpegs, JSON, XML to be stored and operated as it is without any pre-processing.  So it reduces the need for ETL (Extract – Transform – Load).

2.3 High Availability Cluster

NoSQL databases support distributed storage using commodity hardware. It also supports high availability by horizontal scalability. This features enables NoSQL databases get the benefit of elastic nature of the Cloud infrastructure services.

2.4 Open Source

NoSQL databases are open source software.  The usage of software is free and most of them are free to use in commercial products.  The open sources codebase can be modified to solve the business needs.  There are minor variations in the open source software licenses, users must be aware of license agreements.

2.5 NoSQL – Not Only SQL

NoSQL databases not only depend SQL to retrieve data. They provide rich API interfaces to perform DML and CRUD operations. These are APIs are move developer friendly and supported in variety of programming languages.

3. Types of No-SQL

There are four types of No-SQL data bases. They are: Key-Value databases, Column oriented database, Document oriented databases and Graph databases.  At a very high level most of these databases follows the similar structure of RDBMS databases.

The database server might contain many data bases.  The databases might contain one or more tables inside it.  The table intern will have rows and columns to store the actual data.  This hierarchy is common across all No-SQL databases, but the terminologies might vary.

3.1 Key Value Database

Key-Value databases developed based on Dynamo white paper published by Amazon.  Key-Value database allows the user to store data in simple <key> : <value> format, where key is used to retrieve the value from the table.

3.1.1 Data Model

The table contains many key spaces and each key space can have many identifiers to store key value pairs.  The key-space is similar to column in typical RDBMS and the group of identifiers presented under the key-space can be considered as rows.01-KeyValueV1

It is suitable for building simple, non-complex, high available applications.  Since most of Key Value Databases support in memory storage, can be used for building cache mechanism.

3.1.3 Example:

DynamoDB, Redis

3.2 Column oriented Database

Column oriented databases are developed based on Big Table white paper published by Google.  This takes a different approach than traditional RDBMS, where it supports to add more and more columns and have wider table.  Since the table is going to be very broad, it supports to group the column with a family name, call it “Column Family” or “Super Column“.  The Column Family can also be optional in some of the Column data bases.  As per the common philosophy of No-SQL databases, the values to the columns can be sparsely distributed.

3.2.1 Data Model

The table contains column families (optional).  Each column family contains many columns.  The values for columns might be sparsely distributed with key-value pairs.


The Column oriented databases are alternate to the typical Data warehousing databases (Eg. Teradata) and they are suitable for OLAP kind of application.

3.2.2 Example

Apache Cassandra, HBase

3.3 Document-oriented Database

Document oriented databases supports to store semi-structured data.  It can be JSON, XML, YAML or even a Word Document.  The unit of data is called document (similar to a row in RDBMS).  The table which contains a group of documents is called as a “Collection”.

3.3.1 Data Model

The Database contains many Collections.  A Collection contains many documents.  Each document might contain a JSON document or XML document or YAML or even a Word Document.


Document databases are suitable for Web based applications and applications exposing RESTful services.

3.3.2 Example

MongoDB, CouchBaseDB

3.4 Graph Database

The real world graph contains vertices and edges.  They are called nodes and relations in graph.  The graph databases allow us to store and perform data manipulation operations on nodes, relations and attributes of nodes and relations.

The graph databases works better when the graphs are directed graphs, i.e. when there are relations between graphs.

3.4.1 Data Model

The graph database is the two dimensional representation of graph.  The graph is similar to table.  Each graph contains Node, Node Properties, Relation and Relation Properties as Columns.  There will be values for each row for these columns.  The values for properties columns can have key-value pairs. 04-GraphDBv1

Graph databases are suitable for social media, network problems which involves complex queries with more joins.

3.4.2 Example

Neo4j, OrientDB, HyperGraphDB, GraphBase, InfiniteGraph

4. Possible Problem Areas

Following are the important areas to be considered while choosing a NoSQL database for given problem statement.

4.1 ACID Transactions:

Most of the NoSQL databases do not support ACID transactions. E.g. MongoDB, CouchBase, Cassandra.  [Note: To know more about ACID transaction capabilities, refer the appendix below].

4.2 Proprietary APIs / SQL Support

Some of NoSQL databases does not support Structured Query Language, they only support API interface.  There is no common standard for APIs.  Every database follows its own way of implementing APIs, so there is a overhead of learning and developing separate adaptor layers for each and every databases.  Some of NoSQL databases do not support all standard SQL features.

4.3 No JOIN Operator

Due to the nature of the schema or data model, not all NoSQL databases support JOIN operations by default, whereas in RDBMS JOIN operation is a core feature.  The query language in Couchbase supports join operations.  In HBase it can be achieved by integrating with Hive.  MongoDB does not support it currently.

4.4 Lee-way of CAP Theorem

Most of the NoSQL databases, take the leeway suggested by CAP theorem and they support only any two properties of Consistency, Availability and Partition aware.  They do not support all the three qualities. [Note: Please refer appendix to know more about CAP theorem].

5. Summary

NoSQL databases solve the problems where RDBMS could not succeed in both functional and non-functional areas.  In this article we have seen the basic properties, generic data models, various types and features of NoSQL databases.  To further proceed, start using anyone of NoSQL database and get hands-on.


Appendix A Theories behind Databases

A.1 ACID Transactions

ACID is an acronym for Atomicity, Consistency, Isolation and Durability.  These four properties are used to measure

A.1.1 Atomicity

Atomicity means that the database transactions must be atomic in nature. It is also called all or nothing rule. Databases must ensure that a single failure must result rollback of the entire transaction until the commit point. Only if all transactions are successful the transaction must be committed.

A.1.2 Consistency

Databases must ensure that only valid data must be allowed to be stored. In RDBMS, it is all about enforcing schema. In NoSQL the consistency varies depends on the type of DB. For example, in GraphDB such as Neo4J, consistency ensures that relationship must have start and end node. In MongoDB, it automatically creates a unique rowid, using a 24bit length value.

A.1.3 Isolation

Databases allow multiple transactions in parallel. For example, when read and write operations happens in parallel, read will not know about the write operation until write transaction is committed. The read operation will have only legacy data, until the full commit of the write transaction is completed.

A.1.4 Durability

Databases must ensure that committed transactions are persisted into storage. There must be appropriate transaction and commit logs available to enforce writing into hard disk.

A.2 Brewer’s CAP-Theorem

The CAP theorem states that any networked shared-data system can have at most two of three desirable properties.  They are : Consistency, Availability and Partition tolerence.


In a distributed database systems, all the nodes must see the same data at the same time.


The database system must be available to service a request received. Basically, the DBMS must be a high available system.

A.2.3. Partition Tolerance

The database system must continue to operate despite arbitrary partitioning due to network failures.

GIT Command Reference

Posted on Updated on

Git is a software that allows you to keep track of changes made to a project over time.  Git works by recording the changes you make to a project, storing those changes, then allowing you to reference them as needed.

GIT Project has three parts.

1. Working Directory : The directory where you will be doing all the work.  Creating, editing, deleting and organizing files.

2. Staging Area : The place where you will list changes you make to the working directory.

3. Repository :  A place where GIT permanently stores those changes as different versions of those projects.

GIT WorkFlow:

Git workflow consists of editing files in the working directory, adding files to the staging area, and saving changes to a GIT repository.  Saving changes to GIT repository called commit.


I. BASIC GIT Commands

  1. git init – Turns the current working directory into a GIT Project
  2. git status – Prints the current status of git working directory.
  3. git add <filename> – Adds the file into the Staging Area.  [ After adding verify the same with git staus command ]
  4. git add <list of files> Add command also takes list of files.
  5. git diff <filename> – Displays the diff between the file in staging are and current working directory
  6. git commit -m “Commit Comment” – Permenently stores changes from staging area into GIT repository.
  7. git log – Prints the earlier versions of the project which are stored in chronological order.


II. Backtracking Changes
In GIT the commit you are currently on is known as the HEAD commit.  In many cases, the most recently made commit is the HEAD commit.SHA – The git log command displays the commit log.  The commit will contain SHA values for each commit.  The SHA is the FIRST 7 Digit of the SHA.

  1. git show HEAD – Displays the HEAD commit
  2. git reset HEAD <filename> – Unstages file changes in the Staging area.
  3. git checkout HEAD <filename> – Discards the changes in the Working Directory.
  4. git reset <SHA> –  It will reset back to the level of commit.


GIT allows us to create branches to experiment with versions of a project.  Imagine you want to develop a new API on master branch, until you ready to merge that API in master branch, it will not available.  So, in this scenario we create a branch and develop our new API and merge into master.

  1. git branch – Shows the current branches and current active branch you are in.
  2. git branch <new branch name > – Create a new branch
  3. git checkout <branchname> – Switches to the branch
  4. git merge <branchname> – This command is issued from master to merge branch into the master.
  5. git branch -d <branchname> – Delete the branch.


Git offers a suit of colloboration tools to working with other’s project.

  1. git clone <remote_location> <clone_name> – Creates a new replica of git repository from remote repository
  2. git fetch – Update the clone.  This will only update the existing files.
  3. git merge origin/master – Merge the local master with Origin Master.
  4. git push origin <branch> – Push your work to the origin.


V. Example Workflow to add a files

  1. git clone <remote_location>
    1.     Eg. git clone http://www.abc.com/abc/bcd
  2. git add <new_files>
    1.     git add abc.py bcd.py
  3. git status
  4. git commit -m”Commiting abc.py”
  5. git push origin master


VI. Reference

  1. https://confluence.atlassian.com/bitbucketserver/basic-git-commands-776639767.html

Python Collections : High Performing Containers For Complex Problems

Posted on Updated on


Python is known for its powerful general purpose built-in data types like list, dict, tuple and set.  But Python also has collection objects like Java and C++.  These objects are developed on top of the general built-in containers with addtional functionalities which can be used in special scenarios.

The objective of this article is to introduce python collection objects and explain them with apropriate code snippets.  The collections library contains the collections objects, they are namedtuples (v2.6), deque (v2.4), ChainMap(v3.3), Counter(v2.7 ), OrderedDict(v2.7), defaultdict(v2.5) .  Python 3.x also has userDict, userList, userString to create own custom container types (not in the scope of this article), which deserves a separate article.

NOTE: Python 2.x user might aware of various releases they are objects got introduced.  All these objects are available in Python 3.x from 3.1 onwards, except that ChainMap which  got introduced in v3.3.  All the code snippets in the articles are executed in Python 3.5 environment.

2. Namedtuple

As the name suggests, namedtuple is a tuple with name.  In standard tuple, we access the elements using the index, whereas namedtuple allows user to define name for elements.  This is very handy especially processing csv (comma separated value) files and working with complex and large dataset, where the code becomes messy with the use of indices (not so pythonic).

2.1 Example 1

Namedtuples are available in collections library in python. We have to import collections library before using any of container object from this library.

>>>from collections import namedtuple
>>>saleRecord = namedtuple('saleRecord','shopId saleDate salesAmout totalCustomers')
>>>#Assign values to a named tuple 

In the above code snippet, in the first line we import namedtuple from the collections library. In the second line we create a namedtuple called “saleRecord”, which has shopId, saleDate, salesAmount and totalCustomers as fields. Note that namedtuple() takes two string arguments, first argument is the name of tuple and second argument is the list of fields names seperated by space or comma. In the above example space is used as delimeter.
We have also created two tuples here. They are shop11 and shop12.  For shop11, the values are assigned to fields based on the order of the fields and shop12, the values are assigned using the names.

2.2 Example 2

>>>#Reading as a namedtuple
>>>print("Shop Id =",shop12.shopId)
>>>print("Sale Date=",shop12.saleDate)
>>>print("Sales Amount =",shop12.salesAmount)
>>>print("Total Customers =",shop12.totalCustomers)

The above code is pretty much clear that tuple is accessed using the names. It is also possible to access them using indexes of the tuples which is the usual way.

2.3 Interesting Methods and Members

2.3.1 _make

The _make method is used to convert the given iteratable item (list, tuple,dictionary) into a named tuple.

>>>#Convert a list into a namedtuple
>>>aList = [101,"2015-01-02",1250,199]
>>>shop101 = saleRecord._make(aList)
saleRecord(shopId=101, saleDate='2015-01-02', salesAmount=1250, totalCustomers=199)

>>>#Convert a tuple into a namedtuple
>>>aTup =(108,"2015-02-28",1990,189)
saleRecord(shopId=108, saleDate='2015-02-28', salesAmount=1990, totalCustomers=189)

2.3.2 _fields

The _fields is a tuple, which contains the names of the tuple.

>>>('shopId', 'saleDate', 'salesAmount', 'totalCustomers')

2.4 CSV File Processing

As we discussed namedtuple will be very handy while processing a csv data file, where we can access the data using names instead of indexes, which make the code more meaningful and efficient.

from csv import reader
from collections import namedtuple

saleRecord = namedtuple('saleRecord','shopId saleDate totalSales totalCustomers')
fileHandle = open("salesRecord.csv","r")
for fieldsList in csvFieldsList:
    shopRec = saleRecord._make(fieldsList)
    overAllSales += shopRec.totalSales;

print("Total Sales of The Retail Chain =",overAllSales)

In the above code snippet, we have the files salesRecord.csv which contains sales records of shops of a particular retain chain. It contains the values for the fields shopId,saleDate,totalSales,totalCustomers. The fields are delimited by comma and the records are delimited by new line.
The csv.reader() read the file and provides a iterator. The iterator, “csvFieldsList” provides list of fields for every single row of the csv file. As we know the _make() converts the list into namedtuple and the rest of the code is self explanatory.



Counter is used for rapid tallies.  It is a dictionary, where the elements are stored as keys and their counts are stored as values.

3.1 Creating Counters

The Counter() class takes an iteratable object as an argument and computes the count for each element in the object and present as a key value pair.

>>>from collections import Counter
Counter({1: 4, 2: 3, 3: 2, 4: 1})

In the above code snippet, listOfInts is a list which contains numbers. It is passed to Counter() and we got cnt, which is a container object. The cnt is a dictionary, which contains the unique numbers present in the given list as keys, and their respect counts as the value.

3.2 Accessing Counters

Counter is a subclass of dictionary.  So it can be accessed the same as dictionary.   The “cnt” can be handled as a regular dictionary object.

>>> cnt.items()
dict_items([(1, 4), (2, 3), (3, 2), (4, 1)])
>>> cnt.keys()
dict_keys([1, 2, 3, 4])
>>> cnt.values()
dict_values([4, 3, 2, 1])

3.3 Interesting Methods & Usecases

3.3.1 most_common

The most_common(n) of Counter class, provides most commonly occured keys. The n is used as a rank, for example, n = 2 will provide top two keys.

>>>name = "Saravanan Subramanian"
[('a', 7)]
[('a', 7), ('n', 4)]
[('a', 7), ('n', 4), ('r', 2)]

In the above code, we could see that the string is parsed as independent characters as keys and their respective count is stored as values. So, the letterCnt.most_common(1) provides the top letter which has highest occurances.

3.3.2 Operations on Counter

The Counter() subclass is also called as Multiset. It supports addition, substraction, unition and intersection operations on the Counter class.

>>> a = Counter(x=1,y=2,z=3)
>>> b = Counter(x=2,y=3,z=4)
>>> a+b
Counter({'z': 7, 'y': 5, 'x': 3})
>>> a-b       #This will result in negative values & will be omitted
>>> b-a
Counter({'y': 1, 'x': 1, 'z': 1})
>>> a & b    #Chooses the minimum values from their respective pair
Counter({'z': 3, 'y': 2, 'x': 1})
>>> a | b   #Chooses the maximum values from their respective pair
Counter({'z': 4, 'y': 3, 'x': 2})

4. Default Dictionary

The defaultdict() is available part of collections library. It allows the user to specify a function to be called when key is not present in the dictionary.

In a standard dictionary, accesing an element where the key is not present, will raise “Key Error”. So, this is a problem when working working with collections (list, set, etc), especially while creating them.

So, when a dictionary is queried for a key, which is not exists, the function passed as an argument to the named argument “default_dictionary” of default_dict() will called to set a value for given “key” into dictionary.

4.1 Creating Default Dictionary

The defaultdict() is available part of collections library.  The default dict takes a function without argument which returns value as an argument.

4.1.1 Example 1

>>> booksIndex = defaultdict(lambda:'Not Available')
>>> booksIndex['a']='Arts'
>>> booksIndex['b']='Biography'
>>> booksIndex['c']='Computer'
>>> print(booksIndex)
defaultdict(<function  at 0x030EB3D8>, {'c': 'Computer', 'b': 'Biography', 'a': 'Arts'})
>>> booksIndex['z']
'Not Available'
>>> print(booksIndex)
defaultdict(<function  at 0x030EB3D8>, {'c': 'Computer', 'b': 'Biography', 'z': 'Not Available', 'a': 'Arts'})

In the above example, the booksIndex is a defaultdict, where it set ‘Not Available” as a value if any non-existant key is accessed. We have added values for keys a, b & c into the defaultdict. The print(booksIndex) shows that the defaultdict contains values only for these keys. While trying to access the value for key ‘z’, which we have not set, it returned value as ‘Not Available‘ and updated the dictionary.

4.1.2 Example 2

>>> titleIndex = [('a','Arts'),('b','Biography'),('c','Computer'),('a','Army'),('c','Chemistry'),('d','Dogs')]
>>> rackIndices = defaultdict(list)
>>> for id,title in titleIndex:
>>> rackIndices.items()
dict_items([('d', ['Dogs']), ('b', ['Biography']), ('a', ['Arts', 'Army']), ('c', ['Computer', 'Chemistry'])])

In the above example, titleIndex contains a list of tuples. We want to aggregate this list of tuples to identify titles for each alphabets. So, we can have a dictionary where key is the alphabet and value is the list of titles. Here we used a defaultdict with “list” as a function to be called for missing elements. So for each new elements list will be called, and it will create an empty list object. The consecutive append() methods on the list will add elements to the list.

5. Ordered Dictionary

The ordered dictionary maintains the order of elements addition into the dictionary, where the standard dictionary will not maintain the order of inclusion.

5.1 Ordered Dictionary Creation

Ordered Dictionary is created using OrderedDict() from collections library. It an subsclass of regular dictionary, so it inherits all other methods and behaviours of regular dictionary.

>>> from collections import OrderedDict
>>> dOrder=OrderedDict()
>>> dOrder['a']='Alpha'
>>> dOrder['b']='Bravo'
>>> dOrder['c']='Charlie'
>>> dOrder['d']='Delta'
>>> dOrder['e']='Echo'
>>> dOrder
>>> OrderedDict([('a', 'Alpha'), ('b', 'Bravo'), ('c', 'Charlie'), ('d', 'Delta'), ('e', 'Echo')])
>>> >>> dOrder.keys()
odict_keys(['a', 'b', 'c', 'd', 'e'])
>>> dOrder.values()
odict_values(['Alpha', 'Bravo', 'Charlie', 'Delta', 'Echo'])
>>> dOrder.items()
odict_items([('a', 'Alpha'), ('b', 'Bravo'), ('c', 'Charlie'), ('d', 'Delta'), ('e', 'Echo')])

5.2 Creating from other iteratable items

OrderedDict can also be created by passing an dictionary or a list of key, value pair tuples.

>>> from collections import OrderedDict
>>> listKeyVals = [(1,"One"),(2,"Two"),(3,"Three"),(4,"Four"),(5,"Five")]
>>> x = OrderedDict(listKeyVals)
>>> x
OrderedDict([(1, 'One'), (2, 'Two'), (3, 'Three'), (4, 'Four'), (5, 'Five')])

5.3 Sort and Store

One of the interesting use case for OrderedDict is Rank problem. For example, consider the problem a dictionary contains students names and their marks, now we have to find out the best student and rank them according to their marks. So, OrderedDict is the right choice here. Since OrderedDict will remember the order or addition and sorted() will sort a dictionary we can combine both to created a rank list based on the student marks. Please check the example below:

>>> studentMarks={}
>>> studentMarks["Saravanan"]=100
>>> studentMarks["Subhash"]=99
>>> studentMarks["Raju"]=78
>>> studentMarks["Arun"]=85
>>> studentMarks["Hasan"]=67
>>> studentMarks
{'Arun': 85, 'Subhash': 99, 'Raju': 78, 'Hasan': 67, 'Saravanan': 100}
>>> sorted(studentMarks.items(),key=lambda t:t[0])
[('Arun', 85), ('Hasan', 67), ('Raju', 78), ('Saravanan', 100), ('Subhash', 99)]
>>> sorted(studentMarks.items(),key=lambda t:t[1])
[('Hasan', 67), ('Raju', 78), ('Arun', 85), ('Subhash', 99), ('Saravanan', 100)]
>>> sorted(studentMarks.items(), key = lambda t:-t[1])
[('Saravanan', 100), ('Subhash', 99), ('Arun', 85), ('Raju', 78), ('Hasan', 67)]
>>> rankOrder = OrderedDict(sorted(studentMarks.items(), key = lambda t:-t[1]))
>>> rankOrder
OrderedDict([('Saravanan', 100), ('Subhash', 99), ('Arun', 85), ('Raju', 78), ('Hasan', 67)])

In the above example, studentMarks is a dictionary contains the student name as a key and their mark as the value. It got sorted using its value and passed to OrderedDict and got stored in rankOrder. Now rankOrder contains the highest marked student as the first entry, and next highest as the second entry and so on. This ordered is presevered in this dictionary.

6. Deque

Deque means double ended queue and it pronounced as “deck”. It is an extention to the standard list data structure. The standard list allows the user to append or extend elements only at the end. But deque allows the user to operate on both ends, so that the user can implement both stacks and queues.

6.1 Creation & Performing Operations on Deque

The deque() is available in collections library. It takes iteratable entity as an argument and an optional maximum length. If maxlen is set, it ensure that deque length does not exceeds the size of the maxlen.

>>> from collections import deque
>>> aiao = deque([1,2,3,4,5],maxlen=5)
aiao = deque([1,2,3,4,5])
>>> aiao.append(6)
>>> aiao
deque([2, 3, 4, 5, 6], maxlen=5)
>>> aiao.appendleft(1)
>>> aiao
deque([1, 2, 3, 4, 5], maxlen=5)

In the above example, we have created a deque with maxlen 5, once we appended 6th element on the right, it pushed first element on the left.  Similarly, it pushes out the last element on the right when we append element on the left.

6.2 Operations on Right

Operations on the right are common to performing any opertions on the list.  The methods append(), extend() and pop() are operate on the rightside of the deque().

>>> aiao.append(6)
>>> aiao
deque([2, 3, 4, 5, 6], maxlen=5)
>>> aiao.extend([7,8,9])
>>> aiao
deque([5, 6, 7, 8, 9], maxlen=5)
>>> aiao.pop()

6.3 Operation on the Left

The special feature of performing operations on the left is supported by set of methods like appendleft(), extendleft(), popleft().

>>> aiao = deque([1,2,3,4,5],maxlen=5)
>>> aiao.appendleft(0)
>>> aiao
deque([0, 1, 2, 3, 4], maxlen=5)
>>> aiao.extendleft([-1,-2,-3])
>>> aiao
deque([-3, -2, -1, 0, 1], maxlen=5)
>>> aiao.popleft()

6.4 Example 2 (without maxlen)

If the maxlen value is not set, the deque does not perform any trimming operations to maintain the size of the deque.

>>> aiao = deque([1,2,3,4,5])
>>> aiao.appendleft(0)
>>> aiao
deque([0, 1, 2, 3, 4, 5])
>>> aiao.extendleft([-1,-2,-3])
>>> aiao
deque([-3, -2, -1, 0, 1, 2, 3, 4, 5])

From the above example, the deque aiao continues to grow for the append and extend operations performed on it.

7. ChainMap

ChainMap allows to combine multiple dictionaries into a single dictionary, so that operations can be performed on single logical entity.  The ChainMap() does not create any new dictionary, instead it maintains references to the original dictionaries, all operations are performed only on the referred dictionaries.

7.1 Creating ChainMap

>>> from collections import ChainMap
>>> x = {'a':'Alpha','b':'Beta','c':'Cat'}
>>> y = { 'c': "Charlie", 'd':"Delta", 'e':"Echo"}
>>> z = ChainMap(x,y)
>>> z
ChainMap({'c': 'Cat', 'b': 'Beta', 'a': 'Alpha'}, {'d': 'Delta', 'c': 'Charlie', 'e': 'Echo'})
>>> list(z.keys())
['b', 'd', 'c', 'e', 'a']
>>> list(z.values())
['Beta', 'Delta', 'Cat', 'Echo', 'Alpha']
>>> list(z.items())
[('b', 'Beta'), ('d', 'Delta'), ('c', 'Cat'), ('e', 'Echo'), ('a', 'Alpha')]

We have created ChainMap z from other dictionaries x & y. The ChainMap z is reference to the dictionaries x and y. ChainMap will not maintain duplicate keys, it returns presents value ‘Cat’ for key ‘c’. So, basically it skips the second occurance of the same key.

>>> x
{'c': 'Cat', 'b': 'Beta', 'a': 'Alpha'}
>>> y
{'d': 'Delta', 'c': 'Charlie', 'e': 'Echo'}
>>> x.pop('c')
>>> x
{'b': 'Beta', 'a': 'Alpha'}
>>> list(z.keys())
['d', 'c', 'b', 'e', 'a']
>>> list(z.values())
['Delta', 'Charlie', 'Beta', 'Echo', 'Alpha']
>>> list(z.items())
[('d', 'Delta'), ('c', 'Charlie'), ('b', 'Beta'), ('e', 'Echo'), ('a', 'Alpha')]

In the above code, we have removed the key ‘c’ from dict x. Now the ChainMap points the value for key ‘c’ to “Charlie”, which is present in y.

8. Summary

We have seen various python collection data types and understand them with example and use cases. The official python documentation can be referred for further reading.

9. References

[1] – Python Wiki – https://docs.python.org/3.5/library/collections.html

RabbitMQ : A Cloud based Message Oriented Middleware

Posted on Updated on

In this article we will understand RabbitMQ,  a message broker middleware recommeded by OpenStack for cloud deployments. It complies to AMQP standards and developed in Erlang. The code examples are developed using Python and PIKA library.

1. Message Broker

A message broker is a software component that enables communication across applications in the enterprise application cluster.  It also known as Message Oriented Middleware(MOM) in Service Oriented Architecture (SOA).  The applications in the enterprise cluster use the message broker like a mail-exchange or a post-office to send messages to other applications.

RabbitMQ complies to AMQP standard, which is open standard for business messages between applications and organizations.  It is a binary protocol rather than an interface specification.  AMQP standard enables messaging as a cloud service, advanced publish-subscribe pattern, custom header based rourting and programming language independent.

 2. RabbitMQ Model

The RabbitMQ model consits of various components.  They are: Producer (sender), Consumer (receiver), Exchange , Bindings and Message queues.  These components work together as explained below:

  • The producer sends a message to an exchange
  • Exchange forwards the message to the queues based on the bindings
  • The bindings are set by queues to attach to an exchange
  • Consumers (receiver) receives the messages from their respective message queues.
Rabbit MQ System
RabbitMQ ECO System

The producer and consumer of the messages are external entities. The Exchanges, Queues and Bindings are internal entities of the message broker.

2.1 Exchanges

Exchanges are core of RabbitMQ.  The producer sends messages to exchanges to forward it to the right consumer.  Exchanges make forwarding decision based on exchange type, routing_key in the message, custome header fields in the message, and the routing_key registered by the queue with the exchange.

Exchanges can be configured durable, so that they will survive restarts of the RabbitMQ server. Exchanges can be “internal” to server, so that it can be published by other exchanges inside the server.   The exchanges can be configured to auto-delete, while no more queues are bound to it.

2.2 Queues

Queues are used to forward the messages to target consumers and ensures orderly message delivery. The consumer bind itself to a queue with a callback function, so that consumer will be notified on receiving a message.

Queues can be configured durable, so that they will survive restarts of the RabbitMQ server. It can also be configured “exclusive” to a connection, once the connection is closed the queue will be deleted. Auto-delete queues are deleted when no consumer is subscribed to the queue.

2.3 Bindings

Bindings are used by the exchanges to make the routing decision for a message. Basically bindings connects the exchanges with the queues and it uses exchange-type, routing-key set by the queue, custom message headers and routing-key present in the message sent by the producer.  If there are not enough bindings to make a forwarding decision for a message, the message can be dropped or sent back to the producer, based on the properties set in the message.

3. Producer and Consumer

In a typical banking system, when a customer withdraws money from the ATM or swipes his credit card at a shop, the customer is notified with a SMS or an email to prevent frauds.  In this use case, the core banking system will send a message to messaging sub-systems and continue process other customer requests.  The messaging sub-systems listening on the queues will be notified and they take appropriate actions based on subscriber preferences.  In this example, the core banking system is the producer of the messages and messaging sub-system are consumer of the messages.

3.1 Producer

As per RabbitMQ design, the producer sends the message to an exchange.  The behaviour of the exchange defined by the exchange type, route_key & bindings by the message queues.  Let us see how to create a producer using Python pika library.  If you do not have RabbitMQ setup, please refer the Appendix-A section to install and configure RabbitMQ.

import pika 

""" Establish connection and declare the exchange"""
connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost'))
channel = connection.channel()

""" Publish a Message on the Channel """
Msg = "Balance as of today is 200.00$"
channel.basic_publish(exchange = "SMS-Exchange",routing_key="Sms-Alert",body=Msg)

"""Close the communication channel """

In the above code segment, first we establish connection to server, where RabbitMQ is hosted.  RabbitMQ allows to establish multiple channel within the existing connection and all the communications are operations are tied with that channel.  The 3rd line, we create an exchange and named it “SMS-Exchange” and its type is “direct”.  In case if the exchange is already created, this statement will not create an error it simply returns it unless there is no conflict in the exchange type.

The function “channel.basic_publish” is used to send a message to the exchange.  The producer has to mention that name of exchange and routing_key along with the actual message.  In “direct” exchange the “routing_key” will be used to make message forwarding decision.  Finally close the connection, if not needed any more.

3.2 Consumer

As per RabbitMQ design, the consumer is the target application where the messages are intended.  The consumer must register itself to a queue and bind it with the exchange.  If more than one consumers registers to a queue, then RabbitMQ sends the messages to consumers in round-robin fashion.

import pika

""" define a callback function to receive the message """
def callback(channel,method,properties,body):
    print "[X] Received Message ",body

""" Establish connection """
connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost')
channel = connection.channel()

""" Declare the queue & bind to the exchange with routing key """

""" register the call back function """

""" listen to the queue """

The connection establishment is exactly same for both producer and consumer.  But the consumer has to declare a queue and bind the queue with the exchange along with routing_key.  In the above code, the “SMS-Queue” is created and bound with “SMS-Exchange” with routing_key “SMS-Alert”.  This instructs the SMS-Exchange to forward the message with routing_key “SMS-Alert” to “SMS-Queue”.

The callback function registered with the queue will be called once message is received on the queue.  The list statement “channel.start_consuming()” is a blocking call, where the consumer awaits for any messages on the registered queue.

3.3 Remote Connection

In the above examples we have connected to RabbitMQ server which is present in the localhost.  Lets see the code snippet, how to connect to a remote RabbitMQ server:

import pika
credits = pika.PlainCredentials('scott','tiger')
params = pika.ConnectionParameters('',5672,'/',credits)
connection = pika.BlockingConnection(parameters=params)

The above code, first creates a credits object by setting up the user credentials.  Then we create a parameters object by setting up the IP address, port number, virtual host path and credential object.  This parameter is simply passed to BlockingConnection method to establish connection with the given parameters.  Once you establish a connection, the rest of code is exactly same for creating a producer and consumer.

In this exercise we have discussed developing producer and consumer using blocking connection.  RabbitMQ also supports asynchronous connections.  Please find the examples at http://pika.readthedocs.org/en/latest/examples/asynchronous_publisher_example.html .

4. Exchange Types

In this section we will learn how to develop various types of exchanges.

4.1 Direct Exchange

A direct exchange delivers messages to queues based on the message routing key. A queue binds to the exchange with a routing key. When a new message with that routing key arrives at the direct exchange, that message is routed to that queue. Direct exchanges are used to distribute tasks between multiple workers in a round robin manner. RabbitMQ load balances the consumers, when multiple consumers listens on the same queue. A direct exchange is ideal for the unicast routing of message.

Producer: The below code segment creates an exchange named “Direct-X” and set the exchange type to “direct”.   The postMsg method sends the message to this exchange with routing_key set to “Key1”

import pika 

connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost'))
channel = connection.channel()

""" Declare a exchange with type as direct """

""" Publish a Message on the Channel """
Msg = raw_input("Please enter the message :")
channel.basic_publish(exchange = "Direct-X",routing_key="Key1",body=Msg)

"""Close the communication channel """


The below code snippet creates Queue named “Direct-Q1” and registers it with the exchange “Direct-X” for the messages with the routing_key as “Key1”.

import pika

""" define a callback function to receive the message """
def callback(channel,method,properties,body):
    print "[X] Received Message ",body

""" Establish connection """
connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost')
channel = connection.channel()

""" Declare the queue & bind to the exchange with routing key """
""" register the call back function """

""" listen to the queue """

Note: If more than one consumers listens to the same queue, RabbitMQ  load balances the messages across the consumers in round-robin fashion.

4.2 Fanout Exchange

A fanout exchange routes messages to all the queues that are bounded irrespective of the routing key. It is most suitable for broadcast.

Create a producer which sent message to the exchange “Fanout-X”.

import pika 

connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost'))
channel = connection.channel()

""" Declare a exchange with type as direct """

""" Publish a Message on the Channel """
Msg = raw_input("Please enter the message :")
channel.basic_publish(exchange = "Fanout-X",routing_key="Key1",body=Msg)

"""Close the communication channel """

Consumer 1:

Creates a queue “Faount-Q1” and binds to exchange “Fanout-X”, which is of fanout exchange type. Even though the consumer registers a routing_key, it will not have any effect on exchange’s forwarding decision, because it is a fanout exchange.

import pika

""" define a callback function to receive the message """
def callback(channel,method,properties,body):
    print "[X] Received Message ",body

""" Establish connection """
connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost')
channel = connection.channel()

""" Declare the queue & bind to the exchange with routing key """
""" register the call back function """

""" listen to the queue """

Consumer 2: Creates a another queue “Fanout-Q2” and binds it to the same exchange “Fanout-X”

import pika

""" define a callback function to receive the message """
def callback(channel,method,properties,body):
    print "[X] Received Message ",body

""" Establish connection """
connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost')
channel = connection.channel()

""" Declare the queue & bind to the exchange with routing key """
""" register the call back function """

""" listen to the queue """

Now you will see that the message sent by the publisher will be received by both consumers.

4.3 Topic Exchange

Topic exchange route messages to one or more queues based on the message routing key and key-pattern used to bind a queue to an exchange. This is exchange is used to implement publish/subscriber pattern. Topic exchange is suitable for multicast.

Producer creates an exchange named “Topic-X” of type “topic” and sends various messages with different key values.

import pika 

connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost'))
channel = connection.channel()

""" Declare a exchange with type as direct """

""" Publish a Message on the Channel """
Msg = raw_input("Please enter the message :")
channel.basic_publish(exchange = "Topic-X",routing_key="in.sales.put",body=Msg)
channel.basic_publish(exchange = "Topic-X",routing_key="in.sales.post",body=Msg)
channel.basic_publish(exchange = "Topic-X",routing_key="in.sales.delete",body=Msg)
channel.basic_publish(exchange = "Topic-X",routing_key="in.rnd.put",body=Msg)
channel.basic_publish(exchange = "Topic-X",routing_key="in.rnd.post",body=Msg)
channel.basic_publish(exchange = "Topic-X",routing_key="in.rnd.delete",body=Msg)
"""Close the communication channel """ 

Consumer 1: Creates a queue named “Topic-Q1” and binds with exchange “Topic-X” for all the “post” messages.

import pika

""" define a callback function to receive the message """
def callback(channel,method,properties,body):
    print "[X] Received Message ",body

""" Establish connection """
connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost')
channel = connection.channel()

""" Declare the queue & bind to the exchange with routing key """
""" register the call back function """

""" listen to the queue """

Consumer 2: Creates another queue named “Topic-Q2” and binds with the same exchange “Topic-X” for all messages coming from “rnd”.

import pika

""" define a callback function to receive the message """
def callback(channel,method,properties,body):
    print "[X] Received Message ",body

""" Establish connection """
connection = pika.BlockingConnection(pika.ConnectionParameters(host='localhost')
channel = connection.channel()

""" Declare the queue & bind to the exchange with routing key """
""" register the call back function """

""" listen to the queue """

4.4 Headers Exchange

Header exchanges route message based on attributes in message header and it ignores the routing-key. If a message header attributes matches with the queue binding parameters, then the message is forwarded to those queues.

Producer can set key, value pair (dictionary) in the header of the message sent.

import pika

""" A dictionary to store the customer headers """

""" Establish Blocking Connection with RabbitMQ Server """
conn = pika.BlockingConnection(pika.ConnectionParameters(host='localhost'))
chan = conn.channel()

""" Declare an exchange and set type as headers """

"""Set the headers using BasicProperties api """
prop = pika.BasicProperties(headers=header)

"""Publish the message to the exchange """
msg = "Hello Message-System, I am Core-Banking"

The above producer, creates an exchange named “Header-X” and set the type to “headers”. Then it sends a message with header key set to “Source” and the respective value set to “Core-Banking”. Basically the message header is bundled with the originator of the message, in this example it is “Core-Banking”.

The below Consumer creates a queue named “Header-Q1” and binds it with “Header-X” along with header information it is interested.  If you refer the post Method, it

exg = 'Header-X'

""" Declare a call back function to receive the message """
def callback(ch,method,properties,body):
 print "[X] Received Msg",body

""" Establish blocking connection to the server """
conn = pika.BlockingConnection( pika.ConnectionParameters(host='localhost'))
chan = conn.channel()

""" Declare a Queue """

"""Bind the queue to exchange and set arguments to match any of the given header """
chan.queue_bind(exchange=exg,queue=que, arguments={key:val,'x-match':'any'})

print '[*] Wait for messages. To exit press CTRL+C'

""" Link the callback function to the queue """

"""Wait for the messages to arrive """

Note that the queue_bind statement takes additional argument, named “arguments” which takes the key-value pair. If there are more than one key-value pair present, x-match:any will ensure that even if a single entry matches the message will be delivered to this queue.

5. Summary

We have seen how RabbitMQ works, understand various componenets and messaging models.  RabbitMQ  supports high available clustering environments, which will ensure zero down time, higher throughput and increased capacity, which makes it suitable for cloud based installations.

6. References

1. Message Broker – https://msdn.microsoft.com/en-us/library/ff648849.aspx
2. AMQP – https://www.rabbitmq.com/tutorials/amqp-concepts.html
3. RabbitMQCtl Command Reference : https://www.rabbitmq.com/man/rabbitmqctl.1.man.html
4. PIKA Documentation – http://pika.readthedocs.org/en/latest/modules/index.html
5. AMQP Standards – https://www.amqp.org/

Appendix – A

A. Installation & Configuration

The rabbitMQ infrastrcture is installed in CentOS6.5

A.1 Installation

Please install RabbitMQ, Erlang ( RabbitMQ runs on Erlang), Python & Pika.

A.2 Install RabbitMQ Management Plugin

RabbitMQ comes with a web based management console.  The software is bundled along with the RabbitMQ installation software we have installed.  To enable management console, execute the below command in your linux machine:

rabbitmq-plugins enable rabbitmq_management

RabbitMQ management console uses the port 15672. So, this port must be opened.  Please refer https://www.rabbitmq.com/management.html for further information on Management Plugin.

After enabling the management plug-in, you can open the http://ipaddress:15672/ in your browser to see the Web Management Interface of RabbitMQ.

A.3 Configuring RabbitMQ (vhost/user/privileges)

Once you have installed RabbitMQ it is ready for production.  But we have to make few mandatory configurations in order to allow remote clients, because the default configuration and user privilages will not allow the remove clients to access the server.

The idea here is to create a virtual environment with the RabbitMQ server and provide access to a user, where he can create a Message Queue, Exchange, Read  and Write into those resources.  So the users are restricted to a virtual environment and resources are contained within the environment.

A.3.1 Create a Virtual Host
rabbitmqctl add_vhost <virtual_host_name>

Note : RabbitMQ will have a default virtual host named “/”.

A.3.2 Create a New User with password
rabbitmqctl add_user <username> <password>
Eg. rabbitmqctl add_user scott tiger

Note: RabbitMQ will have a default user named “guest”, but it cannot used for remote access by external clients. But it allows when the server is acessed locally.

A.3.3 Assign permissions to the user to access a vhost
set_permissions [-p vhostpath] {user} {conf} {write} {read}
Eg: set_permission -p "/" "scott" ".*"  ".*" ".*"

Refer https://www.rabbitmq.com/access-control.html for more information.

A.4 Firewall Settings

We have to open the necessary ports of the Linux Firewall in order to allow remote clients.  Perform the following operations on the IP table to open the ports 5672 & 15672.

sudo iptables -I INPUT -p tcp --dport 5672 --syn -j ACCEPT
sudo iptables -I INPUT -p tcp --dport 15672 --syn -j ACCEPT

To make the changes permenent, make the configuration changes in the file /etc/sysconfig/iptables and restart the iptables servers:

sudo service iptables restart

A.5 Linux commands to manage the RabbitMQ server

service rabbitmq-server start
service rabbitmq-server stop
service rabbitmq-server status
service rabbitmq-server restart

To start rabbitMQ as a daemon:

chkconfig rabbitmq-server on

A.6 RabbitMQ Commands

  • List all exchanges           rabbitmqctl list_exchanges
  • List all queues                 rabbitmqctl list_queues
  • List all bindings              rabbitmqctl list_bindings
  • List consumers               rabbitmqctl list_consumers
  • List connections             rabbitmqctl list_connections


Microservices Design Principles

Posted on Updated on

The objective of this post is to understand micro services , relevant software architecture, design principles and the constraints to be considered while developing micro services.

1. Micro Services

Micro services are small autonomous systems that provide a solution that is unique, distinct within the eco-system. It runs as a full-stack module and collaborates with  other micro-services that are part of the eco-system.  Sam Newman defines micro services are “Small , Focused and doing one thing very well” in his book “Building Microservices”.

Micro services are created by slicing and dicing a single large monolithic system into many independent autonomous systems.  It can also be a plug-gable add-on component to work along with the existing system as a new component or as a green field project.

2. Eco system

Though the concept of micro service is not new, the evolution of cloud technologies, agile methodologies, continuous integration and automatic provisioning (Dev Ops) tools lead to the evolution of micro services.

2.1 Cloud Technologies

One of the important feature of cloud is “Elasticity”.  Cloud allows the user to dynamically scale up and scale down the performance and capacity of a system by dynamically increasing or decreasing the infrastructure resources such as virtual machines, storage, data base, etc.  If the software is one single large monolithic system, it cannot effectively utilize this capability of the cloud infrastructure, because the inner sub modules and communications pipe across the system could be the bottle neck, which could not scale appropriately.

Since the micro-services are small, independent and  full stack systems, it can efficiently use the elastic nature of the cloud infrastructure.    By increasing or decreasing the number of instance of a micro-service will directly impact the performance and capacity of the system proportionately.

2.2 Dev Ops

Dev Ops is a methodology focuses on speeding up the process of software development to customer deployment.  This methodology concentrates on improving the communication and collaboration between the software development and IT operations by integration, automation and cooperation.

Micro services architecture supports to meet both software engineers and IT professionals objective. Being small, independent component it is relatively easier to develop, test, deploy  and recovery (if failure) when compared to large monolithic architectures.

2.3 Agile Methodologies

Agile is the software development process model, which is evolved from Extreme Programming (XP) and Iterative and Incremental (2I) development process models.  Agile is best suitable for small teams, working on software deliverable where the requirement volatility is high and time to market is shorter.

As per the agile manifesto, agile prefers :

  • Individual interactions over Process and Tools
  • Working Software over comprehensive documentation
  • Customer Collaboration over contract negotiation
  • Responding to Change over following a plan

A small dynamic team which works in agile process model developing a micro service that is small, independent and full-stack application will have a complete product ownership with clear boundaries of responsibility.

3. Design of Micro Services

3.1 Characteristics of Micro Services

Micro services are designed to be small, stateless, in(ter)dependent & full-stack application so that it could be deployed in cloud infrastructure.

Small : Micro services are designed to be small, but defining “small” is subjective.  Some of the estimation techniques like lines of code, function points, use cases may be used, but it is not recommended estimation techniques in agile.

In the book Building Microservices the author Sam Newman suggest few techniques to define the size of micro service, they are : It should be small enough to be owned by a small agile development team,  re-writable within one or two agile sprints ( typically two to four weeks) or the complexity does not require to refactoring or require further divide into another micro service.

Stateless : A stateless application handles every request with the information contained only within it. Micro services must be stateless and it must service the request without remembering the previous communications from the external system.

In(ter)dependent : Micro services must service the request independently, but it may collaborate with other micro services within the eco-system.  For example, a micro service that generates a unique report after interacting with other micro services is an interdependent system. In this scenario, other micro services which only provide the necessary data to reporting micro services may be independent services.

Full-Stack Application : A full stack application is individually deploy-able. It has its own server, network & hosting environment.  The business logic, data model and the service interface ( API / UI) must be part of the entire system.  Micro service must be a full stack application.

3.2 Architecture Principles

Though SOA is one of the important architecture style helps in designing micro services.  There are few more architecture styles and design principles need to be considered while designing micro services.  They are:

3.2.1 Single Responsibility Principle (Robert C Martin)

Each micro-service must be responsible for a specific feature or a functionality or aggregation of cohesive functionality.  The thump rule to apply this principle is : “Gather those things which change for the same reason, Separate those things which change for the different reason”.

3.2.2 Domain Driven Design

Domain driven design is an  architectural principle in-line with object oriented approach. It recommends designing systems to reflect the real world domains.  It considers the business domain, elements and behaviors and interactions between business domains.  For example, in banking domain, individual micro services can be designed to handle various business functions such as retail banking, on-line banking, on-line trading etc. The retail banking micro-service can offer services related to that eg. open a bank account, cash withdraw, cash deposits, etc.

3.2.3 Service Oriented Architecture

The Service Oriented Architecture (SOA) is an architecture style, which enforces certain principles and philosophies.  Following are the principles of SOA to be adhered while designing micro-services for cloud. Encapsulation

The services must encapsulate the internal implementation details, so that the external system utilizes the services need not worry about the internals. Encapsulation reduces the complexity  and enhances the flexibility (adaptability to change) of the system . Loose Coupling

The changes in one micro-system should have zero or minimum impact on other services in the eco-system.   This principle also suggests having a loosely coupled communication methods between the micro services.  As per SOA, RESTful APIs are more suitable than Java RMI, where the later enforces a technology on other micro-services. Separation of Concern

Develop the micro-services based on distinct features with zero overlap with other functions. The main objective is to reduce the interaction between services so that they are highly cohesive and loosely coupled. If we separate the functionality across wrong boundaries will lead tight coupling and increased complexity between services.

The above core principles of SOA provided only a gist of SOA.  There are more principles and philosophies of SOA which nicely fits into design principles of micro-services for cloud.

3.2.4 Hexagonal Architecture

This architecture style is proposed by Alistair Cockburn .  It allows an application to equally driven by users, programs, automated test or batch scripts, and to be developed and tested in isolation from its eventual run-time devices and databases.  This also called as “Ports-Adapters Architecture”, where the ports and adapters encapsulate the core application to function unanimously to external requests.  The ports and adapters handles the external messages and convert them into appropriate functions or methods exposed by the inner core application.  A typical micro service exposes RESTful APIs for external communication, message broker interface (eg. RabbitMQ, HornetQ, etc) for event notification and database adapters for persistence makes hexagonal architecture as a most suitable style for micro service development.

Though there are many architectural styles & principles the above items have high relevant to micro services.

4 Design Constraints

The design constraints (non-functional requirements) are the important decision makers while designing micro services.  The success of a system is completely depends on  Availability, Scalability, Performance, Usability and Flexibility.

4.1 Availability

The golden rule for availability says, anticipate failures and design accordingly so that the systems will be available for 99.999% ( Five Nines).  It means the system can go down only for a 5.5 minutes for an entire year.    The cluster model is used to support high availability, where it suggests having group of services run in Active-Active mode or Active-Standby model.

So while designing micro services, it must be designed for appropriate clustering and high-availability model.  The basic properties of micro-services such as stateless, independent & full stack will help us to run multiple instances in parallel in active-active or active-standby mode.

4.2 Scalability

Micro services must be scale-able both horizontally and vertically.    Being horizontally scale-able, we can have multiple instances of the micro-service to increase the performance of the system.  The design of the micro services must support horizontal scaling ( scale-out).

Also micro-services should be scale-able vertically (scale-in).  If a micro-service is hosted in a system with medium configuration such AWS EC2 t2-small  (1-core, 2-GB memory) is moved to M4 10x-large ( 40 core & 160GB memory) it should scale accordingly.  Similarly downsizing the system capacity must also be possible.

4.3 Performance

Performance is measured by throughput, response time (eg. 2500 TPS -transactions per second) .   The performance requirements must be available in the beginning of the design phase itself. There are technologies and design choices will affect the performance.  They are :

  • Synchronous or Asynchronous communication
  • Blocking or Non-blocking APIs
  • RESTful API or RPC
  • XML or JSON , choice of
  • SQL or NoSQL
  • HornetQ or RabbitMQ
  • MongoDB or Cassandra or CouchBase

So, appropriate technology and design decisions must be taken, to avoid re-work in the later stage.

4.4 Usability

Usability aspects of the design focuses on hiding the internal design, architecture, technology and other complexities to the end user or other system.  Most of the time, micro services expose APIs to the end user as well as to other micro-services.  So, the APIs must be designed in a normalized way, so that it is easy to achieve the required services with minimal number of API calls.

4.5 Flexibility

Flexibility measures the adaptability to change.  In the micro-services eco-system, where each micro-service is owned by different teams and developed in agile methodology, change will happen faster than any other systems.  The micro-services may not inter-operate if they don’t adapt or accommodate the change in other systems.  So, there must be a proper mechanism in place to avoid such scenarios, which could include publishing the APIs, documenting the functional changes, clear communication plans.

This briefly summarizes the important design constraints for micro-services.

5. New Problem Spaces

Though there are many positives with micro-services, it can create some new challenges.

5.1 Complete Functional Testing

The end to end functional testing will be a great challenge in micro-services environment, because we might need to deploy many micro-services to validate single business functionality. Each micro-service might have its own way of installation and configuration.

5.2 Data Integrity across the eco-system

Micro systems run independently and asynchronously,  they communicate each other through proper protocols or APIs. This could result in data integrity issues momentarily or out-of-sync due to failures. So we might need additional services to monitor the data integrity issues.

5.3 Increased Complexity

The complexity increases many folds, when a single monolithic is split into ten to twenty micro-services and introduction of load balance server, monitoring, logging and auditing servers in to the eco-systems increases the operational overhead.  Also the competency needed to manage and deploy the micro-services becomes very critical, where the IT admins and DevOps engineers need to be aware of plethora of technologies used by independent agile development teams.

The articles  “Microservices – Not a free lunch !”  and Service Disoriented Architecture clearly warns us to be aware of issues with micro services, though they greatly support and favour this architecture style.

6. Summary

Micro services architecture style offers many advantages, as we discussed it is most suitable for cloud infrastructure, speed up the deployment and recovery , minimizes the damages in case of failures.  This article consolidates the needed knowledge areas in design, architecture and design constraints for designing micro-services. Thank you.

7. References

[1] Domain Driven Design – Quickly http://www.infoq.com/minibooks/domain-driven-design-quickly

[2] MSDN Software Architecture – https://msdn.microsoft.com/en-us/library/ee658093.aspx

[3] Building Micro ServicesSam Newmann

[4] Hexagonal Architecture – http://alistair.cockburn.us/Hexagonal+architecture



Functional Programming in Python

Posted on Updated on

Idea of this blog is to understand the functional programming concepts using python.

1. Programming Paradigms

There are three programming paradigms:  Imperative Programming, Functional Programming and Logic Programming.  Most of the programming languages support only imperative style.  Imperative style is having direct relationship with machine language.  The features of imperative styles  like assignment operator, conditional execution and loops are directly derived from machine languages.  The procedural and object oriented programming languages such as C, C++, Java are all imperative programming languages.

Logic programming is completely a different style, it will not contain the solution to the problem instead written in terms of facts and rules.  Prolog, ASP & DataLog are some of logic programming languages.

2. Functional programming  

Wikipedia definition says  ” In computer science, functional programming is a programming paradigm—a style of building the structure and elements of computer programs—that treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data”.

Following are the characterstics of the functional programmings :

Immutability :

Functional languages must support immutable objects by default, mutable object must be declared explicitly and consciously.  In contrast, imperative style languages supports mutable objects by default also immutable objects must be declared explicitly from a different library.

Function are first class citizens:

Functions are first class citizens and it is handled like any other object.   It means functions can be stored as an object, functions can be passed as an argument to other functions, function can return function.

Lexical Scoping:

In functional programming, the scope of the function depends on the location of its definition.  If a function is defined inside a function, then its scope is only within the outer function.  It cannot be referred outside the outer function.

3. Type of Functions

3.1 Higher Order Functions: 

A function can take other function as a argument and may return a function.

def FindFormula(shape):
    if (shape == "Square"):
        return SquareArea
    if (shape == "Circle"):
        return CircleArea

def SquareArea(a):
    return (a * a)

def CircleArea(r):
    return (22/7*r * r)

def findShape(sides):
    if (sides == 4):
        return "Square"
    if (sides == 0 ):
        return "Circle"

if __name__ == "__main__":
    size = 5
    area = FindFormula(findShape(sides=4))(size)
    print("Area = ",area)

3.2 Anonymous Functions

A function without a name is called an anonymous function.  Generally these functions are defined inside other functions and called immediately.

area = lambda x : 22/7*x*x

Anonymous functions are created with “lambda” keyword.  We can use this syntax, where we need a function.

3.3 Nested Functions

Functions can be defined within the scope of another function. This type of functions defined inside a other function.  The inner function is only in scope inside the outer function.   It is useful when the inner function is being returned or when it is being passed into another function.

def oFunction(x,y):
    def SqArea(x,y):
    return (x*y)
return SqArea(x,y)
sum = oFunction(10,20)

In the above example, the function SqArea is strictly inside the scope of oFunction().

 3.4 Currying

Simple definition by Cay S Hartsmann in his book “Scala for the impatient”: Currying is the process of turning a function that takes two arguments into a function that takes one argument. That functions returns a function that consumes the second argument.

The generalized definition for currying: A function that takes multiple arguments and turning into a chain of functions each taking one argument and returning the next function, until the last returns the result.

def curried_pow(x):
     def h(y):
         return pow(x,y)
     return h
print curried_pow(2)(3)

3.5 Closures

A closure is a persistent scope which holds on to local variables even after the code execution has moved out of that block. The inner function that remembers the state of the outer function, even after the outer function has completed execution.

Generally this behavior is not possible in imperative programming styles, because the function is executed in a separate stack frame.  Once the function completes its execution the stack frame is freed up.  But in Functional Programming, as long as the inner function remembers the state of the outer function, the stack frame is not freed up.

def sumA(a):
    def sumB(b):
x = sumA(10)
y = x(20)
print (y)

Benefits of Closures:
It avoids the use of global variables and provides data hiding. In scenarios, where the developer does not want to go for object oriented program, closures can be handy to implement abstraction.

3.6 Tail Recursion

A recursive function is called as Tail Recursive when the last statement in the function makes the recursive calls only.  It is much more efficient than the recursive function because, since the recursive call is the last statement, there is nothing to be saved in the stack for the current function call. Hence the tail recursion is very light on the memory utilization.

Regular Recursive Function for Factorial in Python:

def fact(n):
    if (n == 0):
        return 1
        return n * fact(n-1)

The same can converted to Tail Recursion as below, by having additional argument:

def fact(n):

def tailFact(n,a):
    if (n == 0):
        return tailFact(n-1,n*a)

4 Benefits of Functional Programming

Clarity & Complexity : Functional programming supports writing modular, structured and hierarchical code.  The use of anonymous functions, local functions makes it easier to organize the code hierarchically.  The currying and closures are reduces the complexity of the program.

Concurrency : Programming for multi-core architecture is a challenge, where single piece of code can be executed by many threads parallel.  So, the issues of code re-entry problem and shared memory areas will arise. In imperative style programming we must use synchronization techniques to avoid these problems, which will lead to performance impact.  The philosophies of functional programming style fits to suite the need of multi-core programming by enforcing immutability.

Memory Efficient: Functional programming style is memory efficient.  The anonymous functions, nested functions and tail-recursion styles are very light on the memory.  Because of the lexical scoping, once the function is out of scope they will be removed from memory and tail-recursion is very efficient by not holding the stack frames.

5 Summary

We have briefly discussed the  functional programming paradigm & its benefits.   The functional programming style is most suitable for developing algorithms, analytic, data mining and machine learning algorithm.   Most of the modern programming languages are supporting functional styles, also Java is trying to catch up with this style in the latest version (Java 8).


[1] Haskell Wiki – https://wiki.haskell.org/Functional_programming

[2] Currying – https://mtomassoli.wordpress.com/2012/03/18/currying-in-python/

Why Scala ?

Posted on Updated on


  • Scala – means Scalable Language ( pronounced as scah-lah)
  • Scala extends  the Java language, in-fact internally uses many of Java language libraries.
  • Scala is recommended for algorithm development , big data processing & multi-threaded applications in a multi-core environment.
  • The Scala compiler, “scalac” generates java byte code, which can run in JVM
  • Scala is a statically typed language, suites large projects.
  • Scala improves the productivity of the developers after the initial learning curve.

Language Features

1. JVM Language

Scala is an JVM language.  Scala generates Java byte code, which can run on JVM.  Scala supports use of existing Java libraries in scala code.  Java libraries & Scala libraries are inter-operable.

2. Functional Programming Language along with Object Oriented Programming

Scala supports both Functional Style along with OOPs, unlike Java which is strictly OOPs or other functional programming languages which are strictly functional (Haskell, Clojure).

  • Immutable Objects by default : Scala creates immutable objects only.  If the programmer wants mutable objects, they must be created consciously  from a different library.
  • Higher Order Functions :  In Scala function is an object.  It supports anonymous functions, tail-recursion, closure etc.

3. Concurrency Support in Multi-Core Environment

  • Actor – Similar to Java Threads, which has advanced thread management, synchronous/asynchronous communication mechanism across threads.  This communication mechanism avoids the need for thread synchronization.
  • NO support for STATIC class/variables.  If similar behavior is needed programmer has to create a companion object for the class.  This feature avoid the need to protect the critical region which act on the global/static memory.
  • Threads (Actors) can be created and run for-ever and execute based on the messages.
  • Private[this] – feature to make a private member, perfectly private.  A private member of an object, can be accessed by other object of the same class.  This can restricted by Private[this] feature.

4. Removes boiler plate coding:

  • Object creation – simplified
  • Collection library – minimal set of methods to handle most scenarios

5. More Features & Less typing

  • Scala creates default code for commonly used behaviors
    • Automatically creates constructors for class
    • Class can have arguments, which can act as argument for default constructor
    • Automatically Getter & Setter Functions for class members
  • Simple File Handling & IO
    • No need to worry about [File/Input/Output/Buffer] [Stream / Reader / Writer] (Eg. Input Stream Reader => Buffer Reader => ReadInt )
    • Simple functions for input and output statements ( Because Scala supports functions)
  • More language features
    • Easy to create Singleton – Object without a Class definition
    • No primitive data type – All data types of objects : Byte, Char, Int, Long, Double, Float and String.
  • Less typing – Each key stroke is valuable
    • No semicolon
    • No return statement for functions (Every Scala Statement is an Expression which returns a value)


After the initial learning curve, scala improves the productivity of the engineers.  Since scala introduces some new syntax, new symbols into the programming the initial learning curve is slightly bigger than learning other modern languages.  At the same time, if the engineer is experienced Java programmer, they will greatly appreciate the features of the scala language.