Boto3 – Amazon S3 As Python Object Store

Use Amazon Simple Storage Service(S3) as an object store to manage Python data structures.

1.Introduction

Amazon S3 is extensively used as a file storage system to store and share files across the internet.  Amazon S3 can be used to store any type of objects, it is a simple key value store.  It can be used to store objects created in any programming languages, such as Java, JavaScript, Python etc.  AWS DynamoDB recommends to use S3 to store large items of size more than 400KB.  This article focuses on using S3 as an object store using Python.

2. Pre-requisites

The Boto3 is the official AWS SDK to access AWS services using Python code.  Please ensure Boto3 and awscli are installed in the system.

$pip install boto3
$pip install awscli

Also configure the AWS credentials using “aws configure” command or set up environmental variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY store your keys in the environment.  Please DO NOT hard code your AWS Keys inside your Python program.

To configure aws credentials, first install awscli and then use “aws configure” command to setup.  For more details refer AWS CLI Setup and Boto3 Credentials.

Configure the AWS credentials using command:

$aws configure

Do a quick check to ensure you can reach AWS.

$aws s3 ls

The above CLI must show the S3 buckets created in your AWS account.  The AWS account will be selected based on the credentials configured.  In case, multiple AWS accounts are configured, use the “–profile ” option in the AWS CLI.  If you don’t mention “–profile ” option the CLI takes the profile “default”.

Use the below commands to configure development profile named “dev” and validate the settings.

$aws configure -profile dev
$aws s3 ls --profile dev

The above command show s3 buckets present in the account which belongs to “dev” profile.

3. Connecting to S3

3.1 Connecting to Default Account (Profile)

The client() API connects to the specified service in AWS.  The below code snippet connects to S3 using the default profile credentials and lists all the S3 buckets.

import boto3

s3 = boto3.client('s3')
buckets = s3.list_buckets()
for bucket in buckets['Buckets']:
    print bucket['CreationDate'].ctime(), bucket['Name']

3.2 Connecting to Specific Account (Profile)

To connect to a specific account, first create session using Session() API.  The Session() API allows to mention the profile name and region.  It also allows to specify the AWS credentials.

The below code snippet connects to an AWS account configured using “dev” profile and lists all the S3 buckets.

import boto3

session = boto3.Session(profile_name="dev", region_name="us-west-2")
s3 = session.client('s3')buckets = s3.list_buckets()
for bucket in buckets['Buckets']:
    print bucket['CreationDate'].ctime(), bucket['Name']

4. Storing and Retrieving a Python LIST

Boto3 supports put_object() and get_object() APIs to store and retrieve objects in S3.  But the objects must be serialized before storing.   The python pickle library supports serialization and deserialization of objects.  Pickle is available by default in Python installation.

The APIs pickle.dumps() and pickle.loads() is used to serialize and deserialize Python objects.

4.1 Storing a List in S3 Bucket

Ensure serializing the Python object before writing into the S3 bucket.  The list object must be stored using an unique “key”.  If the key is already present, the list object will be overwritten.

import boto3
import pickle

s3 = boto3.client('s3')
myList=[1,2,3,4,5]

#Serialize the object 
serializedListObject = pickle.dumps(myList)

#Write to Bucket named 'mytestbucket' and 
#Store the list using key myList001

s3.put_object(Bucket='mytestbucket',Key='myList001',Body=serializedListObject)

The put_object() API may return a “NoSuchBucket” exception, if bucket does not exists in your account.

NOTE:  Please modify bucket name to your S3 bucket name.  I don’t won this bucket.

4.2 Retrieving a List from S3 Bucket

The list is stored as a stream object inside Body.  It can be read using read() API of the get_object() returned value.  It can throw an “NoSuchKey” exception, if the key is not present.

import boto3
import pickle

#Connect to S3
s3 = boto3.client('s3')

#Read the object stored in key 'myList001'
object = s3.get_object(Bucket='mytestbucket',Key='myList001')
serializedObject = object['Body'].read()

#Deserialize the retrieved object
myList = pickle.loads(serializedObject)

print myList

5 Storing and Retrieving a Python Dictionary

Python dictionary objects can be stored and retrieved in the same way using put_object() and get_object() APIs.

5.1 Storing a Python Dictionary Object in S3

import boto3
import pickle


#Connect to S3 default profile
s3 = boto3.client('s3')

myData = {'firstName':'Saravanan','lastName':'Subramanian','title':'Manager', 'empId':'007'}
#Serialize the object
serializedMyData = pickle.dumps(myData)

#Write to S3 using unique key - EmpId007
s3.put_object(Bucket='mytestbucket',Key='EmpId007')

5.2 Retrieving Python Dictionary Object from S3 Bucket

Use the get_object() API to read the object.  The data is stored as a stream inside the Body object.  This can be read using read() API.

import boto3

s3 = boto3.client('s3')

object = s3.get_object(Bucket='mytestbucket',Key='EmpId007')
serializedObject = object['Body'].read()

myData = pickle.loads(serializedObject)

print myData

6 Working with JSON

When working with Python dictionary, it is recommended to store it as JSON, if the consumer applications are not written in Python or do not have support for Pickle library.

The api json.dumps() converts the Python Dictionary into JSON and json.loads() converts a JSON to a Python dictionary.

6.1 Storing a Python Dictionary Object As JSON in S3 bucket

import boto3
import json

s3 = boto3.client('s3')

myData = {'firstName':'Saravanan','lastName':'Subramanian','title':'Manager', 'empId':'007'}
serializedMyData = json.dumps(myData)

s3.put_object(Bucket='mytestbucket',Key='EmpId007')

6.2 Retrieving a JSON from S3 bucket

import boto3
import json

s3 = boto3.client('s3')
object = s3.get_object(Bucket='mytestbucket',Key='EmpId007')
serializedObject = object['Body'].read()

myData = json.loads(serializedObject)

print myData

7 Upload and Download a Text File

Boto3 supports upload_file() and download_file() APIs to store and retrieve files to and from your local file system to S3.  As per S3 standards, if the Key contains strings with “/” (forward slash) will be considered as sub folders.

7.1 Uploading a File

import boto3

s3 = boto3.client('s3')
s3.upload_file(Bucket='mytestbucket', Key='subdir/abc.txt', Filename='./abc.txt')

7.2 download a File from S3 bucket

import boto3

s3 = boto3.clinet('s3')
s3.download_file(Bucket='mytestbucket',Key='subdir/abc.txt',Filename='./abc.txt')

8 Error Handling

The Boto3 APIs can raise various exceptions depends on the condition.  For example, “DataNotFoundError”,”NoSuchKey”,”HttpClientError“, “ConnectionError“,”SSLError” are few of them.  The Boto3 exceptions inherit Python “Exception” class.  So handle the exceptions by looking for Exceptions class in error and exception handling in the code.

import boto3

try:
s3 = s3.client('s3')
except Exceptions as e:
        print "Exception ",e

 

9.Summary

Storing python objects to an external store has many use cases.  For example,  a game developer can store intermediate state of objects and fetch them when the gamer resumes from where left, API developer can use S3 object store as a simple key value store are few to mention.  Please refer the URLs in the Reference sections to learn more.  Thanks.

References

[i] Boto3 – https://boto3.amazonaws.com/v1/documentation/api/latest/index.html

[ii] Boto3 S3 APIhttps://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html

[iii] AWS CLI – https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html

[iv] AWS Boto3 Credentials https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html

[v]Python 2.7 Pickle Library – https://docs.python.org/3/library/pickle.html

[vi] Boto3 Exceptions – https://github.com/boto/botocore/blob/develop/botocore/exceptions.py

Continue reading

Advertisement

Six Thinking Hats on Angular Vs React

The idea of this article is compare and contract Angular and React, so that we can find the suitable one for our need.

1.Introduction

There are so much of information found on internet to ensure that we will get into analysis paralysis when trying to make a decision on Angular or React for the next Web Application.  So, I thought of applying the “Six Thinking Hats” methodology to organize my thoughts and classify the information and data points.

2.Six Thinking Hats – Decision Making Process

Color of Hat Angular React

WHITE – The facts, Just Facts 

  • Angular is a complete end-to-end framework
  • It supports MVC
  • Typescript
  • Two-way Binding by default
  • Asynchronous processing
  • Developed and maintained by Google
  • Focus on Single Page Application
  • Command Line support for development tools
  • Is a View library
  • Virtual DOM Technology
  • JSX – Supports pure Javascript coding for HTML/ CSS
  • One-way binding by default
  • State management for using Redux
  • Developed and maintained by Facebook
  • Supports Multi Page Application
  • Command Line support for development tools

YELLOW- Values and Benefits

  • Enables Java developers quickly develop Web Application.
  • Object Oriented Programming style
  • Dependency Injection
  • Very Small, because it is just View component.
  • JSX abstracts HTML and CSS, so its all JavaScript code.
  • Functional programming style
  • Virtual DOM

BLACK – Difficulties and Dangers

  • Initial learning curve of Typescript for non-java script developers
  • More hierarchical and structure might look complex for some Java Script developers
  • Slightly heavier size of application when compared to React.
  • Fear of JavaScript Fatigue
  • Scaling the application with more and more functionality
  • JSX limitations (Why not direct HTML/CSS)
  • Everything is component.  No Controller/Service.
  • Javascript library life cycle management

RED – Emotions and Feelings 

  • Happy that it supports OOPs.
  • Code structure enforced with MVC pattern
  • Everything under the hood solution.
  • Worried about finding a new JavaScript library every time, we need new function/technology (JavaScript Fatigue)
  • Worried that code may become unorganized over the period of time, since the code structure must be enforced by the maintainers and developers.

GREEN – New ideas and possibilities 

  • Typescript quickly enables Java developers to be come Web Developers.
  • Highly opinionated and enforces code structures
  • Different components for Model, View, Services and Routers.
  • No need for additional / external libraries.
  • Quick to develop application
  • Suitable for small team of JavaScript experts focused on Web UI development
  • Good for teams already worked on other Javascript librairies such as  Ember, Backbone, Dojo.

BLUE – Thinking about Thinking

Angular is suitable for:

  • Large and complex applications
  • Full stack developers with Java/C# knowledge
  • Developing Clean and Structured code
React is suitable for:

  • Large web application which is a collection of many small applications
  • The team with experience in JavaScript  and ready to build everything on their own
  • Teams focusing only Web UI development

3. Summary

As discussed, my decision would be based on team’s competency, willingness to explore new technology, nature of the application, project timelines. The points discussed under “Red Hat” may not be acceptable for everyone, but individuals emotions and feelings might affect the final decision.  Overall I feel both Angular and React are capable and matured technologies in their own unique way of building the web app.

References:

[1] – Angular – https://angular.io/

[2] – ReactJS – https://reactjs.org/

[3] – https://en.wikipedia.org/wiki/Six_Thinking_Hats

[4] – https://medium.com/unicorn-supplies/angular-vs-react-vs-vue-a-2017-comparison-c5c52d620176

 

Naming Conventions from Uncle Bob’s Clean Code Philosophy

This article is the result of reading the book “Clean Code” by Robert C Martin.  All the recommendations made here are suggested by Robert C Marin.  This article covers the abstract of naming conventions suggested by Clean Code.

1. Introduction

The clean code philosophy is seems be to emerged from the fundamentals of Software Craftsmanship.  According to many software experts who have signed the Manifesto for Software Craftsmanship ( http://manifesto.softwarecraftsmanship.org/ ), writing well-crafted and self-explanatory software is almost as important as writing working software.

Following are the snippets from the book, show why naming is hard and disputable and the mind-set to address this problem.

The hardest thing about choosing good names is that it requires good descriptive skills and a shared cultural background.  This is a teaching issue rather than a technical, business or management issue.

Focus on whether the code reads like paragraphs and sentences, or at least tables and data structure.

2.Naming Conventions

2.1 Use Intention-Revealing Names

Basically, don’t name a variable, class or method which needs some explanation in comments.

For example

int d; // elapsed time in days  

int s; // elapsed time in seconds

 

Can be named as :

int elapsedTimeInDays;  

int daysSinceCreation; 

int daysSinceModification;  

int fileAgeInDays;

 

2.2 Avoid Disinformation & encodings

Referring a group of accounts as “accountList”, where as it is actually not a List.  Instead name it as “accountGroup”, “accounts” or “bunchOfAccounts”.

Avoid unnecessary encodings of datatypes along with variable name.

String nameString;

Float salaryFloat;

It may not be necessary, where it is very well known to the entire world that in a employee context, the name is going to have a sequence of characters.  The same goes of Salary, which is decimal/float.

 

2.3 Make Meaningful Distinctions

If there are two different things in the same scope,  you might be tempted to change one name in an arbitrary way.

What is stored in cust, which is different from customer ?

String cust;  

String customer;

How these classes are different ?

class ProductInfo  

class ProductData

 

How do you differentiate these methods in the same class ?

void getActiveAccount()

void getActiveAccounts()

void getActiveAccountsInfo()

So, the suggestion is here to make meaningful distinctions.

2.4.Use Pronounceable Names

Ensure that names are pronounceable.  Nobody wants a tongue twisters.

This

Date genymdhms;

Date modymdhms;

Int pszqint;

vs

Date generationTimeStamp;

Date modificationTimestamp;

Int publicAccessCount;

 

2.5.Use Searchable Names

2.5.1 Avoid single letter names and numeric constants, if they are not easy to locate across the body of the text.

grep "MAX_STUDENTS_PER_TEACHER"  *.java

vs

grep 15 *.java

2.5.2 Use single letter names only used a local variables in short methods.  One important suggestion is here

The length of a name should be correspond to the size of its scope.

2.6.Don’t be cute / Don’t use offensive words

2.6.1

holyHandGrenade()

vs

deleteAllItems()

2.6.2

whack()

vs

kill()

2.6.3

eatMyShorts()

vs

abort()

 

2.7.Pick One Word Per Concept

Use the same concept across code-base.

2.7.1 FetchValue() Vs GetValue() Vs RetrieveValue()

If you are using fetchValue() to return a value of something, use the same concept, instead of using fetchValue() in all the code, using getValue(), retrieveValue() will confuse the reader.

2.7.2 dataFetcher() vs dataGetter() Vs dataFetcher()

If the all three methods does the same thing, don’t mix and match across the code base.  Instead stick to one.

2.7.3  FeatureManager  Vs FeatureController

Strict to either Manager or Controller.  Again consistency across code base for the same thing.

2.8.Don’t Pun

This is exactly opposite of previous rule. Avoid using the same word for two different purposes.  Using the same term for two different ideas is essentially pun.

Eg. For example, if you have two classes having the same add() method.  In one class, it performs addition of two values, in another it inserts an element into list.  So, choose wisely, in the second class use insert() or append() instead of add().

2.9. Solution Domain Names  Vs Problem Domain Names

The clean code philosophy suggest to use solution domain names such as name of algorithms, name of design patterns whenever needed, and use problem domain name as the second choice.

For example, accountVisitor means a lot to a programmer who is familiar with Visitor design pattern.  A “JobQueue” definitely make more sense to a programmer.

2.10. Add Meaningful Context, as a last resort

There are a few names which are meaningful themselves, most are not.  Instead you have to place names in context for your reader by enclosing them in well-named classes, functions or namespaces.  If not possible, use prefixing the names may be necessary as a last resort.

For example: The member variable named “state” inside the address class shows what it contains.  But the same “state” may be misleading if used to store “Name of The State” without context.  So, in this scenarios name it something like “addressState” or “StateName”.

The “state” may make more sense in a context where you coding about addresses, when named “addrState”

3. Summary

Robert C Martin says, “One difference between a smart programmer and a professional programmer is that the professional understands that clarity is king.  Professionals use their powers for good and write code that others can understand“.  I hope this article will helps you to be more professional and write clean code, for further reading refer [3].

4. References

[1] https://cleancoders.com/

[2] https://blog.cleancoder.com/

[3] https://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882/ref=sr_1_1?ie=UTF8&qid=1520625319&sr=8-1&keywords=clean+coding

[4] http://manifesto.softwarecraftsmanship.org/#/en

A Appendix

A.1 Manifesto for Agile Craftsmanship

As aspiring Software Craftsmen we are raising the bar of professional software development by practicing it and helping others learn the craft. Through this work we have come to value:

Not only working software,
but also well-crafted software
Not only responding to change,
but also steadily adding value
Not only individuals and interactions,
but also a community of professionals
Not only customer collaboration,
but also productive partnerships

 

That is, in pursuit of the items on the left we have found the items on the right to be indispensable.

Docker Commands: From Development to Deployment

The objective of this article to understand the end to end flow of container development to deployment in target environment and list down the docker commands needed for every action.

1 Introduction

The overall process consists of  i) develop a container image with your code, dependent software and configurations, ii) run and test the container in development environment, iii) publish the container image into docker hub and finally  iv) deploy the docker image and run the container in the target environment.  This article assumes that you have installed docker engine in the development and target environment.  Please refer 6.3 for installation instructions.

2 Develop Container Image

To build the container image, we have to create a Dockerfile which will contain all the necessary information. Please refer https://nodejs.org/en/docs/guides/nodejs-docker-webapp/  to develop the Dockerfile.

2.1 Build docker container

$docker build -t containername .

The command will take the Dockerfile present in your current directory.  If you have the docker file in a different name and different location, we can use -f flag to specify the docker file name.  The “docker build” command will build the container image in the name specified with “-t” flag.

$docker build -t myapp .

01-DockerBuild

2.2 Docker Image Naming Convention

We can provide any name to a docker container, when you run locally.  It could be as simple as “myApp” as shown above.  But if you want to publish the image into docker hub, there is a specific naming convention to be followed.  This convention helps the docker tools to publish the container image into right namespace and repository.

The format:

NameSpace/Repository:Version

So, I build the docker image using the above convention:

$docker build -t saravasu/techietweak:001 .

We can also use the “docker tag” command to create a image from an existing image.  The “docker tag” command is explained below.

 

2.3 List All Images in the Docker

$docker images

 02-DockerImagesList

3 Run the container

3.1 Start the Docker Container

Use the “docker run” command to start the docker container.

$docker run -d -p 8080:8080 saravasu/techietweak:001

03-DockerRun

The “-d” option run the container in the detached mode, so that the container continues to run, even if the terminal is closed.

The “-p” command used to map the ports.  In this example, “-p 8080:8080” the first port number is the port used the docker host.  The second port number is used by the docker container.  As per this command all the traffic comes to the docker host port, will be forwarded to the docker container port.

3.2 Check current running containers

$docker ps

 

07-DockerPS

From the above output, we can see that docker container is running in the name of “trusting_snyder”

To list all the containers irrespective of the states, use the “-a” switch.

$docker ps -a

3.3 Show the console logs of the running container

$docker logs <containerName>

08-dockerlogs

ContainerName is can be found from “docker ps” command

3.4 Login to the container

$docker exec -it containerId /bin/bash

The above command will prompt you with “bash” shell of the container.

09-dockerlogin

3.5  Stop the running container

$docker stop <containername>

11-DockerStop

3.6  Remove the container image from the docker

$docker rm imageId

Note.  Find the imageId of the container using the command “docker images” or “docker images -a ”

$docker rmi -f <List Of Image Ids>

11-DockerRMIImages

The above command will forcefully delete the given image.

3.7  Clean up your docker / Delete All container images in the local docker

$docker rmi -f $(docker images | tr -s ' ' ' ' | cut -d' ' -f3)

4 Publish The Container Image

The docker container images can be published to your local dockyard or the public docker hub.  The process and commands are same for both.  To publish your docker image in the docker hub, first create your namespace and repository in the http://hub.docker.com.

I have used my namespace “saravasu” and the repository “techietweak” for this exercise.

00-dockerhubRepo

4.1 Login To Docker Hub

$docker login

If you want to login to you local repository, please provide the URL.  If URL is not specified, then this command will login to docker hub.

$docker login http://localhost:8080

04-DockerLogin

 

4.2 Tag The Container Image

To push the docker container image into docker hub, it must be tagged in a specific format.

It is <Namespace>/<Repository>:<Version> .  If you don’t specify the version, it will be taken as “default”.  In the below command, I am tagging the image “myapp” , as “saravasu/techietweak:001”, where as “saravasu” is my namespace (login id), techietweak is repository and 001 is the version.

$docker tag myapp:latest saravasu/techietweak:001

4.3 Push the docker image into docker hub

$docker push saravasu/techietweak:001

10-DockerPushToDockerHub

4.4 Check the Container Images in Docker Hub

Now login to you docker hub account and check for the images in the respective repository.

05-dockerhubimages

5 Deploy The Container

5.1 Pull The Docker Container Image

Login into docker hub from the host machine in the target environment and pull the container image from docker hub.  If you want pull it from your private dockyard use the command “$docker login <hostname>” to specify the hostname the private dockyard.

$docker login

The above command will login to https://hub.docker.com, since the host name is not specified.

$docker pull saravasu/techietweak:001

12-dockerPull

5.2 Check the image

The docker pull command downloaded the container image from the docker hub. We can validate the same by using “docker images” command.

$docker images

06-DockerPull-Images

5.3 Run the container

Now we can run the Docker Container, in the same way we ran in the development environment and test in the same way we have done it before.

$docker run -d -p 8080:8080 saravasu/techietweak:001

13-dockerTargetRun

 

The docker run command starts the container.  To validate the same, we can use “docker ps” command.  The docker has created a new container and it is running in the name of “naughty_lewin”.

As we see above, the docker engine provides a random name to the running container.  But this could be a problem in automation, so it is always good to specify a name we want to refer.  This can be achieved by using “–name” parameter.

$docker run -d -p 8080:8080 --name "myNodeJsWebContainer" saravasu/techietweak:001

14-dockerRunNamedContainer

6 Summary

This article captures over all flow and necessary commands to develop the container image, run it in local environment, publish the image to docker hub and run the container in the target environment.  For further study and detailed documentation is available in docker website [Ref 6.1].

7 References

7.1 Dockerfile Reference https://docs.docker.com/engine/reference/builder/

7.2 Dockerize Node.js Web App https://nodejs.org/en/docs/guides/nodejs-docker-webapp/

7.3 Docker Installation : https://docs.docker.com/engine/installation/

 

Features of Apache Big Data Streaming Frameworks

The idea of this article is to explain the general features of Apache big data  stream processing frameworks.  Also provide a crisp comparative analysis of Apache’s big data streaming frameworks against the generic features.  So that it useful to select the right framework for the application development.

1. Introduction

In the big data world, there are many tools and frameworks available to process the large volume of data in offline mode or batch mode.  But the need for real time processing to analyze the data arriving at high velocity on the fly and provide analytics or enrichment services is also high.  In the last couple of year this is an ever changing landscape, with many new entrants of streaming frameworks.  So choosing the real time processing engine becomes a challenge.

2. Design

The real time streaming engines interacts with stream or messaging frameworks such as Apache Kafka, RabbitMQ, Apache Flume to receive the data in real time.

It process the data inside the cluster computing engine which typically runs on top of a cluster manager such as Apache YARN, Apache Mesos or Apache Tez.

The processed data sent back to message queues ( Apache Kafka, RabbitMQ, Flume) or written into storage such as HDFS, NFS.

StreamingFrameworkDesign

 

3. Characteristics of Real Time Stream Process Engines

3.1 Programming Models

They are two types of programming models present in real time streaming frameworks.

3.1.1 Compositional

This approach provides basic components, using which the streaming application can be created. For example, In Apache Storm, the spout is used to connect to different sources and receive the data and bolts are used to process the received data.

3.1.2. Declarative

This is more of a functional programming approach, where the framework allows us to define higher order functions. This declarative APIs provides more advanced operations like windowing or state management & it is considered more flexible.

3.2  Message Delivery Guarantee

There are three message delivery guarantee mechanisms. They are : at most once, at least
once & exactly once.

3.2.1 At most once

This is a best effort delivery mechanism. The message may be delivered one or more times.  So the possibilities of getting duplicate events processed are very high.

3.2.2 At least once

This mechanism will ensure that the message is delivered at-least once.
But in the process of delivering at least once, the framework might deliver the
message more than once. So, duplicate message might be received and processed.
This might result in unnecessary complications, where the processing logic is not
omnipotent.

3.2.3 Exactly once

The framework will ensure that the message is delivered and processed exactly once.
The message delivery is guaranteed and there won’t be any duplicate messages.
So, “Exactly Once” delivery guarantee is considered to be best of all.

3.3 State Management

Statement management defines the way events are accumulated in side the frameworks  before it actually process the data.  This is a critical factor while deciding the framework for real time analytics.

3.3.1 Stateless processing

The frameworks which process the incoming events independently without the knowledge
of any previous events are considered to be stateless.  The data enrichment and data processing applications might need kind of processing power.

3.3.2 Stateful Processing

The stream processing frameworks can make use of the previous events to process the
incoming events, by storing them in cache or external databases.  Real time analytics applications need stateful processing, so that it can collect the data for a specific interval and process them before it really recommends any suggestions to the user.

3.4 Processing Modes

Processing mode defines, how the incoming data is processed.  There are three processing modes: Event, Micro batch & batch.

3.4.1 . Event Mode

Each and every incoming message is processed independently. It may or may not maintain the state information.

3.4.2 Micro Batch

The incoming events are accumulated for a specific time window and the collected events processed together as batch.

3.4.3 Batch

The incoming events are processed like a bounded stream of inputs.
This allows to process the large finite set of incoming events.

3.5 Cluster Manager

The real time processing frameworks runs in cluster computing environment might need a cluster manager.  The support for cluster manager is critical to support the scalability and performance requirement of the application.  The frameworks might run on standalone mode, their own cluster manager, Apache YARN, Apache Mesos or Apache Tez.

3.5.1 Standalone Mode

The support to run on standalone mode is useful during development phase, where the developers can run the code in their development environment, they do not need to deploy their code in the large cluster computing environment.

3.5.2 Proprietary Cluster Manager

Some of real time processing frameworks might support their own cluster managers, such Apache Spark has its own Standalone Cluster manager, which is bundled with the software.  This reduces the overhead of installing , configuration and maintenance of other cluster managers such as Apache Yarn or Apache Mesos.

3.5.3 Support for Industry Standard Cluster Managers

If you already have a big data environment and want to leverage the cluster for real time processing, then support to existing cluster computing manager is very critical.  The real time stream processing frameworks must support Apache YARN, Apache Mesos or Apache Tez.

3.6 Fault Tolerance

Most of the Big Data frameworks follows master slave architecture.  Basically the master is responsible for running the job on the cluster and monitor the clients in the cluster.  So, the framework must handle failures at the master node as well as failure in client nodes.  Some frameworks might need some external tools like monit/supervisord to monitor the master node.  For example, Apache Spark streaming has its own monitoring process for the master (driver) node.  If the master node fails it will be automatically restarted.  If the client node fails, master takes care of restarting them.    But in Apache Storm, the master has to be monitored using monit.

3.7 External Connectors

The framework must support seamless connection to external data generation sources such Twitter feeds, Kafka , RabbitMQ, Flume, RSS Feeds, Hekad, etc.  The frameworks must provide standard inbuilt connectors as well as provision to extend the connectors to connect various streaming data sources.

3.7.1 Social Media Connectors – Twitter / RSS Feeds

3.7.2 Message Queue Connectors -Kafka / RabbitMQ

3.7.3 Network Port Connectors – TCP/UDP Ports

3.7.4 Custom Connectors – Support to develop customized connectors to read from custom applications.

3.8 Programming Language Support

Most of these frameworks supports JVM languages, especially Java & Scala.  Some also supports Python.  The selection of the framework might depend on the language of the choice.

3.9 Reference Data Storage & Access

The real time processing engines, might need to refer some data bases to enhance or aggregate the given data.  So, the framework must provide features to integrate and efficient access to the reference data.  Some frameworks provide ways to internally cache the reference data in memory (Eg. Apache Spark Broadcast Variable).  Apache Samza and Apache Flink supports storing the reference data internally in each cluster node, so that jobs can access them internally without connecting to the data base over the network.

Following are the various methods available in the big data streaming frameworks:

3.9.1 In-memory cache : Allows to store reference data inside cluster nodes, so that it improves the performance by reducing the delay in connecting to external data bases.

3.9.2 Per Client Data Base Storage: Allows to store data in 3rd party database systems like MySQL, SQLite,MongoDB etc inside the streaming cluster.  Also provides API support to connect and retrieve data from those data bases & provides efficient data base connection methodologies.

3.9.3. Remote DBMS connection:  These systems support connecting to the external databases outside the streaming clusters.  This is considered to be less efficient due to higher latency introduced due to network connectivity and bottlenecks introduced due to network communication.

3.10 Latency and throughput

Though hardware configuration plays a major role in latency and throughput, some of the design factors of the frameworks affects the performance.  The factors are : Network IO, efficient use memory,  reduced disk access, in memory cache for reference data.  For example, Apache Kafka Streaming API provides higher throughput and low latency due to reduced network I/O, hence the messaging framework and computing engines are in the same cluster.  Similarly, Apache Spark uses the memory to cache the data, there by reduces the disk access results in low latency and higher throughput.

4. Feature Comparison Table

Following table provides a comparison Apache streaming frameworks against the above discussed features.

StreamFrameworkComparison

The above frameworks supports both statefull and stateless processing modes.

5. Conclusion

This article summarizes the various features of the  streaming framework, which are critical selection criteria for new streaming application.  Every application is unique and has its own specific functional and non-functional requirement, so the right framework is completely depends on the requirement.

6. References

6.1 Apache Spark Streaming – http://spark.apache.org/streaming/

6.2 Apache Storm – http://storm.apache.org/

6.3 Apache Flink – https://flink.apache.org/

6.4 Apache Samza – http://samza.apache.org/

6.5 Apache Kafka Streaming API – http://kafka.apache.org/documentation.html#streams

 

 

R Data Frame – The One Data Type for Predictive Analytics

The idea of this article is to introduce the R language’s high level data structure named data frame and its usage in the context of programming with predictive machine learning algorithms.

1.Introduction

The R data frame is a high level data structure which is equivalent to a table in database systems.  It is highly useful to work with machine learning algorithms and it is very flexible and easy to use.

The standard definition says  data frames is a “tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R‘s modeling software.”

2. CRUD Operations

2.1 Create

2.1.1 Creating a new Data Frame

The data.frame() API will create a new dataFrame.  This creates a data frame object, which can be later updated with rows and columns.

dataFrame <- data.frame()

2.1.2 Creating a DataFrame from a CSV file

This is the most standard way of creating data frame while working with machine learning problems.  The read.csv() API creates a new data frame and loads it with the contents of the data frame.  The column are named with the first row of the CSV file.

dataFrame <- read.csv("C:\DATA\training.csv")

2.1.3 Creating a CSV file from Data Frame

The write.csv() API is used to persist contents of the data frame into a CSV file.  In a machine learning problem, once we perform the prediction we want to persist the results into storage.  So write.csv() is answer to that requirement.

write.csv(DataFrame,"C:\Data\NewTraining.csv")

2.1.4 Creating a CSV file from Data Frame without heading

By default write.csv() adds a new row id into the output file.  But if you don’t want the same, we can set the row.names to “False”.

write.csv(DataFrame,"C:\Data\NewTraining.csv",row.names = False)

2.1.5 Create a vector from a Data Frame

Basically each and every column of the data frame is a vector.  We can create a new vector from a column of a data frame.

newVector <- dataFrame$ColumnName

2.2 Read

2.2.1 The str() API

The str() displays the contents of the data frame in a tabular format.

str(dataFrame)

2.2.2 Filter the DataFrame Based on the Value set for column

In the following example, we will select only the records where the shopId is 100.  We can also use other relational operators and combine it with logical operators as well.

newFrame <- dataFrame[dataFrame$ShopId == '100',]

2.2.3 Sort the DataFrame on a column

The below code snippet sorts the dataFrame based on the Date column in the dataFrame and stores the result in a new data frame object.

newFrame <- dataFrame[order(dataFrame$Date),]

2.3 Update

2.3.1 Adding Rows

The rbind() API allows to append a new row into an existing data frame.  This api will throw errors if the row does not contain similar columns of the data frame.  The newRow must be a vector() data type.

NewDataFrame <- rbind(dataFrame,newRow)

2.3.2 Adding Columns

The cbind() API allows to append a new column to an existing data frame.  This api will throw errors if the newColumn does not contain similar rows of the data frame.  The newColumn  must be vector() data type.

newDataFrame <- cbind(dataFrame,newColumn)

2.3.3 Dynamically Adding Column

Following is an alternate way to add a new column using a vector to an existing data frame.

dataFrame$newColumn <- Vector()

The newColumn will be appeneded to the dataFrame with the values populated by the Vector.

2.4 Delete

The standard rm()  api is used to delete the dataFrame.

rm(dataFrame)

3. Prediction Models with DataFrame

We will take a simple sales prediction problem, where a shop wants to predict the expected sales based on the past history of 10 days.  Most of the R scripts takes a general format of 1)loading the training data, 2)loading the test data, 3) build the model with the training data and 4) predict the test data with the model.  Finally 5)write the predicted values into a storage.

3.1 Training Data

The prediction model is built with training data.  The prediction model performs the machine learning algorithms on the training data and builds the model.  So that we can perform prediction using this model later.

We have the training data in a csv file.  It has the following columns.  Day -specifies the number of the day, Customers – specifies the total number of customers visited the shop, Promo – specifies whether the shop ran a promotion on that day, Holiday – specifies if it is a holiday in that state and Sales specifies the amount of sales on that day.

For example, the first row says that, one day 1, 55 customers visited the shop and sales was 5488.  Its a regular weekday and there was no promotion or holiday on that day.

Sales

Sales Data ( Training Data)

 

3.2 Test Data

The test data is also a csv file, for which we have to predict the sales.  It contains the similar set of data, except Sales.  For example, we have to predict the sales for day 1, where 53 customers visited and its a regular working day without a promotion or a holiday.

Test

Test Data

3.3 Linear Regression Model

#Create a dataframe using the sales.txt to train the model
trainingData <- read.csv("C:\\Data\\Sales.txt")
#Create a dataframe to using Test.txt for which the predictions to be computed
testData <- read.csv("C:\\Data\\Test.txt")

#Build the linear regression model using the glm() library
#This model will predicts the Sales using Day, Customer, Promo & Holiday

Model <- glm(Sales ~ Day + Customers + Promo + Holiday, data = trainingData)

#Apply the model in predict() api with the test data
#It returns the predicted sales in a vector

predictSales <- predict(Model,testData)

#Round of the predicted sales values 
predictSales <- round(predictSales,digit=2)

#Create a new column in testData framework with predicted Sales values
testData$Sales <- predictSales

#Write the result into a file.
write.csv(testData, file = "C:\\Result.txt",row.names=FALSE)

3.4 Predicted Results

Now we get the results in a format similar to our Sales.txt.  Actually it contains all the fields of Test.txt and the respective Sales data predicted by the algorithm.

ResultLM

Linear Regression Model Results

3.5 Random Forest Model

To run the Random Forest algorithm, we should have the package installed in R.  We can execute the below command to install the package.

install.package("randomForest")

Below the R script to predict the sales using Random Forest algorithm:

#load the ramdonForest library
library(randomForest)

#Create a dataframe for Sales data to train the model
trgData <- read.csv("C:\\Data\\Sales.txt")

#Create a dataframe with the test data, for which the sales to be predicted
testData <- read.csv("C:\\Data\\Test.txt")

#Build the model using Random Forest algorithm
#This model predicts the sales using Day, Customers, Promo and Holiday
Model <- randomForest(Sales ~ Day + Customers + Promo + Holiday, data = trgData)

#Predict the sales using the above model, for the test data
#predictSales is a vector which contains the predicted values
predictSales <- predict(Model,testData)

#Round off the sales numbers
predictSales <- round(predictSales,digit=2)

#Add additional column to the test data and copy the predicted sales values
testData$Sales <- predictSales

#Create a output file using the computed values.
write.csv(testData, file = "C:\\Data\\Result.txt",row.names=FALSE)

3.6 Predicted Output (Random Forest)

Now the results for test.txt is added with the predicted Sales value column for each and every day.  For example, for day 5, where 79 customers visited the shop and it is a holiday and the shop was running a promotion on that day.  The algorithm predicted the expected sales as 7889.6 for that day.

ResultRF

Random Forest Model Results

4.Summary

I have briefly explained the core concepts of R data frame in the context of machine learning algorithms, based on my experience in kaggle competitions.  This article will give a quick heads up into data analytics.  To learn further and gain expertise I would suggest to start reading the books R Cookbook[2] and Machine Learning with R [3]. Hope this helps.

5. References

[1] R Data Frame Documentation – https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html

[2] R Cookbook – O Reilley Publications

[3] Machine Learning with R – https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition

[4] Download R – https://cran.rstudio.com

[5] Download R Studio – www.rstudio.com/products/rstudio/download/

File Handling in Amazon S3 with Python Boto Library

Understand Python Boto library for standard S3 workflows.

1. Introduction

Amazon Web Services (AWS) Simple Storage Service (S3) is a storage as a service provided by Amazon.  It a general purpose object store, the objects are grouped under a name space called as “buckets”.  The buckets are unique across entire AWS S3.

Boto library is the official Python SDK for software development.  It provides APIs to work with AWS services like EC2, S3 and others.

In this article we will focus on how to use Amzaon S3 for regular file handling operations using Python and Boto library.

2. Amzon S3 & Work Flows

In Amzaon S3, the user has to first create a bucket.  The bucket is a namespace, which is has a unique name across AWS.  The users can set access privileges to it based on their requirement.  The buckets can contain objects.  The objects are referred as a key-value pair, where key is the identifier to operate on the object.  The key must be unique inside the bucket.  The object can be of any type.  It can be used to store strings, integers, JSON, text files, sequence files, binary files, picture & videos.  To understand more about Amazon S3 Refer Amazon Documentation [2].

Following are the possible work flow of operations in Amazon S3:

  • Create a Bucket
  • Upload file to a bucket
  • List the contents of a bucket
  • Download a file from a bucket
  • Move files across buckets
  • Delete a file from bucket
  • Delete a bucket

3. Python Boto Library

Boto library is the official Python SDK for software development.  It supports Python 2.7.  Work for Python 3.x is on going.  The code snippets in this article are developed using boto v2.x.  To install the boto library, pip command can be used as below:

pip install -u boto

 

Also in the below code snippets, I have used connect_s3() API, by passing the access credentials as arguments.  This provides the connection object to work with.  But If you don’t want to code  the access credentials in your program, there are other ways of do it.  We can create environmental variables for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.  The other way is to create a credential files and keep them under .aws directory in the name of “credentials” in the users home directory.   The file should contain the below:

File Name : ~/.aws/credentials

[default]
aws_access_key_id = ACCESS_KEY
aws_secret_access_key = SECRET_KEY

4. S3 Work flow Automation

4.1 Create a bucket

The first operation to be performed before any other operation to access the S3 is to create a bucket.  The create_bucket() api in connection object performs the same.  The bucket is the name space under which all the objects of the users can be stored.

import boto

keyId = "your_aws_key_id"
sKeyId="your_aws_secret_key_id"
#Connect to S3 with access credentials 
conn = boto.connect_s3(keyId,sKeyId)  

#Create the bucket in a specific region.
bucket = conn.create_bucket('mybucket001',location='us-west-2')

In create_bucket() api, the bucketname (‘mybucket001’) is the mandatory parameter, which is the name of the bucket.  The location is optional parameter, if the location is not given, then bucket will be created in the default region of the user.

create_bucket() call might throw an error message, if a bucket with the same name already exists.  Also the bucket name is unique across the system.  Naming convention of the bucket is depend the rules enforced by the AWS region.  Generally, bucket name must be in lower case.

4.2 Upload a file

To upload a file into S3, we can use set_contents_from_file() api of the Key object.  The Key object resides inside the bucket object.

import boto
from boto.s3.key import Key

keyId = "your_aws_key_id"
sKeyId= "your_aws_secret_key_id"

fileName="abcd.txt"
bucketName="mybucket001"
file = open(fileName)

conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName)
#Get the Key object of the bucket
k = Key(bucket)
#Crete a new key with id as the name of the file
k.key=fileName
#Upload the file
result = k.set_contents_from_file(file)
#result contains the size of the file uploaded

4.3 Download a file

To download the file, we can use get_contents_to_file() api.

import boto
from boto.s3.key import Key

keyId ="your_aws_key_id"
sKeyId="your_aws_secret_key_id"
srcFileName="abc.txt"
destFileName="s3_abc.txt"
bucketName="mybucket001"

conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName)

#Get the Key object of the given key, in the bucket
k = Key(bucket,srcFileName)

#Get the contents of the key into a file 
k.get_contents_to_filename(destFileName)

4.4 Move a file from one bucket to another

We can achieve moving a file from one bucket to another, only by copying the object from one bucket to another.  The copy_key() api of bucket object, copies the object from a given bucket to local.

import boto

keyId = "your_aws_access_key_id"
sKeyId="your_aws_secret_key_id"

conn = boto.connect_s3(keyId,sKeyId)
srcBucket = conn.get_bucket('mybucket001')   #Source Bucket Object
dstBucket = conn.get_bucket('mybucket002')   #Destination Bucket Object
fileName = "abc.txt"
#Call the copy_key() from destination bucket
dstBucket.copy_key(fileName,srcBucket.name,fileName)

 4.5 Delete a file

To delete a file inside the object, we have to retrieve the key of the object and call the delete() API of the Key object.  The key object can be retrieved by calling Key() with bucket name and object name.

import boto
from boto.s3.key import Key

keyId = "your_aws_access_key"
sKeyId = "your_aws_secret_key"
srcFileName="abc.txt"      #Name of the file to be deleted
bucketName="mybucket001"   #Name of the bucket, where the file resides

conn = boto.connect_s3(keyId,sKeyId)   #Connect to S3
bucket = conn.get_bucket(bucketName)   #Get the bucket object

k = Key(bucket,srcFileName)            #Get the key of the given object

k.delete()                             #Delete the object

4.6 Delete a bucket

The delete_bucket() api of the connection object deletes the given bucket in the parameter.

import boto

keyId = "your_aws_access_key_id"
sKeyId= "your_aws_secret_key_id"
conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.delete_bucket('mybucket002')

The delete_bucket() call will fail, if there are objects inside the bucket.

4.7 Empty a bucket

Emptying a bucket can be achieved by deleting all the objects indie the bucket.  The list() api of bucket object (bucket.list()) will provide all the objects inside the bucket.  By calling the delete() api for those objects, we can delete them.

import boto

keyId = "your_aws_access_key_id"
sKeyId= "your_aws_secret_key_id"

bucketName="mybucket002"
conn = boto.connect_s3(keyId,sKeyId)     #Connect to S3
bucket = conn.get_bucket(bucketName)     #Get the bucket Object

for i in bucket.list():
    print(i.key)
    i.delete()                           #Delete the object

4.8 List All Buckets

The get_all_buckets() of the connection object returns list of all buckets for the user.  This can be used to validate existence of the bucket once you have created or deleted a bucket.

import boto

keyId = "your_aws_access_key_id"
sKeyId= "your_aws_secret_key_id"

conn = boto.connect_s3(keyId,sKeyId)      #Connect to S3
buckets = conn.get_all_buckets()          #Get the bucket list
for i in buckets:
    print(i.name)

5 Summary

The boto library provides connection object, bucket object and key object which exactly represents the design of S3.  By understanding various methods of these objects we can perform all the possible operations on S3 using this boto library.

Hope this helps.

6. References

[1] Boto S3 API Documentation – http://boto.cloudhackers.com/en/latest/ref/s3.html

[2] Amazon S3 Documention – https://aws.amazon.com/documentation/s3/

 

 

 

A Primer On Open Source NoSQL Databases

The idea of this article is to understand NoSQL databases, its properties, various types, data model and how it differs from standard RDBMS.

1. Introduction

The RDMS databases are here for nearly three decades now.  But in the era of social media, smart phones and cloud, we generate large volume of data, at a high velocity.  Also the data varies from simple text messages to high resolution video files.  The traditional RDBMS could not able to cope up with the velocity, volume and variety of data requirement of this new era.  Also most of the RDBMS software are licensed and needs enterprise class, proprietary, licensed hardware machines.  This has clearly let way for Open Source NoSQL Databases, where the basic properties are dynamic schema, distributed and horizontally scalable on commodity hardware.

2. Properties of NoSQL

NoSQL is the acronym for Not Only SQL.  The basic qualities of NoSQL databases are schemaless, distributed and horizontally scalable on commodity hardware.  The NoSQL databases offers variety of functions to solve various problems with variety of data types, where “blob” used to be the only data type in RDBMS to store unstructured data.

2.1 Dynamic Schema

NoSQL databases allows schema to be flexible. New columns can be added anytime.  Rows may or may not have values for those columns and no strict enforcement of data types for columns. This flexibility is handy for developers, especially when they expect frequent changes during the course of product life cycle.

2.2 Variety of Data

NoSQL databases support any type of data.  It supports structured, semi-structured and unstructured data to be stored.  Its supports logs, images files, videos, graphs, jpegs, JSON, XML to be stored and operated as it is without any pre-processing.  So it reduces the need for ETL (Extract – Transform – Load).

2.3 High Availability Cluster

NoSQL databases support distributed storage using commodity hardware. It also supports high availability by horizontal scalability. This features enables NoSQL databases get the benefit of elastic nature of the Cloud infrastructure services.

2.4 Open Source

NoSQL databases are open source software.  The usage of software is free and most of them are free to use in commercial products.  The open sources codebase can be modified to solve the business needs.  There are minor variations in the open source software licenses, users must be aware of license agreements.

2.5 NoSQL – Not Only SQL

NoSQL databases not only depend SQL to retrieve data. They provide rich API interfaces to perform DML and CRUD operations. These are APIs are move developer friendly and supported in variety of programming languages.

3. Types of No-SQL

There are four types of No-SQL data bases. They are: Key-Value databases, Column oriented database, Document oriented databases and Graph databases.  At a very high level most of these databases follows the similar structure of RDBMS databases.

The database server might contain many data bases.  The databases might contain one or more tables inside it.  The table intern will have rows and columns to store the actual data.  This hierarchy is common across all No-SQL databases, but the terminologies might vary.

3.1 Key Value Database

Key-Value databases developed based on Dynamo white paper published by Amazon.  Key-Value database allows the user to store data in simple <key> : <value> format, where key is used to retrieve the value from the table.

3.1.1 Data Model

The table contains many key spaces and each key space can have many identifiers to store key value pairs.  The key-space is similar to column in typical RDBMS and the group of identifiers presented under the key-space can be considered as rows.01-KeyValueV1

It is suitable for building simple, non-complex, high available applications.  Since most of Key Value Databases support in memory storage, can be used for building cache mechanism.

3.1.3 Example:

DynamoDB, Redis

3.2 Column oriented Database

Column oriented databases are developed based on Big Table white paper published by Google.  This takes a different approach than traditional RDBMS, where it supports to add more and more columns and have wider table.  Since the table is going to be very broad, it supports to group the column with a family name, call it “Column Family” or “Super Column“.  The Column Family can also be optional in some of the Column data bases.  As per the common philosophy of No-SQL databases, the values to the columns can be sparsely distributed.

3.2.1 Data Model

The table contains column families (optional).  Each column family contains many columns.  The values for columns might be sparsely distributed with key-value pairs.

02-ColumnDBImgv1

The Column oriented databases are alternate to the typical Data warehousing databases (Eg. Teradata) and they are suitable for OLAP kind of application.

3.2.2 Example

Apache Cassandra, HBase

3.3 Document-oriented Database

Document oriented databases supports to store semi-structured data.  It can be JSON, XML, YAML or even a Word Document.  The unit of data is called document (similar to a row in RDBMS).  The table which contains a group of documents is called as a “Collection”.

3.3.1 Data Model

The Database contains many Collections.  A Collection contains many documents.  Each document might contain a JSON document or XML document or YAML or even a Word Document.

03-DocumentDBv1

Document databases are suitable for Web based applications and applications exposing RESTful services.

3.3.2 Example

MongoDB, CouchBaseDB

3.4 Graph Database

The real world graph contains vertices and edges.  They are called nodes and relations in graph.  The graph databases allow us to store and perform data manipulation operations on nodes, relations and attributes of nodes and relations.

The graph databases works better when the graphs are directed graphs, i.e. when there are relations between graphs.

3.4.1 Data Model

The graph database is the two dimensional representation of graph.  The graph is similar to table.  Each graph contains Node, Node Properties, Relation and Relation Properties as Columns.  There will be values for each row for these columns.  The values for properties columns can have key-value pairs. 04-GraphDBv1

Graph databases are suitable for social media, network problems which involves complex queries with more joins.

3.4.2 Example

Neo4j, OrientDB, HyperGraphDB, GraphBase, InfiniteGraph

4. Possible Problem Areas

Following are the important areas to be considered while choosing a NoSQL database for given problem statement.

4.1 ACID Transactions:

Most of the NoSQL databases do not support ACID transactions. E.g. MongoDB, CouchBase, Cassandra.  [Note: To know more about ACID transaction capabilities, refer the appendix below].

4.2 Proprietary APIs / SQL Support

Some of NoSQL databases does not support Structured Query Language, they only support API interface.  There is no common standard for APIs.  Every database follows its own way of implementing APIs, so there is a overhead of learning and developing separate adaptor layers for each and every databases.  Some of NoSQL databases do not support all standard SQL features.

4.3 No JOIN Operator

Due to the nature of the schema or data model, not all NoSQL databases support JOIN operations by default, whereas in RDBMS JOIN operation is a core feature.  The query language in Couchbase supports join operations.  In HBase it can be achieved by integrating with Hive.  MongoDB does not support it currently.

4.4 Lee-way of CAP Theorem

Most of the NoSQL databases, take the leeway suggested by CAP theorem and they support only any two properties of Consistency, Availability and Partition aware.  They do not support all the three qualities. [Note: Please refer appendix to know more about CAP theorem].

5. Summary

NoSQL databases solve the problems where RDBMS could not succeed in both functional and non-functional areas.  In this article we have seen the basic properties, generic data models, various types and features of NoSQL databases.  To further proceed, start using anyone of NoSQL database and get hands-on.

 

Appendix A Theories behind Databases

A.1 ACID Transactions

ACID is an acronym for Atomicity, Consistency, Isolation and Durability.  These four properties are used to measure

A.1.1 Atomicity

Atomicity means that the database transactions must be atomic in nature. It is also called all or nothing rule. Databases must ensure that a single failure must result rollback of the entire transaction until the commit point. Only if all transactions are successful the transaction must be committed.

A.1.2 Consistency

Databases must ensure that only valid data must be allowed to be stored. In RDBMS, it is all about enforcing schema. In NoSQL the consistency varies depends on the type of DB. For example, in GraphDB such as Neo4J, consistency ensures that relationship must have start and end node. In MongoDB, it automatically creates a unique rowid, using a 24bit length value.

A.1.3 Isolation

Databases allow multiple transactions in parallel. For example, when read and write operations happens in parallel, read will not know about the write operation until write transaction is committed. The read operation will have only legacy data, until the full commit of the write transaction is completed.

A.1.4 Durability

Databases must ensure that committed transactions are persisted into storage. There must be appropriate transaction and commit logs available to enforce writing into hard disk.

A.2 Brewer’s CAP-Theorem

The CAP theorem states that any networked shared-data system can have at most two of three desirable properties.  They are : Consistency, Availability and Partition tolerence.

A.2.1.Consistency

In a distributed database systems, all the nodes must see the same data at the same time.

A.2.2.Availability

The database system must be available to service a request received. Basically, the DBMS must be a high available system.

A.2.3. Partition Tolerance

The database system must continue to operate despite arbitrary partitioning due to network failures.

GIT Command Reference

Git is a software that allows you to keep track of changes made to a project over time.  Git works by recording the changes you make to a project, storing those changes, then allowing you to reference them as needed.

GIT Project has three parts.

1. Working Directory : The directory where you will be doing all the work.  Creating, editing, deleting and organizing files.

2. Staging Area : The place where you will list changes you make to the working directory.

3. Repository :  A place where GIT permanently stores those changes as different versions of those projects.

GIT WorkFlow:

Git workflow consists of editing files in the working directory, adding files to the staging area, and saving changes to a GIT repository.  Saving changes to GIT repository called commit.

 

I. BASIC GIT Commands

  1. git init – Turns the current working directory into a GIT Project
  2. git status – Prints the current status of git working directory.
  3. git add <filename> – Adds the file into the Staging Area.  [ After adding verify the same with git staus command ]
  4. git add <list of files> Add command also takes list of files.
  5. git diff <filename> – Displays the diff between the file in staging are and current working directory
  6. git commit -m “Commit Comment” – Permenently stores changes from staging area into GIT repository.
  7. git log – Prints the earlier versions of the project which are stored in chronological order.

 

II. Backtracking Changes
In GIT the commit you are currently on is known as the HEAD commit.  In many cases, the most recently made commit is the HEAD commit.SHA – The git log command displays the commit log.  The commit will contain SHA values for each commit.  The SHA is the FIRST 7 Digit of the SHA.

  1. git show HEAD – Displays the HEAD commit
  2. git reset HEAD <filename> – Unstages file changes in the Staging area.
  3. git checkout HEAD <filename> – Discards the changes in the Working Directory.
  4. git reset <SHA> –  It will reset back to the level of commit.

 

III. GIT BRANCHING
GIT allows us to create branches to experiment with versions of a project.  Imagine you want to develop a new API on master branch, until you ready to merge that API in master branch, it will not available.  So, in this scenario we create a branch and develop our new API and merge into master.

  1. git branch – Shows the current branches and current active branch you are in.
  2. git branch <new branch name > – Create a new branch
  3. git checkout <branchname> – Switches to the branch
  4. git merge <branchname> – This command is issued from master to merge branch into the master.
  5. git branch -d <branchname> – Delete the branch.

 

IV. GIT COLLOBORATION
Git offers a suit of colloboration tools to working with other’s project.

  1. git clone <remote_location> <clone_name> – Creates a new replica of git repository from remote repository
  2. git fetch – Update the clone.  This will only update the existing files.
  3. git merge origin/master – Merge the local master with Origin Master.
  4. git push origin <branch> – Push your work to the origin.

 

V. Example Workflow to add a files

  1. git clone <remote_location>
    1.     Eg. git clone http://www.abc.com/abc/bcd
  2. git add <new_files>
    1.     git add abc.py bcd.py
  3. git status
  4. git commit -m”Commiting abc.py”
  5. git push origin master

 


VI. Reference

  1. https://confluence.atlassian.com/bitbucketserver/basic-git-commands-776639767.html

Python Collections : High Performing Containers For Complex Problems

1.Introduction

Python is known for its powerful general purpose built-in data types like list, dict, tuple and set.  But Python also has collection objects like Java and C++.  These objects are developed on top of the general built-in containers with addtional functionalities which can be used in special scenarios.

The objective of this article is to introduce python collection objects and explain them with apropriate code snippets.  The collections library contains the collections objects, they are namedtuples (v2.6), deque (v2.4), ChainMap(v3.3), Counter(v2.7 ), OrderedDict(v2.7), defaultdict(v2.5) .  Python 3.x also has userDict, userList, userString to create own custom container types (not in the scope of this article), which deserves a separate article.

NOTE: Python 2.x user might aware of various releases they are objects got introduced.  All these objects are available in Python 3.x from 3.1 onwards, except that ChainMap which  got introduced in v3.3.  All the code snippets in the articles are executed in Python 3.5 environment.

2. Namedtuple

As the name suggests, namedtuple is a tuple with name.  In standard tuple, we access the elements using the index, whereas namedtuple allows user to define name for elements.  This is very handy especially processing csv (comma separated value) files and working with complex and large dataset, where the code becomes messy with the use of indices (not so pythonic).

2.1 Example 1

Namedtuples are available in collections library in python. We have to import collections library before using any of container object from this library.

>>>from collections import namedtuple
>>>saleRecord = namedtuple('saleRecord','shopId saleDate salesAmout totalCustomers')
>>>
>>>
>>>#Assign values to a named tuple 
>>>shop11=saleRecord(11,'2015-01-01',2300,150) 
>>>shop12=saleRecord(shopId=22,saleDate="2015-01-01",saleAmout=1512,totalCustomers=125)

In the above code snippet, in the first line we import namedtuple from the collections library. In the second line we create a namedtuple called “saleRecord”, which has shopId, saleDate, salesAmount and totalCustomers as fields. Note that namedtuple() takes two string arguments, first argument is the name of tuple and second argument is the list of fields names seperated by space or comma. In the above example space is used as delimeter.
We have also created two tuples here. They are shop11 and shop12.  For shop11, the values are assigned to fields based on the order of the fields and shop12, the values are assigned using the names.

2.2 Example 2

>>>#Reading as a namedtuple
>>>print("Shop Id =",shop12.shopId)
12
>>>print("Sale Date=",shop12.saleDate)
2015-01-01
>>>print("Sales Amount =",shop12.salesAmount)
1512
>>>print("Total Customers =",shop12.totalCustomers)
125

The above code is pretty much clear that tuple is accessed using the names. It is also possible to access them using indexes of the tuples which is the usual way.

2.3 Interesting Methods and Members

2.3.1 _make

The _make method is used to convert the given iteratable item (list, tuple,dictionary) into a named tuple.

>>>#Convert a list into a namedtuple
>>>aList = [101,"2015-01-02",1250,199]
>>>shop101 = saleRecord._make(aList)
>>>print(shop101)
saleRecord(shopId=101, saleDate='2015-01-02', salesAmount=1250, totalCustomers=199)

>>>#Convert a tuple into a namedtuple
>>>aTup =(108,"2015-02-28",1990,189)
>>>shop108=saleRecord._make(aTup)
>>>print(shop108)
saleRecord(shopId=108, saleDate='2015-02-28', salesAmount=1990, totalCustomers=189)
>>>

2.3.2 _fields

The _fields is a tuple, which contains the names of the tuple.

>>>print(sho108._fields)
>>>('shopId', 'saleDate', 'salesAmount', 'totalCustomers')

2.4 CSV File Processing

As we discussed namedtuple will be very handy while processing a csv data file, where we can access the data using names instead of indexes, which make the code more meaningful and efficient.

from csv import reader
from collections import namedtuple

saleRecord = namedtuple('saleRecord','shopId saleDate totalSales totalCustomers')
fileHandle = open("salesRecord.csv","r")
csvFieldsList=csv.reader(fileHandle)
for fieldsList in csvFieldsList:
    shopRec = saleRecord._make(fieldsList)
    overAllSales += shopRec.totalSales;

print("Total Sales of The Retail Chain =",overAllSales)

In the above code snippet, we have the files salesRecord.csv which contains sales records of shops of a particular retain chain. It contains the values for the fields shopId,saleDate,totalSales,totalCustomers. The fields are delimited by comma and the records are delimited by new line.
The csv.reader() read the file and provides a iterator. The iterator, “csvFieldsList” provides list of fields for every single row of the csv file. As we know the _make() converts the list into namedtuple and the rest of the code is self explanatory.

 

3.Counter

Counter is used for rapid tallies.  It is a dictionary, where the elements are stored as keys and their counts are stored as values.

3.1 Creating Counters

The Counter() class takes an iteratable object as an argument and computes the count for each element in the object and present as a key value pair.

>>>from collections import Counter
>>>listOfInts=[1,2,3,4,1,2,3,1,2,1]
>>>cnt=Counter(listOfInts)
>>>print(cnt)
Counter({1: 4, 2: 3, 3: 2, 4: 1})

In the above code snippet, listOfInts is a list which contains numbers. It is passed to Counter() and we got cnt, which is a container object. The cnt is a dictionary, which contains the unique numbers present in the given list as keys, and their respect counts as the value.

3.2 Accessing Counters

Counter is a subclass of dictionary.  So it can be accessed the same as dictionary.   The “cnt” can be handled as a regular dictionary object.

>>> cnt.items()
dict_items([(1, 4), (2, 3), (3, 2), (4, 1)])
>>> cnt.keys()
dict_keys([1, 2, 3, 4])
>>> cnt.values()
dict_values([4, 3, 2, 1])

3.3 Interesting Methods & Usecases

3.3.1 most_common

The most_common(n) of Counter class, provides most commonly occured keys. The n is used as a rank, for example, n = 2 will provide top two keys.

>>>name = "Saravanan Subramanian"
>>>letterCnt=Counter(name)
>>>letterCnt.most_common(1)
[('a', 7)]
>>>letterCnt.most_common(2)
[('a', 7), ('n', 4)]
>>>letterCnt.most_common(3)
[('a', 7), ('n', 4), ('r', 2)]

In the above code, we could see that the string is parsed as independent characters as keys and their respective count is stored as values. So, the letterCnt.most_common(1) provides the top letter which has highest occurances.

3.3.2 Operations on Counter

The Counter() subclass is also called as Multiset. It supports addition, substraction, unition and intersection operations on the Counter class.

>>> a = Counter(x=1,y=2,z=3)
>>> b = Counter(x=2,y=3,z=4)
>>> a+b
Counter({'z': 7, 'y': 5, 'x': 3})
>>> a-b       #This will result in negative values & will be omitted
Counter()    
>>> b-a
Counter({'y': 1, 'x': 1, 'z': 1})
>>> a & b    #Chooses the minimum values from their respective pair
Counter({'z': 3, 'y': 2, 'x': 1})
>>> a | b   #Chooses the maximum values from their respective pair
Counter({'z': 4, 'y': 3, 'x': 2})

4. Default Dictionary

The defaultdict() is available part of collections library. It allows the user to specify a function to be called when key is not present in the dictionary.

In a standard dictionary, accesing an element where the key is not present, will raise “Key Error”. So, this is a problem when working working with collections (list, set, etc), especially while creating them.

So, when a dictionary is queried for a key, which is not exists, the function passed as an argument to the named argument “default_dictionary” of default_dict() will called to set a value for given “key” into dictionary.

4.1 Creating Default Dictionary

The defaultdict() is available part of collections library.  The default dict takes a function without argument which returns value as an argument.

4.1.1 Example 1

>>> 
>>> booksIndex = defaultdict(lambda:'Not Available')
>>> booksIndex['a']='Arts'
>>> booksIndex['b']='Biography'
>>> booksIndex['c']='Computer'
>>> print(booksIndex)
defaultdict(<function  at 0x030EB3D8>, {'c': 'Computer', 'b': 'Biography', 'a': 'Arts'})
>>> booksIndex['z']
'Not Available'
>>> print(booksIndex)
defaultdict(<function  at 0x030EB3D8>, {'c': 'Computer', 'b': 'Biography', 'z': 'Not Available', 'a': 'Arts'})
>>> 

In the above example, the booksIndex is a defaultdict, where it set ‘Not Available” as a value if any non-existant key is accessed. We have added values for keys a, b & c into the defaultdict. The print(booksIndex) shows that the defaultdict contains values only for these keys. While trying to access the value for key ‘z’, which we have not set, it returned value as ‘Not Available‘ and updated the dictionary.

4.1.2 Example 2

>>> titleIndex = [('a','Arts'),('b','Biography'),('c','Computer'),('a','Army'),('c','Chemistry'),('d','Dogs')]
>>> rackIndices = defaultdict(list)
>>> for id,title in titleIndex:
	rackIndices[id].append(title)	
>>> rackIndices.items()
dict_items([('d', ['Dogs']), ('b', ['Biography']), ('a', ['Arts', 'Army']), ('c', ['Computer', 'Chemistry'])])
>>> 

In the above example, titleIndex contains a list of tuples. We want to aggregate this list of tuples to identify titles for each alphabets. So, we can have a dictionary where key is the alphabet and value is the list of titles. Here we used a defaultdict with “list” as a function to be called for missing elements. So for each new elements list will be called, and it will create an empty list object. The consecutive append() methods on the list will add elements to the list.

5. Ordered Dictionary

The ordered dictionary maintains the order of elements addition into the dictionary, where the standard dictionary will not maintain the order of inclusion.

5.1 Ordered Dictionary Creation

Ordered Dictionary is created using OrderedDict() from collections library. It an subsclass of regular dictionary, so it inherits all other methods and behaviours of regular dictionary.

>>> from collections import OrderedDict
>>> dOrder=OrderedDict()
>>> dOrder['a']='Alpha'
>>> dOrder['b']='Bravo'
>>> dOrder['c']='Charlie'
>>> dOrder['d']='Delta'
>>> dOrder['e']='Echo'
>>> dOrder
>>> OrderedDict([('a', 'Alpha'), ('b', 'Bravo'), ('c', 'Charlie'), ('d', 'Delta'), ('e', 'Echo')])
>>> >>> dOrder.keys()
odict_keys(['a', 'b', 'c', 'd', 'e'])
>>> dOrder.values()
odict_values(['Alpha', 'Bravo', 'Charlie', 'Delta', 'Echo'])
>>> dOrder.items()
odict_items([('a', 'Alpha'), ('b', 'Bravo'), ('c', 'Charlie'), ('d', 'Delta'), ('e', 'Echo')])
>>> 

5.2 Creating from other iteratable items

OrderedDict can also be created by passing an dictionary or a list of key, value pair tuples.

>>> from collections import OrderedDict
>>> listKeyVals = [(1,"One"),(2,"Two"),(3,"Three"),(4,"Four"),(5,"Five")]
>>> x = OrderedDict(listKeyVals)
>>> x
OrderedDict([(1, 'One'), (2, 'Two'), (3, 'Three'), (4, 'Four'), (5, 'Five')])
>>> 

5.3 Sort and Store

One of the interesting use case for OrderedDict is Rank problem. For example, consider the problem a dictionary contains students names and their marks, now we have to find out the best student and rank them according to their marks. So, OrderedDict is the right choice here. Since OrderedDict will remember the order or addition and sorted() will sort a dictionary we can combine both to created a rank list based on the student marks. Please check the example below:

>>> studentMarks={}
>>> studentMarks["Saravanan"]=100
>>> studentMarks["Subhash"]=99
>>> studentMarks["Raju"]=78
>>> studentMarks["Arun"]=85
>>> studentMarks["Hasan"]=67
>>> studentMarks
{'Arun': 85, 'Subhash': 99, 'Raju': 78, 'Hasan': 67, 'Saravanan': 100}
>>> sorted(studentMarks.items(),key=lambda t:t[0])
[('Arun', 85), ('Hasan', 67), ('Raju', 78), ('Saravanan', 100), ('Subhash', 99)]
>>> sorted(studentMarks.items(),key=lambda t:t[1])
[('Hasan', 67), ('Raju', 78), ('Arun', 85), ('Subhash', 99), ('Saravanan', 100)]
>>> sorted(studentMarks.items(), key = lambda t:-t[1])
[('Saravanan', 100), ('Subhash', 99), ('Arun', 85), ('Raju', 78), ('Hasan', 67)]
>>> rankOrder = OrderedDict(sorted(studentMarks.items(), key = lambda t:-t[1]))
>>> rankOrder
OrderedDict([('Saravanan', 100), ('Subhash', 99), ('Arun', 85), ('Raju', 78), ('Hasan', 67)])

In the above example, studentMarks is a dictionary contains the student name as a key and their mark as the value. It got sorted using its value and passed to OrderedDict and got stored in rankOrder. Now rankOrder contains the highest marked student as the first entry, and next highest as the second entry and so on. This ordered is presevered in this dictionary.

6. Deque

Deque means double ended queue and it pronounced as “deck”. It is an extention to the standard list data structure. The standard list allows the user to append or extend elements only at the end. But deque allows the user to operate on both ends, so that the user can implement both stacks and queues.

6.1 Creation & Performing Operations on Deque

The deque() is available in collections library. It takes iteratable entity as an argument and an optional maximum length. If maxlen is set, it ensure that deque length does not exceeds the size of the maxlen.

>>> from collections import deque
>>> aiao = deque([1,2,3,4,5],maxlen=5)
aiao = deque([1,2,3,4,5])
>>> aiao.append(6)
>>> aiao
deque([2, 3, 4, 5, 6], maxlen=5)
>>> aiao.appendleft(1)
>>> aiao
deque([1, 2, 3, 4, 5], maxlen=5)

In the above example, we have created a deque with maxlen 5, once we appended 6th element on the right, it pushed first element on the left.  Similarly, it pushes out the last element on the right when we append element on the left.

6.2 Operations on Right

Operations on the right are common to performing any opertions on the list.  The methods append(), extend() and pop() are operate on the rightside of the deque().

>>> aiao.append(6)
>>> aiao
deque([2, 3, 4, 5, 6], maxlen=5)
>>> aiao.extend([7,8,9])
>>> aiao
deque([5, 6, 7, 8, 9], maxlen=5)
>>> aiao.pop()
9

6.3 Operation on the Left

The special feature of performing operations on the left is supported by set of methods like appendleft(), extendleft(), popleft().

>>> aiao = deque([1,2,3,4,5],maxlen=5)
>>> aiao.appendleft(0)
>>> aiao
deque([0, 1, 2, 3, 4], maxlen=5)
>>> aiao.extendleft([-1,-2,-3])
>>> aiao
deque([-3, -2, -1, 0, 1], maxlen=5)
>>> aiao.popleft()
-3

6.4 Example 2 (without maxlen)

If the maxlen value is not set, the deque does not perform any trimming operations to maintain the size of the deque.

>>> aiao = deque([1,2,3,4,5])
>>> aiao.appendleft(0)
>>> aiao
deque([0, 1, 2, 3, 4, 5])
>>> aiao.extendleft([-1,-2,-3])
>>> aiao
deque([-3, -2, -1, 0, 1, 2, 3, 4, 5])
>>> 

From the above example, the deque aiao continues to grow for the append and extend operations performed on it.

7. ChainMap

ChainMap allows to combine multiple dictionaries into a single dictionary, so that operations can be performed on single logical entity.  The ChainMap() does not create any new dictionary, instead it maintains references to the original dictionaries, all operations are performed only on the referred dictionaries.

7.1 Creating ChainMap

>>> from collections import ChainMap
>>> x = {'a':'Alpha','b':'Beta','c':'Cat'}
>>> y = { 'c': "Charlie", 'd':"Delta", 'e':"Echo"}
>>> z = ChainMap(x,y)
>>> z
ChainMap({'c': 'Cat', 'b': 'Beta', 'a': 'Alpha'}, {'d': 'Delta', 'c': 'Charlie', 'e': 'Echo'})
>>> list(z.keys())
['b', 'd', 'c', 'e', 'a']
>>> list(z.values())
['Beta', 'Delta', 'Cat', 'Echo', 'Alpha']
>>> list(z.items())
[('b', 'Beta'), ('d', 'Delta'), ('c', 'Cat'), ('e', 'Echo'), ('a', 'Alpha')]

We have created ChainMap z from other dictionaries x & y. The ChainMap z is reference to the dictionaries x and y. ChainMap will not maintain duplicate keys, it returns presents value ‘Cat’ for key ‘c’. So, basically it skips the second occurance of the same key.

>>> x
{'c': 'Cat', 'b': 'Beta', 'a': 'Alpha'}
>>> y
{'d': 'Delta', 'c': 'Charlie', 'e': 'Echo'}
>>> x.pop('c')
'Cat'
>>> x
{'b': 'Beta', 'a': 'Alpha'}
>>> list(z.keys())
['d', 'c', 'b', 'e', 'a']
>>> list(z.values())
['Delta', 'Charlie', 'Beta', 'Echo', 'Alpha']
>>> list(z.items())
[('d', 'Delta'), ('c', 'Charlie'), ('b', 'Beta'), ('e', 'Echo'), ('a', 'Alpha')]
>>> 

In the above code, we have removed the key ‘c’ from dict x. Now the ChainMap points the value for key ‘c’ to “Charlie”, which is present in y.

8. Summary

We have seen various python collection data types and understand them with example and use cases. The official python documentation can be referred for further reading.

9. References

[1] – Python Wiki – https://docs.python.org/3.5/library/collections.html