R Data Frame – The One Data Type for Predictive Analytics

The idea of this article is to introduce the R language’s high level data structure named data frame and its usage in the context of programming with predictive machine learning algorithms.


The R data frame is a high level data structure which is equivalent to a table in database systems.  It is highly useful to work with machine learning algorithms and it is very flexible and easy to use.

The standard definition says  data frames is a “tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R‘s modeling software.”

2. CRUD Operations

2.1 Create

2.1.1 Creating a new Data Frame

The data.frame() API will create a new dataFrame.  This creates a data frame object, which can be later updated with rows and columns.

dataFrame <- data.frame()

2.1.2 Creating a DataFrame from a CSV file

This is the most standard way of creating data frame while working with machine learning problems.  The read.csv() API creates a new data frame and loads it with the contents of the data frame.  The column are named with the first row of the CSV file.

dataFrame <- read.csv("C:\DATA\training.csv")

2.1.3 Creating a CSV file from Data Frame

The write.csv() API is used to persist contents of the data frame into a CSV file.  In a machine learning problem, once we perform the prediction we want to persist the results into storage.  So write.csv() is answer to that requirement.


2.1.4 Creating a CSV file from Data Frame without heading

By default write.csv() adds a new row id into the output file.  But if you don’t want the same, we can set the row.names to “False”.

write.csv(DataFrame,"C:\Data\NewTraining.csv",row.names = False)

2.1.5 Create a vector from a Data Frame

Basically each and every column of the data frame is a vector.  We can create a new vector from a column of a data frame.

newVector <- dataFrame$ColumnName

2.2 Read

2.2.1 The str() API

The str() displays the contents of the data frame in a tabular format.


2.2.2 Filter the DataFrame Based on the Value set for column

In the following example, we will select only the records where the shopId is 100.  We can also use other relational operators and combine it with logical operators as well.

newFrame <- dataFrame[dataFrame$ShopId == '100',]

2.2.3 Sort the DataFrame on a column

The below code snippet sorts the dataFrame based on the Date column in the dataFrame and stores the result in a new data frame object.

newFrame <- dataFrame[order(dataFrame$Date),]

2.3 Update

2.3.1 Adding Rows

The rbind() API allows to append a new row into an existing data frame.  This api will throw errors if the row does not contain similar columns of the data frame.  The newRow must be a vector() data type.

NewDataFrame <- rbind(dataFrame,newRow)

2.3.2 Adding Columns

The cbind() API allows to append a new column to an existing data frame.  This api will throw errors if the newColumn does not contain similar rows of the data frame.  The newColumn  must be vector() data type.

newDataFrame <- cbind(dataFrame,newColumn)

2.3.3 Dynamically Adding Column

Following is an alternate way to add a new column using a vector to an existing data frame.

dataFrame$newColumn <- Vector()

The newColumn will be appeneded to the dataFrame with the values populated by the Vector.

2.4 Delete

The standard rm()  api is used to delete the dataFrame.


3. Prediction Models with DataFrame

We will take a simple sales prediction problem, where a shop wants to predict the expected sales based on the past history of 10 days.  Most of the R scripts takes a general format of 1)loading the training data, 2)loading the test data, 3) build the model with the training data and 4) predict the test data with the model.  Finally 5)write the predicted values into a storage.

3.1 Training Data

The prediction model is built with training data.  The prediction model performs the machine learning algorithms on the training data and builds the model.  So that we can perform prediction using this model later.

We have the training data in a csv file.  It has the following columns.  Day -specifies the number of the day, Customers – specifies the total number of customers visited the shop, Promo – specifies whether the shop ran a promotion on that day, Holiday – specifies if it is a holiday in that state and Sales specifies the amount of sales on that day.

For example, the first row says that, one day 1, 55 customers visited the shop and sales was 5488.  Its a regular weekday and there was no promotion or holiday on that day.


Sales Data ( Training Data)


3.2 Test Data

The test data is also a csv file, for which we have to predict the sales.  It contains the similar set of data, except Sales.  For example, we have to predict the sales for day 1, where 53 customers visited and its a regular working day without a promotion or a holiday.


Test Data

3.3 Linear Regression Model

#Create a dataframe using the sales.txt to train the model
trainingData <- read.csv("C:\\Data\\Sales.txt")
#Create a dataframe to using Test.txt for which the predictions to be computed
testData <- read.csv("C:\\Data\\Test.txt")

#Build the linear regression model using the glm() library
#This model will predicts the Sales using Day, Customer, Promo & Holiday

Model <- glm(Sales ~ Day + Customers + Promo + Holiday, data = trainingData)

#Apply the model in predict() api with the test data
#It returns the predicted sales in a vector

predictSales <- predict(Model,testData)

#Round of the predicted sales values 
predictSales <- round(predictSales,digit=2)

#Create a new column in testData framework with predicted Sales values
testData$Sales <- predictSales

#Write the result into a file.
write.csv(testData, file = "C:\\Result.txt",row.names=FALSE)

3.4 Predicted Results

Now we get the results in a format similar to our Sales.txt.  Actually it contains all the fields of Test.txt and the respective Sales data predicted by the algorithm.


Linear Regression Model Results

3.5 Random Forest Model

To run the Random Forest algorithm, we should have the package installed in R.  We can execute the below command to install the package.


Below the R script to predict the sales using Random Forest algorithm:

#load the ramdonForest library

#Create a dataframe for Sales data to train the model
trgData <- read.csv("C:\\Data\\Sales.txt")

#Create a dataframe with the test data, for which the sales to be predicted
testData <- read.csv("C:\\Data\\Test.txt")

#Build the model using Random Forest algorithm
#This model predicts the sales using Day, Customers, Promo and Holiday
Model <- randomForest(Sales ~ Day + Customers + Promo + Holiday, data = trgData)

#Predict the sales using the above model, for the test data
#predictSales is a vector which contains the predicted values
predictSales <- predict(Model,testData)

#Round off the sales numbers
predictSales <- round(predictSales,digit=2)

#Add additional column to the test data and copy the predicted sales values
testData$Sales <- predictSales

#Create a output file using the computed values.
write.csv(testData, file = "C:\\Data\\Result.txt",row.names=FALSE)

3.6 Predicted Output (Random Forest)

Now the results for test.txt is added with the predicted Sales value column for each and every day.  For example, for day 5, where 79 customers visited the shop and it is a holiday and the shop was running a promotion on that day.  The algorithm predicted the expected sales as 7889.6 for that day.


Random Forest Model Results


I have briefly explained the core concepts of R data frame in the context of machine learning algorithms, based on my experience in kaggle competitions.  This article will give a quick heads up into data analytics.  To learn further and gain expertise I would suggest to start reading the books R Cookbook[2] and Machine Learning with R [3]. Hope this helps.

5. References

[1] R Data Frame Documentation – https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html

[2] R Cookbook – O Reilley Publications

[3] Machine Learning with R – https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition

[4] Download R – https://cran.rstudio.com

[5] Download R Studio – www.rstudio.com/products/rstudio/download/


File Handling in Amazon S3 with Python Boto Library

Understand Python Boto library for standard S3 workflows.

1. Introduction

Amazon Web Services (AWS) Simple Storage Service (S3) is a storage as a service provided by Amazon.  It a general purpose object store, the objects are grouped under a name space called as “buckets”.  The buckets are unique across entire AWS S3.

Boto library is the official Python SDK for software development.  It provides APIs to work with AWS services like EC2, S3 and others.

In this article we will focus on how to use Amzaon S3 for regular file handling operations using Python and Boto library.

2. Amzon S3 & Work Flows

In Amzaon S3, the user has to first create a bucket.  The bucket is a namespace, which is has a unique name across AWS.  The users can set access privileges to it based on their requirement.  The buckets can contain objects.  The objects are referred as a key-value pair, where key is the identifier to operate on the object.  The key must be unique inside the bucket.  The object can be of any type.  It can be used to store strings, integers, JSON, text files, sequence files, binary files, picture & videos.  To understand more about Amazon S3 Refer Amazon Documentation [2].

Following are the possible work flow of operations in Amazon S3:

  • Create a Bucket
  • Upload file to a bucket
  • List the contents of a bucket
  • Download a file from a bucket
  • Move files across buckets
  • Delete a file from bucket
  • Delete a bucket

3. Python Boto Library

Boto library is the official Python SDK for software development.  It supports Python 2.7.  Work for Python 3.x is on going.  The code snippets in this article are developed using boto v2.x.  To install the boto library, pip command can be used as below:

pip install -u boto


Also in the below code snippets, I have used connect_s3() API, by passing the access credentials as arguments.  This provides the connection object to work with.  But If you don’t want to code  the access credentials in your program, there are other ways of do it.  We can create environmental variables for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.  The other way is to create a credential files and keep them under .aws directory in the name of “credentials” in the users home directory.   The file should contain the below:

File Name : ~/.aws/credentials

aws_access_key_id = ACCESS_KEY
aws_secret_access_key = SECRET_KEY

4. S3 Work flow Automation

4.1 Create a bucket

The first operation to be performed before any other operation to access the S3 is to create a bucket.  The create_bucket() api in connection object performs the same.  The bucket is the name space under which all the objects of the users can be stored.

import boto

keyId = "your_aws_key_id"
#Connect to S3 with access credentials 
conn = boto.connect_s3(keyId,sKeyId)  

#Create the bucket in a specific region.
bucket = conn.create_bucket('mybucket001',location='us-west-2')

In create_bucket() api, the bucketname (‘mybucket001’) is the mandatory parameter, which is the name of the bucket.  The location is optional parameter, if the location is not given, then bucket will be created in the default region of the user.

create_bucket() call might throw an error message, if a bucket with the same name already exists.  Also the bucket name is unique across the system.  Naming convention of the bucket is depend the rules enforced by the AWS region.  Generally, bucket name must be in lower case.

4.2 Upload a file

To upload a file into S3, we can use set_contents_from_file() api of the Key object.  The Key object resides inside the bucket object.

import boto
from boto.s3.key import Key

keyId = "your_aws_key_id"
sKeyId= "your_aws_secret_key_id"

file = open(fileName)

conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName)
#Get the Key object of the bucket
k = Key(bucket)
#Crete a new key with id as the name of the file
#Upload the file
result = k.set_contents_from_file(file)
#result contains the size of the file uploaded

4.3 Download a file

To download the file, we can use get_contents_to_file() api.

import boto
from boto.s3.key import Key

keyId ="your_aws_key_id"

conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.get_bucket(bucketName)

#Get the Key object of the given key, in the bucket
k = Key(bucket,srcFileName)

#Get the contents of the key into a file 

4.4 Move a file from one bucket to another

We can achieve moving a file from one bucket to another, only by copying the object from one bucket to another.  The copy_key() api of bucket object, copies the object from a given bucket to local.

import boto

keyId = "your_aws_access_key_id"

conn = boto.connect_s3(keyId,sKeyId)
srcBucket = conn.get_bucket('mybucket001')   #Source Bucket Object
dstBucket = conn.get_bucket('mybucket002')   #Destination Bucket Object
fileName = "abc.txt"
#Call the copy_key() from destination bucket

 4.5 Delete a file

To delete a file inside the object, we have to retrieve the key of the object and call the delete() API of the Key object.  The key object can be retrieved by calling Key() with bucket name and object name.

import boto
from boto.s3.key import Key

keyId = "your_aws_access_key"
sKeyId = "your_aws_secret_key"
srcFileName="abc.txt"      #Name of the file to be deleted
bucketName="mybucket001"   #Name of the bucket, where the file resides

conn = boto.connect_s3(keyId,sKeyId)   #Connect to S3
bucket = conn.get_bucket(bucketName)   #Get the bucket object

k = Key(bucket,srcFileName)            #Get the key of the given object

k.delete()                             #Delete the object

4.6 Delete a bucket

The delete_bucket() api of the connection object deletes the given bucket in the parameter.

import boto

keyId = "your_aws_access_key_id"
sKeyId= "your_aws_secret_key_id"
conn = boto.connect_s3(keyId,sKeyId)
bucket = conn.delete_bucket('mybucket002')

The delete_bucket() call will fail, if there are objects inside the bucket.

4.7 Empty a bucket

Emptying a bucket can be achieved by deleting all the objects indie the bucket.  The list() api of bucket object (bucket.list()) will provide all the objects inside the bucket.  By calling the delete() api for those objects, we can delete them.

import boto

keyId = "your_aws_access_key_id"
sKeyId= "your_aws_secret_key_id"

conn = boto.connect_s3(keyId,sKeyId)     #Connect to S3
bucket = conn.get_bucket(bucketName)     #Get the bucket Object

for i in bucket.list():
    i.delete()                           #Delete the object

4.8 List All Buckets

The get_all_buckets() of the connection object returns list of all buckets for the user.  This can be used to validate existence of the bucket once you have created or deleted a bucket.

import boto

keyId = "your_aws_access_key_id"
sKeyId= "your_aws_secret_key_id"

conn = boto.connect_s3(keyId,sKeyId)      #Connect to S3
buckets = conn.get_all_buckets()          #Get the bucket list
for i in buckets:

5 Summary

The boto library provides connection object, bucket object and key object which exactly represents the design of S3.  By understanding various methods of these objects we can perform all the possible operations on S3 using this boto library.

Hope this helps.

6. References

[1] Boto S3 API Documentation – http://boto.cloudhackers.com/en/latest/ref/s3.html

[2] Amazon S3 Documention – https://aws.amazon.com/documentation/s3/