The idea of this article is to introduce the R language’s high level data structure named data frame and its usage in the context of programming with predictive machine learning algorithms.
The R data frame is a high level data structure which is equivalent to a table in database systems. It is highly useful to work with machine learning algorithms and it is very flexible and easy to use.
The standard definition says data frames is a “tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R‘s modeling software.”
2. CRUD Operations
2.1.1 Creating a new Data Frame
The data.frame() API will create a new dataFrame. This creates a data frame object, which can be later updated with rows and columns.
dataFrame <- data.frame()
2.1.2 Creating a DataFrame from a CSV file
This is the most standard way of creating data frame while working with machine learning problems. The read.csv() API creates a new data frame and loads it with the contents of the data frame. The column are named with the first row of the CSV file.
dataFrame <- read.csv("C:\DATA\training.csv")
2.1.3 Creating a CSV file from Data Frame
The write.csv() API is used to persist contents of the data frame into a CSV file. In a machine learning problem, once we perform the prediction we want to persist the results into storage. So write.csv() is answer to that requirement.
2.1.4 Creating a CSV file from Data Frame without heading
By default write.csv() adds a new row id into the output file. But if you don’t want the same, we can set the row.names to “False”.
write.csv(DataFrame,"C:\Data\NewTraining.csv",row.names = False)
2.1.5 Create a vector from a Data Frame
Basically each and every column of the data frame is a vector. We can create a new vector from a column of a data frame.
newVector <- dataFrame$ColumnName
2.2.1 The str() API
The str() displays the contents of the data frame in a tabular format.
2.2.2 Filter the DataFrame Based on the Value set for column
In the following example, we will select only the records where the shopId is 100. We can also use other relational operators and combine it with logical operators as well.
newFrame <- dataFrame[dataFrame$ShopId == '100',]
2.2.3 Sort the DataFrame on a column
The below code snippet sorts the dataFrame based on the Date column in the dataFrame and stores the result in a new data frame object.
newFrame <- dataFrame[order(dataFrame$Date),]
2.3.1 Adding Rows
The rbind() API allows to append a new row into an existing data frame. This api will throw errors if the row does not contain similar columns of the data frame. The newRow must be a vector() data type.
NewDataFrame <- rbind(dataFrame,newRow)
2.3.2 Adding Columns
The cbind() API allows to append a new column to an existing data frame. This api will throw errors if the newColumn does not contain similar rows of the data frame. The newColumn must be vector() data type.
newDataFrame <- cbind(dataFrame,newColumn)
2.3.3 Dynamically Adding Column
Following is an alternate way to add a new column using a vector to an existing data frame.
dataFrame$newColumn <- Vector()
The newColumn will be appeneded to the dataFrame with the values populated by the Vector.
The standard rm() api is used to delete the dataFrame.
3. Prediction Models with DataFrame
We will take a simple sales prediction problem, where a shop wants to predict the expected sales based on the past history of 10 days. Most of the R scripts takes a general format of 1)loading the training data, 2)loading the test data, 3) build the model with the training data and 4) predict the test data with the model. Finally 5)write the predicted values into a storage.
3.1 Training Data
The prediction model is built with training data. The prediction model performs the machine learning algorithms on the training data and builds the model. So that we can perform prediction using this model later.
We have the training data in a csv file. It has the following columns. Day -specifies the number of the day, Customers – specifies the total number of customers visited the shop, Promo – specifies whether the shop ran a promotion on that day, Holiday – specifies if it is a holiday in that state and Sales specifies the amount of sales on that day.
For example, the first row says that, one day 1, 55 customers visited the shop and sales was 5488. Its a regular weekday and there was no promotion or holiday on that day.
3.2 Test Data
The test data is also a csv file, for which we have to predict the sales. It contains the similar set of data, except Sales. For example, we have to predict the sales for day 1, where 53 customers visited and its a regular working day without a promotion or a holiday.
3.3 Linear Regression Model
#Create a dataframe using the sales.txt to train the model trainingData <- read.csv("C:\\Data\\Sales.txt") #Create a dataframe to using Test.txt for which the predictions to be computed testData <- read.csv("C:\\Data\\Test.txt") #Build the linear regression model using the glm() library #This model will predicts the Sales using Day, Customer, Promo & Holiday Model <- glm(Sales ~ Day + Customers + Promo + Holiday, data = trainingData) #Apply the model in predict() api with the test data #It returns the predicted sales in a vector predictSales <- predict(Model,testData) #Round of the predicted sales values predictSales <- round(predictSales,digit=2) #Create a new column in testData framework with predicted Sales values testData$Sales <- predictSales #Write the result into a file. write.csv(testData, file = "C:\\Result.txt",row.names=FALSE)
3.4 Predicted Results
Now we get the results in a format similar to our Sales.txt. Actually it contains all the fields of Test.txt and the respective Sales data predicted by the algorithm.
3.5 Random Forest Model
To run the Random Forest algorithm, we should have the package installed in R. We can execute the below command to install the package.
Below the R script to predict the sales using Random Forest algorithm:
#load the ramdonForest library library(randomForest) #Create a dataframe for Sales data to train the model trgData <- read.csv("C:\\Data\\Sales.txt") #Create a dataframe with the test data, for which the sales to be predicted testData <- read.csv("C:\\Data\\Test.txt") #Build the model using Random Forest algorithm #This model predicts the sales using Day, Customers, Promo and Holiday Model <- randomForest(Sales ~ Day + Customers + Promo + Holiday, data = trgData) #Predict the sales using the above model, for the test data #predictSales is a vector which contains the predicted values predictSales <- predict(Model,testData) #Round off the sales numbers predictSales <- round(predictSales,digit=2) #Add additional column to the test data and copy the predicted sales values testData$Sales <- predictSales #Create a output file using the computed values. write.csv(testData, file = "C:\\Data\\Result.txt",row.names=FALSE)
3.6 Predicted Output (Random Forest)
Now the results for test.txt is added with the predicted Sales value column for each and every day. For example, for day 5, where 79 customers visited the shop and it is a holiday and the shop was running a promotion on that day. The algorithm predicted the expected sales as 7889.6 for that day.
I have briefly explained the core concepts of R data frame in the context of machine learning algorithms, based on my experience in kaggle competitions. This article will give a quick heads up into data analytics. To learn further and gain expertise I would suggest to start reading the books R Cookbook and Machine Learning with R . Hope this helps.
 R Data Frame Documentation – https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html
 R Cookbook – O Reilley Publications
 Machine Learning with R – https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-r-second-edition
 Download R – https://cran.rstudio.com
 Download R Studio – www.rstudio.com/products/rstudio/download/