Working with Data Frames in R (Part II)

In the first blog on data frames, we used existing data frames and saw how to access, filter and sort the members. However, many times you may need to create a new data frame and work with the same.

Creating a new Data Frame

As you know that the data frames is a list of vectors of equal lengths, let’s create few relevant vectors. Following commands create students data frames, students, with the column names grade, city and gender:


In above example, you just created a data frame using the statement data.frame. All though you used three separate vectors, they are kind of tightly related to form details for students. In our day-to-day life we come across such data quite often and hence data frame is one of the most important data structures in R.

Following is the complete syntax for data.frame function:

row.names = NULL, 
check.rows = FALSE,
           check.names = TRUE, 
fix.empty.names = TRUE,
             stringsAsFactors = default.stringsAsFactors())

Here, first you provide the list of equal length vectors that you want to join to form the data frame.

  • is null by default. However, you can provide the name of equal length as the column vectors.
  • Check.rows – allows you to ensure consistency of length and names
  • Check.names – allows you to ensure that names of the variables are syntactically correct and unique.
    • In the earlier section you might have noticed that when you added a duplicate entry, the row name was adjusted to make it unique
  • Fix.empty.names – allows you to mention automatically constructed names instead of empty name, when the name is not mentioned


Naming the rows

While you can specify the names of rows using the option of data.frame function, you don’t need to worry, if you didn’t do so at the time of data frame creation.

You can make use of rownames function at assign row names vector to a given data frame. Following example shows how to do this:


Adding a new column

Using function, you can add a new column vector in the data frame. Following example shows how to do this:


You can also make use of cbind function to add columns. For example, you can make use of sample function to generate some random age for the students and make use of the cbind function to add the new column into the students data frame.


Adding columns without recreating the data frame all over

While cbind allows you to add a new column, essentially, this approach also indicates that you are actually recreating a data frame with an additional column. It will be good, if you modify the same data frame and just add an additional column. To demonstrate this, we will copy an inbuilt data frame, women and add a new column called BMI, which uses the existing height and weight data to come up with a calculated value, as shown in below example:


Also, note that by default the numbers are considered to be upto 6-fractional digits. Hence, we have applied round function explicitly and tried to keep it more readable by putting the digits value of 2.

Adding a new row

Using the rbind you can add a new row in the data frame:


Note that by default rbind didn’t give any rowname for the new record. We have used the square bracket notation to access the row name and replace that value with the desired value.

New row with row name

In the previous example, the we didn’t explicitly gave a name to the list and hence R generated a default name. However, by assigning a name to the list, we can ensure that the same name becomes the row name for the new record. Following example demonstrates the same:


Creating a new Data Frame from a Matrix

Sometimes you may need to create a data frame from a matrix. The easiest way to do so is using the function.

Following example creates a 5 x 6 matrix of random numbers between 1 to 100 and converts that matrix into a data frame:


Note that function has built-in row naming and column naming capability. In above example, when we didn’t give the column name, it used V1, V2, … etc.

Deleting rows

Sometimes we may have a need for deleting few records. For example, let’s take a look at the students record, lets delete or remove the 2nd row:

Using the minus and index number for the given set of row, you can retrieve the new data frame, which will exclude those specific records and return rest of the records.


Updating the data elements

Earlier in this article, we learned how to access a specific row, column or a given element. Depending on your need, you can do one of the following to update elements in data frames

  • Assign a specific value to a given cell
  • Assign a vector to a given column
  • Assign a vector to a given set of rows for a given column
  • Assign a list to a given set of columns for a given row

Below example shows how you can update a specific cell of a data frame using the index notation:


You may encounter the below warning when doing the assignment and this is due to R creating a factor of the values. This can be resolved by converting to the appropriate column type as.character for below example.


You could have used following statements as well to achieve the similar results:

> students[8, ‘city’] <- ‘Maldives’

> students[‘Samantha’, ‘city’] <- ‘Maldives’

Following example shows how you can modify a set of columns of a given row by using the combine function and the list of data values:


And, below example shows how you can modify the whole column of a given data set:


In above example, a random gender has been assigned to the different students using the sample function(though you should not be doing this for gender in the real world scenario but using this example for ease of sample data generation).

Deleting a Column

By assigning a NULL value to a column, you can remove that column from the data frame. Following example demonstrates this case, where the age column has been deleted:


Alternately, you can use the same result by making use of the negative indexing, as shown in the below example:


Importing and exporting data frames

Exporting Data Frames

You can export an R-objects into various formats. In this section, I will show you how to export data frame in the .xlsx format and .csv format.

Exporting in CSV format

Use following commands to ensure that you are in the right directory.

  • > getwd() — to get the current working directory
  • > setwd(‘/users/abhilasha/R’) — to set the present working directory

You can run following command to create a CSV file:

> write.table(students, “./studentdata.csv”, sep=”,”)

For the students data being used in this article, you can see following output:


Exporting Data in XLSX format

You would be required to follow below steps:


write.xlsx(mydata,"D:/dummy.xlsx",sheetName = "Newdata")

If you get an error like below, you will need to check the Java version(32 bit  or 64 bit- it should be same as R version).


Also check the JAVA_HOME system variable set properly. You can set it explicitly as below to proceed.

Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre1.8.0_144')
write.xlsx(students,"D:/dummy.xlsx",sheetName = "Newdata")

The xlsx file is saved as below.


Importing Data Frames

You can import data from various data formats into R. This section shows how to import data from a csv file and xlsx file. However, you can use similar approach to read data from any format.

Importing data from csv file

In the previous section, we exported student data in a file called studentdata.csv. Let’s import the same data again in a data frame called students.csv. Following is the output on the console:

> students.csv <- read.csv('./studentdata.csv')
> students.csv
 grade gender cities major age
Jim 8.0 M Hyderabad Math 10
Terry 9.1 F Tokyo Bio 11
Steve 9.0 M Bangalore Phy 11
Veronica 9.2 F Chicago Chem 13
Betty 10.0 F Texas IT 15
Archie 8.8 M Calgary Phy 11
Jughead 8.9 M Minneapolis Math 14
Samantha 8.6 F Maldives IT 15
Joyce 8.8 F Chicago Phy 14

Merging Two Data Frames

The data frame allows you to merge two data frames. In above section we created a data frame called students.csv. Let’s update this data frame to randomize the gender and age.

The changed data look as shown below:

> students.csv[,'age'] <- sample(12:18,9, replace=T )
> students.csv[,'gender'] <- sample(c('M','F'),9, replace=T )
> students.csv
 grade gender cities major age
Jim 8.0 M Hyderabad Math 12
Terry 9.1 F Tokyo Bio 18
Steve 9.0 M Bangalore Phy 17
Veronica 9.2 M Chicago Chem 13
Betty 10.0 F Texas IT 13
Archie 8.8 F Calgary Phy 16
Jughead 8.9 M Minneapolis Math 12
Samantha 8.6 M Maldives IT 15
Joyce 8.8 F Chicago Phy 13
> students
 grade gender cities major age
Jim 8.0 M Hyderabad Math 10
Terry 9.1 F Tokyo Bio 12
Steve 9.0 M Bangalore Phy 11
Veronica 9.2 F Chicago Chem 10
Betty 10.0 F Texas IT 10
Archie 8.8 M Calgary Phy 11
Jughead 8.9 M Minneapolis Math 12
Samantha 8.6 F Chicago IT 15
Joyce 8.8 F Chicago Phy 14

Let’s run following merge command and understand the final outcome:

> students.merge <- merge(students, students.csv)
> students.merge
 grade gender cities major age
1 8.9 M Minneapolis Math 12

Above output looks like an intersection (similar to standard inner join between two tables) of the two data sets.

Let’s look at the syntax of merge to understand various possibilities:

merge( x, y, 

by = intersect(names(x), names(y)), by.x = by, by.y = by, 

all = FALSE, all.x = all, all.y = all,
        sort = TRUE, 

suffixes = c(".x",".y"),

incomparables = NULL, ...)


  • X and Y are data frames
  • By allows you to mention the columns on which you want to merge. The default is intersection between two data sets.
  • A TRUE value for the all attribute means that the records other than what is returned by the intersection will also be returned (full outer join).
    • A false value for the all attribute means standard inner join
    • Thus using all.x and all.y you can achieve left outer join and right outer join kind of logic

Here is an example of the left outer join through R-merge:

> students.merge <- merge(students, students.csv,all.x=TRUE)
> students.merge
 grade gender cities major age
1 8.0 M Hyderabad Math 10
2 8.6 F Chicago IT 15
3 8.8 F Chicago Phy 14
4 8.8 M Calgary Phy 11
5 8.9 M Minneapolis Math 12
6 9.0 M Bangalore Phy 11
7 9.1 F Tokyo Bio 12
8 9.2 F Chicago Chem 10
9 10.0 F Texas IT 10

Working with Missing Data

Generally in the data frame, the missing data will be shown as NA. Many times, you get such data when you import data from somewhere or you have merged two separate data frames. In such cases, it will make sense to fix such data before applying any analytics on the date.

Following built-in functions will be helpful:

  • any

Let’s export the mtcars data and make few mpg value as empty and then import this data again. The first thing you shall do is to check if you really have any data in your data frame, which is invalid. You can do this using the any function, as shown below:

> any(

[1] TRUE

A value of TRUE indicates that we do have at least one element, which is not valid.

Now. Run the command on the new data frame. The output will look like following:


Here a value of TRUE means that we have invalid value (NA) in these cells.

You can also pass the function to the data frame to retrieve a vector consisting of invalid values. For example, following code returns all the NA values in the data frame:

> mtcars.csv[]


Since we know that mpg column vector does have few invalid elements, we can fix these elements by assigning a calculated value (e.g. mean / median, etc) or a fixed value. In the below example, we have used the mean value to assign to the missing values:

> mtcars.csv$mpg[$mpg)] <- mean(mtcars.csv$mpg, na.rm = TRUE)

> mtcars.csv[]

[1] NA


  • While calculating the mean value, you do need to set na.rm value to TRUE, to strip the invalid values. Otherwise, the mean value will also become INVALID.
  • You can use the generic statements like mtcars.csv[]  <- 0 (or whatever value you want), however, you shall pay very close attention as it as effect across the columns

Applying Statistical functions on data frames

You can apply summary function to produce generic summaries of the results of various model fitting functions:

> summary(mtcars)
 mpg cyl disp hp drat wt 
 Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 
 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 
 Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 
 Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 
 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 
 Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 
 qsec vs am gear carb 
 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000 
 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 
 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000 Median :2.000 
 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812 
 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 
 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000

Similarly you can apply other statistical functions on the columns of the data frames as shown in below examples:

> max(mtcars$mpg)
[1] 33.9
> sd(mtcars$disp)
[1] 123.9387
> mean(mtcars[,'mpg'])
[1] 20.09062

Note that functions like mean or median require numeric data and you must use the notation which will provide the column vector consisting of the numeric data. For example, the single bracket notation to access mpg, may result into data frames and that may not help you in getting the mean or medians.


Read More


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s