In the first blog on data frames, we used existing data frames and saw how to access, filter and sort the members. However, many times you may need to create a new data frame and work with the same.
Creating a new Data Frame
As you know that the data frames is a list of vectors of equal lengths, let’s create few relevant vectors. Following commands create students data frames, students, with the column names grade, city and gender:
In above example, you just created a data frame using the statement data.frame. All though you used three separate vectors, they are kind of tightly related to form details for students. In our day-to-day life we come across such data quite often and hence data frame is one of the most important data structures in R.
Following is the complete syntax for data.frame function:
data.frame(..., row.names = NULL, check.rows = FALSE, check.names = TRUE, fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors())
Here, first you provide the list of equal length vectors that you want to join to form the data frame.
- Row.name is null by default. However, you can provide the name of equal length as the column vectors.
- Check.rows – allows you to ensure consistency of length and names
- Check.names – allows you to ensure that names of the variables are syntactically correct and unique.
- In the earlier section you might have noticed that when you added a duplicate entry, the row name was adjusted to make it unique
- Fix.empty.names – allows you to mention automatically constructed names instead of empty name, when the name is not mentioned
Naming the rows
While you can specify the names of rows using the row.name option of data.frame function, you don’t need to worry, if you didn’t do so at the time of data frame creation.
You can make use of rownames function at assign row names vector to a given data frame. Following example shows how to do this:
Adding a new column
Using cbind.data.frame function, you can add a new column vector in the data frame. Following example shows how to do this:
You can also make use of cbind function to add columns. For example, you can make use of sample function to generate some random age for the students and make use of the cbind function to add the new column into the students data frame.
Adding columns without recreating the data frame all over
While cbind allows you to add a new column, essentially, this approach also indicates that you are actually recreating a data frame with an additional column. It will be good, if you modify the same data frame and just add an additional column. To demonstrate this, we will copy an inbuilt data frame, women and add a new column called BMI, which uses the existing height and weight data to come up with a calculated value, as shown in below example:
Also, note that by default the numbers are considered to be upto 6-fractional digits. Hence, we have applied round function explicitly and tried to keep it more readable by putting the digits value of 2.
Adding a new row
Using the rbind you can add a new row in the data frame:
Note that by default rbind didn’t give any rowname for the new record. We have used the square bracket notation to access the row name and replace that value with the desired value.
New row with row name
In the previous example, the we didn’t explicitly gave a name to the list and hence R generated a default name. However, by assigning a name to the list, we can ensure that the same name becomes the row name for the new record. Following example demonstrates the same:
Creating a new Data Frame from a Matrix
Sometimes you may need to create a data frame from a matrix. The easiest way to do so is using the as.data.frame function.
Following example creates a 5 x 6 matrix of random numbers between 1 to 100 and converts that matrix into a data frame:
Note that as.data.frame function has built-in row naming and column naming capability. In above example, when we didn’t give the column name, it used V1, V2, … etc.
Sometimes we may have a need for deleting few records. For example, let’s take a look at the students record, lets delete or remove the 2nd row:
Using the minus and index number for the given set of row, you can retrieve the new data frame, which will exclude those specific records and return rest of the records.
Updating the data elements
Earlier in this article, we learned how to access a specific row, column or a given element. Depending on your need, you can do one of the following to update elements in data frames
- Assign a specific value to a given cell
- Assign a vector to a given column
- Assign a vector to a given set of rows for a given column
- Assign a list to a given set of columns for a given row
Below example shows how you can update a specific cell of a data frame using the index notation:
You may encounter the below warning when doing the assignment and this is due to R creating a factor of the values. This can be resolved by converting to the appropriate column type as.character for below example.
You could have used following statements as well to achieve the similar results:
> students[8, ‘city’] <- ‘Maldives’
> students[‘Samantha’, ‘city’] <- ‘Maldives’
Following example shows how you can modify a set of columns of a given row by using the combine function and the list of data values:
And, below example shows how you can modify the whole column of a given data set:
In above example, a random gender has been assigned to the different students using the sample function(though you should not be doing this for gender in the real world scenario but using this example for ease of sample data generation).
Deleting a Column
By assigning a NULL value to a column, you can remove that column from the data frame. Following example demonstrates this case, where the age column has been deleted:
Alternately, you can use the same result by making use of the negative indexing, as shown in the below example:
Importing and exporting data frames
Exporting Data Frames
You can export an R-objects into various formats. In this section, I will show you how to export data frame in the .xlsx format and .csv format.
Exporting in CSV format
Use following commands to ensure that you are in the right directory.
- > getwd() — to get the current working directory
- > setwd(‘/users/abhilasha/R’) — to set the present working directory
You can run following command to create a CSV file:
> write.table(students, “./studentdata.csv”, sep=”,”)
For the students data being used in this article, you can see following output:
Exporting Data in XLSX format
You would be required to follow below steps:
install.packages("xlsx") library(xlsx) write.xlsx(mydata,"D:/dummy.xlsx",sheetName = "Newdata")
If you get an error like below, you will need to check the Java version(32 bit or 64 bit- it should be same as R version).
Also check the JAVA_HOME system variable set properly. You can set it explicitly as below to proceed.
Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre1.8.0_144') library(xlsx) write.xlsx(students,"D:/dummy.xlsx",sheetName = "Newdata")
The xlsx file is saved as below.
Importing Data Frames
You can import data from various data formats into R. This section shows how to import data from a csv file and xlsx file. However, you can use similar approach to read data from any format.
Importing data from csv file
In the previous section, we exported student data in a file called studentdata.csv. Let’s import the same data again in a data frame called students.csv. Following is the output on the console:
> students.csv <- read.csv('./studentdata.csv') > students.csv grade gender cities major age Jim 8.0 M Hyderabad Math 10 Terry 9.1 F Tokyo Bio 11 Steve 9.0 M Bangalore Phy 11 Veronica 9.2 F Chicago Chem 13 Betty 10.0 F Texas IT 15 Archie 8.8 M Calgary Phy 11 Jughead 8.9 M Minneapolis Math 14 Samantha 8.6 F Maldives IT 15 Joyce 8.8 F Chicago Phy 14
Merging Two Data Frames
The data frame allows you to merge two data frames. In above section we created a data frame called students.csv. Let’s update this data frame to randomize the gender and age.
The changed data look as shown below:
> students.csv[,'age'] <- sample(12:18,9, replace=T ) > students.csv[,'gender'] <- sample(c('M','F'),9, replace=T ) > students.csv grade gender cities major age Jim 8.0 M Hyderabad Math 12 Terry 9.1 F Tokyo Bio 18 Steve 9.0 M Bangalore Phy 17 Veronica 9.2 M Chicago Chem 13 Betty 10.0 F Texas IT 13 Archie 8.8 F Calgary Phy 16 Jughead 8.9 M Minneapolis Math 12 Samantha 8.6 M Maldives IT 15 Joyce 8.8 F Chicago Phy 13 > students grade gender cities major age Jim 8.0 M Hyderabad Math 10 Terry 9.1 F Tokyo Bio 12 Steve 9.0 M Bangalore Phy 11 Veronica 9.2 F Chicago Chem 10 Betty 10.0 F Texas IT 10 Archie 8.8 M Calgary Phy 11 Jughead 8.9 M Minneapolis Math 12 Samantha 8.6 F Chicago IT 15 Joyce 8.8 F Chicago Phy 14
Let’s run following merge command and understand the final outcome:
> students.merge <- merge(students, students.csv) > students.merge grade gender cities major age 1 8.9 M Minneapolis Math 12
Above output looks like an intersection (similar to standard inner join between two tables) of the two data sets.
Let’s look at the syntax of merge to understand various possibilities:
merge( x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"), incomparables = NULL, ...)
- X and Y are data frames
- By allows you to mention the columns on which you want to merge. The default is intersection between two data sets.
- A TRUE value for the all attribute means that the records other than what is returned by the intersection will also be returned (full outer join).
- A false value for the all attribute means standard inner join
- Thus using all.x and all.y you can achieve left outer join and right outer join kind of logic
Here is an example of the left outer join through R-merge:
> students.merge <- merge(students, students.csv,all.x=TRUE) > students.merge grade gender cities major age 1 8.0 M Hyderabad Math 10 2 8.6 F Chicago IT 15 3 8.8 F Chicago Phy 14 4 8.8 M Calgary Phy 11 5 8.9 M Minneapolis Math 12 6 9.0 M Bangalore Phy 11 7 9.1 F Tokyo Bio 12 8 9.2 F Chicago Chem 10 9 10.0 F Texas IT 10
Working with Missing Data
Generally in the data frame, the missing data will be shown as NA. Many times, you get such data when you import data from somewhere or you have merged two separate data frames. In such cases, it will make sense to fix such data before applying any analytics on the date.
Following built-in functions will be helpful:
Let’s export the mtcars data and make few mpg value as empty and then import this data again. The first thing you shall do is to check if you really have any data in your data frame, which is invalid. You can do this using the any function, as shown below:
> any(is.na(mtcars.csv))  TRUE
A value of TRUE indicates that we do have at least one element, which is not valid.
Now. Run the is.na command on the new data frame. The output will look like following:
Here a value of TRUE means that we have invalid value (NA) in these cells.
You can also pass the is.na function to the data frame to retrieve a vector consisting of invalid values. For example, following code returns all the NA values in the data frame:
> mtcars.csv[is.na(mtcars.csv)]  NA NA NA NA NA
Since we know that mpg column vector does have few invalid elements, we can fix these elements by assigning a calculated value (e.g. mean / median, etc) or a fixed value. In the below example, we have used the mean value to assign to the missing values:
> mtcars.csv$mpg[is.na(mtcars.csv$mpg)] <- mean(mtcars.csv$mpg, na.rm = TRUE) > mtcars.csv[is.na(mtcars.csv)]  NA
- While calculating the mean value, you do need to set na.rm value to TRUE, to strip the invalid values. Otherwise, the mean value will also become INVALID.
- You can use the generic statements like mtcars.csv[is.na(mtcars.csv)] <- 0 (or whatever value you want), however, you shall pay very close attention as it as effect across the columns
Applying Statistical functions on data frames
You can apply summary function to produce generic summaries of the results of various model fitting functions:
> summary(mtcars) mpg cyl disp hp drat wt Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 qsec vs am gear carb Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 Median :17.71 Median :0.0000 Median :0.0000 Median :4.000 Median :2.000 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000
Similarly you can apply other statistical functions on the columns of the data frames as shown in below examples:
> max(mtcars$mpg)  33.9 > sd(mtcars$disp)  123.9387 > mean(mtcars[,'mpg'])  20.09062
Note that functions like mean or median require numeric data and you must use the notation which will provide the column vector consisting of the numeric data. For example, the single bracket notation to access mpg, may result into data frames and that may not help you in getting the mean or medians.
- 15 Easy Solutions to Your Data Frame Problems in R