Data Frame is a datatype which is used for storing data in tabular format. Each element of a data frame is a vector of equal length.
While matrices also give you a tabular look of the data, you do need to understand the following differences between matrices and data frames.
- Matrices generally consist of single type of data and mostly used with numerical data.
- Matrices have their own set of use cases where they are helpful, while data frames can be considered like an excel spreadsheet, where you can do different kind of operations with tabular data.
How do I get the inbuilt data frames in R?
R-provides a list of data frames for you to play with. You can get the complete data set using the data() command on the console. Some of the data frames are shown below:
For the purpose of explaining the data frame concepts, I will be using the mtcars data set which contains data for Motor Trend Car Road Tests.
Viewing an existing data frame
Just typing mtcars on the console will give you the corresponding data frame:
You can make use of following commands to know the size / dimensions of the data frames
> dim(mtcars)  32 11 > nrow(mtcars)  32 > ncol(mtcars)  11
- Like spreadsheet, the top row is often called header and it contains the column names
- The rows except the top row are called data rows
- The first column contains the row names
- The individual data in the data frames are referred to as cells
Knowing the structure of the data frame
You can make use of str function to know the data structure of a given data frame. For example following example shows the data structure of mtcars:
Using head and tail methods
Many times the data frame may have large number of records and you may like to see few records from the top or the bottom, in such cases, you can make use of head or tail functions to get corresponding records.
- By default, the head and tail methods return upto six records. However, if you need to extract more or less number of records, you can provide the additional parameter, n (e.g. 10 or 10L in case of head function call above), and pass specific numeric value to get the desired number of records.
- You can provide negative value of n to exclude those many records from the beginning or end (depending on the method used by you) and return the rest.
Accessing data in specific rows / columns
Data in specific ranges
Using the colon notation, you can view data in specific row ranges and column ranges. Below example shows how to access data in specific ranges:
- If you need all the columns then you must put the comma and leave the column value empty
- The command mtcars[6:10] and mtcars[6:10, ] produces totally different results. The mtcars[6:10] will result into all the rows with columns 6 to 10 being shown in the output.
- Even when you really intend to select all the rows and only few columns, I recommend you use the notation comma followed by the column ranges ( e.g. mtcars[, 6:10])
- Note that while accessing all the rows but only a few columns using the bracket notation, if you just put one column name (or index) then the result is a vector and not a data frame.
Accessing Data from specific rows and columns
While learning matrices, you learned the usage of combine function ( c ) to pass the specific rows and columns to access specific set of rows and columns values. Similar command can be used for selecting specific rows and columns of dataframes in the desired order. Following examples shows how you can use rows and columns in different/desired orders:
Also, if needed you can copy the same row or column twice, as shown in below example:
You need to pay close attention to the header and the row names here. For the duplicate columns and rows, R appends .1 (.2, .3, etc) to ensure the uniqueness of the column and row names.
Accessing data using names
When the row and column names of a data frame is given then you can make use of combine function or even the bracket notations to access the data rows using the names. Further, you can mix the index notation with names as well. Following example shows few ways to access data using the names:
Accessing data columns as vectors
Essentially every column of the data frame is a vector (i.e. it contains data of single type). Many times you would need to access specific column(s).
Using bracket notation
By default, if you use the name of the column using the single bracket notation then it returns a data frame with that specific column in it. Following example shows how accessing using single bracket results into data frame:
If you use double square bracket notation ([[ ]]) in the above example, you will be able to access the column vectors. Following example shows how double square brackets notation helps in accessing the data as a vector:
Using single square bracket
In preceding example we saw that we need to use the double square bracket. However, when you don’t pass the row details and just pass one column name (or index) then it does return a vector specific to that column name (or index). Following example demonstrates the same:
Using $ notation to access a data column as vector
The $ notation is a very convenient way of accessing the data columns as vectors. Following example shows a sample usage:
Applying Filtering on the data frames
One of the most important need on any data is to be able to filter the data based on certain criteria. When filtering data in the data frame, you can make use of the data columns and their respective data types to create a logical expression, which will be applied on the whole data frame to give you the desired results.
Understanding Logical Indexes
Earlier in this article, we discussed how to use $ notation to access the column vector. In the Working with Vectors article we learned how to apply logical conditions on vector. So, I am sure you can guess what to expect. Let’s take a look at the below example:
What did we just do?
- We created a vector using the $ notation
- Then we created a logical expression using the > (greater than) comparison
- This gave us a vector consisting of logical index(TRUE/FALSE) corresponding to the mpg vector
- We can make use of this vector as a filter to see only those cars whose mileage is more than 22
Applying the logical index as filter on the data frame
Using the single square bracket and passing the logical index vector, you can filter all the records where the condition evaluated to true. Following example shows all the cars, whose mileage is greater than 22:
Using logical operators
You can make use of logical operators like & (and) | (or) to apply further conditions to create the logical index that you can apply on a data frame. Following example shows all the cars with mileage more than 22 and having 4-gears:
Using subset function
While subset is not exactly a traditional filter, essentially, you still identify a subset of the desired (i.e. filtered) data. The subset function can be applied on vectors, matrix as well as data frames. In this section, let’s see how to use this on the data frames.
Following is the syntax for the subset function for the data frames:
subset( x, subset, select, drop = FALSE, ...)
- X is the data frame
- Subset parameter is the logical criteria for keeping the rows of the data frames
- Select parameter is the expression for selecting the columns of the data frames
In the below example, we have applied subset on the inbuilt data frame, mtcars, to get all the cars whose mileage is more than 20 and weight is more than 2.5. Also, using the combine function, we have provided the select expression for selecting only few columns:
In above example, you used specific column names, however, in case you do have congruent columns then you can very well use the colon (:) operator and mention the starting column name and ending column name, as shown in below example:
- Since the subset function is already aware of the data frame name, you don’t need to use $ notation to access the columns.
Ordering the data frames records
Using the order function on a given column vector, you can decide the order in which the rows shall be sorted.
Following example sorts the mtcars dataframe on mpg as shown:
By default the sorting happens in the ascending order. However, if you do have a need for sorting the records in descending order then you can make use of negative (-) operator to change the direction of the sorting, as shown in below example:
In this blog, we dealt with different ways to access a data frame, apply filters and ordering. In the following blog we will see what other operations can be performed with this amazing data structure.