Working with Data Frames in R (Part I)

Data Frame is a datatype which is used for storing data in tabular format. Each element of a data frame is a vector of equal length.

While matrices also give you a tabular look of the data, you do need to understand the following differences between matrices and data frames.

  • Matrices generally consist of single type of data and mostly used with numerical data.
  • Matrices have their own set of use cases where they are helpful, while data frames can be considered like an excel spreadsheet, where you can do different kind of operations with tabular data.

How do I get the inbuilt data frames in R?

R-provides a list of data frames for you to play with. You can get the complete data set using the data() command on the console. Some of the data frames are shown below:

RDataSets

For the purpose of explaining the data frame concepts, I will be using the mtcars data set which contains data for Motor Trend Car Road Tests.

Viewing an existing data frame

Just typing mtcars on the console will give you the corresponding data frame:

R_mtcars_dataset

You can make use of following commands to know the size / dimensions of the data frames

> dim(mtcars)
[1] 32 11
> nrow(mtcars)
[1] 32
> ncol(mtcars)
[1] 11

Note

  • Like spreadsheet, the top row is often called header and it contains the column names
  • The rows except the top row are called data rows
  • The first column contains the row names
  • The individual data in the data frames are referred to as cells

 

Knowing the structure of the data frame

You can make use of str function to know the data structure of a given data frame. For example following example shows the data structure of mtcars:

R_str_mtrcars

Using head and tail methods

Many times the data frame may have large number of records and you may like to see few records from the top or the bottom, in such cases, you can make use of head or tail functions to get corresponding records.

R_head_tail_mtcars

Note

  • By default, the head and tail methods return upto six records. However, if you need to extract more or less number of records, you can provide the additional parameter, n (e.g. 10 or 10L in case of head function call above), and pass specific numeric value to get the desired number of records.
  • You can provide negative value of n to exclude those many records from the beginning or end (depending on the method used by you) and return the rest.

Accessing data in specific rows / columns

Data in specific ranges

Using the colon notation, you can view data in specific row ranges and column ranges. Below example shows how to access data in specific ranges:

R_Access_mtcars1

Note

  • If you need all the columns then you must put the comma and leave the column value empty
  • The command mtcars[6:10] and mtcars[6:10, ] produces totally different results. The mtcars[6:10] will result into all the rows with columns 6 to 10 being shown in the output.
    • Even when you really intend to select all the rows and only few columns, I recommend you use the notation comma followed by the column ranges ( e.g. mtcars[, 6:10])
  • Note that while accessing all the rows but only a few columns using the bracket notation, if you just put one column name (or index) then the result is a vector and not a data frame.    

 

Accessing Data from specific rows and columns

While learning matrices, you learned the usage of combine function ( c ) to pass the specific rows and columns to access specific set of rows and columns values. Similar command can be used for selecting specific rows and columns of dataframes in the desired order. Following examples shows how you can use rows and columns in different/desired orders:

R_mtcars_access2

Also, if needed you can copy the same row or column twice, as shown in below example:

R_mtcars_access3

You need to pay close attention to the header and the row names here. For the duplicate columns and rows, R appends .1 (.2, .3, etc) to ensure the uniqueness of the column and row names.

Accessing data using names

When the row and column names of a data frame is given then you can make use of combine function or even the bracket notations to access the data rows using the names. Further, you can mix the index notation with names as well. Following example shows few ways to access data using the names:

R_mtcars_access4

Accessing data columns as vectors

Essentially every column of the data frame is a vector (i.e. it contains data of single type). Many times you would need to access specific column(s).

Using bracket notation

By default, if you use the name of the column using the single bracket notation then it returns a data frame with that specific column in it. Following example shows how accessing using single bracket results into data frame:

R_mtcars_access5

If you use double square bracket notation ([[ ]]) in the above example, you will be able to access the column vectors. Following example shows how double square brackets notation helps in accessing the data as a vector:

R_mtcars_access6

Using single square bracket

In preceding example we saw that we need to use the double square bracket. However, when you don’t pass the row details and just pass one column name (or index) then it does return a vector specific to that column name (or index). Following example demonstrates the same:

R_mtcars_access7

Using $ notation to access a data column as vector

The $ notation is a very convenient way of accessing the data columns as vectors. Following example shows a sample usage:

R_mtcars_access8

Applying Filtering on the data frames

One of the most important need on any data is to be able to filter the data based on certain criteria. When filtering data in the data frame, you can make use of the data columns and their respective data types to create a logical expression, which will be applied on the whole data frame to give you the desired results.

Understanding Logical Indexes

Earlier in this article, we discussed how to use $ notation to access the column vector. In the Working with Vectors article we learned how to apply logical conditions on vector. So, I am sure you can guess what to expect. Let’s take a look at the below example:

R_mtcars_Logical_Filter1

What did we just do?

  • We created a vector using the $ notation
  • Then we created a logical expression using the > (greater than) comparison
  • This gave us a vector consisting of logical index(TRUE/FALSE) corresponding to the mpg vector
  • We can make use of this vector as a filter to see only those cars whose mileage is more than 22

Applying the logical index as filter on the data frame

Using the single square bracket and passing the logical index vector, you can filter all the records where the condition evaluated to true. Following example shows all the cars, whose mileage is greater than 22:

R_mtcars_Logical_Filter2

Using logical operators

You can make use of logical operators like & (and) | (or) to apply further conditions to create the logical index that you can apply on a data frame. Following example shows all the cars with mileage more than 22 and having 4-gears:

R_mtcars_Logical_Filter3

Using subset function

While subset is not exactly a traditional filter, essentially, you still identify a subset of the desired (i.e. filtered) data. The subset function can be applied on vectors, matrix as well as data frames. In this section, let’s see how to use this on the data frames.

Following is the syntax for the subset function for the data frames:

subset( x, 
subset, 
select, 
drop = FALSE, ...)

Where

  • X is the data frame
  • Subset parameter is the logical criteria for keeping the rows of the data frames
  • Select parameter is the expression for selecting the columns of the data frames

In the below example, we have applied subset on the inbuilt data frame, mtcars, to get all the cars whose mileage is more than 20 and weight is more than 2.5. Also, using the combine function, we have provided the select expression for selecting only few columns:

R_mtcars_Logical_Filter4

In above example, you used specific column names, however, in case you do have congruent columns then you can very well use the colon (:) operator and mention the starting column name and ending column name, as shown in below example:

R_mtcars_Logical_Filter5

Note

  • Since the subset function is already aware of the data frame name, you don’t need to use $ notation to access the columns.

 

Ordering the data frames records

Using the order function on a given column vector, you can decide the order in which the rows shall be sorted.

Following example sorts the mtcars dataframe on mpg as shown:

R_mtcars_Sorting1

By default the sorting happens in the ascending order. However, if you do have a need for sorting the records in descending order then you can make use of negative (-) operator to change the direction of the sorting, as shown in below example:

In this blog,  we dealt with different ways to access a data frame, apply filters and ordering. In the following blog we will see what other operations can be performed with this amazing data structure.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s