Using Dplyr for Manipulating Data in R

Dplyr is a grammar for data manipulation in R. By constraining the options, dplyr helps you to think about

  • Your data manipulation challenges
  • Solve those challenges through program and
  • Execute those programs

We will make use of the data consisting of Flights that Departed NYC in 2013 from the CRAN. If you have already gone through my article on data frames then you may find some of the concepts being repeated. However, if you pay close attention, you will notice that dplyr tries to do the same thing a bit more cleanly.

Install the package

The first thing you need to do is to check if the dplyr package is installed or not. If not then make use of the install.package function to install this package:

install.packages('dplyr')

Note

Install the data package using the packages function:

install.packages('nycflights13')

 

Attach Package

Using the library function, you can attach the installed packages, as shown in below examples:

2017-10-24 14_20_24-Dplyr for data manipulation - Google Docs

Since dplyr will have its own filter and lag functions, it is being shown as masked. Hence, you need not be concerned with this warning message. Same is the case for intersect, setdiff, setequal and union objects.

Similarly, you can attach the flight data package using:

> library(nycflights13)

With this, you are set with the dplyr package as well as the flight data.

 

Functions of Dplyr Package

Dplyr provides you the most common verbs that you use in your day-to-day life to work with the data and translate your thought into code.

These verbs are

  • filter() to select cases based on their values
  • arrange() to reorder the cases
  • select() and rename() to select variables based on their names
  • mutate() and transmute() to add new variables that are functions of existing variables
  • summarise() to condense multiple values to a single value
  • sample_n() and sample_frac() to take random samples

You may like to refer to following page to see a very good explanation of each of these functions:

https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html

In each of these functions, the first argument is a data frame on which you do certain data operations and eventually return another data frame. So, at any given moment, you can very well use the concepts learned in data frame.

Using Pipe Operator

In the previous section, you must have seen the example usage of different verbs. However, any practical usage often requires us to combine two or more verbs to eventually get the desired data in the format or order that we are looking for.

Since all these verbs eventually return a data frame, the command chaining becomes straightforward. However, you can do that using one of the following approaches

  1. Use separate variables to store the result and pass that variable to the net verb as a parameter
  2. Use function chaining by calling one verb inside the other verb
  3. Use pipe operator to process data using one verb and pass the processed data to the next verb

 

Let’s take a look at this using an example:

Using chaining of verbs

sample_n(arrange(select(filter(flights, month==2, dep_delay > 2), day, dep_time, carrier, distance, origin, dest), dep_time), size = 20)

2017-10-24 14_23_45-Dplyr for data manipulation - Google Docs

Using the pipe operator

result <- filter(flights, month==2, dep_delay > 2) %>%

select(day, dep_time, carrier, distance, origin, dest) %>%

arrange(dep_time) %>%

sample_n(size = 20)

print(result)

2017-10-24 14_24_32-Dplyr for data manipulation - Google Docs

A use case to understand the verbs better

We will make use of the flights data for the year 2013 to find out the top five airlines with the least amount of departure delay.

In order to solve this, this is how we can start,

  • Find out the average departure delay of all the airlines
    • Which means we need to find mean of dep_delay by grouping them at the carrier level
  • Sort them (using the arrange verb) on the avg.dep_delay
  • Pick up the top five elements

Here is how the code will look like:

result <- group_by(flights, carrier) %>% 

  summarise(avg.dep_delay = mean(dep_delay, na.rm = TRUE)) %>%

  arrange(avg.dep_delay) %>%

  head( n = 5)

print(result)

 

When you execute this, you will get following output:

2017-10-24 14_26_42-Dplyr for data manipulation - Google Docs

Note

  • Unlike other query languages, just applying the group_by will not result into grouping. You need to pass the grouped data to summarise process and that will essentially allow you to see the grouped information.
  • You can mention multiple column vectors to group on multiple variables
  • Here we used mean as aggregate functions, however, apart from standard aggregate functions, summarise also support aggregate functions like n(), n_distinct(), first, last, nth()

Additional Resources

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s