Dplyr is a grammar for data manipulation in R. By constraining the options, dplyr helps you to think about
- Your data manipulation challenges
- Solve those challenges through program and
- Execute those programs
We will make use of the data consisting of Flights that Departed NYC in 2013 from the CRAN. If you have already gone through my article on data frames then you may find some of the concepts being repeated. However, if you pay close attention, you will notice that dplyr tries to do the same thing a bit more cleanly.
Install the package
The first thing you need to do is to check if the dplyr package is installed or not. If not then make use of the install.package function to install this package:
- Sometimes you may face a problem related to locale. You may like review below thread on stacktrace: https://stackoverflow.com/questions/3907719/how-to-fix-tar-failed-to-set-default-locale-error
- Also, if Restarting R from the studio doesn’t help then try to quit and restart R-studio.
Install the data package using the packages function:
Using the library function, you can attach the installed packages, as shown in below examples:
Since dplyr will have its own filter and lag functions, it is being shown as masked. Hence, you need not be concerned with this warning message. Same is the case for intersect, setdiff, setequal and union objects.
Similarly, you can attach the flight data package using:
With this, you are set with the dplyr package as well as the flight data.
Functions of Dplyr Package
Dplyr provides you the most common verbs that you use in your day-to-day life to work with the data and translate your thought into code.
These verbs are
- filter() to select cases based on their values
- arrange() to reorder the cases
- select() and rename() to select variables based on their names
- mutate() and transmute() to add new variables that are functions of existing variables
- summarise() to condense multiple values to a single value
- sample_n() and sample_frac() to take random samples
You may like to refer to following page to see a very good explanation of each of these functions:
In each of these functions, the first argument is a data frame on which you do certain data operations and eventually return another data frame. So, at any given moment, you can very well use the concepts learned in data frame.
Using Pipe Operator
In the previous section, you must have seen the example usage of different verbs. However, any practical usage often requires us to combine two or more verbs to eventually get the desired data in the format or order that we are looking for.
Since all these verbs eventually return a data frame, the command chaining becomes straightforward. However, you can do that using one of the following approaches
- Use separate variables to store the result and pass that variable to the net verb as a parameter
- Use function chaining by calling one verb inside the other verb
- Use pipe operator to process data using one verb and pass the processed data to the next verb
Let’s take a look at this using an example:
Using chaining of verbs
sample_n(arrange(select(filter(flights, month==2, dep_delay > 2), day, dep_time, carrier, distance, origin, dest), dep_time), size = 20)
Using the pipe operator
result <- filter(flights, month==2, dep_delay > 2) %>% select(day, dep_time, carrier, distance, origin, dest) %>% arrange(dep_time) %>% sample_n(size = 20) print(result)
A use case to understand the verbs better
We will make use of the flights data for the year 2013 to find out the top five airlines with the least amount of departure delay.
In order to solve this, this is how we can start,
- Find out the average departure delay of all the airlines
- Which means we need to find mean of dep_delay by grouping them at the carrier level
- Sort them (using the arrange verb) on the avg.dep_delay
- Pick up the top five elements
Here is how the code will look like:
result <- group_by(flights, carrier) %>% summarise(avg.dep_delay = mean(dep_delay, na.rm = TRUE)) %>% arrange(avg.dep_delay) %>% head( n = 5) print(result)
When you execute this, you will get following output:
- Unlike other query languages, just applying the group_by will not result into grouping. You need to pass the grouped data to summarise process and that will essentially allow you to see the grouped information.
- You can mention multiple column vectors to group on multiple variables
- Here we used mean as aggregate functions, however, apart from standard aggregate functions, summarise also support aggregate functions like n(), n_distinct(), first, last, nth()