Chapter 5 Exploratory Data Analysis (EDA)
5.1 tidyverse
package in R
The tidyverse
package in R is a very useful package for manipulating data. The tidyverse
is a collection of a set of packages. Of these packages, we will particularly focus on two pacakages - dplyr
and ggplot2
. The dplyr
is for data manipulation and the ggplot2
is for data visualization. In R, to install a package, you need to write install.packages ()
code and to load the package, you need to write library ()
code. Therefore, to install the package, you should write install.packages("tidyverse")
and to load the follwing code -
library(tidyverse)
5.2 dplyr
Package - Data Manipulation Tool
As stated above, the dplyr
package is for data manipulation. There are many functions in dplyr
; however, of these functions, six (06) functons are very much essential for data manipulation. In this project, we will learn those six necessary functions. These functions include - select
, filter
, arrange
, mutate
, summarize
and , group_by
. In additon to these functions, we will also use some other functions such as glimpse
, count
, dim
, str
and so on.
5.3 Data Set for Classroom Practice
There are many sources from which you can collect the data and manipulate in R. Some of the data sets are already included in some packages. In our class room, we will use a package called nycflights13
and install it by writing the code install.packages("nycflights13")
. In the package, there are several data set, but we will use the flights
data set. In order to get the data set in R Environment, you need to first install the package and load the data set and the following codes should be executed -
library(nycflights13)
<- flights flights
5.4 Assignment <-
& Pipe %>%
operator
Frequently, we will use the assignment <-
and pipe %>%
operator. The keyboard shortcurt for <-
is alt+- and %>%
is ctrl+shift+M
5.5 Meta Data - Data About Data
Once you load a data set in R, your next job should be to learn about some characteristcs about the data. To do so, you first need to load the tidyverse
package by running the code library(tidyverse)
. Then you should write the following code. See the Table 5.1.
library(tidyverse)
glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 201...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3...
## $ arr_time <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 1...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL...
## $ flight <int> 1545, 1714, 1141, 725, 461,...
## $ tailnum <chr> "N14228", "N24211", "N619AA...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK",...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN",...
## $ air_time <dbl> 227, 227, 160, 183, 116, 15...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, ...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-...
::kable(head(flights[,1:5]), caption = "A Table of First Five Variables.") knitr
year | month | day | dep_time | sched_dep_time |
---|---|---|---|---|
2013 | 1 | 1 | 517 | 515 |
2013 | 1 | 1 | 533 | 529 |
2013 | 1 | 1 | 542 | 540 |
2013 | 1 | 1 | 544 | 545 |
2013 | 1 | 1 | 554 | 600 |
2013 | 1 | 1 | 554 | 558 |
It can be seen that there are 336,776 observations and 19 variables. Additionally, the label of the variables can be identified. For examplem, the variable “year” is integer, the variable “carrier” is character variable, and the variable “time_hour” is date variable. This metadata is important for further manipulation of the data. You can also use dim ()
to see the number of rows and columns. Furhter, str()
can be used. The names ()
function gives you the names of the variables of the data set.
dim(flights)
## [1] 336776 19
str(flights)
## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" ...
names(flights)
## [1] "year" "month" "day"
## [4] "dep_time" "sched_dep_time" "dep_delay"
## [7] "arr_time" "sched_arr_time" "arr_delay"
## [10] "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time"
## [16] "distance" "hour" "minute"
## [19] "time_hour"
You can also know more about the flights data set (built-in data set in packages) by the following code - help ("flights")
or ?flights
If you want to see the data set in excel-like spread sheet, you have to write View (flights)
. This code will open the data in an excel-like spreadsheet. Note that the V in view is capital letter.
5.6 Changing the type of the variable
Sometimes we might need to change the type of the variable; e.g., converting an integer variable to a character variable. In such case, we need to write code. If we want to convert “flight” variable from int type to chr, you need to write the following code -
$flight <- as.character(flights$flight) flights
Other codes for the conversion should be like this - as.character()
, as.factor()
Also by writing code, you can check the type of the variable. For example -
is.character(flights$hour)
## [1] FALSE
is.numeric(flights$hour)
## [1] TRUE
Alternatively, you can use class()
function to know the type of the variable. For example -
class(flights$year)
## [1] "integer"
5.7 count()
function
To know the frequency of different variables (particularly categorical variables), we can use the count()
function. For example - we want to know whether the dataset includes information about American Airlines (AA); we should write -
%>%
flights count(carrier)
## # A tibble: 16 x 2
## carrier n
## <chr> <int>
## 1 9E 18460
## 2 AA 32729
## 3 AS 714
## 4 B6 54635
## 5 DL 48110
## 6 EV 54173
## 7 F9 685
## 8 FL 3260
## 9 HA 342
## 10 MQ 26397
## 11 OO 32
## 12 UA 58665
## 13 US 20536
## 14 VX 5162
## 15 WN 12275
## 16 YV 601
If we want to know the name and the numbers of airports the flights left, we need to use the “origin” variable -
%>%
flights count(origin)
## # A tibble: 3 x 2
## origin n
## <chr> <int>
## 1 EWR 120835
## 2 JFK 111279
## 3 LGA 104662
Simialrly, we can see where these flights go by the following code -
%>%
flights count(dest)
## # A tibble: 105 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 265
## 3 ALB 439
## 4 ANC 8
## 5 ATL 17215
## 6 AUS 2439
## 7 AVL 275
## 8 BDL 443
## 9 BGR 375
## 10 BHM 297
## # ... with 95 more rows
If we want to order the rows when we use count()
function, then we have to use additional argument such as sort = TRUE
in the function
%>%
flights count(dest, sort = TRUE)
## # A tibble: 105 x 2
## dest n
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
## 7 SFO 13331
## 8 FLL 12055
## 9 MIA 11728
## 10 DCA 9705
## # ... with 95 more rows
We can see from the table that most of the flights’ destination was Chicago Airport (ORD), followed by Atlanta Airport (ATL)
5.8 1st (First) verb - select ()
The select ()
function is used to select some columns from your data set. For example, if you want to select the variables year, month, day, dep_time from your data set. Then you should write the following code (We created a new data set called flights2)
<- flights %>%
flights2 select(year, month, day, dep_time)
glimpse(flights2)
## Rows: 336,776
## Columns: 4
## $ year <int> 2013, 2013, 2013, 2013, 2013, 201...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555...
Alternatively, you can write the following code to get the same results -
<- flights %>%
flights2 select(year:dep_time)
glimpse(flights2)
## Rows: 336,776
## Columns: 4
## $ year <int> 2013, 2013, 2013, 2013, 2013, 201...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555...
There is another function called starts_with()
, which we use to select those variables that start with some pre-selected phrases. For example - if we want to selet all variables that start with arr, then we should write the following code (the new data set is called - arr)
<- flights %>%
arr select(starts_with("arr"))
glimpse(arr)
## Rows: 336,776
## Columns: 2
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 9...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -1...
A similar function like starts_with()
is ends_with()
. contains()
function can be used as well to select those variables that contain specific pharases/words. matches()
function also serves the similar objective. Some of the applications of these arguments are given below -
%>%
flights select(starts_with("dep")) %>%
glimpse()
## Rows: 336,776
## Columns: 2
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 55...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3,...
%>%
flights select(ends_with("time")) %>%
glimpse()
## Rows: 336,776
## Columns: 5
## $ dep_time <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ arr_time <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ air_time <dbl> 227, 227, 160, 183, 116, 15...
%>%
flights select(contains("time")) %>%
glimpse()
## Rows: 336,776
## Columns: 6
## $ dep_time <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ arr_time <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ air_time <dbl> 227, 227, 160, 183, 116, 15...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-...
%>%
flights select(matches("time")) %>%
glimpse()
## Rows: 336,776
## Columns: 6
## $ dep_time <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ arr_time <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ air_time <dbl> 227, 227, 160, 183, 116, 15...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-...
If you want to rearrange the column (Variables) of your data set, then you can use everything()
function. For example - you want to first put the carrier and flight variable and all variables after these two variables. In such case, you have to write the following code -
<- flights %>%
flights select(carrier, flight, everything())
glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL...
## $ flight <chr> "1545", "1714", "1141", "72...
## $ year <int> 2013, 2013, 2013, 2013, 201...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3...
## $ arr_time <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 1...
## $ tailnum <chr> "N14228", "N24211", "N619AA...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK",...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN",...
## $ air_time <dbl> 227, 227, 160, 183, 116, 15...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, ...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-...
5.9 2nd (Second) verb - filter ()
If we want to subset our dataset by rows, then filter ()
is used. For example - we want the flights whose destination was Chicgo Airport (ORD).
<- flights %>%
chicago filter(dest == "ORD")
glimpse(chicago)
## Rows: 17,283
## Columns: 19
## $ carrier <chr> "UA", "AA", "MQ", "AA", "AA...
## $ flight <chr> "1696", "301", "3768", "303...
## $ year <int> 2013, 2013, 2013, 2013, 201...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 554, 558, 608, 629, 656, 70...
## $ sched_dep_time <int> 558, 600, 600, 630, 700, 70...
## $ dep_delay <dbl> -4, -2, 8, -1, -4, 9, 2, -6...
## $ arr_time <int> 740, 753, 807, 824, 854, 85...
## $ sched_arr_time <int> 728, 745, 735, 810, 850, 83...
## $ arr_delay <dbl> 12, 8, 32, 14, 4, 20, 21, -...
## $ tailnum <chr> "N39463", "N3ALAA", "N9EAMQ...
## $ origin <chr> "EWR", "LGA", "EWR", "LGA",...
## $ dest <chr> "ORD", "ORD", "ORD", "ORD",...
## $ air_time <dbl> 150, 138, 139, 140, 143, 13...
## $ distance <dbl> 719, 733, 719, 733, 733, 73...
## $ hour <dbl> 5, 6, 6, 6, 7, 7, 7, 7, 7, ...
## $ minute <dbl> 58, 0, 0, 30, 0, 0, 13, 45,...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-...
Similarly, we want to subset the data for the month of January.
<- flights %>%
january filter(month == "1")
glimpse(january)
## Rows: 27,004
## Columns: 19
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL...
## $ flight <chr> "1545", "1714", "1141", "72...
## $ year <int> 2013, 2013, 2013, 2013, 201...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3...
## $ arr_time <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 1...
## $ tailnum <chr> "N14228", "N24211", "N619AA...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK",...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN",...
## $ air_time <dbl> 227, 227, 160, 183, 116, 15...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, ...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-...
If we to subset the data for the month of January and February, the following code should be run
<- flights %>%
jan_feb filter(month %in% c(1, 2))
glimpse(jan_feb)
## Rows: 51,955
## Columns: 19
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL...
## $ flight <chr> "1545", "1714", "1141", "72...
## $ year <int> 2013, 2013, 2013, 2013, 201...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3...
## $ arr_time <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 1...
## $ tailnum <chr> "N14228", "N24211", "N619AA...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK",...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN",...
## $ air_time <dbl> 227, 227, 160, 183, 116, 15...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, ...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-...
If we want to subset the data for all airlines other than American Airlines (AA), for the month of January and February and for the distance greater than 100, then the following code should be executed.
<- flights %>%
naa filter(carrier != "AA", month %in% c(1, 2), distance > 1000
)glimpse(naa)
## Rows: 18,456
## Columns: 19
## $ carrier <chr> "UA", "UA", "B6", "B6", "B6...
## $ flight <chr> "1545", "1714", "725", "507...
## $ year <int> 2013, 2013, 2013, 2013, 201...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 544, 555, 558, 55...
## $ sched_dep_time <int> 515, 529, 545, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, -1, -5, -2, -2, -2, -...
## $ arr_time <int> 830, 850, 1004, 913, 849, 8...
## $ sched_arr_time <int> 819, 830, 1022, 854, 851, 8...
## $ arr_delay <dbl> 11, 20, -18, 19, -2, -3, 7,...
## $ tailnum <chr> "N14228", "N24211", "N804JB...
## $ origin <chr> "EWR", "LGA", "JFK", "EWR",...
## $ dest <chr> "IAH", "IAH", "BQN", "FLL",...
## $ air_time <dbl> 227, 227, 183, 158, 149, 15...
## $ distance <dbl> 1400, 1416, 1576, 1065, 102...
## $ hour <dbl> 5, 5, 5, 6, 6, 6, 6, 6, 6, ...
## $ minute <dbl> 15, 29, 45, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-...
Like the select_*
, the verb filter
has scoped versions such as filter_if
, filter_at
, and filter_all
. If we want to learn more about these functions, we can use help functions such as ?filter_if
or help(filter_if)
to know more about these functions.
5.10 3rd (Third) verb - arrange ()
The arrange ()
function allows you to reorder your data set by one or more variables.
For example, if you want to reorder the flights
dataset by distance, you need to execute the following code -
%>%
flights arrange(distance)
## # A tibble: 336,776 x 19
## carrier flight year month day dep_time
## <chr> <chr> <int> <int> <int> <int>
## 1 US 1632 2013 7 27 NA
## 2 EV 3833 2013 1 3 2127
## 3 EV 4193 2013 1 4 1240
## 4 EV 4502 2013 1 4 1829
## 5 EV 4645 2013 1 4 2128
## 6 EV 4193 2013 1 5 1155
## 7 EV 4619 2013 1 6 2125
## 8 EV 4619 2013 1 7 2124
## 9 EV 4619 2013 1 8 2127
## 10 EV 4619 2013 1 9 2126
## # ... with 336,766 more rows, and 13 more variables:
## # sched_dep_time <int>, dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>,
## # arr_delay <dbl>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
From the results, we can see the lowest distance was 17-mile flight between EWR and LGA. The next lowest distance was 80-mile flight between EWR and PHL. However, if you want to see the longest distance, then you have to use desc ()
function becasue arrange ()
function reorders the rows in ascending order (from lowest to highest).
%>%
flights arrange(desc(distance))
## # A tibble: 336,776 x 19
## carrier flight year month day dep_time
## <chr> <chr> <int> <int> <int> <int>
## 1 HA 51 2013 1 1 857
## 2 HA 51 2013 1 2 909
## 3 HA 51 2013 1 3 914
## 4 HA 51 2013 1 4 900
## 5 HA 51 2013 1 5 858
## 6 HA 51 2013 1 6 1019
## 7 HA 51 2013 1 7 1042
## 8 HA 51 2013 1 8 901
## 9 HA 51 2013 1 9 641
## 10 HA 51 2013 1 10 859
## # ... with 336,766 more rows, and 13 more variables:
## # sched_dep_time <int>, dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>,
## # arr_delay <dbl>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
It is evident that the highest distance was 4983 miles between JFK and HNL.
5.11 4th (Fourth) verb - mutate ()
The function mutate ()
is used to create new variables (columns). For example, we want to know the total delay
, which is the sum of the dep_delay
and arr_delay
; then, we should write the following code -
<- flights %>%
flights mutate(total_delay = dep_delay + arr_delay)
glimpse(flights)
## Rows: 336,776
## Columns: 20
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL...
## $ flight <chr> "1545", "1714", "1141", "72...
## $ year <int> 2013, 2013, 2013, 2013, 201...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3...
## $ arr_time <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 1...
## $ tailnum <chr> "N14228", "N24211", "N619AA...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK",...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN",...
## $ air_time <dbl> 227, 227, 160, 183, 116, 15...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, ...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-...
## $ total_delay <dbl> 13, 24, 35, -19, -31, 8, 14...
We can use mutate
function with other functions such as ifelse
or case_when
to create new variables. Examples are given below -
<- flights %>%
flights mutate(total_delay_Dummy = ifelse(total_delay > mean(total_delay, na.rm = TRUE), "Delayed", "Not Delayed"))
glimpse(flights)
## Rows: 336,776
## Columns: 21
## $ carrier <chr> "UA", "UA", "AA", "B6", ...
## $ flight <chr> "1545", "1714", "1141", ...
## $ year <int> 2013, 2013, 2013, 2013, ...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 542, 544, 554,...
## $ sched_dep_time <int> 515, 529, 540, 545, 600,...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5,...
## $ arr_time <int> 830, 850, 923, 1004, 812...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12...
## $ tailnum <chr> "N14228", "N24211", "N61...
## $ origin <chr> "EWR", "LGA", "JFK", "JF...
## $ dest <chr> "IAH", "IAH", "MIA", "BQ...
## $ air_time <dbl> 227, 227, 160, 183, 116,...
## $ distance <dbl> 1400, 1416, 1089, 1576, ...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, ...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0...
## $ time_hour <dttm> 2013-01-01 05:00:00, 20...
## $ total_delay <dbl> 13, 24, 35, -19, -31, 8,...
## $ total_delay_Dummy <chr> "Not Delayed", "Delayed"...
Note that we can now use the count
function to know the number of flights delayed -
%>%
flights count(total_delay_Dummy)
## # A tibble: 3 x 2
## total_delay_Dummy n
## <chr> <int>
## 1 Delayed 86255
## 2 Not Delayed 241091
## 3 <NA> 9430
5.12 5th (Fifth) verb - summarize ()
The summarize ()
function is used to calculate different statistics such as mean, median, standard deviation, maximum, and minimum value. For example, we want to calucate the average distance and average delay of all flights in the month of January -
%>%
flights filter(month == "1") %>%
summarise(avg_distance = mean (distance),
avg_delay = mean (total_delay, na.rm = TRUE),
max_distance = max (distance),
min_distance = min (distance, na.rm = TRUE),
std_distance = sd (distance),
med_distance = median (distance)
)
## # A tibble: 1 x 6
## avg_distance avg_delay max_distance min_distance
## <dbl> <dbl> <dbl> <dbl>
## 1 1007. 16.1 4983 80
## # ... with 2 more variables: std_distance <dbl>,
## # med_distance <dbl>
5.13 6th (Sixth) verb - group_by ()
The group_by ()
function is very useful when it is used with summarize ()
function. For example, we want to know the average delay of each airport in New York in descending order; then, we should write the following code -
%>%
flights group_by(origin) %>%
summarize(avg_delay = mean (total_delay, na.rm = TRUE)) %>%
arrange(desc(avg_delay))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## origin avg_delay
## <chr> <dbl>
## 1 EWR 24.1
## 2 JFK 17.6
## 3 LGA 16.1
If you want to know the average delay of each carrier, then you need to write the following code -
%>%
flights group_by(carrier) %>%
summarize(avg_delay = mean (total_delay, na.rm = TRUE)) %>%
arrange(desc(avg_delay))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 16 x 2
## carrier avg_delay
## <chr> <dbl>
## 1 F9 42.1
## 2 FL 38.7
## 3 EV 35.6
## 4 YV 34.5
## 5 WN 27.3
## 6 OO 24.5
## 7 9E 23.8
## 8 B6 22.4
## 9 MQ 21.2
## 10 UA 15.6
## 11 VX 14.5
## 12 DL 10.9
## 13 AA 8.93
## 14 US 5.87
## 15 HA -2.01
## 16 AS -4.10
If you want to know the average delay of each month, then you need to write the following code -
%>%
flights group_by(month) %>%
summarize(avg_delay = mean (total_delay, na.rm = TRUE)) %>%
arrange(desc(avg_delay))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## month avg_delay
## <int> <dbl>
## 1 7 38.2
## 2 6 37.2
## 3 12 31.4
## 4 4 25.0
## 5 3 19.0
## 6 8 18.6
## 7 5 16.4
## 8 2 16.4
## 9 1 16.1
## 10 10 6.07
## 11 11 5.88
## 12 9 2.61