Chapter 5 Exploratory Data Analysis (EDA)

5.1 `tidyverse` package in R

The tidyverse package in R is a very useful package for manipulating data. The tidyverse is a collection of a set of packages. Of these packages, we will particularly focus on two pacakages - dplyr and ggplot2. The dplyr is for data manipulation and the ggplot2 is for data visualization. In R, to install a package, you need to write install.packages () code and to load the package, you need to write library () code. Therefore, to install the package, you should write install.packages("tidyverse") and to load the follwing code -

library(tidyverse)

5.2 `dplyr` Package - Data Manipulation Tool

As stated above, the dplyr package is for data manipulation. There are many functions in dplyr; however, of these functions, six (06) functons are very much essential for data manipulation. In this project, we will learn those six necessary functions. These functions include - select, filter, arrange, mutate, summarize and , group_by. In additon to these functions, we will also use some other functions such as glimpse, count, dim, str and so on.

5.3 Data Set for Classroom Practice

There are many sources from which you can collect the data and manipulate in R. Some of the data sets are already included in some packages. In our class room, we will use a package called nycflights13 and install it by writing the code install.packages("nycflights13"). In the package, there are several data set, but we will use the flights data set. In order to get the data set in R Environment, you need to first install the package and load the data set and the following codes should be executed -

library(nycflights13)
flights <- flights

5.4 Assignment `<-` & Pipe `%>%` operator

Frequently, we will use the assignment <- and pipe %>% operator. The keyboard shortcurt for <- is alt+- and %>% is ctrl+shift+M

5.5 Meta Data - Data About Data

Once you load a data set in R, your next job should be to learn about some characteristcs about the data. To do so, you first need to load the tidyverse package by running the code library(tidyverse). Then you should write the following code. See the Table 5.1.

library(tidyverse)
glimpse(flights)

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 201...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time       <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 1...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL...
## $ flight         <int> 1545, 1714, 1141, 725, 461,...
## $ tailnum        <chr> "N14228", "N24211", "N619AA...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK",...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN",...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 15...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, ...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-...

knitr::kable(head(flights[,1:5]), caption = "A Table of First Five Variables.")

TABLE5.1: A Table of First Five Variables.
year	month	day	dep_time	sched_dep_time
2013	1	1	517	515
2013	1	1	533	529
2013	1	1	542	540
2013	1	1	544	545
2013	1	1	554	600
2013	1	1	554	558

It can be seen that there are 336,776 observations and 19 variables. Additionally, the label of the variables can be identified. For examplem, the variable “year” is integer, the variable “carrier” is character variable, and the variable “time_hour” is date variable. This metadata is important for further manipulation of the data. You can also use dim () to see the number of rows and columns. Furhter, str() can be used. The names () function gives you the names of the variables of the data set.

dim(flights)

## [1] 336776     19

str(flights)

## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" ...

names(flights)

##  [1] "year"           "month"          "day"           
##  [4] "dep_time"       "sched_dep_time" "dep_delay"     
##  [7] "arr_time"       "sched_arr_time" "arr_delay"     
## [10] "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"      
## [16] "distance"       "hour"           "minute"        
## [19] "time_hour"

You can also know more about the flights data set (built-in data set in packages) by the following code - help ("flights") or ?flights

If you want to see the data set in excel-like spread sheet, you have to write View (flights). This code will open the data in an excel-like spreadsheet. Note that the V in view is capital letter.

5.6 Changing the type of the variable

Sometimes we might need to change the type of the variable; e.g., converting an integer variable to a character variable. In such case, we need to write code. If we want to convert “flight” variable from int type to chr, you need to write the following code -

flights$flight <- as.character(flights$flight)

Other codes for the conversion should be like this - as.character(), as.factor()

Also by writing code, you can check the type of the variable. For example -

is.character(flights$hour)

## [1] FALSE

is.numeric(flights$hour)

## [1] TRUE

Alternatively, you can use class() function to know the type of the variable. For example -

class(flights$year)

## [1] "integer"

5.7 `count()` function

To know the frequency of different variables (particularly categorical variables), we can use the count() function. For example - we want to know whether the dataset includes information about American Airlines (AA); we should write -

flights %>% 
  count(carrier)

## # A tibble: 16 x 2
##    carrier     n
##    <chr>   <int>
##  1 9E      18460
##  2 AA      32729
##  3 AS        714
##  4 B6      54635
##  5 DL      48110
##  6 EV      54173
##  7 F9        685
##  8 FL       3260
##  9 HA        342
## 10 MQ      26397
## 11 OO         32
## 12 UA      58665
## 13 US      20536
## 14 VX       5162
## 15 WN      12275
## 16 YV        601

If we want to know the name and the numbers of airports the flights left, we need to use the “origin” variable -

flights %>% 
  count(origin)

## # A tibble: 3 x 2
##   origin      n
##   <chr>   <int>
## 1 EWR    120835
## 2 JFK    111279
## 3 LGA    104662

Simialrly, we can see where these flights go by the following code -

flights %>% 
  count(dest)

## # A tibble: 105 x 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     265
##  3 ALB     439
##  4 ANC       8
##  5 ATL   17215
##  6 AUS    2439
##  7 AVL     275
##  8 BDL     443
##  9 BGR     375
## 10 BHM     297
## # ... with 95 more rows

If we want to order the rows when we use count() function, then we have to use additional argument such as sort = TRUE in the function

flights %>% 
  count(dest, sort = TRUE)

## # A tibble: 105 x 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705
## # ... with 95 more rows

We can see from the table that most of the flights’ destination was Chicago Airport (ORD), followed by Atlanta Airport (ATL)

5.8 1st (First) verb - `select ()`

The select () function is used to select some columns from your data set. For example, if you want to select the variables year, month, day, dep_time from your data set. Then you should write the following code (We created a new data set called flights2)

flights2 <- flights %>% 
  select(year, month, day, dep_time)
glimpse(flights2)

## Rows: 336,776
## Columns: 4
## $ year     <int> 2013, 2013, 2013, 2013, 2013, 201...
## $ month    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555...

Alternatively, you can write the following code to get the same results -

flights2 <- flights %>% 
  select(year:dep_time)
glimpse(flights2)

## Rows: 336,776
## Columns: 4
## $ year     <int> 2013, 2013, 2013, 2013, 2013, 201...
## $ month    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555...

There is another function called starts_with(), which we use to select those variables that start with some pre-selected phrases. For example - if we want to selet all variables that start with arr, then we should write the following code (the new data set is called - arr)

arr <- flights %>% 
  select(starts_with("arr"))
glimpse(arr)

## Rows: 336,776
## Columns: 2
## $ arr_time  <int> 830, 850, 923, 1004, 812, 740, 9...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -1...

A similar function like starts_with() is ends_with(). contains() function can be used as well to select those variables that contain specific pharases/words. matches() function also serves the similar objective. Some of the applications of these arguments are given below -

flights %>% 
  select(starts_with("dep")) %>% 
  glimpse()

## Rows: 336,776
## Columns: 2
## $ dep_time  <int> 517, 533, 542, 544, 554, 554, 55...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3,...

flights %>% 
  select(ends_with("time")) %>% 
  glimpse()

## Rows: 336,776
## Columns: 5
## $ dep_time       <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 15...

flights %>% 
  select(contains("time")) %>% 
  glimpse()

## Rows: 336,776
## Columns: 6
## $ dep_time       <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 15...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-...

flights %>% 
  select(matches("time")) %>% 
  glimpse()

## Rows: 336,776
## Columns: 6
## $ dep_time       <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 15...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-...

If you want to rearrange the column (Variables) of your data set, then you can use everything() function. For example - you want to first put the carrier and flight variable and all variables after these two variables. In such case, you have to write the following code -

flights <- flights %>% 
  select(carrier, flight, everything())
glimpse(flights)

## Rows: 336,776
## Columns: 19
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL...
## $ flight         <chr> "1545", "1714", "1141", "72...
## $ year           <int> 2013, 2013, 2013, 2013, 201...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time       <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 1...
## $ tailnum        <chr> "N14228", "N24211", "N619AA...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK",...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN",...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 15...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, ...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-...

5.9 2nd (Second) verb - `filter ()`

If we want to subset our dataset by rows, then filter () is used. For example - we want the flights whose destination was Chicgo Airport (ORD).

chicago <- flights %>% 
  filter(dest == "ORD")
glimpse(chicago)

## Rows: 17,283
## Columns: 19
## $ carrier        <chr> "UA", "AA", "MQ", "AA", "AA...
## $ flight         <chr> "1696", "301", "3768", "303...
## $ year           <int> 2013, 2013, 2013, 2013, 201...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time       <int> 554, 558, 608, 629, 656, 70...
## $ sched_dep_time <int> 558, 600, 600, 630, 700, 70...
## $ dep_delay      <dbl> -4, -2, 8, -1, -4, 9, 2, -6...
## $ arr_time       <int> 740, 753, 807, 824, 854, 85...
## $ sched_arr_time <int> 728, 745, 735, 810, 850, 83...
## $ arr_delay      <dbl> 12, 8, 32, 14, 4, 20, 21, -...
## $ tailnum        <chr> "N39463", "N3ALAA", "N9EAMQ...
## $ origin         <chr> "EWR", "LGA", "EWR", "LGA",...
## $ dest           <chr> "ORD", "ORD", "ORD", "ORD",...
## $ air_time       <dbl> 150, 138, 139, 140, 143, 13...
## $ distance       <dbl> 719, 733, 719, 733, 733, 73...
## $ hour           <dbl> 5, 6, 6, 6, 7, 7, 7, 7, 7, ...
## $ minute         <dbl> 58, 0, 0, 30, 0, 0, 13, 45,...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-...

Similarly, we want to subset the data for the month of January.

january <- flights %>% 
  filter(month == "1")
glimpse(january)

## Rows: 27,004
## Columns: 19
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL...
## $ flight         <chr> "1545", "1714", "1141", "72...
## $ year           <int> 2013, 2013, 2013, 2013, 201...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time       <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 1...
## $ tailnum        <chr> "N14228", "N24211", "N619AA...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK",...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN",...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 15...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, ...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-...

If we to subset the data for the month of January and February, the following code should be run

jan_feb <- flights %>% 
  filter(month %in% c(1, 2))
glimpse(jan_feb)

## Rows: 51,955
## Columns: 19
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL...
## $ flight         <chr> "1545", "1714", "1141", "72...
## $ year           <int> 2013, 2013, 2013, 2013, 201...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time       <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 1...
## $ tailnum        <chr> "N14228", "N24211", "N619AA...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK",...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN",...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 15...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, ...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-...

If we want to subset the data for all airlines other than American Airlines (AA), for the month of January and February and for the distance greater than 100, then the following code should be executed.

naa <- flights %>% 
  filter(carrier != "AA", month %in% c(1, 2), distance > 1000
  )
glimpse(naa)

## Rows: 18,456
## Columns: 19
## $ carrier        <chr> "UA", "UA", "B6", "B6", "B6...
## $ flight         <chr> "1545", "1714", "725", "507...
## $ year           <int> 2013, 2013, 2013, 2013, 201...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time       <int> 517, 533, 544, 555, 558, 55...
## $ sched_dep_time <int> 515, 529, 545, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, -1, -5, -2, -2, -2, -...
## $ arr_time       <int> 830, 850, 1004, 913, 849, 8...
## $ sched_arr_time <int> 819, 830, 1022, 854, 851, 8...
## $ arr_delay      <dbl> 11, 20, -18, 19, -2, -3, 7,...
## $ tailnum        <chr> "N14228", "N24211", "N804JB...
## $ origin         <chr> "EWR", "LGA", "JFK", "EWR",...
## $ dest           <chr> "IAH", "IAH", "BQN", "FLL",...
## $ air_time       <dbl> 227, 227, 183, 158, 149, 15...
## $ distance       <dbl> 1400, 1416, 1576, 1065, 102...
## $ hour           <dbl> 5, 5, 5, 6, 6, 6, 6, 6, 6, ...
## $ minute         <dbl> 15, 29, 45, 0, 0, 0, 0, 0, ...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-...

Like the select_*, the verb filter has scoped versions such as filter_if, filter_at, and filter_all. If we want to learn more about these functions, we can use help functions such as ?filter_if or help(filter_if) to know more about these functions.

5.10 3rd (Third) verb - `arrange ()`

The arrange ()function allows you to reorder your data set by one or more variables. For example, if you want to reorder the flights dataset by distance, you need to execute the following code -

flights %>% 
  arrange(distance)

## # A tibble: 336,776 x 19
##    carrier flight  year month   day dep_time
##    <chr>   <chr>  <int> <int> <int>    <int>
##  1 US      1632    2013     7    27       NA
##  2 EV      3833    2013     1     3     2127
##  3 EV      4193    2013     1     4     1240
##  4 EV      4502    2013     1     4     1829
##  5 EV      4645    2013     1     4     2128
##  6 EV      4193    2013     1     5     1155
##  7 EV      4619    2013     1     6     2125
##  8 EV      4619    2013     1     7     2124
##  9 EV      4619    2013     1     8     2127
## 10 EV      4619    2013     1     9     2126
## # ... with 336,766 more rows, and 13 more variables:
## #   sched_dep_time <int>, dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>,
## #   arr_delay <dbl>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

From the results, we can see the lowest distance was 17-mile flight between EWR and LGA. The next lowest distance was 80-mile flight between EWR and PHL. However, if you want to see the longest distance, then you have to use desc () function becasue arrange () function reorders the rows in ascending order (from lowest to highest).

flights %>% 
  arrange(desc(distance))

## # A tibble: 336,776 x 19
##    carrier flight  year month   day dep_time
##    <chr>   <chr>  <int> <int> <int>    <int>
##  1 HA      51      2013     1     1      857
##  2 HA      51      2013     1     2      909
##  3 HA      51      2013     1     3      914
##  4 HA      51      2013     1     4      900
##  5 HA      51      2013     1     5      858
##  6 HA      51      2013     1     6     1019
##  7 HA      51      2013     1     7     1042
##  8 HA      51      2013     1     8      901
##  9 HA      51      2013     1     9      641
## 10 HA      51      2013     1    10      859
## # ... with 336,766 more rows, and 13 more variables:
## #   sched_dep_time <int>, dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>,
## #   arr_delay <dbl>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

It is evident that the highest distance was 4983 miles between JFK and HNL.

5.11 4th (Fourth) verb - `mutate ()`

The function mutate () is used to create new variables (columns). For example, we want to know the total delay, which is the sum of the dep_delay and arr_delay; then, we should write the following code -

flights <- flights %>% 
  mutate(total_delay = dep_delay + arr_delay)
glimpse(flights)

## Rows: 336,776
## Columns: 20
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL...
## $ flight         <chr> "1545", "1714", "1141", "72...
## $ year           <int> 2013, 2013, 2013, 2013, 201...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time       <int> 517, 533, 542, 544, 554, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 55...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 1...
## $ tailnum        <chr> "N14228", "N24211", "N619AA...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK",...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN",...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 15...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, ...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-...
## $ total_delay    <dbl> 13, 24, 35, -19, -31, 8, 14...

We can use mutate function with other functions such as ifelse or case_when to create new variables. Examples are given below -

flights <- flights %>% 
  mutate(total_delay_Dummy = ifelse(total_delay > mean(total_delay, na.rm = TRUE), "Delayed", "Not Delayed")) 
glimpse(flights)

## Rows: 336,776
## Columns: 21
## $ carrier           <chr> "UA", "UA", "AA", "B6", ...
## $ flight            <chr> "1545", "1714", "1141", ...
## $ year              <int> 2013, 2013, 2013, 2013, ...
## $ month             <int> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ day               <int> 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ dep_time          <int> 517, 533, 542, 544, 554,...
## $ sched_dep_time    <int> 515, 529, 540, 545, 600,...
## $ dep_delay         <dbl> 2, 4, 2, -1, -6, -4, -5,...
## $ arr_time          <int> 830, 850, 923, 1004, 812...
## $ sched_arr_time    <int> 819, 830, 850, 1022, 837...
## $ arr_delay         <dbl> 11, 20, 33, -18, -25, 12...
## $ tailnum           <chr> "N14228", "N24211", "N61...
## $ origin            <chr> "EWR", "LGA", "JFK", "JF...
## $ dest              <chr> "IAH", "IAH", "MIA", "BQ...
## $ air_time          <dbl> 227, 227, 160, 183, 116,...
## $ distance          <dbl> 1400, 1416, 1089, 1576, ...
## $ hour              <dbl> 5, 5, 5, 5, 6, 5, 6, 6, ...
## $ minute            <dbl> 15, 29, 40, 45, 0, 58, 0...
## $ time_hour         <dttm> 2013-01-01 05:00:00, 20...
## $ total_delay       <dbl> 13, 24, 35, -19, -31, 8,...
## $ total_delay_Dummy <chr> "Not Delayed", "Delayed"...

Note that we can now use the count function to know the number of flights delayed -

flights %>% 
  count(total_delay_Dummy)

## # A tibble: 3 x 2
##   total_delay_Dummy      n
##   <chr>              <int>
## 1 Delayed            86255
## 2 Not Delayed       241091
## 3 <NA>                9430

5.12 5th (Fifth) verb - `summarize ()`

The summarize () function is used to calculate different statistics such as mean, median, standard deviation, maximum, and minimum value. For example, we want to calucate the average distance and average delay of all flights in the month of January -

flights %>% 
  filter(month == "1") %>% 
  summarise(avg_distance = mean (distance),
            avg_delay = mean (total_delay, na.rm = TRUE),
            max_distance = max (distance),
            min_distance = min (distance, na.rm = TRUE),
            std_distance = sd (distance),
            med_distance = median (distance)
            )

## # A tibble: 1 x 6
##   avg_distance avg_delay max_distance min_distance
##          <dbl>     <dbl>        <dbl>        <dbl>
## 1        1007.      16.1         4983           80
## # ... with 2 more variables: std_distance <dbl>,
## #   med_distance <dbl>

5.13 6th (Sixth) verb - `group_by ()`

The group_by () function is very useful when it is used with summarize () function. For example, we want to know the average delay of each airport in New York in descending order; then, we should write the following code -

flights %>% 
  group_by(origin) %>% 
  summarize(avg_delay = mean (total_delay, na.rm = TRUE)) %>% 
  arrange(desc(avg_delay))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 2
##   origin avg_delay
##   <chr>      <dbl>
## 1 EWR         24.1
## 2 JFK         17.6
## 3 LGA         16.1

If you want to know the average delay of each carrier, then you need to write the following code -

flights %>% 
  group_by(carrier) %>% 
  summarize(avg_delay = mean (total_delay, na.rm = TRUE)) %>% 
  arrange(desc(avg_delay))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 16 x 2
##    carrier avg_delay
##    <chr>       <dbl>
##  1 F9          42.1 
##  2 FL          38.7 
##  3 EV          35.6 
##  4 YV          34.5 
##  5 WN          27.3 
##  6 OO          24.5 
##  7 9E          23.8 
##  8 B6          22.4 
##  9 MQ          21.2 
## 10 UA          15.6 
## 11 VX          14.5 
## 12 DL          10.9 
## 13 AA           8.93
## 14 US           5.87
## 15 HA          -2.01
## 16 AS          -4.10

If you want to know the average delay of each month, then you need to write the following code -

flights %>% 
  group_by(month) %>% 
  summarize(avg_delay = mean (total_delay, na.rm = TRUE)) %>% 
  arrange(desc(avg_delay))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 12 x 2
##    month avg_delay
##    <int>     <dbl>
##  1     7     38.2 
##  2     6     37.2 
##  3    12     31.4 
##  4     4     25.0 
##  5     3     19.0 
##  6     8     18.6 
##  7     5     16.4 
##  8     2     16.4 
##  9     1     16.1 
## 10    10      6.07
## 11    11      5.88
## 12     9      2.61

Chapter 5 Exploratory Data Analysis (EDA)

5.1 tidyverse package in R

5.2 dplyr Package - Data Manipulation Tool

5.3 Data Set for Classroom Practice

5.4 Assignment <- & Pipe %>% operator

5.5 Meta Data - Data About Data

5.6 Changing the type of the variable

5.7 count() function

5.8 1st (First) verb - select ()

5.9 2nd (Second) verb - filter ()

5.10 3rd (Third) verb - arrange ()

5.11 4th (Fourth) verb - mutate ()

5.12 5th (Fifth) verb - summarize ()

5.13 6th (Sixth) verb - group_by ()