Chapter 4 Data Wrangling

Data wrangling is the process of cleaning data, so that data become ready for further manipulation such as visualization and modeling. Sometimes, data wrangling is also called Data Munging. More specifically, data wrangling involves transforming and mapping data from one from to another form - particularly from raw form to tidy form.

There is an old saying that 90% of data science involves data wrangling. In many cases, data wrangling is difficult as it involves dealing with missing entries, ambiguous values, and different types of mixed data. In data analytics ecosystem in R, data wrangling involves three jobs - importing data into R, cleaning (tidying) the data, and transforming the data. Please see Figure 4.1 to learn about the data wrangling process.

Data Wrangling in R

FIGURE4.1: Data Wrangling in R

The first job - importing data into R - is discussed in chapter 03. In this chapter, cleaning (tidying) the data in R will be discussed. The last job will be discussed in next chapter - Exploratory Data Analysis (EDA).

4.1 tidy data

Data wrangling or data munging results in tidy data - which is storing data that makes further manipulation on data such as transformation, visualization, and modeling easier. The following rules make a dataset tidy -

  • Each variable must have its own column
  • Each observation must have its own row
  • Each value must have its own cell

4.2 Same Data, but Different Formats (Presentations)

4.3 Tidying Messy Data

According to Hadley Wickham, “Tidy datasets are all alike, but every messy dataset is messy in its own way” (Wickham and Grolemund 2017). Messy data can be in many forms; for example - Wickham (2014) mentions the five most common problems with messy datasets. These problems include -

  • Column headers are values, not variable names
  • Multiple variables are stored in one column
  • Variables are stored in both rows and columns
  • Multiple types of observational units are stored in the same table
  • A single observational unit is stored in multiple tables

4.3.1 Column Hearders are Values, not Variable Names

The following dataset (total_assets) is an example of this case, in which the name of the variables (columns) are numbers. Though in some cases, this data format might be useful, in many cases usually it is not useful.

total_assets
## # A tibble: 5 x 7
##   TICKER COMPANYNAME `2000` `2001` `2002` `2003` `2004`
##   <chr>  <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 AAPL   APPLE INC    6803   6021   6298   6815  8.05e3
## 2 WMT    WALMART INC 70349  78130  83451  94685  1.05e5
## 3 AABA   ALTABA INC   2270.  2379.  2790.  5932. 9.18e3
## 4 AMZN   AMAZON.COM~  2135.  1638.  1990.  2162. 3.25e3
## 5 GOOGL  ALPHABET I~    NA     NA    287.   871. 3.31e3

4.3.2 Multiple Variables are Stored in One Column

The netincome_asset dataset is a good example of this kind of dataset. For example, the variable VALUES includes net income and total assets separated by /. This variable is very hard to manipulate. For example, if we want to calculate Return on Assets (ROA), which equals to net income divided by total assets, then it is not possible to use VALUES column to calculate ROA.

netincome_asset
## # A tibble: 23 x 4
##     year TICKER COMPANYNAME VALUES     
##    <dbl> <chr>  <chr>       <chr>      
##  1  2000 AAPL   APPLE INC   786/6803   
##  2  2001 AAPL   APPLE INC   -25/6021   
##  3  2002 AAPL   APPLE INC   65/6298    
##  4  2003 AAPL   APPLE INC   69/6815    
##  5  2004 AAPL   APPLE INC   276/8050   
##  6  2000 WMT    WALMART INC 5377/70349 
##  7  2001 WMT    WALMART INC 6295/78130 
##  8  2002 WMT    WALMART INC 6671/83451 
##  9  2003 WMT    WALMART INC 8039/94685 
## 10  2004 WMT    WALMART INC 9054/104912
## # ... with 13 more rows

4.3.3 Variables are Stored in Both Rows and Columns

The sales_profit dataset is an example of this type. Here in the ITEM column, actually variables are included (such as SALES and NETINCOME). Also, the columns such as 2000 through 2004 should be a variable (e.g., year).

sales_profit 
## # A tibble: 10 x 8
##    TICKER COMPANYNAME ITEM   `2000`  `2001`  `2002`
##    <chr>  <chr>       <chr>   <dbl>   <dbl>   <dbl>
##  1 AAPL   APPLE INC   SALES  7.98e3  5.36e3  5.74e3
##  2 AAPL   APPLE INC   NETI~  7.86e2 -2.50e1  6.50e1
##  3 WMT    WALMART INC SALES  1.66e5  1.92e5  2.19e5
##  4 WMT    WALMART INC NETI~  5.38e3  6.30e3  6.67e3
##  5 AABA   ALTABA INC  SALES  1.11e3  7.17e2  9.53e2
##  6 AABA   ALTABA INC  NETI~  7.08e1 -9.28e1  4.28e1
##  7 AMZN   AMAZON.COM~ SALES  2.76e3  3.12e3  3.93e3
##  8 AMZN   AMAZON.COM~ NETI~ -1.41e3 -5.67e2 -1.49e2
##  9 GOOGL  ALPHABET I~ SALES NA      NA       4.40e2
## 10 GOOGL  ALPHABET I~ NETI~ NA      NA       9.97e1
## # ... with 2 more variables: `2003` <dbl>,
## #   `2004` <dbl>

4.3.4 Multiple Types of Observational Units are Stored in the Same Table

4.3.5

4.4 tidyr Package for tidy Data

The tidyr package is widley used to tidy data in R. Specifically, pivot_longer, pivot_wider, separate, and extract functions are widely used to make a messy data into tidy.

4.4.1 pivot_longer () function

If the total_assets dataset above (Column headers are values, not variable names) is made tidy using pivot_longer function, then it will be look like this (Assuming the name of the dataset is variable_number) -

# using pivot_longer () function 
total_assets %>% 
  pivot_longer(cols = -c("TICKER", "COMPANYNAME"),
               names_to = "year",
               values_to = "TOTALASSETS"
               ) %>% 
  head()
## # A tibble: 6 x 4
##   TICKER COMPANYNAME year  TOTALASSETS
##   <chr>  <chr>       <chr>       <dbl>
## 1 AAPL   APPLE INC   2000         6803
## 2 AAPL   APPLE INC   2001         6021
## 3 AAPL   APPLE INC   2002         6298
## 4 AAPL   APPLE INC   2003         6815
## 5 AAPL   APPLE INC   2004         8050
## 6 WMT    WALMART INC 2000        70349

There are several arguments of pivot_longer function. Here three arguments are used. The cols argument specifies which columns should (not) be used in pivot_longer function. In this case, we select the variables that should not be used while using pivot_longer function. The second argument names_to specifies the name of the variable in which existing column values will be stored and finally values_to specifies the name of the column in which the cell values will be stored. Now, the dataset is a tidy dataset.

4.4.2 pivot_wider () function

The sales_profit dataset can be put into tidy format using both pivot_longer and pivot_wider function.

sales_profit %>% 
  pivot_longer(
    cols = !c ("TICKER", "COMPANYNAME", "ITEM"),
    names_to = "year",
    values_to = "AMOUNT"
  ) %>% 
  pivot_wider(
    names_from = ITEM, 
    values_from = AMOUNT
  )
## # A tibble: 25 x 5
##    TICKER COMPANYNAME year   SALES NETINCOME
##    <chr>  <chr>       <chr>  <dbl>     <dbl>
##  1 AAPL   APPLE INC   2000    7983       786
##  2 AAPL   APPLE INC   2001    5363       -25
##  3 AAPL   APPLE INC   2002    5742        65
##  4 AAPL   APPLE INC   2003    6207        69
##  5 AAPL   APPLE INC   2004    8279       276
##  6 WMT    WALMART INC 2000  165639      5377
##  7 WMT    WALMART INC 2001  192003      6295
##  8 WMT    WALMART INC 2002  218529      6671
##  9 WMT    WALMART INC 2003  245308      8039
## 10 WMT    WALMART INC 2004  257157      9054
## # ... with 15 more rows

4.4.3 separate () function

The separate function can be used when multiple variables are stored in one column. The function separates one column into multiple columns. For example, in netincome_asset data, the column VALUES should be converted into two columns called NETINCOME and TOTALASSETS. In separate function, col argument is ued to select the column that needs to be broken; into argument is used to create new columns; and sep argument is ued to identify the character in which the column will be separated. In this case, it is /. Finally remove argument is used to decide whether the column (here VALUES) that is separated should be in new dataset.

netincome_asset %>% 
  separate(
    col = VALUES, into = c ("NETINCMOE", "TOTALASSETS"), sep ="/",
    remove =TRUE
  )
## # A tibble: 23 x 5
##     year TICKER COMPANYNAME NETINCMOE TOTALASSETS
##    <dbl> <chr>  <chr>       <chr>     <chr>      
##  1  2000 AAPL   APPLE INC   786       6803       
##  2  2001 AAPL   APPLE INC   -25       6021       
##  3  2002 AAPL   APPLE INC   65        6298       
##  4  2003 AAPL   APPLE INC   69        6815       
##  5  2004 AAPL   APPLE INC   276       8050       
##  6  2000 WMT    WALMART INC 5377      70349      
##  7  2001 WMT    WALMART INC 6295      78130      
##  8  2002 WMT    WALMART INC 6671      83451      
##  9  2003 WMT    WALMART INC 8039      94685      
## 10  2004 WMT    WALMART INC 9054      104912     
## # ... with 13 more rows

It is clear from the above dataset that the type of columns NETINCOME and TOTALASSETS are chr, but they should be number (e.g., dbl). Actually, separate functions maintain the type of the columns that are separated. The type of VALUES column was chr and it is maintained in new dataset. In order to get the true type, we can use convert argument, which is done below.

netincome_asset  %>% 
  separate(
    col = VALUES, into = c ("NETINCMOE", "TOTALASSETS"), sep ="/",
    remove =TRUE,convert = TRUE
  )
## # A tibble: 23 x 5
##     year TICKER COMPANYNAME NETINCMOE TOTALASSETS
##    <dbl> <chr>  <chr>           <dbl>       <dbl>
##  1  2000 AAPL   APPLE INC         786        6803
##  2  2001 AAPL   APPLE INC         -25        6021
##  3  2002 AAPL   APPLE INC          65        6298
##  4  2003 AAPL   APPLE INC          69        6815
##  5  2004 AAPL   APPLE INC         276        8050
##  6  2000 WMT    WALMART INC      5377       70349
##  7  2001 WMT    WALMART INC      6295       78130
##  8  2002 WMT    WALMART INC      6671       83451
##  9  2003 WMT    WALMART INC      8039       94685
## 10  2004 WMT    WALMART INC      9054      104912
## # ... with 13 more rows

4.4.4 extract () function

This is a very good source to discuss about data wrangling - https://dsapps-2020.github.io/Class_Slides/

This is the best data wrangling website - https://dcl-wrangle.stanford.edu/

The above link is also best about why EXCEL is not be best for Accounting & Audit Analytics.