Chapter 3 Getting Data into R

Data lie in different places and in different formats. Therefore, getting them into R is not always the same. Some data are stored in local directory in our computer, while other data are available online. Moreover, sometimes we need extract data from databases. To know how to extract different forms of data from different sources is important because the explosion of social media and similar platforms has given birth to tremendous amounts of data in different formats and in different places. Though accountants and auditors are good at dealing with structured data, they should also know how to deal with unstructured and non-conventional data (Richins et al. 2017).

3.1 Packages That Could Be Used to Import Data into R.

There are many packages that can be used to import data into R from many different sources. Of those packages, readr, readxl, and purrr will be discussed here. Additionally, data can be imported into R using RStudio's Import Wizard. To import data using RStudio's Import Wizard, one needs to go to the Environment tab and select Import Dataset; then, select appropriate type of data one wants to import and finally browse the data one wants to import. Please see Figure 3.1 to learn about how to use RStudio's Import Wizard.

Import Data Using RStudio's Import Wizard

FIGURE3.1: Import Data Using RStudio’s Import Wizard

3.2 Knowing & Setting Your Current (Working) Directory

When one wants to import the data from Personal Computer (PC), then knowing where the data reside (which is also called “Path”) is necessary as this will help import the data easily. The function getwd() will help to know the current working directory. Unless a specific directory is set as a current (working) directory, R will always look for a file in the current working directory. The function setwd() helps to set any directory (folder) as current working directory. If you are working in an R project (The discussion about R project is in Appendix B.1), then the project folder (directory) is the current directory. Figure 3.2 is an example of the address of a directory (folder).

Directory

FIGURE3.2: Directory

If you click on the address bar of the directory (highlighted in Red), it will look like - C:\Documents\Project Docs. If you want to set it as your current working directory, then you have to write the code setwd("C:/Documents/Project Docs"). Note that though the address has back slash (\), in setwd() function, we use forward slash (/) as R cannot deal with back slash. Once you set the working directory, running getwd() will show you your current working directory, which is the default directory for importing and exporting data into R (unless you specifically mention a different path).

3.3 Importing Data into R from Excel

To import data from excel, readxl package can be excellent. You can install readxl package by install.packages("readxl") or it will be installed when tidyverse package is installed. readxl package can deal with both xls and xlsx files. There are some built-in datasets with readxl package. The function readxl_example generates the names of the built in datasets in readxl package.

library(readxl)
readxl_example() # These are the example files from readxl package. 
##  [1] "clippy.xls"    "clippy.xlsx"   "datasets.xls" 
##  [4] "datasets.xlsx" "deaths.xls"    "deaths.xlsx"  
##  [7] "geometry.xls"  "geometry.xlsx" "type-me.xls"  
## [10] "type-me.xlsx"

The read_excel function will read the data from excel. For example - if we want to import an excel file such as climate change data from World Bank (https://datacatalog.worldbank.org/dataset/climate-change-data), we can use read_excel function.

climate_change <- read_excel("data/climate_change_download_0.xls")
# To know the names of the variables of the dataset 
names(climate_change)
##  [1] "Country code" "Country name" "Series code" 
##  [4] "Series name"  "SCALE"        "Decimals"    
##  [7] "1990"         "1991"         "1992"        
## [10] "1993"         "1994"         "1995"        
## [13] "1996"         "1997"         "1998"        
## [16] "1999"         "2000"         "2001"        
## [19] "2002"         "2003"         "2004"        
## [22] "2005"         "2006"         "2007"        
## [25] "2008"         "2009"         "2010"        
## [28] "2011"
# OR
colnames(climate_change)
##  [1] "Country code" "Country name" "Series code" 
##  [4] "Series name"  "SCALE"        "Decimals"    
##  [7] "1990"         "1991"         "1992"        
## [10] "1993"         "1994"         "1995"        
## [13] "1996"         "1997"         "1998"        
## [16] "1999"         "2000"         "2001"        
## [19] "2002"         "2003"         "2004"        
## [22] "2005"         "2006"         "2007"        
## [25] "2008"         "2009"         "2010"        
## [28] "2011"

If the excel file climate_change_download_0 is opened in excel, we can see that there are 3 worksheets in the files, but read_excel function in climate_change dataset only imports the first worksheet called Data. We can use sheet argument in read_excel function to specify which worksheet one wants to import. In sheet argument the position of the worksheet can be specified as well rather than the name of the worksheet. Also, excel_sheets functions can be used to know the names of all worksheets in a given dataset. The function excel_format can be used to know the format of the excel files (xls or xlsx).

excel_sheets("data/climate_change_download_0.xls")
## [1] "Data"    "Country" "Series"
excel_format("data/climate_change_download_0.xls")
## [1] "xls"
climate_change_country <- read_excel("data/climate_change_download_0.xls", 
                                     sheet = "Country")
# OR
climate_change_country <- read_excel("data/climate_change_download_0.xls", 
                                     sheet = 2)

Other arguments in read_excel function such as range can be used to import a subset of the excel file. The argument na are used for missing values (NA). In our dataset climate_change there are missing values represented by ... This can be replaced by naargument.

climate_change2 <- read_excel("data/climate_change_download_0.xls",
                             na = "..")

Also, we can use writexl package to export (save) a dataset from R in excel format. The function write_xlsx is usually used to save the data in excel format in desired directory.

library(writexl)
write_xlsx(climate_change_country, "data/climatechange_country.xlsx")

Alternatively, one can use file.choose() function within read_excel function to manually import an excel file into R.

data <- read_excel(file.choose())

Also, using read.table function allows one to copy and paste an excel file in R.

df <- read.table(file = "clipboard", 
                      sep = "\t", header=TRUE)

3.4 Importing Data into R from Local Directory of Computer

Importing data from local directory of computer can be done in couple of ways. For example, we can specify the path in which the data is stored. Alternatively, we can set the working directory first; then import the data by specifying the name of the data files.

3.5 Importing Data into R from Internet

3.6 Importing Data into R from WRDS Database

3.7 Some Other Sources

Very good website for the chapter - https://rc2e.com/inputandoutput#recipe-id245