Reading and Writing Files in R
Quickly after beginning to use R, as a biologist it is likely you will want to analyze some data. These data will be often stored as external files in your computer, with the most common data file types being Excel Spreadsheet files (.xls(x)
), Comma Separated Value files (.csv
), or Plain Text (.txt
) files.
Since it is best practice to never directly edit your raw data files (because what if you need to go back to the original version!?), a common workflow is to:
- Read your data into R
- Perform some kind of analysis, creating new data objects, figures, or outputs, then
- Save (aka “write”) those data objects/figures/outputs to your computer for later use
Steps 1 & 3 will be covered here.
There are, as there are with all things programming, multiple ways to perform these tasks in both base R and via packages.
Reading Files
When reading in a new file, after manually checking it and ensuring it is formatted in a readable way, note the path in your computer where the file is present. NOTE see section on workflow and the here
package.
Comma-Separated Value .csv
Files
This is the recommended file type for reading data into R as it is flexible and easily transferred directly into a dataframe.
We can read in the file by assinging it directly to an object.
Base R
data <- read.csv("/path/to/file/file-name.csv", header = TRUE)
Here the first argument in the read.csv()
function is the path pointing to the file in your computer. The second argument specifies header = TRUE
> which tells R that the first row of the .csv
file will be the column names, not part of the data themselves.
Tidyverse
data <- readr::read_csv("/path/to/file/file-name.csv")
Excel sheets .xls(x)
The best way to read in an excel sheet into R with with the readxl
package. Almost identical to above:
data <- readxl::read_xls("/path/to/file/file-name.xls", sheet = "SHEETNAME")
Plain Text .txt
Reading a text file is similar to the approaches to a csv file:
data <- read.delim("/path/to/file/file-name.txt")
This above option assumes that the file it tab-delimited.
We could also use the function read.table()
which is flexible in that we could define what type of delimiter the file uses. For example, a space-delimited text file is a data file similar to a .csv
but instead of using a comma, a tab, or some other delimiter, it simply uses spaces. We could use read.table()
for this task like so:
data <- read.table("/path/to/file/file-name.txt", sep = " ")
And we could extrapolate this to other less common types of delimiters supported by the function as well.
Tidyverse
data <- readr::read_tsv("/path/to/file/file-name.txt")
Writing Files
Writing files is nearly identical to reading files.
Comma-Separated Value .csv
Files
Base R
write.csv(data_object, "/path/to/new/file/file-name.csv")
Tidyverse
readr::write_csv(data_object, "/path/to/new/file/file-name.csv")
Excel sheets .xls(x)
Writing a file to an excel sheet is slightly unusual and wouldn’t be recommended (.csv
files are always better), but if you need to, there exists a number of packages to assist, the easiest of which being writexl
.
library(writexl)
writexl::write_xlsx(data_object, "/path/to/new/file/file-name.xlsx")
Plain Text .txt
Base R
write.table(data_object, "/path/to/new/file/file-name.txt")
Tidyverse
readr::write_delim(data_object, "/path/to/new/file/file-name.txt")
Writing Figures
It is common to create a figure in R and want to save it to the local machine. If you are plotting in ggplot2
, then there are specific functions for those objects, but if you are plotting in base R, you will use built-in functions to save as whatever file type you want.
Using the example built-in dataset airquality
, we can make an example plot to save.
Saving an image created in base R
plot(x = airquality$Day, y = airquality$Temp, col = "red")
If this is the plot we wish to save, we can save it as any common figure file type (e.g. .png
, .jpg
, .tif
, .pdf
etc.)
# first use the command for the file type
png(file="/path/to/plot/saving_plot.png")
# then we create the plot
hist(Temperature, col="darkgreen")
# this command saves the file and closes the connection to the file
dev.off()
Above, essentially the first line png(file="/path/to/plot/saving_plot.png")
opens the connection to a file (aka makes a place for it in the computer), then the plot function hist(Temperature, col="darkgreen")
actually makes the plot, then the last line dev.off()
writes the actual file and closes the connection opened by the first line of code.
Saving a ggplot2
object
It is best to use the let of functions from ggplot2
to save these plots.
ggplot2::ggsave("/path/to/plot/saving_plot.png",
plot_object)
Note here that we can change the type of file that gets saved by simply changing the extension on the file name.
Reading & Writing Examples
To get some practice with reading and writing files and to ensure that you have an understanding of the process, let’s use a built in dataframe that comes shipped with R, to practice reading and writing files.
We’ll select a built-in dataframe, write it to our computers, and then read it back in. Let’s use the Iris dataset.
df <- iris
head(df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Now that we have our dataframe df
, let’s write this out to a place on our computers. Here is the tricky bit.. As we mentioned in the workflow section, everyone’s computer will have slightly different “paths”, aka set of folders and directions that the computer uses to point to a particular file in a particular place. On my computer, I’ll be using a simple location in my documents folder. For me, that location is “~/Documents/iris.csv”, but it will be slightly different on each operating system (e.g. Windows vs. MacOSx). So, go ahead now and find the path that you want to save to. If you’re unclear still on paths, you can check out the “Working Directories” section on the workflow page
We’ll use the Tidyverse method with the readr
package, so we need to load this first so we can use it. If you don’t have it installed yet, use install.packages("readr")
to install it.
library(readr)
Let’s write out our file:
readr::write_csv(df, "~/Documents/iris.csv")
This returns no message which is a good sign, so to check it worked, we can manually on our computers go to that location, but better yet, lets try reading in the file and see if it works!
We’ll assign it a different name when we read it in for clarity.
iris_df <- readr::read_csv("~/Documents/iris.csv")
## Rows: 150 Columns: 5
## ── Column specification ──────────────────────────────────────────────
## Delimiter: ","
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
And we see the file has been read into our R session.
If the file hasn’t been read in, DO NOT PANIC. The most likely source of the error is a misspelling. Double check that you have spelt the paths correctly in both your writing and reading steps, and manually check in your computer to see if you can find the file where you think you saved it. You might have accidentally saved it elsewhere!
Here’s an example of a common error we might encounter during this type of task:
iris_df <- readr::read_csv("~/Docoments/iris.csv")
## Error: '~/Docoments/iris.csv' does not exist.
We see that no file was read in! How come? Can you spot the error in the above code? It turns out, “Documents” is misspelt as “Docoments”. This type of small and easily solved error is the nmost common when reading and writing files.
Once we’ve read in our data, it’s almost always important to take a quick look at our data to make sure it looks as we expect it to. There’s two easy ways to do so. The first way is just to look at the first 6 rows of data using the head()
function:
head(iris_df)
## # A tibble: 6 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
The other useful thing to do is double check the structure of the data we’re working with, by calling the str()
function, which will give us the structure of our data object, including of the object itself (i.e. is it a DataFrame, a matrix, a list?), as well as the sub-components. For a DataFrame, we’ll see what type of data each column is. This is important since particular methods only work on columns of particular types (e.g. you can’t take a numerical mean of a character column).
str(iris_df)
## spc_tbl_ [150 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : chr [1:150] "setosa" "setosa" "setosa" "setosa" ...
## - attr(*, "spec")=
## .. cols(
## .. Sepal.Length = col_double(),
## .. Sepal.Width = col_double(),
## .. Petal.Length = col_double(),
## .. Petal.Width = col_double(),
## .. Species = col_character()
## .. )
## - attr(*, "problems")=<externalptr>