Packages
Packages are essential to the work we do in R. Whether you are using R for the first time, or you’re a seasoned R expert, using packages makes up a great deal of your daily workload in R. It’s important to be familiar with the concept of a package, how we go about using them, and what to do when they don’t work how we want them to.
What are Packages?
R packages are collections of related functions, code, and data, which can be downloaded and used by any R user, as a way of avoiding having to write all the code to do a task manually ourselves. In fact, if you’ve opened R and read in some data, it’s likely you’ve already used a package! There are a set of packages that are standard (base
) and ship with R when you install it on your computer. These packages are always automatically loaded and available for your use, so when you use a function that exists natively in R, such as the mean()
function, that actually exists in a package that R loaded for you already.
All official packages are submitted and verified through CRAN (Comprehensive R Archive Network). Packages typically are sets of functions that perform related tasks (e.g. data management). For example, if in an R script or .Rmd
file, we begin to type the function subset()
, we get the following:
We see here that R is suggesting some options to us of functions that start with sub
. The word in the curly braces, {base}
indicates that that function is native to “base” R, or the base set of functions and methods that we have access to without loading in other packages. When you download R, a set of packages are imported automatically, including base
, stats
, and some others. Packages not downloaded automatically with R need to be explicitly loaded separately.
There are many many packages, the vast majority of which we will not even talk about. In fact, as of November 2020, more than 16,000 packages were available on CRAN (and that’s not including the non-CRAN packages that are available too!).
Note that many packages are termed dependencies, that is, they are not the packages we are going to use directly, but they are required on our computers, for the package we do want to use to work properly.
Installing Packages
To see what packages we have isntalled already on our computer, we can run the library()
command.
library()
## Warning in library(): libraries '/usr/local/lib/R/site-library', '/
## usr/lib/R/site-library' contain no packages
It turns out on this computer, that’s a pretty large number of packages, but on your computer it may return fewer packages, and that’s perfectly fine.
The EEB R Manual will make use frequently of packages that are part of the Tidyverse. One of the first packages we’ll use is the dplyr
package, a very common and useful data cleaning package.
The Tidy Verse
The Tidyverse is “…an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures” that is funded and developed by a company called Posit, which was previously known as the RStudio Foundation. It is a good resource for those learning R, as there are a ton of educational resources available
Okay, let’s install the package. Anytime we install a package we call install.package()
, and put our package name inside quotations between the brackets.
install.packages("dplyr")
## Error in install.packages : Updating loaded packages
We see here a long output as the package installs and tells us how the process is going. We don’t need to worry too much about the specifics of this output, but the end part, where it tells us that The downloaded source packages are in … tells us that the installation process worked, and the software now exists on our computer. This process is the same for any new package we want.
Loading a Package
Now that the package has been installed, we never need to call it again. But, if we’ve just started an R session and we want to use the package, we’ll need to load it. That is, we want to tell R that we want to use the functions inside the package. We do this via the library()
function. We only need to call the library()
call once in an R session. If we load the library and then close R and re-start it, we’ll need to call it again.
As we discuss in the workflow section, it’s best to call all the packages that you want to use at the top of your script. Why? Well, R works its way through a script from top to bottom. So if we try to use a function in a package before we’ve actually called it, we’ll get an error. Let’s load the dplyr
package now:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
There is some important information here in this library load. First of all, we see that two packages, package:stats
and package::base
> now have “masked objects”. The objects that are masked for each of these packages are listed. For stats
, filter()
and lag()
are masked, and for base
, the functions intersect(), setdiff(), setequal(), and union()
are masked.
Don’t worry, you don’t have to do anything about this, but just know that **this means there are functions in the dplyr
package with the same names (i.e. filter()
and lag()
, and intersect()
, setdiff()
, setequal()
, and union()
) as functions that were already loaded in R. So, if you just call the function filter()
it will default to using the function named filter()
from the dplyr
package, and NOT from the stats
package.
It is good practice to be explicit about what package you are using for what task with thepackage::function()
notation. That is, if I wanted to use the filter()
function from dplyr
, I would write: dplyr::filter()
, but if I wanted to use the function from the stats
package I would write stats::filter()
. This way R will not get confused and perhaps default to the wrong package!
Using Packages in Scripts
As we will discuss in the workflow section, any packages that we need for an analysis should always be loaded at the top of the script we’re working in. For example, if I wanted to use two packages in a script, here
and dplyr
, I would load them both at the very top of my new R script like this:
# install.packages("here", "dplyr")
library(here)
library(dplyr)
Troubleshooting Packages
There are a couple common problems that we all encounter when trying to use packages. The most common one is installation issues, usually stemming from version issues (i.e. the version of R doesn’t support the version of a package), and that end up giving frustrating messages like Warning in install.packages : package "magittr" is not available (for R version 3.3.3)
. The data blogger Laura Ellis has an excellent flowchart of how to go about installing a package that is worth taking some time to consider.
Let’s walk through this process for another package we’ll use a lot in this R Manual, ggplot2
. First, let’s head over to CRAN and go to the list of available packages, sorted by name:
And we can scroll to find the package ggplot2
:
We see here that it Depends: on R (≥ 3.3). We can see what version we’re running by putting the following command in the console:
version
## _
## platform x86_64-pc-linux-gnu
## arch x86_64
## os linux-gnu
## system x86_64, linux-gnu
## status Patched
## major 4
## minor 2.2
## year 2022
## month 11
## day 10
## svn rev 83330
## language R
## version.string R version 4.2.2 Patched (2022-11-10 r83330)
## nickname Innocent and Trusting
And we see here that (at least on this computer), we’re running version 4.2.2, which is ≥ 3.3, so we know we’re okay to download this package. We can do so via the method shown earlier:
install.packages("ggplot2")
## Error in install.packages : Updating loaded packages
And we see this worked! On to the next step: Celebrate 🙂
If you receive messages on your computer about specific “dependencies” that you’re missing, the first thing to do is try manually installing the packages the error message is telling you you might need.
Once you have a package loaded in R, it’s usually easy at that point to check out the package functions and find out more about them. For example, since we’ve installed ggplot2
, we could load the package, and then use the ?
help function to take a look at some of the functions in the package.
library(ggplot2)
Now that the package is loaded, let’s take a look at the main function in this package: ggplot()
?ggplot
If you do this in RStudio, the help window on the bottom right will open up a help page that looks something like this:
And if you scroll down, you’ll see examples of this function at work. It’s very important to get some sense for how a package/set of functions work before you use them with your own data, else you risk introducing errors that you will be unaware of, simply through unintentional misuse of the function.