*This is part two of a series on ggplot2.*

In today’s post, we will look at some of the basics of ggplot. As mentioned in the previous post, ggplot has a number of layers. In today’s post, we will look at two of these layers: the basic mapping layer and the geometric object (geom) layer.

### Loading Libraries

The first step is to load the libraries that we’ll use today. I’m using the tidyverse library, which includes a number of useful packages including ggplot2. I’m also loading the gapminder library, which has interesting periodic, cross-country data covering a 50-year time frame.

if(!require(tidyverse)) { install.packages("tidyverse", repos = "http://cran.us.r-project.org") library(tidyverse) } if(!require(gapminder)) { install.packages("gapminder", repos = "http://cran.us.r-project.org") library(gapminder) }

Once I have these libraries loaded, I can start messing around with the data.

Let’s first take a quick look at what’s in the gapminder data frame, using the head and glimpse functions:

We can see that there are three descriptive variables—country, continent and year—and three measures of interest—life expectancy, population, and GDP per capita. The gapminder data set collects data for countries around the world from the period 1952 through 2007.

### Mapping Data

The first question I want to ask is, what was the average life expectancy by continent in 1952? I’m imagining a column chart to display this data (for a brief review of column charts, check out my post on various types of visuals).

The first thing we want to do is to get our data into the right format. We’ll see some of ggplot2’s ability to reshape data soon, but I want to start by feeding it the final data set, as that makes it easier for us to follow.

lifeExp_by_continent_1952 <- gapminder %>% filter(year == 1952) %>% group_by(continent) %>% summarize(avg_lifeExp = mean(lifeExp)) %>% select(continent, avg_lifeExp)

Now that we have my source data prepared, we can get to work on plotting. The first thing we need is a mapping. Here’s the mapping that we’re going to use:

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = continent, y = avg_lifeExp))

We have two parameter that we’re passing into the ggplot function: data and mapping. We can specify the data frame that we’re using as the data parameter; doing so means that we do not need to keep specifying it down the line. The mapping, as you’ll recall, lets us represent parts of the graph. Specifically, we’re going to define an aesthetic which lays out what the X and Y axes contain: the X axis will list the individual continents, and the Y axis will cover average life expectancy.

If we run that first line of code, we already have a visual:

Yeah, it’s not a very useful visual, but it’s a start.

### Adding Geometric Objects

The next step is to add a geometric object. I mentioned that I want a column for each continent, and that column’s value represents the average life expectancy. This leads to our first geom: the column.

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = continent, y = avg_lifeExp)) + geom_col()

We can display a simple column chart in two lines of code.This gives us a column chart. The order of the columns is alphabetical, which could be the way that we want to display the data, but probably isn’t. We probably want to reorder the X axis by average life expectancy descending, and that’s what the reorder function lets us do.

ggplot(data = lifeExp_by_continent_1952, mapping = aes(x = reorder(continent, desc(avg_lifeExp)), y = avg_lifeExp)) + geom_col()

This might not be absolutely necessary for this particular visual, but it’s a good principle to follow and definitely helps us when there are more categories or several categories which are very close in mean life expectancy.

### More Geometric Objects

We’re going to look at a few more geoms. If you want to see even more, check out the ggplot2 cheat sheet or sape’s geom reference.

#### Scatter Plots And Smoothers

Let’s say that I want to test a conjecture that higher GDP per capita (measured here in USD) correlates with higher life expectancy. I can plot out GDP per capita versus life expectancy pretty easily.

ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()

Well, that’s not extremely clear… But I made a mistake that should be clear to people who have done data analytics on economic data: money typically should be expressed as a logarithmic function. There’s a good way to do this but we won’t cover it today. I’m going to cover the bad method today and save the good method for the next post in the series. The bad method is to modify the X-axis variable and call the log function

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) + geom_point()

Now we see a much clearer relationship between the log of GDP per capita and average life expectancy. It’s not a perfect relationship, but there’s definitely a positive line that we could draw. So let’s draw that positive line!

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

We have used a new geom here, geom_smooth. The geom_smooth function creates a smoothed conditional mean. Basically, we’re drawing some line as a result of a function based on this input data. Notice that there are two parameters that I set: method and se. The method parameter tells the function which method to use. There are five methods available, including using a Generalized Additive Model (gam), Locally Weighted Scatterplot Smoothing (loess), and three varieties of Linear Models (lm, glm, and rlm). The se parameter controls whether we want to see the standard error bar. Here’s our graph with the standard error bar turned on:

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) + geom_point() + geom_smooth(method = "lm", se = TRUE)

This model also represents the first time that we’ve created a complex visual. This is a visual with dots as well as a line. It was really easy to create this because we can lay out the two layers independent of one another: I can have geom_point() without geom_smooth() or vice versa, so if I need to work on one layer, I can comment out the other and hide it until I’m ready. This also allowed us to step through the visual iteratively.

Let’s turn off standard errors again and look at the scatter plot. One trick we can use to see the line more clearly is to change the alpha channel for our scatter plot dots. We can use the alpha parameter on geom_point to do just this.

ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp)) + geom_point(alpha = 0.2) + geom_smooth(method = "lm", se = FALSE)

Now we can see the line more clearly without losing the scatter plot. This has a second beneficial effect for us: there was some overplotting of dots, where several country-year combos had roughly the same GDP and life expectancy. By toning down the alpha channel a bit, we can see the overlap much more clearly.

Zooming in a bit, let’s filter down to one country, Germany. We could create a new data frame with just the Germany data, but we don’t need to do that. We can just simply use the filter function in dplyr and get data for Germany:

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = gdpPercap, y = lifeExp)) + geom_point()

Note that I switched back to normal format for GDP per capita so we can see dollar amounts.

#### Other Charts

Creating a line chart is pretty easy as well. Let’s graph GDP per capita in Germany from 1952 to 2007.

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = year, y = gdpPercap)) + geom_line()

We can easily switch it over to an area chart with geom_area:

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = year, y = gdpPercap)) + geom_area(alpha = 0.4)

Note that I’ve set the alpha on geom_area, mostly so that the amount of black doesn’t overwhelm the eyes.

We can create a step chart as well. This is helpful in gauging the magnitude of changes from period to period a little more clearly than on a line chart:

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = year, y = gdpPercap)) + geom_step()

We can also compare what the step output looks like versus the line output. In this case, I’m coloring the step output as red and leaving the line as black.

ggplot(data = filter(gapminder, country == "Germany"), mapping = aes(x = year, y = gdpPercap)) + geom_step(color="Red") + geom_line()

Taking another tack, let’s see what the spread for GDP per capita is across continents in the year 1997. We will once again use the log of GDP, and will create a box and whiskers plot using geom_boxplot.

ggplot(data = filter(gapminder, year == 1997), mapping = aes(x = continent, y = log(gdpPercap))) + geom_boxplot()

### Conclusion

In today’s post, we looked at several of the geometric objects available within ggplot2. We’re able to create simplistic but functional graphs with just two or three lines of code. Starting with the next post, we’ll begin to improve some of these charts by looking at scales and coordinates.