Writing about visualization, demographics, dashboards, and spatial data science.

Interested in learning more? Hire me for a workshop or to consult on your next project. See the Services page for more details.

Geo-faceted population pyramids with tidycensus 0.3

· by Kyle Walker · Read in about 4 min · (770 Words)
r census tidycensus

Version 0.3 of the tidycensus R package is now available on CRAN. The big change in this new release is the ability to fetch entire tables of Census or ACS data without having to construct a list of variable names. The table prefix should be passed to the new table parameter in the get_decennial() or get_acs() functions to work.

I’d like to illustrate this below by showing you how to create faceted population pyramids with the geofacet R package, a package that allows you to create faceted ggplot2 plots in a way that represents the geographic position of the plot data.

To get started, let’s get data on age and sex - required to create population pyramids - using the new table parameter in tidycensus.

library(tidycensus)
library(tidyverse)
library(stringr)

# If not installed, install your Census API key with `census_api_key("KEY", install = TRUE)`

age <- get_decennial(geography = "state", table = "P012", summary_var = "P0010001") %>%
  mutate(variable = str_replace(variable, "P01200", "")) %>%
  filter(!variable %in% c("01", "02", "26")) %>%
  arrange(NAME, variable)

head(age)
## # A tibble: 6 x 5
##   GEOID    NAME variable  value summary_value
##   <chr>   <chr>    <chr>  <dbl>         <dbl>
## 1    01 Alabama       03 155265       4779736
## 2    01 Alabama       04 157340       4779736
## 3    01 Alabama       05 163417       4779736
## 4    01 Alabama       06 102627       4779736
## 5    01 Alabama       07  72524       4779736
## 6    01 Alabama       08  36159       4779736

I’ve fetched all age and sex data from Census 2010 table P012, then removed three variables in the table, representing total population, total male population, and total female population, respectively.

I now do some data wrangling to get group percentages by state for 5-year age bands, as the Census data by default returns some age bands that are more refined than 5 years. I define my desired age categories, calculate a group sum and then percentage, and then set all male values to negative to display them on the left-hand side of the population pyramids.

agegroups <- c("0-4", "5-9", "10-14", "15-19", "15-19", "20-24", "20-24", 
               "20-24", "25-29", "30-34", "35-39", "40-44", "45-49", "50-54", 
               "55-59", "60-64", "60-64", "65-69", "65-69", "70-74", "75-79", 
               "80-84", "85+")

agesex <- c(paste("Male", agegroups), 
            paste("Female", agegroups))

age$group <- rep(agesex, length(unique(age$NAME)))

age2 <- age %>%
  group_by(NAME, group) %>%
  mutate(group_est = sum(value)) %>%
  distinct(NAME, group, .keep_all = TRUE) %>%
  ungroup() %>%
  mutate(percent = 100 * (group_est / summary_value)) %>%
  select(name = NAME, group, percent) %>%
  separate(group, into = c("sex", "age"), sep = " ") %>%
  mutate(age = factor(age, levels = unique(age)), 
         percent = ifelse(sex == "Female", percent, -percent)) 

head(age2)
## # A tibble: 6 x 4
##      name   sex    age   percent
##     <chr> <chr> <fctr>     <dbl>
## 1 Alabama  Male    0-4 -3.248401
## 2 Alabama  Male    5-9 -3.291814
## 3 Alabama  Male  10-14 -3.418955
## 4 Alabama  Male  15-19 -3.664449
## 5 Alabama  Male  20-24 -3.504796
## 6 Alabama  Male  25-29 -3.215994

I now can create a geofaceted plot with ggplot2. The population pyramids are back-to-back bar charts categorized by sex, and the facet_geo() function in the geofacet package puts the plots in geographically appropriate positions.

library(geofacet)
library(extrafont)

xlabs = c("0-4" = "0-4", "5-9" = "", "10-14" = "", "15-19" = "", "20-24" = "", 
          "25-29" = "", "30-34" = "", "35-39" = "", "40-44" = "", "45-49" = "", 
          "50-54" = "", "55-59" = "", "60-64" = "", "65-69" = "", "70-74" = "", 
          "75-79" = "", "80-84" = "", "85+" = "85+")

ggplot(data = age2, aes(x = age, y = percent, fill = sex)) +
  geom_bar(stat = "identity", width = 1) + 
  scale_y_continuous(breaks=c(-5, 0, 5),labels=c("5%", "0%", "5%")) + 
  coord_flip() + 
  theme_minimal(base_family = "Tahoma") + 
  scale_x_discrete(labels = xlabs) + 
  scale_fill_manual(values = c("red", "navy")) + 
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        strip.text.x = element_text(size = 6)) + 
  labs(x = "", y = "", fill = "", 
       title = "Demographic structure of US states", 
       caption = "Data source: 2010 US Census, tidycensus R package.  Chart by @kyle_e_walker.") + 
  facet_geo(~ name, grid = "us_state_grid2", move_axes = TRUE) 

Many states look quite similar, though there are a few notable outliers. These include high-fertility states like Utah and Idaho, reflected in their proportionally larger young populations, as well as DC’s urban profile with a sizeable population of residents in their 20s and 30s.

Within-state differences are perhaps more interesting; I’m in the process of creating these types of graphs at the county level by state and posting them to Twitter as I go, like the example below:

I’ll be creating a website that eventually will show the demographic structure of counties across the US.