Doing ‘GIS’ and making maps with US Census data in R

2024 SSDAN Webinar Series

Kyle Walker

2024-03-07

About me

Associate Professor of Geography at TCU
Spatial data science researcher and consultant
Package developer: tidycensus, tigris, mapboxapi, crsuggest, idbr (R), pygris (Python)
Book: Analyzing US Census Data: Methods, Maps and Models in R

SSDAN webinar series

Thurday, February 8: Working with the 2022 American Community Survey with R and tidycensus
Thursday, February 22: Analyzing 2020 Decennial US Census Data in R
Today: Doing “GIS” and making maps with US Census Data in R

Today’s agenda

Hour 1: How to get and explore spatial US Census data using R
Hour 2: A tour of map types with R and US Census data
Hour 3: Advanced workflows: automated mapping and spatial analysis

US Census data: an overview

R and RStudio

R: programming language and software environment for data analysis (and wherever else your imagination can take you!)
RStudio: integrated development environment (IDE) for R developed by Posit
Posit Cloud: run RStudio with today’s workshop pre-configured at https://posit.cloud/content/7549022

Setup: RStudio and basic data structures in R

The Decennial US Census

Complete count of the US population mandated by Article 1, Sections 2 and 9 in the US Constitution
Directed by the US Census Bureau (US Department of Commerce); conducted every 10 years since 1790
Used for proportional representation / congressional redistricting
Limited set of questions asked about race, ethnicity, age, sex, and housing tenure

The American Community Survey

Annual survey of 3.5 million US households
Covers topics not available in decennial US Census data (e.g. income, education, language, housing characteristics)
Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)
Data delivered as estimates characterized by margins of error

How to get Census data

data.census.gov is the main, revamped interactive data portal for browsing and downloading Census datasets
The US Census Application Programming Interface (API) allows developers to access Census data resources programmatically

tidycensus

R interface to the Decennial Census, American Community Survey, Population Estimates Program, Migration Flows, and Public Use Microdata Series APIs
First released in 2017; over 500,000 downloads from the Posit CRAN mirror

tidycensus: key features

Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);
Automatically downloads and merges Census geometries to data for mapping;
Includes a variety of analytic tools to support common Census workflows;
States and counties can be requested by name (no more looking up FIPS codes!)

Getting started with tidycensus

To get started, install the packages you’ll need for today’s workshop
If you are using the Posit Cloud environment, these packages are already installed for you

install.packages(c("tidycensus", "tidyverse", "mapview", "ggspatial", "leafsync"))

Optional: your Census API key

tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day
Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.
Once activated, use the census_api_key() function to set your key as an environment variable

library(tidycensus)

census_api_key("YOUR KEY GOES HERE", install = TRUE)

Getting spatial US Census data

Spatial Census data: the old way

Traditionally, getting “spatial” Census data required:

Fetching shapefiles from the Census website;
Downloading a CSV of data, then cleaning and formatting it;
Loading geometries and data into your GIS of choice;
Aligning key fields in your GIS and joining your data

Getting started with tidycensus

Your core functions in tidycensus are get_decennial() for decennial Census data, and get_acs() for ACS data
Required arguments are geography and variables

texas_income <- get_acs(
  geography = "county",
  variables = "B19013_001",
  state = "TX",
  year = 2022
)

ACS data are returned with five columns: GEOID, NAME, variable, estimate, and moe

texas_income

# A tibble: 254 × 5
   GEOID NAME                    variable   estimate   moe
   <chr> <chr>                   <chr>         <dbl> <dbl>
 1 48001 Anderson County, Texas  B19013_001    57445  4562
 2 48003 Andrews County, Texas   B19013_001    86458 16116
 3 48005 Angelina County, Texas  B19013_001    57055  2484
 4 48007 Aransas County, Texas   B19013_001    58168  6458
 5 48009 Archer County, Texas    B19013_001    69954  8482
 6 48011 Armstrong County, Texas B19013_001    70417 14574
 7 48013 Atascosa County, Texas  B19013_001    67442  4309
 8 48015 Austin County, Texas    B19013_001    73556  4757
 9 48017 Bailey County, Texas    B19013_001    69830 13120
10 48019 Bandera County, Texas   B19013_001    70965  5710
# ℹ 244 more rows

Spatial Census data with tidycensus

Use the argument geometry = TRUE to get pre-joined geometry along with your data!

texas_income_sf <- get_acs(
  geography = "county",
  variables = "B19013_001",
  state = "TX",
  year = 2022,
  geometry = TRUE
)

We can now make a simple map with plot():

plot(texas_income_sf['estimate'])

Looking under the hood: simple features in R

The sf package implements a simple features data model for vector spatial data in R
Vector geometries: points, lines, and polygons stored in a list-column of a data frame

Spatial data are returned with five data columns: GEOID, NAME, variable, estimate, and moe, along with a geometry column representing the shapes of locations

texas_income_sf

Simple feature collection with 254 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -106.6456 ymin: 25.83738 xmax: -93.50829 ymax: 36.5007
Geodetic CRS:  NAD83
First 10 features:
   GEOID                     NAME   variable estimate   moe
1  48273    Kleberg County, Texas B19013_001    52487  4987
2  48391    Refugio County, Texas B19013_001    54304  2650
3  48201     Harris County, Texas B19013_001    70789   482
4  48443    Terrell County, Texas B19013_001    52813 11107
5  48229   Hudspeth County, Texas B19013_001    35163  8159
6  48205    Hartley County, Texas B19013_001    78065 21076
7  48351     Newton County, Texas B19013_001    38871  6573
8  48373       Polk County, Texas B19013_001    57315  2976
9  48139      Ellis County, Texas B19013_001    93248  2485
10 48491 Williamson County, Texas B19013_001   102851  1462
                         geometry
1  MULTIPOLYGON (((-97.3178 27...
2  MULTIPOLYGON (((-97.54085 2...
3  MULTIPOLYGON (((-94.97839 2...
4  MULTIPOLYGON (((-102.5669 3...
5  MULTIPOLYGON (((-105.998 32...
6  MULTIPOLYGON (((-103.0422 3...
7  MULTIPOLYGON (((-93.91113 3...
8  MULTIPOLYGON (((-95.20018 3...
9  MULTIPOLYGON (((-97.08703 3...
10 MULTIPOLYGON (((-98.04989 3...

Interactive viewing with `mapview()`

The mapview package allows for interactive viewing of spatial data in R

library(mapview)

mapview(
  texas_income_sf, 
  zcol = "estimate"
)

Understanding geography and variables in tidycensus

US Census Geography

Geography in tidycensus

Information on available geographies, and how to specify them, can be found in the tidycensus documentation

Searching for variables

To search for variables, use the load_variables() function along with a year and dataset
For the 2022 5-year ACS, use "acs5" for the Detailed Tables; "acs5/profile" for the Data Profile; "acs5/subject" for the Subject Tables; and "acs5/cprofile" for the Comparison Profile
The View() function in RStudio allows for interactive browsing and filtering

vars <- load_variables(2022, "acs5")

View(vars)

Small-area spatial demographic data

Smaller areas like Census tracts or block groups are available with geography = "tract" or geography = "block group"; one or more counties can optionally be specified to focus your query

nyc_income <- get_acs(
  geography = "tract",
  variables = "B19013_001",
  state = "NY",
  county = c("New York", "Kings", "Queens",
             "Bronx", "Richmond"),
  year = 2022,
  geometry = TRUE
)

mapview(nyc_income, zcol = "estimate")

“Tidy” or long-form data

The default data structure returned by tidycensus is “tidy” or long-form data, with variables by geography stacked by row
For spatial data, this means that geometries will also be stacked, which is helpful for group-wise analysis and visualization

“Tidy” or long-form data

san_diego_race <- get_acs(
  geography = "tract",
  variables = c(
    Hispanic = "DP05_0073P",
    White = "DP05_0079P",
    Black = "DP05_0080P",
    Asian = "DP05_0082P"
  ),
  state = "CA",
  county = "San Diego",
  year = 2022,
  geometry = TRUE
)

san_diego_race

Simple feature collection with 2948 features and 5 fields (with 4 geometries empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -117.596 ymin: 32.53444 xmax: -116.0809 ymax: 33.50503
Geodetic CRS:  NAD83
First 10 features:
         GEOID                                              NAME variable
1  06073013104 Census Tract 131.04; San Diego County; California Hispanic
2  06073013104 Census Tract 131.04; San Diego County; California    White
3  06073013104 Census Tract 131.04; San Diego County; California    Black
4  06073013104 Census Tract 131.04; San Diego County; California    Asian
5  06073020902 Census Tract 209.02; San Diego County; California Hispanic
6  06073020902 Census Tract 209.02; San Diego County; California    White
7  06073020902 Census Tract 209.02; San Diego County; California    Black
8  06073020902 Census Tract 209.02; San Diego County; California    Asian
9  06073015902 Census Tract 159.02; San Diego County; California Hispanic
10 06073015902 Census Tract 159.02; San Diego County; California    White
   estimate  moe                       geometry
1      87.0  4.8 MULTIPOLYGON (((-117.0841 3...
2       6.9  2.9 MULTIPOLYGON (((-117.0841 3...
3       3.5  3.0 MULTIPOLYGON (((-117.0841 3...
4       2.6  1.9 MULTIPOLYGON (((-117.0841 3...
5      19.4 10.9 MULTIPOLYGON (((-116.7902 3...
6      68.3 12.2 MULTIPOLYGON (((-116.7902 3...
7       2.6  1.6 MULTIPOLYGON (((-116.7902 3...
8       1.3  1.5 MULTIPOLYGON (((-116.7902 3...
9      36.3 12.0 MULTIPOLYGON (((-116.9788 3...
10     54.6 11.8 MULTIPOLYGON (((-116.9788 3...

“Wide” data

The argument output = "wide" spreads Census variables across the columns, returning one row per geographic unit and one column per variable
This will be a more familiar data structure for traditional desktop GIS users

“Wide” data

san_diego_race_wide <- get_acs(
  geography = "tract",
  variables = c(
    Hispanic = "DP05_0073P",
    White = "DP05_0079P",
    Black = "DP05_0080P",
    Asian = "DP05_0082P"
  ),
  state = "CA",
  county = "San Diego",
  geometry = TRUE,
  output = "wide",
  year = 2022
)

san_diego_race_wide

Simple feature collection with 737 features and 10 fields (with 1 geometry empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -117.596 ymin: 32.53444 xmax: -116.0809 ymax: 33.50503
Geodetic CRS:  NAD83
First 10 features:
         GEOID                                              NAME HispanicE
1  06073013104 Census Tract 131.04; San Diego County; California      87.0
2  06073020902 Census Tract 209.02; San Diego County; California      19.4
3  06073015902 Census Tract 159.02; San Diego County; California      36.3
4  06073005700     Census Tract 57; San Diego County; California      24.6
5  06073003001  Census Tract 30.01; San Diego County; California      44.5
6  06073015801 Census Tract 158.01; San Diego County; California      36.9
7  06073011601 Census Tract 116.01; San Diego County; California      85.3
8  06073010111 Census Tract 101.11; San Diego County; California      92.3
9  06073009000     Census Tract 90; San Diego County; California      39.3
10 06073008510  Census Tract 85.10; San Diego County; California      23.2
   HispanicM WhiteE WhiteM BlackE BlackM AsianE AsianM
1        4.8    6.9    2.9    3.5    3.0    2.6    1.9
2       10.9   68.3   12.2    2.6    1.6    1.3    1.5
3       12.0   54.6   11.8    2.9    2.1    3.7    3.5
4        8.0   52.1    8.0   12.0    6.3    5.9    4.1
5        6.9   10.6    2.7   26.7    5.7    3.1    2.7
6       12.1   39.0    8.7   13.5    7.2    2.2    1.6
7        4.4    3.5    2.3    2.8    1.9    4.2    2.3
8        5.1    4.4    4.0    2.1    3.1    0.5    0.6
9        9.6   40.4    7.2    2.3    1.7   11.0    4.3
10       7.9   47.8    7.9    5.0    3.5   19.6    5.0
                         geometry
1  MULTIPOLYGON (((-117.0841 3...
2  MULTIPOLYGON (((-116.7902 3...
3  MULTIPOLYGON (((-116.9788 3...
4  MULTIPOLYGON (((-117.1657 3...
5  MULTIPOLYGON (((-117.085 32...
6  MULTIPOLYGON (((-116.9666 3...
7  MULTIPOLYGON (((-117.1036 3...
8  MULTIPOLYGON (((-117.0739 3...
9  MULTIPOLYGON (((-117.1872 3...
10 MULTIPOLYGON (((-117.1753 3...

Census geometry and the tigris R package

tidycensus uses the tigris R package internally to acquire Census shapefiles
By default, the Cartographic Boundary shapefiles are used, which are pre-clipped to the US shoreline

Problem: interior water areas

Let’s re-visit the NYC income map
Water areas throughout the city are included within tract polygons, making patterns difficult to observe for those who know the area
erase_water() in the tigris package offers a solution; it automates the removal of water areas from your shapes
Tips: tune the area_threshold to selectively remove water area; use cb = FALSE to avoid sliver polygons

nyc_income_tiger <- get_acs(
  geography = "tract",
  variables = "B19013_001",
  state = "NY",
  county = c("New York", "Kings", "Queens",
             "Bronx", "Richmond"),
  year = 2022,
  cb = FALSE,
  geometry = TRUE
)

area_threshold = 0.5 means we are removing the largest 50% of water areas in the area

library(tigris)
library(sf)
sf_use_s2(FALSE)

nyc_erase <- erase_water(
  nyc_income_tiger,
  area_threshold = 0.5,
  year = 2022
)

mapview(nyc_erase, zcol = "estimate")

Part 1 exercises

Use the load_variables() function to find a variable that interests you that we haven’t used yet.
Use get_acs() to fetch spatial ACS data on that variable for a geography and location of your choice, then use mapview() to display your data interactively.

A tour of map types with R and US Census data

Mapping in R

R has a robust set of tools for cartographic visualization that make it a suitable alternative to desktop GIS software in many instances
Popular packages for cartography include ggplot2, tmap, and mapsf
Today, we’ll be focusing on ggplot2; see Chapter 6 of my book for similar examples using tmap

ggplot2 and `geom_sf()`

ggplot2: R’s most popular visualization package (over 137 million downloads!)
ggplot2 graphics are defined by an aesthetic mapping and one or more geoms
geom_sf() is a special geom that interprets the geometry type of your spatial data and visualizes it accordingly
As a result, we can make attractive maps using familiar ggplot2 syntax

Mapping Census data with ggplot2

Continuous choropleth

By default, ggplot2 will apply a continuous color palette to create choropleth maps
Choropleth maps: the shading of a polygon / shape is mapped to a data attribute

library(tidyverse)

san_diego_asian <- filter(san_diego_race, variable == "Asian")

ggplot(san_diego_asian, aes(fill = estimate)) + 
  geom_sf()

Continuous choropleth with styling

We can apply some styling to customize our choropleth maps
Used here: a viridis color palette, which is built-in to ggplot2

cont_choro <- ggplot(san_diego_asian, aes(fill = estimate)) + 
  geom_sf() + 
  theme_void() + 
  scale_fill_viridis_c(option = "rocket") + 
  labs(title = "Percent Asian by Census tract",
       subtitle = "San Diego County, CA",
       fill = "ACS estimate",
       caption = "2018-2022 ACS | tidycensus R package")

cont_choro

Classed choropleth

We can also create a binned choropleth; ggplot2 will identify “pretty” breaks, or custom breaks can be supplied

classed_choro <- ggplot(san_diego_asian, aes(fill = estimate)) + 
  geom_sf() + 
  theme_void() + 
  scale_fill_viridis_b(option = "rocket", n.breaks = 6) + 
  labs(title = "Percent Asian by Census tract",
       subtitle = "San Diego County, CA",
       fill = "ACS estimate",
       caption = "2018-2022 ACS | tidycensus R package")

classed_choro

Faceted choropleth maps

Spatial ACS data in tidy (long) format can be faceted by a grouping variable, allowing for comparative mapping

faceted_choro <- ggplot(san_diego_race, aes(fill = estimate)) + 
  geom_sf(color = NA) + 
  theme_void() + 
  scale_fill_viridis_c(option = "rocket") + 
  facet_wrap(~variable) + 
  labs(title = "Race / ethnicity by Census tract",
       subtitle = "San Diego County, California",
       fill = "ACS estimate (%)",
       caption = "2018-2022 ACS | tidycensus R package")

Mapping count data

At times, you’ll want to show variations in counts rather than rates on your maps of ACS data
Choropleth maps are poorly suited for count data
Let’s grab some count data for race / ethnicity and consider some alternatives

san_diego_race_counts <- get_acs(
  geography = "tract",
  variables = c(
    Hispanic = "DP05_0073",
    White = "DP05_0079",
    Black = "DP05_0080",
    Asian = "DP05_0082"
  ),
  state = "CA",
  county = "San Diego",
  geometry = TRUE,
  year = 2022
)

Graduated symbol maps

Graduated symbol maps show difference on a map with the size of symbols (often circles)
They are better for count data than choropleth maps as the shapes are directly comparable (unlike differentially-sized polygons)
We’ll need to convert our data to centroids to plot graduated symbols in ggplot2

library(sf)

san_diego_hispanic <- filter(
  san_diego_race_counts, 
  variable == "Hispanic"
)

centroids <- st_centroid(san_diego_hispanic)

Graduated symbol maps

We’ll first plot a base layer of Census tracts, then a layer of graduated symbols on top
Use scale_size_area() to plot proportional symbols

grad_symbol <- ggplot() + 
  geom_sf(data = san_diego_hispanic, color = "black", fill = "lightgrey") + 
  geom_sf(data = centroids, aes(size = estimate),
          alpha = 0.7, color = "navy") + 
  theme_void() + 
  labs(title = "Hispanic population by Census tract",
       subtitle = "2018-2022 ACS, San Diego County, California",
       size = "ACS estimate") + 
  scale_size_area(max_size = 6)

grad_symbol

Dot-density mapping

It can be difficult to show heterogeneity or mixing of different categories on maps
Dot-density maps scatter dots proportionally to data size; dots can be colored to show mixing of categories
Traditionally, dot-density maps are slow to make in R; tidycensus’s as_dot_density() function addresses this

Dot-density mapping

san_diego_race_dots <- as_dot_density(
  san_diego_race_counts,
  value = "estimate",
  values_per_dot = 200,
  group = "variable"
)

san_diego_race_dots

Simple feature collection with 15495 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -117.5762 ymin: 32.53794 xmax: -116.0835 ymax: 33.50345
Geodetic CRS:  NAD83
First 10 features:
         GEOID                                              NAME variable
1  06073010010 Census Tract 100.10; San Diego County; California Hispanic
2  06073014200    Census Tract 142; San Diego County; California    Black
3  06073012102 Census Tract 121.02; San Diego County; California Hispanic
4  06073003101  Census Tract 31.01; San Diego County; California Hispanic
5  06073014500    Census Tract 145; San Diego County; California    White
6  06073003305  Census Tract 33.05; San Diego County; California    Asian
7  06073017501 Census Tract 175.01; San Diego County; California    White
8  06073018523 Census Tract 185.23; San Diego County; California Hispanic
9  06073002100     Census Tract 21; San Diego County; California    White
10 06073016615 Census Tract 166.15; San Diego County; California    White
   estimate moe                   geometry
1      3604 664 POINT (-117.0613 32.58091)
2      1257 596 POINT (-117.0429 32.73077)
3      2026 386 POINT (-117.0754 32.67362)
4      2202 687 POINT (-117.0837 32.69957)
5      1174 199 POINT (-117.0395 32.74568)
6       701 331 POINT (-117.0946 32.70061)
7      2281 467 POINT (-117.2859 33.04079)
8      1243 418 POINT (-117.3123 33.22678)
9      2382 403 POINT (-117.1083 32.75751)
10     2475 374 POINT (-116.9835 32.84615)

Dot-density mapping

Like the graduated symbol map, we plot points over a base layer, but in this case with a much smaller size
Use override.aes in guide_legend() to plot visible colors in the legend

dot_density_map <- ggplot() + 
  geom_sf(data = san_diego_hispanic, color = "lightgrey", fill = "white") + 
  geom_sf(data = san_diego_race_dots, aes(color = variable), size = 0.01) + 
  scale_color_brewer(palette = "Set1") + 
  guides(color = guide_legend(override.aes = list(size = 3))) + 
  theme_void() + 
  labs(color = "Race / ethnicity",
       caption = "2018-2022 ACS | 1 dot = approximately 200 people")

dot_density_map

Adding a basemap

You may instead want a basemap as a reference layer; this is straightforward with the ggspatial package and annotation_map_tile()

library(ggspatial)

dot_density_with_basemap <- ggplot() + 
  annotation_map_tile(type = "cartolight", zoom = 9) + 
  geom_sf(data = san_diego_race_dots, aes(color = variable), size = 0.01) + 
  scale_color_brewer(palette = "Set1") + 
  guides(color = guide_legend(override.aes = list(size = 3))) + 
  theme_void() + 
  labs(color = "Race / ethnicity",
       caption = "2018-2022 ACS | 1 dot = approximately 200 people")

dot_density_with_basemap

Customizing interactive maps with `mapview()`

mapview() accepts custom color palettes and labels, making it a suitable engine for interactive maps for presentations!

library(viridisLite)

colors <- rocket(n = 100)

mv1 <- mapview(san_diego_asian, zcol = "estimate", 
        layer.name = "Percent Asian<br>2018-2022 ACS",
        col.regions = colors)

mv1

Linked interactive maps with `mapview()`

In mapview, layers can be stacked with the + operator or swiped between with the | operator
leafsync takes this one step further by creating side-by-side synced maps

library(leafsync)

san_diego_white <- filter(san_diego_race, variable == "White")

m1 <- mapview(san_diego_asian, zcol = "estimate", 
        layer.name = "Percent Asian<br/>2018-2022 ACS",
        col.regions = colors)

m2 <- mapview(san_diego_white, zcol = "estimate", 
        layer.name = "Percent White<br/>2018-2022 ACS",
        col.regions = colors)

mv2 <- sync(m1, m2)

mv2

Bonus: migration flow mapping

Installing the rdeck package

To run this example, you’ll need the rdeck package - which is not on CRAN

install.packages("remotes")
library(remotes)
install_github("qfes/rdeck")

Getting migration flow data

The get_flows() function obtains origin-destination flow data from the 5-year ACS

library(tidyverse)

fulton_inflow <- get_flows(
  geography = "county",
  state = "GA",
  county = "Fulton",
  geometry = TRUE,
  year = 2020
) %>%
  filter(variable == "MOVEDIN") %>%
  na.omit()

fulton_top_origins <- fulton_inflow %>%
  slice_max(estimate, n = 30)

Mapping in 3D with rdeck!

library(rdeck)

Sys.setenv("MAPBOX_ACCESS_TOKEN" = "pk.eyJ1Ijoia3dhbGtlcnRjdSIsImEiOiJjbHRoYm12eDQwMzZ1MnNvN2JyMzZqYXBpIn0.Sdd16XvAh70IwBrqDD7MzQ")

fulton_top_origins$centroid1 <- st_transform(fulton_top_origins$centroid1, 4326)
fulton_top_origins$centroid2 <- st_transform(fulton_top_origins$centroid2, 4326)

flow_map <- rdeck(map_style = mapbox_light(), 
      initial_view_state = view_state(center = c(-98.422, 38.606), zoom = 3, 
                                      pitch = 45)) %>%
  add_arc_layer(get_source_position = centroid2,
          get_target_position = centroid1,
          data = as_tibble(fulton_top_origins),
          get_source_color = "#274f8f",
          get_target_color = "#274f8f",
          get_height = 1,
          get_width = scale_linear(estimate, range = 1:5),
          great_circle = TRUE
          )

flow_map

Part 2 exercises

Try reproducing one of the maps you created in this section with a different state / county combination. What patterns do you observe?

Advanced workflows: automated mapping and small-area time series analysis

Automated mapping

Challenge: making 100 maps at once

Let’s tackle a hypothetical example you might get at work. Your boss has assigned you the following task:

I’d like you to look at geographic patterns in working from home for the 100 largest counties by population in the US. Make me maps for each county.

Warning: may be too memory-intensive for Posit Cloud users

Challenge: making 100 maps at once

At first, this may seem like a significant task! You’ll need to:

Generate a list of the 100 largest counties by population in the US;
Get data for those counties on working from home at a sufficiently granular geographic level to show patterns;
Make and deliver 100 maps.

Step 1: find the 100 largest counties in the US by population

library(tidycensus)
library(tidyverse)

top100counties <- get_acs(
  geography = "county",
  variables = "B01003_001",
  year = 2022,
  survey = "acs1"
) %>%
  slice_max(estimate, n = 100)

top100counties

# A tibble: 100 × 5
   GEOID NAME                           variable   estimate   moe
   <chr> <chr>                          <chr>         <dbl> <dbl>
 1 06037 Los Angeles County, California B01003_001  9721138    NA
 2 17031 Cook County, Illinois          B01003_001  5109292    NA
 3 48201 Harris County, Texas           B01003_001  4780913    NA
 4 04013 Maricopa County, Arizona       B01003_001  4551524    NA
 5 06073 San Diego County, California   B01003_001  3276208    NA
 6 06059 Orange County, California      B01003_001  3151184    NA
 7 12086 Miami-Dade County, Florida     B01003_001  2673837    NA
 8 48113 Dallas County, Texas           B01003_001  2600840    NA
 9 36047 Kings County, New York         B01003_001  2590516    NA
10 06065 Riverside County, California   B01003_001  2473902    NA
# ℹ 90 more rows

Step 2: pull tract-level Census data by county

We now need to organize tract data county-by-county using the information in the table. But how can we do this efficiently?
Something to remember (from Downey, Think Python: )

Repeating identical or similar tasks without making errors is something that computers do well and people do poorly

My preference for iteration in R: the purrr package

Step 2: pull tract-level data by county

Here’s how I’d do it:

wfh_tract_list <- top100counties %>%
  split(~NAME) %>%
  map(function(county) {
    state_fips <- str_sub(county$GEOID, 1, 2)
    county_fips <- str_sub(county$GEOID, 3, 5)
    
    get_acs(
      geography = "tract",
      variables = "DP03_0024P",
      state = state_fips,
      county = county_fips,
      year = 2022,
      geometry = TRUE
    )
  })

What just happened?

We split our dataset of counties into separate datasets by NAME with split()
We used map() to set up iteration over each county dataset
We extract the state and county FIPS codes for each of the top 100 counties
Finally, we call get_acs() using those state and county FIPS codes.

This is why you want a Census API key!

Now, let’s make some maps:

We can again use map() to… well… make some maps

library(mapview)

wfh_maps <- map(wfh_tract_list, function(county) {
  mapview(
    county, 
    zcol = "estimate",
    layer.name = "% working from home"
  ) 
})

wfh_maps$`San Mateo County, California`

Small-area time-series analysis

Prompt: analyzing change over time

Nice work! Now, you have a new follow-up from your boss:

Interesting stuff. However, I’m really interested in knowing how work-from-home is changing over time in our target markets. Could you show me where work-from-home has increased the most in Salt Lake City?

Time-series and the ACS: some notes

5-year ACS data represent overlapping survey years, meaning that direct comparison between overlapping datasets is not typically recommended
The Census Bureau recommends comparing non-overlapping years in the 5-year ACS
For 2018-2022 data, available 5-year intervals for comparison are 2008-2012 and 2013-2017

The ACS Comparison Profile

The Comparison Profile includes data on the current 5-year ACS and the 5-year ACS that ends 5 years prior to help with time-series comparisons

utah_wfh_compare <- get_acs(
  geography = "county",
  variables = c(
    wfh17 = "CP03_2017_024",
    wfh22 = "CP03_2022_024"
  ),
  state = "UT",
  year = 2022
)

utah_wfh_compare

# A tibble: 50 × 4
   GEOID NAME                   variable estimate
   <chr> <chr>                  <chr>       <dbl>
 1 49001 Beaver County, Utah    wfh17         3.4
 2 49001 Beaver County, Utah    wfh22         8.7
 3 49003 Box Elder County, Utah wfh17         4  
 4 49003 Box Elder County, Utah wfh22         9.8
 5 49005 Cache County, Utah     wfh17         5.2
 6 49005 Cache County, Utah     wfh22        10.7
 7 49007 Carbon County, Utah    wfh17         3.3
 8 49007 Carbon County, Utah    wfh22         6.2
 9 49011 Davis County, Utah     wfh17         5.5
10 49011 Davis County, Utah     wfh22        14.2
# ℹ 40 more rows

Geography and making comparisons

Data in the Comparison Profile tables is only available down to the county level (though counties can change from ACS to ACS!)
Comparing neighborhood-level change across ACS datasets can introduce additional challenge as Census tract and block group boundaries may differ from previous years

New boundaries drawn with the 2020 Cenus

Locations with significant population growth (e.g. suburban Collin County, Texas) will have Census tracts / block groups with large populations subdivided in the 2020 Census
Example: total population by Census tract in Collin County in the 2013-2017 ACS (on the left) and the 2018-2022 ACS (on the right)

Data setup

Let’s grab data on the population working from home by Census tract in Salt Lake City
Use a projected coordinate reference system to speed things up!

library(sf)

wfh_17 <- get_acs(geography = "tract", variables = "B08006_017", year = 2017,
                  state = "UT", county = "Salt Lake", geometry = TRUE) %>%
  st_transform(6620)

wfh_22 <- get_acs(geography = "tract", variables = "B08006_017", year = 2022,
                  state = "UT", county = "Salt Lake", geometry = TRUE) %>%
  st_transform(6620)

Method 1: area-weighted interpolation

Interpolating data between sets of boundaries involves the use of weights to re-distribute data from one geography to another
A common method is area-weighted interpolation, which allocates information from one geography to another weighted by the area of overlap
Typically more accurate when going backward, as many new tracts will “roll up” within parent tracts from a previous Census (though not always)

library(sf)

wfh_22_to_17 <- wfh_22 %>%
  select(estimate) %>%
  st_interpolate_aw(to = wfh_17, extensive = TRUE)

Area-weighted interpolation

Show code

library(mapview)
library(leafsync)

m22a <- mapview(wfh_22, zcol = "estimate", layer.name = "2020 geographies")
m17a <- mapview(wfh_22_to_17, zcol = "estimate", layer.name = "2015 geographies")

sync(m22a, m17a)

Method 2: population-weighted interpolation

If you need to go forward, area-weighted interpolation may be very inaccurate as it can incorrectly allocate large values to low-density / empty areas
Population-weighted interpolation will instead use an underlying dataset explaining the population distribution as weights
Example: population-weighted interpolation using block population weights from the 2020 Census

In tidycensus: `interpolate_pw()`

library(tigris)
options(tigris_use_cache = TRUE)

salt_lake_blocks <- blocks(
  "UT", 
  "Salt Lake", 
  year = 2020
)

wfh_17_to_22 <- interpolate_pw(
  from = wfh_17,
  to = wfh_22,
  to_id = "GEOID",
  weights = salt_lake_blocks,
  weight_column = "POP20",
  crs = 6620,
  extensive = TRUE
)

Evaluating the result

Show code

m17b <- mapview(wfh_17, zcol = "estimate", layer.name = "2017 geographies")
m22b <- mapview(wfh_17_to_22, zcol = "estimate", layer.name = "2022 geographies")

sync(m17b, m22b)

Evaluating change over time

With consistent geographies, we can now join 2017 and 2022 data and evaluate change over time
We covered this workflow in more detail in the February 22nd workshop

Evaluating change over time

wfh_shift <- wfh_17_to_22 %>%
  select(GEOID, estimate17 = estimate) %>%
  left_join(
    select(st_drop_geometry(wfh_22), 
           GEOID, estimate22 = estimate), by = "GEOID"
  ) %>%
  mutate(
    shift = estimate22 - estimate17,
    pct_shift = 100 * (shift / estimate17)
  )

mapview(wfh_shift, zcol = "shift")

Doing ‘GIS’ and making maps with US Census data in R

About me

SSDAN webinar series

Today’s agenda

US Census data: an overview

R and RStudio

Setup: RStudio and basic data structures in R

The Decennial US Census

The American Community Survey

How to get Census data

tidycensus

tidycensus: key features

Getting started with tidycensus

Optional: your Census API key

Getting spatial US Census data

Spatial Census data: the old way

Getting started with tidycensus

Spatial Census data with tidycensus

Looking under the hood: simple features in R

Interactive viewing with mapview()

Understanding geography and variables in tidycensus

US Census Geography

Geography in tidycensus

Searching for variables

Small-area spatial demographic data

“Tidy” or long-form data

“Tidy” or long-form data

“Wide” data

“Wide” data

Census geometry and the tigris R package

Problem: interior water areas

Part 1 exercises

A tour of map types with R and US Census data

Mapping in R

ggplot2 and geom_sf()

Mapping Census data with ggplot2

Continuous choropleth

Continuous choropleth with styling

Classed choropleth

Faceted choropleth maps

Mapping count data

Graduated symbol maps

Graduated symbol maps

Dot-density mapping

Dot-density mapping

Dot-density mapping

Adding a basemap

Customizing interactive maps with mapview()

Linked interactive maps with mapview()

Bonus: migration flow mapping

Installing the rdeck package

Getting migration flow data

Mapping in 3D with rdeck!

Part 2 exercises

Advanced workflows: automated mapping and small-area time series analysis

Automated mapping

Challenge: making 100 maps at once

Challenge: making 100 maps at once

Step 1: find the 100 largest counties in the US by population

Step 2: pull tract-level Census data by county

Step 2: pull tract-level data by county

What just happened?

Now, let’s make some maps:

Small-area time-series analysis

Prompt: analyzing change over time

Time-series and the ACS: some notes

The ACS Comparison Profile

Geography and making comparisons

New boundaries drawn with the 2020 Cenus

Data setup

Method 1: area-weighted interpolation

Area-weighted interpolation

Method 2: population-weighted interpolation

In tidycensus: interpolate_pw()

Evaluating the result

Evaluating change over time

Evaluating change over time

Thank you!

Interactive viewing with `mapview()`

ggplot2 and `geom_sf()`

Customizing interactive maps with `mapview()`

Linked interactive maps with `mapview()`

In tidycensus: `interpolate_pw()`