Doing ‘GIS’ and making maps with US Census data in R

2024 SSDAN Webinar Series

Kyle Walker

2024-03-07

About me

SSDAN webinar series

Today’s agenda

  • Hour 1: How to get and explore spatial US Census data using R

  • Hour 2: A tour of map types with R and US Census data

  • Hour 3: Advanced workflows: automated mapping and spatial analysis

US Census data: an overview

R and RStudio

  • R: programming language and software environment for data analysis (and wherever else your imagination can take you!)

  • RStudio: integrated development environment (IDE) for R developed by Posit

  • Posit Cloud: run RStudio with today’s workshop pre-configured at https://posit.cloud/content/7549022

Setup: RStudio and basic data structures in R

The Decennial US Census

  • Complete count of the US population mandated by Article 1, Sections 2 and 9 in the US Constitution

  • Directed by the US Census Bureau (US Department of Commerce); conducted every 10 years since 1790

  • Used for proportional representation / congressional redistricting

  • Limited set of questions asked about race, ethnicity, age, sex, and housing tenure

The American Community Survey

  • Annual survey of 3.5 million US households

  • Covers topics not available in decennial US Census data (e.g. income, education, language, housing characteristics)

  • Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)

  • Data delivered as estimates characterized by margins of error

How to get Census data

tidycensus

  • R interface to the Decennial Census, American Community Survey, Population Estimates Program, Migration Flows, and Public Use Microdata Series APIs

  • First released in 2017; over 500,000 downloads from the Posit CRAN mirror

tidycensus: key features

  • Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);

  • Automatically downloads and merges Census geometries to data for mapping;

  • Includes a variety of analytic tools to support common Census workflows;

  • States and counties can be requested by name (no more looking up FIPS codes!)

Getting started with tidycensus

  • To get started, install the packages you’ll need for today’s workshop

  • If you are using the Posit Cloud environment, these packages are already installed for you

install.packages(c("tidycensus", "tidyverse", "mapview", "ggspatial", "leafsync"))

Optional: your Census API key

  • tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day

  • Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.

  • Once activated, use the census_api_key() function to set your key as an environment variable

library(tidycensus)

census_api_key("YOUR KEY GOES HERE", install = TRUE)

Getting spatial US Census data

Spatial Census data: the old way

Traditionally, getting “spatial” Census data required:

  • Fetching shapefiles from the Census website;

  • Downloading a CSV of data, then cleaning and formatting it;

  • Loading geometries and data into your GIS of choice;

  • Aligning key fields in your GIS and joining your data

Getting started with tidycensus

  • Your core functions in tidycensus are get_decennial() for decennial Census data, and get_acs() for ACS data

  • Required arguments are geography and variables

texas_income <- get_acs(
  geography = "county",
  variables = "B19013_001",
  state = "TX",
  year = 2022
)
  • ACS data are returned with five columns: GEOID, NAME, variable, estimate, and moe
texas_income
# A tibble: 254 × 5
   GEOID NAME                    variable   estimate   moe
   <chr> <chr>                   <chr>         <dbl> <dbl>
 1 48001 Anderson County, Texas  B19013_001    57445  4562
 2 48003 Andrews County, Texas   B19013_001    86458 16116
 3 48005 Angelina County, Texas  B19013_001    57055  2484
 4 48007 Aransas County, Texas   B19013_001    58168  6458
 5 48009 Archer County, Texas    B19013_001    69954  8482
 6 48011 Armstrong County, Texas B19013_001    70417 14574
 7 48013 Atascosa County, Texas  B19013_001    67442  4309
 8 48015 Austin County, Texas    B19013_001    73556  4757
 9 48017 Bailey County, Texas    B19013_001    69830 13120
10 48019 Bandera County, Texas   B19013_001    70965  5710
# ℹ 244 more rows

Spatial Census data with tidycensus

  • Use the argument geometry = TRUE to get pre-joined geometry along with your data!
texas_income_sf <- get_acs(
  geography = "county",
  variables = "B19013_001",
  state = "TX",
  year = 2022,
  geometry = TRUE
)
  • We can now make a simple map with plot():
plot(texas_income_sf['estimate'])

Looking under the hood: simple features in R

  • The sf package implements a simple features data model for vector spatial data in R

  • Vector geometries: points, lines, and polygons stored in a list-column of a data frame

  • Spatial data are returned with five data columns: GEOID, NAME, variable, estimate, and moe, along with a geometry column representing the shapes of locations
texas_income_sf
Simple feature collection with 254 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -106.6456 ymin: 25.83738 xmax: -93.50829 ymax: 36.5007
Geodetic CRS:  NAD83
First 10 features:
   GEOID                     NAME   variable estimate   moe
1  48273    Kleberg County, Texas B19013_001    52487  4987
2  48391    Refugio County, Texas B19013_001    54304  2650
3  48201     Harris County, Texas B19013_001    70789   482
4  48443    Terrell County, Texas B19013_001    52813 11107
5  48229   Hudspeth County, Texas B19013_001    35163  8159
6  48205    Hartley County, Texas B19013_001    78065 21076
7  48351     Newton County, Texas B19013_001    38871  6573
8  48373       Polk County, Texas B19013_001    57315  2976
9  48139      Ellis County, Texas B19013_001    93248  2485
10 48491 Williamson County, Texas B19013_001   102851  1462
                         geometry
1  MULTIPOLYGON (((-97.3178 27...
2  MULTIPOLYGON (((-97.54085 2...
3  MULTIPOLYGON (((-94.97839 2...
4  MULTIPOLYGON (((-102.5669 3...
5  MULTIPOLYGON (((-105.998 32...
6  MULTIPOLYGON (((-103.0422 3...
7  MULTIPOLYGON (((-93.91113 3...
8  MULTIPOLYGON (((-95.20018 3...
9  MULTIPOLYGON (((-97.08703 3...
10 MULTIPOLYGON (((-98.04989 3...

Interactive viewing with mapview()

  • The mapview package allows for interactive viewing of spatial data in R
library(mapview)

mapview(
  texas_income_sf, 
  zcol = "estimate"
)

Understanding geography and variables in tidycensus

US Census Geography

Source: US Census Bureau

Source: US Census Bureau

Geography in tidycensus

Searching for variables

  • To search for variables, use the load_variables() function along with a year and dataset

  • For the 2022 5-year ACS, use "acs5" for the Detailed Tables; "acs5/profile" for the Data Profile; "acs5/subject" for the Subject Tables; and "acs5/cprofile" for the Comparison Profile

  • The View() function in RStudio allows for interactive browsing and filtering

vars <- load_variables(2022, "acs5")

View(vars)

Small-area spatial demographic data

  • Smaller areas like Census tracts or block groups are available with geography = "tract" or geography = "block group"; one or more counties can optionally be specified to focus your query
nyc_income <- get_acs(
  geography = "tract",
  variables = "B19013_001",
  state = "NY",
  county = c("New York", "Kings", "Queens",
             "Bronx", "Richmond"),
  year = 2022,
  geometry = TRUE
)
mapview(nyc_income, zcol = "estimate")

“Tidy” or long-form data

  • The default data structure returned by tidycensus is “tidy” or long-form data, with variables by geography stacked by row

  • For spatial data, this means that geometries will also be stacked, which is helpful for group-wise analysis and visualization

“Tidy” or long-form data

san_diego_race <- get_acs(
  geography = "tract",
  variables = c(
    Hispanic = "DP05_0073P",
    White = "DP05_0079P",
    Black = "DP05_0080P",
    Asian = "DP05_0082P"
  ),
  state = "CA",
  county = "San Diego",
  year = 2022,
  geometry = TRUE
)
san_diego_race
Simple feature collection with 2948 features and 5 fields (with 4 geometries empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -117.596 ymin: 32.53444 xmax: -116.0809 ymax: 33.50503
Geodetic CRS:  NAD83
First 10 features:
         GEOID                                              NAME variable
1  06073013104 Census Tract 131.04; San Diego County; California Hispanic
2  06073013104 Census Tract 131.04; San Diego County; California    White
3  06073013104 Census Tract 131.04; San Diego County; California    Black
4  06073013104 Census Tract 131.04; San Diego County; California    Asian
5  06073020902 Census Tract 209.02; San Diego County; California Hispanic
6  06073020902 Census Tract 209.02; San Diego County; California    White
7  06073020902 Census Tract 209.02; San Diego County; California    Black
8  06073020902 Census Tract 209.02; San Diego County; California    Asian
9  06073015902 Census Tract 159.02; San Diego County; California Hispanic
10 06073015902 Census Tract 159.02; San Diego County; California    White
   estimate  moe                       geometry
1      87.0  4.8 MULTIPOLYGON (((-117.0841 3...
2       6.9  2.9 MULTIPOLYGON (((-117.0841 3...
3       3.5  3.0 MULTIPOLYGON (((-117.0841 3...
4       2.6  1.9 MULTIPOLYGON (((-117.0841 3...
5      19.4 10.9 MULTIPOLYGON (((-116.7902 3...
6      68.3 12.2 MULTIPOLYGON (((-116.7902 3...
7       2.6  1.6 MULTIPOLYGON (((-116.7902 3...
8       1.3  1.5 MULTIPOLYGON (((-116.7902 3...
9      36.3 12.0 MULTIPOLYGON (((-116.9788 3...
10     54.6 11.8 MULTIPOLYGON (((-116.9788 3...

“Wide” data

  • The argument output = "wide" spreads Census variables across the columns, returning one row per geographic unit and one column per variable

  • This will be a more familiar data structure for traditional desktop GIS users

“Wide” data

san_diego_race_wide <- get_acs(
  geography = "tract",
  variables = c(
    Hispanic = "DP05_0073P",
    White = "DP05_0079P",
    Black = "DP05_0080P",
    Asian = "DP05_0082P"
  ),
  state = "CA",
  county = "San Diego",
  geometry = TRUE,
  output = "wide",
  year = 2022
)
san_diego_race_wide
Simple feature collection with 737 features and 10 fields (with 1 geometry empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -117.596 ymin: 32.53444 xmax: -116.0809 ymax: 33.50503
Geodetic CRS:  NAD83
First 10 features:
         GEOID                                              NAME HispanicE
1  06073013104 Census Tract 131.04; San Diego County; California      87.0
2  06073020902 Census Tract 209.02; San Diego County; California      19.4
3  06073015902 Census Tract 159.02; San Diego County; California      36.3
4  06073005700     Census Tract 57; San Diego County; California      24.6
5  06073003001  Census Tract 30.01; San Diego County; California      44.5
6  06073015801 Census Tract 158.01; San Diego County; California      36.9
7  06073011601 Census Tract 116.01; San Diego County; California      85.3
8  06073010111 Census Tract 101.11; San Diego County; California      92.3
9  06073009000     Census Tract 90; San Diego County; California      39.3
10 06073008510  Census Tract 85.10; San Diego County; California      23.2
   HispanicM WhiteE WhiteM BlackE BlackM AsianE AsianM
1        4.8    6.9    2.9    3.5    3.0    2.6    1.9
2       10.9   68.3   12.2    2.6    1.6    1.3    1.5
3       12.0   54.6   11.8    2.9    2.1    3.7    3.5
4        8.0   52.1    8.0   12.0    6.3    5.9    4.1
5        6.9   10.6    2.7   26.7    5.7    3.1    2.7
6       12.1   39.0    8.7   13.5    7.2    2.2    1.6
7        4.4    3.5    2.3    2.8    1.9    4.2    2.3
8        5.1    4.4    4.0    2.1    3.1    0.5    0.6
9        9.6   40.4    7.2    2.3    1.7   11.0    4.3
10       7.9   47.8    7.9    5.0    3.5   19.6    5.0
                         geometry
1  MULTIPOLYGON (((-117.0841 3...
2  MULTIPOLYGON (((-116.7902 3...
3  MULTIPOLYGON (((-116.9788 3...
4  MULTIPOLYGON (((-117.1657 3...
5  MULTIPOLYGON (((-117.085 32...
6  MULTIPOLYGON (((-116.9666 3...
7  MULTIPOLYGON (((-117.1036 3...
8  MULTIPOLYGON (((-117.0739 3...
9  MULTIPOLYGON (((-117.1872 3...
10 MULTIPOLYGON (((-117.1753 3...

Census geometry and the tigris R package

  • tidycensus uses the tigris R package internally to acquire Census shapefiles

  • By default, the Cartographic Boundary shapefiles are used, which are pre-clipped to the US shoreline

Problem: interior water areas

  • Let’s re-visit the NYC income map

  • Water areas throughout the city are included within tract polygons, making patterns difficult to observe for those who know the area

  • erase_water() in the tigris package offers a solution; it automates the removal of water areas from your shapes

  • Tips: tune the area_threshold to selectively remove water area; use cb = FALSE to avoid sliver polygons

nyc_income_tiger <- get_acs(
  geography = "tract",
  variables = "B19013_001",
  state = "NY",
  county = c("New York", "Kings", "Queens",
             "Bronx", "Richmond"),
  year = 2022,
  cb = FALSE,
  geometry = TRUE
)
  • area_threshold = 0.5 means we are removing the largest 50% of water areas in the area
library(tigris)
library(sf)
sf_use_s2(FALSE)

nyc_erase <- erase_water(
  nyc_income_tiger,
  area_threshold = 0.5,
  year = 2022
)
mapview(nyc_erase, zcol = "estimate")

Part 1 exercises

  1. Use the load_variables() function to find a variable that interests you that we haven’t used yet.

  2. Use get_acs() to fetch spatial ACS data on that variable for a geography and location of your choice, then use mapview() to display your data interactively.

A tour of map types with R and US Census data

Mapping in R

  • R has a robust set of tools for cartographic visualization that make it a suitable alternative to desktop GIS software in many instances

  • Popular packages for cartography include ggplot2, tmap, and mapsf

  • Today, we’ll be focusing on ggplot2; see Chapter 6 of my book for similar examples using tmap

ggplot2 and geom_sf()

  • ggplot2: R’s most popular visualization package (over 137 million downloads!)

  • ggplot2 graphics are defined by an aesthetic mapping and one or more geoms

  • geom_sf() is a special geom that interprets the geometry type of your spatial data and visualizes it accordingly

  • As a result, we can make attractive maps using familiar ggplot2 syntax

Mapping Census data with ggplot2

Continuous choropleth

  • By default, ggplot2 will apply a continuous color palette to create choropleth maps

  • Choropleth maps: the shading of a polygon / shape is mapped to a data attribute

library(tidyverse)

san_diego_asian <- filter(san_diego_race, variable == "Asian")

ggplot(san_diego_asian, aes(fill = estimate)) + 
  geom_sf()

Continuous choropleth with styling

  • We can apply some styling to customize our choropleth maps

  • Used here: a viridis color palette, which is built-in to ggplot2

cont_choro <- ggplot(san_diego_asian, aes(fill = estimate)) + 
  geom_sf() + 
  theme_void() + 
  scale_fill_viridis_c(option = "rocket") + 
  labs(title = "Percent Asian by Census tract",
       subtitle = "San Diego County, CA",
       fill = "ACS estimate",
       caption = "2018-2022 ACS | tidycensus R package")
cont_choro

Classed choropleth

  • We can also create a binned choropleth; ggplot2 will identify “pretty” breaks, or custom breaks can be supplied
classed_choro <- ggplot(san_diego_asian, aes(fill = estimate)) + 
  geom_sf() + 
  theme_void() + 
  scale_fill_viridis_b(option = "rocket", n.breaks = 6) + 
  labs(title = "Percent Asian by Census tract",
       subtitle = "San Diego County, CA",
       fill = "ACS estimate",
       caption = "2018-2022 ACS | tidycensus R package")
classed_choro

Faceted choropleth maps

  • Spatial ACS data in tidy (long) format can be faceted by a grouping variable, allowing for comparative mapping
faceted_choro <- ggplot(san_diego_race, aes(fill = estimate)) + 
  geom_sf(color = NA) + 
  theme_void() + 
  scale_fill_viridis_c(option = "rocket") + 
  facet_wrap(~variable) + 
  labs(title = "Race / ethnicity by Census tract",
       subtitle = "San Diego County, California",
       fill = "ACS estimate (%)",
       caption = "2018-2022 ACS | tidycensus R package")

Mapping count data

  • At times, you’ll want to show variations in counts rather than rates on your maps of ACS data

  • Choropleth maps are poorly suited for count data

  • Let’s grab some count data for race / ethnicity and consider some alternatives

san_diego_race_counts <- get_acs(
  geography = "tract",
  variables = c(
    Hispanic = "DP05_0073",
    White = "DP05_0079",
    Black = "DP05_0080",
    Asian = "DP05_0082"
  ),
  state = "CA",
  county = "San Diego",
  geometry = TRUE,
  year = 2022
)

Graduated symbol maps

  • Graduated symbol maps show difference on a map with the size of symbols (often circles)

  • They are better for count data than choropleth maps as the shapes are directly comparable (unlike differentially-sized polygons)

  • We’ll need to convert our data to centroids to plot graduated symbols in ggplot2

library(sf)

san_diego_hispanic <- filter(
  san_diego_race_counts, 
  variable == "Hispanic"
)

centroids <- st_centroid(san_diego_hispanic)

Graduated symbol maps

  • We’ll first plot a base layer of Census tracts, then a layer of graduated symbols on top

  • Use scale_size_area() to plot proportional symbols

grad_symbol <- ggplot() + 
  geom_sf(data = san_diego_hispanic, color = "black", fill = "lightgrey") + 
  geom_sf(data = centroids, aes(size = estimate),
          alpha = 0.7, color = "navy") + 
  theme_void() + 
  labs(title = "Hispanic population by Census tract",
       subtitle = "2018-2022 ACS, San Diego County, California",
       size = "ACS estimate") + 
  scale_size_area(max_size = 6) 
grad_symbol

Dot-density mapping

  • It can be difficult to show heterogeneity or mixing of different categories on maps

  • Dot-density maps scatter dots proportionally to data size; dots can be colored to show mixing of categories

  • Traditionally, dot-density maps are slow to make in R; tidycensus’s as_dot_density() function addresses this

Dot-density mapping

san_diego_race_dots <- as_dot_density(
  san_diego_race_counts,
  value = "estimate",
  values_per_dot = 200,
  group = "variable"
)
san_diego_race_dots
Simple feature collection with 15495 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -117.5762 ymin: 32.53794 xmax: -116.0835 ymax: 33.50345
Geodetic CRS:  NAD83
First 10 features:
         GEOID                                              NAME variable
1  06073010010 Census Tract 100.10; San Diego County; California Hispanic
2  06073014200    Census Tract 142; San Diego County; California    Black
3  06073012102 Census Tract 121.02; San Diego County; California Hispanic
4  06073003101  Census Tract 31.01; San Diego County; California Hispanic
5  06073014500    Census Tract 145; San Diego County; California    White
6  06073003305  Census Tract 33.05; San Diego County; California    Asian
7  06073017501 Census Tract 175.01; San Diego County; California    White
8  06073018523 Census Tract 185.23; San Diego County; California Hispanic
9  06073002100     Census Tract 21; San Diego County; California    White
10 06073016615 Census Tract 166.15; San Diego County; California    White
   estimate moe                   geometry
1      3604 664 POINT (-117.0613 32.58091)
2      1257 596 POINT (-117.0429 32.73077)
3      2026 386 POINT (-117.0754 32.67362)
4      2202 687 POINT (-117.0837 32.69957)
5      1174 199 POINT (-117.0395 32.74568)
6       701 331 POINT (-117.0946 32.70061)
7      2281 467 POINT (-117.2859 33.04079)
8      1243 418 POINT (-117.3123 33.22678)
9      2382 403 POINT (-117.1083 32.75751)
10     2475 374 POINT (-116.9835 32.84615)

Dot-density mapping

  • Like the graduated symbol map, we plot points over a base layer, but in this case with a much smaller size

  • Use override.aes in guide_legend() to plot visible colors in the legend

dot_density_map <- ggplot() + 
  geom_sf(data = san_diego_hispanic, color = "lightgrey", fill = "white") + 
  geom_sf(data = san_diego_race_dots, aes(color = variable), size = 0.01) + 
  scale_color_brewer(palette = "Set1") + 
  guides(color = guide_legend(override.aes = list(size = 3))) + 
  theme_void() + 
  labs(color = "Race / ethnicity",
       caption = "2018-2022 ACS | 1 dot = approximately 200 people")
dot_density_map

Adding a basemap

  • You may instead want a basemap as a reference layer; this is straightforward with the ggspatial package and annotation_map_tile()
library(ggspatial)

dot_density_with_basemap <- ggplot() + 
  annotation_map_tile(type = "cartolight", zoom = 9) + 
  geom_sf(data = san_diego_race_dots, aes(color = variable), size = 0.01) + 
  scale_color_brewer(palette = "Set1") + 
  guides(color = guide_legend(override.aes = list(size = 3))) + 
  theme_void() + 
  labs(color = "Race / ethnicity",
       caption = "2018-2022 ACS | 1 dot = approximately 200 people")
dot_density_with_basemap

Customizing interactive maps with mapview()

  • mapview() accepts custom color palettes and labels, making it a suitable engine for interactive maps for presentations!
library(viridisLite)

colors <- rocket(n = 100)

mv1 <- mapview(san_diego_asian, zcol = "estimate", 
        layer.name = "Percent Asian<br>2018-2022 ACS",
        col.regions = colors)
mv1

Linked interactive maps with mapview()

  • In mapview, layers can be stacked with the + operator or swiped between with the | operator

  • leafsync takes this one step further by creating side-by-side synced maps

library(leafsync)

san_diego_white <- filter(san_diego_race, variable == "White")

m1 <- mapview(san_diego_asian, zcol = "estimate", 
        layer.name = "Percent Asian<br/>2018-2022 ACS",
        col.regions = colors)

m2 <- mapview(san_diego_white, zcol = "estimate", 
        layer.name = "Percent White<br/>2018-2022 ACS",
        col.regions = colors)

mv2 <- sync(m1, m2)
mv2

Bonus: migration flow mapping

Installing the rdeck package

  • To run this example, you’ll need the rdeck package - which is not on CRAN
install.packages("remotes")
library(remotes)
install_github("qfes/rdeck")

Getting migration flow data

  • The get_flows() function obtains origin-destination flow data from the 5-year ACS
library(tidyverse)

fulton_inflow <- get_flows(
  geography = "county",
  state = "GA",
  county = "Fulton",
  geometry = TRUE,
  year = 2020
) %>%
  filter(variable == "MOVEDIN") %>%
  na.omit()

fulton_top_origins <- fulton_inflow %>%
  slice_max(estimate, n = 30) 

Mapping in 3D with rdeck!

library(rdeck)

Sys.setenv("MAPBOX_ACCESS_TOKEN" = "pk.eyJ1Ijoia3dhbGtlcnRjdSIsImEiOiJjbHRoYm12eDQwMzZ1MnNvN2JyMzZqYXBpIn0.Sdd16XvAh70IwBrqDD7MzQ")

fulton_top_origins$centroid1 <- st_transform(fulton_top_origins$centroid1, 4326)
fulton_top_origins$centroid2 <- st_transform(fulton_top_origins$centroid2, 4326)

flow_map <- rdeck(map_style = mapbox_light(), 
      initial_view_state = view_state(center = c(-98.422, 38.606), zoom = 3, 
                                      pitch = 45)) %>%
  add_arc_layer(get_source_position = centroid2,
          get_target_position = centroid1,
          data = as_tibble(fulton_top_origins),
          get_source_color = "#274f8f",
          get_target_color = "#274f8f",
          get_height = 1,
          get_width = scale_linear(estimate, range = 1:5),
          great_circle = TRUE
          )
flow_map

Part 2 exercises

  • Try reproducing one of the maps you created in this section with a different state / county combination. What patterns do you observe?

Advanced workflows: automated mapping and small-area time series analysis

Automated mapping

Challenge: making 100 maps at once

Let’s tackle a hypothetical example you might get at work. Your boss has assigned you the following task:

I’d like you to look at geographic patterns in working from home for the 100 largest counties by population in the US. Make me maps for each county.

Related blog post: https://walker-data.com/posts/iterative-mapping/

Warning: may be too memory-intensive for Posit Cloud users

Challenge: making 100 maps at once

At first, this may seem like a significant task! You’ll need to:

  • Generate a list of the 100 largest counties by population in the US;

  • Get data for those counties on working from home at a sufficiently granular geographic level to show patterns;

  • Make and deliver 100 maps.

Step 1: find the 100 largest counties in the US by population

library(tidycensus)
library(tidyverse)

top100counties <- get_acs(
  geography = "county",
  variables = "B01003_001",
  year = 2022,
  survey = "acs1"
) %>%
  slice_max(estimate, n = 100)
top100counties
# A tibble: 100 × 5
   GEOID NAME                           variable   estimate   moe
   <chr> <chr>                          <chr>         <dbl> <dbl>
 1 06037 Los Angeles County, California B01003_001  9721138    NA
 2 17031 Cook County, Illinois          B01003_001  5109292    NA
 3 48201 Harris County, Texas           B01003_001  4780913    NA
 4 04013 Maricopa County, Arizona       B01003_001  4551524    NA
 5 06073 San Diego County, California   B01003_001  3276208    NA
 6 06059 Orange County, California      B01003_001  3151184    NA
 7 12086 Miami-Dade County, Florida     B01003_001  2673837    NA
 8 48113 Dallas County, Texas           B01003_001  2600840    NA
 9 36047 Kings County, New York         B01003_001  2590516    NA
10 06065 Riverside County, California   B01003_001  2473902    NA
# ℹ 90 more rows

Step 2: pull tract-level Census data by county

  • We now need to organize tract data county-by-county using the information in the table. But how can we do this efficiently?

  • Something to remember (from Downey, Think Python: )

Repeating identical or similar tasks without making errors is something that computers do well and people do poorly

  • My preference for iteration in R: the purrr package

Step 2: pull tract-level data by county

  • Here’s how I’d do it:
wfh_tract_list <- top100counties %>%
  split(~NAME) %>%
  map(function(county) {
    state_fips <- str_sub(county$GEOID, 1, 2)
    county_fips <- str_sub(county$GEOID, 3, 5)
    
    get_acs(
      geography = "tract",
      variables = "DP03_0024P",
      state = state_fips,
      county = county_fips,
      year = 2022,
      geometry = TRUE
    )
  })

What just happened?

  1. We split our dataset of counties into separate datasets by NAME with split()

  2. We used map() to set up iteration over each county dataset

  3. We extract the state and county FIPS codes for each of the top 100 counties

  4. Finally, we call get_acs() using those state and county FIPS codes.

This is why you want a Census API key!

Now, let’s make some maps:

  • We can again use map() to… well… make some maps
library(mapview)

wfh_maps <- map(wfh_tract_list, function(county) {
  mapview(
    county, 
    zcol = "estimate",
    layer.name = "% working from home"
  ) 
})
wfh_maps$`San Mateo County, California`

Small-area time-series analysis

Prompt: analyzing change over time

Nice work! Now, you have a new follow-up from your boss:

Interesting stuff. However, I’m really interested in knowing how work-from-home is changing over time in our target markets. Could you show me where work-from-home has increased the most in Salt Lake City?

Time-series and the ACS: some notes

  • 5-year ACS data represent overlapping survey years, meaning that direct comparison between overlapping datasets is not typically recommended

  • The Census Bureau recommends comparing non-overlapping years in the 5-year ACS

  • For 2018-2022 data, available 5-year intervals for comparison are 2008-2012 and 2013-2017

The ACS Comparison Profile

  • The Comparison Profile includes data on the current 5-year ACS and the 5-year ACS that ends 5 years prior to help with time-series comparisons
utah_wfh_compare <- get_acs(
  geography = "county",
  variables = c(
    wfh17 = "CP03_2017_024",
    wfh22 = "CP03_2022_024"
  ),
  state = "UT",
  year = 2022
)
utah_wfh_compare
# A tibble: 50 × 4
   GEOID NAME                   variable estimate
   <chr> <chr>                  <chr>       <dbl>
 1 49001 Beaver County, Utah    wfh17         3.4
 2 49001 Beaver County, Utah    wfh22         8.7
 3 49003 Box Elder County, Utah wfh17         4  
 4 49003 Box Elder County, Utah wfh22         9.8
 5 49005 Cache County, Utah     wfh17         5.2
 6 49005 Cache County, Utah     wfh22        10.7
 7 49007 Carbon County, Utah    wfh17         3.3
 8 49007 Carbon County, Utah    wfh22         6.2
 9 49011 Davis County, Utah     wfh17         5.5
10 49011 Davis County, Utah     wfh22        14.2
# ℹ 40 more rows

Geography and making comparisons

  • Data in the Comparison Profile tables is only available down to the county level (though counties can change from ACS to ACS!)

  • Comparing neighborhood-level change across ACS datasets can introduce additional challenge as Census tract and block group boundaries may differ from previous years

New boundaries drawn with the 2020 Cenus

  • Locations with significant population growth (e.g. suburban Collin County, Texas) will have Census tracts / block groups with large populations subdivided in the 2020 Census

  • Example: total population by Census tract in Collin County in the 2013-2017 ACS (on the left) and the 2018-2022 ACS (on the right)

Data setup

  • Let’s grab data on the population working from home by Census tract in Salt Lake City

  • Use a projected coordinate reference system to speed things up!

library(sf)

wfh_17 <- get_acs(geography = "tract", variables = "B08006_017", year = 2017,
                  state = "UT", county = "Salt Lake", geometry = TRUE) %>%
  st_transform(6620)

wfh_22 <- get_acs(geography = "tract", variables = "B08006_017", year = 2022,
                  state = "UT", county = "Salt Lake", geometry = TRUE) %>%
  st_transform(6620)

Method 1: area-weighted interpolation

  • Interpolating data between sets of boundaries involves the use of weights to re-distribute data from one geography to another

  • A common method is area-weighted interpolation, which allocates information from one geography to another weighted by the area of overlap

  • Typically more accurate when going backward, as many new tracts will “roll up” within parent tracts from a previous Census (though not always)

library(sf)

wfh_22_to_17 <- wfh_22 %>%
  select(estimate) %>%
  st_interpolate_aw(to = wfh_17, extensive = TRUE)

Area-weighted interpolation

Show code
library(mapview)
library(leafsync)

m22a <- mapview(wfh_22, zcol = "estimate", layer.name = "2020 geographies")
m17a <- mapview(wfh_22_to_17, zcol = "estimate", layer.name = "2015 geographies")

sync(m22a, m17a)

Method 2: population-weighted interpolation

  • If you need to go forward, area-weighted interpolation may be very inaccurate as it can incorrectly allocate large values to low-density / empty areas

  • Population-weighted interpolation will instead use an underlying dataset explaining the population distribution as weights

  • Example: population-weighted interpolation using block population weights from the 2020 Census

In tidycensus: interpolate_pw()

library(tigris)
options(tigris_use_cache = TRUE)

salt_lake_blocks <- blocks(
  "UT", 
  "Salt Lake", 
  year = 2020
)

wfh_17_to_22 <- interpolate_pw(
  from = wfh_17,
  to = wfh_22,
  to_id = "GEOID",
  weights = salt_lake_blocks,
  weight_column = "POP20",
  crs = 6620,
  extensive = TRUE
)

Evaluating the result

Show code
m17b <- mapview(wfh_17, zcol = "estimate", layer.name = "2017 geographies")
m22b <- mapview(wfh_17_to_22, zcol = "estimate", layer.name = "2022 geographies")

sync(m17b, m22b)

Evaluating change over time

Evaluating change over time

wfh_shift <- wfh_17_to_22 %>%
  select(GEOID, estimate17 = estimate) %>%
  left_join(
    select(st_drop_geometry(wfh_22), 
           GEOID, estimate22 = estimate), by = "GEOID"
  ) %>%
  mutate(
    shift = estimate22 - estimate17,
    pct_shift = 100 * (shift / estimate17)
  )
mapview(wfh_shift, zcol = "shift")

Thank you!