Using the new 2016-2020 ACS estimates in R

class: center, middle, inverse, title-slide

# Using the new 2016-2020 ACS estimates in R
### Kyle Walker
### March 25, 2022

---

## About me

.pull-left[

* Associate Professor of Geography at TCU

* Spatial data science researcher and consultant

* R package developer: __tidycensus__, __tigris__, __mapboxapi__

* Book: [_Analyzing US Census Data: Methods, Maps and Models in R_](https://walker-data.com/census-r/)
  - Available for free online right now;
  - To be published in print with CRC Press in fall 2022

]

.pull-right[

]

---

## SSDAN workshop series

* March 10: an introduction to 2020 US Census data
  - [If you missed the workshop, view the slides here](https://walker-data.com/umich-workshop-2022/intro-2020-census/#1)

* Last week: Mapping 2020 Census data
  - [If you missed the workshop, view the slides here](https://walker-data.com/umich-workshop-2022/mapping-census-data/#1)

* __Right now__: a first look at the 2016-2020 American Community Survey data with R and __tidycensus__

---

## Today's agenda

* Hour 1: An overview of the 2016-2020 ACS data in R and __tidycensus__

* Hour 2: Visualizing 2016-2020 ACS data

* Hour 3: Advanced topics: margins of error & time-series analysis

---

## General setup

* Packages required for today's workshop:

```r
install.packages(c("tidycensus", "tidyverse", "mapview", "leafsync", "ggspatial"))
# For brand-new features:
install.packages("remotes")
remotes::install_github("walkerke/tidycensus")
```

* Other required packages will be picked up as dependencies of these packages

* Or use the pre-built RStudio Cloud environment at https://rstudio.cloud/project/3705005

---
class: middle, center, inverse

## Part 1: An overview of the 2016-2020 ACS data in R and __tidycensus__

---

## What is the American Community Survey (ACS)?

* Annual survey of 3.5 million US households

* Covers topics not available in decennial US Census data (e.g. income, education, language, housing characteristics)

* Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)
  - 2020 1-year data only available as [experimental estimates](https://www.census.gov/programs-surveys/acs/data/experimental-data.html)

* Data delivered as _estimates_ characterized by _margins of error_

---

## The __tidycensus__ R package

* R interface to the Decennial Census, American Community Survey, Population Estimates Program, and Public Use Microdata Series APIs

* Key features: 
  - Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);
  
  - Automatically downloads and merges Census geometries to data for mapping; 
  
  - Includes tools for handling margins of error in the ACS and working with survey weights in the ACS PUMS;
  
  - States and counties can be requested by name (no more looking up FIPS codes!)

---

## ACS data in __tidycensus__

* The `get_acs()` function in __tidycensus__ allows you to access ACS data from 2005 through 2020

* Our focus today: the 2016-2020 ACS estimates, available with the argument `year = 2020`

* Other required arguments: `geography`, for the level of aggregation, and `variables`, for one or more ACS variables

* `get_acs()` defaults to the 5-year ACS with the argument `survey = "acs5"`; `survey = "acs1"` is used for 1-year ACS data

---

## Review: __tidycensus__ and the Census API

* To use tidycensus, you will need a Census API key.  Visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.

* Once activated, use the `census_api_key()` function to set your key as an environment variable

```r
library(tidycensus)

census_api_key("YOUR KEY GOES HERE", install = TRUE)
```

---

## Example: population born in Mexico by state

```r
library(tidycensus)

born_in_mexico <- get_acs(
  geography = "state", 
  variables = "B05006_150",
* year = 2020
)
```

---

```r
born_in_mexico
```

```
## # A tibble: 52 × 5
##    GEOID NAME                 variable   estimate   moe
##    <chr> <chr>                <chr>         <dbl> <dbl>
##  1 01    Alabama              B05006_150    46927  1846
##  2 02    Alaska               B05006_150     4181   709
##  3 04    Arizona              B05006_150   510639  8028
##  4 05    Arkansas             B05006_150    60236  2182
##  5 06    California           B05006_150  3962910 25353
##  6 08    Colorado             B05006_150   215778  4888
##  7 09    Connecticut          B05006_150    28086  2144
##  8 10    Delaware             B05006_150    14616  1065
##  9 11    District of Columbia B05006_150     4026   761
## 10 12    Florida              B05006_150   257933  6418
## # … with 42 more rows
```

---

## Example: median age by state

```r
median_age <- get_acs(
  geography = "state",
  variables = "B01002_001",
* year = 2020
)
```

---

```r
median_age
```

```
## # A tibble: 52 × 5
##    GEOID NAME                 variable   estimate   moe
##    <chr> <chr>                <chr>         <dbl> <dbl>
##  1 01    Alabama              B01002_001     39.2   0.1
##  2 02    Alaska               B01002_001     34.6   0.2
##  3 04    Arizona              B01002_001     37.9   0.2
##  4 05    Arkansas             B01002_001     38.3   0.2
##  5 06    California           B01002_001     36.7   0.1
##  6 08    Colorado             B01002_001     36.9   0.1
##  7 09    Connecticut          B01002_001     41.1   0.2
##  8 10    Delaware             B01002_001     41     0.2
##  9 11    District of Columbia B01002_001     34.1   0.1
## 10 12    Florida              B01002_001     42.2   0.2
## # … with 42 more rows
```

---

## Understanding `get_acs()` output

By default, ACS data are returned by `get_acs()` with the following five columns:

* `GEOID`: unique identifier for the Census geographic unit

* `NAME`: A descriptive name for the geographic unit (e.g. a state name)

* `variable`: the ACS variable ID

* `estimate`: The ACS estimate. Estimates are interpretated as _characteristics_ rather than true counts.

* `moe`: The margin of error associated with the estimate at a 90 percent confidence level.

Using the argument `output = "wide"` will spread the estimates and MOEs across the columns of the dataset

---

## Geography and the ACS

.pull-left[

* The `geography` argument in `get_acs()` allows users to access data aggregated to varying levels of the Census hierarchy

* Geographies should be formatted [as specified in the _Analyzing US Census Data_ book](https://walker-data.com/census-r/an-introduction-to-tidycensus.html#geography-and-variables-in-tidycensus)

* ACS data available __starts at the block group__; not all variables are available for all geographies

]

.pull-right[

]

---

## Example: geography aliases

* The commonly-requested "metropolitan statistical area/micropolitan statistical area" and "zip code tabulation area" geographies can be aliased as "cbsa" and "zcta", respectively

```r
median_age_cbsa <- get_acs(
* geography = "cbsa",
  variables = "B01002_001",
  year = 2020
)
```

---

```r
median_age_cbsa
```

```
## # A tibble: 939 × 5
##    GEOID NAME                             variable   estimate   moe
##    <chr> <chr>                            <chr>         <dbl> <dbl>
##  1 10100 Aberdeen, SD Micro Area          B01002_001     37.7   0.9
##  2 10140 Aberdeen, WA Micro Area          B01002_001     44.3   0.3
##  3 10180 Abilene, TX Metro Area           B01002_001     34.1   0.2
##  4 10220 Ada, OK Micro Area               B01002_001     36     0.4
##  5 10300 Adrian, MI Micro Area            B01002_001     42.1   0.4
##  6 10380 Aguadilla-Isabela, PR Metro Area B01002_001     43.5   0.2
##  7 10420 Akron, OH Metro Area             B01002_001     40.3   0.2
##  8 10460 Alamogordo, NM Micro Area        B01002_001     36.2   0.3
##  9 10500 Albany, GA Metro Area            B01002_001     37.3   0.4
## 10 10540 Albany-Lebanon, OR Metro Area    B01002_001     39.9   0.4
## # … with 929 more rows
```

---

## Example: geographic subsets

* Smaller geographies (e.g. tracts, block groups) are available by state and county using `state` and `county` arguments

* Access with common names - no need for FIPS codes!

```r
washtenaw_median_age <- get_acs(
  geography = "block group",
  variables = "B01002_001",
* state = "MI",
* county = "Washtenaw",
  year = 2020
)
```

---

```r
washtenaw_median_age
```

```
## # A tibble: 308 × 5
##    GEOID        NAME                                     variable estimate   moe
##    <chr>        <chr>                                    <chr>       <dbl> <dbl>
##  1 261614001001 Block Group 1, Census Tract 4001, Washt… B01002_…     30.9   4.4
##  2 261614001002 Block Group 2, Census Tract 4001, Washt… B01002_…     21.8   0.7
##  3 261614003001 Block Group 1, Census Tract 4003, Washt… B01002_…     21.5   0.5
##  4 261614003002 Block Group 2, Census Tract 4003, Washt… B01002_…     21.6   0.7
##  5 261614003003 Block Group 3, Census Tract 4003, Washt… B01002_…     21.1   0.4
##  6 261614004001 Block Group 1, Census Tract 4004, Washt… B01002_…     35    11.7
##  7 261614004002 Block Group 2, Census Tract 4004, Washt… B01002_…     23.7   3  
##  8 261614004003 Block Group 3, Census Tract 4004, Washt… B01002_…     50.2  10  
##  9 261614005001 Block Group 1, Census Tract 4005, Washt… B01002_…     NA    NA  
## 10 261614005002 Block Group 2, Census Tract 4005, Washt… B01002_…     21.6   0.7
## # … with 298 more rows
```

---

## Variables and datasets in the 5-year ACS

* The ACS Detailed Tables, Data Profile, and Subject Tables are available with `get_acs()`; __tidycensus__ will auto-detect the dataset from the variable name and fetch from the appropriate location

* To look up variable IDs from a dataset, use: 
  - Detailed Tables: `load_variables(2020, "acs5")`
  - Data Profile: `load_variables(2020, "acs5/profile")`
  - Subject Tables: `load_variables(2020, "acs5/subject")`
  - Comparison Profile: `load_variables(2020, "acs5/cprofile")` (GitHub only)

* __New__: The `geography` column tells you the smallest geography at which a variable is available; the Data Profile and Subject tables are available at the tract and up, and the Comparison Profile is available at the county and up

---
class: middle, center, inverse

## Analyzing 2016-2020 ACS data

---

## Viewing ACS data

* Basic analysis of data in RStudio can be accomplished with the interactive table generated by `View()` (or Ctrl+click on the data object in your script)

* Example: median home value by county in the 2016-2020 ACS

```r
median_home_value <- get_acs(
  geography = "county",
  variables = "B25077_001",
  year = 2020
)
```

---

## Exploring data with the tidyverse

* ACS data returned by __tidycensus__ are designed for analysis with __tidyverse__ packages like __dplyr__

* [Basic examples covered in the first workshop will work for ACS data as well](https://walker-data.com/umich-workshop-2022/intro-2020-census/#42)

* Example question: which counties are in the top 10 percent nationally for median home values, and how does this break down by state?

---

## Step 1: extract the top 10 percent

```r
library(tidyverse)

top10percent <- median_home_value %>%
* mutate(percentile = percent_rank(estimate)) %>%
* filter(percentile >= 0.9)
```

---

```r
top10percent
```

```
## # A tibble: 321 × 6
##    GEOID NAME                               variable   estimate   moe percentile
##    <chr> <chr>                              <chr>         <dbl> <dbl>      <dbl>
##  1 02016 Aleutians West Census Area, Alaska B25077_001   316700 61170      0.943
##  2 02020 Anchorage Municipality, Alaska     B25077_001   320100  4445      0.945
##  3 02100 Haines Borough, Alaska             B25077_001   263900 41626      0.902
##  4 02110 Juneau City and Borough, Alaska    B25077_001   355100 11524      0.959
##  5 02130 Ketchikan Gateway Borough, Alaska  B25077_001   299500 12386      0.933
##  6 02150 Kodiak Island Borough, Alaska      B25077_001   295100 22848      0.931
##  7 02220 Sitka City and Borough, Alaska     B25077_001   365800 18682      0.963
##  8 02230 Skagway Municipality, Alaska       B25077_001   397200 38489      0.973
##  9 04005 Coconino County, Arizona           B25077_001   299100 10357      0.933
## 10 04013 Maricopa County, Arizona           B25077_001   278700  1401      0.915
## # … with 311 more rows
```

---

## Step 2: extract state information from the `NAME` column

* The `separate()` function splits a column into multiple columns based on a separator; useful for parsing the `NAME` column returned by __tidycensus__

```r
top10percent <- top10percent %>%
* separate(
*   NAME,
*   into = c("county", "state"),
*   sep = ", "
* )
```

---

```r
top10percent
```

```
## # A tibble: 321 × 7
##    GEOID county                     state   variable   estimate   moe percentile
##    <chr> <chr>                      <chr>   <chr>         <dbl> <dbl>      <dbl>
##  1 02016 Aleutians West Census Area Alaska  B25077_001   316700 61170      0.943
##  2 02020 Anchorage Municipality     Alaska  B25077_001   320100  4445      0.945
##  3 02100 Haines Borough             Alaska  B25077_001   263900 41626      0.902
##  4 02110 Juneau City and Borough    Alaska  B25077_001   355100 11524      0.959
##  5 02130 Ketchikan Gateway Borough  Alaska  B25077_001   299500 12386      0.933
##  6 02150 Kodiak Island Borough      Alaska  B25077_001   295100 22848      0.931
##  7 02220 Sitka City and Borough     Alaska  B25077_001   365800 18682      0.963
##  8 02230 Skagway Municipality       Alaska  B25077_001   397200 38489      0.973
##  9 04005 Coconino County            Arizona B25077_001   299100 10357      0.933
## 10 04013 Maricopa County            Arizona B25077_001   278700  1401      0.915
## # … with 311 more rows
```

---

## Step 3: group by `state` and summarize

```r
state_summary <- top10percent %>%
* group_by(state) %>%
* summarize(n = n()) %>%
* arrange(desc(n))
```

---

```r
state_summary
```

```
## # A tibble: 42 × 2
##    state             n
##    <chr>         <int>
##  1 California       43
##  2 Virginia         36
##  3 Colorado         30
##  4 Oregon           19
##  5 Washington       17
##  6 New Jersey       15
##  7 Maryland         14
##  8 New York         13
##  9 Massachusetts    11
## 10 Texas            11
## # … with 32 more rows
```

---

## Part 1 exercises

1. Use `load_variables()` to browse the available variables in the 2016-2020 ACS.  Find a variable that interests you and try reproducing one of the examples in this section with that variable.

2. Use the __tidyverse__ functions covered in this section to explore that variable for all counties in the US.  Which counties have the largest estimate for your variable?  Which have the smallest?

---
class: middle, center, inverse

## Part 2: Visualizing 2016-2020 ACS Data

---

## Visualization in our workshops: an overview

* In the first workshop, [we covered a range of methods using __ggplot2__ to visualize decennial US Census data](https://walker-data.com/umich-workshop-2022/intro-2020-census/#65)

* The second workshop focused on [cartography, largely with the __tmap__ package](https://walker-data.com/umich-workshop-2022/mapping-census-data/#30)

* Those methods will work for ACS data as well; we'll re-visit some of the basics here and introduce some new visualization methods with a focus on __ggplot2__

---

## __ggplot2__: a review

.pull-left[

* Graphics in __ggplot2__ are initialized with the `ggplot()` function, in which a user typically supplies a dataset and aesthetic mapping with `aes()`

* Graphical elements are then "layered" onto the ggplot object, consisting of a "geom", or geometric object (`geom_*()`) and custom styling elements linked with the `+` operator

* Example: a histogram of county median home values generated with `geom_histogram()`

]

.pull-right[

```r
library(tidyverse)
library(scales)

ggplot(median_home_value, aes(x = estimate)) + 
  geom_histogram(bins = 30) + 
  scale_x_continuous(labels = label_dollar())
```

![](index_files/figure-html/histogram-1.png)

]

---
class: middle, center, inverse

## Example #1: visualizing margins of error

---

## Visualizing ACS estimates

* As opposed to decennial US Census data, ACS estimates include information on uncertainty, represented by the _margin of error_ in the `moe` column

* This means that in some cases, visualization of estimates without reference to the margin of error can be misleading

* Example: a sorted bar chart ([drawing from an example in the first workshop](https://walker-data.com/umich-workshop-2022/intro-2020-census/#79))

---

## Visualizing ACS estimates

.pull-left[

* Data on median household income by county can be readily acquired...

```r
nj_income <- get_acs(
  geography = "county",
  variables = "B19013_001",
  state = "NJ",
  year = 2020
) %>%
  mutate(NAME = str_remove(NAME, 
                           " County, New Jersey"))
```

]

.pull-right[

* ...and visualized comparatively with __ggplot2__

```r
nj_income_bar <- ggplot(nj_income, 
       aes(x = estimate, 
           y = reorder(NAME, estimate))) + 
  geom_col() + 
  labs(title = "Median household income, 2016-2020 ACS", 
       subtitle = "Counties in New Jersey", 
       x = "ACS estimate", 
       y = "") + 
  theme_minimal(base_size = 18) + 
  scale_x_continuous(labels = dollar_format(scale = 0.001, 
                                            suffix = "K"))
```

]

---

```r
nj_income_bar
```

![](index_files/figure-html/nj-income-bar-show-1.png)

---

## Problem: which county has the highest income in New Jersey?

* The chart suggests that Hunterdon County has the highest income in the state, but Morris and Somerset Counties are well within Hunterdon's margin of error

```r
nj_income %>%
  arrange(desc(estimate)) %>%
  head(3)
```

```
## # A tibble: 3 × 5
##   GEOID NAME      variable   estimate   moe
##   <chr> <chr>     <chr>         <dbl> <dbl>
## 1 34019 Hunterdon B19013_001   117858  4348
## 2 34027 Morris    B19013_001   117298  2241
## 3 34035 Somerset  B19013_001   116510  2194
```

* How to visualize uncertainty in an intuitive way?

---

## Visualizing margins of error

```r
nj_income_errorbar <- ggplot(nj_income, 
       aes(x = estimate, 
           y = reorder(NAME, estimate))) + 
* geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe),
*               width = 0.5, size = 0.5) +
* geom_point(color = "navy", size = 3) +
  labs(title = "Median household income, 2016-2020 ACS", 
       subtitle = "Counties in New Jersey", 
       x = "ACS estimate", 
       y = "",
*      caption = "Error bars reflect the margin of error around the ACS estimate") +
  theme_minimal(base_size = 18) + 
  scale_x_continuous(labels = dollar_format(scale = 0.001, 
                                            suffix = "K"))
```

---

```r
nj_income_errorbar
```

![](index_files/figure-html/nj-income-errorbar-show-1.png)

---
class: middle, center, inverse

## Example #2: building population pyramids

---

## Identifying relevant data

* The ACS Data Profile and ACS Subject Tables will commonly include pre-aggregated versions of popular statistics, helping analysts avoid doing the aggregations on their own

```r
subject <- load_variables(2020, "acs5/subject")
```

---

## Preparing the data request

```r
cohort_names <- c("0-4", "5-9", "10-14", "15-19",
                  "20-24", "25-29", "30-34", "35-39",
                  "40-44", "45-49", "50-54", "55-59",
                  "60-64", "65-69", "70-74", "75-79",
                  "80-84", "85+")

male_vars <- 2:19 %>%
  str_pad(2, "left", "0") %>%
  paste0("S0101_C03_0", .) %>%
  set_names(cohort_names)

female_vars <- 2:19 %>%
  str_pad(2, "left", "0") %>%
  paste0("S0101_C05_0", .) %>%
  set_names(cohort_names)
```

---

## Assembling the data

```r
male_data <- get_acs(
  geography = "county",
  variables = male_vars,
  state = "MI",
  county = "Washtenaw",
  year = 2020
) %>%
  mutate(sex = "Male",
         estimate = estimate * -1)

female_data <- get_acs(
  geography = "county",
  variables = female_vars,
  state = "MI",
  county = "Washtenaw",
  year = 2020
) %>%
  mutate(sex = "Female")

pyramid_data <- bind_rows(male_data, female_data) %>%
* mutate(variable = factor(variable, levels = cohort_names))
```

---

## Designing the visualization

```r
washtenaw_pyramid <- ggplot(pyramid_data, 
                            aes(x = estimate, y = variable, 
                                fill = sex)) + 
  geom_col(width = 0.95, alpha = 0.75) + 
  theme_minimal(base_size = 18) + 
  scale_x_continuous(labels = function(x) paste0(abs(x / 1000), "k")) + 
  scale_fill_manual(values = c("#00274C", "#FFCB05")) + 
  labs(x = "", 
       y = "ACS estimate", 
       title = "Population structure in Washtenaw County, Michigan", 
       fill = "", 
       caption = "Data source: 2016-2020 ACS & tidycensus R package")
```

---

```r
washtenaw_pyramid
```

![](index_files/figure-html/washtenaw-pyramid-show-1.png)

---
class: middle, center, inverse

## Example #3: Mapping ACS data

---

## Review: mapping Census data

* [In last week's workshop](https://walker-data.com/umich-workshop-2022/mapping-census-data/#42), participants learned how to make maps of decennial Census data with a focus on the __tmap__ R package

* __tidycensus__ functions return pre-joined geometry and Census attribute data with the argument `geometry = TRUE`, helping you map data faster

* `geometry = TRUE` works with `get_acs()`!

---

## Example: using `geometry = TRUE` with `get_acs()`

```r
orleans_income <- get_acs(
  geography = "tract",
  variables = "B19013_001",
  state = "LA",
  county = "Orleans",
  year = 2020,
  geometry = TRUE
)
```

---

```r
orleans_income
```

```
## Simple feature collection with 184 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -90.14007 ymin: 29.86666 xmax: -89.62552 ymax: 30.19866
## Geodetic CRS:  NAD83
## First 10 features:
##          GEOID                                           NAME   variable
## 1  22071003307  Census Tract 33.07, Orleans Parish, Louisiana B19013_001
## 2  22071008400     Census Tract 84, Orleans Parish, Louisiana B19013_001
## 3  22071009000     Census Tract 90, Orleans Parish, Louisiana B19013_001
## 4  22071000902   Census Tract 9.02, Orleans Parish, Louisiana B19013_001
## 5  22071009900     Census Tract 99, Orleans Parish, Louisiana B19013_001
## 6  22071004600     Census Tract 46, Orleans Parish, Louisiana B19013_001
## 7  22071010700    Census Tract 107, Orleans Parish, Louisiana B19013_001
## 8  22071001301  Census Tract 13.01, Orleans Parish, Louisiana B19013_001
## 9  22071012102 Census Tract 121.02, Orleans Parish, Louisiana B19013_001
## 10 22071001754  Census Tract 17.54, Orleans Parish, Louisiana B19013_001
##    estimate   moe                       geometry
## 1     14753  9772 MULTIPOLYGON (((-90.07408 3...
## 2     42125  6372 MULTIPOLYGON (((-90.08514 2...
## 3     78088 38392 MULTIPOLYGON (((-90.09018 2...
## 4        NA    NA MULTIPOLYGON (((-90.01691 2...
## 5     79798 13899 MULTIPOLYGON (((-90.10192 2...
## 6     75246 23556 MULTIPOLYGON (((-90.10509 2...
## 7     63250 15300 MULTIPOLYGON (((-90.10895 2...
## 8     19464  8693 MULTIPOLYGON (((-90.04479 2...
## 9     66000 12138 MULTIPOLYGON (((-90.12815 2...
## 10    26439  9283 MULTIPOLYGON (((-89.99241 3...
```

---

## Example: mapping with __ggplot2__

.pull-left[

* The `geom_sf()` function in __ggplot2__ can be used to quickly make exploratory maps

* For ACS data, specify `aes(fill = estimate)` to produce a choropleth map

]

.pull-right[

```r
ggplot(orleans_income, aes(fill = estimate)) + 
  geom_sf() 
```

![](index_files/figure-html/orleans-income-first-1.png)

]

---

## Customizing colors

.pull-left[

* A wide range of color palettes are available with the `scale_fill_*()` family of functions

* The viridis color palettes (available with `scale_fill_viridis_c()` for continuous versions) offer a series of colorblind and print-friendly palettes for mapping

]

.pull-right[

```r
ggplot(orleans_income, aes(fill = estimate)) + 
  geom_sf() + 
  scale_fill_viridis_c(option = "mako")
```

![](index_files/figure-html/orleans-income-show-1.png)

]

---

## Refining your cartography

.pull-left[

* [As discussed in the previous workshop](https://walker-data.com/umich-workshop-2022/mapping-census-data/#82), water areas like New Orleans will look best when water area is removed

* The `erase_water()` function in the __tigris__ package handles this; use the argument `cb = FALSE` when fetching ACS data to avoid sliver polygons

]

.pull-right[

```r
library(tigris)
library(sf)

orleans_erase <- get_acs(
  geography = "tract",
  variables = "B19013_001",
  state = "LA",
  county = "Orleans",
  geometry = TRUE,
  year = 2020,
* cb = FALSE
) %>%
* st_transform(26982) %>%
* erase_water(area_threshold = 0.99)
```

]

---

```r
ggplot(orleans_erase, aes(fill = estimate)) + 
  geom_sf() + 
  scale_fill_viridis_c(option = "mako")
```

![](index_files/figure-html/orleans-erase-show-1.png)

---

## Adding a basemap

* The easiest way to add a basemap to a __ggplot2__ map is with `annotation_map_tile()` in the __ggspatial__ package

```r
library(ggspatial)

final_map <- ggplot(orleans_erase, aes(fill = estimate)) + 
* annotation_map_tile(type = "cartolight", zoom = 11) +
  theme_void(base_size = 20) + 
  geom_sf(alpha = 0.6, lwd = 0.1) + 
  scale_fill_viridis_c(option = "mako", labels = label_dollar()) + 
  labs(title = "Median household income, Orleans Parish LA",
       subtitle = "2016-2020 ACS estimates",
       caption = "Tile source: CARTO / OpenStreetMap contributors",
       fill = "ACS estimate  ")
```

---

```r
final_map
```

![](index_files/figure-html/final-map-show-1.png)

---

## Part 2 exercises

1. Try re-creating one of the three charts presented in this section with a different state and/or county.

2. Customize your chart with different colors or styles; [ggplot2 documentation on themes is found here](https://ggplot2.tidyverse.org/reference/ggtheme.html).

---
class: middle, center, inverse

## Part 3: margins of error and time-series analysis

---

## Margins of error in the 2016-2020 ACS

* As discussed earlier, __tidycensus__ will always return the margin of error for an estimate when applicable

* Due to data collection issues influenced by the COVID-19 pandemic, the 2020 ACS had a lower response rate than in previous years

* This means that margins of error are larger in the 2016-2020 ACS than in previous datasets

---

## Comparing margins of error

---

## Modifying the returned margin of error

* By default, margins of error are found in the `moe` column; in wide-form data, MOEs are found in columns that end with `M`

* The `moe_level` parameter controls the confidence level of the MOE; choose `90` (the default), `95`, or `99`

---

## Margins of error for small geographies

.pull-left[

* For small geographies like Census tracts or block groups, margins of error may be proportionally large for small populations

* Example: granular population age bands by sex for a Census tract in Salt Lake County, Utah

]

.pull-right[

```r
vars <- paste0("B01001_0", c(20:25, 44:49))
salt_lake <- get_acs(
  geography = "tract",
  variables = vars,
  state = "Utah",
  county = "Salt Lake",
  year = 2019
)
example_tract <- salt_lake %>%
  filter(GEOID == "49035100100")
```

]

---

## Using margin of error functions in __tidycensus__

* tidycensus includes helper functions for calculating derives margins of error based on Census-supplied formulas.  These functions include `moe_sum()`, `moe_product()`, `moe_ratio()`, and `moe_prop()`

Example:

```r
moe_prop(25, 100, 5, 3)
```

```
## [1] 0.0494343
```

---

## Calculating derived MOEs tidyverse-style

* `moe_sum()` is designed to work within a `group_by() %>% summarize()` pipeline, calculating derived margins of error along with rolled-up estimates

```r
salt_lake_grouped <- salt_lake %>%
  mutate(sex = case_when(
    str_sub(variable, start = -2) < "26" ~ "Male",
    TRUE ~ "Female"
  )) %>%
  group_by(GEOID, sex) %>%
  summarize(sum_est = sum(estimate), 
            sum_moe = moe_sum(moe, estimate))
```

---

```r
salt_lake_grouped
```

```
## # A tibble: 424 × 4
## # Groups:   GEOID [212]
##    GEOID       sex    sum_est sum_moe
##    <chr>       <chr>    <dbl>   <dbl>
##  1 49035100100 Female      55    30.9
##  2 49035100100 Male        83    39.2
##  3 49035100200 Female     167    57.5
##  4 49035100200 Male       153    50.9
##  5 49035100306 Female     273   109. 
##  6 49035100306 Male       225    90.3
##  7 49035100307 Female     188    70.2
##  8 49035100307 Male       117    64.5
##  9 49035100308 Female     164    98.7
## 10 49035100308 Male       129    77.9
## # … with 414 more rows
```

---

## A note from Census on derived MOEs

* A note of caution [from the US Census Bureau](https://www2.census.gov/programs-surveys/acs/tech_docs/statistical_testing/2019_Instructions_for_Stat_Testing_ACS.pdf?):

> All [derived MOE methods] are approximations and users should be cautious in using them. This is because these methods do not consider the correlation or covariance between the basic estimates. They may be overestimates or underestimates of the derived estimate’s standard error depending on whether the two basic estimates are highly correlated in either the positive or negative direction. As a result, the approximated standard error may not match direct calculations of standard errors or calculations obtained through other methods.

---
class: middle, center, inverse

## Time-series analysis with the ACS

---

## Time-series and the ACS: some notes

* 5-year ACS data represent _overlapping survey years_, meaning that direct comparison between overlapping datasets is not typically recommended

* The Census Bureau recommends comparing non-overlapping years in the 5-year ACS

* For 2016-2020 data, available 5-year intervals for comparison are 2006-2010 and 2011-2015

---

## Getting data for multiple years

* The `map_dfr()` function in the __purrr__ package (a member of the tidyverse) is a great way to iterate over a sequence of years and return the combined result

```r
years <- c(2010, 2015, 2020)

nebraska_series <- map_dfr(years, function(year) {
  get_acs(
    geography = "county",
    state = "NE",
    variables = "B01002_001",
    year = year
  ) %>%
    mutate(year = year)
}) %>%
  arrange(NAME)
```

---

```r
nebraska_series
```

```
## # A tibble: 279 × 6
##    GEOID NAME                      variable   estimate   moe  year
##    <chr> <chr>                     <chr>         <dbl> <dbl> <dbl>
##  1 31001 Adams County, Nebraska    B01002_001     37.5   0.5  2010
##  2 31001 Adams County, Nebraska    B01002_001     37.8   0.7  2015
##  3 31001 Adams County, Nebraska    B01002_001     38     0.7  2020
##  4 31003 Antelope County, Nebraska B01002_001     45.2   0.2  2010
##  5 31003 Antelope County, Nebraska B01002_001     46.5   0.4  2015
##  6 31003 Antelope County, Nebraska B01002_001     44.8   0.2  2020
##  7 31005 Arthur County, Nebraska   B01002_001     39.4   5.1  2010
##  8 31005 Arthur County, Nebraska   B01002_001     39.9   6.1  2015
##  9 31005 Arthur County, Nebraska   B01002_001     48     5.6  2020
## 10 31007 Banner County, Nebraska   B01002_001     46.4   4.5  2010
## # … with 269 more rows
```

---

## Caution: inconsistent variable names across years

* The ACS Data Profile and Subject Tables are not guaranteed to have the same variable names across years, and not all Detailed Tables variables are available in every year

* For example: try swapping in `"DP02_0068P"` for the variable in the example above (representing % of the population age 25 and up with a bachelor's degree in 2020)

---

## New feature: comparison profile tables

.pull-left[

* In the development version of __tidycensus__, I _just added_ support for the ACS Comparison Profile

* The Comparison Profile includes data on the current 5-year ACS and the 5-year ACS that ends 5 years prior to help with time-series comparisons

]

.pull-right[

```r
ak_income_compare <- get_acs(
  geography = "county",
  variables = c(
    income15 = "CP03_2015_062",
    income20 = "CP03_2020_062"
  ),
  state = "AK",
  year = 2020
)
```

]

---

```r
ak_income_compare
```

```
## # A tibble: 34 × 4
##    GEOID NAME                                 variable estimate
##    <chr> <chr>                                <chr>       <dbl>
##  1 02016 Aleutians West Census Area, Alaska   income15    92500
##  2 02016 Aleutians West Census Area, Alaska   income20    87443
##  3 02020 Anchorage Municipality, Alaska       income15    85534
##  4 02020 Anchorage Municipality, Alaska       income20    84813
##  5 02050 Bethel Census Area, Alaska           income15    55692
##  6 02050 Bethel Census Area, Alaska           income20    54400
##  7 02090 Fairbanks North Star Borough, Alaska income15    77590
##  8 02090 Fairbanks North Star Borough, Alaska income20    76464
##  9 02110 Juneau City and Borough, Alaska      income15    93836
## 10 02110 Juneau City and Borough, Alaska      income20    88077
## # … with 24 more rows
```

---

## Geography and making comparisons

* Data in the Comparison Profile tables is only available down to the county level ([though counties can change from ACS to ACS!](https://walker-data.com/umich-workshop-2022/intro-2020-census/#63))

* Comparing _neighborhood-level_ change across ACS datasets introduces additional challenges with the 2016-2020 data, as Census tract and block group boundaries differ from previous years

---

## New boundaries in the 2016-2020 data

* Locations with significant population growth (e.g. suburban Collin County, Texas) will have Census tracts / block groups with large populations subdivided in the 2020 Census

* Example: total population by Census tract in Collin County in the 2015-2019 ACS (on the left) and the 2016-2020 ACS (on the right)

---

![](index_files/figure-html/collin-compare-1.png)

---
class: middle, center, inverse

## Discussion: interpolating data between different sets of Census geographies

---

## Data setup

* Let's grab data on the population working from home by Census tract in Maricopa County, AZ

* Use a projected coordinate reference system to speed things up!

```r
library(sf)

wfh_15 <- get_acs(geography = "tract", variables = "B08006_017", year = 2015,
                  state = "AZ", county = "Maricopa", geometry = TRUE) %>%
  st_transform(26950)

wfh_20 <- get_acs(geography = "tract", variables = "B08006_017", year = 2020,
                  state = "AZ", county = "Maricopa", geometry = TRUE) %>%
  st_transform(26950)
```

---

## Method 1: area-weighted interpolation

.pull-left[

* _Interpolating_ data between sets of boundaries involves the use of _weights_ to re-distribute data from one geography to another

* A common method is _area-weighted interpolation_, which allocates information from one geography to another weighted by the area of overlap

* Typically more accurate when going _backward_, as many new tracts will "roll up" within parent tracts from a previous Census (though not always)

]

.pull-right[

```r
library(sf)

wfh_20_to_15 <- wfh_20 %>%
  select(estimate) %>%
  st_interpolate_aw(to = wfh_15, extensive = TRUE)
```

]

---

## Method 1: area-weighted interpolation

* Let's compare the two maps:

```r
library(mapview)
library(leafsync)

m20a <- mapview(wfh_20, zcol = "estimate", layer.name = "2020 geographies")
m15a <- mapview(wfh_20_to_15, zcol = "estimate", layer.name = "2015 geographies")

sync(m20a, m15a)
```

---

## Method 2: population-weighted interpolation

* If you need to go _forward_, area-weighted interpolation may be very inaccurate as it can incorrectly allocate large values to low-density / empty areas

* _Population-weighted interpolation_ will instead use an underlying dataset explaining the population distribution as weights

* Example: population-weighted interpolation using block population weights from the 2020 Census

---

## A function for population-weighted interpolation

.pull-left[

* Population-weighted interpolation is more complicated than would fit in a slide deck, so I've written a function named `interpolate_pw()` found in the companion script `population_weighted_interpolation.R`

* Let's give it a try:

]

.pull-right[

```r
# source("population_weighted_interpolation.R")
library(tigris)
options(tigris_use_cache = TRUE)

maricopa_blocks <- blocks(
  "AZ", 
  "Maricopa", 
  year = 2020
)

wfh_15_to_20 <- interpolate_pw(
  from = wfh_15,
  to = wfh_20,
  weights = maricopa_blocks,
  weight_column = "POP20",
  crs = 26950,
  extensive = TRUE
)
```

]

---

## Evaluating the result

* Let's compare the two maps:

```r
m15b <- mapview(wfh_15, zcol = "estimate", layer.name = "2015 geographies")
m20b <- mapview(wfh_15_to_20, zcol = "estimate", layer.name = "2020 geographies")

sync(m15b, m20b)
```

---

## Concluding thoughts

* While this workflow uses population-weighted interpolation, [differential privacy in the 2020 Census may introduce some inaccuracies](https://walker-data.com/umich-workshop-2022/intro-2020-census/#17)

* NHGIS uses a methodology that incorporates land cover data for adjusting data between Census years; [read through their methodology here](https://www.nhgis.org/2000-block-data-standardized-2010-geography)

---
class: middle, center, inverse

## Thank you!