Working with the 2022 ACS

Workflows with R and tidycensus

Kyle Walker

2024-02-08

About me

Associate Professor of Geography at TCU
Spatial data science researcher and consultant
Package developer: tidycensus, tigris, mapboxapi, crsuggest, idbr (R), pygris (Python)
Book: Analyzing US Census Data: Methods, Maps and Models in R

SSDAN webinar series

Today: Working with the 2022 American Community Survey with R and tidycensus
Thursday, February 22nd: Analyzing 2020 Decennial US Census Data in R
Thursday, March 7th: Doing “GIS” and making maps with US Census Data in R

Today’s agenda

Hour 1: The American Community Survey, R, and tidycensus
Hour 2: ACS data workflows
Hour 3: An introduction to ACS microdata

The American Community Survey, R, and tidycensus

What is the ACS?

Annual survey of 3.5 million US households
Covers topics not available in decennial US Census data (e.g. income, education, language, housing characteristics)
Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)
Data delivered as estimates characterized by margins of error

How to get ACS data

data.census.gov is the main, revamped interactive data portal for browsing and downloading Census datasets, including the ACS
The US Census Application Programming Interface (API) allows developers to access Census data resources programmatically

tidycensus

R interface to the Decennial Census, American Community Survey, Population Estimates Program, and Public Use Microdata Series APIs
First released in 2017; nearly 500,000 downloads from the Posit CRAN mirror

tidycensus: key features

Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);
Automatically downloads and merges Census geometries to data for mapping;
Includes tools for handling margins of error in the ACS and working with survey weights in the ACS PUMS;
States and counties can be requested by name (no more looking up FIPS codes!)

R and RStudio

R: programming language and software environment for data analysis (and wherever else your imagination can take you!)
RStudio: integrated development environment (IDE) for R developed by Posit
Posit Cloud: run RStudio with today’s workshop pre-configured at https://posit.cloud/content/7549022

Getting started with tidycensus

To get started, install the packages you’ll need for today’s workshop
If you are using the Posit Cloud environment, these packages are already installed for you

install.packages(c("tidycensus", "tidyverse"))

Optional, to run advanced examples:

install.packages(c("mapview", "survey", "srvyr"))

Optional: your Census API key

tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day
Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.
Once activated, use the census_api_key() function to set your key as an environment variable

library(tidycensus)

census_api_key("YOUR KEY GOES HERE", install = TRUE)

Getting started with ACS data in tidycensus

Using the `get_acs()` function

The get_acs() function is your portal to access ACS data using tidycensus
The two required arguments are geography and variables. As of v1.6, the function defaults to the 2018-2022 5-year ACS

library(tidycensus)

median_value <- get_acs(
  geography = "county",
  variables = "B25077_001",
  year = 2022
)

ACS data are returned with five columns: GEOID, NAME, variable, estimate, and moe

median_value

# A tibble: 3,222 × 5
   GEOID NAME                     variable   estimate   moe
   <chr> <chr>                    <chr>         <dbl> <dbl>
 1 01001 Autauga County, Alabama  B25077_001   191800  7996
 2 01003 Baldwin County, Alabama  B25077_001   266000  6916
 3 01005 Barbour County, Alabama  B25077_001   102700 11171
 4 01007 Bibb County, Alabama     B25077_001   120100 13377
 5 01009 Blount County, Alabama   B25077_001   159800  6189
 6 01011 Bullock County, Alabama  B25077_001    87700 20560
 7 01013 Butler County, Alabama   B25077_001    94800  5984
 8 01015 Calhoun County, Alabama  B25077_001   140500  5181
 9 01017 Chambers County, Alabama B25077_001   116900  9814
10 01019 Cherokee County, Alabama B25077_001   158700  8550
# ℹ 3,212 more rows

1-year ACS data

1-year ACS data are more current, but are only available for geographies of population 65,000 and greater
Access 1-year ACS data with the argument survey = "acs1"; defaults to "acs5"

median_value_1yr <- get_acs(
  geography = "place",
  variables = "B25077_001",
  year = 2022,
  survey = "acs1"
)

median_value_1yr

# A tibble: 646 × 5
   GEOID   NAME                           variable   estimate   moe
   <chr>   <chr>                          <chr>         <dbl> <dbl>
 1 0103076 Auburn city, Alabama           B25077_001   335200 22622
 2 0107000 Birmingham city, Alabama       B25077_001   125500 14964
 3 0121184 Dothan city, Alabama           B25077_001   190800  8133
 4 0135896 Hoover city, Alabama           B25077_001   393400 19743
 5 0137000 Huntsville city, Alabama       B25077_001   294700 16881
 6 0150000 Mobile city, Alabama           B25077_001   178800 11552
 7 0151000 Montgomery city, Alabama       B25077_001   155200 10868
 8 0177256 Tuscaloosa city, Alabama       B25077_001   297600 30475
 9 0203000 Anchorage municipality, Alaska B25077_001   367900 10111
10 0404720 Avondale city, Arizona         B25077_001   400300 22495
# ℹ 636 more rows

Requesting tables of variables

The table parameter can be used to obtain all related variables in a “table” at once

income_table <- get_acs(
  geography = "county", 
  table = "B19001", 
  year = 2022
)

income_table

# A tibble: 54,774 × 5
   GEOID NAME                    variable   estimate   moe
   <chr> <chr>                   <chr>         <dbl> <dbl>
 1 01001 Autauga County, Alabama B19001_001    22308   369
 2 01001 Autauga County, Alabama B19001_002      990   265
 3 01001 Autauga County, Alabama B19001_003      656   187
 4 01001 Autauga County, Alabama B19001_004     1026   303
 5 01001 Autauga County, Alabama B19001_005     1335   329
 6 01001 Autauga County, Alabama B19001_006      741   205
 7 01001 Autauga County, Alabama B19001_007      822   218
 8 01001 Autauga County, Alabama B19001_008      840   270
 9 01001 Autauga County, Alabama B19001_009      921   260
10 01001 Autauga County, Alabama B19001_010      962   279
# ℹ 54,764 more rows

Understanding geography and variables in tidycensus

US Census Geography

Geography in tidycensus

Information on available geographies, and how to specify them, can be found in the tidycensus documentation

Querying by state

For geographies available below the state level, the state parameter allows you to query data for a specific state
For smaller geographies (Census tracts, block groups), a county can also be requested
tidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!
Example: data on median home value in San Diego County, California by Census tract

sd_value <- get_acs(
  geography = "tract", 
  variables = "B25077_001", 
  state = "CA", 
  county = "San Diego",
  year = 2022
)

sd_value

# A tibble: 737 × 5
   GEOID       NAME                                     variable estimate    moe
   <chr>       <chr>                                    <chr>       <dbl>  <dbl>
 1 06073000100 Census Tract 1; San Diego County; Calif… B25077_…  1633800  71171
 2 06073000201 Census Tract 2.01; San Diego County; Ca… B25077_…  1331000 147432
 3 06073000202 Census Tract 2.02; San Diego County; Ca… B25077_…   891100  97240
 4 06073000301 Census Tract 3.01; San Diego County; Ca… B25077_…   957500 232555
 5 06073000302 Census Tract 3.02; San Diego County; Ca… B25077_…   761700 108681
 6 06073000400 Census Tract 4; San Diego County; Calif… B25077_…   799100  94490
 7 06073000500 Census Tract 5; San Diego County; Calif… B25077_…  1025000  81768
 8 06073000600 Census Tract 6; San Diego County; Calif… B25077_…   727700  92078
 9 06073000700 Census Tract 7; San Diego County; Calif… B25077_…   736400 102788
10 06073000800 Census Tract 8; San Diego County; Calif… B25077_…   678400 119751
# ℹ 727 more rows

Searching for variables

To search for variables, use the load_variables() function along with a year and dataset
The View() function in RStudio allows for interactive browsing and filtering

vars <- load_variables(2022, "acs5")

View(vars)

Available ACS datasets in tidycensus

Detailed Tables
Data Profile (add "/profile" for variable lookup)
Subject Tables (add "/subject")
Comparison Profile (add "/cprofile")
Supplemental Estimates (use "acsse")
Migration Flows (access with get_flows())

“Tidy” or long-form data

The default data structure returned by tidycensus is “tidy” or long-form data, with variables by geography stacked by row

age_sex_table <- get_acs(
  geography = "state", 
  table = "B01001", 
  year = 2022,
  survey = "acs1",
)

age_sex_table

# A tibble: 2,548 × 5
   GEOID NAME    variable   estimate   moe
   <chr> <chr>   <chr>         <dbl> <dbl>
 1 01    Alabama B01001_001  5074296    NA
 2 01    Alabama B01001_002  2461248  6178
 3 01    Alabama B01001_003   146169  3134
 4 01    Alabama B01001_004   158767  6029
 5 01    Alabama B01001_005   164578  5689
 6 01    Alabama B01001_006    97834  3029
 7 01    Alabama B01001_007    70450  2897
 8 01    Alabama B01001_008    42597  4156
 9 01    Alabama B01001_009    34623  3440
10 01    Alabama B01001_010    97373  4627
# ℹ 2,538 more rows

“Wide” data

The argument output = "wide" spreads Census variables across the columns, returning one row per geographic unit and one column per variable

age_sex_table_wide <- get_acs(
  geography = "state", 
  table = "B01001", 
  year = 2022,
  survey = "acs1",
  output = "wide" 
)

age_sex_table_wide

# A tibble: 52 × 100
   GEOID NAME        B01001_001E B01001_001M B01001_002E B01001_002M B01001_003E
   <chr> <chr>             <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1 01    Alabama         5074296          NA     2461248        6178      146169
 2 02    Alaska           733583          NA      385667        2351       23043
 3 04    Arizona         7359197          NA     3678381        2695      201423
 4 05    Arkansas        3045637          NA     1504488        4216       90239
 5 06    California     39029342          NA    19536425        6410     1081904
 6 08    Colorado        5839926          NA     2960896        4278      154565
 7 09    Connecticut     3626205          NA     1776689        2237       91513
 8 10    Delaware        1018396          NA      494657        1092       27456
 9 11    District o…      671803          NA      319763         733       20038
10 12    Florida        22244823          NA    10953468        6169      563703
# ℹ 42 more rows
# ℹ 93 more variables: B01001_003M <dbl>, B01001_004E <dbl>, B01001_004M <dbl>,
#   B01001_005E <dbl>, B01001_005M <dbl>, B01001_006E <dbl>, B01001_006M <dbl>,
#   B01001_007E <dbl>, B01001_007M <dbl>, B01001_008E <dbl>, B01001_008M <dbl>,
#   B01001_009E <dbl>, B01001_009M <dbl>, B01001_010E <dbl>, B01001_010M <dbl>,
#   B01001_011E <dbl>, B01001_011M <dbl>, B01001_012E <dbl>, B01001_012M <dbl>,
#   B01001_013E <dbl>, B01001_013M <dbl>, B01001_014E <dbl>, …

Using named vectors of variables

Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input
In long form, these custom inputs will populate the variable column; in wide form, they will replace the column names

ca_education <- get_acs(
  geography = "county",
  state = "CA",
  variables = c(percent_high_school = "DP02_0062P", 
                percent_bachelors = "DP02_0065P",
                percent_graduate = "DP02_0066P"), 
  year = 2021
)

ca_education

# A tibble: 174 × 5
   GEOID NAME                       variable            estimate   moe
   <chr> <chr>                      <chr>                  <dbl> <dbl>
 1 06001 Alameda County, California percent_high_school     16.7   0.4
 2 06001 Alameda County, California percent_bachelors       28.3   0.3
 3 06001 Alameda County, California percent_graduate        21.3   0.3
 4 06003 Alpine County, California  percent_high_school     25.7   7.5
 5 06003 Alpine County, California  percent_bachelors       20.6   7.5
 6 06003 Alpine County, California  percent_graduate        18.7   8.5
 7 06005 Amador County, California  percent_high_school     30.7   2.2
 8 06005 Amador County, California  percent_bachelors       13.6   1.8
 9 06005 Amador County, California  percent_graduate         5.9   1.1
10 06007 Butte County, California   percent_high_school     22.3   0.9
# ℹ 164 more rows

Part 1 exercises

Use the load_variables() function to find a variable that interests you that we haven’t used yet.
Use get_acs() to fetch data on that variable from the ACS for counties, similar to how we did for median household income.

Part 2: ACS data workflows

Understanding limitations of the 1-year ACS

The 1-year American Community Survey is only available for geographies with population 65,000 and greater. This means:

Only 848 of 3,221 counties are available
Only 646 of 31,908 cities / Census-designated places are available
No data for Census tracts, block groups, ZCTAs, or any other geographies that typically have populations below 65,000

Data sparsity and margins of error

You may encounter data issues in the 1-year ACS data that are less pronounced in the 5-year ACS. For example:

Values available in the 5-year ACS may not be available in the corresponding 1-year ACS tables
If available, they will likely have larger margins of error
Your job as an analyst: balance need for certainty vs. need for recency in estimates

Example: Punjabi speakers by state (1-year ACS)

get_acs(
  geography = "state",
  variables = "B16001_054",
  year = 2022,
  survey = "acs1"
)

# A tibble: 52 × 5
   GEOID NAME                 variable   estimate   moe
   <chr> <chr>                <chr>         <dbl> <dbl>
 1 01    Alabama              B16001_054      666   556
 2 02    Alaska               B16001_054       NA    NA
 3 04    Arizona              B16001_054     1906  1342
 4 05    Arkansas             B16001_054       NA    NA
 5 06    California           B16001_054   154917 14153
 6 08    Colorado             B16001_054     1643  1968
 7 09    Connecticut          B16001_054     4039  2965
 8 10    Delaware             B16001_054        0   203
 9 11    District of Columbia B16001_054       NA    NA
10 12    Florida              B16001_054     3311  1969
# ℹ 42 more rows

Punjabi speakers by state (5-year ACS)

get_acs(
  geography = "state",
  variables = "B16001_054",
  year = 2022,
  survey = "acs5"
)

# A tibble: 52 × 5
   GEOID NAME                 variable   estimate   moe
   <chr> <chr>                <chr>         <dbl> <dbl>
 1 01    Alabama              B16001_054      493   256
 2 02    Alaska               B16001_054       24    39
 3 04    Arizona              B16001_054     2894   626
 4 05    Arkansas             B16001_054      641   308
 5 06    California           B16001_054   146406  5724
 6 08    Colorado             B16001_054     1358   472
 7 09    Connecticut          B16001_054     2093   715
 8 10    Delaware             B16001_054        7    12
 9 11    District of Columbia B16001_054       22    24
10 12    Florida              B16001_054     2466   636
# ℹ 42 more rows

Visualizing ACS data

Visualizing ACS estimates

As opposed to decennial US Census data, ACS estimates include information on uncertainty, represented by the margin of error in the moe column
This means that in some cases, visualization of estimates without reference to the margin of error can be misleading
Walkthrough: building a margin of error visualization with ggplot2

Visualizing ACS estimates

Let’s get some data on median household income by county in Utah

utah_income <- get_acs(
  geography = "county",
  variables = "B19013_001",
  state = "UT",
  year = 2022
)

A basic plot

To visualize a dataset with ggplot2, we define an aesthetic and a geom

library(ggplot2)

utah_plot <- ggplot(utah_income, aes(x = estimate, y = NAME)) + 
  geom_point()

utah_plot

Problems with our basic plot

The data are not sorted by value, making comparisons difficult
The axis and tick labels are not intuitive
The Y-axis labels contain repetitive information (” County, Utah”)
We’ve made no attempt to customize the styling

Sorting by value

We use reorder() to sort counties by the value of their ACS estimates, improving legibility

utah_plot <- ggplot(utah_income, aes(x = estimate, 
                                y = reorder(NAME, estimate))) + 
  geom_point(color = "darkblue", size = 2)

Cleaning up tick labels

Using a combination of functions in the scales package and custom-defined functions, tick labels can be formatted any way you want

library(scales)
library(stringr)

utah_plot <- utah_plot + 
  scale_x_continuous(labels = label_dollar()) + 
  scale_y_discrete(labels = function(x) str_remove(x, " County, Utah"))

Improving formatting and theming

Use labs() to label the plot and its axes, and change the theme to one of several built-in options

utah_plot <- utah_plot + 
  labs(title = "Median household income, 2018-2022 ACS",
       subtitle = "Counties in Utah",
       caption = "Data acquired with R and tidycensus",
       x = "ACS estimate",
       y = "") + 
  theme_minimal(base_size = 12)

Problem: comparing ACS estimates

The chart suggests that Juab County has lower income than Salt Lake County but its margin of error is quite large

View(utah_income)

How to visualize uncertainty in an intuitive way?

Visualizing margins of error

utah_plot_errorbar <- ggplot(utah_income, aes(x = estimate, 
                                        y = reorder(NAME, estimate))) + 
  geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe), #<<
                width = 0.5, linewidth = 0.5) + #<<
  geom_point(color = "darkblue", size = 2) + 
  scale_x_continuous(labels = label_dollar()) + 
  scale_y_discrete(labels = function(x) str_remove(x, " County, Utah")) + 
  labs(title = "Median household income, 2018-2022 ACS",
       subtitle = "Counties in Utah",
       caption = "Data acquired with R and tidycensus. Error bars represent margin of error around estimates.",
       x = "ACS estimate",
       y = "") + 
  theme_minimal(base_size = 12)

utah_plot_errorbar

Exploring maps of ACS data

“Spatial” ACS data

One of the best features of tidycensus is the argument geometry = TRUE, which gets you the correct Census geometries with no hassle
get_acs() with geometry = TRUE returns a spatial Census dataset containing simple feature geometries; learn more on March 7
Let’s take a look at some examples

“Spatial” ACS data

geometry = TRUE does the hard work for you of acquiring and pre-joining spatial Census data

cook_education <- get_acs(
  geography = "tract",
  variables = "DP02_0068P",
  state = "IL",
  county = "Cook",
  year = 2022,
  geometry = TRUE
)

We get back a simple features data frame (more about this on March 7)

cook_education

Simple feature collection with 1332 features and 5 fields (with 1 geometry empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -88.26364 ymin: 41.46971 xmax: -87.52416 ymax: 42.15426
Geodetic CRS:  NAD83
First 10 features:
         GEOID                                        NAME   variable estimate
1  17031822400    Census Tract 8224; Cook County; Illinois DP02_0068P     16.4
2  17031740100    Census Tract 7401; Cook County; Illinois DP02_0068P     40.9
3  17031828100    Census Tract 8281; Cook County; Illinois DP02_0068P     19.4
4  17031826600    Census Tract 8266; Cook County; Illinois DP02_0068P     17.1
5  17031720500    Census Tract 7205; Cook County; Illinois DP02_0068P     58.5
6  17031750300    Census Tract 7503; Cook County; Illinois DP02_0068P     64.5
7  17031826500    Census Tract 8265; Cook County; Illinois DP02_0068P     20.1
8  17031825504 Census Tract 8255.04; Cook County; Illinois DP02_0068P     36.4
9  17031827100    Census Tract 8271; Cook County; Illinois DP02_0068P     14.2
10 17031824900    Census Tract 8249; Cook County; Illinois DP02_0068P     11.3
    moe                       geometry
1   5.9 MULTIPOLYGON (((-87.79876 4...
2   8.5 MULTIPOLYGON (((-87.70137 4...
3   6.4 MULTIPOLYGON (((-87.549 41....
4   4.4 MULTIPOLYGON (((-87.63033 4...
5   7.6 MULTIPOLYGON (((-87.69645 4...
6   7.9 MULTIPOLYGON (((-87.6912 41...
7   6.8 MULTIPOLYGON (((-87.63427 4...
8  11.9 MULTIPOLYGON (((-87.71377 4...
9   6.4 MULTIPOLYGON (((-87.656 41....
10  5.0 MULTIPOLYGON (((-87.71712 4...

Exploring spatial data

Mapping, GIS, and spatial data is the subject of our March 7 workshop - so be sure to check that out!
Even before we dive deeper into spatial data, it is very useful to be able to explore your results on an interactive map
Our solution: mapview()

Exploring spatial data

library(mapview)

mapview(cook_education)

Creating a shaded map with `zcol`

mapview(cook_education, zcol = "estimate")

What about mapping 1-year ACS data?

Typically it is difficult to map 1-year ACS data below the state level as your data will have gaps due to the population restrictions

Example: “mapping” 1-year ACS data

tx_education <- get_acs(
  geography = "county",
  variables = "DP02_0068P",
  state = "TX",
  year = 2022,
  survey = "acs1",
  geometry = TRUE
)

Example: “mapping” 1-year ACS data

mapview(tx_education, zcol = "estimate")

Mapping small(er) areas with PUMAs

Consider using Public Use Microdata Areas (PUMAs) for geographically-consistent substate mapping
PUMAs are typically used for microdata geography; however, I find them quite useful to approximate real state submarkets, planning areas, etc.

wa_wfh <- get_acs(
  geography = "puma",
  variables = "DP03_0024P",
  state = "WA",
  survey = "acs1",
  year = 2022,
  geometry = TRUE
)

library(mapview)

mapview(wa_wfh, zcol = "estimate")

Bonus: new Connecticut county-equivalents

The 2022 ACS is the first to include the new Connecticut Planning Regions in the “county” geography

ct_income <- get_acs(
  geography = "county",
  variables = "B19013_001",
  state = "CT",
  year = 2022,
  survey = "acs1",
  geometry = TRUE
)

mapview(ct_income, zcol = "estimate")

Time-series analysis with the 1-year ACS: some notes

Variables in the Data Profile and Subject Tables can change names over time
You’ll need to watch out for the Connecticut issue and changing geographies
The 2020 1-year ACS was not released (and is not in tidycensus), so your time-series can break if you are using iteration to pull data

Part 2 exercises

Swap in a variable from Part 1, "B25077_001" (median home value) for the analysis in this section, and try the following:

For a state of your choosing, how do margins of error differ among counties for median home values in the 1-year and 5-year ACS?
Can you visualize trends in median home value for a county of your choosing using mapview()?

Part 3: Working with ACS microdata

What is “microdata?”

Microdata: individual-level survey responses made available to researchers
The ACS Public Use Microdata Series (PUMS) allows for detailed cross-tabulations not available in aggregated data
The 1-year PUMS covers about 1 percent of the US population; the 5-year PUMS covers about 5 percent (so, not the full ACS)
Data downloads available in bulk from the Census FTP server or from data.census.gov’s MDAT tool
Other resource for cleaned, time-series microdata: IPUMS

Using microdata in tidycensus

Basic usage of `get_pums()`

get_pums() requires specifying one or more variables and the state for which you’d like to request data. state = 'all' can get data for the entire USA, but it takes a while!
The function defaults to the 5-year ACS with survey = "acs5"; 1-year ACS data is available with survey = "acs1".
The default year is 2022 in the latest version of tidycensus; data are available back to 2005 (1-year ACS) and 2005-2009 (5-year ACS). 2020 1-year data are not available.

Basic usage of `get_pums()`

library(tidycensus)

or_pums <- get_pums(
  variables = c("SEX", "AGEP", "HHT"),
  state = "OR",
  survey = "acs1",
  year = 2022
)

or_pums

# A tibble: 43,708 × 8
   SERIALNO      SPORDER  WGTP PWGTP  AGEP ST    HHT   SEX  
   <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
 1 2022GQ0000103       1     0    15    43 41    b     1    
 2 2022GQ0000264       1     0    11    29 41    b     1    
 3 2022GQ0000368       1     0   115    79 41    b     2    
 4 2022GQ0000407       1     0     2    20 41    b     2    
 5 2022GQ0000504       1     0    18    59 41    b     1    
 6 2022GQ0000523       1     0    29    60 41    b     1    
 7 2022GQ0000539       1     0    73    27 41    b     1    
 8 2022GQ0000570       1     0    73    90 41    b     2    
 9 2022GQ0000718       1     0    14    59 41    b     1    
10 2022GQ0000776       1     0     4    85 41    b     1    
# ℹ 43,698 more rows

Understanding default data from `get_pums()`

get_pums() returns some technical variables by default without the user needing to request them specifically. These include:

SERIALNO: a serial number that uniquely identifies households in the sample;
SPORDER: the order of the person in the household; when combined with SERIALNO, uniquely identifies a person;
WGTP: the household weight;
PWGTP: the person weight

Weights and ACS microdata

Given that PUMS data are a sample of the US population, the weights columns must be used for analysis

library(tidyverse)

or_age_40 <- filter(or_pums, AGEP == 40)

print(sum(or_pums$PWGTP))

[1] 4240137

print(sum(or_age_40$PWGTP))

[1] 59948

Are these estimates accurate?

PUMS weights are calibrated to population and household totals, so larger tabulations should align with published estimates

get_acs("state", "B01003_001", state = "OR", survey = "acs1", year = 2022)

# A tibble: 1 × 5
  GEOID NAME   variable   estimate   moe
  <chr> <chr>  <chr>         <dbl> <dbl>
1 41    Oregon B01003_001  4240137    NA

Smaller tabulations will be characterized by more uncertainty, and may deviate from published estimates

Variables available in the ACS PUMS

View(pums_variables)

The pums_variables dataset is your one-stop shop for browsing variables in the ACS PUMS
It is a long-form dataset that organizes specific value codes by variable so you know what you can get. You’ll use information in the var_code column to fetch variables, but pay attention to the var_label, val_code, val_label, and data_type columns

Recoding PUMS variables

The recode = TRUE argument in get_pums() appends recoded columns to your returned dataset based on information available in pums_variables

or_pums_recoded <- get_pums(
  variables = c("SEX", "AGEP", "HHT"),
  state = "OR",
  survey = "acs1",
  year = 2022,
  recode = TRUE
)

or_pums_recoded

# A tibble: 43,708 × 11
   SERIALNO      SPORDER  WGTP PWGTP  AGEP ST    HHT   SEX   ST_label  HHT_label
   <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <ord>     <ord>    
 1 2022GQ0000103       1     0    15    43 41    b     1     Oregon/OR N/A (GQ/…
 2 2022GQ0000264       1     0    11    29 41    b     1     Oregon/OR N/A (GQ/…
 3 2022GQ0000368       1     0   115    79 41    b     2     Oregon/OR N/A (GQ/…
 4 2022GQ0000407       1     0     2    20 41    b     2     Oregon/OR N/A (GQ/…
 5 2022GQ0000504       1     0    18    59 41    b     1     Oregon/OR N/A (GQ/…
 6 2022GQ0000523       1     0    29    60 41    b     1     Oregon/OR N/A (GQ/…
 7 2022GQ0000539       1     0    73    27 41    b     1     Oregon/OR N/A (GQ/…
 8 2022GQ0000570       1     0    73    90 41    b     2     Oregon/OR N/A (GQ/…
 9 2022GQ0000718       1     0    14    59 41    b     1     Oregon/OR N/A (GQ/…
10 2022GQ0000776       1     0     4    85 41    b     1     Oregon/OR N/A (GQ/…
# ℹ 43,698 more rows
# ℹ 1 more variable: SEX_label <ord>

Using variables filters

PUMS datasets - especially from the 5-year ACS - can get quite large. The variables_filter argument can return a subset of data from the API, reducing long download times
variables_filter is specified as a named list where the name represents the PUMS variable and the value represents a vector of values you are requesting from the API

Using variables filters

or_pums_filtered <- get_pums(
  variables = c("SEX", "AGEP", "HHT"),
  state = "OR",
  survey = "acs5",
  variables_filter = list(
    SEX = 2,
    AGEP = 30:49
  ),
  year = 2022
)

or_pums_filtered

# A tibble: 25,703 × 8
   SERIALNO      SPORDER  WGTP PWGTP  AGEP ST    HHT   SEX  
   <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
 1 2018GQ0003965       1     0     8    32 41    b     2    
 2 2018GQ0004720       1     0     6    46 41    b     2    
 3 2018GQ0004942       1     0     5    42 41    b     2    
 4 2018GQ0006933       1     0     3    30 41    b     2    
 5 2018GQ0006965       1     0    23    32 41    b     2    
 6 2018GQ0007480       1     0    11    40 41    b     2    
 7 2018GQ0007813       1     0    19    34 41    b     2    
 8 2018GQ0010876       1     0    28    41 41    b     2    
 9 2018GQ0012843       1     0    14    37 41    b     2    
10 2018GQ0018588       1     0    67    32 41    b     2    
# ℹ 25,693 more rows

Working with PUMAs in PUMS data

In the previous hour, you were introduced to PUMAs
Public Use Microdata Areas (PUMAs) are the smallest available geographies at which records are identifiable in the PUMS datasets
PUMAs are redrawn with each decennial US Census, and typically are home to 100,000-200,000 people. The 2022 ACS is the first that aligns with the new 2020 PUMAs

Working with PUMAs in PUMS data

To get PUMA information in your output data, use the variable code PUMA

or_age_by_puma <- get_pums(
  variables = c("PUMA", "AGEP"),
  state = "OR",
  survey = "acs1",
  year = 2022
)

or_age_by_puma

# A tibble: 43,708 × 7
   SERIALNO      SPORDER  WGTP PWGTP  AGEP PUMA  ST   
   <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr>
 1 2022HU0427155       1   124   124    55 04703 41   
 2 2022HU0427210       1   117   117    57 06721 41   
 3 2022HU0427210       2   117   128    56 06721 41   
 4 2022HU0427249       1    56    55    72 06722 41   
 5 2022HU0427249       2    56    57    69 06722 41   
 6 2022HU0427278       1    52    52    69 00503 41   
 7 2022HU0427278       2    52    57    77 00503 41   
 8 2022HU0427304       1   193   194    64 03904 41   
 9 2022HU0427314       1    57    57    73 05114 41   
10 2022HU0427314       2    57    59    71 05114 41   
# ℹ 43,698 more rows

Discussion: problems with PUMAs in the 2018-2022 5-year PUMS

Handling uncertainty in tabulated PUMS estimates

Uncertainty in PUMS data

PUMS data represent a smaller sample than the regular ACS, so understanding error around tabulated estimates is critical
The Census Bureau recommends using successive difference replication to calculate standard errors, and provides replicate weights to do this
tidycensus includes tools to help you get replicate weights and format your data for appropriate survey-weighted analysis

Getting replicate weights

We can acquire either housing or person replicate weights with the rep_weights argument

or_pums_replicate <- get_pums(
  variables = c("AGEP", "PUMA"),
  state = "OR",
  survey = "acs1",
  year = 2022,
  rep_weights = "person" 
)

or_pums_replicate

# A tibble: 43,708 × 87
   SERIALNO    SPORDER  AGEP PUMA  ST     WGTP PWGTP PWGTP1 PWGTP2 PWGTP3 PWGTP4
   <chr>         <dbl> <dbl> <chr> <chr> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 2022GQ0000…       1    43 00503 41        0    15     16     16     17     18
 2 2022GQ0000…       1    29 04704 41        0    11     25     13      2     10
 3 2022GQ0000…       1    79 03904 41        0   115    115    121    115    124
 4 2022GQ0000…       1    20 06724 41        0     2      2      2      3      3
 5 2022GQ0000…       1    59 01900 41        0    18     19     20     17     17
 6 2022GQ0000…       1    60 03905 41        0    29     28     30     30     26
 7 2022GQ0000…       1    27 04705 41        0    73     71     71     71     48
 8 2022GQ0000…       1    90 03904 41        0    73     76     73     74     73
 9 2022GQ0000…       1    59 09100 41        0    14     13     14     14     16
10 2022GQ0000…       1    85 05116 41        0     4      4      2      3      3
# ℹ 43,698 more rows
# ℹ 76 more variables: PWGTP5 <dbl>, PWGTP6 <dbl>, PWGTP7 <dbl>, PWGTP8 <dbl>,
#   PWGTP9 <dbl>, PWGTP10 <dbl>, PWGTP11 <dbl>, PWGTP12 <dbl>, PWGTP13 <dbl>,
#   PWGTP14 <dbl>, PWGTP15 <dbl>, PWGTP16 <dbl>, PWGTP17 <dbl>, PWGTP18 <dbl>,
#   PWGTP19 <dbl>, PWGTP20 <dbl>, PWGTP21 <dbl>, PWGTP22 <dbl>, PWGTP23 <dbl>,
#   PWGTP24 <dbl>, PWGTP25 <dbl>, PWGTP26 <dbl>, PWGTP27 <dbl>, PWGTP28 <dbl>,
#   PWGTP29 <dbl>, PWGTP30 <dbl>, PWGTP31 <dbl>, PWGTP32 <dbl>, …

Handling complex survey samples

tidycensus links to the survey and srvyr packages for managing PUMS data as complex survey samples
The to_survey() function will format your data with replicate weights for correct survey-weighted estimation

or_survey <- to_survey(
  or_pums_replicate,
  type = "person"
)

class(or_survey)

[1] "tbl_svy"       "svyrep.design"

Survey-weighted tabulations

srvyr conveniently links R’s survey infrastructure to familiar tidyverse-style workflows
Standard errors can be multiplied by 1.645 to get familiar 90% confidence level margins of error

library(srvyr)

or_survey %>%
  filter(AGEP == 40) %>%
  survey_count() %>%
  mutate(n_moe = n_se * 1.645)

# A tibble: 1 × 3
      n  n_se n_moe
  <dbl> <dbl> <dbl>
1 59948 3080. 5066.

Group-wise survey data analysis

A group-wise tidyverse workflow can be applied correctly by srvyr for the calculation of medians and other summary statistics

or_survey %>%
  group_by(PUMA) %>%
  summarize(median_age = survey_median(AGEP)) %>%
  mutate(median_age_moe = median_age_se * 1.645)

# A tibble: 35 × 4
   PUMA  median_age median_age_se median_age_moe
   <chr>      <dbl>         <dbl>          <dbl>
 1 00301         33         0.754          1.24 
 2 00501         40         1.51           2.48 
 3 00502         41         1.26           2.07 
 4 00503         43         1.51           2.48 
 5 00504         43         1.26           2.07 
 6 01701         41         0.754          1.24 
 7 01702         48         1.51           2.48 
 8 01900         46         0.502          0.826
 9 02901         40         1.26           2.07 
10 02902         44         1.76           2.89 
# ℹ 25 more rows

Checking our answers

Tabulated median ages are not identical to published estimates, but are very close
Use published estimates if available; use PUMS data to generate estimates that aren’t available in the published tables

or_age_puma <- get_acs(
  geography = "puma",
  variables = "B01002_001",
  state = "OR",
  year = 2022,
  survey = "acs1"
)

Working with the 2022 ACS

About me

SSDAN webinar series

Today’s agenda

The American Community Survey, R, and tidycensus

What is the ACS?

How to get ACS data

tidycensus

tidycensus: key features

R and RStudio

Getting started with tidycensus

Optional: your Census API key

Getting started with ACS data in tidycensus

Using the get_acs() function

1-year ACS data

Requesting tables of variables

Understanding geography and variables in tidycensus

US Census Geography

Geography in tidycensus

Querying by state

Searching for variables

Available ACS datasets in tidycensus

“Tidy” or long-form data

“Wide” data

Using named vectors of variables

Part 1 exercises

Part 2: ACS data workflows

Understanding limitations of the 1-year ACS

Data sparsity and margins of error

Example: Punjabi speakers by state (1-year ACS)

Punjabi speakers by state (5-year ACS)

Visualizing ACS data

Visualizing ACS estimates

Visualizing ACS estimates

A basic plot

Problems with our basic plot

Sorting by value

Cleaning up tick labels

Improving formatting and theming

Problem: comparing ACS estimates

Visualizing margins of error

Exploring maps of ACS data

“Spatial” ACS data

“Spatial” ACS data

Exploring spatial data

Exploring spatial data

Creating a shaded map with zcol

What about mapping 1-year ACS data?

Example: “mapping” 1-year ACS data

Example: “mapping” 1-year ACS data

Mapping small(er) areas with PUMAs

Bonus: new Connecticut county-equivalents

Time-series analysis with the 1-year ACS: some notes

Part 2 exercises

Part 3: Working with ACS microdata

What is “microdata?”

Using microdata in tidycensus

Basic usage of get_pums()

Basic usage of get_pums()

Understanding default data from get_pums()

Weights and ACS microdata

Are these estimates accurate?

Variables available in the ACS PUMS

Recoding PUMS variables

Using variables filters

Using variables filters

Working with PUMAs in PUMS data

Working with PUMAs in PUMS data

Discussion: problems with PUMAs in the 2018-2022 5-year PUMS

Handling uncertainty in tabulated PUMS estimates

Uncertainty in PUMS data

Getting replicate weights

Handling complex survey samples

Survey-weighted tabulations

Group-wise survey data analysis

Checking our answers

Thank you!

Using the `get_acs()` function

Creating a shaded map with `zcol`

Basic usage of `get_pums()`

Basic usage of `get_pums()`

Understanding default data from `get_pums()`