Working with the 2022 ACS

Workflows with R and tidycensus

Kyle Walker

2024-02-08

About me

SSDAN webinar series

  • Today: Working with the 2022 American Community Survey with R and tidycensus

  • Thursday, February 22nd: Analyzing 2020 Decennial US Census Data in R

  • Thursday, March 7th: Doing “GIS” and making maps with US Census Data in R

Today’s agenda

  • Hour 1: The American Community Survey, R, and tidycensus

  • Hour 2: ACS data workflows

  • Hour 3: An introduction to ACS microdata

The American Community Survey, R, and tidycensus

What is the ACS?

  • Annual survey of 3.5 million US households

  • Covers topics not available in decennial US Census data (e.g. income, education, language, housing characteristics)

  • Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)

  • Data delivered as estimates characterized by margins of error

How to get ACS data

tidycensus

  • R interface to the Decennial Census, American Community Survey, Population Estimates Program, and Public Use Microdata Series APIs

  • First released in 2017; nearly 500,000 downloads from the Posit CRAN mirror

tidycensus: key features

  • Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);

  • Automatically downloads and merges Census geometries to data for mapping;

  • Includes tools for handling margins of error in the ACS and working with survey weights in the ACS PUMS;

  • States and counties can be requested by name (no more looking up FIPS codes!)

R and RStudio

  • R: programming language and software environment for data analysis (and wherever else your imagination can take you!)

  • RStudio: integrated development environment (IDE) for R developed by Posit

  • Posit Cloud: run RStudio with today’s workshop pre-configured at https://posit.cloud/content/7549022

Getting started with tidycensus

  • To get started, install the packages you’ll need for today’s workshop

  • If you are using the Posit Cloud environment, these packages are already installed for you

install.packages(c("tidycensus", "tidyverse"))
  • Optional, to run advanced examples:
install.packages(c("mapview", "survey", "srvyr"))

Optional: your Census API key

  • tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day

  • Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.

  • Once activated, use the census_api_key() function to set your key as an environment variable

library(tidycensus)

census_api_key("YOUR KEY GOES HERE", install = TRUE)

Getting started with ACS data in tidycensus

Using the get_acs() function

  • The get_acs() function is your portal to access ACS data using tidycensus

  • The two required arguments are geography and variables. As of v1.6, the function defaults to the 2018-2022 5-year ACS

library(tidycensus)

median_value <- get_acs(
  geography = "county",
  variables = "B25077_001",
  year = 2022
)
  • ACS data are returned with five columns: GEOID, NAME, variable, estimate, and moe
median_value
# A tibble: 3,222 × 5
   GEOID NAME                     variable   estimate   moe
   <chr> <chr>                    <chr>         <dbl> <dbl>
 1 01001 Autauga County, Alabama  B25077_001   191800  7996
 2 01003 Baldwin County, Alabama  B25077_001   266000  6916
 3 01005 Barbour County, Alabama  B25077_001   102700 11171
 4 01007 Bibb County, Alabama     B25077_001   120100 13377
 5 01009 Blount County, Alabama   B25077_001   159800  6189
 6 01011 Bullock County, Alabama  B25077_001    87700 20560
 7 01013 Butler County, Alabama   B25077_001    94800  5984
 8 01015 Calhoun County, Alabama  B25077_001   140500  5181
 9 01017 Chambers County, Alabama B25077_001   116900  9814
10 01019 Cherokee County, Alabama B25077_001   158700  8550
# ℹ 3,212 more rows

1-year ACS data

  • 1-year ACS data are more current, but are only available for geographies of population 65,000 and greater

  • Access 1-year ACS data with the argument survey = "acs1"; defaults to "acs5"

median_value_1yr <- get_acs(
  geography = "place",
  variables = "B25077_001",
  year = 2022,
  survey = "acs1"
)
median_value_1yr
# A tibble: 646 × 5
   GEOID   NAME                           variable   estimate   moe
   <chr>   <chr>                          <chr>         <dbl> <dbl>
 1 0103076 Auburn city, Alabama           B25077_001   335200 22622
 2 0107000 Birmingham city, Alabama       B25077_001   125500 14964
 3 0121184 Dothan city, Alabama           B25077_001   190800  8133
 4 0135896 Hoover city, Alabama           B25077_001   393400 19743
 5 0137000 Huntsville city, Alabama       B25077_001   294700 16881
 6 0150000 Mobile city, Alabama           B25077_001   178800 11552
 7 0151000 Montgomery city, Alabama       B25077_001   155200 10868
 8 0177256 Tuscaloosa city, Alabama       B25077_001   297600 30475
 9 0203000 Anchorage municipality, Alaska B25077_001   367900 10111
10 0404720 Avondale city, Arizona         B25077_001   400300 22495
# ℹ 636 more rows

Requesting tables of variables

  • The table parameter can be used to obtain all related variables in a “table” at once
income_table <- get_acs(
  geography = "county", 
  table = "B19001", 
  year = 2022
)
income_table
# A tibble: 54,774 × 5
   GEOID NAME                    variable   estimate   moe
   <chr> <chr>                   <chr>         <dbl> <dbl>
 1 01001 Autauga County, Alabama B19001_001    22308   369
 2 01001 Autauga County, Alabama B19001_002      990   265
 3 01001 Autauga County, Alabama B19001_003      656   187
 4 01001 Autauga County, Alabama B19001_004     1026   303
 5 01001 Autauga County, Alabama B19001_005     1335   329
 6 01001 Autauga County, Alabama B19001_006      741   205
 7 01001 Autauga County, Alabama B19001_007      822   218
 8 01001 Autauga County, Alabama B19001_008      840   270
 9 01001 Autauga County, Alabama B19001_009      921   260
10 01001 Autauga County, Alabama B19001_010      962   279
# ℹ 54,764 more rows

Understanding geography and variables in tidycensus

US Census Geography

Source: US Census Bureau

Source: US Census Bureau

Geography in tidycensus

Querying by state

  • For geographies available below the state level, the state parameter allows you to query data for a specific state

  • For smaller geographies (Census tracts, block groups), a county can also be requested

  • tidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!

  • Example: data on median home value in San Diego County, California by Census tract

sd_value <- get_acs(
  geography = "tract", 
  variables = "B25077_001", 
  state = "CA", 
  county = "San Diego",
  year = 2022
)
sd_value
# A tibble: 737 × 5
   GEOID       NAME                                     variable estimate    moe
   <chr>       <chr>                                    <chr>       <dbl>  <dbl>
 1 06073000100 Census Tract 1; San Diego County; Calif… B25077_…  1633800  71171
 2 06073000201 Census Tract 2.01; San Diego County; Ca… B25077_…  1331000 147432
 3 06073000202 Census Tract 2.02; San Diego County; Ca… B25077_…   891100  97240
 4 06073000301 Census Tract 3.01; San Diego County; Ca… B25077_…   957500 232555
 5 06073000302 Census Tract 3.02; San Diego County; Ca… B25077_…   761700 108681
 6 06073000400 Census Tract 4; San Diego County; Calif… B25077_…   799100  94490
 7 06073000500 Census Tract 5; San Diego County; Calif… B25077_…  1025000  81768
 8 06073000600 Census Tract 6; San Diego County; Calif… B25077_…   727700  92078
 9 06073000700 Census Tract 7; San Diego County; Calif… B25077_…   736400 102788
10 06073000800 Census Tract 8; San Diego County; Calif… B25077_…   678400 119751
# ℹ 727 more rows

Searching for variables

  • To search for variables, use the load_variables() function along with a year and dataset

  • The View() function in RStudio allows for interactive browsing and filtering

vars <- load_variables(2022, "acs5")

View(vars)

Available ACS datasets in tidycensus

  • Detailed Tables

  • Data Profile (add "/profile" for variable lookup)

  • Subject Tables (add "/subject")

  • Comparison Profile (add "/cprofile")

  • Supplemental Estimates (use "acsse")

  • Migration Flows (access with get_flows())

“Tidy” or long-form data

  • The default data structure returned by tidycensus is “tidy” or long-form data, with variables by geography stacked by row
age_sex_table <- get_acs(
  geography = "state", 
  table = "B01001", 
  year = 2022,
  survey = "acs1",
)
age_sex_table
# A tibble: 2,548 × 5
   GEOID NAME    variable   estimate   moe
   <chr> <chr>   <chr>         <dbl> <dbl>
 1 01    Alabama B01001_001  5074296    NA
 2 01    Alabama B01001_002  2461248  6178
 3 01    Alabama B01001_003   146169  3134
 4 01    Alabama B01001_004   158767  6029
 5 01    Alabama B01001_005   164578  5689
 6 01    Alabama B01001_006    97834  3029
 7 01    Alabama B01001_007    70450  2897
 8 01    Alabama B01001_008    42597  4156
 9 01    Alabama B01001_009    34623  3440
10 01    Alabama B01001_010    97373  4627
# ℹ 2,538 more rows

“Wide” data

  • The argument output = "wide" spreads Census variables across the columns, returning one row per geographic unit and one column per variable
age_sex_table_wide <- get_acs(
  geography = "state", 
  table = "B01001", 
  year = 2022,
  survey = "acs1",
  output = "wide" 
)
age_sex_table_wide
# A tibble: 52 × 100
   GEOID NAME        B01001_001E B01001_001M B01001_002E B01001_002M B01001_003E
   <chr> <chr>             <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1 01    Alabama         5074296          NA     2461248        6178      146169
 2 02    Alaska           733583          NA      385667        2351       23043
 3 04    Arizona         7359197          NA     3678381        2695      201423
 4 05    Arkansas        3045637          NA     1504488        4216       90239
 5 06    California     39029342          NA    19536425        6410     1081904
 6 08    Colorado        5839926          NA     2960896        4278      154565
 7 09    Connecticut     3626205          NA     1776689        2237       91513
 8 10    Delaware        1018396          NA      494657        1092       27456
 9 11    District o…      671803          NA      319763         733       20038
10 12    Florida        22244823          NA    10953468        6169      563703
# ℹ 42 more rows
# ℹ 93 more variables: B01001_003M <dbl>, B01001_004E <dbl>, B01001_004M <dbl>,
#   B01001_005E <dbl>, B01001_005M <dbl>, B01001_006E <dbl>, B01001_006M <dbl>,
#   B01001_007E <dbl>, B01001_007M <dbl>, B01001_008E <dbl>, B01001_008M <dbl>,
#   B01001_009E <dbl>, B01001_009M <dbl>, B01001_010E <dbl>, B01001_010M <dbl>,
#   B01001_011E <dbl>, B01001_011M <dbl>, B01001_012E <dbl>, B01001_012M <dbl>,
#   B01001_013E <dbl>, B01001_013M <dbl>, B01001_014E <dbl>, …

Using named vectors of variables

  • Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input

  • In long form, these custom inputs will populate the variable column; in wide form, they will replace the column names

ca_education <- get_acs(
  geography = "county",
  state = "CA",
  variables = c(percent_high_school = "DP02_0062P", 
                percent_bachelors = "DP02_0065P",
                percent_graduate = "DP02_0066P"), 
  year = 2021
)
ca_education
# A tibble: 174 × 5
   GEOID NAME                       variable            estimate   moe
   <chr> <chr>                      <chr>                  <dbl> <dbl>
 1 06001 Alameda County, California percent_high_school     16.7   0.4
 2 06001 Alameda County, California percent_bachelors       28.3   0.3
 3 06001 Alameda County, California percent_graduate        21.3   0.3
 4 06003 Alpine County, California  percent_high_school     25.7   7.5
 5 06003 Alpine County, California  percent_bachelors       20.6   7.5
 6 06003 Alpine County, California  percent_graduate        18.7   8.5
 7 06005 Amador County, California  percent_high_school     30.7   2.2
 8 06005 Amador County, California  percent_bachelors       13.6   1.8
 9 06005 Amador County, California  percent_graduate         5.9   1.1
10 06007 Butte County, California   percent_high_school     22.3   0.9
# ℹ 164 more rows

Part 1 exercises

  1. Use the load_variables() function to find a variable that interests you that we haven’t used yet.

  2. Use get_acs() to fetch data on that variable from the ACS for counties, similar to how we did for median household income.

Part 2: ACS data workflows

Understanding limitations of the 1-year ACS

  • The 1-year American Community Survey is only available for geographies with population 65,000 and greater. This means:
  • Only 848 of 3,221 counties are available
  • Only 646 of 31,908 cities / Census-designated places are available
  • No data for Census tracts, block groups, ZCTAs, or any other geographies that typically have populations below 65,000

Data sparsity and margins of error

  • You may encounter data issues in the 1-year ACS data that are less pronounced in the 5-year ACS. For example:
  • Values available in the 5-year ACS may not be available in the corresponding 1-year ACS tables

  • If available, they will likely have larger margins of error

  • Your job as an analyst: balance need for certainty vs. need for recency in estimates

Example: Punjabi speakers by state (1-year ACS)

get_acs(
  geography = "state",
  variables = "B16001_054",
  year = 2022,
  survey = "acs1"
)
# A tibble: 52 × 5
   GEOID NAME                 variable   estimate   moe
   <chr> <chr>                <chr>         <dbl> <dbl>
 1 01    Alabama              B16001_054      666   556
 2 02    Alaska               B16001_054       NA    NA
 3 04    Arizona              B16001_054     1906  1342
 4 05    Arkansas             B16001_054       NA    NA
 5 06    California           B16001_054   154917 14153
 6 08    Colorado             B16001_054     1643  1968
 7 09    Connecticut          B16001_054     4039  2965
 8 10    Delaware             B16001_054        0   203
 9 11    District of Columbia B16001_054       NA    NA
10 12    Florida              B16001_054     3311  1969
# ℹ 42 more rows

Punjabi speakers by state (5-year ACS)

get_acs(
  geography = "state",
  variables = "B16001_054",
  year = 2022,
  survey = "acs5"
)
# A tibble: 52 × 5
   GEOID NAME                 variable   estimate   moe
   <chr> <chr>                <chr>         <dbl> <dbl>
 1 01    Alabama              B16001_054      493   256
 2 02    Alaska               B16001_054       24    39
 3 04    Arizona              B16001_054     2894   626
 4 05    Arkansas             B16001_054      641   308
 5 06    California           B16001_054   146406  5724
 6 08    Colorado             B16001_054     1358   472
 7 09    Connecticut          B16001_054     2093   715
 8 10    Delaware             B16001_054        7    12
 9 11    District of Columbia B16001_054       22    24
10 12    Florida              B16001_054     2466   636
# ℹ 42 more rows

Visualizing ACS data

Visualizing ACS estimates

  • As opposed to decennial US Census data, ACS estimates include information on uncertainty, represented by the margin of error in the moe column

  • This means that in some cases, visualization of estimates without reference to the margin of error can be misleading

  • Walkthrough: building a margin of error visualization with ggplot2

Visualizing ACS estimates

  • Let’s get some data on median household income by county in Utah
utah_income <- get_acs(
  geography = "county",
  variables = "B19013_001",
  state = "UT",
  year = 2022
) 

A basic plot

  • To visualize a dataset with ggplot2, we define an aesthetic and a geom
library(ggplot2)

utah_plot <- ggplot(utah_income, aes(x = estimate, y = NAME)) + 
  geom_point()
utah_plot

Problems with our basic plot

  • The data are not sorted by value, making comparisons difficult

  • The axis and tick labels are not intuitive

  • The Y-axis labels contain repetitive information (” County, Utah”)

  • We’ve made no attempt to customize the styling

Sorting by value

  • We use reorder() to sort counties by the value of their ACS estimates, improving legibility
utah_plot <- ggplot(utah_income, aes(x = estimate, 
                                y = reorder(NAME, estimate))) + 
  geom_point(color = "darkblue", size = 2)

Cleaning up tick labels

  • Using a combination of functions in the scales package and custom-defined functions, tick labels can be formatted any way you want
library(scales)
library(stringr)

utah_plot <- utah_plot + 
  scale_x_continuous(labels = label_dollar()) + 
  scale_y_discrete(labels = function(x) str_remove(x, " County, Utah")) 

Improving formatting and theming

  • Use labs() to label the plot and its axes, and change the theme to one of several built-in options
utah_plot <- utah_plot + 
  labs(title = "Median household income, 2018-2022 ACS",
       subtitle = "Counties in Utah",
       caption = "Data acquired with R and tidycensus",
       x = "ACS estimate",
       y = "") + 
  theme_minimal(base_size = 12)

Problem: comparing ACS estimates

  • The chart suggests that Juab County has lower income than Salt Lake County but its margin of error is quite large
View(utah_income)
  • How to visualize uncertainty in an intuitive way?

Visualizing margins of error

utah_plot_errorbar <- ggplot(utah_income, aes(x = estimate, 
                                        y = reorder(NAME, estimate))) + 
  geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe), #<<
                width = 0.5, linewidth = 0.5) + #<<
  geom_point(color = "darkblue", size = 2) + 
  scale_x_continuous(labels = label_dollar()) + 
  scale_y_discrete(labels = function(x) str_remove(x, " County, Utah")) + 
  labs(title = "Median household income, 2018-2022 ACS",
       subtitle = "Counties in Utah",
       caption = "Data acquired with R and tidycensus. Error bars represent margin of error around estimates.",
       x = "ACS estimate",
       y = "") + 
  theme_minimal(base_size = 12)
utah_plot_errorbar

Exploring maps of ACS data

“Spatial” ACS data

  • One of the best features of tidycensus is the argument geometry = TRUE, which gets you the correct Census geometries with no hassle

  • get_acs() with geometry = TRUE returns a spatial Census dataset containing simple feature geometries; learn more on March 7

  • Let’s take a look at some examples

“Spatial” ACS data

  • geometry = TRUE does the hard work for you of acquiring and pre-joining spatial Census data
cook_education <- get_acs(
  geography = "tract",
  variables = "DP02_0068P",
  state = "IL",
  county = "Cook",
  year = 2022,
  geometry = TRUE
)
  • We get back a simple features data frame (more about this on March 7)
cook_education
Simple feature collection with 1332 features and 5 fields (with 1 geometry empty)
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -88.26364 ymin: 41.46971 xmax: -87.52416 ymax: 42.15426
Geodetic CRS:  NAD83
First 10 features:
         GEOID                                        NAME   variable estimate
1  17031822400    Census Tract 8224; Cook County; Illinois DP02_0068P     16.4
2  17031740100    Census Tract 7401; Cook County; Illinois DP02_0068P     40.9
3  17031828100    Census Tract 8281; Cook County; Illinois DP02_0068P     19.4
4  17031826600    Census Tract 8266; Cook County; Illinois DP02_0068P     17.1
5  17031720500    Census Tract 7205; Cook County; Illinois DP02_0068P     58.5
6  17031750300    Census Tract 7503; Cook County; Illinois DP02_0068P     64.5
7  17031826500    Census Tract 8265; Cook County; Illinois DP02_0068P     20.1
8  17031825504 Census Tract 8255.04; Cook County; Illinois DP02_0068P     36.4
9  17031827100    Census Tract 8271; Cook County; Illinois DP02_0068P     14.2
10 17031824900    Census Tract 8249; Cook County; Illinois DP02_0068P     11.3
    moe                       geometry
1   5.9 MULTIPOLYGON (((-87.79876 4...
2   8.5 MULTIPOLYGON (((-87.70137 4...
3   6.4 MULTIPOLYGON (((-87.549 41....
4   4.4 MULTIPOLYGON (((-87.63033 4...
5   7.6 MULTIPOLYGON (((-87.69645 4...
6   7.9 MULTIPOLYGON (((-87.6912 41...
7   6.8 MULTIPOLYGON (((-87.63427 4...
8  11.9 MULTIPOLYGON (((-87.71377 4...
9   6.4 MULTIPOLYGON (((-87.656 41....
10  5.0 MULTIPOLYGON (((-87.71712 4...

Exploring spatial data

  • Mapping, GIS, and spatial data is the subject of our March 7 workshop - so be sure to check that out!

  • Even before we dive deeper into spatial data, it is very useful to be able to explore your results on an interactive map

  • Our solution: mapview()

Exploring spatial data

library(mapview)

mapview(cook_education)

Creating a shaded map with zcol

mapview(cook_education, zcol = "estimate")

What about mapping 1-year ACS data?

  • Typically it is difficult to map 1-year ACS data below the state level as your data will have gaps due to the population restrictions

Example: “mapping” 1-year ACS data

tx_education <- get_acs(
  geography = "county",
  variables = "DP02_0068P",
  state = "TX",
  year = 2022,
  survey = "acs1",
  geometry = TRUE
)

Example: “mapping” 1-year ACS data

mapview(tx_education, zcol = "estimate")

Mapping small(er) areas with PUMAs

  • Consider using Public Use Microdata Areas (PUMAs) for geographically-consistent substate mapping

  • PUMAs are typically used for microdata geography; however, I find them quite useful to approximate real state submarkets, planning areas, etc.

wa_wfh <- get_acs(
  geography = "puma",
  variables = "DP03_0024P",
  state = "WA",
  survey = "acs1",
  year = 2022,
  geometry = TRUE
)
library(mapview)

mapview(wa_wfh, zcol = "estimate")

Bonus: new Connecticut county-equivalents

  • The 2022 ACS is the first to include the new Connecticut Planning Regions in the “county” geography
ct_income <- get_acs(
  geography = "county",
  variables = "B19013_001",
  state = "CT",
  year = 2022,
  survey = "acs1",
  geometry = TRUE
)
mapview(ct_income, zcol = "estimate")

Time-series analysis with the 1-year ACS: some notes

  • Variables in the Data Profile and Subject Tables can change names over time

  • You’ll need to watch out for the Connecticut issue and changing geographies

  • The 2020 1-year ACS was not released (and is not in tidycensus), so your time-series can break if you are using iteration to pull data

Part 2 exercises

Swap in a variable from Part 1, "B25077_001" (median home value) for the analysis in this section, and try the following:

  • For a state of your choosing, how do margins of error differ among counties for median home values in the 1-year and 5-year ACS?

  • Can you visualize trends in median home value for a county of your choosing using mapview()?

Part 3: Working with ACS microdata

What is “microdata?”

Using microdata in tidycensus

Basic usage of get_pums()

  • get_pums() requires specifying one or more variables and the state for which you’d like to request data. state = 'all' can get data for the entire USA, but it takes a while!

  • The function defaults to the 5-year ACS with survey = "acs5"; 1-year ACS data is available with survey = "acs1".

  • The default year is 2022 in the latest version of tidycensus; data are available back to 2005 (1-year ACS) and 2005-2009 (5-year ACS). 2020 1-year data are not available.

Basic usage of get_pums()

library(tidycensus)

or_pums <- get_pums(
  variables = c("SEX", "AGEP", "HHT"),
  state = "OR",
  survey = "acs1",
  year = 2022
)
or_pums
# A tibble: 43,708 × 8
   SERIALNO      SPORDER  WGTP PWGTP  AGEP ST    HHT   SEX  
   <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
 1 2022GQ0000103       1     0    15    43 41    b     1    
 2 2022GQ0000264       1     0    11    29 41    b     1    
 3 2022GQ0000368       1     0   115    79 41    b     2    
 4 2022GQ0000407       1     0     2    20 41    b     2    
 5 2022GQ0000504       1     0    18    59 41    b     1    
 6 2022GQ0000523       1     0    29    60 41    b     1    
 7 2022GQ0000539       1     0    73    27 41    b     1    
 8 2022GQ0000570       1     0    73    90 41    b     2    
 9 2022GQ0000718       1     0    14    59 41    b     1    
10 2022GQ0000776       1     0     4    85 41    b     1    
# ℹ 43,698 more rows

Understanding default data from get_pums()

get_pums() returns some technical variables by default without the user needing to request them specifically. These include:

  • SERIALNO: a serial number that uniquely identifies households in the sample;

  • SPORDER: the order of the person in the household; when combined with SERIALNO, uniquely identifies a person;

  • WGTP: the household weight;

  • PWGTP: the person weight

Weights and ACS microdata

  • Given that PUMS data are a sample of the US population, the weights columns must be used for analysis
library(tidyverse)

or_age_40 <- filter(or_pums, AGEP == 40)

print(sum(or_pums$PWGTP))
[1] 4240137
print(sum(or_age_40$PWGTP))
[1] 59948

Are these estimates accurate?

  • PUMS weights are calibrated to population and household totals, so larger tabulations should align with published estimates
get_acs("state", "B01003_001", state = "OR", survey = "acs1", year = 2022)
# A tibble: 1 × 5
  GEOID NAME   variable   estimate   moe
  <chr> <chr>  <chr>         <dbl> <dbl>
1 41    Oregon B01003_001  4240137    NA
  • Smaller tabulations will be characterized by more uncertainty, and may deviate from published estimates

Variables available in the ACS PUMS

View(pums_variables)
  • The pums_variables dataset is your one-stop shop for browsing variables in the ACS PUMS

  • It is a long-form dataset that organizes specific value codes by variable so you know what you can get. You’ll use information in the var_code column to fetch variables, but pay attention to the var_label, val_code, val_label, and data_type columns

Recoding PUMS variables

  • The recode = TRUE argument in get_pums() appends recoded columns to your returned dataset based on information available in pums_variables
or_pums_recoded <- get_pums(
  variables = c("SEX", "AGEP", "HHT"),
  state = "OR",
  survey = "acs1",
  year = 2022,
  recode = TRUE
)
or_pums_recoded
# A tibble: 43,708 × 11
   SERIALNO      SPORDER  WGTP PWGTP  AGEP ST    HHT   SEX   ST_label  HHT_label
   <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <ord>     <ord>    
 1 2022GQ0000103       1     0    15    43 41    b     1     Oregon/OR N/A (GQ/…
 2 2022GQ0000264       1     0    11    29 41    b     1     Oregon/OR N/A (GQ/…
 3 2022GQ0000368       1     0   115    79 41    b     2     Oregon/OR N/A (GQ/…
 4 2022GQ0000407       1     0     2    20 41    b     2     Oregon/OR N/A (GQ/…
 5 2022GQ0000504       1     0    18    59 41    b     1     Oregon/OR N/A (GQ/…
 6 2022GQ0000523       1     0    29    60 41    b     1     Oregon/OR N/A (GQ/…
 7 2022GQ0000539       1     0    73    27 41    b     1     Oregon/OR N/A (GQ/…
 8 2022GQ0000570       1     0    73    90 41    b     2     Oregon/OR N/A (GQ/…
 9 2022GQ0000718       1     0    14    59 41    b     1     Oregon/OR N/A (GQ/…
10 2022GQ0000776       1     0     4    85 41    b     1     Oregon/OR N/A (GQ/…
# ℹ 43,698 more rows
# ℹ 1 more variable: SEX_label <ord>

Using variables filters

  • PUMS datasets - especially from the 5-year ACS - can get quite large. The variables_filter argument can return a subset of data from the API, reducing long download times

  • variables_filter is specified as a named list where the name represents the PUMS variable and the value represents a vector of values you are requesting from the API

Using variables filters

or_pums_filtered <- get_pums(
  variables = c("SEX", "AGEP", "HHT"),
  state = "OR",
  survey = "acs5",
  variables_filter = list(
    SEX = 2,
    AGEP = 30:49
  ),
  year = 2022
)
or_pums_filtered
# A tibble: 25,703 × 8
   SERIALNO      SPORDER  WGTP PWGTP  AGEP ST    HHT   SEX  
   <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
 1 2018GQ0003965       1     0     8    32 41    b     2    
 2 2018GQ0004720       1     0     6    46 41    b     2    
 3 2018GQ0004942       1     0     5    42 41    b     2    
 4 2018GQ0006933       1     0     3    30 41    b     2    
 5 2018GQ0006965       1     0    23    32 41    b     2    
 6 2018GQ0007480       1     0    11    40 41    b     2    
 7 2018GQ0007813       1     0    19    34 41    b     2    
 8 2018GQ0010876       1     0    28    41 41    b     2    
 9 2018GQ0012843       1     0    14    37 41    b     2    
10 2018GQ0018588       1     0    67    32 41    b     2    
# ℹ 25,693 more rows

Working with PUMAs in PUMS data

  • In the previous hour, you were introduced to PUMAs

  • Public Use Microdata Areas (PUMAs) are the smallest available geographies at which records are identifiable in the PUMS datasets

  • PUMAs are redrawn with each decennial US Census, and typically are home to 100,000-200,000 people. The 2022 ACS is the first that aligns with the new 2020 PUMAs

Working with PUMAs in PUMS data

  • To get PUMA information in your output data, use the variable code PUMA
or_age_by_puma <- get_pums(
  variables = c("PUMA", "AGEP"),
  state = "OR",
  survey = "acs1",
  year = 2022
)
or_age_by_puma
# A tibble: 43,708 × 7
   SERIALNO      SPORDER  WGTP PWGTP  AGEP PUMA  ST   
   <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr>
 1 2022HU0427155       1   124   124    55 04703 41   
 2 2022HU0427210       1   117   117    57 06721 41   
 3 2022HU0427210       2   117   128    56 06721 41   
 4 2022HU0427249       1    56    55    72 06722 41   
 5 2022HU0427249       2    56    57    69 06722 41   
 6 2022HU0427278       1    52    52    69 00503 41   
 7 2022HU0427278       2    52    57    77 00503 41   
 8 2022HU0427304       1   193   194    64 03904 41   
 9 2022HU0427314       1    57    57    73 05114 41   
10 2022HU0427314       2    57    59    71 05114 41   
# ℹ 43,698 more rows

Discussion: problems with PUMAs in the 2018-2022 5-year PUMS

Handling uncertainty in tabulated PUMS estimates

Uncertainty in PUMS data

  • PUMS data represent a smaller sample than the regular ACS, so understanding error around tabulated estimates is critical

  • The Census Bureau recommends using successive difference replication to calculate standard errors, and provides replicate weights to do this

  • tidycensus includes tools to help you get replicate weights and format your data for appropriate survey-weighted analysis

Getting replicate weights

  • We can acquire either housing or person replicate weights with the rep_weights argument
or_pums_replicate <- get_pums(
  variables = c("AGEP", "PUMA"),
  state = "OR",
  survey = "acs1",
  year = 2022,
  rep_weights = "person" 
)
or_pums_replicate
# A tibble: 43,708 × 87
   SERIALNO    SPORDER  AGEP PUMA  ST     WGTP PWGTP PWGTP1 PWGTP2 PWGTP3 PWGTP4
   <chr>         <dbl> <dbl> <chr> <chr> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 2022GQ0000…       1    43 00503 41        0    15     16     16     17     18
 2 2022GQ0000…       1    29 04704 41        0    11     25     13      2     10
 3 2022GQ0000…       1    79 03904 41        0   115    115    121    115    124
 4 2022GQ0000…       1    20 06724 41        0     2      2      2      3      3
 5 2022GQ0000…       1    59 01900 41        0    18     19     20     17     17
 6 2022GQ0000…       1    60 03905 41        0    29     28     30     30     26
 7 2022GQ0000…       1    27 04705 41        0    73     71     71     71     48
 8 2022GQ0000…       1    90 03904 41        0    73     76     73     74     73
 9 2022GQ0000…       1    59 09100 41        0    14     13     14     14     16
10 2022GQ0000…       1    85 05116 41        0     4      4      2      3      3
# ℹ 43,698 more rows
# ℹ 76 more variables: PWGTP5 <dbl>, PWGTP6 <dbl>, PWGTP7 <dbl>, PWGTP8 <dbl>,
#   PWGTP9 <dbl>, PWGTP10 <dbl>, PWGTP11 <dbl>, PWGTP12 <dbl>, PWGTP13 <dbl>,
#   PWGTP14 <dbl>, PWGTP15 <dbl>, PWGTP16 <dbl>, PWGTP17 <dbl>, PWGTP18 <dbl>,
#   PWGTP19 <dbl>, PWGTP20 <dbl>, PWGTP21 <dbl>, PWGTP22 <dbl>, PWGTP23 <dbl>,
#   PWGTP24 <dbl>, PWGTP25 <dbl>, PWGTP26 <dbl>, PWGTP27 <dbl>, PWGTP28 <dbl>,
#   PWGTP29 <dbl>, PWGTP30 <dbl>, PWGTP31 <dbl>, PWGTP32 <dbl>, …

Handling complex survey samples

  • tidycensus links to the survey and srvyr packages for managing PUMS data as complex survey samples

  • The to_survey() function will format your data with replicate weights for correct survey-weighted estimation

or_survey <- to_survey(
  or_pums_replicate,
  type = "person"
)

class(or_survey)
[1] "tbl_svy"       "svyrep.design"

Survey-weighted tabulations

  • srvyr conveniently links R’s survey infrastructure to familiar tidyverse-style workflows

  • Standard errors can be multiplied by 1.645 to get familiar 90% confidence level margins of error

library(srvyr)

or_survey %>%
  filter(AGEP == 40) %>%
  survey_count() %>%
  mutate(n_moe = n_se * 1.645)
# A tibble: 1 × 3
      n  n_se n_moe
  <dbl> <dbl> <dbl>
1 59948 3080. 5066.

Group-wise survey data analysis

  • A group-wise tidyverse workflow can be applied correctly by srvyr for the calculation of medians and other summary statistics
or_survey %>%
  group_by(PUMA) %>%
  summarize(median_age = survey_median(AGEP)) %>%
  mutate(median_age_moe = median_age_se * 1.645)
# A tibble: 35 × 4
   PUMA  median_age median_age_se median_age_moe
   <chr>      <dbl>         <dbl>          <dbl>
 1 00301         33         0.754          1.24 
 2 00501         40         1.51           2.48 
 3 00502         41         1.26           2.07 
 4 00503         43         1.51           2.48 
 5 00504         43         1.26           2.07 
 6 01701         41         0.754          1.24 
 7 01702         48         1.51           2.48 
 8 01900         46         0.502          0.826
 9 02901         40         1.26           2.07 
10 02902         44         1.76           2.89 
# ℹ 25 more rows

Checking our answers

  • Tabulated median ages are not identical to published estimates, but are very close

  • Use published estimates if available; use PUMS data to generate estimates that aren’t available in the published tables

or_age_puma <- get_acs(
  geography = "puma",
  variables = "B01002_001",
  state = "OR",
  year = 2022,
  survey = "acs1"
)

Thank you!