Analyzing Data from the 2023 American Community Survey in R

Kyle Walker

2025-02-05

About me

SSDAN webinar series

  • Today: Analyzing Data from the 2023 American Community Survey in R

  • Wednesday, February 12th: Working with Decennial Census Data in R

  • Wednesday, February 26th: Mapping and Spatial Analysis with US Census Data in R

Today’s agenda

  • Hour 1: The American Community Survey, R, and tidycensus

  • Hour 2: ACS data workflows

  • Hour 3: An introduction to ACS microdata

The American Community Survey, R, and tidycensus

What is the ACS?

  • Annual survey of 3.5 million US households

  • Covers topics not available in decennial US Census data (e.g. income, education, language, housing characteristics)

  • Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)

  • Data delivered as estimates characterized by margins of error

How to get ACS data

tidycensus

  • R interface to the Decennial Census, American Community Survey, Population Estimates Program, and Public Use Microdata Series APIs

  • First released in 2017; nearly 600,000 downloads from the Posit CRAN mirror

tidycensus: key features

  • Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);

  • Automatically downloads and merges Census geometries to data for mapping;

  • Includes tools for handling margins of error in the ACS and working with survey weights in the ACS PUMS;

  • States and counties can be requested by name (no more looking up FIPS codes!)

R and RStudio

  • R: programming language and software environment for data analysis (and wherever else your imagination can take you!)

  • RStudio: integrated development environment (IDE) for R developed by Posit

  • Posit Cloud: run RStudio with today’s workshop pre-configured at https://posit.cloud/content/9689451

Getting started with tidycensus

  • To get started, install the packages you’ll need for today’s workshop

  • If you are using the Posit Cloud environment, these packages are already installed for you

install.packages(c("tidycensus", "tidyverse"))
  • Optional, to run advanced examples:
install.packages(c("mapview", "survey", "srvyr", "arcgislayers"))

Optional: your Census API key

  • tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day

  • Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.

  • Once activated, use the census_api_key() function to set your key as an environment variable

  • As of February 2025, the API key service appears to be unavailable

library(tidycensus)

census_api_key("YOUR KEY GOES HERE", install = TRUE)

Getting started with ACS data in tidycensus

Using the get_acs() function

  • The get_acs() function is your portal to access ACS data using tidycensus

  • The two required arguments are geography and variables. As of v1.7.1, the function defaults to the 2019-2023 5-year ACS

library(tidycensus)

median_value <- get_acs(
  geography = "county",
  variables = "B25077_001",
  year = 2023
)
  • ACS data are returned with five columns: GEOID, NAME, variable, estimate, and moe
median_value
# A tibble: 3,222 × 5
   GEOID NAME                     variable   estimate   moe
   <chr> <chr>                    <chr>         <dbl> <dbl>
 1 01001 Autauga County, Alabama  B25077_001   197900  8280
 2 01003 Baldwin County, Alabama  B25077_001   287000  6910
 3 01005 Barbour County, Alabama  B25077_001   109900 11374
 4 01007 Bibb County, Alabama     B25077_001   132600 18638
 5 01009 Blount County, Alabama   B25077_001   169700  5354
 6 01011 Bullock County, Alabama  B25077_001    79400 15850
 7 01013 Butler County, Alabama   B25077_001    99700  8484
 8 01015 Calhoun County, Alabama  B25077_001   149500  6196
 9 01017 Chambers County, Alabama B25077_001   129700 12036
10 01019 Cherokee County, Alabama B25077_001   165900  9786
# ℹ 3,212 more rows

1-year ACS data

  • 1-year ACS data are more current, but are only available for geographies of population 65,000 and greater

  • Access 1-year ACS data with the argument survey = "acs1"; defaults to "acs5"

median_value_1yr <- get_acs(
  geography = "place",
  variables = "B25077_001",
  year = 2023,
  survey = "acs1"
)
median_value_1yr
# A tibble: 649 × 5
   GEOID   NAME                           variable   estimate   moe
   <chr>   <chr>                          <chr>         <dbl> <dbl>
 1 0103076 Auburn city, Alabama           B25077_001   365500 39357
 2 0107000 Birmingham city, Alabama       B25077_001   157200 14921
 3 0121184 Dothan city, Alabama           B25077_001   190000 10687
 4 0135896 Hoover city, Alabama           B25077_001   415300 20389
 5 0137000 Huntsville city, Alabama       B25077_001   309900 15846
 6 0150000 Mobile city, Alabama           B25077_001   189000 20112
 7 0151000 Montgomery city, Alabama       B25077_001   158500  8737
 8 0177256 Tuscaloosa city, Alabama       B25077_001   252200 26811
 9 0203000 Anchorage municipality, Alaska B25077_001   385900 11025
10 0404720 Avondale city, Arizona         B25077_001   404900 17356
# ℹ 639 more rows

Requesting tables of variables

  • The table parameter can be used to obtain all related variables in a “table” at once
income_table <- get_acs(
  geography = "county", 
  table = "B19001", 
  year = 2023
)
income_table
# A tibble: 54,774 × 5
   GEOID NAME                    variable   estimate   moe
   <chr> <chr>                   <chr>         <dbl> <dbl>
 1 01001 Autauga County, Alabama B19001_001    22523   431
 2 01001 Autauga County, Alabama B19001_002      831   221
 3 01001 Autauga County, Alabama B19001_003      676   211
 4 01001 Autauga County, Alabama B19001_004      801   312
 5 01001 Autauga County, Alabama B19001_005      984   281
 6 01001 Autauga County, Alabama B19001_006      983   322
 7 01001 Autauga County, Alabama B19001_007     1057   255
 8 01001 Autauga County, Alabama B19001_008      768   241
 9 01001 Autauga County, Alabama B19001_009      961   266
10 01001 Autauga County, Alabama B19001_010      837   248
# ℹ 54,764 more rows

Understanding geography and variables in tidycensus

US Census Geography

Source: US Census Bureau

Source: US Census Bureau

Geography in tidycensus

Querying by state

  • For geographies available below the state level, the state parameter allows you to query data for a specific state

  • For smaller geographies (Census tracts, block groups), a county can also be requested

  • tidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!

  • Example: data on median home value in San Diego County, California by Census tract

sd_value <- get_acs(
  geography = "tract", 
  variables = "B25077_001", 
  state = "CA", 
  county = "San Diego",
  year = 2023
)
sd_value
# A tibble: 737 × 5
   GEOID       NAME                                     variable estimate    moe
   <chr>       <chr>                                    <chr>       <dbl>  <dbl>
 1 06073000100 Census Tract 1; San Diego County; Calif… B25077_…  1825500 159981
 2 06073000201 Census Tract 2.01; San Diego County; Ca… B25077_…  1445900 203254
 3 06073000202 Census Tract 2.02; San Diego County; Ca… B25077_…   966200 175650
 4 06073000301 Census Tract 3.01; San Diego County; Ca… B25077_…   895600 152793
 5 06073000302 Census Tract 3.02; San Diego County; Ca… B25077_…   820800 132232
 6 06073000400 Census Tract 4; San Diego County; Calif… B25077_…   802500 108414
 7 06073000500 Census Tract 5; San Diego County; Calif… B25077_…  1031800  96156
 8 06073000600 Census Tract 6; San Diego County; Calif… B25077_…   723300  78902
 9 06073000700 Census Tract 7; San Diego County; Calif… B25077_…   833900 107773
10 06073000800 Census Tract 8; San Diego County; Calif… B25077_…   795900 246242
# ℹ 727 more rows

Searching for variables

  • To search for variables, use the load_variables() function along with a year and dataset

  • The View() function in RStudio allows for interactive browsing and filtering

vars <- load_variables(2023, "acs5")

View(vars)

Available ACS datasets in tidycensus

  • Detailed Tables

  • Data Profile (add "/profile" for variable lookup)

  • Subject Tables (add "/subject")

  • Comparison Profile (add "/cprofile")

  • Supplemental Estimates (use "acsse")

  • Migration Flows (access with get_flows())

“Tidy” or long-form data

  • The default data structure returned by tidycensus is “tidy” or long-form data, with variables by geography stacked by row
age_sex_table <- get_acs(
  geography = "state", 
  table = "B01001", 
  year = 2023,
  survey = "acs1",
)
age_sex_table
# A tibble: 2,548 × 5
   GEOID NAME    variable   estimate   moe
   <chr> <chr>   <chr>         <dbl> <dbl>
 1 01    Alabama B01001_001  5108468    NA
 2 01    Alabama B01001_002  2471801  5359
 3 01    Alabama B01001_003   143309  3322
 4 01    Alabama B01001_004   154161  6072
 5 01    Alabama B01001_005   169656  6167
 6 01    Alabama B01001_006   104280  3039
 7 01    Alabama B01001_007    73719  3352
 8 01    Alabama B01001_008    33655  3913
 9 01    Alabama B01001_009    32913  3242
10 01    Alabama B01001_010   100629  5077
# ℹ 2,538 more rows

“Wide” data

  • The argument output = "wide" spreads Census variables across the columns, returning one row per geographic unit and one column per variable
age_sex_table_wide <- get_acs(
  geography = "state", 
  table = "B01001", 
  year = 2023,
  survey = "acs1",
  output = "wide" 
)
age_sex_table_wide
# A tibble: 52 × 100
   GEOID NAME        B01001_001E B01001_001M B01001_002E B01001_002M B01001_003E
   <chr> <chr>             <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1 01    Alabama         5108468          NA     2471801        5359      143309
 2 02    Alaska           733406          NA      385855        2547       24360
 3 04    Arizona         7431344          NA     3708672        2905      198924
 4 05    Arkansas        3067732          NA     1512201        4594       89183
 5 06    California     38965193          NA    19450698        5477     1067828
 6 08    Colorado        5877610          NA     2980108        4746      159079
 7 09    Connecticut     3617176          NA     1775718        2410       92614
 8 10    Delaware        1031890          NA      498994        1839       26971
 9 11    District o…      678972          NA      321590         265       19516
10 12    Florida        22610726          NA    11116575        6588      577538
# ℹ 42 more rows
# ℹ 93 more variables: B01001_003M <dbl>, B01001_004E <dbl>, B01001_004M <dbl>,
#   B01001_005E <dbl>, B01001_005M <dbl>, B01001_006E <dbl>, B01001_006M <dbl>,
#   B01001_007E <dbl>, B01001_007M <dbl>, B01001_008E <dbl>, B01001_008M <dbl>,
#   B01001_009E <dbl>, B01001_009M <dbl>, B01001_010E <dbl>, B01001_010M <dbl>,
#   B01001_011E <dbl>, B01001_011M <dbl>, B01001_012E <dbl>, B01001_012M <dbl>,
#   B01001_013E <dbl>, B01001_013M <dbl>, B01001_014E <dbl>, …

Using named vectors of variables

  • Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input

  • In long form, these custom inputs will populate the variable column; in wide form, they will replace the column names

ca_education <- get_acs(
  geography = "county",
  state = "CA",
  variables = c(percent_high_school = "DP02_0062P", 
                percent_bachelors = "DP02_0065P",
                percent_graduate = "DP02_0066P"), 
  year = 2023
)
ca_education
# A tibble: 174 × 5
   GEOID NAME                       variable            estimate   moe
   <chr> <chr>                      <chr>                  <dbl> <dbl>
 1 06001 Alameda County, California percent_high_school     15.9   0.3
 2 06001 Alameda County, California percent_bachelors       28.7   0.3
 3 06001 Alameda County, California percent_graduate        22.8   0.3
 4 06003 Alpine County, California  percent_high_school     18.5   6.1
 5 06003 Alpine County, California  percent_bachelors       22.4   7.3
 6 06003 Alpine County, California  percent_graduate        19.7   9.3
 7 06005 Amador County, California  percent_high_school     26.5   1.6
 8 06005 Amador County, California  percent_bachelors       15.5   1.5
 9 06005 Amador County, California  percent_graduate         7     1.1
10 06007 Butte County, California   percent_high_school     22.3   0.8
# ℹ 164 more rows

Part 1 exercises

  1. Use the load_variables() function to find a variable that interests you that we haven’t used yet.

  2. Use get_acs() to fetch data on that variable from the ACS for counties, similar to how we did for median household income.

Part 2: ACS data workflows

Understanding limitations of the 1-year ACS

  • The 1-year American Community Survey is only available for geographies with population 65,000 and greater. This means:
  • Only 854 of 3,222 counties are available
  • Only 649 of 32,325 cities / Census-designated places are available
  • No data for Census tracts, block groups, ZCTAs, or any other geographies that typically have populations below 65,000

Data sparsity and margins of error

  • You may encounter data issues in the 1-year ACS data that are less pronounced in the 5-year ACS. For example:
  • Values available in the 5-year ACS may not be available in the corresponding 1-year ACS tables

  • If available, they will likely have larger margins of error

  • Your job as an analyst: balance need for certainty vs. need for recency in estimates

Example: Punjabi speakers by state (1-year ACS)

get_acs(
  geography = "state",
  variables = "B16001_054",
  year = 2023,
  survey = "acs1"
)
# A tibble: 52 × 5
   GEOID NAME                 variable   estimate   moe
   <chr> <chr>                <chr>         <dbl> <dbl>
 1 01    Alabama              B16001_054     1149   947
 2 02    Alaska               B16001_054       NA    NA
 3 04    Arizona              B16001_054     2180  1477
 4 05    Arkansas             B16001_054        0   208
 5 06    California           B16001_054   171384 13741
 6 08    Colorado             B16001_054     1997  1973
 7 09    Connecticut          B16001_054     2062  1095
 8 10    Delaware             B16001_054     1643  1814
 9 11    District of Columbia B16001_054       67    74
10 12    Florida              B16001_054     5148  2555
# ℹ 42 more rows

Punjabi speakers by state (5-year ACS)

get_acs(
  geography = "state",
  variables = "B16001_054",
  year = 2023,
  survey = "acs5"
)
# A tibble: 52 × 5
   GEOID NAME                 variable   estimate   moe
   <chr> <chr>                <chr>         <dbl> <dbl>
 1 01    Alabama              B16001_054      704   321
 2 02    Alaska               B16001_054        0    24
 3 04    Arizona              B16001_054     2824   594
 4 05    Arkansas             B16001_054      519   257
 5 06    California           B16001_054   151139  6187
 6 08    Colorado             B16001_054     1406   505
 7 09    Connecticut          B16001_054     1945   529
 8 10    Delaware             B16001_054      236   176
 9 11    District of Columbia B16001_054       39    34
10 12    Florida              B16001_054     2959   764
# ℹ 42 more rows

Visualizing ACS data

Visualizing ACS estimates

  • As opposed to decennial US Census data, ACS estimates include information on uncertainty, represented by the margin of error in the moe column

  • This means that in some cases, visualization of estimates without reference to the margin of error can be misleading

  • Walkthrough: building a margin of error visualization with ggplot2

Visualizing ACS estimates

  • Let’s get some data on median household income by county in Utah
utah_income <- get_acs(
  geography = "county",
  variables = "B19013_001",
  state = "UT",
  year = 2023
) 

A basic plot

  • To visualize a dataset with ggplot2, we define an aesthetic and a geom
library(ggplot2)

utah_plot <- ggplot(utah_income, aes(x = estimate, y = NAME)) + 
  geom_point()
utah_plot

Problems with our basic plot

  • The data are not sorted by value, making comparisons difficult

  • The axis and tick labels are not intuitive

  • The Y-axis labels contain repetitive information (” County, Utah”)

  • We’ve made no attempt to customize the styling

Sorting by value

  • We use reorder() to sort counties by the value of their ACS estimates, improving legibility
utah_plot <- ggplot(utah_income, aes(x = estimate, 
                                y = reorder(NAME, estimate))) + 
  geom_point(color = "darkblue", size = 2)

Cleaning up tick labels

  • Using a combination of functions in the scales package and custom-defined functions, tick labels can be formatted any way you want
library(scales)
library(stringr)

utah_plot <- utah_plot + 
  scale_x_continuous(labels = label_dollar()) + 
  scale_y_discrete(labels = function(x) str_remove(x, " County, Utah")) 

Improving formatting and theming

  • Use labs() to label the plot and its axes, and change the theme to one of several built-in options
utah_plot <- utah_plot + 
  labs(title = "Median household income, 2019-2023 ACS",
       subtitle = "Counties in Utah",
       caption = "Data acquired with R and tidycensus",
       x = "ACS estimate",
       y = "") + 
  theme_minimal(base_size = 12)

Problem: comparing ACS estimates

  • The chart suggests that Juab County has lower income than Salt Lake County but its margin of error is quite large
View(utah_income)
  • How to visualize uncertainty in an intuitive way?

Visualizing margins of error

utah_plot_errorbar <- ggplot(utah_income, aes(x = estimate, 
                                        y = reorder(NAME, estimate))) + 
  geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe), #<<
                width = 0.5, linewidth = 0.5) + #<<
  geom_point(color = "darkblue", size = 2) + 
  scale_x_continuous(labels = label_dollar()) + 
  scale_y_discrete(labels = function(x) str_remove(x, " County, Utah")) + 
  labs(title = "Median household income, 2019-2023 ACS",
       subtitle = "Counties in Utah",
       caption = "Data acquired with R and tidycensus. Error bars represent margin of error around estimates.",
       x = "ACS estimate",
       y = "") + 
  theme_minimal(base_size = 12)
utah_plot_errorbar

Exploring maps of ACS data

“Spatial” ACS data

  • One of the best features of tidycensus is the argument geometry = TRUE, which gets you the correct Census geometries with no hassle

  • get_acs() with geometry = TRUE returns a spatial Census dataset containing simple feature geometries

  • As of February 2025, the Census geographic shapefiles used by tidycensus have been taken offline, so we’ll be taking a look at some workarounds

“Spatial” ACS data

cook_education <- get_acs(
  geography = "tract",
  variables = "DP02_0068P",
  state = "IL",
  county = "Cook",
  year = 2023
)

Geometry workaround

  • Corresponding geometries are available from the TigerWeb API, accessed with the {arcgislayers} package
library(arcgislayers)

cook_geo <- arc_read("https://tigerweb.geo.census.gov/arcgis/rest/services/Generalized_ACS2023/Tracts_Blocks/MapServer/4", where = "COUNTY = '031' AND STATE = '17'")
  • We get back a simple features data frame (more about this on February 26th)
cook_geo
Simple feature collection with 1331 features and 17 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -9825464 ymin: 5081873 xmax: -9743145 ymax: 5184116
Projected CRS: WGS 84 / Pseudo-Mercator
First 10 features:
   MTFCC             OID       GEOID STATE COUNTY  TRACT BASENAME
1  G5020 207903704492010 17031030101    17    031 030101   301.01
2  G5020 207903704492021 17031030701    17    031 030701   307.01
3  G5020 207903704492051 17031070103    17    031 070103   701.03
4  G5020  20790197959947 17031807100    17    031 807100     8071
5  G5020  20790197959946 17031807200    17    031 807200     8072
6  G5020  20790197959945 17031807300    17    031 807300     8073
7  G5020  20790197959944 17031807400    17    031 807400     8074
8  G5020  20790197959943 17031807500    17    031 807500     8075
9  G5020  20790197959941 17031807700    17    031 807700     8077
10 G5020  20790197959939 17031807900    17    031 807900     8079
                  NAME LSADC FUNCSTAT AREALAND AREAWATER OBJECTID     CENTLAT
1  Census Tract 301.01    CT        S   127283    110999     8482 +41.9912045
2  Census Tract 307.01    CT        S    69878         0     8483 +41.9827453
3  Census Tract 701.03    CT        S    69094         0     8484 +41.9268213
4    Census Tract 8071    CT        S  1528033         0     8487 +42.0444541
5    Census Tract 8072    CT        S  2489125         0     8488 +42.0371765
6    Census Tract 8073    CT        S  2037391         0     8514 +42.0337542
7    Census Tract 8074    CT        S  1396762         0     8515 +42.0281982
8    Census Tract 8075    CT        S   988887         0     8516 +42.0279560
9    Census Tract 8077    CT        S  1697095         0     8955 +42.0207253
10   Census Tract 8079    CT        S  2013730         0     8956 +42.0041682
        CENTLON    INTPTLAT     INTPTLON                       geometry
1  -087.6573633 +41.9910538 -087.6548790 POLYGON ((-9758304 5159824,...
2  -087.6579659 +41.9827453 -087.6579659 POLYGON ((-9758274 5158526,...
3  -087.6426050 +41.9268110 -087.6426051 POLYGON ((-9756510 5150208,...
4  -087.7210876 +42.0444535 -087.7210786 POLYGON ((-9766345 5167360,...
5  -087.7285090 +42.0371751 -087.7285103 POLYGON ((-9767978 5166271,...
6  -087.7548822 +42.0337614 -087.7549133 POLYGON ((-9769870 5166885,...
7  -087.7398421 +42.0281935 -087.7398439 POLYGON ((-9767999 5165384,...
8  -087.7281340 +42.0279578 -087.7281380 POLYGON ((-9766395 5164912,...
9  -087.7542264 +42.0207639 -087.7542195 POLYGON ((-9770177 5164408,...
10 -087.7487805 +42.0041886 -087.7487798 POLYGON ((-9769616 5161377,...

Tutorial: using TIGERweb

  • Link: https://tigerweb.geo.census.gov/tigerwebmain/TIGERweb_main.html

Spatial ACS data

  • With the geometries in hand, we can merge the ACS data using left_join()
library(dplyr)

cook_education_geo <- cook_geo %>%
  select(GEOID) %>%
  left_join(cook_education, by = "GEOID")

cook_education_geo
Simple feature collection with 1331 features and 5 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -9825464 ymin: 5081873 xmax: -9743145 ymax: 5184116
Projected CRS: WGS 84 / Pseudo-Mercator
First 10 features:
         GEOID                                       NAME   variable estimate
1  17031030101 Census Tract 301.01; Cook County; Illinois DP02_0068P     50.0
2  17031030701 Census Tract 307.01; Cook County; Illinois DP02_0068P     48.2
3  17031070103 Census Tract 701.03; Cook County; Illinois DP02_0068P     88.4
4  17031807100   Census Tract 8071; Cook County; Illinois DP02_0068P     73.2
5  17031807200   Census Tract 8072; Cook County; Illinois DP02_0068P     52.7
6  17031807300   Census Tract 8073; Cook County; Illinois DP02_0068P     52.5
7  17031807400   Census Tract 8074; Cook County; Illinois DP02_0068P     49.8
8  17031807500   Census Tract 8075; Cook County; Illinois DP02_0068P     48.0
9  17031807700   Census Tract 8077; Cook County; Illinois DP02_0068P     44.1
10 17031807900   Census Tract 8079; Cook County; Illinois DP02_0068P     63.3
   moe                       geometry
1  9.5 POLYGON ((-9758304 5159824,...
2  6.1 POLYGON ((-9758274 5158526,...
3  5.3 POLYGON ((-9756510 5150208,...
4  6.5 POLYGON ((-9766345 5167360,...
5  7.0 POLYGON ((-9767978 5166271,...
6  7.1 POLYGON ((-9769870 5166885,...
7  8.1 POLYGON ((-9767999 5165384,...
8  6.7 POLYGON ((-9766395 5164912,...
9  6.8 POLYGON ((-9770177 5164408,...
10 7.8 POLYGON ((-9769616 5161377,...

Exploring spatial data

  • Mapping, GIS, and spatial data is the subject of our February 26 workshop - so be sure to check that out!

  • Even before we dive deeper into spatial data, it is very useful to be able to explore your results on an interactive map

  • Our solution: mapview()

Exploring spatial data

library(mapview)

mapview(cook_education_geo)

Creating a shaded map with zcol

mapview(cook_education_geo, zcol = "estimate")

What about mapping 1-year ACS data?

  • Typically it is difficult to map 1-year ACS data below the state level as your data will have gaps due to the population restrictions

Example: “mapping” 1-year ACS data

tx_education <- get_acs(
  geography = "county",
  variables = "DP02_0068P",
  state = "TX",
  year = 2023,
  survey = "acs1"
)

tx_counties <- arc_read("https://tigerweb.geo.census.gov/arcgis/rest/services/Generalized_ACS2023/State_County/MapServer/11", where = "STATE = '48'")

tx_geo <- left_join(tx_counties, tx_education, by = "GEOID")

Example: “mapping” 1-year ACS data

mapview(tx_geo, zcol = "estimate")

Mapping small(er) areas with PUMAs

  • Consider using Public Use Microdata Areas (PUMAs) for geographically-consistent substate mapping

  • PUMAs are typically used for microdata geography; however, I find them quite useful to approximate real state submarkets, planning areas, etc.

tx_wfh <- get_acs(
  geography = "puma",
  variables = "DP03_0024P",
  state = "TX",
  survey = "acs1",
  year = 2023
)

tx_pumas <- arc_read("https://tigerweb.geo.census.gov/arcgis/rest/services/Generalized_TAB2020/PUMA_TAD_TAZ_UGA_ZCTA/MapServer/4", where = "STATE = '48'")

tx_geo <- left_join(tx_pumas, tx_wfh, by = "GEOID")
library(mapview)

mapview(tx_geo, zcol = "estimate")

Time-series analysis with the 1-year ACS: some notes

  • Variables in the Data Profile and Subject Tables can change names over time

  • You’ll need to watch out for the Connecticut issue and changing geographies

  • The 2020 1-year ACS was not released (and is not in tidycensus), so your time-series can break if you are using iteration to pull data

Part 2 exercises

Swap in a variable from Part 1, "B25077_001" (median home value) for the analysis in this section, and try the following:

  • For a state of your choosing, how do margins of error differ among counties for median home values in the 1-year and 5-year ACS?

  • Can you visualize trends in median home value for a county of your choosing using mapview()?

Part 3: Working with ACS microdata

What is “microdata?”

Using microdata in tidycensus

Basic usage of get_pums()

  • get_pums() requires specifying one or more variables and the state for which you’d like to request data. state = 'all' can get data for the entire USA, but it takes a while!

  • The function defaults to the 5-year ACS with survey = "acs5"; 1-year ACS data is available with survey = "acs1".

  • The default year is 2023 in the latest version of tidycensus; data are available back to 2005 (1-year ACS) and 2005-2009 (5-year ACS). 2020 1-year data are not available.

Basic usage of get_pums()

library(tidycensus)

la_pums <- get_pums(
  variables = c("SEX", "AGEP", "HHT"),
  state = "LA",
  survey = "acs1",
  year = 2023
)
la_pums
# A tibble: 45,026 × 8
   SERIALNO      SPORDER  WGTP PWGTP  AGEP STATE HHT   SEX  
   <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
 1 2023GQ0000080       1     0    22    35 22    b     1    
 2 2023GQ0000129       1     0   138    58 22    b     2    
 3 2023GQ0000171       1     0    37    19 22    b     1    
 4 2023GQ0000183       1     0    97    22 22    b     2    
 5 2023GQ0000200       1     0    62    19 22    b     2    
 6 2023GQ0000299       1     0    32    82 22    b     1    
 7 2023GQ0000387       1     0    58    44 22    b     1    
 8 2023GQ0000401       1     0    40    21 22    b     2    
 9 2023GQ0000406       1     0    50    32 22    b     1    
10 2023GQ0000477       1     0    26    27 22    b     1    
# ℹ 45,016 more rows

Understanding default data from get_pums()

get_pums() returns some technical variables by default without the user needing to request them specifically. These include:

  • SERIALNO: a serial number that uniquely identifies households in the sample;

  • SPORDER: the order of the person in the household; when combined with SERIALNO, uniquely identifies a person;

  • WGTP: the household weight;

  • PWGTP: the person weight

Weights and ACS microdata

  • Given that PUMS data are a sample of the US population, the weights columns must be used for analysis
library(tidyverse)

la_age_41 <- filter(la_pums, AGEP == 41)

print(sum(la_pums$PWGTP))
[1] 4573749
print(sum(la_age_41$PWGTP))
[1] 63826

Are these estimates accurate?

  • PUMS weights are calibrated to population and household totals, so larger tabulations should align with published estimates
get_acs("state", "B01003_001", state = "LA", survey = "acs1", year = 2023)
# A tibble: 1 × 5
  GEOID NAME      variable   estimate   moe
  <chr> <chr>     <chr>         <dbl> <dbl>
1 22    Louisiana B01003_001  4573749    NA
  • Smaller tabulations will be characterized by more uncertainty, and may deviate from published estimates

Variables available in the ACS PUMS

View(pums_variables)
  • The pums_variables dataset is your one-stop shop for browsing variables in the ACS PUMS

  • It is a long-form dataset that organizes specific value codes by variable so you know what you can get. You’ll use information in the var_code column to fetch variables, but pay attention to the var_label, val_code, val_label, and data_type columns

Recoding PUMS variables

  • The recode = TRUE argument in get_pums() appends recoded columns to your returned dataset based on information available in pums_variables
la_pums_recoded <- get_pums(
  variables = c("SEX", "AGEP", "HHT"),
  state = "LA",
  survey = "acs1",
  year = 2023,
  recode = TRUE
)
la_pums_recoded
# A tibble: 45,026 × 11
   SERIALNO    SPORDER  WGTP PWGTP  AGEP STATE HHT   SEX   STATE_label HHT_label
   <chr>         <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <ord>       <ord>    
 1 2023GQ0000…       1     0    22    35 22    b     1     Louisiana/… N/A (GQ/…
 2 2023GQ0000…       1     0   138    58 22    b     2     Louisiana/… N/A (GQ/…
 3 2023GQ0000…       1     0    37    19 22    b     1     Louisiana/… N/A (GQ/…
 4 2023GQ0000…       1     0    97    22 22    b     2     Louisiana/… N/A (GQ/…
 5 2023GQ0000…       1     0    62    19 22    b     2     Louisiana/… N/A (GQ/…
 6 2023GQ0000…       1     0    32    82 22    b     1     Louisiana/… N/A (GQ/…
 7 2023GQ0000…       1     0    58    44 22    b     1     Louisiana/… N/A (GQ/…
 8 2023GQ0000…       1     0    40    21 22    b     2     Louisiana/… N/A (GQ/…
 9 2023GQ0000…       1     0    50    32 22    b     1     Louisiana/… N/A (GQ/…
10 2023GQ0000…       1     0    26    27 22    b     1     Louisiana/… N/A (GQ/…
# ℹ 45,016 more rows
# ℹ 1 more variable: SEX_label <ord>

Using variables filters

  • PUMS datasets - especially from the 5-year ACS - can get quite large. The variables_filter argument can return a subset of data from the API, reducing long download times

  • variables_filter is specified as a named list where the name represents the PUMS variable and the value represents a vector of values you are requesting from the API

Using variables filters

la_pums_filtered <- get_pums(
  variables = c("SEX", "AGEP", "HHT"),
  state = "LA",
  survey = "acs5",
  variables_filter = list(
    SEX = 2,
    AGEP = 30:49
  ),
  year = 2023
)
la_pums_filtered
# A tibble: 24,518 × 8
   SERIALNO      SPORDER  WGTP PWGTP  AGEP STATE HHT   SEX  
   <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
 1 2019GQ0001753       1     0     5    44 22    b     2    
 2 2019GQ0002125       1     0    14    34 22    b     2    
 3 2019GQ0004757       1     0    19    34 22    b     2    
 4 2019GQ0008205       1     0    16    38 22    b     2    
 5 2019GQ0009735       1     0     1    35 22    b     2    
 6 2019GQ0010191       1     0    33    32 22    b     2    
 7 2019GQ0010776       1     0    13    34 22    b     2    
 8 2019GQ0013041       1     0    13    42 22    b     2    
 9 2019GQ0014180       1     0     4    47 22    b     2    
10 2019GQ0022448       1     0    11    47 22    b     2    
# ℹ 24,508 more rows

Working with PUMAs in PUMS data

  • In the previous hour, you were introduced to PUMAs

  • Public Use Microdata Areas (PUMAs) are the smallest available geographies at which records are identifiable in the PUMS datasets

  • PUMAs are redrawn with each decennial US Census, and typically are home to 100,000-200,000 people. The 2023 ACS aligns with the 2020 PUMAs

Working with PUMAs in PUMS data

  • To get PUMA information in your output data, use the variable code PUMA
la_age_by_puma <- get_pums(
  variables = c("PUMA", "AGEP"),
  state = "LA",
  survey = "acs1",
  year = 2023
)
la_age_by_puma
# A tibble: 45,026 × 7
   SERIALNO      SPORDER  WGTP PWGTP  AGEP PUMA  STATE
   <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr>
 1 2023GQ0000080       1     0    22    35 01400 22   
 2 2023GQ0000129       1     0   138    58 01400 22   
 3 2023GQ0000171       1     0    37    19 00300 22   
 4 2023GQ0000183       1     0    97    22 00400 22   
 5 2023GQ0000200       1     0    62    19 00300 22   
 6 2023GQ0000299       1     0    32    82 02000 22   
 7 2023GQ0000387       1     0    58    44 01800 22   
 8 2023GQ0000401       1     0    40    21 01800 22   
 9 2023GQ0000406       1     0    50    32 01800 22   
10 2023GQ0000477       1     0    26    27 00400 22   
# ℹ 45,016 more rows

Handling uncertainty in tabulated PUMS estimates

Uncertainty in PUMS data

  • PUMS data represent a smaller sample than the regular ACS, so understanding error around tabulated estimates is critical

  • The Census Bureau recommends using successive difference replication to calculate standard errors, and provides replicate weights to do this

  • tidycensus includes tools to help you get replicate weights and format your data for appropriate survey-weighted analysis

Getting replicate weights

  • We can acquire either housing or person replicate weights with the rep_weights argument
la_pums_replicate <- get_pums(
  variables = c("AGEP", "PUMA"),
  state = "LA",
  survey = "acs1",
  year = 2023,
  rep_weights = "person" 
)
la_pums_replicate
# A tibble: 45,026 × 87
   SERIALNO    SPORDER  AGEP PUMA  STATE  WGTP PWGTP PWGTP1 PWGTP2 PWGTP3 PWGTP4
   <chr>         <dbl> <dbl> <chr> <chr> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 2023GQ0000…       1    35 01400 22        0    22      4      5     35     20
 2 2023GQ0000…       1    58 01400 22        0   138    251    322     15     13
 3 2023GQ0000…       1    19 00300 22        0    37      5     34     32     74
 4 2023GQ0000…       1    22 00400 22        0    97      9      9     75    161
 5 2023GQ0000…       1    19 00300 22        0    62    110      4    124    113
 6 2023GQ0000…       1    82 02000 22        0    32     28     27     35     29
 7 2023GQ0000…       1    44 01800 22        0    58     84    106     51     47
 8 2023GQ0000…       1    21 01800 22        0    40      4      4     39     87
 9 2023GQ0000…       1    32 01800 22        0    50     87     13     65     92
10 2023GQ0000…       1    27 00400 22        0    26     20     29      4     25
# ℹ 45,016 more rows
# ℹ 76 more variables: PWGTP5 <dbl>, PWGTP6 <dbl>, PWGTP7 <dbl>, PWGTP8 <dbl>,
#   PWGTP9 <dbl>, PWGTP10 <dbl>, PWGTP11 <dbl>, PWGTP12 <dbl>, PWGTP13 <dbl>,
#   PWGTP14 <dbl>, PWGTP15 <dbl>, PWGTP16 <dbl>, PWGTP17 <dbl>, PWGTP18 <dbl>,
#   PWGTP19 <dbl>, PWGTP20 <dbl>, PWGTP21 <dbl>, PWGTP22 <dbl>, PWGTP23 <dbl>,
#   PWGTP24 <dbl>, PWGTP25 <dbl>, PWGTP26 <dbl>, PWGTP27 <dbl>, PWGTP28 <dbl>,
#   PWGTP29 <dbl>, PWGTP30 <dbl>, PWGTP31 <dbl>, PWGTP32 <dbl>, …

Handling complex survey samples

  • tidycensus links to the survey and srvyr packages for managing PUMS data as complex survey samples

  • The to_survey() function will format your data with replicate weights for correct survey-weighted estimation

la_survey <- to_survey(
  la_pums_replicate,
  type = "person"
)

class(la_survey)
[1] "tbl_svy"       "svyrep.design"

Survey-weighted tabulations

  • srvyr conveniently links R’s survey infrastructure to familiar tidyverse-style workflows

  • Standard errors can be multiplied by 1.645 to get familiar 90% confidence level margins of error

library(srvyr)

la_survey %>%
  filter(AGEP == 41) %>%
  survey_count() %>%
  mutate(n_moe = n_se * 1.645)
# A tibble: 1 × 3
      n  n_se n_moe
  <dbl> <dbl> <dbl>
1 63826 3662. 6024.

Group-wise survey data analysis

  • A group-wise tidyverse workflow can be applied correctly by srvyr for the calculation of medians and other summary statistics
la_survey %>%
  group_by(PUMA) %>%
  summarize(median_age = survey_median(AGEP)) %>%
  mutate(median_age_moe = median_age_se * 1.645)
# A tibble: 35 × 4
   PUMA  median_age median_age_se median_age_moe
   <chr>      <dbl>         <dbl>          <dbl>
 1 00101         39         1.76           2.89 
 2 00102         41         1.26           2.07 
 3 00200         40         0.502          0.826
 4 00300         40         0.754          1.24 
 5 00400         36         0.502          0.826
 6 00500         42         1.00           1.65 
 7 00600         36         1.00           1.65 
 8 00700         36         0.502          0.826
 9 00801         40         1.00           1.65 
10 00901         37         1.00           1.65 
# ℹ 25 more rows

Checking our answers

  • Tabulated median ages are not identical to published estimates, but are very close

  • Use published estimates if available; use PUMS data to generate estimates that aren’t available in the published tables

la_age_puma <- get_acs(
  geography = "puma",
  variables = "B01002_001",
  state = "LA",
  year = 2023,
  survey = "acs1"
)

Thank you!