Analyzing Data from the 2023 American Community Survey in R

Kyle Walker

2025-02-05

About me

SSDAN webinar series

  • Today: Analyzing Data from the 2023 American Community Survey in R

  • Wednesday, February 12th: Working with Decennial Census Data in R

  • Wednesday, February 26th: Mapping and Spatial Analysis with US Census Data in R

Today’s agenda

  • Hour 1: The American Community Survey, R, and tidycensus

  • Hour 2: ACS data workflows

  • Hour 3: An introduction to ACS microdata

The American Community Survey, R, and tidycensus

What is the ACS?

  • Annual survey of 3.5 million US households

  • Covers topics not available in decennial US Census data (e.g. income, education, language, housing characteristics)

  • Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)

  • Data delivered as estimates characterized by margins of error

How to get ACS data

tidycensus

  • R interface to the Decennial Census, American Community Survey, Population Estimates Program, and Public Use Microdata Series APIs

  • First released in 2017; nearly 600,000 downloads from the Posit CRAN mirror

tidycensus: key features

  • Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);

  • Automatically downloads and merges Census geometries to data for mapping;

  • Includes tools for handling margins of error in the ACS and working with survey weights in the ACS PUMS;

  • States and counties can be requested by name (no more looking up FIPS codes!)

R and RStudio

  • R: programming language and software environment for data analysis (and wherever else your imagination can take you!)

  • RStudio: integrated development environment (IDE) for R developed by Posit

  • Posit Cloud: run RStudio with today’s workshop pre-configured at https://posit.cloud/content/9689451

Getting started with tidycensus

  • To get started, install the packages you’ll need for today’s workshop

  • If you are using the Posit Cloud environment, these packages are already installed for you

install.packages(c("tidycensus", "tidyverse"))
  • Optional, to run advanced examples:
install.packages(c("mapview", "survey", "srvyr", "arcgislayers"))

Optional: your Census API key

  • tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day

  • Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.

  • Once activated, use the census_api_key() function to set your key as an environment variable

  • As of February 2025, the API key service appears to be unavailable

library(tidycensus)

census_api_key("YOUR KEY GOES HERE", install = TRUE)

Getting started with ACS data in tidycensus

Using the get_acs() function

  • The get_acs() function is your portal to access ACS data using tidycensus

  • The two required arguments are geography and variables. As of v1.7.1, the function defaults to the 2019-2023 5-year ACS

library(tidycensus)

median_value <- get_acs(
  geography = "county",
  variables = "B25077_001",
  year = 2023
)
  • ACS data are returned with five columns: GEOID, NAME, variable, estimate, and moe
median_value
# A tibble: 3,222 × 5
   GEOID NAME                     variable   estimate   moe
   <chr> <chr>                    <chr>         <dbl> <dbl>
 1 01001 Autauga County, Alabama  B25077_001   197900  8280
 2 01003 Baldwin County, Alabama  B25077_001   287000  6910
 3 01005 Barbour County, Alabama  B25077_001   109900 11374
 4 01007 Bibb County, Alabama     B25077_001   132600 18638
 5 01009 Blount County, Alabama   B25077_001   169700  5354
 6 01011 Bullock County, Alabama  B25077_001    79400 15850
 7 01013 Butler County, Alabama   B25077_001    99700  8484
 8 01015 Calhoun County, Alabama  B25077_001   149500  6196
 9 01017 Chambers County, Alabama B25077_001   129700 12036
10 01019 Cherokee County, Alabama B25077_001   165900  9786
# ℹ 3,212 more rows

1-year ACS data

  • 1-year ACS data are more current, but are only available for geographies of population 65,000 and greater

  • Access 1-year ACS data with the argument survey = "acs1"; defaults to "acs5"

median_value_1yr <- get_acs(
  geography = "place",
  variables = "B25077_001",
  year = 2023,
  survey = "acs1"
)
median_value_1yr
# A tibble: 649 × 5
   GEOID   NAME                           variable   estimate   moe
   <chr>   <chr>                          <chr>         <dbl> <dbl>
 1 0103076 Auburn city, Alabama           B25077_001   365500 39357
 2 0107000 Birmingham city, Alabama       B25077_001   157200 14921
 3 0121184 Dothan city, Alabama           B25077_001   190000 10687
 4 0135896 Hoover city, Alabama           B25077_001   415300 20389
 5 0137000 Huntsville city, Alabama       B25077_001   309900 15846
 6 0150000 Mobile city, Alabama           B25077_001   189000 20112
 7 0151000 Montgomery city, Alabama       B25077_001   158500  8737
 8 0177256 Tuscaloosa city, Alabama       B25077_001   252200 26811
 9 0203000 Anchorage municipality, Alaska B25077_001   385900 11025
10 0404720 Avondale city, Arizona         B25077_001   404900 17356
# ℹ 639 more rows

Requesting tables of variables

  • The table parameter can be used to obtain all related variables in a “table” at once
income_table <- get_acs(
  geography = "county", 
  table = "B19001", 
  year = 2023
)
income_table
# A tibble: 54,774 × 5
   GEOID NAME                    variable   estimate   moe
   <chr> <chr>                   <chr>         <dbl> <dbl>
 1 01001 Autauga County, Alabama B19001_001    22523   431
 2 01001 Autauga County, Alabama B19001_002      831   221
 3 01001 Autauga County, Alabama B19001_003      676   211
 4 01001 Autauga County, Alabama B19001_004      801   312
 5 01001 Autauga County, Alabama B19001_005      984   281
 6 01001 Autauga County, Alabama B19001_006      983   322
 7 01001 Autauga County, Alabama B19001_007     1057   255
 8 01001 Autauga County, Alabama B19001_008      768   241
 9 01001 Autauga County, Alabama B19001_009      961   266
10 01001 Autauga County, Alabama B19001_010      837   248
# ℹ 54,764 more rows

Understanding geography and variables in tidycensus

US Census Geography

Source: US Census Bureau

Source: US Census Bureau

Geography in tidycensus

Querying by state

  • For geographies available below the state level, the state parameter allows you to query data for a specific state

  • For smaller geographies (Census tracts, block groups), a county can also be requested

  • tidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!

  • Example: data on median home value in San Diego County, California by Census tract

sd_value <- get_acs(
  geography = "tract", 
  variables = "B25077_001", 
  state = "CA", 
  county = "San Diego",
  year = 2023
)
sd_value
# A tibble: 737 × 5
   GEOID       NAME                                     variable estimate    moe
   <chr>       <chr>                                    <chr>       <dbl>  <dbl>
 1 06073000100 Census Tract 1; San Diego County; Calif… B25077_…  1825500 159981
 2 06073000201 Census Tract 2.01; San Diego County; Ca… B25077_…  1445900 203254
 3 06073000202 Census Tract 2.02; San Diego County; Ca… B25077_…   966200 175650
 4 06073000301 Census Tract 3.01; San Diego County; Ca… B25077_…   895600 152793
 5 06073000302 Census Tract 3.02; San Diego County; Ca… B25077_…   820800 132232
 6 06073000400 Census Tract 4; San Diego County; Calif… B25077_…   802500 108414
 7 06073000500 Census Tract 5; San Diego County; Calif… B25077_…  1031800  96156
 8 06073000600 Census Tract 6; San Diego County; Calif… B25077_…   723300  78902
 9 06073000700 Census Tract 7; San Diego County; Calif… B25077_…   833900 107773
10 06073000800 Census Tract 8; San Diego County; Calif… B25077_…   795900 246242
# ℹ 727 more rows

Searching for variables

  • To search for variables, use the load_variables() function along with a year and dataset

  • The View() function in RStudio allows for interactive browsing and filtering

vars <- load_variables(2023, "acs5")

View(vars)

Available ACS datasets in tidycensus

  • Detailed Tables

  • Data Profile (add "/profile" for variable lookup)

  • Subject Tables (add "/subject")

  • Comparison Profile (add "/cprofile")

  • Supplemental Estimates (use "acsse")

  • Migration Flows (access with get_flows())

“Tidy” or long-form data

  • The default data structure returned by tidycensus is “tidy” or long-form data, with variables by geography stacked by row
age_sex_table <- get_acs(
  geography = "state", 
  table = "B01001", 
  year = 2023,
  survey = "acs1",
)
age_sex_table
# A tibble: 2,548 × 5
   GEOID NAME    variable   estimate   moe
   <chr> <chr>   <chr>         <dbl> <dbl>
 1 01    Alabama B01001_001  5108468    NA
 2 01    Alabama B01001_002  2471801  5359
 3 01    Alabama B01001_003   143309  3322
 4 01    Alabama B01001_004   154161  6072
 5 01    Alabama B01001_005   169656  6167
 6 01    Alabama B01001_006   104280  3039
 7 01    Alabama B01001_007    73719  3352
 8 01    Alabama B01001_008    33655  3913
 9 01    Alabama B01001_009    32913  3242
10 01    Alabama B01001_010   100629  5077
# ℹ 2,538 more rows

“Wide” data

  • The argument output = "wide" spreads Census variables across the columns, returning one row per geographic unit and one column per variable
age_sex_table_wide <- get_acs(
  geography = "state", 
  table = "B01001", 
  year = 2023,
  survey = "acs1",
  output = "wide" 
)
age_sex_table_wide
# A tibble: 52 × 100
   GEOID NAME        B01001_001E B01001_001M B01001_002E B01001_002M B01001_003E
   <chr> <chr>             <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1 01    Alabama         5108468          NA     2471801        5359      143309
 2 02    Alaska           733406          NA      385855        2547       24360
 3 04    Arizona         7431344          NA     3708672        2905      198924
 4 05    Arkansas        3067732          NA     1512201        4594       89183
 5 06    California     38965193          NA    19450698        5477     1067828
 6 08    Colorado        5877610          NA     2980108        4746      159079
 7 09    Connecticut     3617176          NA     1775718        2410       92614
 8 10    Delaware        1031890          NA      498994        1839       26971
 9 11    District o…      678972          NA      321590         265       19516
10 12    Florida        22610726          NA    11116575        6588      577538
# ℹ 42 more rows
# ℹ 93 more variables: B01001_003M <dbl>, B01001_004E <dbl>, B01001_004M <dbl>,
#   B01001_005E <dbl>, B01001_005M <dbl>, B01001_006E <dbl>, B01001_006M <dbl>,
#   B01001_007E <dbl>, B01001_007M <dbl>, B01001_008E <dbl>, B01001_008M <dbl>,
#   B01001_009E <dbl>, B01001_009M <dbl>, B01001_010E <dbl>, B01001_010M <dbl>,
#   B01001_011E <dbl>, B01001_011M <dbl>, B01001_012E <dbl>, B01001_012M <dbl>,
#   B01001_013E <dbl>, B01001_013M <dbl>, B01001_014E <dbl>, …

Using named vectors of variables

  • Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input

  • In long form, these custom inputs will populate the variable column; in wide form, they will replace the column names

ca_education <- get_acs(
  geography = "county",
  state = "CA",
  variables = c(percent_high_school = "DP02_0062P", 
                percent_bachelors = "DP02_0065P",
                percent_graduate = "DP02_0066P"), 
  year = 2023
)
ca_education
# A tibble: 174 × 5
   GEOID NAME                       variable            estimate   moe
   <chr> <chr>                      <chr>                  <dbl> <dbl>
 1 06001 Alameda County, California percent_high_school     15.9   0.3
 2 06001 Alameda County, California percent_bachelors       28.7   0.3
 3 06001 Alameda County, California percent_graduate        22.8   0.3
 4 06003 Alpine County, California  percent_high_school     18.5   6.1
 5 06003 Alpine County, California  percent_bachelors       22.4   7.3
 6 06003 Alpine County, California  percent_graduate        19.7   9.3
 7 06005 Amador County, California  percent_high_school     26.5   1.6
 8 06005 Amador County, California  percent_bachelors       15.5   1.5
 9 06005 Amador County, California  percent_graduate         7     1.1
10 06007 Butte County, California   percent_high_school     22.3   0.8
# ℹ 164 more rows

Part 1 exercises

  1. Use the load_variables() function to find a variable that interests you that we haven’t used yet.

  2. Use get_acs() to fetch data on that variable from the ACS for counties, similar to how we did for median household income.

Part 2: ACS data workflows

Understanding limitations of the 1-year ACS

  • The 1-year American Community Survey is only available for geographies with population 65,000 and greater. This means:
  • Only 854 of 3,222 counties are available
  • Only 649 of 32,325 cities / Census-designated places are available
  • No data for Census tracts, block groups, ZCTAs, or any other geographies that typically have populations below 65,000

Data sparsity and margins of error

  • You may encounter data issues in the 1-year ACS data that are less pronounced in the 5-year ACS. For example:
  • Values available in the 5-year ACS may not be available in the corresponding 1-year ACS tables

  • If available, they will likely have larger margins of error

  • Your job as an analyst: balance need for certainty vs. need for recency in estimates

Example: Punjabi speakers by state (1-year ACS)

get_acs(
  geography = "state",
  variables = "B16001_054",
  year = 2023,
  survey = "acs1"
)
# A tibble: 52 × 5
   GEOID NAME                 variable   estimate   moe
   <chr> <chr>                <chr>         <dbl> <dbl>
 1 01    Alabama              B16001_054     1149   947
 2 02    Alaska               B16001_054       NA    NA
 3 04    Arizona              B16001_054     2180  1477
 4 05    Arkansas             B16001_054        0   208
 5 06    California           B16001_054   171384 13741
 6 08    Colorado             B16001_054     1997  1973
 7 09    Connecticut          B16001_054     2062  1095
 8 10    Delaware             B16001_054     1643  1814
 9 11    District of Columbia B16001_054       67    74
10 12    Florida              B16001_054     5148  2555
# ℹ 42 more rows

Punjabi speakers by state (5-year ACS)

get_acs(
  geography = "state",
  variables = "B16001_054",
  year = 2023,
  survey = "acs5"
)
# A tibble: 52 × 5
   GEOID NAME                 variable   estimate   moe
   <chr> <chr>                <chr>         <dbl> <dbl>
 1 01    Alabama              B16001_054      704   321
 2 02    Alaska               B16001_054        0    24
 3 04    Arizona              B16001_054     2824   594
 4 05    Arkansas             B16001_054      519   257
 5 06    California           B16001_054   151139  6187
 6 08    Colorado             B16001_054     1406   505
 7 09    Connecticut          B16001_054     1945   529
 8 10    Delaware             B16001_054      236   176
 9 11    District of Columbia B16001_054       39    34
10 12    Florida              B16001_054     2959   764
# ℹ 42 more rows

Visualizing ACS data

Visualizing ACS estimates

  • As opposed to decennial US Census data, ACS estimates include information on uncertainty, represented by the margin of error in the moe column

  • This means that in some cases, visualization of estimates without reference to the margin of error can be misleading

  • Walkthrough: building a margin of error visualization with ggplot2

Visualizing ACS estimates

  • Let’s get some data on median household income by county in Utah
utah_income <- get_acs(
  geography = "county",
  variables = "B19013_001",
  state = "UT",
  year = 2023
) 

A basic plot

  • To visualize a dataset with ggplot2, we define an aesthetic and a geom
library(ggplot2)

utah_plot <- ggplot(utah_income, aes(x = estimate, y = NAME)) + 
  geom_point()
utah_plot

Problems with our basic plot

  • The data are not sorted by value, making comparisons difficult

  • The axis and tick labels are not intuitive

  • The Y-axis labels contain repetitive information (” County, Utah”)

  • We’ve made no attempt to customize the styling

Sorting by value

  • We use reorder() to sort counties by the value of their ACS estimates, improving legibility
utah_plot <- ggplot(utah_income, aes(x = estimate, 
                                y = reorder(NAME, estimate))) + 
  geom_point(color = "darkblue", size = 2)

Cleaning up tick labels

  • Using a combination of functions in the scales package and custom-defined functions, tick labels can be formatted any way you want
library(scales)
library(stringr)

utah_plot <- utah_plot + 
  scale_x_continuous(labels = label_dollar()) + 
  scale_y_discrete(labels = function(x) str_remove(x, " County, Utah")) 

Improving formatting and theming

  • Use labs() to label the plot and its axes, and change the theme to one of several built-in options
utah_plot <- utah_plot + 
  labs(title = "Median household income, 2019-2023 ACS",
       subtitle = "Counties in Utah",
       caption = "Data acquired with R and tidycensus",
       x = "ACS estimate",
       y = "") + 
  theme_minimal(base_size = 12)

Problem: comparing ACS estimates

  • The chart suggests that Juab County has lower income than Salt Lake County but its margin of error is quite large
View(utah_income)
  • How to visualize uncertainty in an intuitive way?

Visualizing margins of error

utah_plot_errorbar <- ggplot(utah_income, aes(x = estimate, 
                                        y = reorder(NAME, estimate))) + 
  geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe), #<<
                width = 0.5, linewidth = 0.5) + #<<
  geom_point(color = "darkblue", size = 2) + 
  scale_x_continuous(labels = label_dollar()) + 
  scale_y_discrete(labels = function(x) str_remove(x, " County, Utah")) + 
  labs(title = "Median household income, 2019-2023 ACS",
       subtitle = "Counties in Utah",
       caption = "Data acquired with R and tidycensus. Error bars represent margin of error around estimates.",
       x = "ACS estimate",
       y = "") + 
  theme_minimal(base_size = 12)
utah_plot_errorbar