2025-02-05
Professor of Geography at TCU
Spatial data science researcher and consultant
Package developer: tidycensus, tigris, mapgl, mapboxapi, crsuggest, idbr (R), pygris (Python)
Book: Analyzing US Census Data: Methods, Maps and Models in R
Today: Analyzing Data from the 2023 American Community Survey in R
Wednesday, February 12th: Working with Decennial Census Data in R
Wednesday, February 26th: Mapping and Spatial Analysis with US Census Data in R
Hour 1: The American Community Survey, R, and tidycensus
Hour 2: ACS data workflows
Hour 3: An introduction to ACS microdata
Annual survey of 3.5 million US households
Covers topics not available in decennial US Census data (e.g. income, education, language, housing characteristics)
Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)
Data delivered as estimates characterized by margins of error
data.census.gov is the main, revamped interactive data portal for browsing and downloading Census datasets, including the ACS
The US Census Application Programming Interface (API) allows developers to access Census data resources programmatically
Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);
Automatically downloads and merges Census geometries to data for mapping;
Includes tools for handling margins of error in the ACS and working with survey weights in the ACS PUMS;
States and counties can be requested by name (no more looking up FIPS codes!)
R: programming language and software environment for data analysis (and wherever else your imagination can take you!)
RStudio: integrated development environment (IDE) for R developed by Posit
Posit Cloud: run RStudio with today’s workshop pre-configured at https://posit.cloud/content/9689451
To get started, install the packages you’ll need for today’s workshop
If you are using the Posit Cloud environment, these packages are already installed for you
tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day
Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.
Once activated, use the census_api_key()
function to set your key as an environment variable
As of February 2025, the API key service appears to be unavailable
get_acs()
functionThe get_acs()
function is your portal to access ACS data using tidycensus
The two required arguments are geography
and variables
. As of v1.7.1, the function defaults to the 2019-2023 5-year ACS
GEOID
, NAME
, variable
, estimate
, and moe
# A tibble: 3,222 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01001 Autauga County, Alabama B25077_001 197900 8280
2 01003 Baldwin County, Alabama B25077_001 287000 6910
3 01005 Barbour County, Alabama B25077_001 109900 11374
4 01007 Bibb County, Alabama B25077_001 132600 18638
5 01009 Blount County, Alabama B25077_001 169700 5354
6 01011 Bullock County, Alabama B25077_001 79400 15850
7 01013 Butler County, Alabama B25077_001 99700 8484
8 01015 Calhoun County, Alabama B25077_001 149500 6196
9 01017 Chambers County, Alabama B25077_001 129700 12036
10 01019 Cherokee County, Alabama B25077_001 165900 9786
# ℹ 3,212 more rows
1-year ACS data are more current, but are only available for geographies of population 65,000 and greater
Access 1-year ACS data with the argument survey = "acs1"
; defaults to "acs5"
# A tibble: 649 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 0103076 Auburn city, Alabama B25077_001 365500 39357
2 0107000 Birmingham city, Alabama B25077_001 157200 14921
3 0121184 Dothan city, Alabama B25077_001 190000 10687
4 0135896 Hoover city, Alabama B25077_001 415300 20389
5 0137000 Huntsville city, Alabama B25077_001 309900 15846
6 0150000 Mobile city, Alabama B25077_001 189000 20112
7 0151000 Montgomery city, Alabama B25077_001 158500 8737
8 0177256 Tuscaloosa city, Alabama B25077_001 252200 26811
9 0203000 Anchorage municipality, Alaska B25077_001 385900 11025
10 0404720 Avondale city, Arizona B25077_001 404900 17356
# ℹ 639 more rows
table
parameter can be used to obtain all related variables in a “table” at once# A tibble: 54,774 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01001 Autauga County, Alabama B19001_001 22523 431
2 01001 Autauga County, Alabama B19001_002 831 221
3 01001 Autauga County, Alabama B19001_003 676 211
4 01001 Autauga County, Alabama B19001_004 801 312
5 01001 Autauga County, Alabama B19001_005 984 281
6 01001 Autauga County, Alabama B19001_006 983 322
7 01001 Autauga County, Alabama B19001_007 1057 255
8 01001 Autauga County, Alabama B19001_008 768 241
9 01001 Autauga County, Alabama B19001_009 961 266
10 01001 Autauga County, Alabama B19001_010 837 248
# ℹ 54,764 more rows
For geographies available below the state level, the state
parameter allows you to query data for a specific state
For smaller geographies (Census tracts, block groups), a county
can also be requested
tidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!
Example: data on median home value in San Diego County, California by Census tract
# A tibble: 737 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 06073000100 Census Tract 1; San Diego County; Calif… B25077_… 1825500 159981
2 06073000201 Census Tract 2.01; San Diego County; Ca… B25077_… 1445900 203254
3 06073000202 Census Tract 2.02; San Diego County; Ca… B25077_… 966200 175650
4 06073000301 Census Tract 3.01; San Diego County; Ca… B25077_… 895600 152793
5 06073000302 Census Tract 3.02; San Diego County; Ca… B25077_… 820800 132232
6 06073000400 Census Tract 4; San Diego County; Calif… B25077_… 802500 108414
7 06073000500 Census Tract 5; San Diego County; Calif… B25077_… 1031800 96156
8 06073000600 Census Tract 6; San Diego County; Calif… B25077_… 723300 78902
9 06073000700 Census Tract 7; San Diego County; Calif… B25077_… 833900 107773
10 06073000800 Census Tract 8; San Diego County; Calif… B25077_… 795900 246242
# ℹ 727 more rows
To search for variables, use the load_variables()
function along with a year and dataset
The View()
function in RStudio allows for interactive browsing and filtering
Detailed Tables
Data Profile (add "/profile"
for variable lookup)
Subject Tables (add "/subject"
)
Comparison Profile (add "/cprofile"
)
Supplemental Estimates (use "acsse"
)
Migration Flows (access with get_flows()
)
# A tibble: 2,548 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama B01001_001 5108468 NA
2 01 Alabama B01001_002 2471801 5359
3 01 Alabama B01001_003 143309 3322
4 01 Alabama B01001_004 154161 6072
5 01 Alabama B01001_005 169656 6167
6 01 Alabama B01001_006 104280 3039
7 01 Alabama B01001_007 73719 3352
8 01 Alabama B01001_008 33655 3913
9 01 Alabama B01001_009 32913 3242
10 01 Alabama B01001_010 100629 5077
# ℹ 2,538 more rows
# A tibble: 52 × 100
GEOID NAME B01001_001E B01001_001M B01001_002E B01001_002M B01001_003E
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 01 Alabama 5108468 NA 2471801 5359 143309
2 02 Alaska 733406 NA 385855 2547 24360
3 04 Arizona 7431344 NA 3708672 2905 198924
4 05 Arkansas 3067732 NA 1512201 4594 89183
5 06 California 38965193 NA 19450698 5477 1067828
6 08 Colorado 5877610 NA 2980108 4746 159079
7 09 Connecticut 3617176 NA 1775718 2410 92614
8 10 Delaware 1031890 NA 498994 1839 26971
9 11 District o… 678972 NA 321590 265 19516
10 12 Florida 22610726 NA 11116575 6588 577538
# ℹ 42 more rows
# ℹ 93 more variables: B01001_003M <dbl>, B01001_004E <dbl>, B01001_004M <dbl>,
# B01001_005E <dbl>, B01001_005M <dbl>, B01001_006E <dbl>, B01001_006M <dbl>,
# B01001_007E <dbl>, B01001_007M <dbl>, B01001_008E <dbl>, B01001_008M <dbl>,
# B01001_009E <dbl>, B01001_009M <dbl>, B01001_010E <dbl>, B01001_010M <dbl>,
# B01001_011E <dbl>, B01001_011M <dbl>, B01001_012E <dbl>, B01001_012M <dbl>,
# B01001_013E <dbl>, B01001_013M <dbl>, B01001_014E <dbl>, …
Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input
In long form, these custom inputs will populate the variable
column; in wide form, they will replace the column names
# A tibble: 174 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 06001 Alameda County, California percent_high_school 15.9 0.3
2 06001 Alameda County, California percent_bachelors 28.7 0.3
3 06001 Alameda County, California percent_graduate 22.8 0.3
4 06003 Alpine County, California percent_high_school 18.5 6.1
5 06003 Alpine County, California percent_bachelors 22.4 7.3
6 06003 Alpine County, California percent_graduate 19.7 9.3
7 06005 Amador County, California percent_high_school 26.5 1.6
8 06005 Amador County, California percent_bachelors 15.5 1.5
9 06005 Amador County, California percent_graduate 7 1.1
10 06007 Butte County, California percent_high_school 22.3 0.8
# ℹ 164 more rows
Use the load_variables()
function to find a variable that interests you that we haven’t used yet.
Use get_acs()
to fetch data on that variable from the ACS for counties, similar to how we did for median household income.
Values available in the 5-year ACS may not be available in the corresponding 1-year ACS tables
If available, they will likely have larger margins of error
Your job as an analyst: balance need for certainty vs. need for recency in estimates
# A tibble: 52 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama B16001_054 1149 947
2 02 Alaska B16001_054 NA NA
3 04 Arizona B16001_054 2180 1477
4 05 Arkansas B16001_054 0 208
5 06 California B16001_054 171384 13741
6 08 Colorado B16001_054 1997 1973
7 09 Connecticut B16001_054 2062 1095
8 10 Delaware B16001_054 1643 1814
9 11 District of Columbia B16001_054 67 74
10 12 Florida B16001_054 5148 2555
# ℹ 42 more rows
# A tibble: 52 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama B16001_054 704 321
2 02 Alaska B16001_054 0 24
3 04 Arizona B16001_054 2824 594
4 05 Arkansas B16001_054 519 257
5 06 California B16001_054 151139 6187
6 08 Colorado B16001_054 1406 505
7 09 Connecticut B16001_054 1945 529
8 10 Delaware B16001_054 236 176
9 11 District of Columbia B16001_054 39 34
10 12 Florida B16001_054 2959 764
# ℹ 42 more rows
As opposed to decennial US Census data, ACS estimates include information on uncertainty, represented by the margin of error in the moe
column
This means that in some cases, visualization of estimates without reference to the margin of error can be misleading
Walkthrough: building a margin of error visualization with ggplot2
The data are not sorted by value, making comparisons difficult
The axis and tick labels are not intuitive
The Y-axis labels contain repetitive information (” County, Utah”)
We’ve made no attempt to customize the styling
reorder()
to sort counties by the value of their ACS estimates, improving legibilitylabs()
to label the plot and its axes, and change the theme to one of several built-in optionsutah_plot_errorbar <- ggplot(utah_income, aes(x = estimate,
y = reorder(NAME, estimate))) +
geom_errorbar(aes(xmin = estimate - moe, xmax = estimate + moe), #<<
width = 0.5, linewidth = 0.5) + #<<
geom_point(color = "darkblue", size = 2) +
scale_x_continuous(labels = label_dollar()) +
scale_y_discrete(labels = function(x) str_remove(x, " County, Utah")) +
labs(title = "Median household income, 2019-2023 ACS",
subtitle = "Counties in Utah",
caption = "Data acquired with R and tidycensus. Error bars represent margin of error around estimates.",
x = "ACS estimate",
y = "") +
theme_minimal(base_size = 12)