Working with Decennial Census Data in R

2025 SSDAN Webinar Series

Kyle Walker

2025-02-12

About me

SSDAN webinar series

Today’s agenda

  • Hour 1: Getting started with 2020 Decennial US Census data in R

  • Hour 2: Analysis workflows with 2020 Census data

  • Hour 3: The detailed DHC data files and time-series analysis

Getting started with 2020 Decennial US Census data in R

R and RStudio

  • R: programming language and software environment for data analysis (and wherever else your imagination can take you!)

  • RStudio: integrated development environment (IDE) for R developed by Posit

  • Posit Cloud: run RStudio with today’s workshop pre-configured at https://posit.cloud/content/9689451

What is the decennial US Census?

  • Complete count of the US population mandated by Article 1, Sections 2 and 9 in the US Constitution

  • Directed by the US Census Bureau (US Department of Commerce); conducted every 10 years since 1790

  • Used for proportional representation / congressional redistricting

  • Limited set of questions asked about race, ethnicity, age, sex, and housing tenure

The 2020 US Census: available datasets

Available datasets from the 2020 US Census include:

  • The PL 94-171 Redistricting Data
  • The Demographic and Housing Characteristics (DHC) file
  • The Demographic Profile (for pre-tabulated variables)
  • Tabulations for the 118th Congress & for Island Areas
  • The Detailed DHC-A file (with very detailed racial & ethnic categories)
  • The Detailed DHC-B file (with housing characteristics for those detailed categories)

How to get decennial Census data

tidycensus

  • R interface to the Decennial Census, American Community Survey, Population Estimates Program, and Public Use Microdata Series APIs

  • First released in 2017; over 600,000 downloads from the Posit CRAN mirror

tidycensus: key features

  • Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);

  • Automatically downloads and merges Census geometries to data for mapping;

  • Includes a variety of analytic tools to support common Census workflows;

  • States and counties can be requested by name (no more looking up FIPS codes!)

Getting started with tidycensus

  • To get started, install the packages you’ll need for today’s workshop

  • If you are using the Posit Cloud environment, these packages are already installed for you

install.packages(c("tidycensus", "tidyverse", "mapview"))

Optional: your Census API key

  • tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day

  • Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.

  • Once activated, use the census_api_key() function to set your key as an environment variable

library(tidycensus)

census_api_key("YOUR KEY GOES HERE", install = TRUE)

Getting started with Census data in tidycensus

2020 Census data in tidycensus

  • The get_decennial() function is used to acquire data from the decennial US Census

  • The two required arguments are geography and variables for the functions to work; for 2020 Census data, use year = 2020.

pop20 <- get_decennial(
  geography = "state",
  variables = "P1_001N",
  year = 2020
)
  • Decennial Census data are returned with four columns: GEOID, NAME, variable, and value
pop20
# A tibble: 52 × 4
   GEOID NAME                 variable    value
   <chr> <chr>                <chr>       <dbl>
 1 42    Pennsylvania         P1_001N  13002700
 2 06    California           P1_001N  39538223
 3 54    West Virginia        P1_001N   1793716
 4 49    Utah                 P1_001N   3271616
 5 36    New York             P1_001N  20201249
 6 11    District of Columbia P1_001N    689545
 7 02    Alaska               P1_001N    733391
 8 12    Florida              P1_001N  21538187
 9 45    South Carolina       P1_001N   5118425
10 38    North Dakota         P1_001N    779094
# ℹ 42 more rows

Understanding the printed messages

  • When we run get_decennial() for the 2020 Census for the first time, we see the following messages:
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
This message is displayed once per session.

Understanding the printed messages

  • The Census Bureau is using differential privacy in an attempt to preserve respondent confidentiality in the 2020 Census data, which is required under US Code Title 13

  • Intentional errors are introduced into data, impacting the accuracy of small area counts (e.g. some blocks with children, but no adults)

  • Advocates argue that differential privacy is necessary to satisfy Title 13 requirements given modern database reconstruction technologies; critics contend that the method makes data less useful with no tangible privacy benefit

Requesting tables of variables

  • The table parameter can be used to obtain all related variables in a “table” at once
table_p2 <- get_decennial(
  geography = "state", 
  table = "P2", 
  year = 2020
)
table_p2
# A tibble: 3,796 × 4
   GEOID NAME                 variable    value
   <chr> <chr>                <chr>       <dbl>
 1 42    Pennsylvania         P2_001N  13002700
 2 06    California           P2_001N  39538223
 3 54    West Virginia        P2_001N   1793716
 4 49    Utah                 P2_001N   3271616
 5 36    New York             P2_001N  20201249
 6 11    District of Columbia P2_001N    689545
 7 02    Alaska               P2_001N    733391
 8 12    Florida              P2_001N  21538187
 9 45    South Carolina       P2_001N   5118425
10 38    North Dakota         P2_001N    779094
# ℹ 3,786 more rows

US Census Geography

Source: US Census Bureau

Source: US Census Bureau

Geography in tidycensus

  • Information on available geographies, and how to specify them, can be found in the tidycensus documentation

  • The 2020 Census allows you to get data down to the Census block (unlike the ACS, covered last week)

{width: 400}

Querying by state

  • For geographies available below the state level, the state parameter allows you to query data for a specific state

  • tidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!

  • Example: total population in Missouri by county

mo_population <- get_decennial(
  geography = "county",
  variables = "P1_001N",
  state = "MO",
  sumfile = "dhc",
  year = 2020
)
mo_population
# A tibble: 115 × 4
   GEOID NAME                       variable  value
   <chr> <chr>                      <chr>     <dbl>
 1 29001 Adair County, Missouri     P1_001N   25314
 2 29003 Andrew County, Missouri    P1_001N   18135
 3 29005 Atchison County, Missouri  P1_001N    5305
 4 29007 Audrain County, Missouri   P1_001N   24962
 5 29009 Barry County, Missouri     P1_001N   34534
 6 29011 Barton County, Missouri    P1_001N   11637
 7 29013 Bates County, Missouri     P1_001N   16042
 8 29015 Benton County, Missouri    P1_001N   19394
 9 29017 Bollinger County, Missouri P1_001N   10567
10 29019 Boone County, Missouri     P1_001N  183610
# ℹ 105 more rows

Querying by state and county

  • County names are also translated internally by tidycensus for sub-county queries, e.g. for Census tracts, block groups, and blocks
stl_blocks <- get_decennial(
  geography = "block",
  variables = "P1_001N",
  state = "MO",
  county = "St. Louis city",
  sumfile = "dhc",
  year = 2020
)
stl_blocks
# A tibble: 8,057 × 4
   GEOID           NAME                                           variable value
   <chr>           <chr>                                          <chr>    <dbl>
 1 295101156001003 Block 1003, Block Group 1, Census Tract 1156,… P1_001N     18
 2 295101156001004 Block 1004, Block Group 1, Census Tract 1156,… P1_001N      0
 3 295101156001005 Block 1005, Block Group 1, Census Tract 1156,… P1_001N     60
 4 295101156001006 Block 1006, Block Group 1, Census Tract 1156,… P1_001N     68
 5 295101156001007 Block 1007, Block Group 1, Census Tract 1156,… P1_001N    566
 6 295101156001008 Block 1008, Block Group 1, Census Tract 1156,… P1_001N      0
 7 295101164004000 Block 4000, Block Group 4, Census Tract 1164,… P1_001N     52
 8 295101164003019 Block 3019, Block Group 3, Census Tract 1164,… P1_001N     51
 9 295101164003020 Block 3020, Block Group 3, Census Tract 1164,… P1_001N    114
10 295101164003021 Block 3021, Block Group 3, Census Tract 1164,… P1_001N     35
# ℹ 8,047 more rows

Searching for variables

  • To search for variables, use the load_variables() function along with a year and dataset

  • The View() function in RStudio allows for interactive browsing and filtering

vars <- load_variables(2020, "dhc")

View(vars)

Available decennial Census datasets in tidycensus

The different datasets in the 2020 Census are accessible by specifying a sumfile in get_decennial(). The datasets we’ll cover today include (besides the default PL 94-171 Redistricting Data):

  • The DHC data (sumfile = "dhc")
  • The Demographic Profile (sumfile = "dp")
  • The CD118 data (sumfile = "cd118)
  • The Detailed DHC-A data (sumfile = "ddhca")
  • The Detailed DHC-B data (sumfile = "ddhcb")

Data structure in tidycensus

“Tidy” or long-form data

  • The default data structure returned by tidycensus is “tidy” or long-form data, with variables by geography stacked by row
single_year_age <- get_decennial(
  geography = "state",
  table = "PCT12",
  year = 2020,
  sumfile = "dhc"
)
single_year_age
# A tibble: 10,868 × 4
   GEOID NAME                 variable      value
   <chr> <chr>                <chr>         <dbl>
 1 09    Connecticut          PCT12_001N  3605944
 2 10    Delaware             PCT12_001N   989948
 3 11    District of Columbia PCT12_001N   689545
 4 12    Florida              PCT12_001N 21538187
 5 13    Georgia              PCT12_001N 10711908
 6 15    Hawaii               PCT12_001N  1455271
 7 16    Idaho                PCT12_001N  1839106
 8 17    Illinois             PCT12_001N 12812508
 9 18    Indiana              PCT12_001N  6785528
10 19    Iowa                 PCT12_001N  3190369
# ℹ 10,858 more rows

“Wide” data

  • The argument output = "wide" spreads Census variables across the columns, returning one row per geographic unit and one column per variable
single_year_age_wide <- get_decennial(
  geography = "state",
  table = "PCT12",
  year = 2020,
  sumfile = "dhc",
  output = "wide" 
)
single_year_age_wide
# A tibble: 52 × 211
   GEOID NAME  PCT12_001N PCT12_002N PCT12_003N PCT12_004N PCT12_005N PCT12_006N
   <chr> <chr>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
 1 09    Conn…    3605944    1749853      16946      17489      17974      18571
 2 10    Dela…     989948     476719       4847       5045       5239       5427
 3 11    Dist…     689545     322777       4116       3730       3620       3723
 4 12    Flor…   21538187   10464234      98241     100543     104962     108555
 5 13    Geor…   10711908    5188570      58614      59817      62719      64271
 6 15    Hawa…    1455271     727844       7757       7565       7701       8303
 7 16    Idaho    1839106     919196      11029      10953      11558      12056
 8 17    Illi…   12812508    6283130      68115      69120      71823      74165
 9 18    Indi…    6785528    3344660      39308      40232      41971      43310
10 19    Iowa     3190369    1586092      18038      18566      19349      19864
# ℹ 42 more rows
# ℹ 203 more variables: PCT12_007N <dbl>, PCT12_008N <dbl>, PCT12_009N <dbl>,
#   PCT12_010N <dbl>, PCT12_011N <dbl>, PCT12_012N <dbl>, PCT12_013N <dbl>,
#   PCT12_014N <dbl>, PCT12_015N <dbl>, PCT12_016N <dbl>, PCT12_017N <dbl>,
#   PCT12_018N <dbl>, PCT12_019N <dbl>, PCT12_020N <dbl>, PCT12_021N <dbl>,
#   PCT12_022N <dbl>, PCT12_023N <dbl>, PCT12_024N <dbl>, PCT12_025N <dbl>,
#   PCT12_026N <dbl>, PCT12_027N <dbl>, PCT12_028N <dbl>, PCT12_029N <dbl>, …

Using named vectors of variables

  • Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input

  • In long form, these custom inputs will populate the variable column; in wide form, they will replace the column names

fl_samesex <- get_decennial(
  geography = "county",
  state = "FL",
  variables = c(married = "DP1_0116P",
                partnered = "DP1_0118P"),
  year = 2020,
  sumfile = "dp",
  output = "wide"
)
fl_samesex
# A tibble: 67 × 4
   GEOID NAME                      married partnered
   <chr> <chr>                       <dbl>     <dbl>
 1 12001 Alachua County, Florida       0.3       0.2
 2 12003 Baker County, Florida         0.1       0.1
 3 12005 Bay County, Florida           0.2       0.2
 4 12007 Bradford County, Florida      0.1       0.1
 5 12009 Brevard County, Florida       0.2       0.1
 6 12011 Broward County, Florida       0.4       0.3
 7 12013 Calhoun County, Florida       0.1       0.1
 8 12015 Charlotte County, Florida     0.2       0.1
 9 12017 Citrus County, Florida        0.2       0.1
10 12019 Clay County, Florida          0.2       0.1
# ℹ 57 more rows

Part 1 exercises

  1. Use load_variables(2020, "dhc") to find a variable that interests you from the Demographic and Housing Characteristics file.

  2. Use get_decennial() to fetch data on that variable from the decennial US Census for counties in a state of your choosing.

Part 2: Analysis workflows with 2020 Census data

The tidyverse

library(tidyverse)

tidyverse_logo()
⬢ __  _    __   .    ⬡           ⬢  . 
 / /_(_)__/ /_ ___  _____ _______ ___ 
/ __/ / _  / // / |/ / -_) __(_-</ -_)
\__/_/\_,_/\_, /|___/\__/_/ /___/\__/ 
     ⬢  . /___/      ⬡      .       ⬢ 
  • The tidyverse: an integrated set of packages developed primarily by Hadley Wickham and the Posit team

tidycensus and the tidyverse

  • Census data are commonly used in wide format, with categories spread across the columns

  • tidyverse tools work better with data that are in “tidy”, or long format; this format is returned by tidycensus by default

  • Goal: return data “ready to go” for use with tidyverse tools

Exploring 2020 Census data with tidyverse tools

Finding the largest values

  • dplyr’s arrange() function sorts data based on values in one or more columns, and filter() helps you query data based on column values

  • Example: what are the largest and smallest counties in Missouri by population?

library(tidycensus)
library(tidyverse)

mo_population <- get_decennial(
  geography = "county",
  variables = "P1_001N",
  year = 2020,
  state = "MO",
  sumfile = "dhc"
)
arrange(mo_population, value)
# A tibble: 115 × 4
   GEOID NAME                      variable value
   <chr> <chr>                     <chr>    <dbl>
 1 29227 Worth County, Missouri    P1_001N   1973
 2 29129 Mercer County, Missouri   P1_001N   3538
 3 29103 Knox County, Missouri     P1_001N   3744
 4 29197 Schuyler County, Missouri P1_001N   4032
 5 29087 Holt County, Missouri     P1_001N   4223
 6 29171 Putnam County, Missouri   P1_001N   4681
 7 29199 Scotland County, Missouri P1_001N   4716
 8 29035 Carter County, Missouri   P1_001N   5202
 9 29005 Atchison County, Missouri P1_001N   5305
10 29211 Sullivan County, Missouri P1_001N   5999
# ℹ 105 more rows
arrange(mo_population, desc(value))
# A tibble: 115 × 4
   GEOID NAME                         variable   value
   <chr> <chr>                        <chr>      <dbl>
 1 29189 St. Louis County, Missouri   P1_001N  1004125
 2 29095 Jackson County, Missouri     P1_001N   717204
 3 29183 St. Charles County, Missouri P1_001N   405262
 4 29510 St. Louis city, Missouri     P1_001N   301578
 5 29077 Greene County, Missouri      P1_001N   298915
 6 29047 Clay County, Missouri        P1_001N   253335
 7 29099 Jefferson County, Missouri   P1_001N   226739
 8 29019 Boone County, Missouri       P1_001N   183610
 9 29097 Jasper County, Missouri      P1_001N   122761
10 29037 Cass County, Missouri        P1_001N   107824
# ℹ 105 more rows

What are the counties with a population below 5,000?

  • The filter() function subsets data according to a specified condition, much like a SQL query
below1000 <- filter(mo_population, value < 1000)

below1000
# A tibble: 0 × 4
# ℹ 4 variables: GEOID <chr>, NAME <chr>, variable <chr>, value <dbl>

Using summary variables

  • Many decennial Census and ACS variables are organized in tables in which the first variable represents a summary variable, or denominator for the others

  • The parameter summary_var can be used to generate a new column in long-form data for a requested denominator, which works well for normalizing estimates

Using summary variables

race_vars <- c(
  Hispanic = "P5_010N",
  White = "P5_003N",
  Black = "P5_004N",
  Native = "P5_005N",
  Asian = "P5_006N",
  HIPI = "P5_007N"
)

cbsa_race <- get_decennial(
  geography = "cbsa",
  variables = race_vars,
  summary_var = "P5_001N", 
  year = 2020,
  sumfile = "dhc"
)
cbsa_race
# A tibble: 5,634 × 5
   GEOID NAME                                      variable  value summary_value
   <chr> <chr>                                     <chr>     <dbl>         <dbl>
 1 49420 Yakima, WA Metro Area                     Hispanic 130049        256728
 2 49460 Yankton, SD Micro Area                    Hispanic   1234         23310
 3 49500 Yauco, PR Metro Area                      Hispanic  85665         86142
 4 49620 York-Hanover, PA Metro Area               Hispanic  39360        456438
 5 49660 Youngstown-Warren-Boardman, OH-PA Metro … Hispanic  19881        541243
 6 49700 Yuba City, CA Metro Area                  Hispanic  55088        181208
 7 49740 Yuma, AZ Metro Area                       Hispanic 130003        203881
 8 49780 Zanesville, OH Micro Area                 Hispanic   1055         86410
 9 49820 Zapata, TX Micro Area                     Hispanic  12999         13889
10 48300 Wenatchee, WA Metro Area                  Hispanic  36741        122012
# ℹ 5,624 more rows

Normalizing columns with mutate()

  • dplyr’s mutate() function is used to calculate new columns in your data; the select() column can keep or drop columns by name

  • In a tidyverse workflow, these steps are commonly linked using the pipe operator (%>%) from the magrittr package

cbsa_race_percent <- cbsa_race %>%
  mutate(percent = 100 * (value / summary_value)) %>% 
  select(NAME, variable, percent) 
cbsa_race_percent
# A tibble: 5,634 × 3
   NAME                                         variable percent
   <chr>                                        <chr>      <dbl>
 1 Yakima, WA Metro Area                        Hispanic   50.7 
 2 Yankton, SD Micro Area                       Hispanic    5.29
 3 Yauco, PR Metro Area                         Hispanic   99.4 
 4 York-Hanover, PA Metro Area                  Hispanic    8.62
 5 Youngstown-Warren-Boardman, OH-PA Metro Area Hispanic    3.67
 6 Yuba City, CA Metro Area                     Hispanic   30.4 
 7 Yuma, AZ Metro Area                          Hispanic   63.8 
 8 Zanesville, OH Micro Area                    Hispanic    1.22
 9 Zapata, TX Micro Area                        Hispanic   93.6 
10 Wenatchee, WA Metro Area                     Hispanic   30.1 
# ℹ 5,624 more rows

Group-wise Census data analysis

  • The group_by() and summarize() functions in dplyr are used to implement the split-apply-combine method of data analysis

  • The default “tidy” format returned by tidycensus is designed to work well with group-wise Census data analysis workflows

What is the largest group by CBSA?

largest_group <- cbsa_race_percent %>%
  group_by(NAME) %>% 
  filter(percent == max(percent)) 

# Optionally, use `.by`: 
# largest_group <- cbsa_race_percent %>%
#   filter(percent == max(percent), .by = NAME) 
largest_group
# A tibble: 939 × 3
# Groups:   NAME [939]
   NAME                      variable percent
   <chr>                     <chr>      <dbl>
 1 Yakima, WA Metro Area     Hispanic    50.7
 2 Yauco, PR Metro Area      Hispanic    99.4
 3 Yuma, AZ Metro Area       Hispanic    63.8
 4 Zapata, TX Micro Area     Hispanic    93.6
 5 Del Rio, TX Micro Area    Hispanic    80.3
 6 Deming, NM Micro Area     Hispanic    65.6
 7 Dodge City, KS Micro Area Hispanic    57.4
 8 Dumas, TX Micro Area      Hispanic    59.2
 9 Eagle Pass, TX Micro Area Hispanic    94.9
10 El Centro, CA Metro Area  Hispanic    85.2
# ℹ 929 more rows

What are the median percentages by group?

cbsa_race_percent %>%
  group_by(variable) %>% 
  summarize(median_pct = median(percent, na.rm = TRUE)) 
# A tibble: 6 × 2
  variable median_pct
  <chr>         <dbl>
1 Asian        1.03  
2 Black        3.61  
3 HIPI         0.0349
4 Hispanic     6.97  
5 Native       0.287 
6 White       75.2   

Exploring maps of Census data

“Spatial” Census data

  • One of the best features of tidycensus is the argument geometry = TRUE, which gets you the correct Census geometries with no hassle

  • get_decennial() with geometry = TRUE returns a spatial Census dataset containing simple feature geometries; learn more on February 26

  • Let’s take a look at some examples

“Spatial” Census data

  • geometry = TRUE does the hard work for you of acquiring and pre-joining spatial Census data

  • Consider using the Demographic Profile for pre-tabulated percentages

wv_over_65 <- get_decennial(
  geography = "tract",
  variables = "DP1_0024P",
  state = "WV",
  geometry = TRUE,
  sumfile = "dp",
  year = 2020
)
  • We get back a simple features data frame (more about this on February 26)
wv_over_65
Simple feature collection with 546 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -82.64474 ymin: 37.20148 xmax: -77.71952 ymax: 40.6388
Geodetic CRS:  NAD83
# A tibble: 546 × 5
   GEOID       NAME                     variable value                  geometry
   <chr>       <chr>                    <chr>    <dbl>             <POLYGON [°]>
 1 54001965700 Census Tract 9657; Barb… DP1_002…  22.3 ((-80.07606 39.1016, -80…
 2 54029020700 Census Tract 207; Hanco… DP1_002…  22.4 ((-80.57509 40.41198, -8…
 3 54069000200 Census Tract 2; Ohio Co… DP1_002…  22.3 ((-80.71044 40.12813, -8…
 4 54011001400 Census Tract 14; Cabell… DP1_002…  15.1 ((-82.43736 38.41681, -8…
 5 54099005100 Census Tract 51; Wayne … DP1_002…  23.3 ((-82.54137 38.39631, -8…
 6 54099005200 Census Tract 52; Wayne … DP1_002…  21.8 ((-82.55603 38.39246, -8…
 7 54059957200 Census Tract 9572; Ming… DP1_002…  20.3 ((-82.35141 37.78997, -8…
 8 54071970600 Census Tract 9706; Pend… DP1_002…  29.2 ((-79.47433 38.46251, -7…
 9 54083965900 Census Tract 9659; Rand… DP1_002…  20.1 ((-80.26471 38.71209, -8…
10 54085962400 Census Tract 9624; Ritc… DP1_002…  25   ((-81.32879 39.15258, -8…
# ℹ 536 more rows

Exploring spatial data

  • Mapping, GIS, and spatial data is the subject of our February 26 workshop - so be sure to check that out!

  • Even before we dive deeper into spatial data, it is very useful to be able to explore your results on an interactive map

  • Our solution: mapview()

Exploring spatial data

library(mapview)

mapview(wv_over_65)