Analyzing 2020 Decennial US Census Data in R

2024 SSDAN Webinar Series

Kyle Walker

2024-02-22

About me

Associate Professor of Geography at TCU
Spatial data science researcher and consultant
Package developer: tidycensus, tigris, mapboxapi, crsuggest, idbr (R), pygris (Python)
Book: Analyzing US Census Data: Methods, Maps and Models in R

SSDAN webinar series

Thurday, February 8: Working with the 2022 American Community Survey with R and tidycensus
Today: Analyzing 2020 Decennial US Census Data in R
Thursday, March 7th: Doing “GIS” and making maps with US Census Data in R

Today’s agenda

Hour 1: Getting started with 2020 Decennial US Census data in R
Hour 2: Analysis workflows with 2020 Census data
Hour 3: The detailed DHC-A data and time-series analysis

Getting started with 2020 Decennial US Census data in R

R and RStudio

R: programming language and software environment for data analysis (and wherever else your imagination can take you!)
RStudio: integrated development environment (IDE) for R developed by Posit
Posit Cloud: run RStudio with today’s workshop pre-configured at https://posit.cloud/content/7549022

What is the decennial US Census?

Complete count of the US population mandated by Article 1, Sections 2 and 9 in the US Constitution
Directed by the US Census Bureau (US Department of Commerce); conducted every 10 years since 1790
Used for proportional representation / congressional redistricting
Limited set of questions asked about race, ethnicity, age, sex, and housing tenure

The 2020 US Census: available datasets

Available datasets from the 2020 US Census include:

The PL 94-171 Redistricting Data
The Demographic and Housing Characteristics (DHC) file
The Demographic Profile (for pre-tabulated variables)
Tabulations for the 118th Congress & for Island Areas
The Detailed DHC-A file (with very detailed racial & ethnic categories)

How to get decennial Census data

data.census.gov is the main, revamped interactive data portal for browsing and downloading Census datasets
The US Census Application Programming Interface (API) allows developers to access Census data resources programmatically

tidycensus

R interface to the Decennial Census, American Community Survey, Population Estimates Program, and Public Use Microdata Series APIs
First released in 2017; nearly 500,000 downloads from the Posit CRAN mirror

tidycensus: key features

Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);
Automatically downloads and merges Census geometries to data for mapping;
Includes a variety of analytic tools to support common Census workflows;
States and counties can be requested by name (no more looking up FIPS codes!)

Getting started with tidycensus

To get started, install the packages you’ll need for today’s workshop
If you are using the Posit Cloud environment, these packages are already installed for you

install.packages(c("tidycensus", "tidyverse", "mapview"))

Optional: your Census API key

tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day
Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.
Once activated, use the census_api_key() function to set your key as an environment variable

library(tidycensus)

census_api_key("YOUR KEY GOES HERE", install = TRUE)

Getting started with Census data in tidycensus

2020 Census data in tidycensus

The get_decennial() function is used to acquire data from the decennial US Census
The two required arguments are geography and variables for the functions to work; for 2020 Census data, use year = 2020.

pop20 <- get_decennial(
  geography = "state",
  variables = "P1_001N",
  year = 2020
)

Decennial Census data are returned with four columns: GEOID, NAME, variable, and value

pop20

# A tibble: 52 × 4
   GEOID NAME                 variable    value
   <chr> <chr>                <chr>       <dbl>
 1 42    Pennsylvania         P1_001N  13002700
 2 06    California           P1_001N  39538223
 3 54    West Virginia        P1_001N   1793716
 4 49    Utah                 P1_001N   3271616
 5 36    New York             P1_001N  20201249
 6 11    District of Columbia P1_001N    689545
 7 02    Alaska               P1_001N    733391
 8 12    Florida              P1_001N  21538187
 9 45    South Carolina       P1_001N   5118425
10 38    North Dakota         P1_001N    779094
# ℹ 42 more rows

Understanding the printed messages

When we run get_decennial() for the 2020 Census for the first time, we see the following messages:

Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
This message is displayed once per session.

Understanding the printed messages

The Census Bureau is using differential privacy in an attempt to preserve respondent confidentiality in the 2020 Census data, which is required under US Code Title 13
Intentional errors are introduced into data, impacting the accuracy of small area counts (e.g. some blocks with children, but no adults)
Advocates argue that differential privacy is necessary to satisfy Title 13 requirements given modern database reconstruction technologies; critics contend that the method makes data less useful with no tangible privacy benefit

Requesting tables of variables

The table parameter can be used to obtain all related variables in a “table” at once

table_p2 <- get_decennial(
  geography = "state", 
  table = "P2", 
  year = 2020
)

table_p2

# A tibble: 3,796 × 4
   GEOID NAME                 variable    value
   <chr> <chr>                <chr>       <dbl>
 1 42    Pennsylvania         P2_001N  13002700
 2 06    California           P2_001N  39538223
 3 54    West Virginia        P2_001N   1793716
 4 49    Utah                 P2_001N   3271616
 5 36    New York             P2_001N  20201249
 6 11    District of Columbia P2_001N    689545
 7 02    Alaska               P2_001N    733391
 8 12    Florida              P2_001N  21538187
 9 45    South Carolina       P2_001N   5118425
10 38    North Dakota         P2_001N    779094
# ℹ 3,786 more rows

US Census Geography

Geography in tidycensus

Information on available geographies, and how to specify them, can be found in the tidycensus documentation
The 2020 Census allows you to get data down to the Census block (unlike the ACS, covered last week)

{width: 400}

Querying by state

For geographies available below the state level, the state parameter allows you to query data for a specific state
tidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!
Example: total population in Texas by county

tx_population <- get_decennial(
  geography = "county",
  variables = "P1_001N",
  state = "TX",
  sumfile = "dhc",
  year = 2020
)

tx_population

# A tibble: 254 × 4
   GEOID NAME                     variable  value
   <chr> <chr>                    <chr>     <dbl>
 1 48239 Jackson County, Texas    P1_001N   14988
 2 48233 Hutchinson County, Texas P1_001N   20617
 3 48235 Irion County, Texas      P1_001N    1513
 4 48237 Jack County, Texas       P1_001N    8472
 5 48241 Jasper County, Texas     P1_001N   32980
 6 48243 Jeff Davis County, Texas P1_001N    1996
 7 48245 Jefferson County, Texas  P1_001N  256526
 8 48247 Jim Hogg County, Texas   P1_001N    4838
 9 48249 Jim Wells County, Texas  P1_001N   38891
10 48251 Johnson County, Texas    P1_001N  179927
# ℹ 244 more rows

Querying by state and county

County names are also translated internally by tidycensus for sub-county queries, e.g. for Census tracts, block groups, and blocks

matagorda_blocks <- get_decennial(
  geography = "block",
  variables = "P1_001N",
  state = "TX",
  county = "Matagorda",
  sumfile = "dhc",
  year = 2020
)

matagorda_blocks

# A tibble: 2,620 × 4
   GEOID           NAME                                           variable value
   <chr>           <chr>                                          <chr>    <dbl>
 1 483217303013065 Block 3065, Block Group 3, Census Tract 7303.… P1_001N     12
 2 483217303013062 Block 3062, Block Group 3, Census Tract 7303.… P1_001N      6
 3 483217303013063 Block 3063, Block Group 3, Census Tract 7303.… P1_001N     30
 4 483217303013064 Block 3064, Block Group 3, Census Tract 7303.… P1_001N     31
 5 483217303013066 Block 3066, Block Group 3, Census Tract 7303.… P1_001N     14
 6 483217303013067 Block 3067, Block Group 3, Census Tract 7303.… P1_001N     19
 7 483217303013068 Block 3068, Block Group 3, Census Tract 7303.… P1_001N     19
 8 483217303013069 Block 3069, Block Group 3, Census Tract 7303.… P1_001N     15
 9 483217303013070 Block 3070, Block Group 3, Census Tract 7303.… P1_001N     24
10 483217303013071 Block 3071, Block Group 3, Census Tract 7303.… P1_001N      3
# ℹ 2,610 more rows

Searching for variables

To search for variables, use the load_variables() function along with a year and dataset
The View() function in RStudio allows for interactive browsing and filtering

vars <- load_variables(2020, "dhc")

View(vars)

Available decennial Census datasets in tidycensus

The different datasets in the 2020 Census are accessible by specifying a sumfile in get_decennial(). The datasets we’ll cover today include:

The DHC data (sumfile = "dhc")
The Demographic Profile (sumfile = "dp")
The CD118 data (sumfile = "cd118)
The Detailed DHC-A data (sumfile = "ddhca")

Data structure in tidycensus

“Tidy” or long-form data

The default data structure returned by tidycensus is “tidy” or long-form data, with variables by geography stacked by row

single_year_age <- get_decennial(
  geography = "state",
  table = "PCT12",
  year = 2020,
  sumfile = "dhc"
)

single_year_age

# A tibble: 10,868 × 4
   GEOID NAME                 variable      value
   <chr> <chr>                <chr>         <dbl>
 1 09    Connecticut          PCT12_001N  3605944
 2 10    Delaware             PCT12_001N   989948
 3 11    District of Columbia PCT12_001N   689545
 4 12    Florida              PCT12_001N 21538187
 5 13    Georgia              PCT12_001N 10711908
 6 15    Hawaii               PCT12_001N  1455271
 7 16    Idaho                PCT12_001N  1839106
 8 17    Illinois             PCT12_001N 12812508
 9 18    Indiana              PCT12_001N  6785528
10 19    Iowa                 PCT12_001N  3190369
# ℹ 10,858 more rows

“Wide” data

The argument output = "wide" spreads Census variables across the columns, returning one row per geographic unit and one column per variable

single_year_age_wide <- get_decennial(
  geography = "state",
  table = "PCT12",
  year = 2020,
  sumfile = "dhc",
  output = "wide" 
)

single_year_age_wide

# A tibble: 52 × 211
   GEOID NAME  PCT12_001N PCT12_002N PCT12_003N PCT12_004N PCT12_005N PCT12_006N
   <chr> <chr>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
 1 09    Conn…    3605944    1749853      16946      17489      17974      18571
 2 10    Dela…     989948     476719       4847       5045       5239       5427
 3 11    Dist…     689545     322777       4116       3730       3620       3723
 4 12    Flor…   21538187   10464234      98241     100543     104962     108555
 5 13    Geor…   10711908    5188570      58614      59817      62719      64271
 6 15    Hawa…    1455271     727844       7757       7565       7701       8303
 7 16    Idaho    1839106     919196      11029      10953      11558      12056
 8 17    Illi…   12812508    6283130      68115      69120      71823      74165
 9 18    Indi…    6785528    3344660      39308      40232      41971      43310
10 19    Iowa     3190369    1586092      18038      18566      19349      19864
# ℹ 42 more rows
# ℹ 203 more variables: PCT12_007N <dbl>, PCT12_008N <dbl>, PCT12_009N <dbl>,
#   PCT12_010N <dbl>, PCT12_011N <dbl>, PCT12_012N <dbl>, PCT12_013N <dbl>,
#   PCT12_014N <dbl>, PCT12_015N <dbl>, PCT12_016N <dbl>, PCT12_017N <dbl>,
#   PCT12_018N <dbl>, PCT12_019N <dbl>, PCT12_020N <dbl>, PCT12_021N <dbl>,
#   PCT12_022N <dbl>, PCT12_023N <dbl>, PCT12_024N <dbl>, PCT12_025N <dbl>,
#   PCT12_026N <dbl>, PCT12_027N <dbl>, PCT12_028N <dbl>, PCT12_029N <dbl>, …

Using named vectors of variables

Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input
In long form, these custom inputs will populate the variable column; in wide form, they will replace the column names

ca_samesex <- get_decennial(
  geography = "county",
  state = "CA",
  variables = c(married = "DP1_0116P",
                partnered = "DP1_0118P"),
  year = 2020,
  sumfile = "dp",
  output = "wide"
)

ca_samesex

# A tibble: 58 × 4
   GEOID NAME                            married partnered
   <chr> <chr>                             <dbl>     <dbl>
 1 06001 Alameda County, California          0.4       0.2
 2 06003 Alpine County, California           0.3       0.2
 3 06005 Amador County, California           0.2       0.1
 4 06007 Butte County, California            0.2       0.1
 5 06009 Calaveras County, California        0.2       0.1
 6 06011 Colusa County, California           0.1       0  
 7 06013 Contra Costa County, California     0.3       0.1
 8 06015 Del Norte County, California        0.1       0.1
 9 06017 El Dorado County, California        0.2       0.1
10 06019 Fresno County, California           0.2       0.1
# ℹ 48 more rows

Part 1 exercises

Use load_variables(2020, "dhc") to find a variable that interests you from the Demographic and Housing Characteristics file.
Use get_decennial() to fetch data on that variable from the decennial US Census for counties in a state of your choosing.

Part 2: Analysis workflows with 2020 Census data

The tidyverse

library(tidyverse)

tidyverse_logo()

⬢ __  _    __   .    ⬡           ⬢  . 
 / /_(_)__/ /_ ___  _____ _______ ___ 
/ __/ / _  / // / |/ / -_) __(_-</ -_)
\__/_/\_,_/\_, /|___/\__/_/ /___/\__/ 
     ⬢  . /___/      ⬡      .       ⬢

The tidyverse: an integrated set of packages developed primarily by Hadley Wickham and the Posit team

tidycensus and the tidyverse

Census data are commonly used in wide format, with categories spread across the columns
tidyverse tools work better with data that are in “tidy”, or long format; this format is returned by tidycensus by default
Goal: return data “ready to go” for use with tidyverse tools

Exploring 2020 Census data with tidyverse tools

Finding the largest values

dplyr’s arrange() function sorts data based on values in one or more columns, and filter() helps you query data based on column values
Example: what are the largest and smallest counties in Texas by population?

library(tidycensus)
library(tidyverse)

tx_population <- get_decennial(
  geography = "county",
  variables = "P1_001N",
  year = 2020,
  state = "TX",
  sumfile = "dhc"
)

arrange(tx_population, value)

# A tibble: 254 × 4
   GEOID NAME                   variable value
   <chr> <chr>                  <chr>    <dbl>
 1 48301 Loving County, Texas   P1_001N     64
 2 48269 King County, Texas     P1_001N    265
 3 48261 Kenedy County, Texas   P1_001N    350
 4 48311 McMullen County, Texas P1_001N    600
 5 48033 Borden County, Texas   P1_001N    631
 6 48263 Kent County, Texas     P1_001N    753
 7 48443 Terrell County, Texas  P1_001N    760
 8 48393 Roberts County, Texas  P1_001N    827
 9 48345 Motley County, Texas   P1_001N   1063
10 48155 Foard County, Texas    P1_001N   1095
# ℹ 244 more rows

arrange(tx_population, desc(value))

# A tibble: 254 × 4
   GEOID NAME                    variable   value
   <chr> <chr>                   <chr>      <dbl>
 1 48201 Harris County, Texas    P1_001N  4731145
 2 48113 Dallas County, Texas    P1_001N  2613539
 3 48439 Tarrant County, Texas   P1_001N  2110640
 4 48029 Bexar County, Texas     P1_001N  2009324
 5 48453 Travis County, Texas    P1_001N  1290188
 6 48085 Collin County, Texas    P1_001N  1064465
 7 48121 Denton County, Texas    P1_001N   906422
 8 48215 Hidalgo County, Texas   P1_001N   870781
 9 48141 El Paso County, Texas   P1_001N   865657
10 48157 Fort Bend County, Texas P1_001N   822779
# ℹ 244 more rows

What are the counties with a population below 1,000?

The filter() function subsets data according to a specified condition, much like a SQL query

below1000 <- filter(tx_population, value < 1000)

below1000

# A tibble: 8 × 4
  GEOID NAME                   variable value
  <chr> <chr>                  <chr>    <dbl>
1 48261 Kenedy County, Texas   P1_001N    350
2 48263 Kent County, Texas     P1_001N    753
3 48269 King County, Texas     P1_001N    265
4 48301 Loving County, Texas   P1_001N     64
5 48311 McMullen County, Texas P1_001N    600
6 48393 Roberts County, Texas  P1_001N    827
7 48443 Terrell County, Texas  P1_001N    760
8 48033 Borden County, Texas   P1_001N    631

Using summary variables

Many decennial Census and ACS variables are organized in tables in which the first variable represents a summary variable, or denominator for the others
The parameter summary_var can be used to generate a new column in long-form data for a requested denominator, which works well for normalizing estimates

Using summary variables

race_vars <- c(
  Hispanic = "P5_010N",
  White = "P5_003N",
  Black = "P5_004N",
  Native = "P5_005N",
  Asian = "P5_006N",
  HIPI = "P5_007N"
)

cd_race <- get_decennial(
  geography = "congressional district",
  variables = race_vars,
  summary_var = "P5_001N", 
  year = 2020,
  sumfile = "cd118"
)

cd_race

# A tibble: 2,640 × 5
   GEOID NAME                                      variable  value summary_value
   <chr> <chr>                                     <chr>     <dbl>         <dbl>
 1 0101  Congressional District 1 (118th Congress… Hispanic  27196        717754
 2 0102  Congressional District 2 (118th Congress… Hispanic  31708        717755
 3 0103  Congressional District 3 (118th Congress… Hispanic  26849        717754
 4 0104  Congressional District 4 (118th Congress… Hispanic  53874        717754
 5 0105  Congressional District 5 (118th Congress… Hispanic  47307        717754
 6 0106  Congressional District 6 (118th Congress… Hispanic  46872        717754
 7 0107  Congressional District 7 (118th Congress… Hispanic  30241        717754
 8 0200  Congressional District (at Large) (118th… Hispanic  49824        733391
 9 0401  Congressional District 1 (118th Congress… Hispanic 129929        794611
10 0402  Congressional District 2 (118th Congress… Hispanic 133934        794612
# ℹ 2,630 more rows

Normalizing columns with `mutate()`

dplyr’s mutate() function is used to calculate new columns in your data; the select() column can keep or drop columns by name
In a tidyverse workflow, these steps are commonly linked using the pipe operator (%>%) from the magrittr package

cd_race_percent <- cd_race %>%
  mutate(percent = 100 * (value / summary_value)) %>% 
  select(NAME, variable, percent)

cd_race_percent

# A tibble: 2,640 × 3
   NAME                                                       variable percent
   <chr>                                                      <chr>      <dbl>
 1 Congressional District 1 (118th Congress), Alabama         Hispanic    3.79
 2 Congressional District 2 (118th Congress), Alabama         Hispanic    4.42
 3 Congressional District 3 (118th Congress), Alabama         Hispanic    3.74
 4 Congressional District 4 (118th Congress), Alabama         Hispanic    7.51
 5 Congressional District 5 (118th Congress), Alabama         Hispanic    6.59
 6 Congressional District 6 (118th Congress), Alabama         Hispanic    6.53
 7 Congressional District 7 (118th Congress), Alabama         Hispanic    4.21
 8 Congressional District (at Large) (118th Congress), Alaska Hispanic    6.79
 9 Congressional District 1 (118th Congress), Arizona         Hispanic   16.4 
10 Congressional District 2 (118th Congress), Arizona         Hispanic   16.9 
# ℹ 2,630 more rows

Group-wise Census data analysis

The group_by() and summarize() functions in dplyr are used to implement the split-apply-combine method of data analysis
The default “tidy” format returned by tidycensus is designed to work well with group-wise Census data analysis workflows

What is the largest group by congressional district?

largest_group <- cd_race_percent %>%
  group_by(NAME) %>% 
  filter(percent == max(percent)) 

# Optionally, use `.by`: 
# largest_group <- cd_race_percent %>%
#   filter(percent == max(percent), .by = NAME)

largest_group

# A tibble: 437 × 3
# Groups:   NAME [437]
   NAME                                                   variable percent
   <chr>                                                  <chr>      <dbl>
 1 Congressional District 3 (118th Congress), Arizona     Hispanic    62.6
 2 Congressional District 7 (118th Congress), Arizona     Hispanic    59.8
 3 Congressional District 8 (118th Congress), California  Hispanic    35.2
 4 Congressional District 9 (118th Congress), California  Hispanic    41.5
 5 Congressional District 13 (118th Congress), California Hispanic    65.1
 6 Congressional District 18 (118th Congress), California Hispanic    65.3
 7 Congressional District 21 (118th Congress), California Hispanic    64.3
 8 Congressional District 22 (118th Congress), California Hispanic    73.2
 9 Congressional District 23 (118th Congress), California Hispanic    41.6
10 Congressional District 25 (118th Congress), California Hispanic    64.8
# ℹ 427 more rows

What are the median percentages by group?

cd_race_percent %>%
  group_by(variable) %>% 
  summarize(median_pct = median(percent, na.rm = TRUE))

# A tibble: 6 × 2
  variable median_pct
  <chr>         <dbl>
1 Asian        3.40  
2 Black        7.49  
3 HIPI         0.0468
4 Hispanic    12.8   
5 Native       0.235 
6 White       61.4

Exploring maps of Census data

“Spatial” Census data

One of the best features of tidycensus is the argument geometry = TRUE, which gets you the correct Census geometries with no hassle
get_decennial() with geometry = TRUE returns a spatial Census dataset containing simple feature geometries; learn more on March 7
Let’s take a look at some examples

“Spatial” Census data

geometry = TRUE does the hard work for you of acquiring and pre-joining spatial Census data
Consider using the Demographic Profile for pre-tabulated percentages

iowa_over_65 <- get_decennial(
  geography = "tract",
  variables = "DP1_0024P",
  state = "IA",
  geometry = TRUE,
  sumfile = "dp",
  year = 2020
)

We get back a simple features data frame (more about this on March 7)

iowa_over_65

Simple feature collection with 896 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -96.63862 ymin: 40.37566 xmax: -90.14006 ymax: 43.5012
Geodetic CRS:  NAD83
# A tibble: 896 × 5
   GEOID       NAME                     variable value                  geometry
   <chr>       <chr>                    <chr>    <dbl>             <POLYGON [°]>
 1 19163010600 Census Tract 106; Scott… DP1_002…   7.1 ((-90.57555 41.52573, -9…
 2 19163011700 Census Tract 117; Scott… DP1_002…  13.1 ((-90.5766 41.55272, -90…
 3 19163013500 Census Tract 135; Scott… DP1_002…  17.2 ((-90.52247 41.53852, -9…
 4 19057000300 Census Tract 3; Des Moi… DP1_002…  23.1 ((-91.15105 40.8279, -91…
 5 19145490300 Census Tract 4903; Page… DP1_002…  26.2 ((-95.38534 40.74262, -9…
 6 19105070100 Census Tract 701; Jones… DP1_002…  20.2 ((-91.36461 42.15154, -9…
 7 19103001200 Census Tract 12; Johnso… DP1_002…  17.6 ((-91.52284 41.66125, -9…
 8 19123950400 Census Tract 9504; Maha… DP1_002…  21.4 ((-92.64872 41.33564, -9…
 9 19169000300 Census Tract 3; Story C… DP1_002…  24.9 ((-93.63794 42.05642, -9…
10 19187000400 Census Tract 4; Webster… DP1_002…  12.7 ((-94.19502 42.50798, -9…
# ℹ 886 more rows

Exploring spatial data

Mapping, GIS, and spatial data is the subject of our March 7 workshop - so be sure to check that out!
Even before we dive deeper into spatial data, it is very useful to be able to explore your results on an interactive map
Our solution: mapview()

Exploring spatial data

library(mapview)

mapview(iowa_over_65)

Creating a shaded map with `zcol`

mapview(iowa_over_65, zcol = "value")

Customizing your mapview output

mapview(iowa_over_65, zcol = "value",
        layer.name = "% age 65 and up<br>Census tracts in Iowa")

Customizing your mapview output

library(viridisLite)

mapview(iowa_over_65, zcol = "value",
        layer.name = "% age 65 and up<br>Census tracts in Iowa",
        col.regions = inferno(100))

Saving and using interactive maps

Use the saveWidget() function over the map slot of your mapview map to save out a standalone HTML file, which you can embed in websites

library(htmlwidgets)

m1 <- mapview(iowa_over_65, zcol = "value",
        layer.name = "% age 65 and up<br>Census tracts in Iowa",
        col.regions = inferno(100))

saveWidget(m1@map, "iowa_over_65.html")

Part 2 exercise

Try making an interactive map of a different variable from the Demographic Profile (use load_variables(2020, "dp") to look them up) for a different state, or state / county combination.

Part 3: The Detailed DHC-A File and time-series analysis

The 2020 Decennial Census Detailed DHC-A File

The Detailed DHC-A File

Tabulation of 2020 Decennial Census results for population by sex and age
Key feature: break-outs for thousands of racial and ethnic groups

Limitations of the DDHC-A File

An “adaptive design” is used, meaning that data for different groups / geographies may be found in different tables
There is considerable sparsity in the data, especially when going down to the Census tract level

Using the DDHC-A File in tidycensus

You’ll query the DDHC-A file with the argument sumfile = "ddhca" in get_decennial()
A new argument, pop_group, is required to use the DDHC-A; it takes a population group code.
Use pop_group = "all" to query for all groups; set pop_group_label = TRUE to return the label for the population group
Look up variables with load_variables(2020, "ddhca")

Example usage of the DDHC-A File

mn_population_groups <- get_decennial(
  geography = "state",
  variables = "T01001_001N",
  state = "MN",
  year = 2020,
  sumfile = "ddhca",
  pop_group = "all",
  pop_group_label = TRUE
)

mn_population_groups

# A tibble: 2,996 × 6
   GEOID NAME      pop_group pop_group_label   variable      value
   <chr> <chr>     <chr>     <chr>             <chr>         <dbl>
 1 27    Minnesota 1002      European alone    T01001_001N 3162905
 2 27    Minnesota 1003      Albanian alone    T01001_001N     512
 3 27    Minnesota 1004      Alsatian alone    T01001_001N      27
 4 27    Minnesota 1005      Andorran alone    T01001_001N      NA
 5 27    Minnesota 1006      Armenian alone    T01001_001N     605
 6 27    Minnesota 1007      Austrian alone    T01001_001N    2552
 7 27    Minnesota 1008      Azerbaijani alone T01001_001N     103
 8 27    Minnesota 1009      Basque alone      T01001_001N      52
 9 27    Minnesota 1010      Belarusian alone  T01001_001N    1579
10 27    Minnesota 1011      Belgian alone     T01001_001N    3864
# ℹ 2,986 more rows

Looking up group codes

A new function, get_pop_groups(), helps you look up population group codes
It works for SF2/SF4 in 2000 and SF2 in 2010 as well!

available_groups <- get_pop_groups(2020, "ddhca")

Understanding sparsity in the DDHC-A File

The DDHC-A File uses an “adaptive design” that makes certain tables available for specific geographies

You may see this error…

get_decennial(
  geography = "county",
  variables = "T02001_001N",
  state = "MN",
  county = "Hennepin",
  pop_group = "1325",
  year = 2020,
  sumfile = "ddhca"
)

Error in `get_decennial()`:
! Error in load_data_decennial(geography, variables, key, year, sumfile,  : 
  Your DDHC-A request returned No Content from the API.
ℹ The DDHC-A file uses an 'adaptive design' where data availability varies by geography and by population group.
ℹ Read Section 3-1 at https://www2.census.gov/programs-surveys/decennial/2020/technical-documentation/complete-tech-docs/detailed-demographic-and-housing-characteristics-file-a/2020census-detailed-dhc-a-techdoc.pdf for more information.
ℹ In tidycensus, use the function `check_ddhca_groups()` to see if your data is available.

How to check for data availability

A new function, check_ddhca_groups(), can be used to see which tables to use for the data you want

check_ddhca_groups(
  geography = "county", 
  pop_group = "1325", 
  state = "MN", 
  county = "Hennepin"
)

Mapping DDHC-A data

Given data sparsity in the DDHC-A data, should you make maps with it?
I’m not personally a fan of mapping data that are geographically sparse. But…

I think it is OK to map DDHC-A data if you think through the data limitations in your map design

Example: Somali populations by Census tract in Minneapolis

library(tidycensus)

hennepin_somali <- get_decennial(
  geography = "tract",
  variables = "T01001_001N",
  state = "MN",
  county = "Hennepin",
  year = 2020,
  sumfile = "ddhca",
  pop_group = "1325",
  pop_group_label = TRUE,
  geometry = TRUE
)

mapview(hennepin_somali, zcol = "value")

Alternative approach: dot-density mapping

I don’t think choropleth maps are advisable with geographically incomplete data in most cases
Other map types - like graduated symbols or dot-density maps - may be more appropriate
The tidycensus function as_dot_density() allows you to specify the number of people represented in each dot, which means you can represent data-suppressed areas as 0 more confidently

somali_dots <- as_dot_density(
  hennepin_somali,
  value = "value",
  values_per_dot = 25
)

mapview(somali_dots, cex = 0.01, layer.name = "Somali population<br>1 dot = 25 people",
        col.regions = "navy", color = "navy")

Time-series analysis

How have areas changed since the 2010 Census?

A common use-case for the 2020 decennial Census data is to assess change over time
For example: which areas have experienced the most population growth, and which have experienced the steepest declines?
tidycensus allows users to access the 2000 and 2010 decennial Census data for comparison, though variable IDs will differ

Getting data from the 2010 Census

county_pop_10 <- get_decennial(
  geography = "county",
  variables = "P001001", 
  year = 2010,
  sumfile = "sf1"
)

county_pop_10

# A tibble: 3,221 × 4
   GEOID NAME                        variable  value
   <chr> <chr>                       <chr>     <dbl>
 1 05131 Sebastian County, Arkansas  P001001  125744
 2 05133 Sevier County, Arkansas     P001001   17058
 3 05135 Sharp County, Arkansas      P001001   17264
 4 05137 Stone County, Arkansas      P001001   12394
 5 05139 Union County, Arkansas      P001001   41639
 6 05141 Van Buren County, Arkansas  P001001   17295
 7 05143 Washington County, Arkansas P001001  203065
 8 05145 White County, Arkansas      P001001   77076
 9 05149 Yell County, Arkansas       P001001   22185
10 06011 Colusa County, California   P001001   21419
# ℹ 3,211 more rows

Cleanup before joining

The select() function can both subset datasets by column and rename columns, “cleaning up” a dataset before joining to another dataset

county_pop_10_clean <- county_pop_10 %>%
  select(GEOID, value10 = value) 

county_pop_10_clean

# A tibble: 3,221 × 2
   GEOID value10
   <chr>   <dbl>
 1 05131  125744
 2 05133   17058
 3 05135   17264
 4 05137   12394
 5 05139   41639
 6 05141   17295
 7 05143  203065
 8 05145   77076
 9 05149   22185
10 06011   21419
# ℹ 3,211 more rows

Joining data

In dplyr, joins are implemented with the *_join() family of functions

county_pop_20 <- get_decennial(
  geography = "county",
  variables = "P1_001N",
  year = 2020,
  sumfile = "dhc"
) %>%
  select(GEOID, NAME, value20 = value)

county_joined <- county_pop_20 %>%
  left_join(county_pop_10_clean, by = "GEOID")

county_joined

# A tibble: 3,221 × 4
   GEOID NAME                         value20 value10
   <chr> <chr>                          <dbl>   <dbl>
 1 06039 Madera County, California     156255  150865
 2 06041 Marin County, California      262321  252409
 3 06043 Mariposa County, California    17131   18251
 4 06045 Mendocino County, California   91601   87841
 5 06047 Merced County, California     281202  255793
 6 06049 Modoc County, California        8700    9686
 7 06051 Mono County, California        13195   14202
 8 06053 Monterey County, California   439035  415057
 9 06055 Napa County, California       138019  136484
10 06057 Nevada County, California     102241   98764
# ℹ 3,211 more rows

Calculating change

dplyr’s mutate() function can be used to calculate new columns, allowing for assessment of change over time

county_change <- county_joined %>%
  mutate( 
    total_change = value20 - value10, 
    percent_change = 100 * (total_change / value10) 
  )

county_change

# A tibble: 3,221 × 6
   GEOID NAME                        value20 value10 total_change percent_change
   <chr> <chr>                         <dbl>   <dbl>        <dbl>          <dbl>
 1 06039 Madera County, California    156255  150865         5390           3.57
 2 06041 Marin County, California     262321  252409         9912           3.93
 3 06043 Mariposa County, California   17131   18251        -1120          -6.14
 4 06045 Mendocino County, Californ…   91601   87841         3760           4.28
 5 06047 Merced County, California    281202  255793        25409           9.93
 6 06049 Modoc County, California       8700    9686         -986         -10.2 
 7 06051 Mono County, California       13195   14202        -1007          -7.09
 8 06053 Monterey County, California  439035  415057        23978           5.78
 9 06055 Napa County, California      138019  136484         1535           1.12
10 06057 Nevada County, California    102241   98764         3477           3.52
# ℹ 3,211 more rows

Caveat: changing geographies!

County names and boundaries can change from year to year, introducing potential problems in time-series analysis
This is particularly acute for small geographies like Census tracts & block groups, which we’ll cover on March 7!

filter(county_change, is.na(value10))

# A tibble: 4 × 6
  GEOID NAME                         value20 value10 total_change percent_change
  <chr> <chr>                          <dbl>   <dbl>        <dbl>          <dbl>
1 02063 Chugach Census Area, Alaska     7102      NA           NA             NA
2 02066 Copper River Census Area, A…    2617      NA           NA             NA
3 02158 Kusilvak Census Area, Alaska    8368      NA           NA             NA
4 46102 Oglala Lakota County, South…   13672      NA           NA             NA

Analyzing 2020 Decennial US Census Data in R

About me

SSDAN webinar series

Today’s agenda

Getting started with 2020 Decennial US Census data in R

R and RStudio

What is the decennial US Census?

The 2020 US Census: available datasets

How to get decennial Census data

tidycensus

tidycensus: key features

Getting started with tidycensus

Optional: your Census API key

Getting started with Census data in tidycensus

2020 Census data in tidycensus

Understanding the printed messages

Understanding the printed messages

Requesting tables of variables

US Census Geography

Geography in tidycensus

Querying by state

Querying by state and county

Searching for variables

Available decennial Census datasets in tidycensus

Data structure in tidycensus

“Tidy” or long-form data

“Wide” data

Using named vectors of variables

Part 1 exercises

Part 2: Analysis workflows with 2020 Census data

The tidyverse

tidycensus and the tidyverse

Exploring 2020 Census data with tidyverse tools

Finding the largest values

What are the counties with a population below 1,000?

Using summary variables

Using summary variables

Normalizing columns with mutate()

Group-wise Census data analysis

What is the largest group by congressional district?

What are the median percentages by group?

Exploring maps of Census data

“Spatial” Census data

“Spatial” Census data

Exploring spatial data

Exploring spatial data

Creating a shaded map with zcol

Customizing your mapview output

Customizing your mapview output

Saving and using interactive maps

Part 2 exercise

Part 3: The Detailed DHC-A File and time-series analysis

The 2020 Decennial Census Detailed DHC-A File

The Detailed DHC-A File

Limitations of the DDHC-A File

Using the DDHC-A File in tidycensus

Example usage of the DDHC-A File

Looking up group codes

Understanding sparsity in the DDHC-A File

You may see this error…

How to check for data availability

Mapping DDHC-A data

Example: Somali populations by Census tract in Minneapolis

Alternative approach: dot-density mapping

Time-series analysis

How have areas changed since the 2010 Census?

Getting data from the 2010 Census

Cleanup before joining

Joining data

Calculating change

Caveat: changing geographies!

Bonus example: creating this plot

Thank you!

Normalizing columns with `mutate()`

Creating a shaded map with `zcol`