Analyzing 2020 Decennial US Census Data in R

2024 SSDAN Webinar Series

Kyle Walker

2024-02-22

About me

SSDAN webinar series

Today’s agenda

  • Hour 1: Getting started with 2020 Decennial US Census data in R

  • Hour 2: Analysis workflows with 2020 Census data

  • Hour 3: The detailed DHC-A data and time-series analysis

Getting started with 2020 Decennial US Census data in R

R and RStudio

  • R: programming language and software environment for data analysis (and wherever else your imagination can take you!)

  • RStudio: integrated development environment (IDE) for R developed by Posit

  • Posit Cloud: run RStudio with today’s workshop pre-configured at https://posit.cloud/content/7549022

What is the decennial US Census?

  • Complete count of the US population mandated by Article 1, Sections 2 and 9 in the US Constitution

  • Directed by the US Census Bureau (US Department of Commerce); conducted every 10 years since 1790

  • Used for proportional representation / congressional redistricting

  • Limited set of questions asked about race, ethnicity, age, sex, and housing tenure

The 2020 US Census: available datasets

Available datasets from the 2020 US Census include:

  • The PL 94-171 Redistricting Data
  • The Demographic and Housing Characteristics (DHC) file
  • The Demographic Profile (for pre-tabulated variables)
  • Tabulations for the 118th Congress & for Island Areas
  • The Detailed DHC-A file (with very detailed racial & ethnic categories)

How to get decennial Census data

tidycensus

  • R interface to the Decennial Census, American Community Survey, Population Estimates Program, and Public Use Microdata Series APIs

  • First released in 2017; nearly 500,000 downloads from the Posit CRAN mirror

tidycensus: key features

  • Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);

  • Automatically downloads and merges Census geometries to data for mapping;

  • Includes a variety of analytic tools to support common Census workflows;

  • States and counties can be requested by name (no more looking up FIPS codes!)

Getting started with tidycensus

  • To get started, install the packages you’ll need for today’s workshop

  • If you are using the Posit Cloud environment, these packages are already installed for you

install.packages(c("tidycensus", "tidyverse", "mapview"))

Optional: your Census API key

  • tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day

  • Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.

  • Once activated, use the census_api_key() function to set your key as an environment variable

library(tidycensus)

census_api_key("YOUR KEY GOES HERE", install = TRUE)

Getting started with Census data in tidycensus

2020 Census data in tidycensus

  • The get_decennial() function is used to acquire data from the decennial US Census

  • The two required arguments are geography and variables for the functions to work; for 2020 Census data, use year = 2020.

pop20 <- get_decennial(
  geography = "state",
  variables = "P1_001N",
  year = 2020
)
  • Decennial Census data are returned with four columns: GEOID, NAME, variable, and value
pop20
# A tibble: 52 × 4
   GEOID NAME                 variable    value
   <chr> <chr>                <chr>       <dbl>
 1 42    Pennsylvania         P1_001N  13002700
 2 06    California           P1_001N  39538223
 3 54    West Virginia        P1_001N   1793716
 4 49    Utah                 P1_001N   3271616
 5 36    New York             P1_001N  20201249
 6 11    District of Columbia P1_001N    689545
 7 02    Alaska               P1_001N    733391
 8 12    Florida              P1_001N  21538187
 9 45    South Carolina       P1_001N   5118425
10 38    North Dakota         P1_001N    779094
# ℹ 42 more rows

Understanding the printed messages

  • When we run get_decennial() for the 2020 Census for the first time, we see the following messages:
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
This message is displayed once per session.

Understanding the printed messages

  • The Census Bureau is using differential privacy in an attempt to preserve respondent confidentiality in the 2020 Census data, which is required under US Code Title 13

  • Intentional errors are introduced into data, impacting the accuracy of small area counts (e.g. some blocks with children, but no adults)

  • Advocates argue that differential privacy is necessary to satisfy Title 13 requirements given modern database reconstruction technologies; critics contend that the method makes data less useful with no tangible privacy benefit

Requesting tables of variables

  • The table parameter can be used to obtain all related variables in a “table” at once
table_p2 <- get_decennial(
  geography = "state", 
  table = "P2", 
  year = 2020
)
table_p2
# A tibble: 3,796 × 4
   GEOID NAME                 variable    value
   <chr> <chr>                <chr>       <dbl>
 1 42    Pennsylvania         P2_001N  13002700
 2 06    California           P2_001N  39538223
 3 54    West Virginia        P2_001N   1793716
 4 49    Utah                 P2_001N   3271616
 5 36    New York             P2_001N  20201249
 6 11    District of Columbia P2_001N    689545
 7 02    Alaska               P2_001N    733391
 8 12    Florida              P2_001N  21538187
 9 45    South Carolina       P2_001N   5118425
10 38    North Dakota         P2_001N    779094
# ℹ 3,786 more rows

US Census Geography

Source: US Census Bureau

Source: US Census Bureau

Geography in tidycensus

  • Information on available geographies, and how to specify them, can be found in the tidycensus documentation

  • The 2020 Census allows you to get data down to the Census block (unlike the ACS, covered last week)

{width: 400}

Querying by state

  • For geographies available below the state level, the state parameter allows you to query data for a specific state

  • tidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!

  • Example: total population in Texas by county

tx_population <- get_decennial(
  geography = "county",
  variables = "P1_001N",
  state = "TX",
  sumfile = "dhc",
  year = 2020
)
tx_population
# A tibble: 254 × 4
   GEOID NAME                     variable  value
   <chr> <chr>                    <chr>     <dbl>
 1 48239 Jackson County, Texas    P1_001N   14988
 2 48233 Hutchinson County, Texas P1_001N   20617
 3 48235 Irion County, Texas      P1_001N    1513
 4 48237 Jack County, Texas       P1_001N    8472
 5 48241 Jasper County, Texas     P1_001N   32980
 6 48243 Jeff Davis County, Texas P1_001N    1996
 7 48245 Jefferson County, Texas  P1_001N  256526
 8 48247 Jim Hogg County, Texas   P1_001N    4838
 9 48249 Jim Wells County, Texas  P1_001N   38891
10 48251 Johnson County, Texas    P1_001N  179927
# ℹ 244 more rows

Querying by state and county

  • County names are also translated internally by tidycensus for sub-county queries, e.g. for Census tracts, block groups, and blocks
matagorda_blocks <- get_decennial(
  geography = "block",
  variables = "P1_001N",
  state = "TX",
  county = "Matagorda",
  sumfile = "dhc",
  year = 2020
)
matagorda_blocks
# A tibble: 2,620 × 4
   GEOID           NAME                                           variable value
   <chr>           <chr>                                          <chr>    <dbl>
 1 483217303013065 Block 3065, Block Group 3, Census Tract 7303.… P1_001N     12
 2 483217303013062 Block 3062, Block Group 3, Census Tract 7303.… P1_001N      6
 3 483217303013063 Block 3063, Block Group 3, Census Tract 7303.… P1_001N     30
 4 483217303013064 Block 3064, Block Group 3, Census Tract 7303.… P1_001N     31
 5 483217303013066 Block 3066, Block Group 3, Census Tract 7303.… P1_001N     14
 6 483217303013067 Block 3067, Block Group 3, Census Tract 7303.… P1_001N     19
 7 483217303013068 Block 3068, Block Group 3, Census Tract 7303.… P1_001N     19
 8 483217303013069 Block 3069, Block Group 3, Census Tract 7303.… P1_001N     15
 9 483217303013070 Block 3070, Block Group 3, Census Tract 7303.… P1_001N     24
10 483217303013071 Block 3071, Block Group 3, Census Tract 7303.… P1_001N      3
# ℹ 2,610 more rows

Searching for variables

  • To search for variables, use the load_variables() function along with a year and dataset

  • The View() function in RStudio allows for interactive browsing and filtering

vars <- load_variables(2020, "dhc")

View(vars)

Available decennial Census datasets in tidycensus

The different datasets in the 2020 Census are accessible by specifying a sumfile in get_decennial(). The datasets we’ll cover today include:

  • The DHC data (sumfile = "dhc")
  • The Demographic Profile (sumfile = "dp")
  • The CD118 data (sumfile = "cd118)
  • The Detailed DHC-A data (sumfile = "ddhca")

Data structure in tidycensus

“Tidy” or long-form data

  • The default data structure returned by tidycensus is “tidy” or long-form data, with variables by geography stacked by row
single_year_age <- get_decennial(
  geography = "state",
  table = "PCT12",
  year = 2020,
  sumfile = "dhc"
)
single_year_age
# A tibble: 10,868 × 4
   GEOID NAME                 variable      value
   <chr> <chr>                <chr>         <dbl>
 1 09    Connecticut          PCT12_001N  3605944
 2 10    Delaware             PCT12_001N   989948
 3 11    District of Columbia PCT12_001N   689545
 4 12    Florida              PCT12_001N 21538187
 5 13    Georgia              PCT12_001N 10711908
 6 15    Hawaii               PCT12_001N  1455271
 7 16    Idaho                PCT12_001N  1839106
 8 17    Illinois             PCT12_001N 12812508
 9 18    Indiana              PCT12_001N  6785528
10 19    Iowa                 PCT12_001N  3190369
# ℹ 10,858 more rows

“Wide” data

  • The argument output = "wide" spreads Census variables across the columns, returning one row per geographic unit and one column per variable
single_year_age_wide <- get_decennial(
  geography = "state",
  table = "PCT12",
  year = 2020,
  sumfile = "dhc",
  output = "wide" 
)
single_year_age_wide
# A tibble: 52 × 211
   GEOID NAME  PCT12_001N PCT12_002N PCT12_003N PCT12_004N PCT12_005N PCT12_006N
   <chr> <chr>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
 1 09    Conn…    3605944    1749853      16946      17489      17974      18571
 2 10    Dela…     989948     476719       4847       5045       5239       5427
 3 11    Dist…     689545     322777       4116       3730       3620       3723
 4 12    Flor…   21538187   10464234      98241     100543     104962     108555
 5 13    Geor…   10711908    5188570      58614      59817      62719      64271
 6 15    Hawa…    1455271     727844       7757       7565       7701       8303
 7 16    Idaho    1839106     919196      11029      10953      11558      12056
 8 17    Illi…   12812508    6283130      68115      69120      71823      74165
 9 18    Indi…    6785528    3344660      39308      40232      41971      43310
10 19    Iowa     3190369    1586092      18038      18566      19349      19864
# ℹ 42 more rows
# ℹ 203 more variables: PCT12_007N <dbl>, PCT12_008N <dbl>, PCT12_009N <dbl>,
#   PCT12_010N <dbl>, PCT12_011N <dbl>, PCT12_012N <dbl>, PCT12_013N <dbl>,
#   PCT12_014N <dbl>, PCT12_015N <dbl>, PCT12_016N <dbl>, PCT12_017N <dbl>,
#   PCT12_018N <dbl>, PCT12_019N <dbl>, PCT12_020N <dbl>, PCT12_021N <dbl>,
#   PCT12_022N <dbl>, PCT12_023N <dbl>, PCT12_024N <dbl>, PCT12_025N <dbl>,
#   PCT12_026N <dbl>, PCT12_027N <dbl>, PCT12_028N <dbl>, PCT12_029N <dbl>, …

Using named vectors of variables

  • Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input

  • In long form, these custom inputs will populate the variable column; in wide form, they will replace the column names

ca_samesex <- get_decennial(
  geography = "county",
  state = "CA",
  variables = c(married = "DP1_0116P",
                partnered = "DP1_0118P"),
  year = 2020,
  sumfile = "dp",
  output = "wide"
)
ca_samesex
# A tibble: 58 × 4
   GEOID NAME                            married partnered
   <chr> <chr>                             <dbl>     <dbl>
 1 06001 Alameda County, California          0.4       0.2
 2 06003 Alpine County, California           0.3       0.2
 3 06005 Amador County, California           0.2       0.1
 4 06007 Butte County, California            0.2       0.1
 5 06009 Calaveras County, California        0.2       0.1
 6 06011 Colusa County, California           0.1       0  
 7 06013 Contra Costa County, California     0.3       0.1
 8 06015 Del Norte County, California        0.1       0.1
 9 06017 El Dorado County, California        0.2       0.1
10 06019 Fresno County, California           0.2       0.1
# ℹ 48 more rows

Part 1 exercises

  1. Use load_variables(2020, "dhc") to find a variable that interests you from the Demographic and Housing Characteristics file.

  2. Use get_decennial() to fetch data on that variable from the decennial US Census for counties in a state of your choosing.

Part 2: Analysis workflows with 2020 Census data

The tidyverse

library(tidyverse)

tidyverse_logo()
⬢ __  _    __   .    ⬡           ⬢  . 
 / /_(_)__/ /_ ___  _____ _______ ___ 
/ __/ / _  / // / |/ / -_) __(_-</ -_)
\__/_/\_,_/\_, /|___/\__/_/ /___/\__/ 
     ⬢  . /___/      ⬡      .       ⬢ 
  • The tidyverse: an integrated set of packages developed primarily by Hadley Wickham and the Posit team

tidycensus and the tidyverse

  • Census data are commonly used in wide format, with categories spread across the columns

  • tidyverse tools work better with data that are in “tidy”, or long format; this format is returned by tidycensus by default

  • Goal: return data “ready to go” for use with tidyverse tools

Exploring 2020 Census data with tidyverse tools

Finding the largest values

  • dplyr’s arrange() function sorts data based on values in one or more columns, and filter() helps you query data based on column values

  • Example: what are the largest and smallest counties in Texas by population?

library(tidycensus)
library(tidyverse)

tx_population <- get_decennial(
  geography = "county",
  variables = "P1_001N",
  year = 2020,
  state = "TX",
  sumfile = "dhc"
)
arrange(tx_population, value)
# A tibble: 254 × 4
   GEOID NAME                   variable value
   <chr> <chr>                  <chr>    <dbl>
 1 48301 Loving County, Texas   P1_001N     64
 2 48269 King County, Texas     P1_001N    265
 3 48261 Kenedy County, Texas   P1_001N    350
 4 48311 McMullen County, Texas P1_001N    600
 5 48033 Borden County, Texas   P1_001N    631
 6 48263 Kent County, Texas     P1_001N    753
 7 48443 Terrell County, Texas  P1_001N    760
 8 48393 Roberts County, Texas  P1_001N    827
 9 48345 Motley County, Texas   P1_001N   1063
10 48155 Foard County, Texas    P1_001N   1095
# ℹ 244 more rows
arrange(tx_population, desc(value))
# A tibble: 254 × 4
   GEOID NAME                    variable   value
   <chr> <chr>                   <chr>      <dbl>
 1 48201 Harris County, Texas    P1_001N  4731145
 2 48113 Dallas County, Texas    P1_001N  2613539
 3 48439 Tarrant County, Texas   P1_001N  2110640
 4 48029 Bexar County, Texas     P1_001N  2009324
 5 48453 Travis County, Texas    P1_001N  1290188
 6 48085 Collin County, Texas    P1_001N  1064465
 7 48121 Denton County, Texas    P1_001N   906422
 8 48215 Hidalgo County, Texas   P1_001N   870781
 9 48141 El Paso County, Texas   P1_001N   865657
10 48157 Fort Bend County, Texas P1_001N   822779
# ℹ 244 more rows

What are the counties with a population below 1,000?

  • The filter() function subsets data according to a specified condition, much like a SQL query
below1000 <- filter(tx_population, value < 1000)

below1000
# A tibble: 8 × 4
  GEOID NAME                   variable value
  <chr> <chr>                  <chr>    <dbl>
1 48261 Kenedy County, Texas   P1_001N    350
2 48263 Kent County, Texas     P1_001N    753
3 48269 King County, Texas     P1_001N    265
4 48301 Loving County, Texas   P1_001N     64
5 48311 McMullen County, Texas P1_001N    600
6 48393 Roberts County, Texas  P1_001N    827
7 48443 Terrell County, Texas  P1_001N    760
8 48033 Borden County, Texas   P1_001N    631

Using summary variables

  • Many decennial Census and ACS variables are organized in tables in which the first variable represents a summary variable, or denominator for the others

  • The parameter summary_var can be used to generate a new column in long-form data for a requested denominator, which works well for normalizing estimates

Using summary variables

race_vars <- c(
  Hispanic = "P5_010N",
  White = "P5_003N",
  Black = "P5_004N",
  Native = "P5_005N",
  Asian = "P5_006N",
  HIPI = "P5_007N"
)

cd_race <- get_decennial(
  geography = "congressional district",
  variables = race_vars,
  summary_var = "P5_001N", 
  year = 2020,
  sumfile = "cd118"
)
cd_race
# A tibble: 2,640 × 5
   GEOID NAME                                      variable  value summary_value
   <chr> <chr>                                     <chr>     <dbl>         <dbl>
 1 0101  Congressional District 1 (118th Congress… Hispanic  27196        717754
 2 0102  Congressional District 2 (118th Congress… Hispanic  31708        717755
 3 0103  Congressional District 3 (118th Congress… Hispanic  26849        717754
 4 0104  Congressional District 4 (118th Congress… Hispanic  53874        717754
 5 0105  Congressional District 5 (118th Congress… Hispanic  47307        717754
 6 0106  Congressional District 6 (118th Congress… Hispanic  46872        717754
 7 0107  Congressional District 7 (118th Congress… Hispanic  30241        717754
 8 0200  Congressional District (at Large) (118th… Hispanic  49824        733391
 9 0401  Congressional District 1 (118th Congress… Hispanic 129929        794611
10 0402  Congressional District 2 (118th Congress… Hispanic 133934        794612
# ℹ 2,630 more rows

Normalizing columns with mutate()

  • dplyr’s mutate() function is used to calculate new columns in your data; the select() column can keep or drop columns by name

  • In a tidyverse workflow, these steps are commonly linked using the pipe operator (%>%) from the magrittr package

cd_race_percent <- cd_race %>%
  mutate(percent = 100 * (value / summary_value)) %>% 
  select(NAME, variable, percent) 
cd_race_percent
# A tibble: 2,640 × 3
   NAME                                                       variable percent
   <chr>                                                      <chr>      <dbl>
 1 Congressional District 1 (118th Congress), Alabama         Hispanic    3.79
 2 Congressional District 2 (118th Congress), Alabama         Hispanic    4.42
 3 Congressional District 3 (118th Congress), Alabama         Hispanic    3.74
 4 Congressional District 4 (118th Congress), Alabama         Hispanic    7.51
 5 Congressional District 5 (118th Congress), Alabama         Hispanic    6.59
 6 Congressional District 6 (118th Congress), Alabama         Hispanic    6.53
 7 Congressional District 7 (118th Congress), Alabama         Hispanic    4.21
 8 Congressional District (at Large) (118th Congress), Alaska Hispanic    6.79
 9 Congressional District 1 (118th Congress), Arizona         Hispanic   16.4 
10 Congressional District 2 (118th Congress), Arizona         Hispanic   16.9 
# ℹ 2,630 more rows

Group-wise Census data analysis

  • The group_by() and summarize() functions in dplyr are used to implement the split-apply-combine method of data analysis

  • The default “tidy” format returned by tidycensus is designed to work well with group-wise Census data analysis workflows

What is the largest group by congressional district?

largest_group <- cd_race_percent %>%
  group_by(NAME) %>% 
  filter(percent == max(percent)) 

# Optionally, use `.by`: 
# largest_group <- cd_race_percent %>%
#   filter(percent == max(percent), .by = NAME) 
largest_group
# A tibble: 437 × 3
# Groups:   NAME [437]
   NAME                                                   variable percent
   <chr>                                                  <chr>      <dbl>
 1 Congressional District 3 (118th Congress), Arizona     Hispanic    62.6
 2 Congressional District 7 (118th Congress), Arizona     Hispanic    59.8
 3 Congressional District 8 (118th Congress), California  Hispanic    35.2
 4 Congressional District 9 (118th Congress), California  Hispanic    41.5
 5 Congressional District 13 (118th Congress), California Hispanic    65.1
 6 Congressional District 18 (118th Congress), California Hispanic    65.3
 7 Congressional District 21 (118th Congress), California Hispanic    64.3
 8 Congressional District 22 (118th Congress), California Hispanic    73.2
 9 Congressional District 23 (118th Congress), California Hispanic    41.6
10 Congressional District 25 (118th Congress), California Hispanic    64.8
# ℹ 427 more rows

What are the median percentages by group?

cd_race_percent %>%
  group_by(variable) %>% 
  summarize(median_pct = median(percent, na.rm = TRUE)) 
# A tibble: 6 × 2
  variable median_pct
  <chr>         <dbl>
1 Asian        3.40  
2 Black        7.49  
3 HIPI         0.0468
4 Hispanic    12.8   
5 Native       0.235 
6 White       61.4   

Exploring maps of Census data

“Spatial” Census data

  • One of the best features of tidycensus is the argument geometry = TRUE, which gets you the correct Census geometries with no hassle

  • get_decennial() with geometry = TRUE returns a spatial Census dataset containing simple feature geometries; learn more on March 7

  • Let’s take a look at some examples

“Spatial” Census data

  • geometry = TRUE does the hard work for you of acquiring and pre-joining spatial Census data

  • Consider using the Demographic Profile for pre-tabulated percentages

iowa_over_65 <- get_decennial(
  geography = "tract",
  variables = "DP1_0024P",
  state = "IA",
  geometry = TRUE,
  sumfile = "dp",
  year = 2020
)
  • We get back a simple features data frame (more about this on March 7)
iowa_over_65
Simple feature collection with 896 features and 4 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -96.63862 ymin: 40.37566 xmax: -90.14006 ymax: 43.5012
Geodetic CRS:  NAD83
# A tibble: 896 × 5
   GEOID       NAME                     variable value                  geometry
   <chr>       <chr>                    <chr>    <dbl>             <POLYGON [°]>
 1 19163010600 Census Tract 106; Scott… DP1_002…   7.1 ((-90.57555 41.52573, -9…
 2 19163011700 Census Tract 117; Scott… DP1_002…  13.1 ((-90.5766 41.55272, -90…
 3 19163013500 Census Tract 135; Scott… DP1_002…  17.2 ((-90.52247 41.53852, -9…
 4 19057000300 Census Tract 3; Des Moi… DP1_002…  23.1 ((-91.15105 40.8279, -91…
 5 19145490300 Census Tract 4903; Page… DP1_002…  26.2 ((-95.38534 40.74262, -9…
 6 19105070100 Census Tract 701; Jones… DP1_002…  20.2 ((-91.36461 42.15154, -9…
 7 19103001200 Census Tract 12; Johnso… DP1_002…  17.6 ((-91.52284 41.66125, -9…
 8 19123950400 Census Tract 9504; Maha… DP1_002…  21.4 ((-92.64872 41.33564, -9…
 9 19169000300 Census Tract 3; Story C… DP1_002…  24.9 ((-93.63794 42.05642, -9…
10 19187000400 Census Tract 4; Webster… DP1_002…  12.7 ((-94.19502 42.50798, -9…
# ℹ 886 more rows

Exploring spatial data

  • Mapping, GIS, and spatial data is the subject of our March 7 workshop - so be sure to check that out!

  • Even before we dive deeper into spatial data, it is very useful to be able to explore your results on an interactive map

  • Our solution: mapview()

Exploring spatial data

library(mapview)

mapview(iowa_over_65)

Creating a shaded map with zcol

mapview(iowa_over_65, zcol = "value")

Customizing your mapview output

mapview(iowa_over_65, zcol = "value",
        layer.name = "% age 65 and up<br>Census tracts in Iowa")

Customizing your mapview output

library(viridisLite)

mapview(iowa_over_65, zcol = "value",
        layer.name = "% age 65 and up<br>Census tracts in Iowa",
        col.regions = inferno(100))

Saving and using interactive maps

Use the saveWidget() function over the map slot of your mapview map to save out a standalone HTML file, which you can embed in websites

library(htmlwidgets)

m1 <- mapview(iowa_over_65, zcol = "value",
        layer.name = "% age 65 and up<br>Census tracts in Iowa",
        col.regions = inferno(100))

saveWidget(m1@map, "iowa_over_65.html")

Part 2 exercise

  • Try making an interactive map of a different variable from the Demographic Profile (use load_variables(2020, "dp") to look them up) for a different state, or state / county combination.

Part 3: The Detailed DHC-A File and time-series analysis

The 2020 Decennial Census Detailed DHC-A File

The Detailed DHC-A File

  • Tabulation of 2020 Decennial Census results for population by sex and age

  • Key feature: break-outs for thousands of racial and ethnic groups

Limitations of the DDHC-A File

  • An “adaptive design” is used, meaning that data for different groups / geographies may be found in different tables

  • There is considerable sparsity in the data, especially when going down to the Census tract level

Using the DDHC-A File in tidycensus

  • You’ll query the DDHC-A file with the argument sumfile = "ddhca" in get_decennial()

  • A new argument, pop_group, is required to use the DDHC-A; it takes a population group code.

  • Use pop_group = "all" to query for all groups; set pop_group_label = TRUE to return the label for the population group

  • Look up variables with load_variables(2020, "ddhca")

Example usage of the DDHC-A File

mn_population_groups <- get_decennial(
  geography = "state",
  variables = "T01001_001N",
  state = "MN",
  year = 2020,
  sumfile = "ddhca",
  pop_group = "all",
  pop_group_label = TRUE
)
mn_population_groups
# A tibble: 2,996 × 6
   GEOID NAME      pop_group pop_group_label   variable      value
   <chr> <chr>     <chr>     <chr>             <chr>         <dbl>
 1 27    Minnesota 1002      European alone    T01001_001N 3162905
 2 27    Minnesota 1003      Albanian alone    T01001_001N     512
 3 27    Minnesota 1004      Alsatian alone    T01001_001N      27
 4 27    Minnesota 1005      Andorran alone    T01001_001N      NA
 5 27    Minnesota 1006      Armenian alone    T01001_001N     605
 6 27    Minnesota 1007      Austrian alone    T01001_001N    2552
 7 27    Minnesota 1008      Azerbaijani alone T01001_001N     103
 8 27    Minnesota 1009      Basque alone      T01001_001N      52
 9 27    Minnesota 1010      Belarusian alone  T01001_001N    1579
10 27    Minnesota 1011      Belgian alone     T01001_001N    3864
# ℹ 2,986 more rows

Looking up group codes

  • A new function, get_pop_groups(), helps you look up population group codes

  • It works for SF2/SF4 in 2000 and SF2 in 2010 as well!

available_groups <- get_pop_groups(2020, "ddhca")

Understanding sparsity in the DDHC-A File

  • The DDHC-A File uses an “adaptive design” that makes certain tables available for specific geographies

You may see this error…

get_decennial(
  geography = "county",
  variables = "T02001_001N",
  state = "MN",
  county = "Hennepin",
  pop_group = "1325",
  year = 2020,
  sumfile = "ddhca"
)
Error in `get_decennial()`:
! Error in load_data_decennial(geography, variables, key, year, sumfile,  : 
  Your DDHC-A request returned No Content from the API.
ℹ The DDHC-A file uses an 'adaptive design' where data availability varies by geography and by population group.
ℹ Read Section 3-1 at https://www2.census.gov/programs-surveys/decennial/2020/technical-documentation/complete-tech-docs/detailed-demographic-and-housing-characteristics-file-a/2020census-detailed-dhc-a-techdoc.pdf for more information.
ℹ In tidycensus, use the function `check_ddhca_groups()` to see if your data is available.

How to check for data availability

  • A new function, check_ddhca_groups(), can be used to see which tables to use for the data you want
check_ddhca_groups(
  geography = "county", 
  pop_group = "1325", 
  state = "MN", 
  county = "Hennepin"
)

Mapping DDHC-A data

  • Given data sparsity in the DDHC-A data, should you make maps with it?

  • I’m not personally a fan of mapping data that are geographically sparse. But…

  • I think it is OK to map DDHC-A data if you think through the data limitations in your map design

Example: Somali populations by Census tract in Minneapolis

library(tidycensus)

hennepin_somali <- get_decennial(
  geography = "tract",
  variables = "T01001_001N",
  state = "MN",
  county = "Hennepin",
  year = 2020,
  sumfile = "ddhca",
  pop_group = "1325",
  pop_group_label = TRUE,
  geometry = TRUE
)
mapview(hennepin_somali, zcol = "value")

Alternative approach: dot-density mapping

  • I don’t think choropleth maps are advisable with geographically incomplete data in most cases

  • Other map types - like graduated symbols or dot-density maps - may be more appropriate

  • The tidycensus function as_dot_density() allows you to specify the number of people represented in each dot, which means you can represent data-suppressed areas as 0 more confidently

somali_dots <- as_dot_density(
  hennepin_somali,
  value = "value",
  values_per_dot = 25
)

mapview(somali_dots, cex = 0.01, layer.name = "Somali population<br>1 dot = 25 people",
        col.regions = "navy", color = "navy")

Time-series analysis

How have areas changed since the 2010 Census?

  • A common use-case for the 2020 decennial Census data is to assess change over time

  • For example: which areas have experienced the most population growth, and which have experienced the steepest declines?

  • tidycensus allows users to access the 2000 and 2010 decennial Census data for comparison, though variable IDs will differ

Getting data from the 2010 Census

county_pop_10 <- get_decennial(
  geography = "county",
  variables = "P001001", 
  year = 2010,
  sumfile = "sf1"
)
county_pop_10
# A tibble: 3,221 × 4
   GEOID NAME                        variable  value
   <chr> <chr>                       <chr>     <dbl>
 1 05131 Sebastian County, Arkansas  P001001  125744
 2 05133 Sevier County, Arkansas     P001001   17058
 3 05135 Sharp County, Arkansas      P001001   17264
 4 05137 Stone County, Arkansas      P001001   12394
 5 05139 Union County, Arkansas      P001001   41639
 6 05141 Van Buren County, Arkansas  P001001   17295
 7 05143 Washington County, Arkansas P001001  203065
 8 05145 White County, Arkansas      P001001   77076
 9 05149 Yell County, Arkansas       P001001   22185
10 06011 Colusa County, California   P001001   21419
# ℹ 3,211 more rows

Cleanup before joining

  • The select() function can both subset datasets by column and rename columns, “cleaning up” a dataset before joining to another dataset
county_pop_10_clean <- county_pop_10 %>%
  select(GEOID, value10 = value) 

county_pop_10_clean
# A tibble: 3,221 × 2
   GEOID value10
   <chr>   <dbl>
 1 05131  125744
 2 05133   17058
 3 05135   17264
 4 05137   12394
 5 05139   41639
 6 05141   17295
 7 05143  203065
 8 05145   77076
 9 05149   22185
10 06011   21419
# ℹ 3,211 more rows

Joining data

  • In dplyr, joins are implemented with the *_join() family of functions
county_pop_20 <- get_decennial(
  geography = "county",
  variables = "P1_001N",
  year = 2020,
  sumfile = "dhc"
) %>%
  select(GEOID, NAME, value20 = value)

county_joined <- county_pop_20 %>%
  left_join(county_pop_10_clean, by = "GEOID") 
county_joined
# A tibble: 3,221 × 4
   GEOID NAME                         value20 value10
   <chr> <chr>                          <dbl>   <dbl>
 1 06039 Madera County, California     156255  150865
 2 06041 Marin County, California      262321  252409
 3 06043 Mariposa County, California    17131   18251
 4 06045 Mendocino County, California   91601   87841
 5 06047 Merced County, California     281202  255793
 6 06049 Modoc County, California        8700    9686
 7 06051 Mono County, California        13195   14202
 8 06053 Monterey County, California   439035  415057
 9 06055 Napa County, California       138019  136484
10 06057 Nevada County, California     102241   98764
# ℹ 3,211 more rows

Calculating change

  • dplyr’s mutate() function can be used to calculate new columns, allowing for assessment of change over time
county_change <- county_joined %>%
  mutate( 
    total_change = value20 - value10, 
    percent_change = 100 * (total_change / value10) 
  ) 
county_change
# A tibble: 3,221 × 6
   GEOID NAME                        value20 value10 total_change percent_change
   <chr> <chr>                         <dbl>   <dbl>        <dbl>          <dbl>
 1 06039 Madera County, California    156255  150865         5390           3.57
 2 06041 Marin County, California     262321  252409         9912           3.93
 3 06043 Mariposa County, California   17131   18251        -1120          -6.14
 4 06045 Mendocino County, Californ…   91601   87841         3760           4.28
 5 06047 Merced County, California    281202  255793        25409           9.93
 6 06049 Modoc County, California       8700    9686         -986         -10.2 
 7 06051 Mono County, California       13195   14202        -1007          -7.09
 8 06053 Monterey County, California  439035  415057        23978           5.78
 9 06055 Napa County, California      138019  136484         1535           1.12
10 06057 Nevada County, California    102241   98764         3477           3.52
# ℹ 3,211 more rows

Caveat: changing geographies!

  • County names and boundaries can change from year to year, introducing potential problems in time-series analysis

  • This is particularly acute for small geographies like Census tracts & block groups, which we’ll cover on March 7!

filter(county_change, is.na(value10))
# A tibble: 4 × 6
  GEOID NAME                         value20 value10 total_change percent_change
  <chr> <chr>                          <dbl>   <dbl>        <dbl>          <dbl>
1 02063 Chugach Census Area, Alaska     7102      NA           NA             NA
2 02066 Copper River Census Area, A…    2617      NA           NA             NA
3 02158 Kusilvak Census Area, Alaska    8368      NA           NA             NA
4 46102 Oglala Lakota County, South…   13672      NA           NA             NA

Bonus example: creating this plot

Thank you!