2025 SSDAN Webinar Series
2025-02-12
Professor of Geography at TCU
Spatial data science researcher and consultant
Package developer: tidycensus, tigris, mapgl, mapboxapi, crsuggest, idbr (R), pygris (Python)
Book: Analyzing US Census Data: Methods, Maps and Models in R
Wednesday, February 5: Analyzing Data from the 2023 American Community Survey in R
Today: Working with Decennial Census Data in R
Wednesday, February 26: Mapping and Spatial Analysis with US Census Data in R
Hour 1: Getting started with 2020 Decennial US Census data in R
Hour 2: Analysis workflows with 2020 Census data
Hour 3: The detailed DHC data files and time-series analysis
R: programming language and software environment for data analysis (and wherever else your imagination can take you!)
RStudio: integrated development environment (IDE) for R developed by Posit
Posit Cloud: run RStudio with today’s workshop pre-configured at https://posit.cloud/content/9689451
Complete count of the US population mandated by Article 1, Sections 2 and 9 in the US Constitution
Directed by the US Census Bureau (US Department of Commerce); conducted every 10 years since 1790
Used for proportional representation / congressional redistricting
Limited set of questions asked about race, ethnicity, age, sex, and housing tenure
Available datasets from the 2020 US Census include:
data.census.gov is the main, revamped interactive data portal for browsing and downloading Census datasets
The US Census Application Programming Interface (API) allows developers to access Census data resources programmatically
Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);
Automatically downloads and merges Census geometries to data for mapping;
Includes a variety of analytic tools to support common Census workflows;
States and counties can be requested by name (no more looking up FIPS codes!)
To get started, install the packages you’ll need for today’s workshop
If you are using the Posit Cloud environment, these packages are already installed for you
tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day
Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.
Once activated, use the census_api_key()
function to set your key as an environment variable
The get_decennial()
function is used to acquire data from the decennial US Census
The two required arguments are geography
and variables
for the functions to work; for 2020 Census data, use year = 2020
.
GEOID
, NAME
, variable
, and value
# A tibble: 52 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 42 Pennsylvania P1_001N 13002700
2 06 California P1_001N 39538223
3 54 West Virginia P1_001N 1793716
4 49 Utah P1_001N 3271616
5 36 New York P1_001N 20201249
6 11 District of Columbia P1_001N 689545
7 02 Alaska P1_001N 733391
8 12 Florida P1_001N 21538187
9 45 South Carolina P1_001N 5118425
10 38 North Dakota P1_001N 779094
# ℹ 42 more rows
get_decennial()
for the 2020 Census for the first time, we see the following messages:Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
This message is displayed once per session.
The Census Bureau is using differential privacy in an attempt to preserve respondent confidentiality in the 2020 Census data, which is required under US Code Title 13
Intentional errors are introduced into data, impacting the accuracy of small area counts (e.g. some blocks with children, but no adults)
Advocates argue that differential privacy is necessary to satisfy Title 13 requirements given modern database reconstruction technologies; critics contend that the method makes data less useful with no tangible privacy benefit
table
parameter can be used to obtain all related variables in a “table” at once# A tibble: 3,796 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 42 Pennsylvania P2_001N 13002700
2 06 California P2_001N 39538223
3 54 West Virginia P2_001N 1793716
4 49 Utah P2_001N 3271616
5 36 New York P2_001N 20201249
6 11 District of Columbia P2_001N 689545
7 02 Alaska P2_001N 733391
8 12 Florida P2_001N 21538187
9 45 South Carolina P2_001N 5118425
10 38 North Dakota P2_001N 779094
# ℹ 3,786 more rows
Information on available geographies, and how to specify them, can be found in the tidycensus documentation
The 2020 Census allows you to get data down to the Census block (unlike the ACS, covered last week)
{width: 400}
For geographies available below the state level, the state
parameter allows you to query data for a specific state
tidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!
Example: total population in Missouri by county
# A tibble: 115 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 29001 Adair County, Missouri P1_001N 25314
2 29003 Andrew County, Missouri P1_001N 18135
3 29005 Atchison County, Missouri P1_001N 5305
4 29007 Audrain County, Missouri P1_001N 24962
5 29009 Barry County, Missouri P1_001N 34534
6 29011 Barton County, Missouri P1_001N 11637
7 29013 Bates County, Missouri P1_001N 16042
8 29015 Benton County, Missouri P1_001N 19394
9 29017 Bollinger County, Missouri P1_001N 10567
10 29019 Boone County, Missouri P1_001N 183610
# ℹ 105 more rows
# A tibble: 8,057 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 295101156001003 Block 1003, Block Group 1, Census Tract 1156,… P1_001N 18
2 295101156001004 Block 1004, Block Group 1, Census Tract 1156,… P1_001N 0
3 295101156001005 Block 1005, Block Group 1, Census Tract 1156,… P1_001N 60
4 295101156001006 Block 1006, Block Group 1, Census Tract 1156,… P1_001N 68
5 295101156001007 Block 1007, Block Group 1, Census Tract 1156,… P1_001N 566
6 295101156001008 Block 1008, Block Group 1, Census Tract 1156,… P1_001N 0
7 295101164004000 Block 4000, Block Group 4, Census Tract 1164,… P1_001N 52
8 295101164003019 Block 3019, Block Group 3, Census Tract 1164,… P1_001N 51
9 295101164003020 Block 3020, Block Group 3, Census Tract 1164,… P1_001N 114
10 295101164003021 Block 3021, Block Group 3, Census Tract 1164,… P1_001N 35
# ℹ 8,047 more rows
To search for variables, use the load_variables()
function along with a year and dataset
The View()
function in RStudio allows for interactive browsing and filtering
The different datasets in the 2020 Census are accessible by specifying a sumfile
in get_decennial()
. The datasets we’ll cover today include (besides the default PL 94-171 Redistricting Data):
sumfile = "dhc"
)sumfile = "dp"
)sumfile = "cd118
)sumfile = "ddhca"
)sumfile = "ddhcb"
)# A tibble: 10,868 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 09 Connecticut PCT12_001N 3605944
2 10 Delaware PCT12_001N 989948
3 11 District of Columbia PCT12_001N 689545
4 12 Florida PCT12_001N 21538187
5 13 Georgia PCT12_001N 10711908
6 15 Hawaii PCT12_001N 1455271
7 16 Idaho PCT12_001N 1839106
8 17 Illinois PCT12_001N 12812508
9 18 Indiana PCT12_001N 6785528
10 19 Iowa PCT12_001N 3190369
# ℹ 10,858 more rows
# A tibble: 52 × 211
GEOID NAME PCT12_001N PCT12_002N PCT12_003N PCT12_004N PCT12_005N PCT12_006N
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 09 Conn… 3605944 1749853 16946 17489 17974 18571
2 10 Dela… 989948 476719 4847 5045 5239 5427
3 11 Dist… 689545 322777 4116 3730 3620 3723
4 12 Flor… 21538187 10464234 98241 100543 104962 108555
5 13 Geor… 10711908 5188570 58614 59817 62719 64271
6 15 Hawa… 1455271 727844 7757 7565 7701 8303
7 16 Idaho 1839106 919196 11029 10953 11558 12056
8 17 Illi… 12812508 6283130 68115 69120 71823 74165
9 18 Indi… 6785528 3344660 39308 40232 41971 43310
10 19 Iowa 3190369 1586092 18038 18566 19349 19864
# ℹ 42 more rows
# ℹ 203 more variables: PCT12_007N <dbl>, PCT12_008N <dbl>, PCT12_009N <dbl>,
# PCT12_010N <dbl>, PCT12_011N <dbl>, PCT12_012N <dbl>, PCT12_013N <dbl>,
# PCT12_014N <dbl>, PCT12_015N <dbl>, PCT12_016N <dbl>, PCT12_017N <dbl>,
# PCT12_018N <dbl>, PCT12_019N <dbl>, PCT12_020N <dbl>, PCT12_021N <dbl>,
# PCT12_022N <dbl>, PCT12_023N <dbl>, PCT12_024N <dbl>, PCT12_025N <dbl>,
# PCT12_026N <dbl>, PCT12_027N <dbl>, PCT12_028N <dbl>, PCT12_029N <dbl>, …
Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input
In long form, these custom inputs will populate the variable
column; in wide form, they will replace the column names
# A tibble: 67 × 4
GEOID NAME married partnered
<chr> <chr> <dbl> <dbl>
1 12001 Alachua County, Florida 0.3 0.2
2 12003 Baker County, Florida 0.1 0.1
3 12005 Bay County, Florida 0.2 0.2
4 12007 Bradford County, Florida 0.1 0.1
5 12009 Brevard County, Florida 0.2 0.1
6 12011 Broward County, Florida 0.4 0.3
7 12013 Calhoun County, Florida 0.1 0.1
8 12015 Charlotte County, Florida 0.2 0.1
9 12017 Citrus County, Florida 0.2 0.1
10 12019 Clay County, Florida 0.2 0.1
# ℹ 57 more rows
Use load_variables(2020, "dhc")
to find a variable that interests you from the Demographic and Housing Characteristics file.
Use get_decennial()
to fetch data on that variable from the decennial US Census for counties in a state of your choosing.
⬢ __ _ __ . ⬡ ⬢ .
/ /_(_)__/ /_ ___ _____ _______ ___
/ __/ / _ / // / |/ / -_) __(_-</ -_)
\__/_/\_,_/\_, /|___/\__/_/ /___/\__/
⬢ . /___/ ⬡ . ⬢
Census data are commonly used in wide format, with categories spread across the columns
tidyverse tools work better with data that are in “tidy”, or long format; this format is returned by tidycensus by default
Goal: return data “ready to go” for use with tidyverse tools
dplyr’s arrange()
function sorts data based on values in one or more columns, and filter()
helps you query data based on column values
Example: what are the largest and smallest counties in Missouri by population?
# A tibble: 115 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 29227 Worth County, Missouri P1_001N 1973
2 29129 Mercer County, Missouri P1_001N 3538
3 29103 Knox County, Missouri P1_001N 3744
4 29197 Schuyler County, Missouri P1_001N 4032
5 29087 Holt County, Missouri P1_001N 4223
6 29171 Putnam County, Missouri P1_001N 4681
7 29199 Scotland County, Missouri P1_001N 4716
8 29035 Carter County, Missouri P1_001N 5202
9 29005 Atchison County, Missouri P1_001N 5305
10 29211 Sullivan County, Missouri P1_001N 5999
# ℹ 105 more rows
# A tibble: 115 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 29189 St. Louis County, Missouri P1_001N 1004125
2 29095 Jackson County, Missouri P1_001N 717204
3 29183 St. Charles County, Missouri P1_001N 405262
4 29510 St. Louis city, Missouri P1_001N 301578
5 29077 Greene County, Missouri P1_001N 298915
6 29047 Clay County, Missouri P1_001N 253335
7 29099 Jefferson County, Missouri P1_001N 226739
8 29019 Boone County, Missouri P1_001N 183610
9 29097 Jasper County, Missouri P1_001N 122761
10 29037 Cass County, Missouri P1_001N 107824
# ℹ 105 more rows
filter()
function subsets data according to a specified condition, much like a SQL queryMany decennial Census and ACS variables are organized in tables in which the first variable represents a summary variable, or denominator for the others
The parameter summary_var
can be used to generate a new column in long-form data for a requested denominator, which works well for normalizing estimates
# A tibble: 5,634 × 5
GEOID NAME variable value summary_value
<chr> <chr> <chr> <dbl> <dbl>
1 49420 Yakima, WA Metro Area Hispanic 130049 256728
2 49460 Yankton, SD Micro Area Hispanic 1234 23310
3 49500 Yauco, PR Metro Area Hispanic 85665 86142
4 49620 York-Hanover, PA Metro Area Hispanic 39360 456438
5 49660 Youngstown-Warren-Boardman, OH-PA Metro … Hispanic 19881 541243
6 49700 Yuba City, CA Metro Area Hispanic 55088 181208
7 49740 Yuma, AZ Metro Area Hispanic 130003 203881
8 49780 Zanesville, OH Micro Area Hispanic 1055 86410
9 49820 Zapata, TX Micro Area Hispanic 12999 13889
10 48300 Wenatchee, WA Metro Area Hispanic 36741 122012
# ℹ 5,624 more rows
mutate()
dplyr’s mutate()
function is used to calculate new columns in your data; the select()
column can keep or drop columns by name
In a tidyverse workflow, these steps are commonly linked using the pipe operator (%>%
) from the magrittr package
# A tibble: 5,634 × 3
NAME variable percent
<chr> <chr> <dbl>
1 Yakima, WA Metro Area Hispanic 50.7
2 Yankton, SD Micro Area Hispanic 5.29
3 Yauco, PR Metro Area Hispanic 99.4
4 York-Hanover, PA Metro Area Hispanic 8.62
5 Youngstown-Warren-Boardman, OH-PA Metro Area Hispanic 3.67
6 Yuba City, CA Metro Area Hispanic 30.4
7 Yuma, AZ Metro Area Hispanic 63.8
8 Zanesville, OH Micro Area Hispanic 1.22
9 Zapata, TX Micro Area Hispanic 93.6
10 Wenatchee, WA Metro Area Hispanic 30.1
# ℹ 5,624 more rows
The group_by()
and summarize()
functions in dplyr are used to implement the split-apply-combine method of data analysis
The default “tidy” format returned by tidycensus is designed to work well with group-wise Census data analysis workflows
# A tibble: 939 × 3
# Groups: NAME [939]
NAME variable percent
<chr> <chr> <dbl>
1 Yakima, WA Metro Area Hispanic 50.7
2 Yauco, PR Metro Area Hispanic 99.4
3 Yuma, AZ Metro Area Hispanic 63.8
4 Zapata, TX Micro Area Hispanic 93.6
5 Del Rio, TX Micro Area Hispanic 80.3
6 Deming, NM Micro Area Hispanic 65.6
7 Dodge City, KS Micro Area Hispanic 57.4
8 Dumas, TX Micro Area Hispanic 59.2
9 Eagle Pass, TX Micro Area Hispanic 94.9
10 El Centro, CA Metro Area Hispanic 85.2
# ℹ 929 more rows
One of the best features of tidycensus is the argument geometry = TRUE
, which gets you the correct Census geometries with no hassle
get_decennial()
with geometry = TRUE
returns a spatial Census dataset containing simple feature geometries; learn more on February 26
Let’s take a look at some examples
geometry = TRUE
does the hard work for you of acquiring and pre-joining spatial Census data
Consider using the Demographic Profile for pre-tabulated percentages
Simple feature collection with 546 features and 4 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -82.64474 ymin: 37.20148 xmax: -77.71952 ymax: 40.6388
Geodetic CRS: NAD83
# A tibble: 546 × 5
GEOID NAME variable value geometry
<chr> <chr> <chr> <dbl> <POLYGON [°]>
1 54001965700 Census Tract 9657; Barb… DP1_002… 22.3 ((-80.07606 39.1016, -80…
2 54029020700 Census Tract 207; Hanco… DP1_002… 22.4 ((-80.57509 40.41198, -8…
3 54069000200 Census Tract 2; Ohio Co… DP1_002… 22.3 ((-80.71044 40.12813, -8…
4 54011001400 Census Tract 14; Cabell… DP1_002… 15.1 ((-82.43736 38.41681, -8…
5 54099005100 Census Tract 51; Wayne … DP1_002… 23.3 ((-82.54137 38.39631, -8…
6 54099005200 Census Tract 52; Wayne … DP1_002… 21.8 ((-82.55603 38.39246, -8…
7 54059957200 Census Tract 9572; Ming… DP1_002… 20.3 ((-82.35141 37.78997, -8…
8 54071970600 Census Tract 9706; Pend… DP1_002… 29.2 ((-79.47433 38.46251, -7…
9 54083965900 Census Tract 9659; Rand… DP1_002… 20.1 ((-80.26471 38.71209, -8…
10 54085962400 Census Tract 9624; Ritc… DP1_002… 25 ((-81.32879 39.15258, -8…
# ℹ 536 more rows
Mapping, GIS, and spatial data is the subject of our February 26 workshop - so be sure to check that out!
Even before we dive deeper into spatial data, it is very useful to be able to explore your results on an interactive map
Our solution: mapview()