2024 SSDAN Webinar Series
2024-02-22
Associate Professor of Geography at TCU
Spatial data science researcher and consultant
Package developer: tidycensus, tigris, mapboxapi, crsuggest, idbr (R), pygris (Python)
Book: Analyzing US Census Data: Methods, Maps and Models in R
Thurday, February 8: Working with the 2022 American Community Survey with R and tidycensus
Today: Analyzing 2020 Decennial US Census Data in R
Thursday, March 7th: Doing “GIS” and making maps with US Census Data in R
Hour 1: Getting started with 2020 Decennial US Census data in R
Hour 2: Analysis workflows with 2020 Census data
Hour 3: The detailed DHC-A data and time-series analysis
R: programming language and software environment for data analysis (and wherever else your imagination can take you!)
RStudio: integrated development environment (IDE) for R developed by Posit
Posit Cloud: run RStudio with today’s workshop pre-configured at https://posit.cloud/content/7549022
Complete count of the US population mandated by Article 1, Sections 2 and 9 in the US Constitution
Directed by the US Census Bureau (US Department of Commerce); conducted every 10 years since 1790
Used for proportional representation / congressional redistricting
Limited set of questions asked about race, ethnicity, age, sex, and housing tenure
Available datasets from the 2020 US Census include:
data.census.gov is the main, revamped interactive data portal for browsing and downloading Census datasets
The US Census Application Programming Interface (API) allows developers to access Census data resources programmatically
Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested);
Automatically downloads and merges Census geometries to data for mapping;
Includes a variety of analytic tools to support common Census workflows;
States and counties can be requested by name (no more looking up FIPS codes!)
To get started, install the packages you’ll need for today’s workshop
If you are using the Posit Cloud environment, these packages are already installed for you
tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day
Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.
Once activated, use the census_api_key()
function to set your key as an environment variable
The get_decennial()
function is used to acquire data from the decennial US Census
The two required arguments are geography
and variables
for the functions to work; for 2020 Census data, use year = 2020
.
GEOID
, NAME
, variable
, and value
# A tibble: 52 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 42 Pennsylvania P1_001N 13002700
2 06 California P1_001N 39538223
3 54 West Virginia P1_001N 1793716
4 49 Utah P1_001N 3271616
5 36 New York P1_001N 20201249
6 11 District of Columbia P1_001N 689545
7 02 Alaska P1_001N 733391
8 12 Florida P1_001N 21538187
9 45 South Carolina P1_001N 5118425
10 38 North Dakota P1_001N 779094
# ℹ 42 more rows
get_decennial()
for the 2020 Census for the first time, we see the following messages:Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
This message is displayed once per session.
The Census Bureau is using differential privacy in an attempt to preserve respondent confidentiality in the 2020 Census data, which is required under US Code Title 13
Intentional errors are introduced into data, impacting the accuracy of small area counts (e.g. some blocks with children, but no adults)
Advocates argue that differential privacy is necessary to satisfy Title 13 requirements given modern database reconstruction technologies; critics contend that the method makes data less useful with no tangible privacy benefit
table
parameter can be used to obtain all related variables in a “table” at once# A tibble: 3,796 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 42 Pennsylvania P2_001N 13002700
2 06 California P2_001N 39538223
3 54 West Virginia P2_001N 1793716
4 49 Utah P2_001N 3271616
5 36 New York P2_001N 20201249
6 11 District of Columbia P2_001N 689545
7 02 Alaska P2_001N 733391
8 12 Florida P2_001N 21538187
9 45 South Carolina P2_001N 5118425
10 38 North Dakota P2_001N 779094
# ℹ 3,786 more rows
Information on available geographies, and how to specify them, can be found in the tidycensus documentation
The 2020 Census allows you to get data down to the Census block (unlike the ACS, covered last week)
{width: 400}
For geographies available below the state level, the state
parameter allows you to query data for a specific state
tidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!
Example: total population in Texas by county
# A tibble: 254 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 48239 Jackson County, Texas P1_001N 14988
2 48233 Hutchinson County, Texas P1_001N 20617
3 48235 Irion County, Texas P1_001N 1513
4 48237 Jack County, Texas P1_001N 8472
5 48241 Jasper County, Texas P1_001N 32980
6 48243 Jeff Davis County, Texas P1_001N 1996
7 48245 Jefferson County, Texas P1_001N 256526
8 48247 Jim Hogg County, Texas P1_001N 4838
9 48249 Jim Wells County, Texas P1_001N 38891
10 48251 Johnson County, Texas P1_001N 179927
# ℹ 244 more rows
# A tibble: 2,620 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 483217303013065 Block 3065, Block Group 3, Census Tract 7303.… P1_001N 12
2 483217303013062 Block 3062, Block Group 3, Census Tract 7303.… P1_001N 6
3 483217303013063 Block 3063, Block Group 3, Census Tract 7303.… P1_001N 30
4 483217303013064 Block 3064, Block Group 3, Census Tract 7303.… P1_001N 31
5 483217303013066 Block 3066, Block Group 3, Census Tract 7303.… P1_001N 14
6 483217303013067 Block 3067, Block Group 3, Census Tract 7303.… P1_001N 19
7 483217303013068 Block 3068, Block Group 3, Census Tract 7303.… P1_001N 19
8 483217303013069 Block 3069, Block Group 3, Census Tract 7303.… P1_001N 15
9 483217303013070 Block 3070, Block Group 3, Census Tract 7303.… P1_001N 24
10 483217303013071 Block 3071, Block Group 3, Census Tract 7303.… P1_001N 3
# ℹ 2,610 more rows
To search for variables, use the load_variables()
function along with a year and dataset
The View()
function in RStudio allows for interactive browsing and filtering
The different datasets in the 2020 Census are accessible by specifying a sumfile
in get_decennial()
. The datasets we’ll cover today include:
sumfile = "dhc"
)sumfile = "dp"
)sumfile = "cd118
)sumfile = "ddhca"
)# A tibble: 10,868 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 09 Connecticut PCT12_001N 3605944
2 10 Delaware PCT12_001N 989948
3 11 District of Columbia PCT12_001N 689545
4 12 Florida PCT12_001N 21538187
5 13 Georgia PCT12_001N 10711908
6 15 Hawaii PCT12_001N 1455271
7 16 Idaho PCT12_001N 1839106
8 17 Illinois PCT12_001N 12812508
9 18 Indiana PCT12_001N 6785528
10 19 Iowa PCT12_001N 3190369
# ℹ 10,858 more rows
# A tibble: 52 × 211
GEOID NAME PCT12_001N PCT12_002N PCT12_003N PCT12_004N PCT12_005N PCT12_006N
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 09 Conn… 3605944 1749853 16946 17489 17974 18571
2 10 Dela… 989948 476719 4847 5045 5239 5427
3 11 Dist… 689545 322777 4116 3730 3620 3723
4 12 Flor… 21538187 10464234 98241 100543 104962 108555
5 13 Geor… 10711908 5188570 58614 59817 62719 64271
6 15 Hawa… 1455271 727844 7757 7565 7701 8303
7 16 Idaho 1839106 919196 11029 10953 11558 12056
8 17 Illi… 12812508 6283130 68115 69120 71823 74165
9 18 Indi… 6785528 3344660 39308 40232 41971 43310
10 19 Iowa 3190369 1586092 18038 18566 19349 19864
# ℹ 42 more rows
# ℹ 203 more variables: PCT12_007N <dbl>, PCT12_008N <dbl>, PCT12_009N <dbl>,
# PCT12_010N <dbl>, PCT12_011N <dbl>, PCT12_012N <dbl>, PCT12_013N <dbl>,
# PCT12_014N <dbl>, PCT12_015N <dbl>, PCT12_016N <dbl>, PCT12_017N <dbl>,
# PCT12_018N <dbl>, PCT12_019N <dbl>, PCT12_020N <dbl>, PCT12_021N <dbl>,
# PCT12_022N <dbl>, PCT12_023N <dbl>, PCT12_024N <dbl>, PCT12_025N <dbl>,
# PCT12_026N <dbl>, PCT12_027N <dbl>, PCT12_028N <dbl>, PCT12_029N <dbl>, …
Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input
In long form, these custom inputs will populate the variable
column; in wide form, they will replace the column names
# A tibble: 58 × 4
GEOID NAME married partnered
<chr> <chr> <dbl> <dbl>
1 06001 Alameda County, California 0.4 0.2
2 06003 Alpine County, California 0.3 0.2
3 06005 Amador County, California 0.2 0.1
4 06007 Butte County, California 0.2 0.1
5 06009 Calaveras County, California 0.2 0.1
6 06011 Colusa County, California 0.1 0
7 06013 Contra Costa County, California 0.3 0.1
8 06015 Del Norte County, California 0.1 0.1
9 06017 El Dorado County, California 0.2 0.1
10 06019 Fresno County, California 0.2 0.1
# ℹ 48 more rows
Use load_variables(2020, "dhc")
to find a variable that interests you from the Demographic and Housing Characteristics file.
Use get_decennial()
to fetch data on that variable from the decennial US Census for counties in a state of your choosing.
⬢ __ _ __ . ⬡ ⬢ .
/ /_(_)__/ /_ ___ _____ _______ ___
/ __/ / _ / // / |/ / -_) __(_-</ -_)
\__/_/\_,_/\_, /|___/\__/_/ /___/\__/
⬢ . /___/ ⬡ . ⬢
Census data are commonly used in wide format, with categories spread across the columns
tidyverse tools work better with data that are in “tidy”, or long format; this format is returned by tidycensus by default
Goal: return data “ready to go” for use with tidyverse tools
dplyr’s arrange()
function sorts data based on values in one or more columns, and filter()
helps you query data based on column values
Example: what are the largest and smallest counties in Texas by population?
# A tibble: 254 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 48301 Loving County, Texas P1_001N 64
2 48269 King County, Texas P1_001N 265
3 48261 Kenedy County, Texas P1_001N 350
4 48311 McMullen County, Texas P1_001N 600
5 48033 Borden County, Texas P1_001N 631
6 48263 Kent County, Texas P1_001N 753
7 48443 Terrell County, Texas P1_001N 760
8 48393 Roberts County, Texas P1_001N 827
9 48345 Motley County, Texas P1_001N 1063
10 48155 Foard County, Texas P1_001N 1095
# ℹ 244 more rows
# A tibble: 254 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 48201 Harris County, Texas P1_001N 4731145
2 48113 Dallas County, Texas P1_001N 2613539
3 48439 Tarrant County, Texas P1_001N 2110640
4 48029 Bexar County, Texas P1_001N 2009324
5 48453 Travis County, Texas P1_001N 1290188
6 48085 Collin County, Texas P1_001N 1064465
7 48121 Denton County, Texas P1_001N 906422
8 48215 Hidalgo County, Texas P1_001N 870781
9 48141 El Paso County, Texas P1_001N 865657
10 48157 Fort Bend County, Texas P1_001N 822779
# ℹ 244 more rows
filter()
function subsets data according to a specified condition, much like a SQL query# A tibble: 8 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 48261 Kenedy County, Texas P1_001N 350
2 48263 Kent County, Texas P1_001N 753
3 48269 King County, Texas P1_001N 265
4 48301 Loving County, Texas P1_001N 64
5 48311 McMullen County, Texas P1_001N 600
6 48393 Roberts County, Texas P1_001N 827
7 48443 Terrell County, Texas P1_001N 760
8 48033 Borden County, Texas P1_001N 631
Many decennial Census and ACS variables are organized in tables in which the first variable represents a summary variable, or denominator for the others
The parameter summary_var
can be used to generate a new column in long-form data for a requested denominator, which works well for normalizing estimates
# A tibble: 2,640 × 5
GEOID NAME variable value summary_value
<chr> <chr> <chr> <dbl> <dbl>
1 0101 Congressional District 1 (118th Congress… Hispanic 27196 717754
2 0102 Congressional District 2 (118th Congress… Hispanic 31708 717755
3 0103 Congressional District 3 (118th Congress… Hispanic 26849 717754
4 0104 Congressional District 4 (118th Congress… Hispanic 53874 717754
5 0105 Congressional District 5 (118th Congress… Hispanic 47307 717754
6 0106 Congressional District 6 (118th Congress… Hispanic 46872 717754
7 0107 Congressional District 7 (118th Congress… Hispanic 30241 717754
8 0200 Congressional District (at Large) (118th… Hispanic 49824 733391
9 0401 Congressional District 1 (118th Congress… Hispanic 129929 794611
10 0402 Congressional District 2 (118th Congress… Hispanic 133934 794612
# ℹ 2,630 more rows
mutate()
dplyr’s mutate()
function is used to calculate new columns in your data; the select()
column can keep or drop columns by name
In a tidyverse workflow, these steps are commonly linked using the pipe operator (%>%
) from the magrittr package
# A tibble: 2,640 × 3
NAME variable percent
<chr> <chr> <dbl>
1 Congressional District 1 (118th Congress), Alabama Hispanic 3.79
2 Congressional District 2 (118th Congress), Alabama Hispanic 4.42
3 Congressional District 3 (118th Congress), Alabama Hispanic 3.74
4 Congressional District 4 (118th Congress), Alabama Hispanic 7.51
5 Congressional District 5 (118th Congress), Alabama Hispanic 6.59
6 Congressional District 6 (118th Congress), Alabama Hispanic 6.53
7 Congressional District 7 (118th Congress), Alabama Hispanic 4.21
8 Congressional District (at Large) (118th Congress), Alaska Hispanic 6.79
9 Congressional District 1 (118th Congress), Arizona Hispanic 16.4
10 Congressional District 2 (118th Congress), Arizona Hispanic 16.9
# ℹ 2,630 more rows
The group_by()
and summarize()
functions in dplyr are used to implement the split-apply-combine method of data analysis
The default “tidy” format returned by tidycensus is designed to work well with group-wise Census data analysis workflows
# A tibble: 437 × 3
# Groups: NAME [437]
NAME variable percent
<chr> <chr> <dbl>
1 Congressional District 3 (118th Congress), Arizona Hispanic 62.6
2 Congressional District 7 (118th Congress), Arizona Hispanic 59.8
3 Congressional District 8 (118th Congress), California Hispanic 35.2
4 Congressional District 9 (118th Congress), California Hispanic 41.5
5 Congressional District 13 (118th Congress), California Hispanic 65.1
6 Congressional District 18 (118th Congress), California Hispanic 65.3
7 Congressional District 21 (118th Congress), California Hispanic 64.3
8 Congressional District 22 (118th Congress), California Hispanic 73.2
9 Congressional District 23 (118th Congress), California Hispanic 41.6
10 Congressional District 25 (118th Congress), California Hispanic 64.8
# ℹ 427 more rows
One of the best features of tidycensus is the argument geometry = TRUE
, which gets you the correct Census geometries with no hassle
get_decennial()
with geometry = TRUE
returns a spatial Census dataset containing simple feature geometries; learn more on March 7
Let’s take a look at some examples
geometry = TRUE
does the hard work for you of acquiring and pre-joining spatial Census data
Consider using the Demographic Profile for pre-tabulated percentages
Simple feature collection with 896 features and 4 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -96.63862 ymin: 40.37566 xmax: -90.14006 ymax: 43.5012
Geodetic CRS: NAD83
# A tibble: 896 × 5
GEOID NAME variable value geometry
<chr> <chr> <chr> <dbl> <POLYGON [°]>
1 19163010600 Census Tract 106; Scott… DP1_002… 7.1 ((-90.57555 41.52573, -9…
2 19163011700 Census Tract 117; Scott… DP1_002… 13.1 ((-90.5766 41.55272, -90…
3 19163013500 Census Tract 135; Scott… DP1_002… 17.2 ((-90.52247 41.53852, -9…
4 19057000300 Census Tract 3; Des Moi… DP1_002… 23.1 ((-91.15105 40.8279, -91…
5 19145490300 Census Tract 4903; Page… DP1_002… 26.2 ((-95.38534 40.74262, -9…
6 19105070100 Census Tract 701; Jones… DP1_002… 20.2 ((-91.36461 42.15154, -9…
7 19103001200 Census Tract 12; Johnso… DP1_002… 17.6 ((-91.52284 41.66125, -9…
8 19123950400 Census Tract 9504; Maha… DP1_002… 21.4 ((-92.64872 41.33564, -9…
9 19169000300 Census Tract 3; Story C… DP1_002… 24.9 ((-93.63794 42.05642, -9…
10 19187000400 Census Tract 4; Webster… DP1_002… 12.7 ((-94.19502 42.50798, -9…
# ℹ 886 more rows
Mapping, GIS, and spatial data is the subject of our March 7 workshop - so be sure to check that out!
Even before we dive deeper into spatial data, it is very useful to be able to explore your results on an interactive map
Our solution: mapview()
zcol
Use the saveWidget()
function over the map
slot of your mapview map to save out a standalone HTML file, which you can embed in websites
load_variables(2020, "dp")
to look them up) for a different state, or state / county combination.Tabulation of 2020 Decennial Census results for population by sex and age
Key feature: break-outs for thousands of racial and ethnic groups
An “adaptive design” is used, meaning that data for different groups / geographies may be found in different tables
There is considerable sparsity in the data, especially when going down to the Census tract level
You’ll query the DDHC-A file with the argument sumfile = "ddhca"
in get_decennial()
A new argument, pop_group
, is required to use the DDHC-A; it takes a population group code.
Use pop_group = "all"
to query for all groups; set pop_group_label = TRUE
to return the label for the population group
Look up variables with load_variables(2020, "ddhca")
# A tibble: 2,996 × 6
GEOID NAME pop_group pop_group_label variable value
<chr> <chr> <chr> <chr> <chr> <dbl>
1 27 Minnesota 1002 European alone T01001_001N 3162905
2 27 Minnesota 1003 Albanian alone T01001_001N 512
3 27 Minnesota 1004 Alsatian alone T01001_001N 27
4 27 Minnesota 1005 Andorran alone T01001_001N NA
5 27 Minnesota 1006 Armenian alone T01001_001N 605
6 27 Minnesota 1007 Austrian alone T01001_001N 2552
7 27 Minnesota 1008 Azerbaijani alone T01001_001N 103
8 27 Minnesota 1009 Basque alone T01001_001N 52
9 27 Minnesota 1010 Belarusian alone T01001_001N 1579
10 27 Minnesota 1011 Belgian alone T01001_001N 3864
# ℹ 2,986 more rows
A new function, get_pop_groups()
, helps you look up population group codes
It works for SF2/SF4 in 2000 and SF2 in 2010 as well!
get_decennial(
geography = "county",
variables = "T02001_001N",
state = "MN",
county = "Hennepin",
pop_group = "1325",
year = 2020,
sumfile = "ddhca"
)
Error in `get_decennial()`:
! Error in load_data_decennial(geography, variables, key, year, sumfile, :
Your DDHC-A request returned No Content from the API.
ℹ The DDHC-A file uses an 'adaptive design' where data availability varies by geography and by population group.
ℹ Read Section 3-1 at https://www2.census.gov/programs-surveys/decennial/2020/technical-documentation/complete-tech-docs/detailed-demographic-and-housing-characteristics-file-a/2020census-detailed-dhc-a-techdoc.pdf for more information.
ℹ In tidycensus, use the function `check_ddhca_groups()` to see if your data is available.
check_ddhca_groups()
, can be used to see which tables to use for the data you wantGiven data sparsity in the DDHC-A data, should you make maps with it?
I’m not personally a fan of mapping data that are geographically sparse. But…
I don’t think choropleth maps are advisable with geographically incomplete data in most cases
Other map types - like graduated symbols or dot-density maps - may be more appropriate
The tidycensus function as_dot_density()
allows you to specify the number of people represented in each dot, which means you can represent data-suppressed areas as 0 more confidently
A common use-case for the 2020 decennial Census data is to assess change over time
For example: which areas have experienced the most population growth, and which have experienced the steepest declines?
tidycensus allows users to access the 2000 and 2010 decennial Census data for comparison, though variable IDs will differ
# A tibble: 3,221 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 05131 Sebastian County, Arkansas P001001 125744
2 05133 Sevier County, Arkansas P001001 17058
3 05135 Sharp County, Arkansas P001001 17264
4 05137 Stone County, Arkansas P001001 12394
5 05139 Union County, Arkansas P001001 41639
6 05141 Van Buren County, Arkansas P001001 17295
7 05143 Washington County, Arkansas P001001 203065
8 05145 White County, Arkansas P001001 77076
9 05149 Yell County, Arkansas P001001 22185
10 06011 Colusa County, California P001001 21419
# ℹ 3,211 more rows
select()
function can both subset datasets by column and rename columns, “cleaning up” a dataset before joining to another dataset# A tibble: 3,221 × 2
GEOID value10
<chr> <dbl>
1 05131 125744
2 05133 17058
3 05135 17264
4 05137 12394
5 05139 41639
6 05141 17295
7 05143 203065
8 05145 77076
9 05149 22185
10 06011 21419
# ℹ 3,211 more rows
*_join()
family of functions# A tibble: 3,221 × 4
GEOID NAME value20 value10
<chr> <chr> <dbl> <dbl>
1 06039 Madera County, California 156255 150865
2 06041 Marin County, California 262321 252409
3 06043 Mariposa County, California 17131 18251
4 06045 Mendocino County, California 91601 87841
5 06047 Merced County, California 281202 255793
6 06049 Modoc County, California 8700 9686
7 06051 Mono County, California 13195 14202
8 06053 Monterey County, California 439035 415057
9 06055 Napa County, California 138019 136484
10 06057 Nevada County, California 102241 98764
# ℹ 3,211 more rows
mutate()
function can be used to calculate new columns, allowing for assessment of change over time# A tibble: 3,221 × 6
GEOID NAME value20 value10 total_change percent_change
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 06039 Madera County, California 156255 150865 5390 3.57
2 06041 Marin County, California 262321 252409 9912 3.93
3 06043 Mariposa County, California 17131 18251 -1120 -6.14
4 06045 Mendocino County, Californ… 91601 87841 3760 4.28
5 06047 Merced County, California 281202 255793 25409 9.93
6 06049 Modoc County, California 8700 9686 -986 -10.2
7 06051 Mono County, California 13195 14202 -1007 -7.09
8 06053 Monterey County, California 439035 415057 23978 5.78
9 06055 Napa County, California 138019 136484 1535 1.12
10 06057 Nevada County, California 102241 98764 3477 3.52
# ℹ 3,211 more rows
County names and boundaries can change from year to year, introducing potential problems in time-series analysis
This is particularly acute for small geographies like Census tracts & block groups, which we’ll cover on March 7!
# A tibble: 4 × 6
GEOID NAME value20 value10 total_change percent_change
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 02063 Chugach Census Area, Alaska 7102 NA NA NA
2 02066 Copper River Census Area, A… 2617 NA NA NA
3 02158 Kusilvak Census Area, Alaska 8368 NA NA NA
4 46102 Oglala Lakota County, South… 13672 NA NA NA