2 An introduction to tidycensus

The tidycensus package (K. Walker and Herman 2021), first released in 2017, is an R package designed to facilitate the process of acquiring and working with US Census Bureau population data in the R environment. The package has two distinct goals. First, tidycensus aims to make Census data available to R users in a tidyverse-friendly format, helping kick-start the process of generating insights from US Census data. Second, the package is designed to streamline the data wrangling process for spatial Census data analysts. With tidycensus, R users can request geometry along with attributes for their Census data, helping facilitate mapping and spatial analysis. This functionality of tidycensus is covered in more depth in Chapters 6, 7, and 8.

As discussed in the previous chapter, the US Census Bureau makes a wide range of datasets available to the user community through their APIs and other data download resources. tidycensus is not a comprehensive portal to these data resources; instead, it focuses on a select number of datasets implemented in a series of core functions. These core functions in tidycensus include:

get_decennial(), which requests data from the US Decennial Census APIs for 2000, 2010, and 2020.
get_acs(), which requests data from the 1-year and 5-year American Community Survey samples. Data are available from the 1-year ACS back to 2005 and the 5-year ACS back to 2005-2009.
get_estimates(), an interface to the Population Estimates APIs. These datasets include yearly estimates of population characteristics by state, county, and metropolitan area, along with components of change demographic estimates like births, deaths, and migration rates.
get_pums(), which accesses data from the ACS Public Use Microdata Sample APIs. These samples include anonymized individual-level records from the ACS organized by household and are highly useful for many different social science analyses. get_pums() is covered in more depth in Chapters 9 and 10.
get_flows(), an interface to the ACS Migration Flows APIs. Includes information on in- and out-flows from various geographies for the 5-year ACS samples, enabling origin-destination analyses.

2.1 Getting started with tidycensus

To get started with tidycensus, users should install the package with install.packages("tidycensus") if not yet installed; load the package with library("tidycensus"); and set their Census API key with the census_api_key() function. API keys can be obtained at https://api.census.gov/data/key_signup.html. After you’ve signed up for an API key, be sure to activate the key from the email you receive from the Census Bureau so it works correctly. Declaring install = TRUE when calling census_api_key() will install the key for use in future R sessions, which may be convenient for many users.

library(tidycensus)
# census_api_key("YOUR KEY GOES HERE", install = TRUE)

2.1.1 Decennial Census

Once an API key is installed, users can obtain decennial Census or ACS data with a single function call. Let’s start with get_decennial(), which is used to access decennial Census data from the 2000, 2010, and 2020 decennial US Censuses.

To get data from the decennial US Census, users must specify a string representing the requested geography; a vector of Census variable IDs, represented by variable; or optionally a Census table ID, passed to table. The code below gets data on total population by state from the 2010 decennial Census.

total_population_10 <- get_decennial(
  geography = "state", 
  variables = "P001001",
  year = 2010
)

Table 2.1: Total population by state, 2010 Census
GEOID	NAME	variable	value
01	Alabama	P001001	4779736
02	Alaska	P001001	710231
04	Arizona	P001001	6392017
05	Arkansas	P001001	2915918
06	California	P001001	37253956
22	Louisiana	P001001	4533372
21	Kentucky	P001001	4339367
08	Colorado	P001001	5029196
09	Connecticut	P001001	3574097
10	Delaware	P001001	897934

The function returns a tibble of data from the 2010 US Census (the function default year) with information on total population by state, and assigns it to the object total_population_10. Data for 2000 or 2020 can also be obtained by supplying the appropriate year to the year parameter.

2.1.1.1 Summary files in the decennial Census

By default, get_decennial() uses the argument sumfile = "sf1", which fetches data from the decennial Census Summary File 1. This summary file exists for the 2000 and 2010 decennial US Censuses, and includes core demographic characteristics for Census geographies. The 2000 and 2010 decennial Census data also include Summary File 2, which contains information on a range of population and housing unit characteristics and is specified as "sf2". Detailed demographic information in the 2000 decennial Census such as income and occupation can be found in Summary Files 3 ("sf3") and 4 ("sf4"). Data from the 2000 and 2010 Decennial Censuses for island territories other than Puerto Rico must be accessed at their corresponding summary files: "as" for American Samoa, "mp" for the Northern Mariana Islands, "gu" for Guam, and "vi" for the US Virgin Islands.

2020 Decennial Census data are available from the PL 94-171 Redistricting summary file, which is specified with sumfile = "pl" and is also available for 2010. The Redistricting summary files include a limited subset of variables from the decennial US Census to be used for legislative redistricting. These variables include total population and housing units; race and ethnicity; voting-age population; and group quarters population. For example, the code below retrieves information on the American Indian & Alaska Native population by state from the 2020 decennial Census.

aian_2020 <- get_decennial(
  geography = "state",
  variables = "P1_005N",
  year = 2020,
  sumfile = "pl"
)

Table 2.2: American Indian or Alaska Native alone population by state from the 2020 decennial Census
GEOID	NAME	variable	value
42	Pennsylvania	P1_005N	31052
06	California	P1_005N	631016
54	West Virginia	P1_005N	3706
49	Utah	P1_005N	41644
36	New York	P1_005N	149690
11	District of Columbia	P1_005N	3193

The argument sumfile = "pl" is assumed (and in turn not required) when users request data for 2020 and will remain so until the main Demographic and Housing Characteristics File is released in mid-to-late 2022.

When users request data from the 2020 decennial Census for the first time in a given R session, get_decennial() prints out the following message:

Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.

This message alerts users that 2020 decennial Census data use differential privacy as a method to preserve confidentiality of individuals who responded to the Census. This can lead to inaccuracies in small area analyses using 2020 Census data and also can make comparisons of small counts across years difficult. A more in-depth discussion of differential privacy and the 2020 Census is found in the conclusion of this book.

2.1.2 American Community Survey

Similarly, get_acs() retrieves data from the American Community Survey. As discussed in the previous chapter, the ACS includes a wide variety of variables detailing characteristics of the US population not found in the decennial Census. The example below fetches data on the number of residents born in Mexico by state.

born_in_mexico <- get_acs(
  geography = "state", 
  variables = "B05006_150",
  year = 2020
)

Table 2.3: Mexican-born population by state, 2016-2020 5-year ACS
GEOID	NAME	variable	estimate	moe
01	Alabama	B05006_150	46927	1846
02	Alaska	B05006_150	4181	709
04	Arizona	B05006_150	510639	8028
05	Arkansas	B05006_150	60236	2182
06	California	B05006_150	3962910	25353
08	Colorado	B05006_150	215778	4888
09	Connecticut	B05006_150	28086	2144
10	Delaware	B05006_150	14616	1065
11	District of Columbia	B05006_150	4026	761
12	Florida	B05006_150	257933	6418

If the year is not specified, get_acs() defaults to the most recent five-year ACS sample, which at the time of this writing is 2016-2020. The data returned is similar in structure to that returned by get_decennial(), but includes an estimate column (for the ACS estimate) and moe column (for the margin of error around that estimate) instead of a value column. Different years and different surveys are available by adjusting the year and survey parameters. survey defaults to the 5-year ACS; however this can be changed to the 1-year ACS by using the argument survey = "acs1". For example, the following code will fetch data from the 1-year ACS for 2019:

born_in_mexico_1yr <- get_acs(
  geography = "state", 
  variables = "B05006_150", 
  survey = "acs1",
  year = 2019
)

Table 2.4: Mexican-born population by state, 2019 1-year ACS
GEOID	NAME	variable	estimate	moe
01	Alabama	B05006_150	NA	NA
02	Alaska	B05006_150	NA	NA
04	Arizona	B05006_150	516618	15863
05	Arkansas	B05006_150	NA	NA
06	California	B05006_150	3951224	40506
08	Colorado	B05006_150	209408	12214
09	Connecticut	B05006_150	26371	4816
10	Delaware	B05006_150	NA	NA
11	District of Columbia	B05006_150	NA	NA
12	Florida	B05006_150	261614	17571

Note the differences between the 5-year ACS estimates and the 1-year ACS estimates shown. For states with larger Mexican-born populations like Arizona, California, and Colorado, the 1-year ACS data will represent the most up-to-date estimates, albeit characterized by larger margins of error relative to their estimates. For states with smaller Mexican-born populations like Alabama, Alaska, and Arkansas, however, the estimate returns NA, R’s notation representing missing data. If you encounter this in your data’s estimate column, it will generally mean that the estimate is too small for a given geography to be deemed reliable by the Census Bureau. In this case, only the states with the largest Mexican-born populations have data available for that variable in the 1-year ACS, meaning that the 5-year ACS should be used to make full state-wise comparisons if desired.

If users try accessing data from the 2020 1-year ACS in tidycensus, they will encounter the following error:

Error: The regular 1-year ACS was not released in 2020 due to low response rates.
The Census Bureau released a set of experimental estimates for the 2020 1-year ACS
that are not available in tidycensus.
These estimates can be downloaded at https://www.census.gov/programs-surveys/acs/data/experimental-data/1-year.html.

This means that for 1-year ACS data, tidycensus users will need to use older datasets (2019 and earlier) or access 2021 data once it is released in late 2022.

Variables from the ACS detailed tables, data profiles, summary tables, comparison profile, and supplemental estimates are available through tidycensus’s get_acs() function; the function will auto-detect from which dataset to look for variables based on their names. Alternatively, users can supply a table name to the table parameter in get_acs(); this will return data for every variable in that table. For example, to get all variables associated with table B01001, which covers sex broken down by age, from the 2016-2020 5-year ACS:

age_table <- get_acs(
  geography = "state", 
  table = "B01001",
  year = 2020
)

Table 2.5: Table B01001 by state from the 2016-2020 5-year ACS
GEOID	NAME	variable	estimate	moe
01	Alabama	B01001_001	4893186	NA
01	Alabama	B01001_002	2365734	1090
01	Alabama	B01001_003	149579	672
01	Alabama	B01001_004	150937	2202
01	Alabama	B01001_005	160287	2159
01	Alabama	B01001_006	96832	565
01	Alabama	B01001_007	65459	961
01	Alabama	B01001_008	36705	1467
01	Alabama	B01001_009	33089	1547
01	Alabama	B01001_010	93871	2045

To find all of the variables associated with a given ACS table, tidycensus downloads a dataset of variables from the Census Bureau website and looks up the variable codes for download. If the cache_table parameter is set to TRUE, the function instructs tidycensus to cache this dataset on the user’s computer for faster future access. This only needs to be done once per ACS or Census dataset if the user would like to specify this option.

2.2 Geography and variables in tidycensus

The geography parameter in get_acs() and get_decennial() allows users to request data aggregated to common Census enumeration units. At the time of this writing, tidycensus accepts enumeration units nested within states and/or counties, when applicable. Census blocks are available in get_decennial() but not in get_acs() as block-level data are not available from the American Community Survey. To request data within states and/or counties, state and county names can be supplied to the state and county parameters, respectively. Arguments should be formatted in the way that they are accepted by the US Census Bureau API, specified in the table below. If an “Available by” geography is in bold, that argument is required for that geography.

The only geographies available in 2000 are "state", "county", "county subdivision", "tract", "block group", and "place". Some geographies available from the Census API are not available in tidycensus at the moment as they require more complex hierarchy specification than the package supports, and not all variables are available at every geography.

Geography	Definition	Available by	Available in
`"us"`	United States		`get_acs()`, `get_decennial()`, `get_estimates()`
`"region"`	Census region		`get_acs()`, `get_decennial()`, `get_estimates()`
`"division"`	Census division		`get_acs()`, `get_decennial()`, `get_estimates()`
`"state"`	State or equivalent	state	`get_acs()`, `get_decennial()`, `get_estimates()`, `get_flows()`
`"county"`	County or equivalent	state, county	`get_acs()`, `get_decennial()`, `get_estimates()`, `get_flows()`
`"county subdivision"`	County subdivision	state, county	`get_acs()`, `get_decennial()`, `get_estimates()`, `get_flows()`
`"tract"`	Census tract	state, county	`get_acs()`, `get_decennial()`
`"block group"`	Census block group	state, county	`get_acs()` (2013-), `get_decennial()`
`"block"`	Census block	state, county	`get_decennial()`
`"place"`	Census-designated place	state	`get_acs()`, `get_decennial()`, `get_estimates()`
`"alaska native regional corporation"`	Alaska native regional corporation	state	`get_acs()`, `get_decennial()`
`"american indian area/alaska native area/hawaiian home land"`	Federal and state-recognized American Indian reservations and Hawaiian home lands	state	`get_acs()`, `get_decennial()`
`"american indian area/alaska native area (reservation or statistical entity only)"`	Only reservations and statistical entities	state	`get_acs()`, `get_decennial()`
`"american indian area (off-reservation trust land only)/hawaiian home land"`	Only off-reservation trust lands and Hawaiian home lands	state	`get_acs()`,
`"metropolitan statistical area/micropolitan statistical area"` OR `"cbsa"`	Core-based statistical area	state	`get_acs()`, `get_decennial()`, `get_estimates()`, `get_flows()`
`"combined statistical area"`	Combined statistical area	state	`get_acs()`, `get_decennial()`, `get_estimates()`
`"new england city and town area"`	New England city/town area	state	`get_acs()`, `get_decennial()`
`"combined new england city and town area"`	Combined New England area	state	`get_acs()`, `get_decennial()`
`"urban area"`	Census-defined urbanized areas		`get_acs()`, `get_decennial()`
`"congressional district"`	Congressional district for the year-appropriate Congress	state	`get_acs()`, `get_decennial()`
`"school district (elementary)"`	Elementary school district	state	`get_acs()`, `get_decennial()`
`"school district (secondary)"`	Secondary school district	state	`get_acs()`, `get_decennial()`
`"school district (unified)"`	Unified school district	state	`get_acs()`, `get_decennial()`
`"public use microdata area"`	PUMA (geography associated with Census microdata samples)	state	`get_acs()`
`"zip code tabulation area"` OR `"zcta"`	Zip code tabulation area	state	`get_acs()`, `get_decennial()`
`"state legislative district (upper chamber)"`	State senate districts	state	`get_acs()`, `get_decennial()`
`"state legislative district (lower chamber)"`	State house districts	state	`get_acs()`, `get_decennial()`
`"voting district"`	Voting districts (2020 only)	state	`get_decennial()`

The geography parameter must be typed exactly as specified in the table above to request data correctly from the Census API; use the guide above as a reference and copy-paste for longer strings. For core-based statistical areas and zip code tabulation areas, two heavily-requested geographies, the aliases "cbsa" and "zcta" can be used, respectively, to fetch data for those geographies.

cbsa_population <- get_acs(
  geography = "cbsa",
  variables = "B01003_001",
  year = 2020
)

Table 2.6: Population by CBSA
GEOID	NAME	variable	estimate	moe
10100	Aberdeen, SD Micro Area	B01003_001	42864	NA
10140	Aberdeen, WA Micro Area	B01003_001	73769	NA
10180	Abilene, TX Metro Area	B01003_001	171354	NA
10220	Ada, OK Micro Area	B01003_001	38385	NA
10300	Adrian, MI Micro Area	B01003_001	98310	NA
10380	Aguadilla-Isabela, PR Metro Area	B01003_001	295172	NA
10420	Akron, OH Metro Area	B01003_001	703286	NA
10460	Alamogordo, NM Micro Area	B01003_001	66804	NA
10500	Albany, GA Metro Area	B01003_001	147431	NA
10540	Albany-Lebanon, OR Metro Area	B01003_001	127216	NA

2.2.1 Geographic subsets

For many geographies, tidycensus supports more granular requests that are subsetted by state or even by county, if supported by the API. This information is found in the “Available by” column in the guide above. If a geographic subset is in bold, it is required; if not, it is optional.

For example, an analyst might be interested in studying variations in household income in the state of Wisconsin. Although the analyst can request all counties in the United States, this is not necessary for this specific task. In turn, they can use the state parameter to subset the request for a specific state.

wi_income <- get_acs(
  geography = "county", 
  variables = "B19013_001", 
  state = "WI",
  year = 2020
)

Table 2.7: Median household income by county in Wisconsin
GEOID	NAME	variable	estimate	moe
55001	Adams County, Wisconsin	B19013_001	48906	2387
55003	Ashland County, Wisconsin	B19013_001	47869	3190
55005	Barron County, Wisconsin	B19013_001	52346	2092
55007	Bayfield County, Wisconsin	B19013_001	57257	2496
55009	Brown County, Wisconsin	B19013_001	64728	1419
55011	Buffalo County, Wisconsin	B19013_001	58364	1871
55013	Burnett County, Wisconsin	B19013_001	53555	2513
55015	Calumet County, Wisconsin	B19013_001	76065	2314
55017	Chippewa County, Wisconsin	B19013_001	61215	2064
55019	Clark County, Wisconsin	B19013_001	54463	1089

tidycensus accepts state names (e.g. "Wisconsin"), state postal codes (e.g. "WI"), and state FIPS codes (e.g. "55"), so an analyst can use what they are most comfortable with.

Smaller geographies like Census tracts can also be subsetted by county. Given that Census tracts nest neatly within counties (and do not cross county boundaries), we can request all Census tracts for a given county by using the optional county parameter. Dane County, home to Wisconsin’s capital city of Madison, is shown below. Note that the name of the county can be supplied as well as the FIPS code. If a state has two counties with similar names (e.g. “Collin” and “Collingsworth” in Texas) you’ll need to spell out the full county string and type "Collin County".

dane_income <- get_acs(
  geography = "tract", 
  variables = "B19013_001", 
  state = "WI", 
  county = "Dane",
  year = 2020
)

Table 2.8: Median household income in Dane County by Census tract
GEOID	NAME	variable	estimate	moe
55025000100	Census Tract 1, Dane County, Wisconsin	B19013_001	74054	15662
55025000201	Census Tract 2.01, Dane County, Wisconsin	B19013_001	92460	27067
55025000202	Census Tract 2.02, Dane County, Wisconsin	B19013_001	88092	5189
55025000204	Census Tract 2.04, Dane County, Wisconsin	B19013_001	82717	12175
55025000205	Census Tract 2.05, Dane County, Wisconsin	B19013_001	100000	17506
55025000301	Census Tract 3.01, Dane County, Wisconsin	B19013_001	37016	11524
55025000302	Census Tract 3.02, Dane County, Wisconsin	B19013_001	117321	28723
55025000401	Census Tract 4.01, Dane County, Wisconsin	B19013_001	100434	12108
55025000402	Census Tract 4.02, Dane County, Wisconsin	B19013_001	105850	12205
55025000406	Census Tract 4.06, Dane County, Wisconsin	B19013_001	74009	2811

With respect to geography and the American Community Survey, users should be aware that whereas the 5-year ACS covers geographies down to the block group, the 1-year ACS only returns data for geographies of population 65,000 and greater. This means that some geographies (e.g. Census tracts) will never be available in the 1-year ACS, and that other geographies such as counties are only partially available. To illustrate this, we can check the number of rows in the object wi_income:

nrow(wi_income)

## [1] 72

There are 72 rows in this dataset, one for each county in Wisconsin. However, if the same data were requested from the 2019 1-year ACS:

wi_income_1yr <- get_acs(
  geography = "county", 
  variables = "B19013_001", 
  state = "WI",
  year = 2019,
  survey = "acs1"
)

nrow(wi_income_1yr)

## [1] 23

There are only 23 rows in this dataset, representing the 23 counties that meet the “total population of 65,000 or greater” threshold required to be included in the 1-year ACS data.

2.3 Searching for variables in tidycensus

One additional challenge when searching for Census variables is understanding variable IDs, which are required to fetch data from the Census and ACS APIs. There are thousands of variables available across the different datasets and summary files. To make searching easier for R users, tidycensus offers the load_variables() function. This function obtains a dataset of variables from the Census Bureau website and formats it for fast searching, ideally in RStudio.

The function takes two required arguments: year, which takes the year or endyear of the Census dataset or ACS sample, and dataset, which references the dataset name. For the 2000 or 2010 Decennial Census, use "sf1" or "sf2" as the dataset name to access variables from Summary Files 1 and 2, respectively. The 2000 Decennial Census also accepts "sf3" and "sf4" for Summary Files 3 and 4. For 2020, the only dataset supported at the time of this writing is "pl" for the PL-94171 Redistricting dataset; more datasets will be supported as the 2020 Census data are released. An example request would look like load_variables(year = 2020, dataset = "pl") for variables from the 2020 Decennial Census Redistricting data.

For variables from the American Community Survey, users should specify the dataset as "acs1" for the 1-year ACS or "acs5" for the 5-year ACS. If no suffix to these dataset names is specified, users will retrieve data from the ACS Detailed Tables. Variables from the ACS Data Profile, Summary Tables, and Comparison Profile are also available by appending the suffixes /profile, /summary, or /cprofile, respectively. For example, a user requesting variables from the 2020 5-year ACS Detailed Tables would specify load_variables(year = 2020, dataset = "acs5"); a request for variables from the Data Profile then would be load_variables(year = 2020, dataset = "acs5/profile"). In addition to these datasets, the ACS Supplemental Estimates variables can be accessed with the dataset name "acsse".

As this function requires processing thousands of variables from the Census Bureau which may take a few moments depending on the user’s internet connection, the user can specify cache = TRUE in the function call to store the data in the user’s cache directory for future access. On subsequent calls of the load_variables() function, cache = TRUE will direct the function to look in the cache directory for the variables rather than the Census website.

An example of how load_variables() works is as follows:

v16 <- load_variables(2016, "acs5", cache = TRUE)

Table 2.9: Variables in the 2012-2016 5-year ACS
name	label	concept	geography
B00001_001	Estimate!!Total	UNWEIGHTED SAMPLE COUNT OF THE POPULATION	block group
B00002_001	Estimate!!Total	UNWEIGHTED SAMPLE HOUSING UNITS	block group
B01001_001	Estimate!!Total	SEX BY AGE	block group
B01001_002	Estimate!!Total!!Male	SEX BY AGE	block group
B01001_003	Estimate!!Total!!Male!!Under 5 years	SEX BY AGE	block group
B01001_004	Estimate!!Total!!Male!!5 to 9 years	SEX BY AGE	block group
B01001_005	Estimate!!Total!!Male!!10 to 14 years	SEX BY AGE	block group
B01001_006	Estimate!!Total!!Male!!15 to 17 years	SEX BY AGE	block group
B01001_007	Estimate!!Total!!Male!!18 and 19 years	SEX BY AGE	block group
B01001_008	Estimate!!Total!!Male!!20 years	SEX BY AGE	block group

The returned data frame always has three columns: name, which refers to the Census variable ID; label, which is a descriptive data label for the variable; and concept, which refers to the topic of the data and often corresponds to a table of Census data. For the 5-year ACS detailed tables, the returned data frame also includes a fourth column, geography, which specifies the smallest geography at which a given variable is available from the Census API. As illustrated above, the data frame can be filtered using tidyverse tools for variable exploration. However, the RStudio integrated development environment includes an interactive data viewer which is ideal for browsing this dataset, and allows for interactive sorting and filtering. The data viewer can be accessed with the View() function:

View(v16)

Figure 2.1: Variable viewer in RStudio

By browsing the table in this way, users can identify the appropriate variable IDs (found in the name column) that can be passed to the variables parameter in get_acs() or get_decennial(). Users may note that the raw variable IDs in the ACS, as consumed by the API, require a suffix of E or M. tidycensus does not require this suffix, as it will automatically return both the estimate and margin of error for a given requested variable. Additionally, if users desire an entire table of related variables from the ACS, the user should supply the characters prior to the underscore from a variable ID to the table parameter.

2.4 Data structure in tidycensus

Key to the design philosophy of tidycensus is its interpretation of tidy data. Following Wickham (2014), “tidy” data are defined as follows:

Each observation forms a row;
Each variable forms a column;
Each observational unit forms a table.

By default, tidycensus returns a tibble of ACS or decennial Census data in “tidy” format. For decennial Census data, this will include four columns:

GEOID, representing the Census ID code that uniquely identifies the geographic unit;
NAME, which represents a descriptive name of the unit;
variable, which contains information on the Census variable name corresponding to that row;
value, which contains the data values for each unit-variable combination. For ACS data, two columns replace the value column: estimate, which represents the ACS estimate, and moe, representing the margin of error around that estimate.

Given the terminology used by the Census Bureau to distinguish data, it is important to provide some clarifications of nomenclature here. Census or ACS variables, which are specific series of data available by enumeration unit, are interpreted in tidycensus as characteristics of those enumeration units. In turn, rows in datasets returned when output = "tidy", which is the default setting in the get_acs() and get_decennial() functions, represent data for unique unit-variable combinations. An example of this is illustrated below with income groups by state for the 2016 1-year American Community Survey.

hhinc <- get_acs(
  geography = "state", 
  table = "B19001", 
  survey = "acs1",
  year = 2016
)

Table 2.10: Household income groups by state, 2016 1-year ACS
GEOID	NAME	variable	estimate	moe
01	Alabama	B19001_001	1852518	12189
01	Alabama	B19001_002	176641	6328
01	Alabama	B19001_003	120590	5347
01	Alabama	B19001_004	117332	5956
01	Alabama	B19001_005	108912	5308
01	Alabama	B19001_006	102080	4740
01	Alabama	B19001_007	103366	5246
01	Alabama	B19001_008	91011	4699
01	Alabama	B19001_009	86996	4418
01	Alabama	B19001_010	74864	4210

In this example, each row represents state-characteristic combinations, consistent with the tidy data model. Alternatively, if a user desires the variables spread across the columns of the dataset, the setting output = "wide" will enable this. For ACS data, estimates and margins of error for each ACS variable will be found in their own columns. For example:

hhinc_wide <- get_acs(
  geography = "state", 
  table = "B19001", 
  survey = "acs1", 
  year = 2016,
  output = "wide"
)

Table 2.11: Income table in wide form
GEOID	NAME	B19001_001E	B19001_001M	B19001_002E	B19001_002M	B19001_003E	B19001_003M	B19001_004E	B19001_004M	B19001_005E	B19001_005M	B19001_006E	B19001_006M	B19001_007E	B19001_007M	B19001_008E	B19001_008M	B19001_009E	B19001_009M	B19001_010E	B19001_010M	B19001_011E	B19001_011M	B19001_012E	B19001_012M	B19001_013E	B19001_013M	B19001_014E	B19001_014M	B19001_015E	B19001_015M	B19001_016E	B19001_016M	B19001_017E	B19001_017M
28	Mississippi	1091245	8803	113124	4835	87136	5004	71206	4058	70160	4560	59619	4105	62688	4149	55973	4422	57215	4119	41870	3427	86198	4669	98865	5983	117664	5168	68367	4079	37809	2983	34786	3038	28565	2396
29	Missouri	2372190	10844	160615	6705	122649	4654	123789	5201	128270	5714	123224	4726	133429	5639	123373	4564	117476	5796	107254	4130	200473	6468	248099	6281	296437	7492	188700	6361	102034	4905	102670	4935	93698	4434
30	Montana	416125	4426	26734	2183	24786	2391	22330	2391	22193	2098	22568	2191	24449	2343	22135	2094	22241	1974	20513	1987	33707	2860	43775	3112	50902	2878	29940	2823	18585	1928	15669	1603	15598	1511
31	Nebraska	747562	4452	45794	3116	33266	2466	31084	2533	37602	2475	38037	3067	40412	2841	36761	2757	35558	2474	33429	2688	57950	3212	83173	4291	99028	4389	69003	3272	37347	2482	37665	2540	31453	2166
32	Nevada	1055158	6433	68507	4886	42720	3071	53143	3653	53188	3403	56693	3758	57215	3909	50798	4207	53783	3826	44637	3558	87876	4032	116975	4704	135242	4728	88474	4750	54275	3382	45943	3019	45689	3255
33	New Hampshire	520643	5191	20890	2566	15933	1908	18190	2315	18067	1841	21680	2292	22695	2067	21064	2112	17717	2340	21086	2454	39534	3108	57994	3587	75337	4214	56445	3069	33685	2445	41092	3028	39234	2925
34	New Jersey	3194519	10274	170029	6836	118862	5855	123335	6065	121889	4670	120881	5562	113762	5328	112003	5795	110312	6000	100527	4994	207103	6096	277719	8225	390127	9002	328144	8879	220764	7203	295764	7663	383298	7529
35	New Mexico	758364	6296	66983	4439	48930	3220	50025	4091	48054	3477	40353	3418	38164	2931	35107	2934	37564	2815	34581	2684	59437	3388	73011	3581	87486	4182	55708	3629	29307	2585	26732	2351	26922	2608
36	New York	7209054	17665	543763	12132	352029	9607	322683	7756	327051	8184	297201	8689	316465	9191	285531	8078	277776	8886	239908	8368	485826	10467	648930	12717	864777	14413	646586	12798	432309	11182	521545	11193	646674	9931
37	North Carolina	3882423	16063	282491	7816	228088	7916	209825	6844	212659	7095	206371	7190	215759	6349	190497	7507	199257	6269	170320	6503	318567	7932	395160	9069	468022	10041	288626	7339	160589	6395	166800	5286	169392	5628

The wide-form dataset includes GEOID and NAME columns, as in the tidy dataset, but is also characterized by estimate/margin of error pairs across the columns for each Census variable in the table.

2.4.1 Understanding GEOIDs

The GEOID column returned by default in tidycensus can be used to uniquely identify geographic units in a given dataset. For geographies within the core Census hierarchy (Census block through state, as discussed in Section 1.2), GEOIDs can be used to uniquely identify specific units as well as units’ parent geographies. Let’s take the example of households by Census block from the 2020 Census in Cimarron County, Oklahoma.

cimarron_blocks <- get_decennial(
  geography = "block",
  variables = "H1_001N",
  state = "OK",
  county = "Cimarron",
  year = 2020,
  sumfile = "pl"
)

Table 2.12: Households by block in Cimarron County, Oklahoma
GEOID	NAME	variable	value
400259501001984	Block 1984, Block Group 1, Census Tract 9501, Cimarron County, Oklahoma	H1_001N	0
400259503001035	Block 1035, Block Group 1, Census Tract 9503, Cimarron County, Oklahoma	H1_001N	0
400259503001068	Block 1068, Block Group 1, Census Tract 9503, Cimarron County, Oklahoma	H1_001N	5
400259503001146	Block 1146, Block Group 1, Census Tract 9503, Cimarron County, Oklahoma	H1_001N	0
400259503001197	Block 1197, Block Group 1, Census Tract 9503, Cimarron County, Oklahoma	H1_001N	7
400259503001218	Block 1218, Block Group 1, Census Tract 9503, Cimarron County, Oklahoma	H1_001N	2
400259501001067	Block 1067, Block Group 1, Census Tract 9501, Cimarron County, Oklahoma	H1_001N	0
400259501001118	Block 1118, Block Group 1, Census Tract 9501, Cimarron County, Oklahoma	H1_001N	0
400259501001141	Block 1141, Block Group 1, Census Tract 9501, Cimarron County, Oklahoma	H1_001N	0
400259501001223	Block 1223, Block Group 1, Census Tract 9501, Cimarron County, Oklahoma	H1_001N	0

The mapping between the GEOID and NAME columns in the returned 2020 Census block data offers some insight into how GEOIDs work for geographies within the core Census hierarchy. Take the first block in the table, Block 1110, which has a GEOID of 400259503001110. The GEOID value breaks down as follows:

The first two digits, 40, correspond to the Federal Information Processing Series (FIPS) code for the state of Oklahoma. All states and US territories, along with other geographies at which the Census Bureau tabulates data, will have a FIPS code that can uniquely identify that geography.
Digits 3 through 5, 025, are representative of Cimarron County. These three digits will uniquely identify Cimarron County within Oklahoma. County codes are generally combined with their corresponding state codes to uniquely identify a county within the United States, as three-digit codes will be repeated across states. Cimarron County’s code in this example would be 40025.
The next six digits, 950300, represent the block’s Census tract. The tract name in the NAME column is Census Tract 9503; the six-digit tract ID is right-padded with zeroes.
The twelfth digit, 1, represents the parent block group of the Census block. As there are no more than nine block groups in any Census tract, the block group name will not exceed 9.
The last three digits, 110, represent the individual Census block, though these digits are combined with the parent block group digit to form the block’s name.

For geographies outside the core Census hierarchy, GEOIDs will uniquely identify geographic units but will only include IDs of parent geographies to the degree to which they nest within them. For example, a geography that nests within states but may cross county boundaries like school districts will include the state GEOID in its GEOID but unique digits after that. Geographies like core-based statistical areas that do not nest within states will have fully unique GEOIDs, independent of any other geographic level of aggregation such as states.

2.4.2 Renaming variable IDs

Census variables IDs can be cumbersome to type and remember in the course of an R session. As such, tidycensus has built-in tools to automatically rename the variable IDs if requested by a user. For example, let’s say that a user is requesting data on median household income (variable ID B19013_001) and median age (variable ID B01002_001). By passing a named vector to the variables parameter in get_acs() or get_decennial(), the functions will return the desired names rather than the Census variable IDs. Let’s examine this for counties in Georgia from the 2016-2020 five-year ACS.

ga <- get_acs(
  geography = "county",
  state = "Georgia",
  variables = c(medinc = "B19013_001",
                medage = "B01002_001"),
  year = 2020
)

Table 2.13: Multi-variable dataset for Georgia counties
GEOID	NAME	variable	estimate	moe
13001	Appling County, Georgia	medage	39.9	1.7
13001	Appling County, Georgia	medinc	37924.0	4761.0
13003	Atkinson County, Georgia	medage	35.9	1.5
13003	Atkinson County, Georgia	medinc	35703.0	5493.0
13005	Bacon County, Georgia	medage	36.5	1.0
13005	Bacon County, Georgia	medinc	36692.0	3774.0
13007	Baker County, Georgia	medage	52.2	4.8
13007	Baker County, Georgia	medinc	34034.0	9879.0
13009	Baldwin County, Georgia	medage	35.8	0.5
13009	Baldwin County, Georgia	medinc	46250.0	4707.0

ACS variable IDs, which would be found in the variable column, are replaced by medage and medinc, as requested. When a wide-form dataset is requested, tidycensus will still append E and M to the specified column names, as illustrated below.

ga_wide <- get_acs(
  geography = "county",
  state = "Georgia",
  variables = c(medinc = "B19013_001",
                medage = "B01002_001"),
  output = "wide",
  year = 2020
)

Table 2.14: Georgia dataset in wide form
GEOID	NAME	medincE	medincM	medageE	medageM
13001	Appling County, Georgia	37924	4761	39.9	1.7
13003	Atkinson County, Georgia	35703	5493	35.9	1.5
13005	Bacon County, Georgia	36692	3774	36.5	1.0
13007	Baker County, Georgia	34034	9879	52.2	4.8
13011	Banks County, Georgia	50912	4278	41.5	1.1
13013	Barrow County, Georgia	62990	2562	36.0	0.3
13017	Ben Hill County, Georgia	32077	4008	39.5	1.4
13021	Bibb County, Georgia	41317	1220	36.3	0.3
13023	Bleckley County, Georgia	46992	6279	36.0	1.5
13027	Brooks County, Georgia	37516	4438	43.6	0.9

Median household income for each county is represented by medincE, for the estimate, and medincM, for the margin of error. At the time of this writing, custom variable names are only available for variables and not for table, as users will not always know the number of variables found in a table beforehand.

2.5 Other Census Bureau datasets in tidycensus

As mentioned earlier in this chapter, tidycensus does not grant access to all of the datasets available from the Census API; users should look at the censusapi package (Recht 2021) for that functionality. However, the Population Estimates and ACS Migration Flows APIs are accessible with the get_estimates() and get_flows() functions, respectively. This section includes brief examples of each.

2.5.1 Using `get_estimates()`

The Population Estimates Program, or PEP, provides yearly estimates of the US population and its components between decennial Censuses. It differs from the ACS in that it is not directly based on a dedicated survey, but rather projects forward data from the most recent decennial Census based on birth, death, and migration rates. In turn, estimates in the PEP will differ slightly from what you may see in data returned by get_acs(), as the estimates are produced using a different methodology.

One advantage of using the PEP to retrieve data is that allows you to access the indicators used to produce the intercensal population estimates. These indicators can be specified as variables direction in the get_estimates() function in tidycensus, or requested in bulk by using the product argument. The products available include "population", "components", "housing", and "characteristics". For example, we can request all components of change population estimates for 2019 for a specific county:

library(tidycensus)
library(tidyverse)

queens_components <- get_estimates(
  geography = "county",
  product = "components",
  state = "NY",
  county = "Queens",
  year = 2019
)

Table 2.15: Components of change estimates for Queens County, NY
NAME	GEOID	variable	value
Queens County, New York	36081	BIRTHS	27453.000000
Queens County, New York	36081	DEATHS	16380.000000
Queens County, New York	36081	DOMESTICMIG	-41789.000000
Queens County, New York	36081	INTERNATIONALMIG	9883.000000
Queens County, New York	36081	NATURALINC	11073.000000
Queens County, New York	36081	NETMIG	-31906.000000
Queens County, New York	36081	RBIRTH	12.124644
Queens County, New York	36081	RDEATH	7.234243
Queens County, New York	36081	RDOMESTICMIG	-18.456152
Queens County, New York	36081	RINTERNATIONALMIG	4.364836

The returned variables include raw values for births and deaths (BIRTHS and DEATHS) during the previous 12 months, defined as mid-year 2018 (July 1) to mid-year 2019. Crude rates per 1000 people in Queens County are also available with RBIRTH and RDEATH. NATURALINC, the natural increase, then measures the number of births minus the number of deaths. Net domestic and international migration are also available as counts and rates, and the NETMIG variable accounts for the overall migration, domestic and international included. Alternatively, a single variable or vector of variables can be requested with the variable argument, and the output = "wide" argument can also be used to spread the variable names across the columns.

The product = "characteristics" argument also has some unique options. The argument breakdown lets users get breakdowns of population estimates for the US, states, and counties by "AGEGROUP", "RACE", "SEX", or "HISP" (Hispanic origin). If set to TRUE, the breakdown_labels argument will return informative labels for the population estimates. For example, to get population estimates by sex and Hispanic origin for metropolitan areas, we can use the following code:

louisiana_sex_hisp <- get_estimates(
  geography = "state",
  product = "characteristics",
  breakdown = c("SEX", "HISP"),
  breakdown_labels = TRUE,
  state = "LA",
  year = 2019
)

Table 2.16: Population characteristics for Louisiana
GEOID	NAME	value	SEX	HISP
22	Louisiana	4648794	Both sexes	Both Hispanic Origins
22	Louisiana	4401822	Both sexes	Non-Hispanic
22	Louisiana	246972	Both sexes	Hispanic
22	Louisiana	2267050	Male	Both Hispanic Origins
22	Louisiana	2135979	Male	Non-Hispanic
22	Louisiana	131071	Male	Hispanic
22	Louisiana	2381744	Female	Both Hispanic Origins
22	Louisiana	2265843	Female	Non-Hispanic
22	Louisiana	115901	Female	Hispanic

The value column gives the estimate characterized by the population labels in the SEX and HISP columns. For example, the estimated population value in 2019 for Hispanic males in Louisiana was 131,071.

2.5.2 Using `get_flows()`

As of version 1.0, tidycensus also includes support for the ACS Migration Flows API. The flows API returns information on both in- and out-migration for states, counties, and metropolitan areas. By default, the function allows for analysis of in-migrants, emigrants, and net migration for a given geography using data from a given 5-year ACS sample. In the example below, we request migration data for Honolulu County, Hawaii. In-migration for world regions is available along with out-migration and net migration for US locations.

honolulu_migration <- get_flows(
  geography = "county",
  state = "HI",
  county = "Honolulu",
  year = 2019
)

Table 2.17: Migration flows data for Honolulu, HI
GEOID1	GEOID2	FULL1_NAME	FULL2_NAME	variable	estimate	moe
15003	NA	Honolulu County, Hawaii	Africa	MOVEDIN	152	156
15003	NA	Honolulu County, Hawaii	Africa	MOVEDOUT	NA	NA
15003	NA	Honolulu County, Hawaii	Africa	MOVEDNET	NA	NA
15003	NA	Honolulu County, Hawaii	Asia	MOVEDIN	7680	884
15003	NA	Honolulu County, Hawaii	Asia	MOVEDOUT	NA	NA
15003	NA	Honolulu County, Hawaii	Asia	MOVEDNET	NA	NA
15003	NA	Honolulu County, Hawaii	Central America	MOVEDIN	192	100
15003	NA	Honolulu County, Hawaii	Central America	MOVEDOUT	NA	NA
15003	NA	Honolulu County, Hawaii	Central America	MOVEDNET	NA	NA
15003	NA	Honolulu County, Hawaii	Caribbean	MOVEDIN	97	78

get_flows() also includes functionality for migration flow mapping; this advanced feature will be covered in Section 6.6.1.

2.6 Debugging tidycensus errors

At times, you may think that you’ve formatted your use of a tidycensus function correctly but the Census API doesn’t return the data you expected. Whenever possible, tidycensus carries through the error message from the Census API or translates common errors for the user. In the example below, a user has mis-typed the variable ID:

state_pop <- get_decennial(
  geography = "state",
  variables = "P01001",
  year = 2010
)

## Error : Your API call has errors.  The API message returned is error: error: unknown variable 'P01001'.

## Error in UseMethod("gather"): no applicable method for 'gather' applied to an object of class "character"

The “unknown variable” error message from the Census API is carried through to the user. In other instances, users might request geographies that are not available in a given dataset:

cbsa_ohio <- get_acs(
  geography = "cbsa",
  variables = "DP02_0068P",
  state = "OH",
  year = 2019
)

## Error: Your API call has errors.  The API message returned is error: unknown/unsupported geography heirarchy.

The user above has attempted to get bachelor’s degree attainment by CBSA in Ohio from the ACS Data Profile. However, CBSA geographies are not available by state given that many CBSAs cross state boundaries. In response, the API returns an “unsupported geography hierarchy” error.

To assist with debugging errors, or more generally to help users understand how tidycensus functions are being translated to Census API calls, tidycensus offers a parameter show_call that when set to TRUE prints out the actual API call that tidycensus is making to the Census API.

cbsa_bachelors <- get_acs(
  geography = "cbsa",
  variables = "DP02_0068P",
  year = 2019,
  show_call = TRUE
)

## Getting data from the 2015-2019 5-year ACS

## Using the ACS Data Profile

## Census API call: https://api.census.gov/data/2019/acs/acs5/profile?get=DP02_0068PE%2CDP02_0068PM%2CNAME&for=metropolitan%20statistical%20area%2Fmicropolitan%20statistical%20area%3A%2A

The printed URL https://api.census.gov/data/2019/acs/acs5/profile?get=DP02_0068PE%2CDP02_0068PM%2CNAME&for=metropolitan%20statistical%20area%2Fmicropolitan%20statistical%20area%3A%2A can be copy-pasted into a web browser where users can see the raw JSON returned by the Census API and inspect the results.

[["DP02_0068PE","DP02_0068PM","NAME","metropolitan statistical area/micropolitan statistical area"],
["15.7","1.5","Big Stone Gap, VA Micro Area","13720"],
["31.6","1.0","Billings, MT Metro Area","13740"],
["27.9","0.7","Binghamton, NY Metro Area","13780"],
["31.4","0.4","Birmingham-Hoover, AL Metro Area","13820"],
["33.3","1.0","Bismarck, ND Metro Area","13900"],
["21.2","2.0","Blackfoot, ID Micro Area","13940"],
["35.2","1.1","Blacksburg-Christiansburg, VA Metro Area","13980"],
["44.8","1.1","Bloomington, IL Metro Area","14010"],
["40.8","1.2","Bloomington, IN Metro Area","14020"],
["24.9","1.0","Bloomsburg-Berwick, PA Metro Area","14100"],
...

A common use-case for show_call = TRUE is to understand what data is available from the API, especially if functions in tidycensus are returning NA in certain rows. If the raw API call itself contains missing values for given variables, this will confirm that the requested data are not available from the API at a given geography.

2.7 Exercises

Review the available geographies in tidycensus from the geography table in this chapter. Acquire data on median age (variable B01002_001) for a geography we have not yet used.
Use the load_variables() function to find a variable that interests you that we haven’t used yet. Use get_acs() to fetch data from the 2016-2020 ACS for counties in the state where you live, where you have visited, or where you would like to visit.

1 The United States Census and the R programming language

3 Wrangling Census data with tidyverse tools