2 An introduction to tidycensus

The tidycensus package (K. Walker and Herman 2021), first released in 2017, is an R package designed to facilitate the process of acquiring and working with US Census Bureau population data in the R environment. The package has two distinct goals. First, tidycensus aims to make Census data available to R users in a tidyverse-friendly format, helping kick-start the process of generating insights from US Census data. Second, the package is designed to streamline the data wrangling process for spatial Census data analysts. With tidycensus, R users can request geometry along with attributes for their Census data, helping facilitate mapping and spatial analysis. This functionality of tidycensus is covered in more depth in Chapters 6, 7, and 8.

As discussed in the previous chapter, the US Census Bureau makes a wide range of datasets available to the user community through their APIs and other data download resources. tidycensus is not a comprehensive portal to these data resources; instead, it focuses on a select number of datasets implemented in a series of core functions. These core functions in tidycensus include:

  • get_decennial(), which requests data from the US Decennial Census APIs for 2000 and 2010. When 2020 Census data are released via the API, R users will be able to access it with this function as well.

  • get_acs(), which requests data from the 1-year and 5-year American Community Survey samples. Data are available from the 1-year ACS back to 2005 and the 5-year ACS back to 2005-2009.

  • get_estimates(), an interface to the Population Estimates APIs. These datasets include yearly estimates of population characteristics by state, county, and metropolitan area, along with components of change demographic estimates like births, deaths, and migration rates.

  • get_pums(), which accesses data from the ACS Public Use Microdata Sample APIs. These samples include anonymized individual-level records from the ACS organized by household and are highly useful for many different social science analyses. get_pums() is covered in more depth in Chapters 9 and 10.

  • get_flows(), an interface to the ACS Migration Flows APIs. Includes information on in- and out-flows from various geographies for the 5-year ACS samples, enabling origin-destination analyses.

2.1 Getting started with tidycensus

To get started with tidycensus, users should install the package with install.packages("tidycensus") if not yet installed; load the package with library("tidycensus"); and set their Census API key with the census_api_key() function. API keys can be obtained at https://api.census.gov/data/key_signup.html. After you’ve signed up for an API key, be sure to activate the key from the email you receive from the Census Bureau so it works correctly. Declaring install = TRUE when calling census_api_key() will install the key for use in future R sessions, which may be convenient for many users.

library(tidycensus)
# census_api_key("YOUR KEY GOES HERE", install = TRUE)

2.1.1 Decennial Census

Once an API key is installed, users can obtain decennial Census or ACS data with a single function call. Let’s start with get_decennial(), which is used to access decennial Census data from the 2000, 2010, and 2020 decennial US Censuses.

To get data from the decennial US Census, users must specify a string representing the requested geography; a vector of Census variable IDs, represented by variable; or optionally a Census table ID, passed to table. The code below gets data on total population by state from the 2010 decennial Census.

total_population_10 <- get_decennial(
  geography = "state", 
  variables = "P001001",
  year = 2010
)
Table 2.1: Total population by state, 2010 Census
GEOID NAME variable value
01 Alabama P001001 4779736
02 Alaska P001001 710231
04 Arizona P001001 6392017
05 Arkansas P001001 2915918
06 California P001001 37253956
22 Louisiana P001001 4533372
21 Kentucky P001001 4339367
08 Colorado P001001 5029196
09 Connecticut P001001 3574097
10 Delaware P001001 897934

The function returns a tibble of data from the 2010 US Census (the function default year) with information on total population by state, and assigns it to the object total_population_10. Data for 2000 or 2020 can also be obtained by supplying the appropriate year to the year parameter.

2.1.1.1 Summary files in the decennial Census

By default, get_decennial() uses the argument sumfile = "sf1", which fetches data from the decennial Census Summary File 1. This summary file exists for the 2000 and 2010 decennial US Censuses, and includes core demographic characteristics for Census geographies. The 2000 and 2010 decennial Census data also include Summary File 2, which contains information on a range of population and housing unit characteristics and is specified as "sf2". Detailed demographic information in the 2000 decennial Census such as income and occupation can be found in Summary Files 3 ("sf3") and 4 ("sf4"). Data from the 2000 and 2010 Decennial Censuses for island territories other than Puerto Rico must be accessed at their corresponding summary files: "as" for American Samoa, "mp" for the Northern Mariana Islands, "gu" for Guam, and "vi" for the US Virgin Islands.

2020 Decennial Census data are available from the PL 94-171 Redistricting summary file, which is specified with sumfile = "pl" and is also available for 2010. The Redistricting summary files include a limited subset of variables from the decennial US Census to be used for legislative redistricting. These variables include total population and housing units; race and ethnicity; voting-age population; and group quarters population. For example, the code below retrieves information on the American Indian & Alaska Native population by state from the 2020 decennial Census.

aian_2020 <- get_decennial(
  geography = "state",
  variables = "P1_005N",
  year = 2020,
  sumfile = "pl"
)
Table 2.2: American Indian or Alaska Native alone population by state from the 2020 decennial Census
GEOID NAME variable value
01 Alabama P1_005N 33625
02 Alaska P1_005N 111575
04 Arizona P1_005N 319512
05 Arkansas P1_005N 27177
06 California P1_005N 631016
08 Colorado P1_005N 74129

The argument sumfile = "pl" is assumed (and in turn not required) when users request data for 2020 and will remain so until the main Demographic and Housing Characteristics File is released in mid-to-late 2022.

When users request data from the 2020 decennial Census for the first time in a given R session, get_decennial() prints out the following message:

Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.

This message alerts users that 2020 decennial Census data use differential privacy as a method to preserve confidentiality of individuals who responded to the Census. This can lead to inaccuracies in small area analyses using 2020 Census data and also can make comparisons of small counts across years difficult. A more in-depth discussion of differential privacy and the 2020 Census is found in the conclusion of this book.

2.1.2 American Community Survey

Similarly, get_acs() retrieves data from the American Community Survey. As discussed in the previous chapter, the ACS includes a wide variety of variables detailing characteristics of the US population not found in the decennial Census. The example below fetches data on the number of residents born in Mexico by state.

born_in_mexico <- get_acs(
  geography = "state", 
  variables = "B05006_150",
  year = 2019
)
Table 2.3: Mexican-born population by state, 2015-2019 5-year ACS
GEOID NAME variable estimate moe
01 Alabama B05006_150 49052 2202
02 Alaska B05006_150 4528 798
04 Arizona B05006_150 511861 7420
05 Arkansas B05006_150 59271 2101
06 California B05006_150 4076121 23048
08 Colorado B05006_150 225372 4757
09 Connecticut B05006_150 27399 2056
10 Delaware B05006_150 16663 1145
11 District of Columbia B05006_150 4022 856
12 Florida B05006_150 266547 6349

If the year is not specified, get_acs() defaults to the most recent five-year ACS sample, which at the time of this writing is 2015-2019. The data returned is similar in structure to that returned by get_decennial(), but includes an estimate column (for the ACS estimate) and moe column (for the margin of error around that estimate) instead of a value column. Different years and different surveys are available by adjusting the year and survey parameters. survey defaults to the 5-year ACS; however this can be changed to the 1-year ACS by using the argument survey = "acs1". For example, the following code will fetch data from the 1-year ACS for 2019:

born_in_mexico_1yr <- get_acs(
  geography = "state", 
  variables = "B05006_150", 
  survey = "acs1",
  year = 2019
)
Table 2.4: Mexican-born population by state, 2019 1-year ACS
GEOID NAME variable estimate moe
01 Alabama B05006_150 NA NA
02 Alaska B05006_150 NA NA
04 Arizona B05006_150 516618 15863
05 Arkansas B05006_150 NA NA
06 California B05006_150 3951224 40506
08 Colorado B05006_150 209408 12214
09 Connecticut B05006_150 26371 4816
10 Delaware B05006_150 NA NA
11 District of Columbia B05006_150 NA NA
12 Florida B05006_150 261614 17571

Note the differences between the 5-year ACS estimates and the 1-year ACS estimates shown. For states with larger Mexican-born populations like Arizona, California, and Colorado, the 1-year ACS data will represent the most up-to-date estimates, albeit characterized by larger margins of error relative to their estimates. For states with smaller Mexican-born populations like Alabama, Alaska, and Arkansas, however, the estimate returns NA, R’s notation representing missing data. If you encounter this in your data’s estimate column, it will generally mean that the estimate is too small for a given geography to be deemed reliable by the Census Bureau. In this case, only the states with the largest Mexican-born populations have data available for that variable in the 1-year ACS, meaning that the 5-year ACS should be used to make full state-wise comparisons if desired.

Variables from the ACS detailed tables, data profiles, summary tables, and supplemental estimates are available through tidycensus’s get_acs() function; the function will auto-detect from which dataset to look for variables based on their names. Alternatively, users can supply a table name to the table parameter in get_acs(); this will return data for every variable in that table. For example, to get all variables associated with table B01001, which covers sex broken down by age, from the 2015-2019 5-year ACS:

age_table <- get_acs(
  geography = "state", 
  table = "B01001",
  year = 2019
)
Table 2.5: Table B01001 by state from the 2015-2019 5-year ACS
GEOID NAME variable estimate moe
01 Alabama B01001_001 4876250 NA
01 Alabama B01001_002 2359355 1270
01 Alabama B01001_003 149090 704
01 Alabama B01001_004 153494 2290
01 Alabama B01001_005 158617 2274
01 Alabama B01001_006 98257 468
01 Alabama B01001_007 64980 834
01 Alabama B01001_008 35870 1436
01 Alabama B01001_009 35040 1472
01 Alabama B01001_010 95065 1916

To find all of the variables associated with a given ACS table, tidycensus downloads a dataset of variables from the Census Bureau website and looks up the variable codes for download. If the cache_table parameter is set to TRUE, the function instructs tidycensus to cache this dataset on the user’s computer for faster future access. This only needs to be done once per ACS or Census dataset if the user would like to specify this option.

2.2 Geography and variables in tidycensus

The geography parameter in get_acs() and get_decennial() allows users to request data aggregated to common Census enumeration units. At the time of this writing, tidycensus accepts enumeration units nested within states and/or counties, when applicable. Census blocks are available in get_decennial() but not in get_acs() as block-level data are not available from the American Community Survey. To request data within states and/or counties, state and county names can be supplied to the state and county parameters, respectively. Arguments should be formatted in the way that they are accepted by the US Census Bureau API, specified in the table below. If an “Available by” geography is in bold, that argument is required for that geography.

The only geographies available in 2000 are "state", "county", "county subdivision", "tract", "block group", and "place". Some geographies available from the Census API are not available in tidycensus at the moment as they require more complex hierarchy specification than the package supports, and not all variables are available at every geography.

Geography Definition Available by Available in
"us" United States get_acs(), get_decennial(), get_estimates()
"region" Census region get_acs(), get_decennial(), get_estimates()
"division" Census division get_acs(), get_decennial(), get_estimates()
"state" State or equivalent state get_acs(), get_decennial(), get_estimates(), get_flows()
"county" County or equivalent state, county get_acs(), get_decennial(), get_estimates(), get_flows()
"county subdivision" County subdivision state, county get_acs(), get_decennial(), get_estimates(), get_flows()
"tract" Census tract state, county get_acs(), get_decennial()
"block group" Census block group state, county get_acs() (2013-), get_decennial()
"block" Census block state, county get_decennial()
"place" Census-designated place state get_acs(), get_decennial(), get_estimates()
"alaska native regional corporation" Alaska native regional corporation state get_acs(), get_decennial()
"american indian area/alaska native area/hawaiian home land" Federal and state-recognized American Indian reservations and Hawaiian home lands state get_acs(), get_decennial()
"american indian area/alaska native area (reservation or statistical entity only)" Only reservations and statistical entities state get_acs(), get_decennial()
"american indian area (off-reservation trust land only)/hawaiian home land" Only off-reservation trust lands and Hawaiian home lands state get_acs(),
"metropolitan statistical area/micropolitan statistical area" OR "cbsa" Core-based statistical area state get_acs(), get_decennial(), get_estimates(), get_flows()
"combined statistical area" Combined statistical area state get_acs(), get_decennial(), get_estimates()
"new england city and town area" New England city/town area state get_acs(), get_decennial()
"combined new england city and town area" Combined New England area state get_acs(), get_decennial()
"urban area" Census-defined urbanized areas get_acs(), get_decennial()
"congressional district" Congressional district for the year-appropriate Congress state get_acs(), get_decennial()
"school district (elementary)" Elementary school district state get_acs(), get_decennial()
"school district (secondary)" Secondary school district state get_acs(), get_decennial()
"school district (unified)" Unified school district state get_acs(), get_decennial()
"public use microdata area" PUMA (geography associated with Census microdata samples) state get_acs()
"zip code tabulation area" OR "zcta" Zip code tabulation area state get_acs(), get_decennial()
"state legislative district (upper chamber)" State senate districts state get_acs(), get_decennial()
"state legislative district (lower chamber)" State house districts state get_acs(), get_decennial()
"voting district" Voting districts (2020 only) state get_decennial()

The geography parameter must be typed exactly as specified in the table above to request data correctly from the Census API; use the guide above as a reference and copy-paste for longer strings. For core-based statistical areas and zip code tabulation areas, two heavily-requested geographies, the aliases "cbsa" and "zcta" can be used, respectively, to fetch data for those geographies.

cbsa_population <- get_acs(
  geography = "cbsa",
  variables = "B01003_001",
  year = 2019
)
Table 2.6: Population by CBSA
GEOID NAME variable estimate moe
10100 Aberdeen, SD Micro Area B01003_001 42824 NA
10140 Aberdeen, WA Micro Area B01003_001 72779 NA
10180 Abilene, TX Metro Area B01003_001 170669 NA
10220 Ada, OK Micro Area B01003_001 38355 NA
10300 Adrian, MI Micro Area B01003_001 98381 NA
10380 Aguadilla-Isabela, PR Metro Area B01003_001 301107 NA
10420 Akron, OH Metro Area B01003_001 703845 NA
10460 Alamogordo, NM Micro Area B01003_001 66137 NA
10500 Albany, GA Metro Area B01003_001 148436 NA
10540 Albany-Lebanon, OR Metro Area B01003_001 125048 NA

2.2.1 Geographic subsets

For many geographies, tidycensus supports more granular requests that are subsetted by state or even by county, if supported by the API. This information is found in the “Available by” column in the guide above. If a geographic subset is in bold, it is required; if not, it is optional.

For example, an analyst might be interested in studying variations in household income in the state of Wisconsin. Although the analyst can request all counties in the United States, this is not necessary for this specific task. In turn, they can use the state parameter to subset the request for a specific state.

wi_income <- get_acs(
  geography = "county", 
  variables = "B19013_001", 
  state = "WI",
  year = 2019
)
Table 2.7: Median household income by county in Wisconsin
GEOID NAME variable estimate moe
55001 Adams County, Wisconsin B19013_001 46369 1834
55003 Ashland County, Wisconsin B19013_001 42510 2858
55005 Barron County, Wisconsin B19013_001 52703 2104
55007 Bayfield County, Wisconsin B19013_001 56096 1877
55009 Brown County, Wisconsin B19013_001 62340 1112
55011 Buffalo County, Wisconsin B19013_001 57829 1873
55013 Burnett County, Wisconsin B19013_001 52672 1388
55015 Calumet County, Wisconsin B19013_001 75814 2425
55017 Chippewa County, Wisconsin B19013_001 59742 1759
55019 Clark County, Wisconsin B19013_001 54012 1223

tidycensus accepts state names (e.g. "Wisconsin"), state postal codes (e.g. "WI"), and state FIPS codes (e.g. "55"), so an analyst can use what they are most comfortable with.

Smaller geographies like Census tracts can also be subsetted by county. Given that Census tracts nest neatly within counties (and do not cross county boundaries), we can request all Census tracts for a given county by using the optional county parameter. Dane County, home to Wisconsin’s capital city of Madison, is shown below. Note that the name of the county can be supplied as well as the FIPS code. If a state has two counties with similar names (e.g. “Collin” and “Collingsworth” in Texas) you’ll need to spell out the full county string and type "Collin County".

dane_income <- get_acs(
  geography = "tract", 
  variables = "B19013_001", 
  state = "WI", 
  county = "Dane"
)
Table 2.8: Median household income in Dane County by Census tract
GEOID NAME variable estimate moe
55025000100 Census Tract 1, Dane County, Wisconsin B19013_001 72471 12984
55025000201 Census Tract 2.01, Dane County, Wisconsin B19013_001 94821 11860
55025000202 Census Tract 2.02, Dane County, Wisconsin B19013_001 84145 7021
55025000204 Census Tract 2.04, Dane County, Wisconsin B19013_001 79617 11823
55025000205 Census Tract 2.05, Dane County, Wisconsin B19013_001 91326 13453
55025000300 Census Tract 3, Dane County, Wisconsin B19013_001 53778 7593
55025000401 Census Tract 4.01, Dane County, Wisconsin B19013_001 98178 7330
55025000402 Census Tract 4.02, Dane County, Wisconsin B19013_001 107440 6585
55025000405 Census Tract 4.05, Dane County, Wisconsin B19013_001 68911 4141
55025000406 Census Tract 4.06, Dane County, Wisconsin B19013_001 74489 10451

With respect to geography and the American Community Survey, users should be aware that whereas the 5-year ACS covers geographies down to the block group, the 1-year ACS only returns data for geographies of population 65,000 and greater. This means that some geographies (e.g. Census tracts) will never be available in the 1-year ACS, and that other geographies such as counties are only partially available. To illustrate this, we can check the number of rows in the object wi_income:

nrow(wi_income)
## [1] 72

There are 72 rows in this dataset, one for each county in Wisconsin. However, if the same data were requested from the 2019 1-year ACS:

wi_income_1yr <- get_acs(
  geography = "county", 
  variables = "B19013_001", 
  state = "WI",
  year = 2019,
  survey = "acs1"
)

nrow(wi_income_1yr)
## [1] 23

There are only 23 rows in this dataset, representing the 23 counties that meet the “total population of 65,000 or greater” threshold required to be included in the 1-year ACS data.

2.3 Searching for variables in tidycensus

One additional challenge when searching for Census variables is understanding variable IDs, which are required to fetch data from the Census and ACS APIs. There are thousands of variables available across the different datasets and summary files. To make searching easier for R users, tidycensus offers the load_variables() function. This function obtains a dataset of variables from the Census Bureau website and formats it for fast searching, ideally in RStudio.

The function takes two required arguments: year, which takes the year or endyear of the Census dataset or ACS sample, and the dataset name - one of sf1, sf3, pl, acs1, or acs5. For the ACS datasets, append /profile for the Data Profile, and /summary for the Summary Tables. As this function requires processing thousands of variables from the Census Bureau which may take a few moments depending on the user’s internet connection, the user can specify cache = TRUE in the function call to store the data in the user’s cache directory for future access. On subsequent calls of the load_variables() function, cache = TRUE will direct the function to look in the cache directory for the variables rather than the Census website.

load_variables() works as follows:

v16 <- load_variables(2016, "acs5", cache = TRUE)
Table 2.9: Variables in the 2012-2016 5-year ACS
name label concept
B00001_001 Estimate!!Total UNWEIGHTED SAMPLE COUNT OF THE POPULATION
B00002_001 Estimate!!Total UNWEIGHTED SAMPLE HOUSING UNITS
B01001_001 Estimate!!Total SEX BY AGE
B01001_002 Estimate!!Total!!Male SEX BY AGE
B01001_003 Estimate!!Total!!Male!!Under 5 years SEX BY AGE
B01001_004 Estimate!!Total!!Male!!5 to 9 years SEX BY AGE
B01001_005 Estimate!!Total!!Male!!10 to 14 years SEX BY AGE
B01001_006 Estimate!!Total!!Male!!15 to 17 years SEX BY AGE
B01001_007 Estimate!!Total!!Male!!18 and 19 years SEX BY AGE
B01001_008 Estimate!!Total!!Male!!20 years SEX BY AGE

The returned data frame has three columns: name, which refers to the Census variable ID; label, which is a descriptive data label for the variable; and concept, which refers to the topic of the data and often corresponds to a table of Census data. As illustrated above, the data frame can be filtered using tidyverse tools for variable exploration. However, the RStudio integrated development environment includes an interactive data viewer which is ideal for browsing this dataset, and allows for interactive sorting and filtering. The data viewer can be accessed with the View() function:

View(v16)
Variable viewer in RStudio

Figure 2.1: Variable viewer in RStudio

By browsing the table in this way, users can identify the appropriate variable IDs (found in the name column) that can be passed to the variables parameter in get_acs() or get_decennial(). Users may note that the raw variable IDs in the ACS, as consumed by the API, require a suffix of E or M. tidycensus does not require this suffix, as it will automatically return both the estimate and margin of error for a given requested variable. Additionally, if users desire an entire table of related variables from the ACS, the user should supply the characters prior to the underscore from a variable ID to the table parameter.

2.4 Data structure in tidycensus

Key to the design philosophy of tidycensus is its interpretation of tidy data. Following Wickham (2014), “tidy” data are defined as follows:

  1. Each observation forms a row;
  2. Each variable forms a column;
  3. Each observational unit forms a table.

By default, tidycensus returns a tibble of ACS or decennial Census data in “tidy” format. For decennial Census data, this will include four columns:

  • GEOID, representing the Census ID code that uniquely identifies the geographic unit;

  • NAME, which represents a descriptive name of the unit;

  • variable, which contains information on the Census variable name corresponding to that row;

  • value, which contains the data values for each unit-variable combination. For ACS data, two columns replace the value column: estimate, which represents the ACS estimate, and moe, representing the margin of error around that estimate.

Given the terminology used by the Census Bureau to distinguish data, it is important to provide some clarifications of nomenclature here. Census or ACS variables, which are specific series of data available by enumeration unit, are interpreted in tidycensus as characteristics of those enumeration units. In turn, rows in datasets returned when output = "tidy", which is the default setting in the get_acs() and get_decennial() functions, represent data for unique unit-variable combinations. An example of this is illustrated below with income groups by state for the 2016 1-year American Community Survey.

hhinc <- get_acs(
  geography = "state", 
  table = "B19001", 
  survey = "acs1",
  year = 2016
)
Table 2.10: Household income groups by state, 2016 1-year ACS
GEOID NAME variable estimate moe
01 Alabama B19001_001 1852518 12189
01 Alabama B19001_002 176641 6328
01 Alabama B19001_003 120590 5347
01 Alabama B19001_004 117332 5956
01 Alabama B19001_005 108912 5308
01 Alabama B19001_006 102080 4740
01 Alabama B19001_007 103366 5246
01 Alabama B19001_008 91011 4699
01 Alabama B19001_009 86996 4418
01 Alabama B19001_010 74864 4210

In this example, each row represents state-characteristic combinations, consistent with the tidy data model. Alternatively, if a user desires the variables spread across the columns of the dataset, the setting output = "wide" will enable this. For ACS data, estimates and margins of error for each ACS variable will be found in their own columns. For example:

hhinc_wide <- get_acs(
  geography = "state", 
  table = "B19001", 
  survey = "acs1", 
  year = 2016,
  output = "wide"
)
Table 2.11: Income table in wide form
GEOID NAME B19001_001E B19001_001M B19001_002E B19001_002M B19001_003E B19001_003M B19001_004E B19001_004M B19001_005E B19001_005M B19001_006E B19001_006M B19001_007E B19001_007M B19001_008E B19001_008M B19001_009E B19001_009M B19001_010E B19001_010M B19001_011E B19001_011M B19001_012E B19001_012M B19001_013E B19001_013M B19001_014E B19001_014M B19001_015E B19001_015M B19001_016E B19001_016M B19001_017E B19001_017M
28 Mississippi 1091245 8803 113124 4835 87136 5004 71206 4058 70160 4560 59619 4105 62688 4149 55973 4422 57215 4119 41870 3427 86198 4669 98865 5983 117664 5168 68367 4079 37809 2983 34786 3038 28565 2396
29 Missouri 2372190 10844 160615 6705 122649 4654 123789 5201 128270 5714 123224 4726 133429 5639 123373 4564 117476 5796 107254 4130 200473 6468 248099 6281 296437 7492 188700 6361 102034 4905 102670 4935 93698 4434
30 Montana 416125 4426 26734 2183 24786 2391 22330 2391 22193 2098 22568 2191 24449 2343 22135 2094 22241 1974 20513 1987 33707 2860 43775 3112 50902 2878 29940 2823 18585 1928 15669 1603 15598 1511
31 Nebraska 747562 4452 45794 3116 33266 2466 31084 2533 37602 2475 38037 3067 40412 2841 36761 2757 35558 2474 33429 2688 57950 3212 83173 4291 99028 4389 69003 3272 37347 2482 37665 2540 31453 2166
32 Nevada 1055158 6433 68507 4886 42720 3071 53143 3653 53188 3403 56693 3758 57215 3909 50798 4207 53783 3826 44637 3558 87876 4032 116975 4704 135242 4728 88474 4750 54275 3382 45943 3019 45689 3255
33 New Hampshire 520643 5191 20890 2566 15933 1908 18190 2315 18067 1841 21680 2292 22695 2067 21064 2112 17717 2340 21086 2454 39534 3108 57994 3587 75337 4214 56445 3069 33685 2445 41092 3028 39234 2925
34 New Jersey 3194519 10274 170029 6836 118862 5855 123335 6065 121889 4670 120881 5562 113762 5328 112003 5795 110312 6000 100527 4994 207103 6096 277719 8225 390127 9002 328144 8879 220764 7203 295764 7663 383298 7529
35 New Mexico 758364 6296 66983 4439 48930 3220 50025 4091 48054 3477 40353 3418 38164 2931 35107 2934 37564 2815 34581 2684 59437 3388 73011 3581 87486 4182 55708 3629 29307 2585 26732 2351 26922 2608
36 New York 7209054 17665 543763 12132 352029 9607 322683 7756 327051 8184 297201 8689 316465 9191 285531 8078 277776 8886 239908 8368 485826 10467 648930 12717 864777 14413 646586 12798 432309 11182 521545 11193 646674 9931
37 North Carolina 3882423 16063 282491 7816 228088 7916 209825 6844 212659 7095 206371 7190 215759 6349 190497 7507 199257 6269 170320 6503 318567 7932 395160 9069 468022 10041 288626 7339 160589 6395 166800 5286 169392 5628

The wide-form dataset includes GEOID and NAME columns, as in the tidy dataset, but is also characterized by estimate/margin of error pairs across the columns for each Census variable in the table.

2.4.1 Renaming variable IDs

Census variables IDs can be cumbersome to type and remember in the course of an R session. As such, tidycensus has built-in tools to automatically rename the variable IDs if requested by a user. For example, let’s say that a user is requesting data on median household income (variable ID B19013_001) and median age (variable ID B01002_001). By passing a named vector to the variables parameter in get_acs() or get_decennial(), the functions will return the desired names rather than the Census variable IDs. Let’s examine this for counties in Georgia from the 2015-2019 five-year ACS.

ga <- get_acs(
  geography = "county",
  state = "Georgia",
  variables = c(medinc = "B19013_001",
                medage = "B01002_001"),
  year = 2019
)
Table 2.12: Multi-variable dataset for Georgia counties
GEOID NAME variable estimate moe
13001 Appling County, Georgia medage 40.3 1.4
13001 Appling County, Georgia medinc 40304.0 5180.0
13003 Atkinson County, Georgia medage 36.4 1.0
13003 Atkinson County, Georgia medinc 37197.0 3686.0
13005 Bacon County, Georgia medage 36.7 0.7
13005 Bacon County, Georgia medinc 37519.0 5492.0
13007 Baker County, Georgia medage 46.1 5.5
13007 Baker County, Georgia medinc 32917.0 6967.0
13009 Baldwin County, Georgia medage 35.5 0.3
13009 Baldwin County, Georgia medinc 43672.0 3736.0

ACS variable IDs, which would be found in the variable column, are replaced by medage and medinc, as requested. When a wide-form dataset is requested, tidycensus will still append E and M to the specified column names, as illustrated below.

ga_wide <- get_acs(
  geography = "county",
  state = "Georgia",
  variables = c(medinc = "B19013_001",
                medage = "B01002_001"),
  output = "wide"
)
Table 2.13: Georgia dataset in wide form
GEOID NAME medincE medincM medageE medageM
13005 Bacon County, Georgia 37519 5492 36.7 0.7
13025 Brantley County, Georgia 38857 3480 41.1 0.8
13017 Ben Hill County, Georgia 32229 3845 39.9 1.1
13033 Burke County, Georgia 44151 2438 37.4 0.6
13047 Catoosa County, Georgia 56235 2290 40.4 0.4
13053 Chattahoochee County, Georgia 47096 5158 24.5 0.5
13055 Chattooga County, Georgia 36807 2268 39.4 0.7
13073 Columbia County, Georgia 82339 3532 36.9 0.4
13087 Decatur County, Georgia 41481 3584 37.8 0.6
13115 Floyd County, Georgia 48336 2266 38.3 0.3

Median household income for each county is represented by medincE, for the estimate, and medincM, for the margin of error. At the time of this writing, custom variable names are only available for variables and not for table, as users will not always know the number of variables found in a table beforehand.

2.5 Other Census Bureau datasets in tidycensus

As mentioned earlier in this chapter, tidycensus does not grant access to all of the datasets available from the Census API; users should look at the censusapi package (Recht 2021) for that functionality. However, the Population Estimates and ACS Migration Flows APIs are accessible with the get_estimates() and get_flows() functions, respectively. This section includes brief examples of each.

2.5.1 Using get_estimates()

The Population Estimates Program, or PEP, provides yearly estimates of the US population and its components between decennial Censuses. It differs from the ACS in that it is not directly based on a dedicated survey, but rather projects forward data from the most recent decennial Census based on birth, death, and migration rates.

One advantage of using the PEP to retrieve data is that allows you to access the indicators used to produce the intercensal population estimates. These indicators can be specified as variables direction in the get_estimates() function in tidycensus, or requested in bulk by using the product argument. The products available include "population", "components", "housing", and "characteristics". For example, we can request all components of change population estimates for 2019 for a specific county:

library(tidycensus)
library(tidyverse)

queens_components <- get_estimates(
  geography = "county",
  product = "components",
  state = "NY",
  county = "Queens",
  year = 2019
)
Table 2.14: Components of change estimates for Queens County, NY
NAME GEOID variable value
Queens County, New York 36081 BIRTHS 27453.000000
Queens County, New York 36081 DEATHS 16380.000000
Queens County, New York 36081 DOMESTICMIG -41789.000000
Queens County, New York 36081 INTERNATIONALMIG 9883.000000
Queens County, New York 36081 NATURALINC 11073.000000
Queens County, New York 36081 NETMIG -31906.000000
Queens County, New York 36081 RBIRTH 12.124644
Queens County, New York 36081 RDEATH 7.234243
Queens County, New York 36081 RDOMESTICMIG -18.456152
Queens County, New York 36081 RINTERNATIONALMIG 4.364836

Alternatively, a single variable or vector of variables can be requested with the variable argument, and the output = "wide" argument can also be used to spread the variable names across the columns.

The product = "characteristics" argument also has some unique options. The argument breakdown lets users get breakdowns of population estimates for the US, states, and counties by "AGEGROUP", "RACE", "SEX", or "HISP" (Hispanic origin). If set to TRUE, the breakdown_labels argument will return informative labels for the population estimates. For example, to get population estimates by sex and Hispanic origin for metropolitan areas, we can use the following code:

louisiana_age_hisp <- get_estimates(
  geography = "state",
  product = "characteristics",
  breakdown = c("SEX", "HISP"),
  breakdown_labels = TRUE,
  state = "LA",
  year = 2019
)
Table 2.15: Population characteristics for Louisiana
GEOID NAME value SEX HISP
22 Louisiana 4648794 Both sexes Both Hispanic Origins
22 Louisiana 4401822 Both sexes Non-Hispanic
22 Louisiana 246972 Both sexes Hispanic
22 Louisiana 2267050 Male Both Hispanic Origins
22 Louisiana 2135979 Male Non-Hispanic
22 Louisiana 131071 Male Hispanic
22 Louisiana 2381744 Female Both Hispanic Origins
22 Louisiana 2265843 Female Non-Hispanic
22 Louisiana 115901 Female Hispanic

2.5.2 Using get_flows()

As of version 1.0, tidycensus also includes support for the ACS Migration Flows API. The flows API returns information on both in- and out-migration for states, counties, and metropolitan areas. By default, the function allows for analysis of in-migrants, emigrants, and net migration for a given geography using data from a given 5-year ACS sample. In the example below, we request migration data for Honolulu County, Hawaii. In-migration for world regions is available along with out-migration and net migration for US locations.

honolulu_migration <- get_flows(
  geography = "county",
  state = "HI",
  county = "Honolulu",
  year = 2018,
  show_call = TRUE
)
Table 2.16: Migration flows data for Honolulu, HI
GEOID1 GEOID2 FULL1_NAME FULL2_NAME variable estimate moe
15003 NA Honolulu County, Hawaii Africa MOVEDIN 86 70
15003 NA Honolulu County, Hawaii Africa MOVEDOUT NA NA
15003 NA Honolulu County, Hawaii Africa MOVEDNET NA NA
15003 NA Honolulu County, Hawaii Asia MOVEDIN 7620 891
15003 NA Honolulu County, Hawaii Asia MOVEDOUT NA NA
15003 NA Honolulu County, Hawaii Asia MOVEDNET NA NA
15003 NA Honolulu County, Hawaii Central America MOVEDIN 228 98
15003 NA Honolulu County, Hawaii Central America MOVEDOUT NA NA
15003 NA Honolulu County, Hawaii Central America MOVEDNET NA NA
15003 NA Honolulu County, Hawaii Caribbean MOVEDIN 106 94

get_flows() also includes functionality for migration flow mapping; this advanced feature will be covered in Section 6.6.1.

2.6 Debugging tidycensus errors

At times, you may think that you’ve formatted your use of a tidycensus function correctly but the Census API doesn’t return the data you expected. Whenever possible, tidycensus carries through the error message from the Census API or translates common errors for the user. In the example below, a user has mis-typed the variable ID:

state_pop <- get_decennial(
  geography = "state",
  variables = "P01001",
  year = 2010
)
## Error : Your API call has errors.  The API message returned is error: error: unknown variable 'P01001'.
## Error in UseMethod("gather"): no applicable method for 'gather' applied to an object of class "character"

The “unknown variable” error message from the Census API is carried through to the user. In other instances, users might request geographies that are not available in a given dataset:

cbsa_ohio <- get_acs(
  geography = "cbsa",
  variables = "DP02_0068P",
  state = "OH",
  year = 2019
)
## Error: Your API call has errors.  The API message returned is error: unknown/unsupported geography heirarchy.

The user above has attempted to get bachelor’s degree attainment by CBSA in Ohio from the ACS Data Profile. However, CBSA geographies are not available by state given that many CBSAs cross state boundaries. In response, the API returns an “unsupported geography hierarchy” error.

To assist with debugging errors, or more generally to help users understand how tidycensus functions are being translated to Census API calls, tidycensus offers a parameter show_call that when set to TRUE prints out the actual API call that tidycensus is making to the Census API.

cbsa_bachelors <- get_acs(
  geography = "cbsa",
  variables = "DP02_0068P",
  year = 2019,
  show_call = TRUE
)
## Getting data from the 2015-2019 5-year ACS
## Using the ACS Data Profile
## Census API call: https://api.census.gov/data/2019/acs/acs5/profile?get=DP02_0068PE%2CDP02_0068PM%2CNAME&for=metropolitan%20statistical%20area%2Fmicropolitan%20statistical%20area%3A%2A

The printed URL https://api.census.gov/data/2019/acs/acs5/profile?get=DP02_0068PE%2CDP02_0068PM%2CNAME&for=metropolitan%20statistical%20area%2Fmicropolitan%20statistical%20area%3A%2A can be copy-pasted into a web browser where users can see the raw JSON returned by the Census API and inspect the results.

[["DP02_0068PE","DP02_0068PM","NAME","metropolitan statistical area/micropolitan statistical area"],
["15.7","1.5","Big Stone Gap, VA Micro Area","13720"],
["31.6","1.0","Billings, MT Metro Area","13740"],
["27.9","0.7","Binghamton, NY Metro Area","13780"],
["31.4","0.4","Birmingham-Hoover, AL Metro Area","13820"],
["33.3","1.0","Bismarck, ND Metro Area","13900"],
["21.2","2.0","Blackfoot, ID Micro Area","13940"],
["35.2","1.1","Blacksburg-Christiansburg, VA Metro Area","13980"],
["44.8","1.1","Bloomington, IL Metro Area","14010"],
["40.8","1.2","Bloomington, IN Metro Area","14020"],
["24.9","1.0","Bloomsburg-Berwick, PA Metro Area","14100"],
...

2.7 Exercises

  1. Review the available geographies in tidycensus from the geography table in this chapter. Acquire data on median age (variable B01002_001) for a geography we have not yet used.

  2. Use the load_variables() function to find a variable that interests you that we haven’t used yet. Use get_acs() to fetch data from the 2015-2019 ACS for counties in the state where you live, where you have visited, or where you would like to visit.