3 Wrangling Census data with tidyverse tools

One of the most popular frameworks for data analysis in R is the tidyverse, a suite of packages designed for integrated data wrangling, visualization, and modeling. The “tidy” or long-form data returned by default in tidycensus is designed to work well with tidyverse analytic workflows. This chapter provides an overview of how to use tidyverse tools to gain additional insights about US Census data retrieved with tidycensus. It concludes with discussion about margins of error (MOEs) in the American Community Survey and how to wrangle and interpret MOEs appropriately.

3.1 The tidyverse

The tidyverse is a collection of R packages that are designed to work together in common data wrangling, analysis, and visualization projects. Many of these R packages, maintained by RStudio, are among the most popular R packages worldwide. Some of the key packages you’ll use in the tidyverse include:

readr (Wickham and Hester 2021), which contains tools for importing and exporting datasets;
dplyr (Wickham et al. 2021), a powerful framework for data wrangling tasks;
tidyr (Wickham 2021b), a package for reshaping data;
purrr (Henry and Wickham 2020), a comprehensive framework for functional programming and iteration;
ggplot2 (Wickham 2016), a data visualization package based on the Grammar of Graphics

The core data structure used in the tidyverse is the tibble, which is an R data frame with some small enhancements to improve the user experience. tidycensus returns tibbles by default.

A full treatment of the tidyverse and its functionality is beyond the scope of this book; however, the examples in this chapter will introduce you to several key tidyverse features using US Census Bureau data. For a more general and broader treatment of the tidyverse, I recommend the R for Data Science book (Wickham and Grolemund 2017).

3.2 Exploring Census data with tidyverse tools

Census data queries using tidycensus, combined with core tidyverse functions, are excellent ways to explore downloaded Census data. Chapter 2 covered how to download data from various Census datasets using tidycensus and return the data in a desired format. A common next step in an analytic process will involve data exploration, which is handled by a wide range of tools in the tidyverse.

To get started, the tidycensus and tidyverse packages are loaded. “tidyverse” is not specifically a package itself, but rather loads several core packages within the tidyverse. The package load message gives you more information:

library(tidycensus)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Eight tidyverse packages are loaded: ggplot2, tibble (Müller and Wickham 2021), purrr, dplyr, readr, and tidyr are included along with stringr (Wickham 2019a) for string manipulation and forcats (Wickham 2021a) for working with factors. These tools collectively can be used for many core Census data analysis tasks.

3.2.1 Sorting and filtering data

For a first example, let’s request data on median age from the 2016-2020 ACS with get_acs() for all counties in the United States. This requires specifying geography = "county" and leaving state set to NULL, the default.

median_age <- get_acs(
  geography = "county",
  variables = "B01002_001",
  year = 2020
)

Table 3.1: Median age for US counties
GEOID	NAME	variable	estimate	moe
01001	Autauga County, Alabama	B01002_001	38.6	0.6
01003	Baldwin County, Alabama	B01002_001	43.2	0.4
01005	Barbour County, Alabama	B01002_001	40.1	0.6
01007	Bibb County, Alabama	B01002_001	39.9	1.2
01009	Blount County, Alabama	B01002_001	41.0	0.5
01011	Bullock County, Alabama	B01002_001	39.7	1.9
01013	Butler County, Alabama	B01002_001	41.2	0.6
01015	Calhoun County, Alabama	B01002_001	39.5	0.4
01017	Chambers County, Alabama	B01002_001	41.9	0.7
01019	Cherokee County, Alabama	B01002_001	46.8	0.5

The default method for printing data used by the tibble package shows the first 10 rows of the dataset, which in this case prints counties in Alabama. A first exploratory data analysis question might involve understanding which counties are the youngest and oldest in the United States as measured by median age. This task can be accomplished with the arrange() function found in the dplyr package. arrange() sorts a dataset by values in one or more columns and returns the sorted result. To view the dataset in ascending order of a given column, supply the data object and a column name to the arrange() function.

arrange(median_age, estimate)

Table 3.2: The youngest counties in the US by median age
GEOID	NAME	variable	estimate	moe
35011	De Baca County, New Mexico	B01002_001	22.2	6.9
51678	Lexington city, Virginia	B01002_001	22.2	0.8
16065	Madison County, Idaho	B01002_001	23.5	0.2
46121	Todd County, South Dakota	B01002_001	23.6	0.6
51750	Radford city, Virginia	B01002_001	23.7	0.6
13053	Chattahoochee County, Georgia	B01002_001	24.0	0.7
02158	Kusilvak Census Area, Alaska	B01002_001	24.1	0.2
49049	Utah County, Utah	B01002_001	25.0	0.1
46027	Clay County, South Dakota	B01002_001	25.2	0.5
53075	Whitman County, Washington	B01002_001	25.2	0.2

Per the 2016-2020 ACS, the youngest county is De Baca County, New Mexico. Two of the five youngest “counties” in the United States are independent cities in Virginia, which are treated as county-equivalents. Both Lexington and Radford are college towns; Lexington is home to both Washington & Lee University and the Virginia Military Institute, and Radford houses Radford University.

To retrieve the oldest counties in the United States by median age, an analyst can use the desc() function available in dplyr to sort the estimate column in descending order.

arrange(median_age, desc(estimate))

Table 3.3: The oldest counties in the US by median age
GEOID	NAME	variable	estimate	moe
12119	Sumter County, Florida	B01002_001	68.0	0.3
48301	Loving County, Texas	B01002_001	62.2	37.8
48243	Jeff Davis County, Texas	B01002_001	61.3	36.8
08027	Custer County, Colorado	B01002_001	60.1	3.4
12015	Charlotte County, Florida	B01002_001	59.5	0.2
51091	Highland County, Virginia	B01002_001	59.5	4.8
35003	Catron County, New Mexico	B01002_001	59.4	2.6
51133	Northumberland County, Virginia	B01002_001	59.3	0.7
26131	Ontonagon County, Michigan	B01002_001	59.1	0.5
48443	Terrell County, Texas	B01002_001	59.1	9.4

The oldest county in the United States is Sumter County, Florida. Sumter County is home to The Villages, a Census-designated place that includes a large age-restricted community also called The Villages.

The tidyverse includes several tools for parsing datasets that allow for exploration beyond sorting and browsing data. The filter() function in dplyr queries a dataset for rows where a given condition evaluates to TRUE, and retains those rows only. For analysts who are familiar with databases and SQL, this is equivalent to a WHERE clause. This helps analysts subset their data for specific areas by their characteristics, and answer questions like “how many counties in the US have a median age of 50 or older?”

filter(median_age, estimate >= 50)

Table 3.4: Counties with a median age of 50 or above
GEOID	NAME	variable	estimate	moe
02105	Hoonah-Angoon Census Area, Alaska	B01002_001	52.1	2.9
04007	Gila County, Arizona	B01002_001	50.4	0.2
04012	La Paz County, Arizona	B01002_001	57.4	0.6
04015	Mohave County, Arizona	B01002_001	52.3	0.2
04025	Yavapai County, Arizona	B01002_001	54.1	0.2
05005	Baxter County, Arkansas	B01002_001	52.3	0.5
05089	Marion County, Arkansas	B01002_001	52.1	0.8
05097	Montgomery County, Arkansas	B01002_001	50.6	0.8
05137	Stone County, Arkansas	B01002_001	50.0	0.5
06009	Calaveras County, California	B01002_001	52.8	0.6

Functions like arrange() and filter() operate on row values and organize data by row. Other tidyverse functions, like tidyr’s separate(), operate on columns. The NAME column, returned by default by most tidycensus functions, contains a basic description of the location that can be more intuitive than the GEOID. For the 2016-2020 ACS, NAME is formatted as “X County, Y”, where X is the county name and Y is the state name. separate() can split this column into two columns where one retains the county name and the other retains the state; this can be useful for analysts who need to complete a comparative analysis by state.

separate(
  median_age,
  NAME,
  into = c("county", "state"),
  sep = ", "
)

Table 3.5: Separate columns for county and state
GEOID	county	state	variable	estimate	moe
01001	Autauga County	Alabama	B01002_001	38.6	0.6
01003	Baldwin County	Alabama	B01002_001	43.2	0.4
01005	Barbour County	Alabama	B01002_001	40.1	0.6
01007	Bibb County	Alabama	B01002_001	39.9	1.2
01009	Blount County	Alabama	B01002_001	41.0	0.5
01011	Bullock County	Alabama	B01002_001	39.7	1.9
01013	Butler County	Alabama	B01002_001	41.2	0.6
01015	Calhoun County	Alabama	B01002_001	39.5	0.4
01017	Chambers County	Alabama	B01002_001	41.9	0.7
01019	Cherokee County	Alabama	B01002_001	46.8	0.5

You may have noticed above that existing variable names are unquoted when referenced in tidyverse functions. Many tidyverse functions use non-standard evaluation to refer to column names, which means that column names can be used as arguments directly without quotation marks. Non-standard evaluation makes interactive programming faster, especially for beginners; however, it can introduce some complications when writing your own functions or R packages. A full treatment of non-standard evaluation is beyond the scope of this book; Hadley Wickham’s Advanced R (Wickham 2019b) is the best resource on the topic if you’d like to learn more.

3.2.2 Using summary variables and calculating new columns

Data in Census and ACS tables, as in the example above, are frequently comprised of variables that individually constitute sub-categories such as the numbers of households in different household income bands. One limitation of the approach above, however, is that the data and the resulting analysis return estimated counts, which are difficult to compare across geographies. For example, Maricopa County in Arizona is the state’s most populous county with 4.3 million residents; the second-largest county, Pima, only has just over 1 million residents and six of the state’s 15 counties have fewer than 100,000 residents. In turn, comparing Maricopa’s estimates with those of smaller counties in the state would often be inappropriate.

A solution to this issue might involve normalizing the estimated count data by dividing it by the overall population from which the sub-group is derived. Appropriate denominators for ACS tables are frequently found in the tables themselves as variables. In ACS table B19001, which covers the number of households by income bands, the variable B19001_001 represents the total number of households in a given enumeration unit, which we removed from our analysis earlier. Given that this variable is an appropriate denominator for the other variables in the table, it merits its own column to facilitate the calculation of proportions or percentages.

In tidycensus, this can be accomplished by supplying a variable ID to the summary_var parameter in both the get_acs() and get_decennial() functions. When using get_decennial(), doing so will create two new columns for the decennial Census datasets, summary_var and summary_value, representing the summary variable ID and the summary variable’s value. When using get_acs(), using summary_var creates three new columns for the ACS datasets, summary_var, summary_est, and summary_moe, which include the ACS estimate and margin of error for the summary variable.

With this information in hand, normalizing data is straightforward. The following example uses the summary_var parameter to compare the population of counties in Arizona by race & Hispanic origin with their baseline populations, using data from the 2016-2020 ACS.

race_vars <- c(
  White = "B03002_003",
  Black = "B03002_004",
  Native = "B03002_005",
  Asian = "B03002_006",
  HIPI = "B03002_007",
  Hispanic = "B03002_012"
)

az_race <- get_acs(
  geography = "county",
  state = "AZ",
  variables = race_vars,
  summary_var = "B03002_001",
  year = 2020
)

Table 3.6: Race and ethnicity in Arizona
GEOID	NAME	variable	estimate	moe	summary_est	summary_moe
04001	Apache County, Arizona	White	12993	56	71714	NA
04001	Apache County, Arizona	Black	544	56	71714	NA
04001	Apache County, Arizona	Native	51979	327	71714	NA
04001	Apache County, Arizona	Asian	262	76	71714	NA
04001	Apache County, Arizona	HIPI	49	14	71714	NA
04001	Apache County, Arizona	Hispanic	4751	NA	71714	NA
04003	Cochise County, Arizona	White	69095	350	126442	NA
04003	Cochise County, Arizona	Black	4512	378	126442	NA
04003	Cochise County, Arizona	Native	1058	176	126442	NA
04003	Cochise County, Arizona	Asian	2371	241	126442	NA

By using dplyr’s mutate() function, we calculate a new column, percent, representing the percentage of each Census tract’s population that corresponds to each racial/ethnic group in 2016-2020. The select() function, also in dplyr, retains only those columns that we need to view.

az_race_percent <- az_race %>%
  mutate(percent = 100 * (estimate / summary_est)) %>%
  select(NAME, variable, percent)

Table 3.7: Race and ethnicity in Arizona as percentages
NAME	variable	percent
Apache County, Arizona	White	18.1178013
Apache County, Arizona	Black	0.7585688
Apache County, Arizona	Native	72.4809661
Apache County, Arizona	Asian	0.3653401
Apache County, Arizona	HIPI	0.0683270
Apache County, Arizona	Hispanic	6.6249268
Cochise County, Arizona	White	54.6456083
Cochise County, Arizona	Black	3.5684345
Cochise County, Arizona	Native	0.8367473
Cochise County, Arizona	Asian	1.8751681

The above example introduces some additional syntax common to tidyverse data analyses. The %>% operator from the magrittr R package (Bache and Wickham 2020) is a pipe operator that allows for analysts to develop analytic pipelines, which are deeply embedded in tidyverse-centric data analytic workflows. The pipe operator passes the result of a given line of code as the first argument of the code on the next line. In turn, analysts can develop data analysis pipelines of related operations that fit together in a coherent way.

tidyverse developers recommend that the pipe operator be read as “then”. The above code can in turn be interpreted as “Create a new data object az_race_percent by using the existing data object az_race THEN creating a new percent column THEN selecting the NAME, variable, and percent columns.”

Since R version 4.1, the base installation of R also includes a pipe operator, |>. It works much the same way as the magrittr pipe %>%, though %>% has some small additional features that make it work well within tidyverse analysis pipelines. In turn, %>% will be used in the examples throughout this book.

3.3 Group-wise Census data analysis

The split-apply-combine model of data analysis, as discussed in Wickham (2011), is a powerful framework for analyzing demographic data. In general terms, an analyst will apply this framework as follows:

The analyst identifies salient groups in a dataset between which they want to make comparisons. The dataset is then split into multiple pieces, one for each group.
A function is then applied to each group in turn. This might be a simple summary function, such as taking the maximum or calculating the mean, or a custom function defined by the analyst.
Finally, the results of the function applied to each group are combined back into a single dataset, allowing the analyst to compare the results by group.

Given the hierarchical nature of US Census Bureau data, “groups” across which analysts can make comparisons are found in just about every analytic tasks. In many cases, the split-apply-combine model of data analysis will be useful to analysts as they make sense of patterns and trends found in Census data.

In the tidyverse, split-apply-combine is implemented with the group_by() function in the dplyr package. group_by() does the work for the analyst of splitting a dataset into groups, allowing subsequent functions used by the analyst in an analytic pipeline to be applied to each group then combined back into a single dataset. The examples that follow illustrate some common group-wise analyses.

3.3.1 Making group-wise comparisons

The az_race_percent dataset created above is an example of a dataset suitable for group-wise data analysis. It includes two columns that could be used as group definitions: NAME, representing the county, and variable, representing the racial or ethnic group. Split-apply-combine could be used for either group definition to make comparisons for data in Arizona across these categories.

In a first example, we can deploy group-wise data analysis to identify the largest racial or ethnic group in each county in Arizona. This involves setting up a data analysis pipeline with the magrittr pipe and calculating a grouped filter where the filter() operation will be applied specific to each group. In this example, the filter condition will be specified as percent == max(percent). We can read the analytic pipeline then as “Create a new dataset, largest_group, by using the az_race_dataset THEN grouping the dataset by the NAME column THEN filtering for rows that are equal to the maximum value of percent for each group.”

largest_group <- az_race_percent %>%
  group_by(NAME) %>%
  filter(percent == max(percent))

Table 3.8: Largest group by county in Arizona
NAME	variable	percent
Apache County, Arizona	Native	72.48097
Cochise County, Arizona	White	54.64561
Coconino County, Arizona	White	53.80798
Gila County, Arizona	White	61.85232
Graham County, Arizona	White	50.88764
Greenlee County, Arizona	Hispanic	47.26889
La Paz County, Arizona	White	57.18089
Maricopa County, Arizona	White	54.55515
Mohave County, Arizona	White	76.69694
Navajo County, Arizona	Native	42.71386

The result of the grouped filter allows us to review the most common racial or ethnic group in each Arizona County along with how their percentages vary. For example, in two Arizona counties (Greenlee and Navajo), none of the racial or ethnic groups form a majority of the population.

group_by() is commonly paired with the summarize() function in data analysis pipelines. summarize() generates a new, condensed dataset that by default returns a column for the grouping variable(s) and columns representing the results of one or more functions applied to those groups. In the example below, the median() function is used to identify the median percentage for each of the racial & ethnic groups in the dataset across counties in Arizona. In turn, variable is passed to group_by() as the grouping variable.

az_race_percent %>%
  group_by(variable) %>%
  summarize(median_pct = median(percent))

Table 3.9: Median percentage by group
variable	median_pct
Asian	0.9918415
Black	1.2857283
HIPI	0.1070384
Hispanic	30.4721836
Native	3.6344427
White	53.8079773

The result of this operation tells us the median county percentage of each racial and ethnic group for the state of Arizona. A broader analysis might involve the calculation of these percentages hierarchically, finding the median county percentage of given attributes across states, for example.

3.3.2 Tabulating new groups

In the examples above, suitable groups in the NAME and variable columns were already found in the data retrieved with get_acs(). Commonly, analysts will also need to calculate new custom groups to address specific analytic questions. For example, variables in ACS table B19001 represent groups of households whose household incomes fall into a variety of categories: less than $10,000/year, between $10,000/year and $19,999/year, and so forth. These categories may be more granular than needed by an analyst. As such, an analyst might take the following steps: 1) recode the ACS variables into wider income bands; 2) group the data by the wider income bands; 3) calculate grouped sums to generate new estimates.

Consider the following example, using household income data for Minnesota counties from the 2012-2016 ACS:

mn_hh_income <- get_acs(
  geography = "county",
  table = "B19001",
  state = "MN",
  year = 2016
)

Table 3.10: Table B19001 for counties in Minnesota, 2012-2016 ACS
GEOID	NAME	variable	estimate	moe
27001	Aitkin County, Minnesota	B19001_001	7640	262
27001	Aitkin County, Minnesota	B19001_002	562	77
27001	Aitkin County, Minnesota	B19001_003	544	72
27001	Aitkin County, Minnesota	B19001_004	472	69
27001	Aitkin County, Minnesota	B19001_005	508	68
27001	Aitkin County, Minnesota	B19001_006	522	92
27001	Aitkin County, Minnesota	B19001_007	447	61
27001	Aitkin County, Minnesota	B19001_008	390	49
27001	Aitkin County, Minnesota	B19001_009	426	64
27001	Aitkin County, Minnesota	B19001_010	415	65

Our data include household income categories for each county in the rows. However, let’s say we only need three income categories for purposes of analysis: below $35,000/year, between $35,000/year and $75,000/year, and $75,000/year and up.

We first need to do some transformation of our data to recode the variables appropriately. First, we will remove variable B19001_001, which represents the total number of households for each county. Second, we use the case_when() function from the dplyr package to identify groups of variables that correspond to our desired groupings. Given that the variables are ordered in the ACS table in relationship to the household income values, the less than operator can be used to identify groups.

The syntax of case_when() can appear complex to beginners, so it is worth stepping through how the function works. Inside the mutate() function, which is used to create a new variable named incgroup, case_when() steps through a series of logical conditions that are evaluated in order similar to a series of if/else statements. The first condition is evaluated, telling the function to assign the value of below35k to all rows with a variable value that comes before "B19001_008" - which in this case will be B19001_002 (income less than $10,000) through B19001_007 (income between $30,000 and $34,999). The second condition is then evaluated for all those rows not accounted for by the first condition. This means that case_when() knows not to assign "bw35kand75k" to the income group of $10,000 and below even though its variable comes before B19001_013. The final condition in case_when() can be set to TRUE which in this scenario translates as “all other values.”

mn_hh_income_recode <- mn_hh_income %>%
  filter(variable != "B19001_001") %>%
  mutate(incgroup = case_when(
    variable < "B19001_008" ~ "below35k", 
    variable < "B19001_013" ~ "bw35kand75k", 
    TRUE ~ "above75k"
  ))

Table 3.11: Recoded household income categories
GEOID	NAME	variable	estimate	moe	incgroup
27001	Aitkin County, Minnesota	B19001_002	562	77	below35k
27001	Aitkin County, Minnesota	B19001_003	544	72	below35k
27001	Aitkin County, Minnesota	B19001_004	472	69	below35k
27001	Aitkin County, Minnesota	B19001_005	508	68	below35k
27001	Aitkin County, Minnesota	B19001_006	522	92	below35k
27001	Aitkin County, Minnesota	B19001_007	447	61	below35k
27001	Aitkin County, Minnesota	B19001_008	390	49	bw35kand75k
27001	Aitkin County, Minnesota	B19001_009	426	64	bw35kand75k
27001	Aitkin County, Minnesota	B19001_010	415	65	bw35kand75k
27001	Aitkin County, Minnesota	B19001_011	706	81	bw35kand75k

Our result illustrates how the different variable IDs are mapped to the new, recoded categories that we specified in case_when(). The group_by() %>% summarize() workflow can now be applied to the recoded categories by county to tabulate the data into a smaller number of groups.

mn_group_sums <- mn_hh_income_recode %>%
  group_by(GEOID, incgroup) %>%
  summarize(estimate = sum(estimate))

Table 3.12: Grouped sums by income bands
GEOID	incgroup	estimate
27001	above75k	1706
27001	below35k	3055
27001	bw35kand75k	2879
27003	above75k	61403
27003	below35k	24546
27003	bw35kand75k	39311
27005	above75k	4390
27005	below35k	4528
27005	bw35kand75k	4577
27007	above75k	4491

Our data now reflect the new estimates by group by county.

3.4 Comparing ACS estimates over time

A common task when working with Census data is to examine demographic change over time. Data from the Census API - and consequently tidycensus - only go back to the 2000 Decennial Census. For historical analysts who want to go even further back, decennial Census data are available since 1790 from the National Historical Geographic Information System, or NHGIS, which will be covered in detail in Chapter 11.

3.4.1 Time-series analysis: some cautions

Before engaging in any sort of time series analysis of Census data, analysts need to account for potential problems that can emerge when using Census data longitudinally. One major issue that can emerge is geography changes over time. For example, let’s say we are interested in analyzing data on Oglala Lakota County, South Dakota. We can get recent data from the ACS using tools learned in Chapter 2:

oglala_lakota_age <- get_acs(
  geography = "county",
  state = "SD",
  county = "Oglala Lakota",
  table = "B01001",
  year = 2020
)

Table 3.13: 2016-2020 age table for Oglala Lakota County, SD
GEOID	NAME	variable	estimate	moe
46102	Oglala Lakota County, South Dakota	B01001_001	14277	NA
46102	Oglala Lakota County, South Dakota	B01001_002	6930	132
46102	Oglala Lakota County, South Dakota	B01001_003	761	66
46102	Oglala Lakota County, South Dakota	B01001_004	794	128
46102	Oglala Lakota County, South Dakota	B01001_005	707	123
46102	Oglala Lakota County, South Dakota	B01001_006	394	20
46102	Oglala Lakota County, South Dakota	B01001_007	227	15
46102	Oglala Lakota County, South Dakota	B01001_008	85	53
46102	Oglala Lakota County, South Dakota	B01001_009	165	70
46102	Oglala Lakota County, South Dakota	B01001_010	356	101

To understand how the age composition of the county has changed over the past 10 years, we may want to look at the 2006-2010 ACS for the county. Normally, we would just change the year argument to 2010:

oglala_lakota_age_10 <- get_acs(
  geography = "county",
  state = "SD",
  county = "Oglala Lakota",
  table = "B01001",
  year = 2010
)

## Error in `map()`:
## ℹ In index: 1.
## ℹ With name: 1.
## Caused by error:
## ! Your API call has errors.  The API message returned is .

The request errors, and we don’t get an informative error message back from the API as was discussed in Section 2.6. The problem here is that Oglala Lakota County had a different name in 2010, Shannon County, meaning that the county = "Oglala Lakota" argument will not return any data. In turn, the equivalent code for the 2006-2010 ACS would use county = "Shannon".

oglala_lakota_age_10 <- get_acs(
  geography = "county",
  state = "SD",
  county = "Shannon",
  table = "B01001",
  year = 2010
)

Table 3.14: 200-2010 age table for Oglala Lakota County, SD (then named Shannon County)
GEOID	NAME	variable	estimate	moe
46113	Shannon County, South Dakota	B01001_001	13437	NA
46113	Shannon County, South Dakota	B01001_002	6553	47
46113	Shannon County, South Dakota	B01001_003	770	99
46113	Shannon County, South Dakota	B01001_004	565	151
46113	Shannon County, South Dakota	B01001_005	833	151
46113	Shannon County, South Dakota	B01001_006	541	47
46113	Shannon County, South Dakota	B01001_007	275	99
46113	Shannon County, South Dakota	B01001_008	164	89
46113	Shannon County, South Dakota	B01001_009	143	54
46113	Shannon County, South Dakota	B01001_010	342	83

Note the differences in the GEOID column between the two tables of data. When a county or geographic entity changes its name, the Census Bureau assigns it a new GEOID, meaning that analysts need to take care when dealing with those changes. A full listing of geography changes is available on the Census website for each year.

In addition to changes in geographic identifiers, variable IDs can change over time as well. For example, the ACS Data Profile is commonly used for pre-computed normalized ACS estimates. Let’s say that we are interested in analyzing the percentage of residents age 25 and up with a 4-year college degree for counties in Colorado from the 2019 1-year ACS. We’d first look up the appropriate variable ID with load_variables(2019, "acs1/profile") then use get_acs():

co_college19 <- get_acs(
  geography = "county",
  variables = "DP02_0068P",
  state = "CO",
  survey = "acs1",
  year = 2019
)

Table 3.15: ACS Data Profile data in 2019
GEOID	NAME	variable	estimate	moe
08001	Adams County, Colorado	DP02_0068P	25.4	1.3
08005	Arapahoe County, Colorado	DP02_0068P	43.8	1.3
08013	Boulder County, Colorado	DP02_0068P	64.8	1.8
08014	Broomfield County, Colorado	DP02_0068P	56.9	3.3
08031	Denver County, Colorado	DP02_0068P	53.1	1.1

We get back data for counties of population 65,000 and greater as these are the geographies available in the 1-year ACS. The data make sense: Boulder County, home to the University of Colorado, has a very high percentage of its population with a 4-year degree or higher. However, when we run the exact same query for the 2018 1-year ACS:

co_college18 <- get_acs(
  geography = "county",
  variables = "DP02_0068P",
  state = "CO",
  survey = "acs1",
  year = 2018
)

Table 3.16: ACS Data Profile data in 2018
GEOID	NAME	variable	estimate	moe
08001	Adams County, Colorado	DP02_0068P	375798	NA
08005	Arapahoe County, Colorado	DP02_0068P	497198	NA
08013	Boulder County, Colorado	DP02_0068P	263938	NA
08014	Broomfield County, Colorado	DP02_0068P	53400	NA
08031	Denver County, Colorado	DP02_0068P	575870	NA

The values are completely different, and clearly not percentages! This is because variable IDs for the Data Profile are unique to each year and in turn should not be used for time-series analysis. The returned results above represent the civilian population age 18 and up, and have nothing to do with educational attainment.

3.4.2 Preparing time-series ACS estimates

The safest option for time-series analysis in the ACS is to use the Comparison Profile Tables. These tables are available for both the 1-year and 5-year ACS, and allow for comparison of demographic indicators over the past five years for a given year. Using the Comparison Profile tables also brings the benefit of additional variable harmonization, such as inflation-adjusted income estimates.

Data from the Comparison Profile are accessed just like other ACS variables using get_acs(). The example below illustrates how to get data from the ACS Comparison Profile on inflation-adjusted median household income for counties and county-equivalents in Alaska.

ak_income_compare <- get_acs(
  geography = "county",
  variables = c(
    income15 = "CP03_2015_062",
    income20 = "CP03_2020_062"
  ),
  state = "AK",
  year = 2020
)

For the 2016-2020 ACS, the “comparison year” is 2015, representing the closest non-overlapping 5-year dataset, which in this case is 2011-2015. We can examine the results, which are inflation-adjusted for appropriate comparison:

Table 3.17: Comparative income data from the ACS CP tables
GEOID	NAME	variable	estimate
02016	Aleutians West Census Area, Alaska	income15	92500
02016	Aleutians West Census Area, Alaska	income20	87443
02020	Anchorage Municipality, Alaska	income15	85534
02020	Anchorage Municipality, Alaska	income20	84813
02050	Bethel Census Area, Alaska	income15	55692
02050	Bethel Census Area, Alaska	income20	54400
02090	Fairbanks North Star Borough, Alaska	income15	77590
02090	Fairbanks North Star Borough, Alaska	income20	76464

3.4.2.1 Iterating over ACS years with tidyverse tools

Using the Detailed Tables also represents a safer option than the Data Profile, as it ensures that variable IDs will remain consistent across years allowing for consistent and correct analysis. That said, there still are some potential pitfalls to account for when using the Detailed Tables. The Census Bureau will add and remove variables from survey to survey depending on data needs and data availability. For example, questions are sometimes added and removed from the ACS survey meaning that you won’t always be able to get every data point for every year and geography combination. In turn, it is still important to check on data availability using load_variables() for the years you plan to analyze before carrying out your time-series analysis.

Let’s re-engineer the analysis above on educational attainment in Colorado counties, which below will be computed for a time series from 2010 to 2019. Information on “bachelor’s degree or higher” is split by sex and across different tiers of educational attainment in the detailed tables, found in ACS table 15002. Given that we only need a few variables (representing estimates of populations age 25+ who have finished a 4-year degree or graduate degrees, by sex), we’ll request those variables directly rather than the entire B15002 table.

college_vars <- c("B15002_015",
                  "B15002_016",
                  "B15002_017",
                  "B15002_018",
                  "B15002_032",
                  "B15002_033",
                  "B15002_034",
                  "B15002_035")

We’ll now use these variables to request data on college degree holders from the ACS for counties in Colorado for each of the 1-year ACS surveys from 2010 to 2019. In most cases, this process should be streamlined with iteration. Thus far, we are familiar with using the year argument in get_acs() to request data for a specific year. Writing out ten different calls to get_acs(), however - one for each year - would be tedious and would require a fair amount of repetitive code! Iteration helps us avoid repetitive coding as it allows us to carry out the same process over a sequence of values. Programmers familiar with iteration will likely know of “loop” operators like for and while, which are available in base R and most other programming languages in some variety. Base R also includes the *apply() family of functions (e.g. lapply(), mapply(), sapply()), which iterates over a sequence of values and applies a given function to each value.

The tidyverse approach to iteration is found in the purrr package. purrr includes a variety of functions that are designed to integrate well in workflows that require iteration and use other tidyverse tools. The map_*() family of functions iterate over values and try to return a desired result; map() returns a list, map_int() returns an integer vector, and map_chr() returns a character vector, for example. With tidycensus, the map_dfr() function is particularly useful. map_dfr() iterates over an input and applies it to a function or process defined by the user, then row-binds the result into a single data frame. The example below illustrates how this works for the years 2010 through 2019.

years <- 2010:2019
names(years) <- years

college_by_year <- map_dfr(years, ~{
  get_acs(
    geography = "county",
    variables = college_vars,
    state = "CO",
    summary_var = "B15002_001",
    survey = "acs1",
    year = .x
  )
}, .id = "year")

For users newer to R, iteration and purrr syntax can feel complex, so it is worth stepping through how the above code sample works.

First, a numeric vector of years is defined with the syntax 2010:2019. This will create a vector of years at 1-year intervals. These values are set as the names of the vector as well, as map_dfr() has additional functionality for working with named objects.
map_dfr() then takes three arguments above.
- The first argument is the object that map_dfr() will iterate over, which in this case is our years vector. This means that the process we set up will be run once for each element of years.
- The second argument is a formula we specify with the tilde (~) operator and curly braces ({...}). The code inside the curly braces will be run once for each element of years. The local variable .x, used inside the formula, takes on each value of years sequentially. In turn, we are running the equivalent of get_acs() with year = 2010, year = 2011, and so forth. Once get_acs() is run for each year, the result is combined into a single output data frame.
- The .id argument, which is optional but used here, creates a new column in the output data frame that contains values equivalent to the names of the input object, which in this case is years. By setting .id = "year", we tell map_dfr() to name the new column that will contain these values year.

Let’s review the result:

college_by_year %>% 
  arrange(NAME, variable, year)

Table 3.18: Educational attainment over time
year	GEOID	NAME	variable	estimate	moe	summary_est	summary_moe
2010	08001	Adams County, Colorado	B15002_015	20501	1983	275849	790
2011	08001	Adams County, Colorado	B15002_015	21233	2124	281231	865
2012	08001	Adams County, Colorado	B15002_015	19238	2020	287924	693
2013	08001	Adams County, Colorado	B15002_015	23818	2445	295122	673
2014	08001	Adams County, Colorado	B15002_015	20255	1928	304394	541
2015	08001	Adams County, Colorado	B15002_015	22962	2018	312281	705
2016	08001	Adams County, Colorado	B15002_015	25744	2149	318077	525
2017	08001	Adams County, Colorado	B15002_015	26159	2320	324185	562
2018	08001	Adams County, Colorado	B15002_015	28113	2078	331247	955
2019	08001	Adams County, Colorado	B15002_015	27552	2070	336931	705

The result is a long-form dataset that contains a time series of each requested ACS variable for each county in Colorado that is available in the 1-year ACS. The code below outlines a group_by() %>% summarize() workflow for calculating the percentage of the population age 25 and up with a 4-year college degree, then uses the pivot_wider() function from the tidyr package to spread the years across the columns for tabular data display.

percent_college_by_year <- college_by_year %>%
  group_by(NAME, year) %>%
  summarize(numerator = sum(estimate),
            denominator = first(summary_est)) %>%
  mutate(pct_college = 100 * (numerator / denominator)) %>%
  pivot_wider(id_cols = NAME,
              names_from = year,
              values_from = pct_college)

Table 3.19: Percent college by year
NAME	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
Adams County, Colorado	20.57394	20.51801	20.64538	23.09384	22.16929	22.79742	22.95293	22.87552	25.74152	25.39956
Arapahoe County, Colorado	37.03001	38.24506	39.28435	39.42478	40.94194	41.03578	41.48359	43.69387	42.71285	43.78332
Boulder County, Colorado	57.50285	59.05601	57.88284	58.53214	58.04066	60.57147	60.63005	63.18150	62.51394	64.80486
Broomfield County, Colorado	NA	NA	NA	NA	NA	56.07776	51.94338	55.13359	56.27740	56.87181
Denver County, Colorado	40.87971	42.97122	44.65358	44.35340	44.25600	47.10820	47.39683	49.32692	51.34580	53.10088
Douglas County, Colorado	54.96800	53.27936	55.09223	57.66999	56.48866	56.06928	59.42687	58.53342	58.37539	58.13747
El Paso County, Colorado	34.11467	35.69184	34.91315	35.47612	36.49302	36.43089	38.67864	39.19931	38.83872	39.04942
Jefferson County, Colorado	40.83113	39.54961	41.43825	41.04234	41.99768	43.20923	43.51953	45.57140	45.83786	47.61074
Larimer County, Colorado	45.80197	42.83543	44.71423	43.33800	42.67180	46.16705	46.78871	47.90465	47.62330	49.01654
Mesa County, Colorado	24.99285	25.82724	23.01511	27.63325	25.14875	30.27630	25.02980	25.82455	29.95739	29.78537

This particular format is suitable for data display or writing to an Excel spreadsheet for colleagues who are not R-based. Methods for visualization of time-series estimates from the ACS will be covered in Section 4.4.

3.5 Handling margins of error in the American Community Survey with tidycensus

A topic of critical importance when working with data from the American Community Survey is the margin of error. As opposed to the decennial US Census, which is based on a complete enumeration of the US population, the ACS is based on a sample with estimates characterized by margins of error. By default, MOEs are returned at a 90 percent confidence level. This can be translated roughly as “we are 90 percent sure that the true value falls within a range defined by the estimate plus or minus the margin of error.”

As discussed in Chapter 2, tidycensus takes an opinionated approach to margins of error. When applicable, tidycensus will always return the margin of error associated with an estimate, and does not have an option available to return estimates only. For “tidy” or long-form data, these margins of error will be found in the moe column; for wide-form data, margins of error will be found in columns with an M suffix.

The confidence level of the MOE can be controlled with the moe_level argument in get_acs(). The default moe_level is 90, which is what the Census Bureau returns by default. tidycensus can also return MOEs at a confidence level of 95 or 99 which uses Census Bureau-recommended formulas to adjust the MOE. For example, we might look at data on median household income by county in Rhode Island using the default moe_level of 90:

get_acs(
  geography = "county",
  state = "Rhode Island",
  variables = "B19013_001",
  year = 2020
)

Table 3.20: Default MOE at 90 percent confidence level
GEOID	NAME	variable	estimate	moe
44001	Bristol County, Rhode Island	B19013_001	85413	6122
44003	Kent County, Rhode Island	B19013_001	75857	2022
44005	Newport County, Rhode Island	B19013_001	84282	2629
44007	Providence County, Rhode Island	B19013_001	62323	1270
44009	Washington County, Rhode Island	B19013_001	86970	3651

A stricter margin of error will increase the size of the MOE relative to its estimate.

get_acs(
  geography = "county",
  state = "Rhode Island",
  variables = "B19013_001",
  year = 2020,
  moe_level = 99
)

Table 3.21: MOE at 99 percent confidence level
GEOID	NAME	variable	estimate	moe
44001	Bristol County, Rhode Island	B19013_001	85413	9586.791
44003	Kent County, Rhode Island	B19013_001	75857	3166.366
44005	Newport County, Rhode Island	B19013_001	84282	4116.902
44007	Providence County, Rhode Island	B19013_001	62323	1988.766
44009	Washington County, Rhode Island	B19013_001	86970	5717.311

3.5.1 Calculating derived margins of error in tidycensus

For small geographies or small populations, margins of error can get quite large, in some cases exceeding their corresponding estimates. In the example below, we can examine data on age groups by sex for the population age 65 and older for Census tracts in Salt Lake County, Utah. We will first generate a vector of variable IDs for which we want to request data from the ACS using some base R functionality.

In this workflow, an analyst has used load_variables() to look up the variables that represent estimates for populations age 65 and up; this includes B01001_020 through B01001_025 for males, and B01001_044 through B01001_049 for females. Typing out each variable individually would be tedious, so an analyst can use string concatenation to generate the required vector of variable IDs as follows:

vars <- paste0("B01001_0", c(20:25, 44:49))

vars

##  [1] "B01001_020" "B01001_021" "B01001_022" "B01001_023" "B01001_024"
##  [6] "B01001_025" "B01001_044" "B01001_045" "B01001_046" "B01001_047"
## [11] "B01001_048" "B01001_049"

When R evaluates nested expressions like this, it starts with the inner-most expressions then evaluates them from the inside out. The steps taken to assemble the correct vector of variable IDs are as follows:

First, the expressions 20:25 and 44:49 are evaluated. The colon operator : in base R, when used between two numbers, will generate a vector of numbers from the first to the second at intervals of 1. This creates vectors of the integers 20 through 25 and 44 through 49, which will serve as the suffixes for the variable IDs.
Second, the c() function is used to combine the two vectors of integers into a single vector.
Third, the paste0() function concatenates the string prefix "B01001_0" with each of the integers in the vector created with c() and returns a vector of variable IDs. paste0() is a convenient extension of the more flexible paste() function that concatenates strings with no spaces in between them by default.

The resulting variables object, named vars, can now be used to request variables in a call to get_acs().

salt_lake <- get_acs(
  geography = "tract",
  variables = vars,
  state = "Utah",
  county = "Salt Lake",
  year = 2020
)

We will now want to examine the margins of error around the estimates in the returned data. Let’s focus on a specific Census tract in Salt Lake County using filter():

example_tract <- salt_lake %>%
  filter(GEOID == "49035100100")

example_tract %>% 
  select(-NAME)

Table 3.22: Example Census tract in Salt Lake City
GEOID	variable	estimate	moe
49035100100	B01001_020	11	13
49035100100	B01001_021	25	18
49035100100	B01001_022	7	10
49035100100	B01001_023	4	7
49035100100	B01001_024	0	12
49035100100	B01001_025	17	20
49035100100	B01001_044	0	12
49035100100	B01001_045	4	7
49035100100	B01001_046	21	17
49035100100	B01001_047	123	168

In many cases, the margins of error exceed their corresponding estimates. For example, the ACS data suggest that in Census tract 49035100100, for the male population age 85 and up (variable ID B01001_0025), there are anywhere between 0 and 45 people in that Census tract. This can make ACS data for small geographies problematic for planning and analysis purposes.

A potential solution to large margins of error for small estimates in the ACS is to aggregate data upwards until a satisfactory margin of error to estimate ratio is reached. The US Census Bureau publishes formulas for appropriately calculating margins of error around such derived estimates, which are included in tidycensus with the following functions:

moe_sum(): calculates a margin of error for a derived sum;
moe_product(): calculates a margin of error for a derived product;
moe_ratio(): calculates a margin of error for a derived ratio;
moe_prop(): calculates a margin of error for a derived proportion.

In their most basic form, these functions can be used with constants. For example, let’s say we had an ACS estimate of 25 with a margin of error of 5 around that estimate. The appropriate denominator for this estimate is 100 with a margin of error of 3. To determine the margin of error around the derived proportion of 0.25, we can use moe_prop():

moe_prop(25, 100, 5, 3)

## [1] 0.0494343

Our margin of error around the derived estimate of 0.25 is approximately 0.049.

3.5.2 Calculating group-wise margins of error

These margin of error functions in tidycensus can in turn be integrated into tidyverse-centric analytic pipelines to handle large margins of error around estimates. Given that the smaller age bands in the Salt Lake City dataset are characterized by too much uncertainty for our analysis, we decide in this scenario to aggregate our data upwards to represent populations aged 65 and older by sex.

In the code below, we use the case_when() function to create a new column, sex, that represents a mapping of the variables we pulled from the ACS to their sex categories. We then employ a familiar group_by() %>% summarize() method to aggregate our data by Census tract and sex. Notably, the call to summarize() includes a call to tidycensus’s moe_sum() function, which will generate a new column that represents the margin of error around the derived sum.

salt_lake_grouped <- salt_lake %>%
  mutate(sex = case_when(
    str_sub(variable, start = -2) < "26" ~ "Male",
    TRUE ~ "Female"
  )) %>%
  group_by(GEOID, sex) %>%
  summarize(sum_est = sum(estimate), 
            sum_moe = moe_sum(moe, estimate))

Table 3.23: Grouped margins of error
GEOID	sex	sum_est	sum_moe
49035100100	Female	165	170.72493
49035100100	Male	64	34.43835
49035100200	Female	170	57.26255
49035100200	Male	128	47.51842
49035100306	Female	155	77.48548
49035100306	Male	136	85.53362
49035100307	Female	207	82.03658
49035100307	Male	110	58.89822
49035100308	Female	157	110.97297
49035100308	Male	91	59.97499

The margins of error relative to their estimates are now much more reasonable than in the disaggregated data.

That said, the Census Bureau issues a note of caution (American Community Survey Office 2020):

All derived MOE methods are approximations and users should be cautious in using them. This is because these methods do not consider the correlation or covariance between the basic estimates. They may be overestimates or underestimates of the derived estimate’s standard error depending on whether the two basic estimates are highly correlated in either the positive or negative direction. As a result, the approximated standard error may not match direct calculations of standard errors or calculations obtained through other methods.

This means that your “best bet” is to first search the ACS tables to see if your data are found in aggregated form elsewhere before doing the aggregation and MOE estimation yourself. In many cases, you’ll find aggregated information in the ACS combined tables, Data Profile, or Subject Tables that will include pre-computed margins of error for you.

3.6 Exercises

The ACS Data Profile includes a number of pre-computed percentages which can reduce your data wrangling time. The variable in the 2015-2019 ACS for “percent of the population age 25 and up with a bachelor’s degree” is DP02_0068P. For a state of your choosing, use this variable to determine:
- The county with the highest percentage in the state;
- The county with the lowest percentage in the state;
- The median value for counties in your chosen state.

2 An introduction to tidycensus

4 Exploring US Census data with visualization