3 Wrangling Census data with tidyverse tools

One of the most popular frameworks for data analysis in R is the tidyverse, a suite of packages designed for integrated data wrangling, visualization, and modeling. The “tidy” or long-form data returned by default in tidycensus is designed to work well with tidyverse analytic workflows. This chapter provides an overview of how to use tidyverse tools to gain additional insights about US Census data retrieved with tidycensus. It concludes with discussion about margins of error (MOEs) in the American Community Survey and how to wrangle and interpret MOEs appropriately.

3.1 The tidyverse

The tidyverse is a collection of R packages that are designed to work together in common data wrangling, analysis, and visualization projects. Many of these R packages, maintained by RStudio, are among the most popular R packages worldwide. Some of the key packages you’ll use in the tidyverse include:

The core data structure used in the tidyverse is the tibble, which is an R data frame with some small enhancements to improve the user experience. tidycensus returns tibbles by default.

A full treatment of the tidyverse and its functionality is beyond the scope of this book; however, the examples in this chapter will introduce you to several key tidyverse features using US Census Bureau data. For a more general and broader treatment of the tidyverse, I recommend the R for Data Science book (Wickham and Grolemund 2017).

3.2 Exploring Census data with tidyverse tools

Census data queries using tidycensus, combined with core tidyverse functions, are excellent ways to explore downloaded Census data. Chapter 2 covered how to download data from various Census datasets using tidycensus and return the data in a desired format. A common next step in an analytic process will involve data exploration, which is handled by a wide range of tools in the tidyverse.

To get started, the tidycensus and tidyverse packages are loaded. “tidyverse” is not specifically a package itself, but rather loads several core packages within the tidyverse. The package load message gives you more information:

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Eight tidyverse packages are loaded: ggplot2, tibble (Müller and Wickham 2021), purrr, dplyr, readr, and tidyr are included along with stringr (Wickham 2019a) for string manipulation and forcats (Wickham 2021a) for working with factors. These tools collectively can be used for many core Census data analysis tasks.

3.2.1 Sorting and filtering data

For a first example, let’s request data on median age from the 2016-2020 ACS with get_acs() for all counties in the United States. This requires specifying geography = "county" and leaving state set to NULL, the default.

median_age <- get_acs(
  geography = "county",
  variables = "B01002_001",
  year = 2020
)
Table 3.1: Median age for US counties
GEOID NAME variable estimate moe
01001 Autauga County, Alabama B01002_001 38.6 0.6
01003 Baldwin County, Alabama B01002_001 43.2 0.4
01005 Barbour County, Alabama B01002_001 40.1 0.6
01007 Bibb County, Alabama B01002_001 39.9 1.2
01009 Blount County, Alabama B01002_001 41.0 0.5
01011 Bullock County, Alabama B01002_001 39.7 1.9
01013 Butler County, Alabama B01002_001 41.2 0.6
01015 Calhoun County, Alabama B01002_001 39.5 0.4
01017 Chambers County, Alabama B01002_001 41.9 0.7
01019 Cherokee County, Alabama B01002_001 46.8 0.5

The default method for printing data used by the tibble package shows the first 10 rows of the dataset, which in this case prints counties in Alabama. A first exploratory data analysis question might involve understanding which counties are the youngest and oldest in the United States as measured by median age. This task can be accomplished with the arrange() function found in the dplyr package. arrange() sorts a dataset by values in one or more columns and returns the sorted result. To view the dataset in ascending order of a given column, supply the data object and a column name to the arrange() function.

arrange(median_age, estimate)
Table 3.2: The youngest counties in the US by median age
GEOID NAME variable estimate moe
35011 De Baca County, New Mexico B01002_001 22.2 6.9
51678 Lexington city, Virginia B01002_001 22.2 0.8
16065 Madison County, Idaho B01002_001 23.5 0.2
46121 Todd County, South Dakota B01002_001 23.6 0.6
51750 Radford city, Virginia B01002_001 23.7 0.6
13053 Chattahoochee County, Georgia B01002_001 24.0 0.7
02158 Kusilvak Census Area, Alaska B01002_001 24.1 0.2
49049 Utah County, Utah B01002_001 25.0 0.1
46027 Clay County, South Dakota B01002_001 25.2 0.5
53075 Whitman County, Washington B01002_001 25.2 0.2

Per the 2016-2020 ACS, the youngest county is De Baca County, New Mexico. Two of the five youngest “counties” in the United States are independent cities in Virginia, which are treated as county-equivalents. Both Lexington and Radford are college towns; Lexington is home to both Washington & Lee University and the Virginia Military Institute, and Radford houses Radford University.

To retrieve the oldest counties in the United States by median age, an analyst can use the desc() function available in dplyr to sort the estimate column in descending order.

arrange(median_age, desc(estimate))
Table 3.3: The oldest counties in the US by median age
GEOID NAME variable estimate moe
12119 Sumter County, Florida B01002_001 68.0 0.3
48301 Loving County, Texas B01002_001 62.2 37.8
48243 Jeff Davis County, Texas B01002_001 61.3 36.8
08027 Custer County, Colorado B01002_001 60.1 3.4
12015 Charlotte County, Florida B01002_001 59.5 0.2
51091 Highland County, Virginia B01002_001 59.5 4.8
35003 Catron County, New Mexico B01002_001 59.4 2.6
51133 Northumberland County, Virginia B01002_001 59.3 0.7
26131 Ontonagon County, Michigan B01002_001 59.1 0.5
48443 Terrell County, Texas B01002_001 59.1 9.4

The oldest county in the United States is Sumter County, Florida. Sumter County is home to The Villages, a Census-designated place that includes a large age-restricted community also called The Villages.

The tidyverse includes several tools for parsing datasets that allow for exploration beyond sorting and browsing data. The filter() function in dplyr queries a dataset for rows where a given condition evaluates to TRUE, and retains those rows only. For analysts who are familiar with databases and SQL, this is equivalent to a WHERE clause. This helps analysts subset their data for specific areas by their characteristics, and answer questions like “how many counties in the US have a median age of 50 or older?”

filter(median_age, estimate >= 50)
Table 3.4: Counties with a median age of 50 or above
GEOID NAME variable estimate moe
02105 Hoonah-Angoon Census Area, Alaska B01002_001 52.1 2.9
04007 Gila County, Arizona B01002_001 50.4 0.2
04012 La Paz County, Arizona B01002_001 57.4 0.6
04015 Mohave County, Arizona B01002_001 52.3 0.2
04025 Yavapai County, Arizona B01002_001 54.1 0.2
05005 Baxter County, Arkansas B01002_001 52.3 0.5
05089 Marion County, Arkansas B01002_001 52.1 0.8
05097 Montgomery County, Arkansas B01002_001 50.6 0.8
05137 Stone County, Arkansas B01002_001 50.0 0.5
06009 Calaveras County, California B01002_001 52.8 0.6

Functions like arrange() and filter() operate on row values and organize data by row. Other tidyverse functions, like tidyr’s separate(), operate on columns. The NAME column, returned by default by most tidycensus functions, contains a basic description of the location that can be more intuitive than the GEOID. For the 2016-2020 ACS, NAME is formatted as “X County, Y”, where X is the county name and Y is the state name. separate() can split this column into two columns where one retains the county name and the other retains the state; this can be useful for analysts who need to complete a comparative analysis by state.

separate(
  median_age,
  NAME,
  into = c("county", "state"),
  sep = ", "
)
Table 3.5: Separate columns for county and state
GEOID county state variable estimate moe
01001 Autauga County Alabama B01002_001 38.6 0.6
01003 Baldwin County Alabama B01002_001 43.2 0.4
01005 Barbour County Alabama B01002_001 40.1 0.6
01007 Bibb County Alabama B01002_001 39.9 1.2
01009 Blount County Alabama B01002_001 41.0 0.5
01011 Bullock County Alabama B01002_001 39.7 1.9
01013 Butler County Alabama B01002_001 41.2 0.6
01015 Calhoun County Alabama B01002_001 39.5 0.4
01017 Chambers County Alabama B01002_001 41.9 0.7
01019 Cherokee County Alabama B01002_001 46.8 0.5

You may have noticed above that existing variable names are unquoted when referenced in tidyverse functions. Many tidyverse functions use non-standard evaluation to refer to column names, which means that column names can be used as arguments directly without quotation marks. Non-standard evaluation makes interactive programming faster, especially for beginners; however, it can introduce some complications when writing your own functions or R packages. A full treatment of non-standard evaluation is beyond the scope of this book; Hadley Wickham’s Advanced R (Wickham 2019b) is the best resource on the topic if you’d like to learn more.

3.2.2 Using summary variables and calculating new columns

Data in Census and ACS tables, as in the example above, are frequently comprised of variables that individually constitute sub-categories such as the numbers of households in different household income bands. One limitation of the approach above, however, is that the data and the resulting analysis return estimated counts, which are difficult to compare across geographies. For example, Maricopa County in Arizona is the state’s most populous county with 4.3 million residents; the second-largest county, Pima, only has just over 1 million residents and six of the state’s 15 counties have fewer than 100,000 residents. In turn, comparing Maricopa’s estimates with those of smaller counties in the state would often be inappropriate.

A solution to this issue might involve normalizing the estimated count data by dividing it by the overall population from which the sub-group is derived. Appropriate denominators for ACS tables are frequently found in the tables themselves as variables. In ACS table B19001, which covers the number of households by income bands, the variable B19001_001 represents the total number of households in a given enumeration unit, which we removed from our analysis earlier. Given that this variable is an appropriate denominator for the other variables in the table, it merits its own column to facilitate the calculation of proportions or percentages.

In tidycensus, this can be accomplished by supplying a variable ID to the summary_var parameter in both the get_acs() and get_decennial() functions. When using get_decennial(), doing so will create two new columns for the decennial Census datasets, summary_var and summary_value, representing the summary variable ID and the summary variable’s value. When using get_acs(), using summary_var creates three new columns for the ACS datasets, summary_var, summary_est, and summary_moe, which include the ACS estimate and margin of error for the summary variable.

With this information in hand, normalizing data is straightforward. The following example uses the summary_var parameter to compare the population of counties in Arizona by race & Hispanic origin with their baseline populations, using data from the 2016-2020 ACS.

race_vars <- c(
  White = "B03002_003",
  Black = "B03002_004",
  Native = "B03002_005",
  Asian = "B03002_006",
  HIPI = "B03002_007",
  Hispanic = "B03002_012"
)

az_race <- get_acs(
  geography = "county",
  state = "AZ",
  variables = race_vars,
  summary_var = "B03002_001",
  year = 2020
) 
Table 3.6: Race and ethnicity in Arizona
GEOID NAME variable estimate moe summary_est summary_moe
04001 Apache County, Arizona White 12993 56 71714 NA
04001 Apache County, Arizona Black 544 56 71714 NA
04001 Apache County, Arizona Native 51979 327 71714 NA
04001 Apache County, Arizona Asian 262 76 71714 NA
04001 Apache County, Arizona HIPI 49 14 71714 NA
04001 Apache County, Arizona Hispanic 4751 NA 71714 NA
04003 Cochise County, Arizona White 69095 350 126442 NA
04003 Cochise County, Arizona Black 4512 378 126442 NA
04003 Cochise County, Arizona Native 1058 176 126442 NA
04003 Cochise County, Arizona Asian 2371 241 126442 NA

By using dplyr’s mutate() function, we calculate a new column, percent, representing the percentage of each Census tract’s population that corresponds to each racial/ethnic group in 2016-2020. The select() function, also in dplyr, retains only those columns that we need to view.

az_race_percent <- az_race %>%
  mutate(percent = 100 * (estimate / summary_est)) %>%
  select(NAME, variable, percent)
Table 3.7: Race and ethnicity in Arizona as percentages
NAME variable percent
Apache County, Arizona White 18.1178013
Apache County, Arizona Black 0.7585688
Apache County, Arizona Native 72.4809661
Apache County, Arizona Asian 0.3653401
Apache County, Arizona HIPI 0.0683270
Apache County, Arizona Hispanic 6.6249268
Cochise County, Arizona White 54.6456083
Cochise County, Arizona Black 3.5684345
Cochise County, Arizona Native 0.8367473
Cochise County, Arizona Asian 1.8751681

The above example introduces some additional syntax common to tidyverse data analyses. The %>% operator from the magrittr R package (Bache and Wickham 2020) is a pipe operator that allows for analysts to develop analytic pipelines, which are deeply embedded in tidyverse-centric data analytic workflows. The pipe operator passes the result of a given line of code as the first argument of the code on the next line. In turn, analysts can develop data analysis pipelines of related operations that fit together in a coherent way.

tidyverse developers recommend that the pipe operator be read as “then”. The above code can in turn be interpreted as “Create a new data object az_race_percent by using the existing data object az_race THEN creating a new percent column THEN selecting the NAME, variable, and percent columns.”

Since R version 4.1, the base installation of R also includes a pipe operator, |>. It works much the same way as the magrittr pipe %>%, though %>% has some small additional features that make it work well within tidyverse analysis pipelines. In turn, %>% will be used in the examples throughout this book.

3.3 Group-wise Census data analysis

The split-apply-combine model of data analysis, as discussed in Wickham (2011), is a powerful framework for analyzing demographic data. In general terms, an analyst will apply this framework as follows:

  • The analyst identifies salient groups in a dataset between which they want to make comparisons. The dataset is then split into multiple pieces, one for each group.

  • A function is then applied to each group in turn. This might be a simple summary function, such as taking the maximum or calculating the mean, or a custom function defined by the analyst.

  • Finally, the results of the function applied to each group are combined back into a single dataset, allowing the analyst to compare the results by group.

Given the hierarchical nature of US Census Bureau data, “groups” across which analysts can make comparisons are found in just about every analytic tasks. In many cases, the split-apply-combine model of data analysis will be useful to analysts as they make sense of patterns and trends found in Census data.

In the tidyverse, split-apply-combine is implemented with the group_by() function in the dplyr package. group_by() does the work for the analyst of splitting a dataset into groups, allowing subsequent functions used by the analyst in an analytic pipeline to be applied to each group then combined back into a single dataset. The examples that follow illustrate some common group-wise analyses.

3.3.1 Making group-wise comparisons

The az_race_percent dataset created above is an example of a dataset suitable for group-wise data analysis. It includes two columns that could be used as group definitions: NAME, representing the county, and variable, representing the racial or ethnic group. Split-apply-combine could be used for either group definition to make comparisons for data in Arizona across these categories.

In a first example, we can deploy group-wise data analysis to identify the largest racial or ethnic group in each county in Arizona. This involves setting up a data analysis pipeline with the magrittr pipe and calculating a grouped filter where the filter() operation will be applied specific to each group. In this example, the filter condition will be specified as percent == max(percent). We can read the analytic pipeline then as “Create a new dataset, largest_group, by using the az_race_dataset THEN grouping the dataset by the NAME column THEN filtering for rows that are equal to the maximum value of percent for each group.”

largest_group <- az_race_percent %>%
  group_by(NAME) %>%
  filter(percent == max(percent))
Table 3.8: Largest group by county in Arizona
NAME variable percent
Apache County, Arizona Native 72.48097
Cochise County, Arizona White 54.64561
Coconino County, Arizona White 53.80798
Gila County, Arizona White 61.85232
Graham County, Arizona White 50.88764
Greenlee County, Arizona Hispanic 47.26889
La Paz County, Arizona White 57.18089
Maricopa County, Arizona White 54.55515
Mohave County, Arizona White 76.69694
Navajo County, Arizona Native 42.71386

The result of the grouped filter allows us to review the most common racial or ethnic group in each Arizona County along with how their percentages vary. For example, in two Arizona counties (Greenlee and Navajo), none of the racial or ethnic groups form a majority of the population.

group_by() is commonly paired with the summarize() function in data analysis pipelines. summarize() generates a new, condensed dataset that by default returns a column for the grouping variable(s) and columns representing the results of one or more functions applied to those groups. In the example below, the median() function is used to identify the median percentage for each of the racial & ethnic groups in the dataset across counties in Arizona. In turn, variable is passed to group_by() as the grouping variable.

az_race_percent %>%
  group_by(variable) %>%
  summarize(median_pct = median(percent))
Table 3.9: Median percentage by group
variable median_pct
Asian 0.9918415
Black 1.2857283
HIPI 0.1070384
Hispanic 30.4721836
Native 3.6344427
White 53.8079773

The result of this operation tells us the median county percentage of each racial and ethnic group for the state of Arizona. A broader analysis might involve the calculation of these percentages hierarchically, finding the median county percentage of given attributes across states, for example.

3.3.2 Tabulating new groups

In the examples above, suitable groups in the NAME and variable columns were already found in the data retrieved with get_acs(). Commonly, analysts will also need to calculate new custom groups to address specific analytic questions. For example, variables in ACS table B19001 represent groups of households whose household incomes fall into a variety of categories: less than $10,000/year, between $10,000/year and $19,999/year, and so forth. These categories may be more granular than needed by an analyst. As such, an analyst might take the following steps: 1) recode the ACS variables into wider income bands; 2) group the data by the wider income bands; 3) calculate grouped sums to generate new estimates.

Consider the following example, using household income data for Minnesota counties from the 2012-2016 ACS:

mn_hh_income <- get_acs(
  geography = "county",
  table = "B19001",
  state = "MN",
  year = 2016
)
Table 3.10: Table B19001 for counties in Minnesota, 2012-2016 ACS
GEOID NAME variable estimate moe
27001 Aitkin County, Minnesota B19001_001 7640 262
27001 Aitkin County, Minnesota B19001_002 562 77
27001 Aitkin County, Minnesota B19001_003 544 72
27001 Aitkin County, Minnesota B19001_004 472 69
27001 Aitkin County, Minnesota B19001_005 508 68
27001 Aitkin County, Minnesota B19001_006 522 92
27001 Aitkin County, Minnesota B19001_007 447 61
27001 Aitkin County, Minnesota B19001_008 390 49
27001 Aitkin County, Minnesota B19001_009 426 64
27001 Aitkin County, Minnesota B19001_010 415 65

Our data include household income categories for each county in the rows. However, let’s say we only need three income categories for purposes of analysis: below $35,000/year, between $35,000/year and $75,000/year, and $75,000/year and up.

We first need to do some transformation of our data to recode the variables appropriately. First, we will remove variable B19001_001, which represents the total number of households for each county. Second, we use the case_when() function from the dplyr package to identify groups of variables that correspond to our desired groupings. Given that the variables are ordered in the ACS table in relationship to the household income values, the less than operator can be used to identify groups.

The syntax of case_when() can appear complex to beginners, so it is worth stepping through how the function works. Inside the mutate() function, which is used to create a new variable named incgroup, case_when() steps through a series of logical conditions that are evaluated in order similar to a series of if/else statements. The first condition is evaluated, telling the function to assign the value of below35k to all rows with a variable value that comes before "B19001_008" - which in this case will be B19001_002 (income less than $10,000) through B19001_007 (income between $30,000 and $34,999). The second condition is then evaluated for all those rows not accounted for by the first condition. This means that case_when() knows not to assign "bw35kand75k" to the income group of $10,000 and below even though its variable comes before B19001_013. The final condition in case_when() can be set to TRUE which in this scenario translates as “all other values.”

mn_hh_income_recode <- mn_hh_income %>%
  filter(variable != "B19001_001") %>%
  mutate(incgroup = case_when(
    variable < "B19001_008" ~ "below35k", 
    variable < "B19001_013" ~ "bw35kand75k", 
    TRUE ~ "above75k"
  )) 
Table 3.11: Recoded household income categories
GEOID NAME variable estimate moe incgroup
27001 Aitkin County, Minnesota B19001_002 562 77 below35k
27001 Aitkin County, Minnesota B19001_003 544 72 below35k
27001 Aitkin County, Minnesota B19001_004 472 69 below35k
27001 Aitkin County, Minnesota B19001_005 508 68 below35k
27001 Aitkin County, Minnesota B19001_006 522 92 below35k
27001 Aitkin County, Minnesota B19001_007 447 61 below35k
27001 Aitkin County, Minnesota B19001_008 390 49 bw35kand75k
27001 Aitkin County, Minnesota B19001_009 426 64 bw35kand75k
27001 Aitkin County, Minnesota B19001_010 415 65 bw35kand75k
27001 Aitkin County, Minnesota B19001_011 706 81 bw35kand75k

Our result illustrates how the different variable IDs are mapped to the new, recoded categories that we specified in case_when(). The group_by() %>% summarize() workflow can now be applied to the recoded categories by county to tabulate the data into a smaller number of groups.

mn_group_sums <- mn_hh_income_recode %>%
  group_by(GEOID, incgroup) %>%
  summarize(estimate = sum(estimate))
Table 3.12: Grouped sums by income bands
GEOID incgroup estimate
27001 above75k 1706
27001 below35k 3055
27001 bw35kand75k 2879
27003 above75k 61403
27003 below35k 24546
27003 bw35kand75k 39311
27005 above75k 4390
27005 below35k 4528
27005 bw35kand75k 4577
27007 above75k 4491

Our data now reflect the new estimates by group by county.

3.4 Comparing ACS estimates over time

A common task when working with Census data is to examine demographic change over time. Data from the Census API - and consequently tidycensus - only go back to the 2000 Decennial Census. For historical analysts who want to go even further back, decennial Census data are available since 1790 from the National Historical Geographic Information System, or NHGIS, which will be covered in detail in Chapter 11.

3.4.1 Time-series analysis: some cautions

Before engaging in any sort of time series analysis of Census data, analysts need to account for potential problems that can emerge when using Census data longitudinally. One major issue that can emerge is geography changes over time. For example, let’s say we are interested in analyzing data on Oglala Lakota County, South Dakota. We can get recent data from the ACS using tools learned in Chapter 2:

oglala_lakota_age <- get_acs(
  geography = "county",
  state = "SD",
  county = "Oglala Lakota",
  table = "B01001",
  year = 2020
)
Table 3.13: 2016-2020 age table for Oglala Lakota County, SD
GEOID NAME variable estimate moe
46102 Oglala Lakota County, South Dakota B01001_001 14277 NA
46102 Oglala Lakota County, South Dakota B01001_002 6930 132
46102 Oglala Lakota County, South Dakota B01001_003 761 66
46102 Oglala Lakota County, South Dakota B01001_004 794 128
46102 Oglala Lakota County, South Dakota B01001_005 707 123
46102 Oglala Lakota County, South Dakota B01001_006 394 20
46102 Oglala Lakota County, South Dakota B01001_007 227 15
46102 Oglala Lakota County, South Dakota B01001_008 85 53
46102 Oglala Lakota County, South Dakota B01001_009 165 70
46102 Oglala Lakota County, South Dakota B01001_010 356 101

To understand how the age composition of the county has changed over the past 10 years, we may want to look at the 2006-2010 ACS for the county. Normally, we would just change the year argument to 2010:

oglala_lakota_age_10 <- get_acs(
  geography = "county",
  state = "SD",
  county = "Oglala Lakota",
  table = "B01001",
  year = 2010
)
## Error in `map()`:
## ℹ In index: 1.
## ℹ With name: 1.
## Caused by error:
## ! Your API call has errors.  The API message returned is .

The request errors, and we don’t get an informative error message back from the API as was discussed in Section 2.6. The problem here is that Oglala Lakota County had a different name in 2010, Shannon County, meaning that the county = "Oglala Lakota" argument will not return any data. In turn, the equivalent code for the 2006-2010 ACS would use county = "Shannon".

oglala_lakota_age_10 <- get_acs(
  geography = "county",
  state = "SD",
  county = "Shannon",
  table = "B01001",
  year = 2010
)
Table 3.14: 200-2010 age table for Oglala Lakota County, SD (then named Shannon County)
GEOID NAME variable estimate moe
46113 Shannon County, South Dakota B01001_001 13437 NA
46113 Shannon County, South Dakota B01001_002 6553 47
46113 Shannon County, South Dakota B01001_003 770 99
46113 Shannon County, South Dakota B01001_004 565 151
46113 Shannon County, South Dakota B01001_005 833 151
46113 Shannon County, South Dakota B01001_006 541 47
46113 Shannon County, South Dakota B01001_007 275 99
46113 Shannon County, South Dakota B01001_008 164 89
46113 Shannon County, South Dakota B01001_009 143 54
46113 Shannon County, South Dakota B01001_010 342 83

Note the differences in the GEOID column between the two tables of data. When a county or geographic entity changes its name, the Census Bureau assigns it a new GEOID, meaning that analysts need to take care when dealing with those changes. A full listing of geography changes is available on the Census website for each year.

In addition to changes in geographic identifiers, variable IDs can change over time as well. For example, the ACS Data Profile is commonly used for pre-computed normalized ACS estimates. Let’s say that we are interested in analyzing the percentage of residents age 25 and up with a 4-year college degree for counties in Colorado from the 2019 1-year ACS. We’d first look up the appropriate variable ID with load_variables(2019, "acs1/profile") then use get_acs():

co_college19 <- get_acs(
  geography = "county",
  variables = "DP02_0068P",
  state = "CO",
  survey = "acs1",
  year = 2019
)
Table 3.15: ACS Data Profile data in 2019
GEOID NAME variable estimate moe
08001 Adams County, Colorado DP02_0068P 25.4 1.3
08005 Arapahoe County, Colorado DP02_0068P 43.8 1.3
08013 Boulder County, Colorado DP02_0068P 64.8 1.8
08014 Broomfield County, Colorado DP02_0068P 56.9 3.3
08031 Denver County, Colorado DP02_0068P 53.1 1.1

We get back data for counties of population 65,000 and greater as these are the geographies available in the 1-year ACS. The data make sense: Boulder County, home to the University of Colorado, has a very high percentage of its population with a 4-year degree or higher. However, when we run the exact same query for the 2018 1-year ACS:

co_college18 <- get_acs(
  geography = "county",
  variables = "DP02_0068P",
  state = "CO",
  survey = "acs1",
  year = 2018
)
Table 3.16: ACS Data Profile data in 2018
GEOID NAME variable estimate moe
08001 Adams County, Colorado DP02_0068P 375798 NA
08005 Arapahoe County, Colorado DP02_0068P 497198 NA
08013 Boulder County, Colorado DP02_0068P 263938 NA
08014 Broomfield County, Colorado DP02_0068P 53400 NA
08031 Denver County, Colorado DP02_0068P 575870 NA

The values are completely different, and clearly not percentages! This is because variable IDs for the Data Profile are unique to each year and in turn should not be used for time-series analysis. The returned results above represent the civilian population age 18 and up, and have nothing to do with educational attainment.

3.4.2 Preparing time-series ACS estimates

The safest option for time-series analysis in the ACS is to use the Comparison Profile Tables. These tables are available for both the 1-year and 5-year ACS, and allow for comparison of demographic indicators over the past five years for a given year. Using the Comparison Profile tables also brings the benefit of additional variable harmonization, such as inflation-adjusted income estimates.

Data from the Comparison Profile are accessed just like other ACS variables using get_acs(). The example below illustrates how to get data from the ACS Comparison Profile on inflation-adjusted median household income for counties and county-equivalents in Alaska.

ak_income_compare <- get_acs(
  geography = "county",
  variables = c(
    income15 = "CP03_2015_062",
    income20 = "CP03_2020_062"
  ),
  state = "AK",
  year = 2020
)

For the 2016-2020 ACS, the “comparison year” is 2015, representing the closest non-overlapping 5-year dataset, which in this case is 2011-2015. We can examine the results, which are inflation-adjusted for appropriate comparison:

Table 3.17: Comparative income data from the ACS CP tables
GEOID NAME variable estimate
02016 Aleutians West Census Area, Alaska income15 92500
02016 Aleutians West Census Area, Alaska income20 87443
02020 Anchorage Municipality, Alaska income15 85534
02020 Anchorage Municipality, Alaska income20 84813
02050 Bethel Census Area, Alaska income15 55692
02050 Bethel Census Area, Alaska income20 54400
02090 Fairbanks North Star Borough, Alaska income15 77590
02090 Fairbanks North Star Borough, Alaska income20 76464

3.4.2.1 Iterating over ACS years with tidyverse tools

Using the Detailed Tables also represents a safer option than the Data Profile, as it ensures that variable IDs will remain consistent across years allowing for consistent and correct analysis. That said, there still are some potential pitfalls to account for when using the Detailed Tables. The Census Bureau will add and remove variables from survey to survey depending on data needs and data availability. For example, questions are sometimes added and removed from the ACS survey meaning that you won’t always be able to get every data point for every year and geography combination. In turn, it is still important to check on data availability using load_variables() for the years you plan to analyze before carrying out your time-series analysis.

Let’s re-engineer the analysis above on educational attainment in Colorado counties, which below will be computed for a time series from 2010 to 2019. Information on “bachelor’s degree or higher” is split by sex and across different tiers of educational attainment in the detailed tables, found in ACS table 15002. Given that we only need a few variables (representing estimates of populations age 25+ who have finished a 4-year degree or graduate degrees, by sex), we’ll request those variables directly rather than the entire B15002 table.

college_vars <- c("B15002_015",
                  "B15002_016",
                  "B15002_017",
                  "B15002_018",
                  "B15002_032",
                  "B15002_033",
                  "B15002_034",
                  "B15002_035")

We’ll now use these variables to request data on college degree holders from the ACS for counties in Colorado for each of the 1-year ACS surveys from 2010 to 2019. In most cases, this process should be streamlined with iteration. Thus far, we are familiar with using the year argument in get_acs() to request data for a specific year. Writing out ten different calls to get_acs(), however - one for each year - would be tedious and would require a fair amount of repetitive code! Iteration helps us avoid repetitive coding as it allows us to carry out the same process over a sequence of values. Programmers familiar with iteration will likely know of “loop” operators like for and while, which are available in base R and most other programming languages in some variety. Base R also includes the *apply() family of functions (e.g. lapply(), mapply(), sapply()), which iterates over a sequence of values and applies a given function to each value.

The tidyverse approach to iteration is found in the purrr package. purrr includes a variety of functions that are designed to integrate well in workflows that require iteration and use other tidyverse tools. The map_*() family of functions iterate over values and try to return a desired result; map() returns a list, map_int() returns an integer vector, and map_chr() returns a character vector, for example. With tidycensus, the map_dfr() function is particularly useful. map_dfr() iterates over an input and applies it to a function or process defined by the user, then row-binds the result into a single data frame. The example below illustrates how this works for the years 2010 through 2019.

years <- 2010:2019
names(years) <- years

college_by_year <- map_dfr(years, ~{
  get_acs(
    geography = "county",
    variables = college_vars,
    state = "CO",
    summary_var = "B15002_001",
    survey = "acs1",
    year = .x
  )
}, .id = "year")

For users newer to R, iteration and purrr syntax can feel complex, so it is worth stepping through how the above code sample works.

  • First, a numeric vector of years is defined with the syntax 2010:2019. This will create a vector of years at 1-year intervals. These values are set as the names of the vector as well, as map_dfr() has additional functionality for working with named objects.

  • map_dfr() then takes three arguments above.

    • The first argument is the object that map_dfr() will iterate over, which in this case is our years vector. This means that the process we set up will be run once for each element of years.
    • The second argument is a formula we specify with the tilde (~) operator and curly braces ({...}). The code inside the curly braces will be run once for each element of years. The local variable .x, used inside the formula, takes on each value of years sequentially. In turn, we are running the equivalent of get_acs() with year = 2010, year = 2011, and so forth. Once get_acs() is run for each year, the result is combined into a single output data frame.
    • The .id argument, which is optional but used here, creates a new column in the output data frame that contains values equivalent to the names of the input object, which in this case is years. By setting .id = "year", we tell map_dfr() to name the new column that will contain these values year.

Let’s review the result:

college_by_year %>% 
  arrange(NAME, variable, year)
Table 3.18: Educational attainment over time
year GEOID NAME variable estimate moe summary_est summary_moe
2010 08001 Adams County, Colorado B15002_015 20501 1983 275849 790
2011 08001 Adams County, Colorado B15002_015 21233 2124 281231 865
2012 08001 Adams County, Colorado B15002_015 19238 2020 287924 693
2013 08001 Adams County, Colorado B15002_015 23818 2445 295122 673
2014 08001 Adams County, Colorado B15002_015 20255 1928 304394 541
2015 08001 Adams County, Colorado B15002_015 22962 2018 312281 705
2016 08001 Adams County, Colorado B15002_015 25744 2149 318077 525
2017 08001 Adams County, Colorado B15002_015 26159 2320 324185 562
2018 08001 Adams County, Colorado B15002_015 28113 2078 331247 955
2019 08001 Adams County, Colorado B15002_015 27552 2070 336931 705

The result is a long-form dataset that contains a time series of each requested ACS variable for each county in Colorado that is available in the 1-year ACS. The code below outlines a group_by() %>% summarize() workflow for calculating the percentage of the population age 25 and up with a 4-year college degree, then uses the pivot_wider() function from the tidyr package to spread the years across the columns for tabular data display.

percent_college_by_year <- college_by_year %>%
  group_by(NAME, year) %>%
  summarize(numerator = sum(estimate),
            denominator = first(summary_est)) %>%
  mutate(pct_college = 100 * (numerator / denominator)) %>%
  pivot_wider(id_cols = NAME,
              names_from = year,
              values_from = pct_college)
Table 3.19: Percent college by year
NAME 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Adams County, Colorado 20.57394 20.51801 20.64538 23.09384 22.16929 22.79742 22.95293 22.87552 25.74152 25.39956
Arapahoe County, Colorado 37.03001 38.24506 39.28435 39.42478 40.94194 41.03578 41.48359 43.69387 42.71285 43.78332
Boulder County, Colorado 57.50285 59.05601 57.88284 58.53214 58.04066 60.57147 60.63005 63.18150 62.51394 64.80486
Broomfield County, Colorado NA NA NA NA NA 56.07776 51.94338 55.13359 56.27740 56.87181
Denver County, Colorado 40.87971 42.97122 44.65358 44.35340 44.25600 47.10820 47.39683 49.32692 51.34580 53.10088
Douglas County, Colorado 54.96800 53.27936 55.09223 57.66999 56.48866 56.06928 59.42687 58.53342 58.37539 58.13747
El Paso County, Colorado 34.11467 35.69184 34.91315 35.47612 36.49302 36.43089 38.67864 39.19931 38.83872 39.04942
Jefferson County, Colorado 40.83113 39.54961 41.43825 41.04234 41.99768 43.20923 43.51953 45.57140 45.83786 47.61074
Larimer County, Colorado 45.80197 42.83543 44.71423 43.33800 42.67180 46.16705 46.78871 47.90465 47.62330 49.01654
Mesa County, Colorado 24.99285 25.82724 23.01511 27.63325 25.14875 30.27630 25.02980 25.82455 29.95739 29.78537

This particular format is suitable for data display or writing to an Excel spreadsheet for colleagues who are not R-based. Methods for visualization of time-series estimates from the ACS will be covered in Section 4.4.

3.5 Handling margins of error in the American Community Survey with tidycensus

A topic of critical importance when working with data from the American Community Survey is the margin of error. As opposed to the decennial US Census, which is based on a complete enumeration of the US population, the ACS is based on a sample with estimates characterized by margins of error. By default, MOEs are returned at a 90 percent confidence level. This can be translated roughly as “we are 90 percent sure that the true value falls within a range defined by the estimate plus or minus the margin of error.”

As discussed in Chapter 2, tidycensus takes an opinionated approach to margins of error. When applicable, tidycensus will always return the margin of error associated with an estimate, and does not have an option available to return estimates only. For “tidy” or long-form data, these margins of error will be found in the moe column; for wide-form data, margins of error will be found in columns with an M suffix.

The confidence level of the MOE can be controlled with the moe_level argument in get_acs(). The default moe_level is 90, which is what the Census Bureau returns by default. tidycensus can also return MOEs at a confidence level of 95 or 99 which uses Census Bureau-recommended formulas to adjust the MOE. For example, we might look at data on median household income by county in Rhode Island using the default moe_level of 90:

get_acs(
  geography = "county",
  state = "Rhode Island",
  variables = "B19013_001",
  year = 2020
)
Table 3.20: Default MOE at 90 percent confidence level
GEOID NAME variable estimate moe
44001 Bristol County, Rhode Island B19013_001 85413 6122
44003 Kent County, Rhode Island B19013_001 75857 2022
44005 Newport County, Rhode Island B19013_001 84282 2629
44007 Providence County, Rhode Island B19013_001 62323 1270
44009 Washington County, Rhode Island B19013_001 86970 3651

A stricter margin of error will increase the size of the MOE relative to its estimate.

get_acs(
  geography = "county",
  state = "Rhode Island",
  variables = "B19013_001",
  year = 2020,
  moe_level = 99
)
Table 3.21: MOE at 99 percent confidence level
GEOID NAME variable estimate moe
44001 Bristol County, Rhode Island B19013_001 85413 9586.791
44003 Kent County, Rhode Island B19013_001 75857 3166.366
44005 Newport County, Rhode Island B19013_001 84282 4116.902
44007 Providence County, Rhode Island B19013_001 62323 1988.766
44009 Washington County, Rhode Island B19013_001 86970 5717.311

3.5.1 Calculating derived margins of error in tidycensus

For small geographies or small populations, margins of error can get quite large, in some cases exceeding their corresponding estimates. In the example below, we can examine data on age groups by sex for the population age 65 and older for Census tracts in Salt Lake County, Utah. We will first generate a vector of variable IDs for which we want to request data from the ACS using some base R functionality.

In this workflow, an analyst has used load_variables() to look up the variables that represent estimates for populations age 65 and up; this includes B01001_020 through B01001_025 for males, and B01001_044 through B01001_049 for females. Typing out each variable individually would be tedious, so an analyst can use string concatenation to generate the required vector of variable IDs as follows:

vars <- paste0("B01001_0", c(20:25, 44:49))

vars
##  [1] "B01001_020" "B01001_021" "B01001_022" "B01001_023" "B01001_024"
##  [6] "B01001_025" "B01001_044" "B01001_045" "B01001_046" "B01001_047"
## [11] "B01001_048" "B01001_049"

When R evaluates nested expressions like this, it starts with the inner-most expressions then evaluates them from the inside out. The steps taken to assemble the correct vector of variable IDs are as follows:

  • First, the expressions 20:25 and 44:49 are evaluated. The colon operator : in base R, when used between two numbers, will generate a vector of numbers from the first to the second at intervals of 1. This creates vectors of the integers 20 through 25 and 44 through 49, which will serve as the suffixes for the variable IDs.

  • Second, the c() function is used to combine the two vectors of integers into a single vector.

  • Third, the paste0() function concatenates the string prefix "B01001_0" with each of the integers in the vector created with c() and returns a vector of variable IDs. paste0() is a convenient extension of the more flexible paste() function that concatenates strings with no spaces in between them by default.

The resulting variables object, named vars, can now be used to request variables in a call to get_acs().

salt_lake <- get_acs(
  geography = "tract",
  variables = vars,
  state = "Utah",
  county = "Salt Lake",
  year = 2020
)

We will now want to examine the margins of error around the estimates in the returned data. Let’s focus on a specific Census tract in Salt Lake County using filter():

example_tract <- salt_lake %>%
  filter(GEOID == "49035100100")

example_tract %>% 
  select(-NAME)
Table 3.22: Example Census tract in Salt Lake City
GEOID variable estimate moe
49035100100 B01001_020 11 13
49035100100 B01001_021 25 18
49035100100 B01001_022 7 10
49035100100 B01001_023 4 7
49035100100 B01001_024 0 12
49035100100 B01001_025 17 20
49035100100 B01001_044 0 12
49035100100 B01001_045 4 7
49035100100 B01001_046 21 17
49035100100 B01001_047 123 168

In many cases, the margins of error exceed their corresponding estimates. For example, the ACS data suggest that in Census tract 49035100100, for the male population age 85 and up (variable ID B01001_0025), there are anywhere between 0 and 45 people in that Census tract. This can make ACS data for small geographies problematic for planning and analysis purposes.

A potential solution to large margins of error for small estimates in the ACS is to aggregate data upwards until a satisfactory margin of error to estimate ratio is reached. The US Census Bureau publishes formulas for appropriately calculating margins of error around such derived estimates, which are included in tidycensus with the following functions:

  • moe_sum(): calculates a margin of error for a derived sum;
  • moe_product(): calculates a margin of error for a derived product;
  • moe_ratio(): calculates a margin of error for a derived ratio;
  • moe_prop(): calculates a margin of error for a derived proportion.

In their most basic form, these functions can be used with constants. For example, let’s say we had an ACS estimate of 25 with a margin of error of 5 around that estimate. The appropriate denominator for this estimate is 100 with a margin of error of 3. To determine the margin of error around the derived proportion of 0.25, we can use moe_prop():

moe_prop(25, 100, 5, 3)
## [1] 0.0494343

Our margin of error around the derived estimate of 0.25 is approximately 0.049.

3.5.2 Calculating group-wise margins of error

These margin of error functions in tidycensus can in turn be integrated into tidyverse-centric analytic pipelines to handle large margins of error around estimates. Given that the smaller age bands in the Salt Lake City dataset are characterized by too much uncertainty for our analysis, we decide in this scenario to aggregate our data upwards to represent populations aged 65 and older by sex.

In the code below, we use the case_when() function to create a new column, sex, that represents a mapping of the variables we pulled from the ACS to their sex categories. We then employ a familiar group_by() %>% summarize() method to aggregate our data by Census tract and sex. Notably, the call to summarize() includes a call to tidycensus’s moe_sum() function, which will generate a new column that represents the margin of error around the derived sum.

salt_lake_grouped <- salt_lake %>%
  mutate(sex = case_when(
    str_sub(variable, start = -2) < "26" ~ "Male",
    TRUE ~ "Female"
  )) %>%
  group_by(GEOID, sex) %>%
  summarize(sum_est = sum(estimate), 
            sum_moe = moe_sum(moe, estimate))
Table 3.23: Grouped margins of error
GEOID sex sum_est sum_moe
49035100100 Female 165 170.72493
49035100100 Male 64 34.43835
49035100200 Female 170 57.26255
49035100200 Male 128 47.51842
49035100306 Female 155 77.48548
49035100306 Male 136 85.53362
49035100307 Female 207 82.03658
49035100307 Male 110 58.89822
49035100308 Female 157 110.97297
49035100308 Male 91 59.97499

The margins of error relative to their estimates are now much more reasonable than in the disaggregated data.

That said, the Census Bureau issues a note of caution (American Community Survey Office 2020):

All derived MOE methods are approximations and users should be cautious in using them. This is because these methods do not consider the correlation or covariance between the basic estimates. They may be overestimates or underestimates of the derived estimate’s standard error depending on whether the two basic estimates are highly correlated in either the positive or negative direction. As a result, the approximated standard error may not match direct calculations of standard errors or calculations obtained through other methods.

This means that your “best bet” is to first search the ACS tables to see if your data are found in aggregated form elsewhere before doing the aggregation and MOE estimation yourself. In many cases, you’ll find aggregated information in the ACS combined tables, Data Profile, or Subject Tables that will include pre-computed margins of error for you.

3.6 Exercises

  • The ACS Data Profile includes a number of pre-computed percentages which can reduce your data wrangling time. The variable in the 2015-2019 ACS for “percent of the population age 25 and up with a bachelor’s degree” is DP02_0068P. For a state of your choosing, use this variable to determine:

    • The county with the highest percentage in the state;

    • The county with the lowest percentage in the state;

    • The median value for counties in your chosen state.