3 Wrangling Census data with tidyverse tools

One of the most popular frameworks for data analysis in R is the tidyverse, a suite of packages designed for integrated data wrangling, visualization, and modeling. The “tidy” or long-form data returned by default in tidycensus is designed to work well with tidyverse analytic workflows. This chapter provides an overview of how to use tidyverse tools to gain additional insights about US Census data retrieved with tidycensus. It concludes with discussion about margins of error (MOEs) in the American Community Survey and how to wrangle and interpret MOEs appropriately.

3.1 The tidyverse

The tidyverse is a collection of R packages that are designed to work together in common data wrangling, analysis, and visualization projects. Many of these R packages, maintained by RStudio, are among the most popular R packages worldwide. Some of the key packages you’ll use in the tidyverse include:

The core data structure used in the tidyverse is the tibble, which is an R data frame with some small enhancements to improve the user experience. tidycensus returns tibbles by default.

A full treatment of the tidyverse and its functionality is beyond the scope of this book; however, the examples in this chapter will introduce you to several key tidyverse features using US Census Bureau data. For a more general and broader treatment of the tidyverse, I recommend the R for Data Science book (Wickham and Grolemund 2017).

3.2 Exploring Census data with tidyverse tools

Census data queries using tidycensus, combined with core tidyverse functions, are excellent ways to explore downloaded Census data. Chapter 2 covered how to download data from various Census datasets using tidycensus and return the data in a desired format. A common next step in an analytic process will involve data exploration, which is handled by a wide range of tools in the tidyverse.

To get started, the tidycensus and tidyverse packages are loaded. “tidyverse” is not specifically a package itself, but rather loads several core packages within the tidyverse. The package load message gives you more information:

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Eight tidyverse packages are loaded: ggplot2, tibble (Müller and Wickham 2021), purrr, dplyr, readr, and tidyr are included along with stringr (Wickham 2019a) for string manipulation and forcats (Wickham 2021a) for working with factors. These tools collectively can be used for many core Census data analysis tasks.

3.2.1 Sorting and filtering data

For a first example, let’s request data on median age from the 2015-2019 ACS with get_acs() for all counties in the United States. This requires specifying geography = "county" and leaving state set to NULL, the default.

median_age <- get_acs(
  geography = "county",
  variables = "B01002_001",
  year = 2019
)
Table 3.1: Median age for US counties
GEOID NAME variable estimate moe
01001 Autauga County, Alabama B01002_001 38.2 0.6
01003 Baldwin County, Alabama B01002_001 43.0 0.3
01005 Barbour County, Alabama B01002_001 40.4 0.5
01007 Bibb County, Alabama B01002_001 40.9 1.3
01009 Blount County, Alabama B01002_001 40.7 0.3
01011 Bullock County, Alabama B01002_001 40.2 2.3
01013 Butler County, Alabama B01002_001 40.8 0.7
01015 Calhoun County, Alabama B01002_001 39.6 0.3
01017 Chambers County, Alabama B01002_001 42.0 0.7
01019 Cherokee County, Alabama B01002_001 46.5 0.5

The default method for printing data used by the tibble package shows the first 10 rows of the dataset, which in this case prints counties in Alabama. A first exploratory data analysis question might involve understanding which counties are the youngest and oldest in the United States as measured by median age. This task can be accomplished with the arrange() function found in the dplyr package. arrange() sorts a dataset by values in one or more columns and returns the sorted result. To view the dataset in ascending order of a given column, supply the data object and a column name to the arrange() function.

arrange(median_age, estimate)
Table 3.2: The youngest counties in the US by median age
GEOID NAME variable estimate moe
51678 Lexington city, Virginia B01002_001 22.3 0.7
51750 Radford city, Virginia B01002_001 23.4 0.5
16065 Madison County, Idaho B01002_001 23.5 0.2
46121 Todd County, South Dakota B01002_001 23.8 0.4
02158 Kusilvak Census Area, Alaska B01002_001 24.1 0.2
13053 Chattahoochee County, Georgia B01002_001 24.5 0.5
53075 Whitman County, Washington B01002_001 24.7 0.3
49049 Utah County, Utah B01002_001 24.8 0.1
46027 Clay County, South Dakota B01002_001 24.9 0.4
51830 Williamsburg city, Virginia B01002_001 24.9 0.7

Per the 2015-2019 ACS, the two youngest “counties” in the United States are independent cities in Virginia, which are treated as county-equivalents. Both Lexington and Radford are college towns; Lexington is home to both Washington & Lee University and the Virginia Military Institute, and Radford houses Radford University. The youngest county then by median age is Madison County, Idaho.

To retrieve the oldest counties in the United States by median age, an analyst can use the desc() function available in dplyr to sort the estimate column in descending order.

arrange(median_age, desc(estimate))
Table 3.3: The oldest counties in the US by median age
GEOID NAME variable estimate moe
12119 Sumter County, Florida B01002_001 67.4 0.2
51091 Highland County, Virginia B01002_001 60.9 3.5
08027 Custer County, Colorado B01002_001 59.7 2.6
12015 Charlotte County, Florida B01002_001 59.1 0.2
41069 Wheeler County, Oregon B01002_001 59.0 3.3
51133 Northumberland County, Virginia B01002_001 58.9 0.7
26131 Ontonagon County, Michigan B01002_001 58.6 0.4
35021 Harding County, New Mexico B01002_001 58.5 5.5
53031 Jefferson County, Washington B01002_001 58.3 0.7
26001 Alcona County, Michigan B01002_001 58.2 0.3

The oldest county in the United States by almost 7 years over the second-oldest is Sumter County, Florida. Sumter County is home to The Villages, a Census-designated place that includes a large age-restricted community also called The Villages.

The tidyverse includes several tools for parsing datasets that allow for exploration beyond sorting and browsing data. The filter() function in dplyr queries a dataset for rows where a given condition evaluates to TRUE, and retains those rows only. For analysts who are familiar with databases and SQL, this is equivalent to a WHERE clause. This helps analysts subset their data for specific areas by their characteristics, and answer questions like “how many counties in the US have a median age of 50 or older?”

filter(median_age, estimate >= 50)
Table 3.4: Counties with a median age of 50 or above
GEOID NAME variable estimate moe
04007 Gila County, Arizona B01002_001 50.2 0.2
04012 La Paz County, Arizona B01002_001 56.5 0.5
04015 Mohave County, Arizona B01002_001 51.6 0.3
04025 Yavapai County, Arizona B01002_001 53.4 0.1
05005 Baxter County, Arkansas B01002_001 52.2 0.3
05089 Marion County, Arkansas B01002_001 52.2 0.5
05097 Montgomery County, Arkansas B01002_001 50.4 0.8
05137 Stone County, Arkansas B01002_001 50.1 0.7
06003 Alpine County, California B01002_001 52.2 8.8
06005 Amador County, California B01002_001 50.5 0.4

Functions like arrange() and filter() operate on row values and organize data by row. Other tidyverse functions, like tidyr’s separate(), operate on columns. The NAME column, returned by default by most tidycensus functions, contains a basic description of the location that can be more intuitive than the GEOID. For the 2015-2019 ACS, NAME is formatted as “X County, Y,” where X is the county name and Y is the state name. separate() can split this column into two columns where one retains the county name and the other retains the state; this can be useful for analysts who need to complete a comparative analysis by state.

separate(
  median_age,
  NAME,
  into = c("county", "state"),
  sep = ", "
)
Table 3.5: Separate columns for county and state
GEOID county state variable estimate moe
01001 Autauga County Alabama B01002_001 38.2 0.6
01003 Baldwin County Alabama B01002_001 43.0 0.3
01005 Barbour County Alabama B01002_001 40.4 0.5
01007 Bibb County Alabama B01002_001 40.9 1.3
01009 Blount County Alabama B01002_001 40.7 0.3
01011 Bullock County Alabama B01002_001 40.2 2.3
01013 Butler County Alabama B01002_001 40.8 0.7
01015 Calhoun County Alabama B01002_001 39.6 0.3
01017 Chambers County Alabama B01002_001 42.0 0.7
01019 Cherokee County Alabama B01002_001 46.5 0.5

You may have noticed above that existing variable names are unquoted when referenced in tidyverse functions. Many tidyverse functions use non-standard evaluation to refer to column names, which means that column names can be used as arguments directly without quotation marks. Non-standard evaluation makes interactive programming faster, especially for beginners; however, it can introduce some complications when writing your own functions or R packages. A full treatment of non-standard evaluation is beyond the scope of this book; Hadley Wickham’s Advanced R (Wickham 2019b) is the best resource on the topic if you’d like to learn more.

3.2.2 Using summary variables and calculating new columns

Data in Census and ACS tables, as in the example above, are frequently comprised of variables that individually constitute sub-categories such as the numbers of households in different household income bands. One limitation of the approach above, however, is that the data - and the resulting analysis - return estimated counts, which are difficult to compare across geographies. For example, Maricopa County in Arizona is the state’s most populous county with 4.3 million residents; the second-largest county, Pima, only has just over 1 million residents and six of the state’s 15 counties have fewer than 100,000 residents. In turn, comparing Maricopa’s estimates with those of smaller counties in the state would often be inappropriate.

A solution to this issue might involve normalizing the estimated count data by dividing it by the overall population from which the sub-group is derived. Appropriate denominators for ACS tables are frequently found in the tables themselves as variables. In ACS table B19001, which covers the number of households by income bands, the variable B19001_001 represents the total number of households in a given enumeration unit, which we removed from our analysis earlier. Given that this variable is an appropriate denominator for the other variables in the table, it merits its own column to facilitate the calculation of proportions or percentages.

In tidycensus, this can be accomplished by supplying a variable ID to the summary_var parameter in both the get_acs() and get_decennial() functions. Doing so will create two new columns for the decennial Census datasets - summary_var and summary_value, representing the summary variable ID and the summary variable’s value - and three new columns for the ACS datasets, summary_var, summary_est, and summary_moe, which includes the ACS estimate and margin of error for the summary variable.

With this information in hand, normalizing data is straightforward. The following example uses the summary_var parameter to compare the population of counties in Arizona by race & Hispanic origin with their baseline populations, using data from the 2015-2019 ACS.

race_vars <- c(
  White = "B03002_003",
  Black = "B03002_004",
  Native = "B03002_005",
  Asian = "B03002_006",
  HIPI = "B03002_007",
  Hispanic = "B03002_012"
)

az_race <- get_acs(
  geography = "county",
  state = "AZ",
  variables = race_vars,
  summary_var = "B03002_001"
) 
Table 3.6: Race and ethnicity in Arizona
GEOID NAME variable estimate moe summary_est summary_moe
04001 Apache County, Arizona White 13022 4 71511 NA
04001 Apache County, Arizona Black 373 138 71511 NA
04001 Apache County, Arizona Native 52285 234 71511 NA
04001 Apache County, Arizona Asian 246 78 71511 NA
04001 Apache County, Arizona HIPI 16 16 71511 NA
04001 Apache County, Arizona Hispanic 4531 NA 71511 NA
04003 Cochise County, Arizona White 69216 235 125867 NA
04003 Cochise County, Arizona Black 4620 247 125867 NA
04003 Cochise County, Arizona Native 1142 191 125867 NA
04003 Cochise County, Arizona Asian 2431 162 125867 NA

By using dplyr’s mutate() function, we calculate a new column, percent, representing the percentage of each Census tract’s population that corresponds to each racial/ethnic group in 2015-2019. The select() function, also in dplyr, retains only those columns that we need to view.

az_race_percent <- az_race %>%
  mutate(percent = 100 * (estimate / summary_est)) %>%
  select(NAME, variable, percent)
Table 3.7: Race and ethnicity in Arizona as percentages
NAME variable percent
Apache County, Arizona White 18.2097859
Apache County, Arizona Black 0.5215981
Apache County, Arizona Native 73.1146257
Apache County, Arizona Asian 0.3440030
Apache County, Arizona HIPI 0.0223742
Apache County, Arizona Hispanic 6.3360882
Cochise County, Arizona White 54.9913798
Cochise County, Arizona Black 3.6705411
Cochise County, Arizona Native 0.9073069
Cochise County, Arizona Asian 1.9314038

The above example introduces some additional syntax common to tidyverse data analyses. The %>% operator from the magrittr R package (Bache and Wickham 2020) is a pipe operator that allows for analysts to develop analytic pipelines, which are deeply embedded in tidyverse-centric data analytic workflows. The pipe operator passes the result of a given line of code as the first argument of the code on the next line. In turn, analysts can develop data analysis pipelines of related operations that fit together in a coherent way.

tidyverse developers recommend that the pipe operator be read as “then.” The above code can in turn be interpreted as “Create a new data object az_race_percent by using the existing data object az_race THEN creating a new percent column THEN selecting the NAME, variable, and percent columns.”

Since R version 4.1, the base installation of R also includes a pipe operator, |>. It works much the same way as the magrittr pipe %>%, though %>% has some small additional features that make it work well within tidyverse analysis pipelines. In turn, %>% will be used in the examples throughout this book.

3.3 Group-wise Census data analysis

The split-apply-combine model of data analysis, as discussed in Wickham (2011), is a powerful framework for analyzing demographic data. In general terms, an analyst will apply this framework as follows:

  • The analyst identifies salient groups in a dataset between which they want to make comparisons. The dataset is then split into multiple pieces, one for each group.

  • A function is then applied to each group in turn. This might be a simple summary function, such as taking the maximum or calculating the mean, or a custom function defined by the analyst.

  • Finally, the results of the function applied to each group are combined back into a single dataset, allowing the analyst to compare the results by group.

Given the hierarchical nature of US Census Bureau data, “groups” across which analysts can make comparisons are found in just about every analytic tasks. In many cases, the split-apply-combine model of data analysis will be useful to analysts as they make sense of patterns and trends found in Census data.

In the tidyverse, split-apply-combine is implemented with the group_by() function in the dplyr package. group_by() does the work for the analyst of splitting a dataset into groups, allowing subsequent functions used by the analyst in an analytic pipeline to be applied to each group then combined back into a single dataset. The examples that follow illustrate some common group-wise analyses.

3.3.1 Making group-wise comparisons

The az_race_percent dataset created above is an example of a dataset suitable for group-wise data analysis. It includes two columns that could be used as group definitions: NAME, representing the county, and variable, representing the racial or ethnic group. Split-apply-combine could be used for either group definition to make comparisons for data in Arizona across these categories.

In a first example, we can deploy group-wise data analysis to identify the largest racial or ethnic group in each county in Arizona. This involves setting up a data analysis pipeline with the magrittr pipe and calculating a grouped filter where the filter() operation will be applied specific to each group. In this example, the filter condition will be specified as percent == max(percent). We can read the analytic pipeline then as “Create a new dataset, largest_group, by using the az_race_dataset THEN grouping the dataset by the NAME column THEN filtering for rows that are equal to the maximum value of percent for each group.”

largest_group <- az_race_percent %>%
  group_by(NAME) %>%
  filter(percent == max(percent))
Table 3.8: Largest group by county in Arizona
NAME variable percent
Apache County, Arizona Native 73.11463
Cochise County, Arizona White 54.99138
Coconino County, Arizona White 54.05170
Gila County, Arizona White 62.25302
Graham County, Arizona White 50.85799
Greenlee County, Arizona Hispanic 46.79689
La Paz County, Arizona White 57.39912
Maricopa County, Arizona White 55.24102
Mohave County, Arizona White 77.29074
Navajo County, Arizona Native 43.52613

The result of the grouped filter allows us to review the most common racial or ethnic group in each Arizona County along with how their percentages vary. For example, in two Arizona counties (Greenlee and Navajo), none of the racial or ethnic groups form a majority of the population.

group_by() is commonly paired with the summarize() function in data analysis pipelines. summarize() generates a new, condensed dataset that by default returns a column for the grouping variable(s) and columns representing the results of one or more functions applied to those groups. In the example below, the median() function is used to identify the median percentage for each of the racial & ethnic groups in the dataset across counties in Arizona. In turn, variable is passed to group_by() as the grouping variable.

az_race_percent %>%
  group_by(variable) %>%
  summarize(median_pct = median(percent))
Table 3.9: Median percentage by group
variable median_pct
Asian 0.9238513
Black 1.1155627
HIPI 0.1210654
Hispanic 30.1555247
Native 3.5811804
White 54.0517009

The result of this operation tells us the median county percentage of each racial and ethnic group for the state of Arizona. A broader analysis might involve the calculation of these percentages hierarchically, finding the median county percentage of given attributes across states, for example.

3.3.2 Tabulating new groups

In the examples above, suitable groups in the NAME and variable columns were already found in the data retrieved with get_acs(). Commonly, analysts will also need to calculate new custom groups to address specific analytic questions. For example, variables in ACS table B19001 represent groups of households whose household incomes fall into a variety of categories: less than $10,000/year, between $10,000/year and $19,999/year, and so forth. These categories may be more granular than needed by an analyst. As such, an analyst might take the following steps: 1) recode the ACS variables into wider income bands; 2) group the data by the wider income bands; 3) calculate grouped sums to generate new estimates.

Consider the following example, using household income data for Minnesota counties from the 2012-2016 ACS:

mn_hh_income <- get_acs(
  geography = "county",
  table = "B19001",
  state = "MN",
  year = 2016
)
Table 3.10: Table B19001 for counties in Minnesota
GEOID NAME variable estimate moe
27001 Aitkin County, Minnesota B19001_001 7640 262
27001 Aitkin County, Minnesota B19001_002 562 77
27001 Aitkin County, Minnesota B19001_003 544 72
27001 Aitkin County, Minnesota B19001_004 472 69
27001 Aitkin County, Minnesota B19001_005 508 68
27001 Aitkin County, Minnesota B19001_006 522 92
27001 Aitkin County, Minnesota B19001_007 447 61
27001 Aitkin County, Minnesota B19001_008 390 49
27001 Aitkin County, Minnesota B19001_009 426 64
27001 Aitkin County, Minnesota B19001_010 415 65

Our data include household income categories for each county in the rows. However, let’s say we only need three income categories for purposes of analysis: below $35,000/year, between $35,000/year and $75,000/year, and $75,000/year and up.

We first need to do some transformation of our data to recode the variables appropriately. First, we will remove variable B19001_001, which represents the total number of households for each county. Second, we use the case_when() function from the dplyr package to identify groups of variables that correspond to our desired groupings. Given that the variables are ordered in the ACS table in relationship to the household income values, the less than operator can be used to identify groups.

The syntax of case_when() can appear complex to beginners, so it is worth stepping through how the function works. Inside the mutate() function, which is used to create a new variable named incgroup, case_when() steps through a series of logical conditions that are evaluated in order similar to a series of if/else statements. The first condition is evaluated, telling the function to assign the value of below35k to all rows with a variable value that comes before "B19001_008" - which in this case will be B19001_002 (income less than $10,000) through B19001_007 (income between $30,000 and $34,999). The second condition is then evaluated for all those rows not accounted for by the first condition. This means that case_when() knows not to assign "bw35kand75k" to the income group of $10,000 and below even though its variable comes before B19001_013. The final condition in case_when() can be set to TRUE which in this scenario translates as “all other values.”

mn_hh_income_recode <- mn_hh_income %>%
  filter(variable != "B19001_001") %>%
  mutate(incgroup = case_when(
    variable < "B19001_008" ~ "below35k", 
    variable < "B19001_013" ~ "bw35kand75k", 
    TRUE ~ "above75k"
  )) 
Table 3.11: Recoded household income categories
GEOID NAME variable estimate moe incgroup
27001 Aitkin County, Minnesota B19001_002 562 77 below35k
27001 Aitkin County, Minnesota B19001_003 544 72 below35k
27001 Aitkin County, Minnesota B19001_004 472 69 below35k
27001 Aitkin County, Minnesota B19001_005 508 68 below35k
27001 Aitkin County, Minnesota B19001_006 522 92 below35k
27001 Aitkin County, Minnesota B19001_007 447 61 below35k
27001 Aitkin County, Minnesota B19001_008 390 49 bw35kand75k
27001 Aitkin County, Minnesota B19001_009 426 64 bw35kand75k
27001 Aitkin County, Minnesota B19001_010 415 65 bw35kand75k
27001 Aitkin County, Minnesota B19001_011 706 81 bw35kand75k

Our result illustrates how the different variable IDs are mapped to the new, recoded categories that we specified in case_when(). The group_by() %>% summarize() workflow can now be applied to the recoded categories by county to tabulate the data into a smaller number of groups.

mn_group_sums <- mn_hh_income_recode %>%
  group_by(GEOID, incgroup) %>%
  summarize(estimate = sum(estimate))
Table 3.12: Grouped sums by income bands
GEOID incgroup estimate
27001 above75k 1706
27001 below35k 3055
27001 bw35kand75k 2879
27003 above75k 61403
27003 below35k 24546
27003 bw35kand75k 39311
27005 above75k 4390
27005 below35k 4528
27005 bw35kand75k 4577
27007 above75k 4491

Our data now reflect the new estimates by group by county.

3.4 Comparing ACS estimates over time

A common task when working with Census data is to examine demographic change over time. Data from the Census API - and consequently tidycensus - only go back to the 2000 Decennial Census. For historical analysts who want to go even further back, decennial Census data are available since 1790 from the National Historical Geographic Information System, or NHGIS, which will be covered in detail in Chapter 11.

3.4.1 Time-series analysis: some cautions

Before engaging in any sort of time series analysis of Census data, analysts need to account for potential problems that can emerge when using Census data longitudinally. One major issue that can emerge is geography changes over time. For example, let’s say we are interested in analyzing data on Oglala Lakota County, South Dakota. We can get recent data from the ACS using tools learned in Chapter 2:

oglala_lakota_age <- get_acs(
  geography = "county",
  state = "SD",
  county = "Oglala Lakota",
  table = "B01001",
  year = 2019
)
Table 3.13: 2015-2019 age table for Oglala Lakota County, SD
GEOID NAME variable estimate moe
46102 Oglala Lakota County, South Dakota B01001_001 14335 NA
46102 Oglala Lakota County, South Dakota B01001_002 7024 126
46102 Oglala Lakota County, South Dakota B01001_003 771 58
46102 Oglala Lakota County, South Dakota B01001_004 697 97
46102 Oglala Lakota County, South Dakota B01001_005 829 88
46102 Oglala Lakota County, South Dakota B01001_006 393 14
46102 Oglala Lakota County, South Dakota B01001_007 263 41
46102 Oglala Lakota County, South Dakota B01001_008 98 59
46102 Oglala Lakota County, South Dakota B01001_009 158 59
46102 Oglala Lakota County, South Dakota B01001_010 375 88

To understand how the age composition of the county has changed over the past 10 years, we may want to look at the 2005-2009 ACS for the county. Normally, we would just change the year argument to 2009:

oglala_lakota_age_09 <- get_acs(
  geography = "county",
  state = "SD",
  county = "Oglala Lakota",
  table = "B01001",
  year = 2009
)
## Error: Your API call has errors.  The API message returned is .

The request errors, and we don’t get an informative error message back from the API as was discussed in Section 2.6. The problem here is that Oglala Lakota County had a different name in 2009, Shannon County, meaning that the county = "Oglala Lakota" argument will not return any data. In turn, the equivalent code for the 2005-2009 ACS would use county = "Shannon".

oglala_lakota_age_09 <- get_acs(
  geography = "county",
  state = "SD",
  county = "Shannon",
  table = "B01001",
  year = 2009
)
Table 3.14: 2005-2009 age table for Oglala Lakota County, SD (then named Shannon County)
GEOID NAME variable estimate moe
46113 Shannon County, South Dakota B01001_001 13593 0
46113 Shannon County, South Dakota B01001_002 6469 238
46113 Shannon County, South Dakota B01001_003 739 99
46113 Shannon County, South Dakota B01001_004 574 144
46113 Shannon County, South Dakota B01001_005 823 122
46113 Shannon County, South Dakota B01001_006 498 129
46113 Shannon County, South Dakota B01001_007 291 39
46113 Shannon County, South Dakota B01001_008 213 98
46113 Shannon County, South Dakota B01001_009 147 74
46113 Shannon County, South Dakota B01001_010 341 110

Note the differences in the GEOID column between the two tables of data. When a county or geographic entity changes its name, the Census Bureau assigns it a new GEOID, meaning that analysts need to take care when dealing with those changes. A full listing of geography changes is available on the Census website for each year.

In addition to changes in geographic identifiers, variable IDs can change over time as well. For example, the ACS Data Profile is commonly used for pre-computed normalized ACS estimates. Let’s say that we are interested in analyzing the percentage of residents age 25 and up with a 4-year college degree for counties in Colorado from the 2019 1-year ACS. We’d first look up the appropriate variable ID with load_variables(2019, "acs1/profile") then use get_acs():

co_college19 <- get_acs(
  geography = "county",
  variables = "DP02_0068P",
  state = "CO",
  survey = "acs1",
  year = 2019
)
Table 3.15: ACS Data Profile data in 2019
GEOID NAME variable estimate moe
08001 Adams County, Colorado DP02_0068P 25.4 1.3
08005 Arapahoe County, Colorado DP02_0068P 43.8 1.3
08013 Boulder County, Colorado DP02_0068P 64.8 1.8
08014 Broomfield County, Colorado DP02_0068P 56.9 3.3
08031 Denver County, Colorado DP02_0068P 53.1 1.1

We get back data for counties of population 65,000 and greater as these are the geographies available in the 1-year ACS. The data make sense: Boulder County, home to the University of Colorado, has a very high percentage of its population with a 4-year degree or higher. However, when we run the exact same query for the 2018 1-year ACS:

co_college18 <- get_acs(
  geography = "county",
  variables = "DP02_0068P",
  state = "CO",
  survey = "acs1",
  year = 2018
)
Table 3.16: ACS Data Profile data in 2018
GEOID NAME variable estimate moe
08001 Adams County, Colorado DP02_0068P 375798 NA
08005 Arapahoe County, Colorado DP02_0068P 497198 NA
08013 Boulder County, Colorado DP02_0068P 263938 NA
08014 Broomfield County, Colorado DP02_0068P 53400 NA
08031 Denver County, Colorado DP02_0068P 575870 NA

The values are completely different, and clearly not percentages! This is because variable IDs for the Data Profile are unique to each year and in turn should not be used for time-series analysis. The returned results above represent the civilian population age 18 and up, and have nothing to do with educational attainment.

3.4.2 Preparing time-series ACS estimates

A safer way to perform time-series analysis of the ACS, then, is to use the Detailed Tables. While this option lacks the convenience of the pre-computed estimates in the Data Profile, it ensures that variable IDs will remain consistent across years allowing for consistent and correct analysis. That said, there still are some potential pitfalls to account for when using the Detailed Tables. The Census Bureau will add and remove variables from survey to survey depending on data needs and data availability. For example, questions are sometimes added and removed from the ACS survey meaning that you won’t always be able to get every data point for every year and geography combination. In turn, it is still important to check on data availability using load_variables() for the years you plan to analyze before carrying out your time-series analysis.

Let’s re-engineer the analysis above on educational attainment in Colorado counties, which below will be computed for a time series from 2010 to 2019. Information on “bachelor’s degree or higher” is split by sex and across different tiers of educational attainment in the detailed tables, found in ACS table 15002. Given that we only need a few variables (representing estimates of populations age 25+ who have finished a 4-year degree or graduate degrees, by sex), we’ll request those variables directly rather than the entire B15002 table.

college_vars <- c("B15002_015",
                  "B15002_016",
                  "B15002_017",
                  "B15002_018",
                  "B15002_032",
                  "B15002_033",
                  "B15002_034",
                  "B15002_035")

We’ll now use these variables to request data on college degree holders from the ACS for counties in Colorado for each of the 1-year ACS surveys from 2010 to 2019. In most cases, this process should be streamlined with iteration. Thus far, we are familiar with using the year argument in get_acs() to request data for a specific year. Writing out ten different calls to get_acs(), however - one for each year - would be tedious and would require a fair amount of repetitive code! Iteration helps us avoid repetitive coding as it allows us to carry out the same process over a sequence of values. Programmers familiar with iteration will likely know of “loop” operators like for and while, which are available in base R and most other programming languages in some variety. Base R also includes the *apply() family of functions (e.g. lapply(), mapply(), sapply()), which iterates over a sequence of values and applies a given function to each value.

The tidyverse approach to iteration is found in the purrr package. purrr includes a variety of functions that are designed to integrate well in workflows that require iteration and use other tidyverse tools. The map_*() family of functions iterate over values and try to return a desired result; map() returns a list, map_int() returns an integer vector, and map_chr() returns a character vector, for example. With tidycensus, the map_dfr() function is particularly useful. map_dfr() iterates over an input and applies it to a function or process defined by the user, then row-binds the result into a single data frame. The example below illustrates how this works for the years 2010 through 2019.

years <- 2010:2019
names(years) <- years

college_by_year <- map_dfr(years, ~{
  get_acs(
    geography = "county",
    variables = college_vars,
    state = "CO",
    summary_var = "B15002_001",
    survey = "acs1",
    year = .x
  )
}, .id = "year")

For users newer to R, iteration and purrr syntax can feel complex, so it is worth stepping through how the above code sample works.

  • First, a numeric vector of years is defined with the syntax 2010:2019. This will create a vector of years at 1-year intervals. These values are set as the names of the vector as well, as map_dfr() has additional functionality for working with named objects.

  • map_dfr() then takes three arguments above.

    • The first argument is the object that map_dfr() will iterate over, which in this case is our years vector. This means that the process we set up will be run once for each element of years.
    • The second argument is a formula we specify with the tilde (~) operator and curly braces ({...}). The code inside the curly braces will be run once for each element of years. The local variable .x, used inside the formula, takes on each value of years sequentially. In turn, we are running the equivalent of get_acs() with year = 2010, year = 2011, and so forth. Once get_acs() is run for each year, the result is combined into a single output data frame.
    • The .id argument, which is optional but used here, creates a new column in the output data frame that contains values equivalent to the names of the input object, which in this case is years. By setting .id = "year", we tell map_dfr() to name the new column that will contain these values year.

Let’s review the result:

college_by_year %>% 
  arrange(NAME, variable, year)
Table 3.17: Educational attainment over time
year GEOID NAME variable estimate moe summary_est summary_moe
2010 08001 Adams County, Colorado B15002_015 20501 1983 275849 790
2011 08001 Adams County, Colorado B15002_015 21233 2124 281231 865
2012 08001 Adams County, Colorado B15002_015 19238 2020 287924 693
2013 08001 Adams County, Colorado B15002_015 23818 2445 295122 673
2014 08001 Adams County, Colorado B15002_015 20255 1928 304394 541
2015 08001 Adams County, Colorado B15002_015 22962 2018 312281 705
2016 08001 Adams County, Colorado B15002_015 25744 2149 318077 525
2017 08001 Adams County, Colorado B15002_015 26159 2320 324185 562
2018 08001 Adams County, Colorado B15002_015 28113 2078 331247 955
2019 08001 Adams County, Colorado B15002_015 27552 2070 336931 705

The result is a long-form dataset that contains a time series of each requested ACS variable for each county in Colorado that is available in the 1-year ACS. The code below outlines a group_by() %>% summarize() workflow for calculating the percentage of the population age 25 and up with a 4-year college degree, then uses the pivot_wider() function from the tidyr package to spread the years across the columns for tabular data display.

percent_college_by_year <- college_by_year %>%
  group_by(NAME, year) %>%
  summarize(numerator = sum(estimate),
            denominator = first(summary_est)) %>%
  mutate(pct_college = 100 * (numerator / denominator)) %>%
  pivot_wider(id_cols = NAME,
              names_from = year,
              values_from = pct_college)
Table 3.18: Percent college by year
NAME 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Adams County, Colorado 20.57394 20.51801 20.64538 23.09384 22.16929 22.79742 22.95293 22.87552 25.74152 25.39956
Arapahoe County, Colorado 37.03001 38.24506 39.28435 39.42478 40.94194 41.03578 41.48359 43.69387 42.71285 43.78332
Boulder County, Colorado 57.50285 59.05601 57.88284 58.53214 58.04066 60.57147 60.63005 63.18150 62.51394 64.80486
Broomfield County, Colorado NA NA NA NA NA 56.07776 51.94338 55.13359 56.27740 56.87181
Denver County, Colorado 40.87971 42.97122 44.65358 44.35340 44.25600 47.10820 47.39683 49.32692 51.34580 53.10088
Douglas County, Colorado 54.96800 53.27936 55.09223 57.66999 56.48866 56.06928 59.42687 58.53342 58.37539 58.13747
El Paso County, Colorado 34.11467 35.69184 34.91315 35.47612 36.49302 36.43089 38.67864 39.19931 38.83872 39.04942
Jefferson County, Colorado 40.83113 39.54961 41.43825 41.04234 41.99768 43.20923 43.51953 45.57140 45.83786 47.61074
Larimer County, Colorado 45.80197 42.83543 44.71423 43.33800 42.67180 46.16705 46.78871 47.90465 47.62330 49.01654
Mesa County, Colorado 24.99285 25.82724 23.01511 27.63325 25.14875 30.27630 25.02980 25.82455 29.95739 29.78537

This particular format is suitable for data display or writing to an Excel spreadsheet for colleagues who are not R-based. Methods for visualization of time-series estimates from the ACS will be covered in Section 4.4.

3.5 Handling margins of error in the American Community Survey with tidycensus

A topic of critical importance when working with data from the American Community Survey is the margin of error. As opposed to the decennial US Census, which is based on a complete enumeration of the US population, the ACS is based on a sample with estimates characterized by margins of error. By default, MOEs are returned at a 90 percent confidence level. This can be translated roughtly as “we are 90 percent sure that the true value falls within a range defined by the estimate plus or minus the margin of error.”

As discussed in Chapter 2, tidycensus takes an opinionated approach to margins of error. When applicable, tidycensus will always return the margin of error associated with an estimate, and does not have an option available to return estimates only. For “tidy” or long-form data, these margins of error will be found in the moe column; for wide-form data, margins of error will be found in columns with an M suffix.

The confidence level of the MOE can be controlled with the moe_level argument in get_acs(). The default moe_level is 90, which is what the Census Bureau returns by default. tidycensus can also return MOEs at a confidence level of 95 or 99 which uses Census Bureau-recommended formulas to adjust the MOE. For example, we might look at data on median household income by county in Rhode Island using the default moe_level of 90:

get_acs(
  geography = "county",
  state = "Rhode Island",
  variables = "B19013_001"
)
Table 3.19: Default MOE at 90 percent confidence level
GEOID NAME variable estimate moe
44001 Bristol County, Rhode Island B19013_001 83092 4339
44003 Kent County, Rhode Island B19013_001 73521 1703
44005 Newport County, Rhode Island B19013_001 79454 2611
44007 Providence County, Rhode Island B19013_001 58974 1051
44009 Washington County, Rhode Island B19013_001 85531 2042

A stricter margin of error will increase the size of the MOE relative to its estimate.

get_acs(
  geography = "county",
  state = "Rhode Island",
  variables = "B19013_001",
  moe_level = 99
)
Table 3.20: MOE at 99 percent confidence level
GEOID NAME variable estimate moe
44001 Bristol County, Rhode Island B19013_001 83092 6752.486
44003 Kent County, Rhode Island B19013_001 73521 2650.261
44005 Newport County, Rhode Island B19013_001 79454 4063.319
44007 Providence County, Rhode Island B19013_001 58974 1635.599
44009 Washington County, Rhode Island B19013_001 85531 3177.824

3.5.1 Calculating derived margins of error in tidycensus

For small geographies or small populations, margins of error can get quite large, in some cases exceeding their corresponding estimates. In the example below, we can examine data on age groups by sex for the population age 65 and older for Census tracts in Salt Lake County, Utah.

vars <- paste0("B01001_0", c(20:25, 44:49))

salt_lake <- get_acs(
  geography = "tract",
  variables = vars,
  state = "Utah",
  county = "Salt Lake",
  year = 2019
)

Let’s focus on a specific Census tract in Salt Lake County using filter():

example_tract <- salt_lake %>%
  filter(GEOID == "49035100100")

example_tract %>% 
  select(-NAME)
Table 3.21: Example Census tract in Salt Lake City
GEOID variable estimate moe
49035100100 B01001_020 12 13
49035100100 B01001_021 36 23
49035100100 B01001_022 8 11
49035100100 B01001_023 5 8
49035100100 B01001_024 0 11
49035100100 B01001_025 22 23
49035100100 B01001_044 0 11
49035100100 B01001_045 11 13
49035100100 B01001_046 27 20
49035100100 B01001_047 10 12

In many cases, the margins of error exceed their corresponding estimates. For example, the ACS data suggest that in Census tract 49035100100, for the male population age 85 and up (variable ID B01001_0025), there are anywhere between 0 and 45 people in that Census tract. This can make ACS data for small geographies problematic for planning and analysis purposes.

A potential solution to large margins of error for small estimates in the ACS is to aggregate data upwards until a satisfactory margin of error to estimate ratio is reached. The US Census Bureau publishes formulas for appropriately calculating margins of error around such derived estimates, which are included in tidycensus with the following functions:

  • moe_sum(): calculates a margin of error for a derived sum;
  • moe_product(): calculates a margin of error for a derived product;
  • moe_ratio(): calculates a margin of error for a derived ratio;
  • moe_prop(): calculates a margin of error for a derived proportion.

In their most basic form, these functions can be used with constants. For example, let’s say we had an ACS estimate of 25 with a margin of error of 5 around that estimate. The appropriate denominator for this estimate is 100 with a margin of error of 3. To determine the margin of error around the derived proportion of 0.25, we can use moe_prop():

moe_prop(25, 100, 5, 3)
## [1] 0.0494343

Our margin of error around the derived estimate of 0.25 is approximately 0.049.

3.5.2 Calculating group-wise margins of error

These margin of error functions in tidycensus can in turn be integrated into tidyverse-centric analytic pipelines to handle large margins of error around estimates. Given that the smaller age bands in the Salt Lake City dataset are characterized by too much uncertainty for our analysis, we decide in this scenario to aggregate our data upwards to represent populations aged 65 and older by sex.

In the code below, we use the case_when() function to create a new column, sex, that represents a mapping of the variables we pulled from the ACS to their sex categories. We then employ a familiar group_by() %>% summarize() method to aggregate our data by Census tract and sex. Notably, the call to summarize() includes a call to tidycensus’s moe_sum() function, which will generate a new column that represents the margin of error around the derived sum.

salt_lake_grouped <- salt_lake %>%
  mutate(sex = case_when(
    str_sub(variable, start = -2) < "26" ~ "Male",
    TRUE ~ "Female"
  )) %>%
  group_by(GEOID, sex) %>%
  summarize(sum_est = sum(estimate), 
            sum_moe = moe_sum(moe, estimate))
Table 3.22: Grouped margins of error
GEOID sex sum_est sum_moe
49035100100 Female 55 30.90307
49035100100 Male 83 39.15354
49035100200 Female 167 57.49783
49035100200 Male 153 50.85273
49035100306 Female 273 108.80257
49035100306 Male 225 90.27181
49035100307 Female 188 70.24956
49035100307 Male 117 64.54456
49035100308 Female 164 98.66610
49035100308 Male 129 77.89095

The margins of error relative to their estimates are now much more reasonable than in the disaggregated data.

That said, the Census Bureau issues a note of caution (American Community Survey Office 2020):

All [derived MOE methods] are approximations and users should be cautious in using them. This is because these methods do not consider the correlation or covariance between the basic estimates. They may be overestimates or underestimates of the derived estimate’s standard error depending on whether the two basic estimates are highly correlated in either the positive or negative direction. As a result, the approximated standard error may not match direct calculations of standard errors or calculations obtained through other methods.

This means that your “best bet” is to first search the ACS tables to see if your data are found in aggregated form elsewhere before doing the aggregation and MOE estimation yourself. In many cases, you’ll find aggregated information in the ACS combined tables, Data Profile, or Subject Tables that will include pre-computed margins of error for you.

3.6 Exercises

  • The ACS Data Profile includes a number of pre-computed percentages which can reduce your data wrangling time. The variable in the 2015-2019 ACS for “percent of the population age 25 and up with a bachelor’s degree” is DP02_0068P. For a state of your choosing, use this variable to determine:

    • The county with the highest percentage in the state;

    • The county with the lowest percentage in the state;

    • The median value for counties in your chosen state.