3 Wrangling Census data with tidyverse tools
One of the most popular frameworks for data analysis in R is the tidyverse, a suite of packages designed for integrated data wrangling, visualization, and modeling. The “tidy” or long-form data returned by default in tidycensus is designed to work well with tidyverse analytic workflows. This chapter provides an overview of how to use tidyverse tools to gain additional insights about US Census data retrieved with tidycensus. It concludes with discussion about margins of error (MOEs) in the American Community Survey and how to wrangle and interpret MOEs appropriately.
3.1 The tidyverse
The tidyverse is a collection of R packages that are designed to work together in common data wrangling, analysis, and visualization projects. Many of these R packages, maintained by RStudio, are among the most popular R packages worldwide. Some of the key packages you’ll use in the tidyverse include:
readr (Wickham and Hester 2021), which contains tools for importing and exporting datasets;
dplyr (Wickham et al. 2021), a powerful framework for data wrangling tasks;
tidyr (Wickham 2021b), a package for reshaping data;
purrr (Henry and Wickham 2020), a comprehensive framework for functional programming and iteration;
ggplot2 (Wickham 2016), a data visualization package based on the Grammar of Graphics
The core data structure used in the tidyverse is the tibble, which is an R data frame with some small enhancements to improve the user experience. tidycensus returns tibbles by default.
A full treatment of the tidyverse and its functionality is beyond the scope of this book; however, the examples in this chapter will introduce you to several key tidyverse features using US Census Bureau data. For a more general and broader treatment of the tidyverse, I recommend the R for Data Science book (Wickham and Grolemund 2017).
3.2 Exploring Census data with tidyverse tools
Census data queries using tidycensus, combined with core tidyverse functions, are excellent ways to explore downloaded Census data. Chapter 2 covered how to download data from various Census datasets using tidycensus and return the data in a desired format. A common next step in an analytic process will involve data exploration, which is handled by a wide range of tools in the tidyverse.
To get started, the tidycensus and tidyverse packages are loaded. “tidyverse” is not specifically a package itself, but rather loads several core packages within the tidyverse. The package load message gives you more information:
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Eight tidyverse packages are loaded: ggplot2, tibble (Müller and Wickham 2021), purrr, dplyr, readr, and tidyr are included along with stringr (Wickham 2019a) for string manipulation and forcats (Wickham 2021a) for working with factors. These tools collectively can be used for many core Census data analysis tasks.
3.2.1 Sorting and filtering data
For a first example, let’s request data on median age from the 2016-2020 ACS with get_acs()
for all counties in the United States. This requires specifying geography = "county"
and leaving state set to NULL
, the default.
median_age <- get_acs(
geography = "county",
variables = "B01002_001",
year = 2020
)
GEOID | NAME | variable | estimate | moe |
---|---|---|---|---|
01001 | Autauga County, Alabama | B01002_001 | 38.6 | 0.6 |
01003 | Baldwin County, Alabama | B01002_001 | 43.2 | 0.4 |
01005 | Barbour County, Alabama | B01002_001 | 40.1 | 0.6 |
01007 | Bibb County, Alabama | B01002_001 | 39.9 | 1.2 |
01009 | Blount County, Alabama | B01002_001 | 41.0 | 0.5 |
01011 | Bullock County, Alabama | B01002_001 | 39.7 | 1.9 |
01013 | Butler County, Alabama | B01002_001 | 41.2 | 0.6 |
01015 | Calhoun County, Alabama | B01002_001 | 39.5 | 0.4 |
01017 | Chambers County, Alabama | B01002_001 | 41.9 | 0.7 |
01019 | Cherokee County, Alabama | B01002_001 | 46.8 | 0.5 |
The default method for printing data used by the tibble package shows the first 10 rows of the dataset, which in this case prints counties in Alabama. A first exploratory data analysis question might involve understanding which counties are the youngest and oldest in the United States as measured by median age. This task can be accomplished with the arrange()
function found in the dplyr package. arrange()
sorts a dataset by values in one or more columns and returns the sorted result. To view the dataset in ascending order of a given column, supply the data object and a column name to the arrange()
function.
arrange(median_age, estimate)
GEOID | NAME | variable | estimate | moe |
---|---|---|---|---|
35011 | De Baca County, New Mexico | B01002_001 | 22.2 | 6.9 |
51678 | Lexington city, Virginia | B01002_001 | 22.2 | 0.8 |
16065 | Madison County, Idaho | B01002_001 | 23.5 | 0.2 |
46121 | Todd County, South Dakota | B01002_001 | 23.6 | 0.6 |
51750 | Radford city, Virginia | B01002_001 | 23.7 | 0.6 |
13053 | Chattahoochee County, Georgia | B01002_001 | 24.0 | 0.7 |
02158 | Kusilvak Census Area, Alaska | B01002_001 | 24.1 | 0.2 |
49049 | Utah County, Utah | B01002_001 | 25.0 | 0.1 |
46027 | Clay County, South Dakota | B01002_001 | 25.2 | 0.5 |
53075 | Whitman County, Washington | B01002_001 | 25.2 | 0.2 |
Per the 2016-2020 ACS, the youngest county is De Baca County, New Mexico. Two of the five youngest “counties” in the United States are independent cities in Virginia, which are treated as county-equivalents. Both Lexington and Radford are college towns; Lexington is home to both Washington & Lee University and the Virginia Military Institute, and Radford houses Radford University.
To retrieve the oldest counties in the United States by median age, an analyst can use the desc()
function available in dplyr to sort the estimate
column in descending order.
GEOID | NAME | variable | estimate | moe |
---|---|---|---|---|
12119 | Sumter County, Florida | B01002_001 | 68.0 | 0.3 |
48301 | Loving County, Texas | B01002_001 | 62.2 | 37.8 |
48243 | Jeff Davis County, Texas | B01002_001 | 61.3 | 36.8 |
08027 | Custer County, Colorado | B01002_001 | 60.1 | 3.4 |
12015 | Charlotte County, Florida | B01002_001 | 59.5 | 0.2 |
51091 | Highland County, Virginia | B01002_001 | 59.5 | 4.8 |
35003 | Catron County, New Mexico | B01002_001 | 59.4 | 2.6 |
51133 | Northumberland County, Virginia | B01002_001 | 59.3 | 0.7 |
26131 | Ontonagon County, Michigan | B01002_001 | 59.1 | 0.5 |
48443 | Terrell County, Texas | B01002_001 | 59.1 | 9.4 |
The oldest county in the United States is Sumter County, Florida. Sumter County is home to The Villages, a Census-designated place that includes a large age-restricted community also called The Villages.
The tidyverse includes several tools for parsing datasets that allow for exploration beyond sorting and browsing data. The filter()
function in dplyr queries a dataset for rows where a given condition evaluates to TRUE
, and retains those rows only. For analysts who are familiar with databases and SQL, this is equivalent to a WHERE
clause. This helps analysts subset their data for specific areas by their characteristics, and answer questions like “how many counties in the US have a median age of 50 or older?”
filter(median_age, estimate >= 50)
GEOID | NAME | variable | estimate | moe |
---|---|---|---|---|
02105 | Hoonah-Angoon Census Area, Alaska | B01002_001 | 52.1 | 2.9 |
04007 | Gila County, Arizona | B01002_001 | 50.4 | 0.2 |
04012 | La Paz County, Arizona | B01002_001 | 57.4 | 0.6 |
04015 | Mohave County, Arizona | B01002_001 | 52.3 | 0.2 |
04025 | Yavapai County, Arizona | B01002_001 | 54.1 | 0.2 |
05005 | Baxter County, Arkansas | B01002_001 | 52.3 | 0.5 |
05089 | Marion County, Arkansas | B01002_001 | 52.1 | 0.8 |
05097 | Montgomery County, Arkansas | B01002_001 | 50.6 | 0.8 |
05137 | Stone County, Arkansas | B01002_001 | 50.0 | 0.5 |
06009 | Calaveras County, California | B01002_001 | 52.8 | 0.6 |
Functions like arrange()
and filter()
operate on row values and organize data by row. Other tidyverse functions, like tidyr’s separate()
, operate on columns. The NAME
column, returned by default by most tidycensus functions, contains a basic description of the location that can be more intuitive than the GEOID
. For the 2016-2020 ACS, NAME
is formatted as “X County, Y”, where X is the county name and Y is the state name. separate()
can split this column into two columns where one retains the county name and the other retains the state; this can be useful for analysts who need to complete a comparative analysis by state.
GEOID | county | state | variable | estimate | moe |
---|---|---|---|---|---|
01001 | Autauga County | Alabama | B01002_001 | 38.6 | 0.6 |
01003 | Baldwin County | Alabama | B01002_001 | 43.2 | 0.4 |
01005 | Barbour County | Alabama | B01002_001 | 40.1 | 0.6 |
01007 | Bibb County | Alabama | B01002_001 | 39.9 | 1.2 |
01009 | Blount County | Alabama | B01002_001 | 41.0 | 0.5 |
01011 | Bullock County | Alabama | B01002_001 | 39.7 | 1.9 |
01013 | Butler County | Alabama | B01002_001 | 41.2 | 0.6 |
01015 | Calhoun County | Alabama | B01002_001 | 39.5 | 0.4 |
01017 | Chambers County | Alabama | B01002_001 | 41.9 | 0.7 |
01019 | Cherokee County | Alabama | B01002_001 | 46.8 | 0.5 |
You may have noticed above that existing variable names are unquoted when referenced in tidyverse functions. Many tidyverse functions use non-standard evaluation to refer to column names, which means that column names can be used as arguments directly without quotation marks. Non-standard evaluation makes interactive programming faster, especially for beginners; however, it can introduce some complications when writing your own functions or R packages. A full treatment of non-standard evaluation is beyond the scope of this book; Hadley Wickham’s Advanced R (Wickham 2019b) is the best resource on the topic if you’d like to learn more.
3.2.2 Using summary variables and calculating new columns
Data in Census and ACS tables, as in the example above, are frequently comprised of variables that individually constitute sub-categories such as the numbers of households in different household income bands. One limitation of the approach above, however, is that the data and the resulting analysis return estimated counts, which are difficult to compare across geographies. For example, Maricopa County in Arizona is the state’s most populous county with 4.3 million residents; the second-largest county, Pima, only has just over 1 million residents and six of the state’s 15 counties have fewer than 100,000 residents. In turn, comparing Maricopa’s estimates with those of smaller counties in the state would often be inappropriate.
A solution to this issue might involve normalizing the estimated count data by dividing it by the overall population from which the sub-group is derived. Appropriate denominators for ACS tables are frequently found in the tables themselves as variables. In ACS table B19001, which covers the number of households by income bands, the variable B19001_001
represents the total number of households in a given enumeration unit, which we removed from our analysis earlier. Given that this variable is an appropriate denominator for the other variables in the table, it merits its own column to facilitate the calculation of proportions or percentages.
In tidycensus, this can be accomplished by supplying a variable ID to the summary_var
parameter in both the get_acs()
and get_decennial()
functions. When using get_decennial()
, doing so will create two new columns for the decennial Census datasets, summary_var
and summary_value
, representing the summary variable ID and the summary variable’s value. When using get_acs()
, using summary_var
creates three new columns for the ACS datasets, summary_var
, summary_est
, and summary_moe
, which include the ACS estimate and margin of error for the summary variable.
With this information in hand, normalizing data is straightforward. The following example uses the summary_var
parameter to compare the population of counties in Arizona by race & Hispanic origin with their baseline populations, using data from the 2016-2020 ACS.
race_vars <- c(
White = "B03002_003",
Black = "B03002_004",
Native = "B03002_005",
Asian = "B03002_006",
HIPI = "B03002_007",
Hispanic = "B03002_012"
)
az_race <- get_acs(
geography = "county",
state = "AZ",
variables = race_vars,
summary_var = "B03002_001",
year = 2020
)
GEOID | NAME | variable | estimate | moe | summary_est | summary_moe |
---|---|---|---|---|---|---|
04001 | Apache County, Arizona | White | 12993 | 56 | 71714 | NA |
04001 | Apache County, Arizona | Black | 544 | 56 | 71714 | NA |
04001 | Apache County, Arizona | Native | 51979 | 327 | 71714 | NA |
04001 | Apache County, Arizona | Asian | 262 | 76 | 71714 | NA |
04001 | Apache County, Arizona | HIPI | 49 | 14 | 71714 | NA |
04001 | Apache County, Arizona | Hispanic | 4751 | NA | 71714 | NA |
04003 | Cochise County, Arizona | White | 69095 | 350 | 126442 | NA |
04003 | Cochise County, Arizona | Black | 4512 | 378 | 126442 | NA |
04003 | Cochise County, Arizona | Native | 1058 | 176 | 126442 | NA |
04003 | Cochise County, Arizona | Asian | 2371 | 241 | 126442 | NA |
By using dplyr’s mutate()
function, we calculate a new column, percent
, representing the percentage of each Census tract’s population that corresponds to each racial/ethnic group in 2016-2020. The select()
function, also in dplyr, retains only those columns that we need to view.
az_race_percent <- az_race %>%
mutate(percent = 100 * (estimate / summary_est)) %>%
select(NAME, variable, percent)
NAME | variable | percent |
---|---|---|
Apache County, Arizona | White | 18.1178013 |
Apache County, Arizona | Black | 0.7585688 |
Apache County, Arizona | Native | 72.4809661 |
Apache County, Arizona | Asian | 0.3653401 |
Apache County, Arizona | HIPI | 0.0683270 |
Apache County, Arizona | Hispanic | 6.6249268 |
Cochise County, Arizona | White | 54.6456083 |
Cochise County, Arizona | Black | 3.5684345 |
Cochise County, Arizona | Native | 0.8367473 |
Cochise County, Arizona | Asian | 1.8751681 |
The above example introduces some additional syntax common to tidyverse data analyses. The %>%
operator from the magrittr R package (Bache and Wickham 2020) is a pipe operator that allows for analysts to develop analytic pipelines, which are deeply embedded in tidyverse-centric data analytic workflows. The pipe operator passes the result of a given line of code as the first argument of the code on the next line. In turn, analysts can develop data analysis pipelines of related operations that fit together in a coherent way.
tidyverse developers recommend that the pipe operator be read as “then”. The above code can in turn be interpreted as “Create a new data object az_race_percent
by using the existing data object az_race
THEN creating a new percent
column THEN selecting the NAME
, variable
, and percent
columns.”
Since R version 4.1, the base installation of R also includes a pipe operator, |>
. It works much the same way as the magrittr pipe %>%
, though %>%
has some small additional features that make it work well within tidyverse analysis pipelines. In turn, %>%
will be used in the examples throughout this book.
3.3 Group-wise Census data analysis
The split-apply-combine model of data analysis, as discussed in Wickham (2011), is a powerful framework for analyzing demographic data. In general terms, an analyst will apply this framework as follows:
The analyst identifies salient groups in a dataset between which they want to make comparisons. The dataset is then split into multiple pieces, one for each group.
A function is then applied to each group in turn. This might be a simple summary function, such as taking the maximum or calculating the mean, or a custom function defined by the analyst.
Finally, the results of the function applied to each group are combined back into a single dataset, allowing the analyst to compare the results by group.
Given the hierarchical nature of US Census Bureau data, “groups” across which analysts can make comparisons are found in just about every analytic tasks. In many cases, the split-apply-combine model of data analysis will be useful to analysts as they make sense of patterns and trends found in Census data.
In the tidyverse, split-apply-combine is implemented with the group_by()
function in the dplyr package. group_by()
does the work for the analyst of splitting a dataset into groups, allowing subsequent functions used by the analyst in an analytic pipeline to be applied to each group then combined back into a single dataset. The examples that follow illustrate some common group-wise analyses.
3.3.1 Making group-wise comparisons
The az_race_percent
dataset created above is an example of a dataset suitable for group-wise data analysis. It includes two columns that could be used as group definitions: NAME
, representing the county, and variable
, representing the racial or ethnic group. Split-apply-combine could be used for either group definition to make comparisons for data in Arizona across these categories.
In a first example, we can deploy group-wise data analysis to identify the largest racial or ethnic group in each county in Arizona. This involves setting up a data analysis pipeline with the magrittr pipe and calculating a grouped filter where the filter()
operation will be applied specific to each group. In this example, the filter condition will be specified as percent == max(percent)
. We can read the analytic pipeline then as “Create a new dataset, largest_group
, by using the az_race_dataset
THEN grouping the dataset by the NAME
column THEN filtering for rows that are equal to the maximum value of percent
for each group.”
NAME | variable | percent |
---|---|---|
Apache County, Arizona | Native | 72.48097 |
Cochise County, Arizona | White | 54.64561 |
Coconino County, Arizona | White | 53.80798 |
Gila County, Arizona | White | 61.85232 |
Graham County, Arizona | White | 50.88764 |
Greenlee County, Arizona | Hispanic | 47.26889 |
La Paz County, Arizona | White | 57.18089 |
Maricopa County, Arizona | White | 54.55515 |
Mohave County, Arizona | White | 76.69694 |
Navajo County, Arizona | Native | 42.71386 |
The result of the grouped filter allows us to review the most common racial or ethnic group in each Arizona County along with how their percentages vary. For example, in two Arizona counties (Greenlee and Navajo), none of the racial or ethnic groups form a majority of the population.
group_by()
is commonly paired with the summarize()
function in data analysis pipelines. summarize()
generates a new, condensed dataset that by default returns a column for the grouping variable(s) and columns representing the results of one or more functions applied to those groups. In the example below, the median()
function is used to identify the median percentage for each of the racial & ethnic groups in the dataset across counties in Arizona. In turn, variable
is passed to group_by()
as the grouping variable.
variable | median_pct |
---|---|
Asian | 0.9918415 |
Black | 1.2857283 |
HIPI | 0.1070384 |
Hispanic | 30.4721836 |
Native | 3.6344427 |
White | 53.8079773 |
The result of this operation tells us the median county percentage of each racial and ethnic group for the state of Arizona. A broader analysis might involve the calculation of these percentages hierarchically, finding the median county percentage of given attributes across states, for example.
3.3.2 Tabulating new groups
In the examples above, suitable groups in the NAME
and variable
columns were already found in the data retrieved with get_acs()
. Commonly, analysts will also need to calculate new custom groups to address specific analytic questions. For example, variables in ACS table B19001 represent groups of households whose household incomes fall into a variety of categories: less than $10,000/year, between $10,000/year and $19,999/year, and so forth. These categories may be more granular than needed by an analyst. As such, an analyst might take the following steps: 1) recode the ACS variables into wider income bands; 2) group the data by the wider income bands; 3) calculate grouped sums to generate new estimates.
Consider the following example, using household income data for Minnesota counties from the 2012-2016 ACS:
mn_hh_income <- get_acs(
geography = "county",
table = "B19001",
state = "MN",
year = 2016
)
GEOID | NAME | variable | estimate | moe |
---|---|---|---|---|
27001 | Aitkin County, Minnesota | B19001_001 | 7640 | 262 |
27001 | Aitkin County, Minnesota | B19001_002 | 562 | 77 |
27001 | Aitkin County, Minnesota | B19001_003 | 544 | 72 |
27001 | Aitkin County, Minnesota | B19001_004 | 472 | 69 |
27001 | Aitkin County, Minnesota | B19001_005 | 508 | 68 |
27001 | Aitkin County, Minnesota | B19001_006 | 522 | 92 |
27001 | Aitkin County, Minnesota | B19001_007 | 447 | 61 |
27001 | Aitkin County, Minnesota | B19001_008 | 390 | 49 |
27001 | Aitkin County, Minnesota | B19001_009 | 426 | 64 |
27001 | Aitkin County, Minnesota | B19001_010 | 415 | 65 |
Our data include household income categories for each county in the rows. However, let’s say we only need three income categories for purposes of analysis: below $35,000/year, between $35,000/year and $75,000/year, and $75,000/year and up.
We first need to do some transformation of our data to recode the variables appropriately. First, we will remove variable B19001_001
, which represents the total number of households for each county. Second, we use the case_when()
function from the dplyr package to identify groups of variables that correspond to our desired groupings. Given that the variables are ordered in the ACS table in relationship to the household income values, the less than operator can be used to identify groups.
The syntax of case_when()
can appear complex to beginners, so it is worth stepping through how the function works. Inside the mutate()
function, which is used to create a new variable named incgroup
, case_when()
steps through a series of logical conditions that are evaluated in order similar to a series of if/else statements. The first condition is evaluated, telling the function to assign the value of below35k
to all rows with a variable
value that comes before "B19001_008"
- which in this case will be B19001_002
(income less than $10,000) through B19001_007
(income between $30,000 and $34,999). The second condition is then evaluated for all those rows not accounted for by the first condition. This means that case_when()
knows not to assign "bw35kand75k"
to the income group of $10,000 and below even though its variable comes before B19001_013
. The final condition in case_when()
can be set to TRUE
which in this scenario translates as “all other values.”
mn_hh_income_recode <- mn_hh_income %>%
filter(variable != "B19001_001") %>%
mutate(incgroup = case_when(
variable < "B19001_008" ~ "below35k",
variable < "B19001_013" ~ "bw35kand75k",
TRUE ~ "above75k"
))
GEOID | NAME | variable | estimate | moe | incgroup |
---|---|---|---|---|---|
27001 | Aitkin County, Minnesota | B19001_002 | 562 | 77 | below35k |
27001 | Aitkin County, Minnesota | B19001_003 | 544 | 72 | below35k |
27001 | Aitkin County, Minnesota | B19001_004 | 472 | 69 | below35k |
27001 | Aitkin County, Minnesota | B19001_005 | 508 | 68 | below35k |
27001 | Aitkin County, Minnesota | B19001_006 | 522 | 92 | below35k |
27001 | Aitkin County, Minnesota | B19001_007 | 447 | 61 | below35k |
27001 | Aitkin County, Minnesota | B19001_008 | 390 | 49 | bw35kand75k |
27001 | Aitkin County, Minnesota | B19001_009 | 426 | 64 | bw35kand75k |
27001 | Aitkin County, Minnesota | B19001_010 | 415 | 65 | bw35kand75k |
27001 | Aitkin County, Minnesota | B19001_011 | 706 | 81 | bw35kand75k |
Our result illustrates how the different variable IDs are mapped to the new, recoded categories that we specified in case_when()
. The group_by() %>% summarize()
workflow can now be applied to the recoded categories by county to tabulate the data into a smaller number of groups.
mn_group_sums <- mn_hh_income_recode %>%
group_by(GEOID, incgroup) %>%
summarize(estimate = sum(estimate))
GEOID | incgroup | estimate |
---|---|---|
27001 | above75k | 1706 |
27001 | below35k | 3055 |
27001 | bw35kand75k | 2879 |
27003 | above75k | 61403 |
27003 | below35k | 24546 |
27003 | bw35kand75k | 39311 |
27005 | above75k | 4390 |
27005 | below35k | 4528 |
27005 | bw35kand75k | 4577 |
27007 | above75k | 4491 |
Our data now reflect the new estimates by group by county.
3.4 Comparing ACS estimates over time
A common task when working with Census data is to examine demographic change over time. Data from the Census API - and consequently tidycensus - only go back to the 2000 Decennial Census. For historical analysts who want to go even further back, decennial Census data are available since 1790 from the National Historical Geographic Information System, or NHGIS, which will be covered in detail in Chapter 11.
3.4.1 Time-series analysis: some cautions
Before engaging in any sort of time series analysis of Census data, analysts need to account for potential problems that can emerge when using Census data longitudinally. One major issue that can emerge is geography changes over time. For example, let’s say we are interested in analyzing data on Oglala Lakota County, South Dakota. We can get recent data from the ACS using tools learned in Chapter 2:
oglala_lakota_age <- get_acs(
geography = "county",
state = "SD",
county = "Oglala Lakota",
table = "B01001",
year = 2020
)
GEOID | NAME | variable | estimate | moe |
---|---|---|---|---|
46102 | Oglala Lakota County, South Dakota | B01001_001 | 14277 | NA |
46102 | Oglala Lakota County, South Dakota | B01001_002 | 6930 | 132 |
46102 | Oglala Lakota County, South Dakota | B01001_003 | 761 | 66 |
46102 | Oglala Lakota County, South Dakota | B01001_004 | 794 | 128 |
46102 | Oglala Lakota County, South Dakota | B01001_005 | 707 | 123 |
46102 | Oglala Lakota County, South Dakota | B01001_006 | 394 | 20 |
46102 | Oglala Lakota County, South Dakota | B01001_007 | 227 | 15 |
46102 | Oglala Lakota County, South Dakota | B01001_008 | 85 | 53 |
46102 | Oglala Lakota County, South Dakota | B01001_009 | 165 | 70 |
46102 | Oglala Lakota County, South Dakota | B01001_010 | 356 | 101 |
To understand how the age composition of the county has changed over the past 10 years, we may want to look at the 2006-2010 ACS for the county. Normally, we would just change the year argument to 2010
:
oglala_lakota_age_10 <- get_acs(
geography = "county",
state = "SD",
county = "Oglala Lakota",
table = "B01001",
year = 2010
)
## Error in `map()`:
## ℹ In index: 1.
## ℹ With name: 1.
## Caused by error:
## ! Your API call has errors. The API message returned is .
The request errors, and we don’t get an informative error message back from the API as was discussed in Section 2.6. The problem here is that Oglala Lakota County had a different name in 2010, Shannon County, meaning that the county = "Oglala Lakota"
argument will not return any data. In turn, the equivalent code for the 2006-2010 ACS would use county = "Shannon"
.
oglala_lakota_age_10 <- get_acs(
geography = "county",
state = "SD",
county = "Shannon",
table = "B01001",
year = 2010
)
GEOID | NAME | variable | estimate | moe |
---|---|---|---|---|
46113 | Shannon County, South Dakota | B01001_001 | 13437 | NA |
46113 | Shannon County, South Dakota | B01001_002 | 6553 | 47 |
46113 | Shannon County, South Dakota | B01001_003 | 770 | 99 |
46113 | Shannon County, South Dakota | B01001_004 | 565 | 151 |
46113 | Shannon County, South Dakota | B01001_005 | 833 | 151 |
46113 | Shannon County, South Dakota | B01001_006 | 541 | 47 |
46113 | Shannon County, South Dakota | B01001_007 | 275 | 99 |
46113 | Shannon County, South Dakota | B01001_008 | 164 | 89 |
46113 | Shannon County, South Dakota | B01001_009 | 143 | 54 |
46113 | Shannon County, South Dakota | B01001_010 | 342 | 83 |
Note the differences in the GEOID
column between the two tables of data. When a county or geographic entity changes its name, the Census Bureau assigns it a new GEOID
, meaning that analysts need to take care when dealing with those changes. A full listing of geography changes is available on the Census website for each year.
In addition to changes in geographic identifiers, variable IDs can change over time as well. For example, the ACS Data Profile is commonly used for pre-computed normalized ACS estimates. Let’s say that we are interested in analyzing the percentage of residents age 25 and up with a 4-year college degree for counties in Colorado from the 2019 1-year ACS. We’d first look up the appropriate variable ID with load_variables(2019, "acs1/profile")
then use get_acs()
:
co_college19 <- get_acs(
geography = "county",
variables = "DP02_0068P",
state = "CO",
survey = "acs1",
year = 2019
)
GEOID | NAME | variable | estimate | moe |
---|---|---|---|---|
08001 | Adams County, Colorado | DP02_0068P | 25.4 | 1.3 |
08005 | Arapahoe County, Colorado | DP02_0068P | 43.8 | 1.3 |
08013 | Boulder County, Colorado | DP02_0068P | 64.8 | 1.8 |
08014 | Broomfield County, Colorado | DP02_0068P | 56.9 | 3.3 |
08031 | Denver County, Colorado | DP02_0068P | 53.1 | 1.1 |
We get back data for counties of population 65,000 and greater as these are the geographies available in the 1-year ACS. The data make sense: Boulder County, home to the University of Colorado, has a very high percentage of its population with a 4-year degree or higher. However, when we run the exact same query for the 2018 1-year ACS:
co_college18 <- get_acs(
geography = "county",
variables = "DP02_0068P",
state = "CO",
survey = "acs1",
year = 2018
)
GEOID | NAME | variable | estimate | moe |
---|---|---|---|---|
08001 | Adams County, Colorado | DP02_0068P | 375798 | NA |
08005 | Arapahoe County, Colorado | DP02_0068P | 497198 | NA |
08013 | Boulder County, Colorado | DP02_0068P | 263938 | NA |
08014 | Broomfield County, Colorado | DP02_0068P | 53400 | NA |
08031 | Denver County, Colorado | DP02_0068P | 575870 | NA |
The values are completely different, and clearly not percentages! This is because variable IDs for the Data Profile are unique to each year and in turn should not be used for time-series analysis. The returned results above represent the civilian population age 18 and up, and have nothing to do with educational attainment.
3.4.2 Preparing time-series ACS estimates
The safest option for time-series analysis in the ACS is to use the Comparison Profile Tables. These tables are available for both the 1-year and 5-year ACS, and allow for comparison of demographic indicators over the past five years for a given year. Using the Comparison Profile tables also brings the benefit of additional variable harmonization, such as inflation-adjusted income estimates.
Data from the Comparison Profile are accessed just like other ACS variables using get_acs()
. The example below illustrates how to get data from the ACS Comparison Profile on inflation-adjusted median household income for counties and county-equivalents in Alaska.
ak_income_compare <- get_acs(
geography = "county",
variables = c(
income15 = "CP03_2015_062",
income20 = "CP03_2020_062"
),
state = "AK",
year = 2020
)
For the 2016-2020 ACS, the “comparison year” is 2015, representing the closest non-overlapping 5-year dataset, which in this case is 2011-2015. We can examine the results, which are inflation-adjusted for appropriate comparison:
GEOID | NAME | variable | estimate |
---|---|---|---|
02016 | Aleutians West Census Area, Alaska | income15 | 92500 |
02016 | Aleutians West Census Area, Alaska | income20 | 87443 |
02020 | Anchorage Municipality, Alaska | income15 | 85534 |
02020 | Anchorage Municipality, Alaska | income20 | 84813 |
02050 | Bethel Census Area, Alaska | income15 | 55692 |
02050 | Bethel Census Area, Alaska | income20 | 54400 |
02090 | Fairbanks North Star Borough, Alaska | income15 | 77590 |
02090 | Fairbanks North Star Borough, Alaska | income20 | 76464 |
3.4.2.1 Iterating over ACS years with tidyverse tools
Using the Detailed Tables also represents a safer option than the Data Profile, as it ensures that variable IDs will remain consistent across years allowing for consistent and correct analysis. That said, there still are some potential pitfalls to account for when using the Detailed Tables. The Census Bureau will add and remove variables from survey to survey depending on data needs and data availability. For example, questions are sometimes added and removed from the ACS survey meaning that you won’t always be able to get every data point for every year and geography combination. In turn, it is still important to check on data availability using load_variables()
for the years you plan to analyze before carrying out your time-series analysis.
Let’s re-engineer the analysis above on educational attainment in Colorado counties, which below will be computed for a time series from 2010 to 2019. Information on “bachelor’s degree or higher” is split by sex and across different tiers of educational attainment in the detailed tables, found in ACS table 15002. Given that we only need a few variables (representing estimates of populations age 25+ who have finished a 4-year degree or graduate degrees, by sex), we’ll request those variables directly rather than the entire B15002 table.
college_vars <- c("B15002_015",
"B15002_016",
"B15002_017",
"B15002_018",
"B15002_032",
"B15002_033",
"B15002_034",
"B15002_035")
We’ll now use these variables to request data on college degree holders from the ACS for counties in Colorado for each of the 1-year ACS surveys from 2010 to 2019. In most cases, this process should be streamlined with iteration. Thus far, we are familiar with using the year
argument in get_acs()
to request data for a specific year. Writing out ten different calls to get_acs()
, however - one for each year - would be tedious and would require a fair amount of repetitive code! Iteration helps us avoid repetitive coding as it allows us to carry out the same process over a sequence of values. Programmers familiar with iteration will likely know of “loop” operators like for
and while
, which are available in base R and most other programming languages in some variety. Base R also includes the *apply()
family of functions (e.g. lapply()
, mapply()
, sapply()
), which iterates over a sequence of values and applies a given function to each value.
The tidyverse approach to iteration is found in the purrr package. purrr includes a variety of functions that are designed to integrate well in workflows that require iteration and use other tidyverse tools. The map_*()
family of functions iterate over values and try to return a desired result; map()
returns a list, map_int()
returns an integer vector, and map_chr()
returns a character vector, for example. With tidycensus, the map_dfr()
function is particularly useful. map_dfr()
iterates over an input and applies it to a function or process defined by the user, then row-binds the result into a single data frame. The example below illustrates how this works for the years 2010 through 2019.
years <- 2010:2019
names(years) <- years
college_by_year <- map_dfr(years, ~{
get_acs(
geography = "county",
variables = college_vars,
state = "CO",
summary_var = "B15002_001",
survey = "acs1",
year = .x
)
}, .id = "year")
For users newer to R, iteration and purrr syntax can feel complex, so it is worth stepping through how the above code sample works.
First, a numeric vector of years is defined with the syntax
2010:2019
. This will create a vector of years at 1-year intervals. These values are set as the names of the vector as well, asmap_dfr()
has additional functionality for working with named objects.-
map_dfr()
then takes three arguments above.- The first argument is the object that
map_dfr()
will iterate over, which in this case is ouryears
vector. This means that the process we set up will be run once for each element ofyears
. - The second argument is a formula we specify with the tilde (
~
) operator and curly braces ({...}
). The code inside the curly braces will be run once for each element ofyears
. The local variable.x
, used inside the formula, takes on each value ofyears
sequentially. In turn, we are running the equivalent ofget_acs()
withyear = 2010
,year = 2011
, and so forth. Onceget_acs()
is run for each year, the result is combined into a single output data frame. - The
.id
argument, which is optional but used here, creates a new column in the output data frame that contains values equivalent to the names of the input object, which in this case isyears
. By setting.id = "year"
, we tellmap_dfr()
to name the new column that will contain these valuesyear
.
- The first argument is the object that
Let’s review the result:
year | GEOID | NAME | variable | estimate | moe | summary_est | summary_moe |
---|---|---|---|---|---|---|---|
2010 | 08001 | Adams County, Colorado | B15002_015 | 20501 | 1983 | 275849 | 790 |
2011 | 08001 | Adams County, Colorado | B15002_015 | 21233 | 2124 | 281231 | 865 |
2012 | 08001 | Adams County, Colorado | B15002_015 | 19238 | 2020 | 287924 | 693 |
2013 | 08001 | Adams County, Colorado | B15002_015 | 23818 | 2445 | 295122 | 673 |
2014 | 08001 | Adams County, Colorado | B15002_015 | 20255 | 1928 | 304394 | 541 |
2015 | 08001 | Adams County, Colorado | B15002_015 | 22962 | 2018 | 312281 | 705 |
2016 | 08001 | Adams County, Colorado | B15002_015 | 25744 | 2149 | 318077 | 525 |
2017 | 08001 | Adams County, Colorado | B15002_015 | 26159 | 2320 | 324185 | 562 |
2018 | 08001 | Adams County, Colorado | B15002_015 | 28113 | 2078 | 331247 | 955 |
2019 | 08001 | Adams County, Colorado | B15002_015 | 27552 | 2070 | 336931 | 705 |
The result is a long-form dataset that contains a time series of each requested ACS variable for each county in Colorado that is available in the 1-year ACS. The code below outlines a group_by() %>% summarize()
workflow for calculating the percentage of the population age 25 and up with a 4-year college degree, then uses the pivot_wider()
function from the tidyr package to spread the years across the columns for tabular data display.
percent_college_by_year <- college_by_year %>%
group_by(NAME, year) %>%
summarize(numerator = sum(estimate),
denominator = first(summary_est)) %>%
mutate(pct_college = 100 * (numerator / denominator)) %>%
pivot_wider(id_cols = NAME,
names_from = year,
values_from = pct_college)
NAME | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
---|---|---|---|---|---|---|---|---|---|---|
Adams County, Colorado | 20.57394 | 20.51801 | 20.64538 | 23.09384 | 22.16929 | 22.79742 | 22.95293 | 22.87552 | 25.74152 | 25.39956 |
Arapahoe County, Colorado | 37.03001 | 38.24506 | 39.28435 | 39.42478 | 40.94194 | 41.03578 | 41.48359 | 43.69387 | 42.71285 | 43.78332 |
Boulder County, Colorado | 57.50285 | 59.05601 | 57.88284 | 58.53214 | 58.04066 | 60.57147 | 60.63005 | 63.18150 | 62.51394 | 64.80486 |
Broomfield County, Colorado | NA | NA | NA | NA | NA | 56.07776 | 51.94338 | 55.13359 | 56.27740 | 56.87181 |
Denver County, Colorado | 40.87971 | 42.97122 | 44.65358 | 44.35340 | 44.25600 | 47.10820 | 47.39683 | 49.32692 | 51.34580 | 53.10088 |
Douglas County, Colorado | 54.96800 | 53.27936 | 55.09223 | 57.66999 | 56.48866 | 56.06928 | 59.42687 | 58.53342 | 58.37539 | 58.13747 |
El Paso County, Colorado | 34.11467 | 35.69184 | 34.91315 | 35.47612 | 36.49302 | 36.43089 | 38.67864 | 39.19931 | 38.83872 | 39.04942 |
Jefferson County, Colorado | 40.83113 | 39.54961 | 41.43825 | 41.04234 | 41.99768 | 43.20923 | 43.51953 | 45.57140 | 45.83786 | 47.61074 |
Larimer County, Colorado | 45.80197 | 42.83543 | 44.71423 | 43.33800 | 42.67180 | 46.16705 | 46.78871 | 47.90465 | 47.62330 | 49.01654 |
Mesa County, Colorado | 24.99285 | 25.82724 | 23.01511 | 27.63325 | 25.14875 | 30.27630 | 25.02980 | 25.82455 | 29.95739 | 29.78537 |
This particular format is suitable for data display or writing to an Excel spreadsheet for colleagues who are not R-based. Methods for visualization of time-series estimates from the ACS will be covered in Section 4.4.
3.5 Handling margins of error in the American Community Survey with tidycensus
A topic of critical importance when working with data from the American Community Survey is the margin of error. As opposed to the decennial US Census, which is based on a complete enumeration of the US population, the ACS is based on a sample with estimates characterized by margins of error. By default, MOEs are returned at a 90 percent confidence level. This can be translated roughly as “we are 90 percent sure that the true value falls within a range defined by the estimate plus or minus the margin of error.”
As discussed in Chapter 2, tidycensus takes an opinionated approach to margins of error. When applicable, tidycensus will always return the margin of error associated with an estimate, and does not have an option available to return estimates only. For “tidy” or long-form data, these margins of error will be found in the moe
column; for wide-form data, margins of error will be found in columns with an M
suffix.
The confidence level of the MOE can be controlled with the moe_level
argument in get_acs()
. The default moe_level
is 90, which is what the Census Bureau returns by default. tidycensus can also return MOEs at a confidence level of 95
or 99
which uses Census Bureau-recommended formulas to adjust the MOE. For example, we might look at data on median household income by county in Rhode Island using the default moe_level
of 90:
get_acs(
geography = "county",
state = "Rhode Island",
variables = "B19013_001",
year = 2020
)
GEOID | NAME | variable | estimate | moe |
---|---|---|---|---|
44001 | Bristol County, Rhode Island | B19013_001 | 85413 | 6122 |
44003 | Kent County, Rhode Island | B19013_001 | 75857 | 2022 |
44005 | Newport County, Rhode Island | B19013_001 | 84282 | 2629 |
44007 | Providence County, Rhode Island | B19013_001 | 62323 | 1270 |
44009 | Washington County, Rhode Island | B19013_001 | 86970 | 3651 |
A stricter margin of error will increase the size of the MOE relative to its estimate.
get_acs(
geography = "county",
state = "Rhode Island",
variables = "B19013_001",
year = 2020,
moe_level = 99
)
GEOID | NAME | variable | estimate | moe |
---|---|---|---|---|
44001 | Bristol County, Rhode Island | B19013_001 | 85413 | 9586.791 |
44003 | Kent County, Rhode Island | B19013_001 | 75857 | 3166.366 |
44005 | Newport County, Rhode Island | B19013_001 | 84282 | 4116.902 |
44007 | Providence County, Rhode Island | B19013_001 | 62323 | 1988.766 |
44009 | Washington County, Rhode Island | B19013_001 | 86970 | 5717.311 |
3.5.1 Calculating derived margins of error in tidycensus
For small geographies or small populations, margins of error can get quite large, in some cases exceeding their corresponding estimates. In the example below, we can examine data on age groups by sex for the population age 65 and older for Census tracts in Salt Lake County, Utah. We will first generate a vector of variable IDs for which we want to request data from the ACS using some base R functionality.
In this workflow, an analyst has used load_variables()
to look up the variables that represent estimates for populations age 65 and up; this includes B01001_020
through B01001_025
for males, and B01001_044
through B01001_049
for females. Typing out each variable individually would be tedious, so an analyst can use string concatenation to generate the required vector of variable IDs as follows:
## [1] "B01001_020" "B01001_021" "B01001_022" "B01001_023" "B01001_024"
## [6] "B01001_025" "B01001_044" "B01001_045" "B01001_046" "B01001_047"
## [11] "B01001_048" "B01001_049"
When R evaluates nested expressions like this, it starts with the inner-most expressions then evaluates them from the inside out. The steps taken to assemble the correct vector of variable IDs are as follows:
First, the expressions
20:25
and44:49
are evaluated. The colon operator:
in base R, when used between two numbers, will generate a vector of numbers from the first to the second at intervals of 1. This creates vectors of the integers 20 through 25 and 44 through 49, which will serve as the suffixes for the variable IDs.Second, the
c()
function is used to combine the two vectors of integers into a single vector.Third, the
paste0()
function concatenates the string prefix"B01001_0"
with each of the integers in the vector created withc()
and returns a vector of variable IDs.paste0()
is a convenient extension of the more flexiblepaste()
function that concatenates strings with no spaces in between them by default.
The resulting variables object, named vars
, can now be used to request variables in a call to get_acs()
.
salt_lake <- get_acs(
geography = "tract",
variables = vars,
state = "Utah",
county = "Salt Lake",
year = 2020
)
We will now want to examine the margins of error around the estimates in the returned data. Let’s focus on a specific Census tract in Salt Lake County using filter()
:
GEOID | variable | estimate | moe |
---|---|---|---|
49035100100 | B01001_020 | 11 | 13 |
49035100100 | B01001_021 | 25 | 18 |
49035100100 | B01001_022 | 7 | 10 |
49035100100 | B01001_023 | 4 | 7 |
49035100100 | B01001_024 | 0 | 12 |
49035100100 | B01001_025 | 17 | 20 |
49035100100 | B01001_044 | 0 | 12 |
49035100100 | B01001_045 | 4 | 7 |
49035100100 | B01001_046 | 21 | 17 |
49035100100 | B01001_047 | 123 | 168 |
In many cases, the margins of error exceed their corresponding estimates. For example, the ACS data suggest that in Census tract 49035100100, for the male population age 85 and up (variable ID B01001_0025
), there are anywhere between 0 and 45 people in that Census tract. This can make ACS data for small geographies problematic for planning and analysis purposes.
A potential solution to large margins of error for small estimates in the ACS is to aggregate data upwards until a satisfactory margin of error to estimate ratio is reached. The US Census Bureau publishes formulas for appropriately calculating margins of error around such derived estimates, which are included in tidycensus with the following functions:
-
moe_sum()
: calculates a margin of error for a derived sum; -
moe_product()
: calculates a margin of error for a derived product; -
moe_ratio()
: calculates a margin of error for a derived ratio; -
moe_prop()
: calculates a margin of error for a derived proportion.
In their most basic form, these functions can be used with constants. For example, let’s say we had an ACS estimate of 25 with a margin of error of 5 around that estimate. The appropriate denominator for this estimate is 100 with a margin of error of 3. To determine the margin of error around the derived proportion of 0.25, we can use moe_prop()
:
moe_prop(25, 100, 5, 3)
## [1] 0.0494343
Our margin of error around the derived estimate of 0.25 is approximately 0.049.
3.5.2 Calculating group-wise margins of error
These margin of error functions in tidycensus can in turn be integrated into tidyverse-centric analytic pipelines to handle large margins of error around estimates. Given that the smaller age bands in the Salt Lake City dataset are characterized by too much uncertainty for our analysis, we decide in this scenario to aggregate our data upwards to represent populations aged 65 and older by sex.
In the code below, we use the case_when()
function to create a new column, sex
, that represents a mapping of the variables we pulled from the ACS to their sex categories. We then employ a familiar group_by() %>% summarize()
method to aggregate our data by Census tract and sex. Notably, the call to summarize()
includes a call to tidycensus’s moe_sum()
function, which will generate a new column that represents the margin of error around the derived sum.
salt_lake_grouped <- salt_lake %>%
mutate(sex = case_when(
str_sub(variable, start = -2) < "26" ~ "Male",
TRUE ~ "Female"
)) %>%
group_by(GEOID, sex) %>%
summarize(sum_est = sum(estimate),
sum_moe = moe_sum(moe, estimate))
GEOID | sex | sum_est | sum_moe |
---|---|---|---|
49035100100 | Female | 165 | 170.72493 |
49035100100 | Male | 64 | 34.43835 |
49035100200 | Female | 170 | 57.26255 |
49035100200 | Male | 128 | 47.51842 |
49035100306 | Female | 155 | 77.48548 |
49035100306 | Male | 136 | 85.53362 |
49035100307 | Female | 207 | 82.03658 |
49035100307 | Male | 110 | 58.89822 |
49035100308 | Female | 157 | 110.97297 |
49035100308 | Male | 91 | 59.97499 |
The margins of error relative to their estimates are now much more reasonable than in the disaggregated data.
That said, the Census Bureau issues a note of caution (American Community Survey Office 2020):
All derived MOE methods are approximations and users should be cautious in using them. This is because these methods do not consider the correlation or covariance between the basic estimates. They may be overestimates or underestimates of the derived estimate’s standard error depending on whether the two basic estimates are highly correlated in either the positive or negative direction. As a result, the approximated standard error may not match direct calculations of standard errors or calculations obtained through other methods.
This means that your “best bet” is to first search the ACS tables to see if your data are found in aggregated form elsewhere before doing the aggregation and MOE estimation yourself. In many cases, you’ll find aggregated information in the ACS combined tables, Data Profile, or Subject Tables that will include pre-computed margins of error for you.
3.6 Exercises
-
The ACS Data Profile includes a number of pre-computed percentages which can reduce your data wrangling time. The variable in the 2015-2019 ACS for “percent of the population age 25 and up with a bachelor’s degree” is
DP02_0068P
. For a state of your choosing, use this variable to determine:The county with the highest percentage in the state;
The county with the lowest percentage in the state;
The median value for counties in your chosen state.